2008-08-28 17:42:39

by Tejun Heo

[permalink] [raw]
Subject: [PATCHSET] FUSE: extend FUSE to support more operations

This patchset extends FUSE such that it supports more file operations
and is consisted of the following seven patches.

0001-FUSE-add-include-protectors.patch
0002-FUSE-pass-nonblock-flag-to-client.patch
0003-FUSE-implement-nonseekable-open.patch
0004-FUSE-implement-direct-lseek-support.patch
0005-FUSE-implement-ioctl-support.patch
0006-FUSE-implement-unsolicited-notification.patch
0007-FUSE-implement-poll-support.patch

The added features will be used primarily for CUSE but can be used by
any FUSE client. Accompanying libfuse updates will be posted
separately.

This patchset is on top of

2.6.27-rc4 (b8e6c91c74e9f0279b7c51048779b3d62da60b88)
+ [1] 9p-use-single-poller patchset
+ [2] wait-kill-is_sync_wait
+ [3] poll-allow-f_op_poll-to-sleep

The above three patches allow f_op->poll() to sleep and 0007 depends
on it.

This patchset is available in the following git tree.

http://git.kernel.org/?p=linux/kernel/git/tj/misc.git;a=shortlog;h=extend-fuse
git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git extend-fuse

and contains the following changes.

fs/fuse/dev.c | 49 +++++
fs/fuse/file.c | 446 +++++++++++++++++++++++++++++++++++++++++++++++++--
fs/fuse/fuse_i.h | 31 +++
fs/fuse/inode.c | 5
include/linux/fuse.h | 82 +++++++++
5 files changed, 599 insertions(+), 14 deletions(-)

Thanks.

--
tejun

[1] http://thread.gmane.org/gmane.linux.kernel/726098
[2] http://article.gmane.org/gmane.linux.kernel/726176
[3] http://article.gmane.org/gmane.linux.kernel/726178


2008-08-28 17:42:54

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 2/7] FUSE: pass nonblock flag to client

Pass O_NONBLOCK to client on reads and writes using FUSE_READ_NONBLOCK
and FUSE_WRITE_NONBLOCK respectively.

Signed-off-by: Tejun Heo <[email protected]>
---
fs/fuse/file.c | 10 +++++++---
include/linux/fuse.h | 3 +++
2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 2bada6b..d405865 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -386,14 +386,15 @@ static size_t fuse_send_read(struct fuse_req *req, struct file *file,
fl_owner_t owner)
{
struct fuse_conn *fc = get_fuse_conn(inode);
+ struct fuse_read_in *inarg = &req->misc.read.in;

fuse_read_fill(req, file, inode, pos, count, FUSE_READ);
if (owner != NULL) {
- struct fuse_read_in *inarg = &req->misc.read.in;
-
inarg->read_flags |= FUSE_READ_LOCKOWNER;
inarg->lock_owner = fuse_lock_owner_id(fc, owner);
}
+ if (file->f_flags & O_NONBLOCK)
+ inarg->read_flags |= FUSE_READ_NONBLOCK;
request_send(fc, req);
return req->out.args[0].size;
}
@@ -628,12 +629,15 @@ static size_t fuse_send_write(struct fuse_req *req, struct file *file,
fl_owner_t owner)
{
struct fuse_conn *fc = get_fuse_conn(inode);
+ struct fuse_write_in *inarg = &req->misc.write.in;
+
fuse_write_fill(req, file, file->private_data, inode, pos, count, 0);
if (owner != NULL) {
- struct fuse_write_in *inarg = &req->misc.write.in;
inarg->write_flags |= FUSE_WRITE_LOCKOWNER;
inarg->lock_owner = fuse_lock_owner_id(fc, owner);
}
+ if (file->f_flags & O_NONBLOCK)
+ inarg->write_flags |= FUSE_WRITE_NONBLOCK;
request_send(fc, req);
return req->misc.write.out.size;
}
diff --git a/include/linux/fuse.h b/include/linux/fuse.h
index 776ab72..724ca19 100644
--- a/include/linux/fuse.h
+++ b/include/linux/fuse.h
@@ -137,14 +137,17 @@ struct fuse_file_lock {
*
* FUSE_WRITE_CACHE: delayed write from page cache, file handle is guessed
* FUSE_WRITE_LOCKOWNER: lock_owner field is valid
+ * FUSE_WRITE_NONBLOCK: perform non-blocking write
*/
#define FUSE_WRITE_CACHE (1 << 0)
#define FUSE_WRITE_LOCKOWNER (1 << 1)
+#define FUSE_WRITE_NONBLOCK (1 << 2)

/**
* Read flags
*/
#define FUSE_READ_LOCKOWNER (1 << 1)
+#define FUSE_READ_NONBLOCK (1 << 2)

enum fuse_opcode {
FUSE_LOOKUP = 1,
--
1.5.4.5

2008-08-28 17:43:13

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 3/7] FUSE: implement nonseekable open

Let the client request nonseekable open using FOPEN_NONSEEKABLE and
call nonseekable_open() on the file if requested.

Signed-off-by: Tejun Heo <[email protected]>
---
fs/fuse/file.c | 2 ++
include/linux/fuse.h | 2 ++
2 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index d405865..9c44f9c 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -101,6 +101,8 @@ void fuse_finish_open(struct inode *inode, struct file *file,
file->f_op = &fuse_direct_io_file_operations;
if (!(outarg->open_flags & FOPEN_KEEP_CACHE))
invalidate_inode_pages2(inode->i_mapping);
+ if (outarg->open_flags & FOPEN_NONSEEKABLE)
+ nonseekable_open(inode, file);
ff->fh = outarg->fh;
file->private_data = fuse_file_get(ff);
}
diff --git a/include/linux/fuse.h b/include/linux/fuse.h
index 724ca19..431666e 100644
--- a/include/linux/fuse.h
+++ b/include/linux/fuse.h
@@ -101,9 +101,11 @@ struct fuse_file_lock {
*
* FOPEN_DIRECT_IO: bypass page cache for this open file
* FOPEN_KEEP_CACHE: don't invalidate the data cache on open
+ * FOPEN_NONSEEKABLE: the file is not seekable
*/
#define FOPEN_DIRECT_IO (1 << 0)
#define FOPEN_KEEP_CACHE (1 << 1)
+#define FOPEN_NONSEEKABLE (1 << 2)

/**
* INIT request/reply flags
--
1.5.4.5

2008-08-28 17:43:34

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 1/7] FUSE: add include protectors

Add include protectors to include/linux/fuse.h and fs/fuse/fuse_i.h.

Signed-off-by: Tejun Heo <[email protected]>
---
fs/fuse/fuse_i.h | 5 +++++
include/linux/fuse.h | 5 +++++
2 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 3a87607..e2b3b72 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -6,6 +6,9 @@
See the file COPYING.
*/

+#ifndef __FS_FUSE_I_H
+#define __FS_FUSE_I_H
+
#include <linux/fuse.h>
#include <linux/fs.h>
#include <linux/mount.h>
@@ -655,3 +658,5 @@ void fuse_set_nowrite(struct inode *inode);
void fuse_release_nowrite(struct inode *inode);

u64 fuse_get_attr_version(struct fuse_conn *fc);
+
+#endif /* __FS_FUSE_I_H */
diff --git a/include/linux/fuse.h b/include/linux/fuse.h
index 265635d..776ab72 100644
--- a/include/linux/fuse.h
+++ b/include/linux/fuse.h
@@ -19,6 +19,9 @@
* - add file flags field to fuse_read_in and fuse_write_in
*/

+#ifndef __LINUX_FUSE_H
+#define __LINUX_FUSE_H
+
#include <asm/types.h>
#include <linux/major.h>

@@ -409,3 +412,5 @@ struct fuse_dirent {
#define FUSE_DIRENT_ALIGN(x) (((x) + sizeof(__u64) - 1) & ~(sizeof(__u64) - 1))
#define FUSE_DIRENT_SIZE(d) \
FUSE_DIRENT_ALIGN(FUSE_NAME_OFFSET + (d)->namelen)
+
+#endif /* __LINUX_FUSE_H */
--
1.5.4.5

2008-08-28 17:43:54

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 4/7] FUSE: implement direct lseek support

Allow clients to implement private lseek. The feature is negotiated
using FUSE_DIRECT_LSEEK flag during INIT. If the client doesn't
request direct lseek, the original implicit lseek is used.

Signed-off-by: Tejun Heo <[email protected]>
---
fs/fuse/file.c | 52 ++++++++++++++++++++++++++++++++++++++++++-------
fs/fuse/fuse_i.h | 3 ++
fs/fuse/inode.c | 4 ++-
include/linux/fuse.h | 14 +++++++++++++
4 files changed, 64 insertions(+), 9 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 9c44f9c..fa27edb 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1448,18 +1448,53 @@ static sector_t fuse_bmap(struct address_space *mapping, sector_t block)

static loff_t fuse_file_llseek(struct file *file, loff_t offset, int origin)
{
- loff_t retval;
+ loff_t retval = -EINVAL;
struct inode *inode = file->f_path.dentry->d_inode;
+ struct fuse_conn *fc = get_fuse_conn(inode);
+
+ if (is_bad_inode(inode))
+ return -EIO;

mutex_lock(&inode->i_mutex);
- switch (origin) {
- case SEEK_END:
- offset += i_size_read(inode);
- break;
- case SEEK_CUR:
- offset += file->f_pos;
+
+ if (fc->direct_lseek) {
+ struct fuse_file *ff = file->private_data;
+ struct fuse_lseek_in inarg = { .fh = ff->fh, .pos = file->f_pos,
+ .offset = offset, .origin = origin };
+ struct fuse_lseek_out outarg;
+ struct fuse_req *req;
+ int err;
+
+ req = fuse_get_req(fc);
+ if (IS_ERR(req))
+ return PTR_ERR(req);
+
+ req->in.h.opcode = FUSE_LSEEK;
+ req->in.h.nodeid = get_node_id(inode);
+ req->in.numargs = 1;
+ req->in.args[0].size = sizeof(inarg);
+ req->in.args[0].value = &inarg;
+ req->out.numargs = 1;
+ req->out.args[0].size = sizeof(outarg);
+ req->out.args[0].value = &outarg;
+ request_send(fc, req);
+ err = req->out.h.error;
+ fuse_put_request(fc, req);
+
+ if (err)
+ return err;
+
+ offset = outarg.pos;
+ } else {
+ switch (origin) {
+ case SEEK_END:
+ offset += i_size_read(inode);
+ break;
+ case SEEK_CUR:
+ offset += file->f_pos;
+ }
}
- retval = -EINVAL;
+
if (offset >= 0 && offset <= inode->i_sb->s_maxbytes) {
if (offset != file->f_pos) {
file->f_pos = offset;
@@ -1467,6 +1502,7 @@ static loff_t fuse_file_llseek(struct file *file, loff_t offset, int origin)
}
retval = offset;
}
+
mutex_unlock(&inode->i_mutex);
return retval;
}
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index e2b3b72..2bf2209 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -366,6 +366,9 @@ struct fuse_conn {
/** Do not send separate SETATTR request before open(O_TRUNC) */
unsigned atomic_o_trunc : 1;

+ /** Do direct lseek */
+ unsigned direct_lseek : 1;
+
/** Filesystem supports NFS exporting. Only set in INIT */
unsigned export_support : 1;

diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index d2249f1..7693dfc 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -757,6 +757,8 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_req *req)
}
if (arg->flags & FUSE_BIG_WRITES)
fc->big_writes = 1;
+ if (arg->flags & FUSE_DIRECT_LSEEK)
+ fc->direct_lseek = 1;
} else {
ra_pages = fc->max_read / PAGE_CACHE_SIZE;
fc->no_lock = 1;
@@ -781,7 +783,7 @@ static void fuse_send_init(struct fuse_conn *fc, struct fuse_req *req)
arg->minor = FUSE_KERNEL_MINOR_VERSION;
arg->max_readahead = fc->bdi.ra_pages * PAGE_CACHE_SIZE;
arg->flags |= FUSE_ASYNC_READ | FUSE_POSIX_LOCKS | FUSE_ATOMIC_O_TRUNC |
- FUSE_EXPORT_SUPPORT | FUSE_BIG_WRITES;
+ FUSE_EXPORT_SUPPORT | FUSE_BIG_WRITES | FUSE_DIRECT_LSEEK;
req->in.h.opcode = FUSE_INIT;
req->in.numargs = 1;
req->in.args[0].size = sizeof(*arg);
diff --git a/include/linux/fuse.h b/include/linux/fuse.h
index 431666e..508d54e 100644
--- a/include/linux/fuse.h
+++ b/include/linux/fuse.h
@@ -118,6 +118,7 @@ struct fuse_file_lock {
#define FUSE_ATOMIC_O_TRUNC (1 << 3)
#define FUSE_EXPORT_SUPPORT (1 << 4)
#define FUSE_BIG_WRITES (1 << 5)
+#define FUSE_DIRECT_LSEEK (1 << 6)

/**
* Release flags
@@ -188,6 +189,7 @@ enum fuse_opcode {
FUSE_INTERRUPT = 36,
FUSE_BMAP = 37,
FUSE_DESTROY = 38,
+ FUSE_LSEEK = 39,
};

/* The read buffer is required to be at least 8k, but may be much larger */
@@ -388,6 +390,18 @@ struct fuse_bmap_out {
__u64 block;
};

+struct fuse_lseek_in {
+ __u64 fh;
+ __u64 pos; /* current file position */
+ __u64 offset; /* seek offset */
+ __u32 origin; /* SEEK_* */
+ __u32 padding;
+};
+
+struct fuse_lseek_out {
+ __u64 pos; /* new position */
+};
+
struct fuse_in_header {
__u32 len;
__u32 opcode;
--
1.5.4.5

2008-08-28 17:44:19

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 6/7] FUSE: implement unsolicited notification

Clients always used to write only in response to read requests. To
implement poll efficiently, clients should be able to issue
unsolicited notifications. This patch implements basic notification
support.

Zero fuse_out_header.unique is now accepted and considered unsolicited
notification and the error field contains notification code. This
patch doesn't implement any actual notification.

Signed-off-by: Tejun Heo <[email protected]>
---
fs/fuse/dev.c | 34 ++++++++++++++++++++++++++++++++--
include/linux/fuse.h | 4 ++++
2 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 87250b6..2fb65a5 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -813,6 +813,21 @@ static ssize_t fuse_dev_read(struct kiocb *iocb, const struct iovec *iov,
return err;
}

+static int fuse_handle_notify(struct fuse_conn *fc, enum fuse_notify_code code,
+ unsigned int size, struct fuse_copy_state *cs)
+{
+ int err;
+
+ switch (code) {
+ default:
+ err = -EINVAL;
+ break;
+ }
+
+ fuse_copy_finish(cs);
+ return err;
+}
+
/* Look up request on processing list by unique ID */
static struct fuse_req *request_find(struct fuse_conn *fc, u64 unique)
{
@@ -876,9 +891,24 @@ static ssize_t fuse_dev_write(struct kiocb *iocb, const struct iovec *iov,
err = fuse_copy_one(&cs, &oh, sizeof(oh));
if (err)
goto err_finish;
+
+ if (oh.len != nbytes)
+ goto err_finish;
+
+ /*
+ * Zero oh.unique indicates unsolicited notification message
+ * and error contains notification code.
+ */
+ if (!oh.unique) {
+ err = fuse_handle_notify(fc, oh.error, nbytes - sizeof(oh),
+ &cs);
+ if (err)
+ return err;
+ return nbytes;
+ }
+
err = -EINVAL;
- if (!oh.unique || oh.error <= -1000 || oh.error > 0 ||
- oh.len != nbytes)
+ if (oh.error <= -1000 || oh.error > 0)
goto err_finish;

spin_lock(&fc->lock);
diff --git a/include/linux/fuse.h b/include/linux/fuse.h
index 5a396d3..0044590 100644
--- a/include/linux/fuse.h
+++ b/include/linux/fuse.h
@@ -206,6 +206,10 @@ enum fuse_opcode {
FUSE_IOCTL = 40,
};

+enum fuse_notify_code {
+ FUSE_NOTIFY_CODE_MAX,
+};
+
/* The read buffer is required to be at least 8k, but may be much larger */
#define FUSE_MIN_READ_BUFFER 8192

--
1.5.4.5

2008-08-28 17:44:53

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 5/7] FUSE: implement ioctl support

ioctl support is tricky to implement because only the ioctl
implementation itself knows which memory regions need to be read
and/or written. To support this, fuse client can request retry of
ioctl specifying memory regions to read and write. Deep copying
(nested pointers) can be implemented by retrying multiple times
resolving one depth of dereference at a time.

Plese read the comment on top of fs/fuse/file.c::fuse_file_do_ioctl()
for more information.

Signed-off-by: Tejun Heo <[email protected]>
---
fs/fuse/file.c | 241 ++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/fuse.h | 30 ++++++
2 files changed, 271 insertions(+), 0 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index fa27edb..75a9b53 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1507,6 +1507,243 @@ static loff_t fuse_file_llseek(struct file *file, loff_t offset, int origin)
return retval;
}

+static int fuse_ioctl_copy_user(struct page **pages, struct iovec *iov,
+ unsigned int nr_segs, size_t bytes, bool to_user)
+{
+ struct iov_iter ii;
+ int page_idx = 0;
+
+ if (!bytes)
+ return 0;
+
+ iov_iter_init(&ii, iov, nr_segs, bytes, 0);
+
+ while (iov_iter_count(&ii)) {
+ struct page *page = pages[page_idx++];
+ size_t todo = min_t(size_t, PAGE_SIZE, iov_iter_count(&ii));
+ void *kaddr, *map;
+
+ kaddr = map = kmap(page);
+
+ while (todo) {
+ char __user *uaddr = ii.iov->iov_base + ii.iov_offset;
+ size_t iov_len = ii.iov->iov_len - ii.iov_offset;
+ size_t copy = min(todo, iov_len);
+ size_t left;
+
+ if (!to_user)
+ left = copy_from_user(kaddr, uaddr, copy);
+ else
+ left = copy_to_user(uaddr, kaddr, copy);
+
+ if (unlikely(left))
+ return -EFAULT;
+
+ iov_iter_advance(&ii, copy);
+ todo -= copy;
+ kaddr += copy;
+ }
+
+ kunmap(map);
+ }
+
+ return 0;
+}
+
+/*
+ * For ioctls, there is no generic way to determine how much memory
+ * needs to be read and/or written. Furthermore, ioctls are allowed
+ * to dereference the passed pointer, so the parameter requires deep
+ * copying but FUSE has no idea whatsoever about what to copy in or
+ * out.
+ *
+ * This is solved by allowing FUSE client to retry ioctl with
+ * necessary in/out iovecs. Let's assume the ioctl implementation
+ * needs to read in the following structure.
+ *
+ * struct a {
+ * char *buf;
+ * size_t buflen;
+ * }
+ *
+ * On the first callout to FUSE client, inarg->in_size and
+ * inarg->out_size will be NULL; then, the client completes the ioctl
+ * with FUSE_IOCTL_RETRY set in out->flags, out->in_iovs set to 1 and
+ * the actual iov array to
+ *
+ * { { .iov_base = inarg.arg, .iov_len = sizeof(struct a) } }
+ *
+ * which tells FUSE to copy in the requested area and retry the ioctl.
+ * On the second round, the client has access to the structure and
+ * from that it can tell what to look for next, so on the invocation,
+ * it sets FUSE_IOCTL_RETRY, out->in_iovs to 2 and iov array to
+ *
+ * { { .iov_base = inarg.arg, .iov_len = sizeof(struct a) },
+ * { .iov_base = a.buf, .iov_len = a.buflen } }
+ *
+ * FUSE will copy both struct a and the pointed buffer from the
+ * process doing the ioctl and retry ioctl with both struct a and the
+ * buffer.
+ *
+ * This time, FUSE client has everything it needs and completes ioctl
+ * without FUSE_IOCTL_RETRY which finishes the ioctl call.
+ *
+ * Copying data out works the same way.
+ */
+static long fuse_file_do_ioctl(struct file *file, unsigned int cmd,
+ unsigned long arg, unsigned int flags)
+{
+ struct inode *inode = file->f_dentry->d_inode;
+ struct fuse_file *ff = file->private_data;
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ struct fuse_ioctl_in inarg = { .fh = ff->fh, .cmd = cmd, .arg = arg,
+ .flags = flags };
+ struct fuse_ioctl_out outarg;
+ struct fuse_req *req = NULL;
+ struct page **pages = NULL;
+ struct page *iov_page = NULL;
+ struct iovec *in_iov = NULL, *out_iov = NULL;
+ unsigned int in_iovs = 0, out_iovs = 0, num_pages = 0, max_pages;
+ size_t in_size, out_size, transferred;
+ int err;
+
+ /* assume all the iovs returned by client always fits in a page */
+ BUILD_BUG_ON(sizeof(struct iovec) * FUSE_IOCTL_MAX_IOV > PAGE_SIZE);
+
+ err = -EIO;
+ if (is_bad_inode(inode))
+ goto out;
+
+ err = -ENOMEM;
+ pages = kzalloc(sizeof(pages[0]) * FUSE_MAX_PAGES_PER_REQ, GFP_KERNEL);
+ iov_page = alloc_page(GFP_KERNEL);
+ if (!pages || !iov_page)
+ goto out;
+
+ retry:
+ inarg.in_size = in_size = iov_length(in_iov, in_iovs);
+ inarg.out_size = out_size = iov_length(out_iov, out_iovs);
+
+ /*
+ * Out data can be used either for actual out data or iovs,
+ * make sure there always is at least one page.
+ */
+ out_size = max_t(size_t, out_size, PAGE_SIZE);
+ max_pages = DIV_ROUND_UP(max(in_size, out_size), PAGE_SIZE);
+
+ /* make sure there are enough buffer pages and init request with them */
+ err = -ENOMEM;
+ if (max_pages > FUSE_MAX_PAGES_PER_REQ)
+ goto out;
+ while (num_pages < max_pages) {
+ pages[num_pages] = alloc_page(GFP_KERNEL | __GFP_HIGHMEM);
+ if (!pages[num_pages])
+ goto out;
+ num_pages++;
+ }
+
+ req = fuse_get_req(fc);
+ if (IS_ERR(req)) {
+ req = NULL;
+ err = PTR_ERR(req);
+ goto out;
+ }
+ memcpy(req->pages, pages, sizeof(req->pages[0]) * num_pages);
+ req->num_pages = num_pages;
+
+ /* okay, let's send it to the client */
+ req->in.h.opcode = FUSE_IOCTL;
+ req->in.h.nodeid = get_node_id(inode);
+ req->in.numargs = 1;
+ req->in.args[0].size = sizeof(inarg);
+ req->in.args[0].value = &inarg;
+ if (in_size) {
+ req->in.numargs++;
+ req->in.args[1].size = in_size;
+ req->in.argpages = 1;
+
+ err = fuse_ioctl_copy_user(pages, in_iov, in_iovs, in_size,
+ false);
+ if (err)
+ goto out;
+ }
+
+ req->out.numargs = 2;
+ req->out.args[0].size = sizeof(outarg);
+ req->out.args[0].value = &outarg;
+ req->out.args[1].size = out_size;
+ req->out.argpages = 1;
+ req->out.argvar = 1;
+
+ request_send(fc, req);
+ err = req->out.h.error;
+ transferred = req->out.args[1].size;
+ fuse_put_request(fc, req);
+ req = NULL;
+ if (err)
+ goto out;
+
+ /* did it ask for retry? */
+ if (outarg.flags & FUSE_IOCTL_RETRY) {
+ char *vaddr;
+
+ in_iovs = outarg.in_iovs;
+ out_iovs = outarg.out_iovs;
+
+ /*
+ * Make sure things are in boundary, separate checks
+ * are to protect against overflow.
+ */
+ err = -ENOMEM;
+ if (in_iovs > FUSE_IOCTL_MAX_IOV ||
+ out_iovs > FUSE_IOCTL_MAX_IOV ||
+ in_iovs + out_iovs > FUSE_IOCTL_MAX_IOV)
+ goto out;
+
+ err = -EBADE;
+ if ((in_iovs + out_iovs) * sizeof(struct iovec) != transferred)
+ goto out;
+
+ /* okay, copy in iovs and retry */
+ vaddr = kmap_atomic(pages[0], KM_USER0);
+ memcpy(page_address(iov_page), vaddr, transferred);
+ kunmap_atomic(vaddr, KM_USER0);
+
+ in_iov = page_address(iov_page);
+ out_iov = in_iov + in_iovs;
+
+ goto retry;
+ }
+
+ err = -EBADE;
+ if (transferred > inarg.out_size)
+ goto out;
+
+ err = fuse_ioctl_copy_user(pages, out_iov, out_iovs, transferred, true);
+ out:
+ if (req)
+ fuse_put_request(fc, req);
+ if (iov_page)
+ __free_page(iov_page);
+ while (num_pages)
+ __free_page(pages[--num_pages]);
+ kfree(pages);
+
+ return err ? err : outarg.result;
+}
+
+static long fuse_file_ioctl(struct file *file, unsigned int cmd,
+ unsigned long arg)
+{
+ return fuse_file_do_ioctl(file, cmd, arg, 0);
+}
+
+static long fuse_file_compat_ioctl(struct file *file, unsigned int cmd,
+ unsigned long arg)
+{
+ return fuse_file_do_ioctl(file, cmd, arg, FUSE_IOCTL_COMPAT);
+}
+
static const struct file_operations fuse_file_operations = {
.llseek = fuse_file_llseek,
.read = do_sync_read,
@@ -1521,6 +1758,8 @@ static const struct file_operations fuse_file_operations = {
.lock = fuse_file_lock,
.flock = fuse_file_flock,
.splice_read = generic_file_splice_read,
+ .unlocked_ioctl = fuse_file_ioctl,
+ .compat_ioctl = fuse_file_compat_ioctl,
};

static const struct file_operations fuse_direct_io_file_operations = {
@@ -1533,6 +1772,8 @@ static const struct file_operations fuse_direct_io_file_operations = {
.fsync = fuse_fsync,
.lock = fuse_file_lock,
.flock = fuse_file_flock,
+ .unlocked_ioctl = fuse_file_ioctl,
+ .compat_ioctl = fuse_file_compat_ioctl,
/* no mmap and splice_read */
};

diff --git a/include/linux/fuse.h b/include/linux/fuse.h
index 508d54e..5a396d3 100644
--- a/include/linux/fuse.h
+++ b/include/linux/fuse.h
@@ -152,6 +152,19 @@ struct fuse_file_lock {
#define FUSE_READ_LOCKOWNER (1 << 1)
#define FUSE_READ_NONBLOCK (1 << 2)

+/**
+ * Ioctl flags
+ *
+ * FUSE_IOCTL_COMPAT: 32bit compat ioctl on 64bit machine
+ * FUSE_IOCTL_RETRY: retry with new iovecs
+ *
+ * FUSE_IOCTL_MAX_IOV: maximum of in_iovecs + out_iovecs
+ */
+#define FUSE_IOCTL_COMPAT (1 << 0)
+#define FUSE_IOCTL_RETRY (1 << 1)
+
+#define FUSE_IOCTL_MAX_IOV 256
+
enum fuse_opcode {
FUSE_LOOKUP = 1,
FUSE_FORGET = 2, /* no reply */
@@ -190,6 +203,7 @@ enum fuse_opcode {
FUSE_BMAP = 37,
FUSE_DESTROY = 38,
FUSE_LSEEK = 39,
+ FUSE_IOCTL = 40,
};

/* The read buffer is required to be at least 8k, but may be much larger */
@@ -402,6 +416,22 @@ struct fuse_lseek_out {
__u64 pos; /* new position */
};

+struct fuse_ioctl_in {
+ __u64 fh;
+ __u32 flags;
+ __u32 cmd;
+ __u64 arg;
+ __u32 in_size;
+ __u32 out_size;
+};
+
+struct fuse_ioctl_out {
+ __s32 result;
+ __u32 flags;
+ __u32 in_iovs;
+ __u32 out_iovs;
+};
+
struct fuse_in_header {
__u32 len;
__u32 opcode;
--
1.5.4.5

2008-08-28 17:44:37

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 7/7] FUSE: implement poll support

Implement poll support. Polled files are indexed using fh in a RB
tree rooted at fuse_conn->polled_files. All pollable files should
have unique fh as that's how notifications are matched to files. If
duplicate fhs are detected, FUSE spits out warning message. Poll will
malfunction but otherwise it will work fine.

Client should send FUSE_NOTIFY_POLL notification once after processing
FUSE_POLL which has FUSE_POLL_SCHEDULE_NOTIFY set. Sending
notification unconditionally after the latest poll or everytime file
content might have changed is inefficient but won't cause malfunction.

fuse_file_poll() can sleep and requires patches from the following
thread which allows f_op->poll() to sleep.

http://thread.gmane.org/gmane.linux.kernel/726176

Signed-off-by: Tejun Heo <[email protected]>
---
fs/fuse/dev.c | 15 +++++
fs/fuse/file.c | 141 ++++++++++++++++++++++++++++++++++++++++++++++++++
fs/fuse/fuse_i.h | 23 ++++++++
fs/fuse/inode.c | 1 +
include/linux/fuse.h | 24 +++++++++
5 files changed, 204 insertions(+), 0 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 2fb65a5..1c422f9 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -819,11 +819,26 @@ static int fuse_handle_notify(struct fuse_conn *fc, enum fuse_notify_code code,
int err;

switch (code) {
+ case FUSE_NOTIFY_POLL: {
+ struct fuse_notify_poll_wakeup_out outarg;
+
+ err = -EINVAL;
+ if (size != sizeof(outarg))
+ goto out;
+
+ err = fuse_copy_one(cs, &outarg, sizeof(outarg));
+ if (err)
+ goto out;
+
+ err = fuse_notify_poll_wakeup(fc, &outarg);
+ break;
+ }
default:
err = -EINVAL;
break;
}

+ out:
fuse_copy_finish(cs);
return err;
}
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 75a9b53..5d704e3 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -59,6 +59,8 @@ struct fuse_file *fuse_file_alloc(void)
INIT_LIST_HEAD(&ff->write_entry);
atomic_set(&ff->count, 0);
}
+ RB_CLEAR_NODE(&ff->polled_node);
+ init_waitqueue_head(&ff->poll_wait);
}
return ff;
}
@@ -167,7 +169,11 @@ int fuse_release_common(struct inode *inode, struct file *file, int isdir)

spin_lock(&fc->lock);
list_del(&ff->write_entry);
+ if (!RB_EMPTY_NODE(&ff->polled_node))
+ rb_erase(&ff->polled_node, &fc->polled_files);
spin_unlock(&fc->lock);
+
+ wake_up_interruptible_sync(&ff->poll_wait);
/*
* Normally this will send the RELEASE request,
* however if some asynchronous READ or WRITE requests
@@ -1744,6 +1750,139 @@ static long fuse_file_compat_ioctl(struct file *file, unsigned int cmd,
return fuse_file_do_ioctl(file, cmd, arg, FUSE_IOCTL_COMPAT);
}

+/*
+ * All files which have been polled are linked to RB tree
+ * fuse_conn->polled_files which is indexed by fh. Walk the tree and
+ * find the matching one.
+ */
+static struct rb_node **fuse_find_polled_node(struct fuse_conn *fc, u64 fh,
+ struct rb_node **parent_out)
+{
+ struct rb_node **link = &fc->polled_files.rb_node;
+ struct rb_node *last = NULL;
+
+ while (*link) {
+ struct fuse_file *ff = rb_entry(*link, struct fuse_file,
+ polled_node);
+ last = *link;
+
+ if (fh < ff->fh)
+ link = &(*link)->rb_left;
+ else if (fh > ff->fh)
+ link = &(*link)->rb_right;
+ else
+ return link;
+ }
+
+ if (parent_out)
+ *parent_out = last;
+ return link;
+}
+
+/*
+ * The file is about to be polled. Make sure it's on the polled_files
+ * RB tree. Note that files once added to the polled_files tree are
+ * not removed before the file is released. This is because a file
+ * polled once is likely to be polled again.
+ */
+static void fuse_register_polled_file(struct fuse_conn *fc,
+ struct fuse_file *ff)
+{
+ if (!RB_EMPTY_NODE(&ff->polled_node))
+ return;
+
+ spin_lock(&fc->lock);
+
+ if (RB_EMPTY_NODE(&ff->polled_node)) {
+ struct rb_node **link, *parent;
+
+ link = fuse_find_polled_node(fc, ff->fh, &parent);
+ if (likely(!*link)) {
+ rb_link_node(&ff->polled_node, parent, link);
+ rb_insert_color(&ff->polled_node, &fc->polled_files);
+ } else if (!fc->warned_duplicate_fh) {
+ printk(KERN_WARNING "FUSE %u:%u duplicate fh detected, "
+ "poll will malfunction, please fix client\n",
+ MAJOR(fc->dev), MINOR(fc->dev));
+ fc->warned_duplicate_fh = 1;
+ }
+ }
+
+ spin_unlock(&fc->lock);
+}
+
+static unsigned fuse_file_poll(struct file *file, poll_table *wait)
+{
+ struct inode *inode = file->f_dentry->d_inode;
+ struct fuse_file *ff = file->private_data;
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ struct fuse_poll_in inarg = { .fh = ff->fh };
+ struct fuse_poll_out outarg;
+ struct fuse_req *req;
+ int err;
+
+ if (fc->no_poll)
+ return DEFAULT_POLLMASK;
+
+ poll_wait(file, &ff->poll_wait, wait);
+
+ /*
+ * Ask for notification iff there's someone waiting for it.
+ * The client may ignore the flag and always notify.
+ */
+ if (waitqueue_active(&ff->poll_wait)) {
+ inarg.flags |= FUSE_POLL_SCHEDULE_NOTIFY;
+ fuse_register_polled_file(fc, ff);
+ }
+
+ req = fuse_get_req(fc);
+ if (IS_ERR(req))
+ return PTR_ERR(req);
+
+ req->in.h.opcode = FUSE_POLL;
+ req->in.h.nodeid = get_node_id(inode);
+ req->in.numargs = 1;
+ req->in.args[0].size = sizeof(inarg);
+ req->in.args[0].value = &inarg;
+ req->out.numargs = 1;
+ req->out.args[0].size = sizeof(outarg);
+ req->out.args[0].value = &outarg;
+ request_send(fc, req);
+ err = req->out.h.error;
+ fuse_put_request(fc, req);
+
+ if (!err)
+ return outarg.revents;
+ if (err == -ENOSYS) {
+ fc->no_poll = 1;
+ return DEFAULT_POLLMASK;
+ }
+ return POLLERR;
+}
+
+/*
+ * This is called from fuse_handle_notify() on FUSE_NOTIFY_POLL and
+ * wakes up the poll waiters.
+ */
+int fuse_notify_poll_wakeup(struct fuse_conn *fc,
+ struct fuse_notify_poll_wakeup_out *outarg)
+{
+ u64 fh = outarg->fh;
+ struct rb_node **link;
+
+ spin_lock(&fc->lock);
+
+ link = fuse_find_polled_node(fc, fh, NULL);
+ if (*link) {
+ struct fuse_file *ff = rb_entry(*link, struct fuse_file,
+ polled_node);
+ wake_up_interruptible_sync(&ff->poll_wait);
+ }
+
+ spin_unlock(&fc->lock);
+ return 0;
+}
+
static const struct file_operations fuse_file_operations = {
.llseek = fuse_file_llseek,
.read = do_sync_read,
@@ -1760,6 +1899,7 @@ static const struct file_operations fuse_file_operations = {
.splice_read = generic_file_splice_read,
.unlocked_ioctl = fuse_file_ioctl,
.compat_ioctl = fuse_file_compat_ioctl,
+ .poll = fuse_file_poll,
};

static const struct file_operations fuse_direct_io_file_operations = {
@@ -1774,6 +1914,7 @@ static const struct file_operations fuse_direct_io_file_operations = {
.flock = fuse_file_flock,
.unlocked_ioctl = fuse_file_ioctl,
.compat_ioctl = fuse_file_compat_ioctl,
+ .poll = fuse_file_poll,
/* no mmap and splice_read */
};

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 2bf2209..f884160 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -19,6 +19,8 @@
#include <linux/backing-dev.h>
#include <linux/mutex.h>
#include <linux/rwsem.h>
+#include <linux/rbtree.h>
+#include <linux/poll.h>

/** Max number of pages that can be used in a single read request */
#define FUSE_MAX_PAGES_PER_REQ 32
@@ -108,6 +110,12 @@ struct fuse_file {

/** Entry on inode's write_files list */
struct list_head write_entry;
+
+ /** RB node to be linked on fuse_conn->polled_files */
+ struct rb_node polled_node;
+
+ /** Wait queue head for poll */
+ wait_queue_head_t poll_wait;
};

/** One input argument of a request */
@@ -322,6 +330,9 @@ struct fuse_conn {
/** The list of requests under I/O */
struct list_head io;

+ /** rbtree of fuse_files waiting for poll events */
+ struct rb_root polled_files;
+
/** Number of requests currently in the background */
unsigned num_background;

@@ -413,9 +424,15 @@ struct fuse_conn {
/** Is bmap not implemented by fs? */
unsigned no_bmap : 1;

+ /** Is poll not implemented by fs? */
+ unsigned no_poll : 1;
+
/** Do multi-page cached writes */
unsigned big_writes : 1;

+ /** Warned about duplicate fh's when using poll? */
+ unsigned warned_duplicate_fh : 1;
+
/** The number of requests waiting for completion */
atomic_t num_waiting;

@@ -522,6 +539,12 @@ int fuse_fsync_common(struct file *file, struct dentry *de, int datasync,
int isdir);

/**
+ * Notify poll wakeup
+ */
+int fuse_notify_poll_wakeup(struct fuse_conn *fc,
+ struct fuse_notify_poll_wakeup_out *outarg);
+
+/**
* Initialize file operations on a regular file
*/
void fuse_init_file_inode(struct inode *inode);
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 7693dfc..088ba6e 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -485,6 +485,7 @@ static struct fuse_conn *new_conn(struct super_block *sb)
fc->bdi.unplug_io_fn = default_unplug_io_fn;
/* fuse does it's own writeback accounting */
fc->bdi.capabilities = BDI_CAP_NO_ACCT_WB;
+ fc->polled_files = RB_ROOT;
fc->dev = sb->s_dev;
err = bdi_init(&fc->bdi);
if (err)
diff --git a/include/linux/fuse.h b/include/linux/fuse.h
index 0044590..b772b4a 100644
--- a/include/linux/fuse.h
+++ b/include/linux/fuse.h
@@ -165,6 +165,13 @@ struct fuse_file_lock {

#define FUSE_IOCTL_MAX_IOV 256

+/**
+ * Poll flags
+ *
+ * FUSE_POLL_SCHEDULE_NOTIFY: request poll notify
+ */
+#define FUSE_POLL_SCHEDULE_NOTIFY (1 << 0)
+
enum fuse_opcode {
FUSE_LOOKUP = 1,
FUSE_FORGET = 2, /* no reply */
@@ -204,9 +211,11 @@ enum fuse_opcode {
FUSE_DESTROY = 38,
FUSE_LSEEK = 39,
FUSE_IOCTL = 40,
+ FUSE_POLL = 41,
};

enum fuse_notify_code {
+ FUSE_NOTIFY_POLL = 1,
FUSE_NOTIFY_CODE_MAX,
};

@@ -436,6 +445,21 @@ struct fuse_ioctl_out {
__u32 out_iovs;
};

+struct fuse_poll_in {
+ __u64 fh;
+ __u32 flags;
+ __u32 padding;
+};
+
+struct fuse_poll_out {
+ __u32 revents;
+ __u32 padding;
+};
+
+struct fuse_notify_poll_wakeup_out {
+ __u64 fh;
+};
+
struct fuse_in_header {
__u32 len;
__u32 opcode;
--
1.5.4.5

2008-08-28 17:54:16

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 5/7] FUSE: implement ioctl support

On Fri, Aug 29, 2008 at 02:41:01AM +0900, Tejun Heo wrote:
> ioctl support is tricky to implement because only the ioctl
> implementation itself knows which memory regions need to be read
> and/or written. To support this, fuse client can request retry of
> ioctl specifying memory regions to read and write. Deep copying
> (nested pointers) can be implemented by retrying multiple times
> resolving one depth of dereference at a time.

Why do we need ioctls? For CUSE? In that case, would we need to copy
the memory from userspace, into the kernel, and then back out into
userspace again? Can't we just have a "pass-through" type fixed ioctl
instead?

thanks,

greg k-h

2008-08-28 18:01:30

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 5/7] FUSE: implement ioctl support

Greg KH wrote:
> Why do we need ioctls? For CUSE? In that case, would we need to copy
> the memory from userspace, into the kernel, and then back out into
> userspace again? Can't we just have a "pass-through" type fixed ioctl
> instead?

Can you elaborate a bit? How the fixed ioctl would know how much to
copy in and out and from where?

Thanks.

--
tejun

2008-08-28 18:04:17

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 5/7] FUSE: implement ioctl support

Tejun Heo wrote:
> Greg KH wrote:
>> Why do we need ioctls? For CUSE? In that case, would we need to copy
>> the memory from userspace, into the kernel, and then back out into
>> userspace again? Can't we just have a "pass-through" type fixed ioctl
>> instead?
>
> Can you elaborate a bit? How the fixed ioctl would know how much to
> copy in and out and from where?

If you're worried about the double copying due to performance reasons,
what we can do is implementing userspace to userspace copying. I just
took easier path of copying twice as u-u copy can't be done using the
existing FUSE mechanics.

--
tejun

2008-08-28 18:08:50

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 5/7] FUSE: implement ioctl support

Tejun Heo wrote:
> Greg KH wrote:
>> Why do we need ioctls? For CUSE? In that case, would we need to copy
>> the memory from userspace, into the kernel, and then back out into
>> userspace again? Can't we just have a "pass-through" type fixed ioctl
>> instead?
>
> Can you elaborate a bit? How the fixed ioctl would know how much to
> copy in and out and from where?

If you're worried about the double copying due to performance reasons,
what we can do is implementing userspace to userspace copying. I just
took easier path of copying twice as u-u copy can't be done using the
existing FUSE mechanics, but frankly, at this point, I think that would
be an premature optimization.

--
tejun

2008-08-28 18:14:18

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 5/7] FUSE: implement ioctl support

On Thu, 28 Aug 2008, Tejun Heo wrote:
> Tejun Heo wrote:
> > Greg KH wrote:
> >> Why do we need ioctls? For CUSE? In that case, would we need to copy
> >> the memory from userspace, into the kernel, and then back out into
> >> userspace again? Can't we just have a "pass-through" type fixed ioctl
> >> instead?
> >
> > Can you elaborate a bit? How the fixed ioctl would know how much to
> > copy in and out and from where?
>
> If you're worried about the double copying due to performance reasons,
> what we can do is implementing userspace to userspace copying. I just
> took easier path of copying twice as u-u copy can't be done using the
> existing FUSE mechanics.

How about not copying anything in the kernel, just passing the virtual
address to the filesystem? It can get/modify the data of the calling
task by mapping /proc/pid/mem. Ugly, yes, but ioctls are inherently
ugly.

Miklos

2008-08-28 18:19:07

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 5/7] FUSE: implement ioctl support

Miklos Szeredi wrote:
>> If you're worried about the double copying due to performance reasons,
>> what we can do is implementing userspace to userspace copying. I just
>> took easier path of copying twice as u-u copy can't be done using the
>> existing FUSE mechanics.
>
> How about not copying anything in the kernel, just passing the virtual
> address to the filesystem? It can get/modify the data of the calling
> task by mapping /proc/pid/mem. Ugly, yes, but ioctls are inherently
> ugly.

Hmmm... I was trying to stay within similar operation mechanics as other
ops. Directly accessing the caller's memory has performance benefits
but that benefit can also be used by reads and writes. So, if we're
gonna do direct memory access, maybe doing it in more generic way is a
better idea?

Thanks.

--
tejun

2008-08-28 18:20:26

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCHSET] FUSE: extend FUSE to support more operations

On Fri, 29 Aug 2008, Tejun Heo wrote:
> This patchset extends FUSE such that it supports more file operations
> and is consisted of the following seven patches.
>
> 0001-FUSE-add-include-protectors.patch
> 0002-FUSE-pass-nonblock-flag-to-client.patch
> 0003-FUSE-implement-nonseekable-open.patch
> 0004-FUSE-implement-direct-lseek-support.patch
> 0005-FUSE-implement-ioctl-support.patch
> 0006-FUSE-implement-unsolicited-notification.patch
> 0007-FUSE-implement-poll-support.patch
>
> The added features will be used primarily for CUSE but can be used by
> any FUSE client. Accompanying libfuse updates will be posted
> separately.

Guessing that CUSE stands for "Character Devices in Userspace"?

Is there a description of this? It's for sound device emulation,
right?

What is direct-lseek for? It doesn't sound like a feature needed by
char devices.

Thanks,
Miklos

2008-08-28 18:22:34

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 5/7] FUSE: implement ioctl support

Tejun Heo wrote:
> ioctl support is tricky to implement because only the ioctl
> implementation itself knows which memory regions need to be read
> and/or written. To support this, fuse client can request retry of
> ioctl specifying memory regions to read and write. Deep copying
> (nested pointers) can be implemented by retrying multiple times
> resolving one depth of dereference at a time.

Okay, I'm going to say it... wouldn't it be better to have some kind of
data structure description language which the userspace can register
with the kernel to linearize the data on a per-ioctl basis (kind of like
rpcgen)?

-hpa

2008-08-28 18:23:13

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 5/7] FUSE: implement ioctl support

On Thu, Aug 28, 2008 at 08:02:44PM +0200, Tejun Heo wrote:
> Tejun Heo wrote:
> > Greg KH wrote:
> >> Why do we need ioctls? For CUSE? In that case, would we need to copy
> >> the memory from userspace, into the kernel, and then back out into
> >> userspace again? Can't we just have a "pass-through" type fixed ioctl
> >> instead?
> >
> > Can you elaborate a bit? How the fixed ioctl would know how much to
> > copy in and out and from where?
>
> If you're worried about the double copying due to performance reasons,
> what we can do is implementing userspace to userspace copying. I just
> took easier path of copying twice as u-u copy can't be done using the
> existing FUSE mechanics, but frankly, at this point, I think that would
> be an premature optimization.

No, I'm not worried about the performance, just that this should be
simple as we can just pass "arg" to userspace without touching it as it
just came from userspace, right?

Oh wait, there are process space issues at play here that I'm totally
forgetting about, right?

thanks,

greg k-h

2008-08-28 18:24:23

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 5/7] FUSE: implement ioctl support

On Thu, 28 Aug 2008, Tejun Heo wrote:
> Miklos Szeredi wrote:
> >> If you're worried about the double copying due to performance reasons,
> >> what we can do is implementing userspace to userspace copying. I just
> >> took easier path of copying twice as u-u copy can't be done using the
> >> existing FUSE mechanics.
> >
> > How about not copying anything in the kernel, just passing the virtual
> > address to the filesystem? It can get/modify the data of the calling
> > task by mapping /proc/pid/mem. Ugly, yes, but ioctls are inherently
> > ugly.
>
> Hmmm... I was trying to stay within similar operation mechanics as other
> ops. Directly accessing the caller's memory has performance benefits
> but that benefit can also be used by reads and writes. So, if we're
> gonna do direct memory access, maybe doing it in more generic way is a
> better idea?

On the contrary: playing VM games is going to be slow. I think this
approach is best suited for generic ioctl support because it
simplifies the kernel part. I'd hate to add all that complexity to
the kernel if not absolutely necessary.

Miklos

2008-08-28 18:24:47

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET] FUSE: extend FUSE to support more operations

Miklos Szeredi wrote:
> Guessing that CUSE stands for "Character Devices in Userspace"?

Yeap, right.

> Is there a description of this? It's for sound device emulation,
> right?

I did it for OSS emulation but it can be used for anything. Hmm... I'm
still in the process of pushing out patches, so please wait just a bit.
The libfuse changes include simple examples so that should make things
a bit clearer.

> What is direct-lseek for? It doesn't sound like a feature needed by
> char devices.

Sound device just sets nonseekable. The direct lseek bit is for API
completeness as there are chardevs which have special semantics
regarding lseek.

Thanks.

--
tejun

2008-08-28 18:27:08

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 5/7] FUSE: implement ioctl support

Greg KH wrote:
> No, I'm not worried about the performance, just that this should be
> simple as we can just pass "arg" to userspace without touching it as it
> just came from userspace, right?

For simple ones, that's how it's gonna be handled. Those copying in and
out kicks in when the @arg is pointer or worse data structure which can
contain pointers.

> Oh wait, there are process space issues at play here that I'm totally
> forgetting about, right?

:-)

Thanks.

--
tejun

2008-08-28 18:29:37

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 5/7] FUSE: implement ioctl support

H. Peter Anvin wrote:
> Tejun Heo wrote:
>> ioctl support is tricky to implement because only the ioctl
>> implementation itself knows which memory regions need to be read
>> and/or written. To support this, fuse client can request retry of
>> ioctl specifying memory regions to read and write. Deep copying
>> (nested pointers) can be implemented by retrying multiple times
>> resolving one depth of dereference at a time.
>
> Okay, I'm going to say it... wouldn't it be better to have some kind of
> data structure description language which the userspace can register
> with the kernel to linearize the data on a per-ioctl basis (kind of like
> rpcgen)?

Ah.... funky. If this retry thing is too repulsive, I guess the best
alternative would be directly accessing caller's memory as Miklos suggested.

Thanks.

--
tejun

2008-08-28 18:35:38

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 5/7] FUSE: implement ioctl support

Miklos Szeredi wrote:
>> Hmmm... I was trying to stay within similar operation mechanics as other
>> ops. Directly accessing the caller's memory has performance benefits
>> but that benefit can also be used by reads and writes. So, if we're
>> gonna do direct memory access, maybe doing it in more generic way is a
>> better idea?
>
> On the contrary: playing VM games is going to be slow. I think this
> approach is best suited for generic ioctl support because it
> simplifies the kernel part. I'd hate to add all that complexity to
> the kernel if not absolutely necessary.

Well, it's only 240 lines with good amount of comments and iovec copying
function. The ioctl itself isn't too complex. I'm a bit skeptical
about direct access. It can easily introduce security vulnerabilities
as there really is no way to hold a pid.

Thanks.

--
tejun

2008-08-28 19:09:21

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 5/7] FUSE: implement ioctl support

Tejun Heo wrote:
>
> Ah.... funky. If this retry thing is too repulsive, I guess the best
> alternative would be directly accessing caller's memory as Miklos suggested.
>

Be careful -- there are some serious dragons there in the presence of
multiple threads.

-hpa

2008-08-28 19:18:34

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 5/7] FUSE: implement ioctl support

On Thu, 28 Aug 2008, H. Peter Anvin wrote:
> Tejun Heo wrote:
> >
> > Ah.... funky. If this retry thing is too repulsive, I guess the best
> > alternative would be directly accessing caller's memory as Miklos suggested.
> >
>
> Be careful -- there are some serious dragons there in the presence of
> multiple threads.

OK, it should map /proc/pid/task/tid/mem. Or rather
/proc/tid/task/tid/mem, as the pid (tgid) of the caller is not
currently passed to the filesystem.

Miklos

2008-08-28 19:25:47

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 5/7] FUSE: implement ioctl support

On Thu, 28 Aug 2008, Tejun Heo wrote:
> Miklos Szeredi wrote:
> >> Hmmm... I was trying to stay within similar operation mechanics as other
> >> ops. Directly accessing the caller's memory has performance benefits
> >> but that benefit can also be used by reads and writes. So, if we're
> >> gonna do direct memory access, maybe doing it in more generic way is a
> >> better idea?
> >
> > On the contrary: playing VM games is going to be slow. I think this
> > approach is best suited for generic ioctl support because it
> > simplifies the kernel part. I'd hate to add all that complexity to
> > the kernel if not absolutely necessary.
>
> Well, it's only 240 lines with good amount of comments and iovec copying
> function. The ioctl itself isn't too complex. I'm a bit skeptical
> about direct access. It can easily introduce security vulnerabilities
> as there really is no way to hold a pid.

I don't understand. No new vulnerabilities are introduced, since it
would just use existing infrastructure.

Why is it better if the kernel does the copying of memory regions
instructed by the userspace filesystem, than if the userspace
filesystem does that copying itself? I feel they are totally
equivalent, except that the latter needs more complexity in the
kernel.

Miklos

2008-08-28 19:44:30

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 5/7] FUSE: implement ioctl support

Miklos Szeredi wrote:
>> Well, it's only 240 lines with good amount of comments and iovec copying
>> function. The ioctl itself isn't too complex. I'm a bit skeptical
>> about direct access. It can easily introduce security vulnerabilities
>> as there really is no way to hold a pid.
>
> I don't understand. No new vulnerabilities are introduced, since it
> would just use existing infrastructure.
>
> Why is it better if the kernel does the copying of memory regions
> instructed by the userspace filesystem, than if the userspace
> filesystem does that copying itself? I feel they are totally
> equivalent, except that the latter needs more complexity in the
> kernel.

I'm no security expert but it feels pretty dangerous to me. First of
all, there are cases where the calling process can exit before the
userland FUSE is finished with an operation, so it might not be always
possible for the FUSE client to tell the PID it got is the correct one.

Another thing is that as it currently stands, the kernel side FUSE
implementation forms a nice safety net taking responsibility of most
security concerns and insulating the mistakes the client may make.
Letting userland client to access and possibly modify the caller's
memory directly weakens that insulation.

Pushing memory access to userland feels a bit too risky to me. There
seem to be too many loose components in security sensitive path and I
have a nagging feeling that someone will come up with something we can't
think of at the moment.

Thanks.

--
tejun

2008-08-28 20:03:06

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 5/7] FUSE: implement ioctl support

On Thu, 28 Aug 2008, Tejun Heo wrote:
> Miklos Szeredi wrote:
> >> Well, it's only 240 lines with good amount of comments and iovec copying
> >> function. The ioctl itself isn't too complex. I'm a bit skeptical
> >> about direct access. It can easily introduce security vulnerabilities
> >> as there really is no way to hold a pid.
> >
> > I don't understand. No new vulnerabilities are introduced, since it
> > would just use existing infrastructure.
> >
> > Why is it better if the kernel does the copying of memory regions
> > instructed by the userspace filesystem, than if the userspace
> > filesystem does that copying itself? I feel they are totally
> > equivalent, except that the latter needs more complexity in the
> > kernel.
>
> I'm no security expert but it feels pretty dangerous to me. First of
> all, there are cases where the calling process can exit before the
> userland FUSE is finished with an operation, so it might not be always
> possible for the FUSE client to tell the PID it got is the correct one.

OK, I grant this one. But then it's easy to protect against by
getting a ref on the task (or just the task ID, I don't know if that's
possible) for the duration of the ioctl.

> Another thing is that as it currently stands, the kernel side FUSE
> implementation forms a nice safety net taking responsibility of most
> security concerns and insulating the mistakes the client may make.
> Letting userland client to access and possibly modify the caller's
> memory directly weakens that insulation.

The same stupid mistakes can be done by giving the wrong instructions
to the kernel about what to modify, thus thrashing the calling
process.

> Pushing memory access to userland feels a bit too risky to me. There
> seem to be too many loose components in security sensitive path and I
> have a nagging feeling that someone will come up with something we can't
> think of at the moment.

I don't see the difference. You have to be careful either way, it's
not possible to do ioctls safely as the rest of fuse unfortunately.
This obviously also means, that it's impossible to run the filesystem
as an unprivileged user, as it has to have access to the whole address
space of the calling process either way (or ioctls have to be
restricted somehow).

Miklos

2008-08-28 20:21:48

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 5/7] FUSE: implement ioctl support

Miklos Szeredi wrote:
> On Thu, 28 Aug 2008, H. Peter Anvin wrote:
>> Tejun Heo wrote:
>>> Ah.... funky. If this retry thing is too repulsive, I guess the best
>>> alternative would be directly accessing caller's memory as Miklos suggested.
>>>
>> Be careful -- there are some serious dragons there in the presence of
>> multiple threads.
>
> OK, it should map /proc/pid/task/tid/mem. Or rather
> /proc/tid/task/tid/mem, as the pid (tgid) of the caller is not
> currently passed to the filesystem.
>

Uhm, no. You can still have it change underneath you as long as you
have any thread of execution with access to the same memory.

This is *hard* to get right, and we screw this up in the kernel with
painful regularity. The throught of having user-space processes, which
don't have access to the kernel locking primitives and functions like
copy_from_user() dealing with this stuff scares me crazy.

That is why I'm suggesting using an in-kernel linearizer.

-hpa

2008-08-28 20:56:40

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 5/7] FUSE: implement ioctl support

On Thu, 28 Aug 2008, H. Peter Anvin wrote:
> This is *hard* to get right, and we screw this up in the kernel with
> painful regularity. The throught of having user-space processes, which
> don't have access to the kernel locking primitives and functions like
> copy_from_user() dealing with this stuff scares me crazy.

What issues exactly are you thinking of?

> That is why I'm suggesting using an in-kernel linearizer.

Lots of complexity, ugh... Even Tejun's current scheme is better IMO.

Miklos

2008-08-28 21:06:35

by Alan

[permalink] [raw]
Subject: Re: [PATCH 5/7] FUSE: implement ioctl support

> Well, it's only 240 lines with good amount of comments and iovec copying
> function. The ioctl itself isn't too complex. I'm a bit skeptical
> about direct access. It can easily introduce security vulnerabilities
> as there really is no way to hold a pid.

Agreed entirely. If and only if there is a future problem with
performance someone can introduce a method for the client to pre-describe
certain ioctls in terms of number/length/read/write.

Alan

2008-08-28 21:27:54

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 5/7] FUSE: implement ioctl support

Miklos Szeredi wrote:
> On Thu, 28 Aug 2008, H. Peter Anvin wrote:
>> This is *hard* to get right, and we screw this up in the kernel with
>> painful regularity. The throught of having user-space processes, which
>> don't have access to the kernel locking primitives and functions like
>> copy_from_user() dealing with this stuff scares me crazy.
>
> What issues exactly are you thinking of?

Memory changing underneath you. It can be dealt with by very careful
sequencing only.

>> That is why I'm suggesting using an in-kernel linearizer.
>
> Lots of complexity, ugh... Even Tejun's current scheme is better IMO.

And then you get *no* privilege separation, for one thing, so why even
bother doing it in userspace?

-hpa

2008-08-29 02:20:27

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 5/7] FUSE: implement ioctl support

Miklos Szeredi wrote:
>> Another thing is that as it currently stands, the kernel side FUSE
>> implementation forms a nice safety net taking responsibility of most
>> security concerns and insulating the mistakes the client may make.
>> Letting userland client to access and possibly modify the caller's
>> memory directly weakens that insulation.
>
> The same stupid mistakes can be done by giving the wrong instructions
> to the kernel about what to modify, thus thrashing the calling
> process.

Yeah, ioctl, by design, has that potential. Whether it's implemented in
kernel or userland, it's gonna access arbitrary memory regions from deep
down the implementation, and it can corrupt things, but by giving the
responsibility to move data to kernel part of FUSE, we can at least
guarantee that it's only gonna ruin the calling address space even when
it screws up and when that happens it will be easy to track down by
tracing the communication between the kernel and FUSE client.

>> Pushing memory access to userland feels a bit too risky to me. There
>> seem to be too many loose components in security sensitive path and I
>> have a nagging feeling that someone will come up with something we can't
>> think of at the moment.
>
> I don't see the difference. You have to be careful either way, it's
> not possible to do ioctls safely as the rest of fuse unfortunately.
> This obviously also means, that it's impossible to run the filesystem
> as an unprivileged user, as it has to have access to the whole address
> space of the calling process either way (or ioctls have to be
> restricted somehow).

I'm not worried about the client accessing wrong memory regions or even
corrupting it. It's pointless to try to protect against that. From the
calling process's POV, it runs the same risk whether it calls an
in-kernel ioctl or a CUSE one and FUSE already has sufficient protection
against allowing unprivileged FS implementation to serve other users.

What I'm worried about is the possibility of CUSE client being able to
break out of that privilege protection which is currently ensured by the
kernel. Also, what about containers? How would it work then?

Thanks.

--
tejun

2008-08-29 07:32:42

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 5/7] FUSE: implement ioctl support

On Thu, 28 Aug 2008, H. Peter Anvin wrote:
> Miklos Szeredi wrote:
> > On Thu, 28 Aug 2008, H. Peter Anvin wrote:
> >> This is *hard* to get right, and we screw this up in the kernel with
> >> painful regularity. The throught of having user-space processes, which
> >> don't have access to the kernel locking primitives and functions like
> >> copy_from_user() dealing with this stuff scares me crazy.
> >
> > What issues exactly are you thinking of?
>
> Memory changing underneath you. It can be dealt with by very careful
> sequencing only.

That's just handwaving. Apps don't normally change memory under
system call arguments. Or if they do the only thing we ever guarantee
is that the thing won't blow up in a big fireball.

I don't see how getting the data from userspace is different from
doing the same in the kernel. Care to explain?

> >> That is why I'm suggesting using an in-kernel linearizer.
> >
> > Lots of complexity, ugh... Even Tejun's current scheme is better IMO.
>
> And then you get *no* privilege separation, for one thing, so why even
> bother doing it in userspace?

And with ioctls (at least if the filesystem supplies the linearizer
instructions) you simply _cannot_ get proper privilege separation.
Generic ioctl support will always be a privileged thing.

Alternatively we can restrict ioctls. Most ioctls conform to some
convention for encoding the format (size/in/out) in the command, no?

Miklos

2008-08-29 07:59:47

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 5/7] FUSE: implement ioctl support

On Fri, 29 Aug 2008, Tejun Heo wrote:
> I'm not worried about the client accessing wrong memory regions or even
> corrupting it. It's pointless to try to protect against that. From the
> calling process's POV, it runs the same risk whether it calls an
> in-kernel ioctl or a CUSE one and FUSE already has sufficient protection
> against allowing unprivileged FS implementation to serve other users.

Yes and no. Fuse allows this protection to be relaxed
(-oallow_other), because it does provide quite good privilege
separation. This ioctl thing breaks that, so we should disable ioctls
with 'allow_other' or require the filesystem to be privileged. But
the latter is not easy because mount(2) is always privileged, we don't
know if the process calling fusermount was privileged or not.

So, your current patch actually _introduces_ a security vulnerability
with the 'allow_other' mount option.

> What I'm worried about is the possibility of CUSE client being able to
> break out of that privilege protection which is currently ensured by the
> kernel.

What do you call client? If you mean the app using the char dev, then
I don't see how it could break out of any protection.

> Also, what about containers? How would it work then?

Dunno. Isn't there some transformation of pids going on, so that the
global namespace can access pids in all containers but under a
different alias? I do hope somethinig like this works, otherwise it's
not only fuse that will break.

Miklos

2008-08-29 08:13:49

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 5/7] FUSE: implement ioctl support

Miklos Szeredi wrote:
> On Fri, 29 Aug 2008, Tejun Heo wrote:
>> I'm not worried about the client accessing wrong memory regions or even
>> corrupting it. It's pointless to try to protect against that. From the
>> calling process's POV, it runs the same risk whether it calls an
>> in-kernel ioctl or a CUSE one and FUSE already has sufficient protection
>> against allowing unprivileged FS implementation to serve other users.
>
> Yes and no. Fuse allows this protection to be relaxed
> (-oallow_other), because it does provide quite good privilege
> separation. This ioctl thing breaks that, so we should disable ioctls
> with 'allow_other' or require the filesystem to be privileged. But
> the latter is not easy because mount(2) is always privileged, we don't
> know if the process calling fusermount was privileged or not.
>
> So, your current patch actually _introduces_ a security vulnerability
> with the 'allow_other' mount option.

Ah.. right. allow_other. Yeah, restricting ioctl implementation only
to root or !allow_other sounds like a good idea.

>> What I'm worried about is the possibility of CUSE client being able to
>> break out of that privilege protection which is currently ensured by the
>> kernel.
>
> What do you call client? If you mean the app using the char dev, then
> I don't see how it could break out of any protection.

I first used 'server' for userland [FC]USE server but then I noticed
there were places in FUSE they were referred as clients so now I use
'client' for those and call the app using the FUSE fs the 'caller'.
What are the established terms?

Anyways, doing it directly from the server (or is it client) opens up a
lot of new possibilities to screw up and I'd really much prefer staying
in similar ballpark with other operations. Maybe we can restrict it to
two stages (query size & transfer) and linear consecutive ranges but
then again adding retry doesn't contribute too much to the complexity.
Oh.. and BTW, the in-ioctl length coding is not used universally, so it
can't be depended upon.

>> Also, what about containers? How would it work then?
>
> Dunno. Isn't there some transformation of pids going on, so that the
> global namespace can access pids in all containers but under a
> different alias? I do hope somethinig like this works, otherwise it's
> not only fuse that will break.

I'm not sure either. Any idea who we should be asking about it?

Thanks.

--
tejun

2008-08-29 08:30:39

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 5/7] FUSE: implement ioctl support

On Fri, 29 Aug 2008, Tejun Heo wrote:
> I first used 'server' for userland [FC]USE server but then I noticed
> there were places in FUSE they were referred as clients so now I use
> 'client' for those and call the app using the FUSE fs the 'caller'.
> What are the established terms?

Umm

- userspace filesystem
- filesystem daemon
- filesystem process
- server

Yes it's also a client of the fuse device, but that term is confusing.

> Anyways, doing it directly from the server (or is it client) opens up a
> lot of new possibilities to screw up and I'd really much prefer staying
> in similar ballpark with other operations. Maybe we can restrict it to
> two stages (query size & transfer) and linear consecutive ranges but
> then again adding retry doesn't contribute too much to the complexity.
> Oh.. and BTW, the in-ioctl length coding is not used universally, so it
> can't be depended upon.

I know it's not universal, some horrors I've seen in the old wireless
interfaces. The question is: do we want to support such "extended"
ioctls? For exmaple, does OSS have non-conformant ioctls?

> >> Also, what about containers? How would it work then?
> >
> > Dunno. Isn't there some transformation of pids going on, so that the
> > global namespace can access pids in all containers but under a
> > different alias? I do hope somethinig like this works, otherwise it's
> > not only fuse that will break.
>
> I'm not sure either. Any idea who we should be asking about it?

Serge Hallyn and Eric Biederman.

Miklos

2008-08-29 09:05:32

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 5/7] FUSE: implement ioctl support

Miklos Szeredi wrote:
> On Fri, 29 Aug 2008, Tejun Heo wrote:
>> I first used 'server' for userland [FC]USE server but then I noticed
>> there were places in FUSE they were referred as clients so now I use
>> 'client' for those and call the app using the FUSE fs the 'caller'.
>> What are the established terms?
>
> Umm
>
> - userspace filesystem
> - filesystem daemon
> - filesystem process
> - server
>
> Yes it's also a client of the fuse device, but that term is confusing.

Okay, will do s/client/server/g

>> Anyways, doing it directly from the server (or is it client) opens up a
>> lot of new possibilities to screw up and I'd really much prefer staying
>> in similar ballpark with other operations. Maybe we can restrict it to
>> two stages (query size & transfer) and linear consecutive ranges but
>> then again adding retry doesn't contribute too much to the complexity.
>> Oh.. and BTW, the in-ioctl length coding is not used universally, so it
>> can't be depended upon.
>
> I know it's not universal, some horrors I've seen in the old wireless
> interfaces. The question is: do we want to support such "extended"
> ioctls? For exmaple, does OSS have non-conformant ioctls?

OSS ioctls are all pretty simple and I think they all use the proper
encoding. For the question, my answer would be yes (naturally). It
will suck later when implementing some other device only to find out
that there's this one ioctl that needs to dereference a pointer but
there's no supported way to do it but everything else works.

I don't think the performance or the complexity of specific ioctl
implementation is of the determining importance as long as it can be
made to work with minimal impact on the rest of the whole thing, so
the current retry implementation.

>>>> Also, what about containers? How would it work then?
>>> Dunno. Isn't there some transformation of pids going on, so that the
>>> global namespace can access pids in all containers but under a
>>> different alias? I do hope somethinig like this works, otherwise it's
>>> not only fuse that will break.
>> I'm not sure either. Any idea who we should be asking about it?
>
> Serge Hallyn and Eric Biederman.

Okay, cc'd both. Hello, Eric Biederman, Serge Hallyn. For
implementing ioctl in FUSE, it's suggested that to access the address
space of the caller directly from the FUSE server using its pid via
/proc/pid/mem (or task/tid/mem). It's most likely that the calling
process's tid will be used. As I don't know much about the
containers, I'm not sure how such approach will play out when combined
with containers. Can you enlighten us a bit? W/o containers, it will
look like the following.


FUSE ----------------
^ |
| | kernel
------ ioctl ----------- /dev/fuse ------------
| | userland
| v
--------------- -------------
| caller | | FUSE server |---> reads and writes
| with tid CTID | | | /proc/PID/task/TID/mem
--------------- -------------

The FUSE server gets task->pid. IIUC, if the FUSE server is not in a
container, task->pid should work fine whether the caller is in
container or not, right? And if the FUSE server is in a container,
it's hell lot more complex and FUSE may have to map task->pid to what
FUSE server would know if possible?

Thanks.

--
tejun

2008-08-29 11:37:55

by Roger Willcocks

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCH 5/7] FUSE: implement ioctl support

One thing to bear in mind is the possibility of 32-bit client programs
running on a 64-bit platform, so your user-space driver will need be be
aware of the flavour of the client, and parse the ioctl arguments (and
in particular, pointers) appropriately.

That's true of a kernel implementation too, of course.

--
Roger


2008-08-29 11:56:48

by Tejun Heo

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCH 5/7] FUSE: implement ioctl support

Roger Willcocks wrote:
> One thing to bear in mind is the possibility of 32-bit client programs
> running on a 64-bit platform, so your user-space driver will need be be
> aware of the flavour of the client, and parse the ioctl arguments (and
> in particular, pointers) appropriately.
>
> That's true of a kernel implementation too, of course.

Yeap, for 32bit client on 64bit machine, FUSE_IOCTL_COMPAT will be set.
For OSS emulation, it doesn't make any difference tho. It's the same
for both word sizes.

Thanks.

--
tejun

2008-08-29 19:24:36

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 5/7] FUSE: implement ioctl support

Tejun Heo <[email protected]> writes:

> Miklos Szeredi wrote:
>> On Fri, 29 Aug 2008, Tejun Heo wrote:
>>> I first used 'server' for userland [FC]USE server but then I noticed
>>> there were places in FUSE they were referred as clients so now I use
>>> 'client' for those and call the app using the FUSE fs the 'caller'.
>>> What are the established terms?
>>
>> Umm
>>
>> - userspace filesystem
>> - filesystem daemon
>> - filesystem process
>> - server
>>
>> Yes it's also a client of the fuse device, but that term is confusing.
>
> Okay, will do s/client/server/g
>
>>> Anyways, doing it directly from the server (or is it client) opens up a
>>> lot of new possibilities to screw up and I'd really much prefer staying
>>> in similar ballpark with other operations. Maybe we can restrict it to
>>> two stages (query size & transfer) and linear consecutive ranges but
>>> then again adding retry doesn't contribute too much to the complexity.
>>> Oh.. and BTW, the in-ioctl length coding is not used universally, so it
>>> can't be depended upon.
>>
>> I know it's not universal, some horrors I've seen in the old wireless
>> interfaces. The question is: do we want to support such "extended"
>> ioctls? For exmaple, does OSS have non-conformant ioctls?
>
> OSS ioctls are all pretty simple and I think they all use the proper
> encoding. For the question, my answer would be yes (naturally). It
> will suck later when implementing some other device only to find out
> that there's this one ioctl that needs to dereference a pointer but
> there's no supported way to do it but everything else works.
>
> I don't think the performance or the complexity of specific ioctl
> implementation is of the determining importance as long as it can be
> made to work with minimal impact on the rest of the whole thing, so
> the current retry implementation.
>
>>>>> Also, what about containers? How would it work then?
>>>> Dunno. Isn't there some transformation of pids going on, so that the
>>>> global namespace can access pids in all containers but under a
>>>> different alias? I do hope somethinig like this works, otherwise it's
>>>> not only fuse that will break.
>>> I'm not sure either. Any idea who we should be asking about it?
>>
>> Serge Hallyn and Eric Biederman.
>
> Okay, cc'd both. Hello, Eric Biederman, Serge Hallyn. For
> implementing ioctl in FUSE, it's suggested that to access the address
> space of the caller directly from the FUSE server using its pid via
> /proc/pid/mem (or task/tid/mem). It's most likely that the calling
> process's tid will be used. As I don't know much about the
> containers, I'm not sure how such approach will play out when combined
> with containers. Can you enlighten us a bit? W/o containers, it will
> look like the following.
>
>
> FUSE ----------------
> ^ |
> | | kernel
> ------ ioctl ----------- /dev/fuse ------------
> | | userland
> | v
> --------------- -------------
> | caller | | FUSE server |---> reads and writes
> | with tid CTID | | | /proc/PID/task/TID/mem
> --------------- -------------
>
> The FUSE server gets task->pid. IIUC, if the FUSE server is not in a
> container, task->pid should work fine whether the caller is in
> container or not, right? And if the FUSE server is in a container,
> it's hell lot more complex and FUSE may have to map task->pid to what
> FUSE server would know if possible?

Implementation wise it is not too bad.

FUSE ----------------
pid = get_pid(task_tid(current))
^ |
| | kernel
pid_vnr(pid)
------ ioctl ----------- /dev/fuse ------------
| | userland
| v
--------------- -------------
| caller | | FUSE server |---> reads and writes
| with tid CTID | | | /proc/PID/task/TID/mem
--------------- -------------

However it is a largely an insane idea.
- Write is not implemented for /proc/PID/task/TID/mem
- It would be better if the kernel handed you back a file descriptor to the
other process memory rather than you having to generate one.
- To access /proc/PID/task/TID/mem you need to have CAP_PTRACE.
- This seems to allow for random ioctls. With the compat_ioctl thing we have
largely stomped on that idea. So you should only need to deal with well
defined ioctls. At which point why do you need to directly access the memory
of another process.

So why not just only support well defined ioctls and serialize them in the kernel
and allow the receiving process to deserialize them?

That would allow all of this to happen with a non-privileged server which
makes the functionality much more useful.

Given the pain it is to maintain ioctls I would be very surprised if we wanted
to open up that pandoras box even wider by allowing arbitrary user space
processes to support random ioctls. How would you do 32/64bit support
and the like?

Eric

2008-08-29 19:48:16

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 5/7] FUSE: implement ioctl support

On Friday 29 August 2008, Eric W. Biederman wrote:
> However it is a largely an insane idea.

I fully agree.

> Given the pain it is to maintain ioctls I would be very surprised if we wanted
> to open up that pandoras box even wider by allowing arbitrary user space
> processes to support random ioctls. ?How would you do 32/64bit support
> and the like?

I think that is not too much of a problem: Just like the file operations in
the kernel have two callbacks (ioctl and compat_ioctl), you need to provide
both operations from user space if there is a difference between them, or
at least a stub otherwise.

Arnd <><

2008-08-30 11:42:43

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 5/7] FUSE: implement ioctl support

Hello, Eric.

Eric W. Biederman wrote:
> Implementation wise it is not too bad.
>
> FUSE ----------------
> pid = get_pid(task_tid(current))
> ^ |
> | | kernel
> pid_vnr(pid)
> ------ ioctl ----------- /dev/fuse ------------
> | | userland
> | v
> --------------- -------------
> | caller | | FUSE server |---> reads and writes
> | with tid CTID | | | /proc/PID/task/TID/mem
> --------------- -------------

Ah hah.. thanks for the info.

> However it is a largely an insane idea.
> - Write is not implemented for /proc/PID/task/TID/mem
> - It would be better if the kernel handed you back a file descriptor to the
> other process memory rather than you having to generate one.

Yeah, that would at least ensure we're not meddling with the wrong
address space.

> - To access /proc/PID/task/TID/mem you need to have CAP_PTRACE.
> - This seems to allow for random ioctls. With the compat_ioctl
> thing we have largely stomped on that idea. So you should only
> need to deal with well defined ioctls. At which point why do you
> need to directly access the memory of another process.

What do you mean by "compat_ioctl" thing? We still allow random
memory accesses to ioctls.

> So why not just only support well defined ioctls and serialize them
> in the kernel and allow the receiving process to deserialize them?

Yeap, that certainly is an option.

> That would allow all of this to happen with a non-privileged server which
> makes the functionality much more useful.

When !allow_others, there's no reason to prevent one user's FUSE
server to corruption the user's processes. Everyone has the innate
right to shoot him/herself in the foot.

> Given the pain it is to maintain ioctls I would be very surprised if
> we wanted to open up that pandoras box even wider by allowing
> arbitrary user space processes to support random ioctls. How would
> you do 32/64bit support and the like?

32/64bit is not really much problem tho. The kernel just flags the
request as-such and let the userland deal with it.

Miklos, I think Eric's reply pretty much blew the direct access idea
out of water, which leaves us with three options.

1. Support only the proper ioctls. This would be the simplest but
someone might run into its limitations in the future. Given that
the most probable use cases for CUSE are replacing in-kernel legacy
code or faking some proprietary interfaces which are likely to be
somewhat weird, the chance isn't too low IMHO.

2. Do what hpa suggested, that is a simple language to describe the
needed data regions. To me, it seems like an overkill. If there's
already such infrastructure in kernel, I would be happy to
piggyback on it but separate serialization language for FUSE ioctl
support seems extreme.

3. And my favorite (obviously), keep the current implementation. Ugly
it may look but given the horror of the problem at hand I think it
actually hits a pretty good balance between interface peculiarness
and complexity that a more elegant solution would require. It's
only ~250 lines of code which can support any ioctl. Complex ones
will have to go through a few duplicate copy operations but I don't
really think those are gonna hurt anyone when the whole thing is
being relayed to userland.

So, what do you think?

Thanks.

--
tejun

2008-10-14 08:24:54

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET] FUSE: extend FUSE to support more operations

Hello, Miklos.

I'm about ready to send the next round of this patchset w/ mmap support,
which I'm sure gonna stir up some discussion. :-) Several things I wanna
hear your comments about.

1. I'll try to incorporate all other comments here but regarding ioctl I
don't think we've reached any better solution, so I'm sticking with the
original one.

2. You told me that the version branching in the userland library wasn't
necessary. Can you explain to me when FUSE version bumping is necessary?

3. Any other things on you mind?

Thanks.

--
tejun

2008-10-14 09:37:40

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCHSET] FUSE: extend FUSE to support more operations

Hi Tejun,

On Tue, 14 Oct 2008, Tejun Heo wrote:
> Hello, Miklos.
>
> I'm about ready to send the next round of this patchset w/ mmap support,
> which I'm sure gonna stir up some discussion. :-) Several things I wanna
> hear your comments about.

Great. Just yesterday evening I was looking through your patches to
see which ones I can submit for 2.6.28, and I've already applied

0001-FUSE-add-include-protectors.patch
0003-FUSE-implement-nonseekable-open.patch

Comments about the others:

0002-FUSE-pass-nonblock-flag-to-client.patch

this is not needed, f_flags are already passed to userspace for read
and write.

0004-FUSE-implement-direct-lseek-support.patch

this is trickier to get the interface right I think. If we want to
allow filesystems to implement a custom lseek, then we also want them
to keep track of the file position, which means we must differentiate
between a write(2) and a pwrite(2) and similarly for reads. AFAICS
this isn't needed for CUSE so we can leave this to later.

0005-FUSE-implement-ioctl-support.patch

See below.

0006-FUSE-implement-unsolicited-notification.patch
0007-FUSE-implement-poll-support.patch

This would be nice, but... I don't really like the fact that it uses
the file handle. Could we have a separate "poll handle" that is
returned in the POLL reply?

> 1. I'll try to incorporate all other comments here but regarding ioctl I
> don't think we've reached any better solution, so I'm sticking with the
> original one.

I still got qualms about this ioctl thing. One is the security
aspect, but that could be dealt with. The other is that I really
really don't want people to start implementing new custom ioctls for
their filesystems, as I think that way lies madness. We could limit
ioctls to CUSE and that would be fine with me. Or for non-CUSE users
we could enforce the "standard" format where the type and length is
encoded in the command number.

I don't have any problems with the iterative way you implemented
ioctls. We just need some additional restrictions to the current
implementation, I think.

> 2. You told me that the version branching in the userland library wasn't
> necessary. Can you explain to me when FUSE version bumping is necessary?

The version number has to be bumped anyway. But if you are only
adding new functions to the end of fuse_operations and
fuse_lowlevel_ops, then the interface can handle that, without needing
new compatibility functions.

> 3. Any other things on you mind?

One other thing I was thinking about is that do we really need
emulated char devices to be char devices? What I mean is, what would
happen if instead of a char device /dev/dsp would be a regular file
mounted on /dev/dsp (which implements all the necessary interfaces:
ioctls, poll, etc)?

Thanks,
Miklos

2008-10-14 12:26:07

by Szabolcs Szakacsits

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCHSET] FUSE: extend FUSE to support more operations


On Tue, 14 Oct 2008, Miklos Szeredi wrote:

> I still got qualms about this ioctl thing. One is the security
> aspect, but that could be dealt with. The other is that I really
> really don't want people to start implementing new custom ioctls for
> their filesystems, as I think that way lies madness. We could limit
> ioctls to CUSE and that would be fine with me. Or for non-CUSE users
> we could enforce the "standard" format where the type and length is
> encoded in the command number.

Is this enough to support the most useful/requested file system ioctls,
like FIEMAP?

Thanks,
Szaka

--
NTFS-3G: http://ntfs-3g.org

2008-10-14 12:44:16

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCHSET] FUSE: extend FUSE to support more operations

On Tue, 14 Oct 2008, Szabolcs Szakacsits wrote:
> On Tue, 14 Oct 2008, Miklos Szeredi wrote:
>
> > I still got qualms about this ioctl thing. One is the security
> > aspect, but that could be dealt with. The other is that I really
> > really don't want people to start implementing new custom ioctls for
> > their filesystems, as I think that way lies madness. We could limit
> > ioctls to CUSE and that would be fine with me. Or for non-CUSE users
> > we could enforce the "standard" format where the type and length is
> > encoded in the command number.
>
> Is this enough to support the most useful/requested file system ioctls,
> like FIEMAP?

Looking at FIEMAP patches...

No, FIEMAP doesn't seem fit in that category. But FIEMAP has a
specific handler, so it would not get as far as the generic ioctl
handler anyway. If we need FIEMAP support in fuse, that will need to
be handled independently of the generic ioctls (just like FIBMAP).

Miklos

2008-11-12 08:41:53

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET] FUSE: extend FUSE to support more operations

Hello, Miklos. Sorry about the extra long delay. I was buried alive
under bugs and regressions with SLE11 release date nearing and all.

Miklos Szeredi wrote:
> Great. Just yesterday evening I was looking through your patches to
> see which ones I can submit for 2.6.28, and I've already applied
>
> 0001-FUSE-add-include-protectors.patch
> 0003-FUSE-implement-nonseekable-open.patch

Thanks.

> Comments about the others:
>
> 0002-FUSE-pass-nonblock-flag-to-client.patch
>
> this is not needed, f_flags are already passed to userspace for read
> and write.

Hmmm... I'll try to find out whether I can use f_flags. There was
something that prevented it from working properly. I'll dig.

> 0004-FUSE-implement-direct-lseek-support.patch
>
> this is trickier to get the interface right I think. If we want to
> allow filesystems to implement a custom lseek, then we also want them
> to keep track of the file position, which means we must differentiate
> between a write(2) and a pwrite(2) and similarly for reads. AFAICS
> this isn't needed for CUSE so we can leave this to later.

Read/write already passes @offset, so the only thing required is an
extra flag there. I mainly wanted a way for a CUSE server to veto lseek
with proper error and still think it's better to have this as we don't
really know what wacky users are out there. What do you think about an
extra flag?

> 0005-FUSE-implement-ioctl-support.patch
>
> See below.
>
> 0006-FUSE-implement-unsolicited-notification.patch
> 0007-FUSE-implement-poll-support.patch
>
> This would be nice, but... I don't really like the fact that it uses
> the file handle. Could we have a separate "poll handle" that is
> returned in the POLL reply?

Sure thing.

>> 1. I'll try to incorporate all other comments here but regarding ioctl I
>> don't think we've reached any better solution, so I'm sticking with the
>> original one.
>
> I still got qualms about this ioctl thing. One is the security
> aspect, but that could be dealt with. The other is that I really
> really don't want people to start implementing new custom ioctls for
> their filesystems, as I think that way lies madness. We could limit
> ioctls to CUSE and that would be fine with me. Or for non-CUSE users
> we could enforce the "standard" format where the type and length is
> encoded in the command number.

For now, I'll limit ioctl to CUSE. Hmmm... Yeah, limiting ioctl to
well-formatted ones sounds like a good idea.

> I don't have any problems with the iterative way you implemented
> ioctls. We just need some additional restrictions to the current
> implementation, I think.

Cool.

>> 2. You told me that the version branching in the userland library wasn't
>> necessary. Can you explain to me when FUSE version bumping is necessary?
>
> The version number has to be bumped anyway. But if you are only
> adding new functions to the end of fuse_operations and
> fuse_lowlevel_ops, then the interface can handle that, without needing
> new compatibility functions.

Alright.

>> 3. Any other things on you mind?
>
> One other thing I was thinking about is that do we really need
> emulated char devices to be char devices? What I mean is, what would
> happen if instead of a char device /dev/dsp would be a regular file
> mounted on /dev/dsp (which implements all the necessary interfaces:
> ioctls, poll, etc)?

For most it would work, I suppose, but there are all sorts of wonky
users out in the wild (and quite a few that we don't have source access
to) and different configurations, so I think it's better to appear as
proper character device if it is a character device. It will also help
udev and other desktop thingies deal with devices implemented in userland.

Thanks.

--
tejun

2008-11-12 09:14:42

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCHSET] FUSE: extend FUSE to support more operations

On Wed, Nov 12, 2008 at 05:41:18PM +0900, Tejun Heo wrote:
> > emulated char devices to be char devices? What I mean is, what would
> > happen if instead of a char device /dev/dsp would be a regular file
> > mounted on /dev/dsp (which implements all the necessary interfaces:
> > ioctls, poll, etc)?
>
> For most it would work, I suppose, but there are all sorts of wonky
> users out in the wild (and quite a few that we don't have source access
> to) and different configurations, so I think it's better to appear as
> proper character device if it is a character device. It will also help
> udev and other desktop thingies deal with devices implemented in userland.

Note that mounting a char device on /dev/dsp is exactly the same as
mounting a regular file on /dev/dsp - the only difference is what you
set i_mode to.

2008-11-12 09:30:36

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET] FUSE: extend FUSE to support more operations

Christoph Hellwig wrote:
> On Wed, Nov 12, 2008 at 05:41:18PM +0900, Tejun Heo wrote:
>>> emulated char devices to be char devices? What I mean is, what would
>>> happen if instead of a char device /dev/dsp would be a regular file
>>> mounted on /dev/dsp (which implements all the necessary interfaces:
>>> ioctls, poll, etc)?
>> For most it would work, I suppose, but there are all sorts of wonky
>> users out in the wild (and quite a few that we don't have source access
>> to) and different configurations, so I think it's better to appear as
>> proper character device if it is a character device. It will also help
>> udev and other desktop thingies deal with devices implemented in userland.
>
> Note that mounting a char device on /dev/dsp is exactly the same as
> mounting a regular file on /dev/dsp - the only difference is what you
> set i_mode to.

A difference is how the device is located. With proper character device
emulation, any char device node on any filesystem would work. Not sure
how relevant that would be tho.

Thanks.

--
tejun

2008-11-12 09:36:52

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCHSET] FUSE: extend FUSE to support more operations

On Wed, 12 Nov 2008, Tejun Heo wrote:
> Christoph Hellwig wrote:
> > On Wed, Nov 12, 2008 at 05:41:18PM +0900, Tejun Heo wrote:
> >>> emulated char devices to be char devices? What I mean is, what would
> >>> happen if instead of a char device /dev/dsp would be a regular file
> >>> mounted on /dev/dsp (which implements all the necessary interfaces:
> >>> ioctls, poll, etc)?
> >> For most it would work, I suppose, but there are all sorts of wonky
> >> users out in the wild (and quite a few that we don't have source access
> >> to) and different configurations, so I think it's better to appear as
> >> proper character device if it is a character device. It will also help
> >> udev and other desktop thingies deal with devices implemented in userland.
> >
> > Note that mounting a char device on /dev/dsp is exactly the same as
> > mounting a regular file on /dev/dsp - the only difference is what you
> > set i_mode to.
>
> A difference is how the device is located. With proper character device
> emulation, any char device node on any filesystem would work. Not sure
> how relevant that would be tho.

Yeah, I think it would be a bit hacky to lie about being a device with
a specific major/minor number, when in fact we just set those in the
stat fields, and the real char device would return ENODEV or implement
something else.

So I'd prefer the CUSE approach if we want the file to actually look
like a device.

Miklos

2008-11-12 10:01:20

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCHSET] FUSE: extend FUSE to support more operations

On Wed, 12 Nov 2008, Tejun Heo wrote:
> > Comments about the others:
> >
> > 0002-FUSE-pass-nonblock-flag-to-client.patch
> >
> > this is not needed, f_flags are already passed to userspace for read
> > and write.
>
> Hmmm... I'll try to find out whether I can use f_flags. There was
> something that prevented it from working properly. I'll dig.

Support for this was missing from libfuse, but now I fixed that in the
CVS version.

> > 0004-FUSE-implement-direct-lseek-support.patch
> >
> > this is trickier to get the interface right I think. If we want to
> > allow filesystems to implement a custom lseek, then we also want them
> > to keep track of the file position, which means we must differentiate
> > between a write(2) and a pwrite(2) and similarly for reads. AFAICS
> > this isn't needed for CUSE so we can leave this to later.
>
> Read/write already passes @offset, so the only thing required is an
> extra flag there. I mainly wanted a way for a CUSE server to veto lseek
> with proper error and still think it's better to have this as we don't
> really know what wacky users are out there. What do you think about an
> extra flag?

OK, but that's gonna involve a fair bit of API churn, and I'm not sure
it's worth it at this stage. If this is not needed for the OSS
emulation, I think we should postpone it.

Thanks,
Miklos

2008-11-12 10:25:27

by Mike Hommey

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCHSET] FUSE: extend FUSE to support more operations

On Wed, Nov 12, 2008 at 10:36:25AM +0100, Miklos Szeredi <[email protected]> wrote:
> On Wed, 12 Nov 2008, Tejun Heo wrote:
> > Christoph Hellwig wrote:
> > > On Wed, Nov 12, 2008 at 05:41:18PM +0900, Tejun Heo wrote:
> > >>> emulated char devices to be char devices? What I mean is, what would
> > >>> happen if instead of a char device /dev/dsp would be a regular file
> > >>> mounted on /dev/dsp (which implements all the necessary interfaces:
> > >>> ioctls, poll, etc)?
> > >> For most it would work, I suppose, but there are all sorts of wonky
> > >> users out in the wild (and quite a few that we don't have source access
> > >> to) and different configurations, so I think it's better to appear as
> > >> proper character device if it is a character device. It will also help
> > >> udev and other desktop thingies deal with devices implemented in userland.
> > >
> > > Note that mounting a char device on /dev/dsp is exactly the same as
> > > mounting a regular file on /dev/dsp - the only difference is what you
> > > set i_mode to.
> >
> > A difference is how the device is located. With proper character device
> > emulation, any char device node on any filesystem would work. Not sure
> > how relevant that would be tho.
>
> Yeah, I think it would be a bit hacky to lie about being a device with
> a specific major/minor number, when in fact we just set those in the
> stat fields, and the real char device would return ENODEV or implement
> something else.
>
> So I'd prefer the CUSE approach if we want the file to actually look
> like a device.

You're talking about fusd, here.
http://www.circlemud.org/~jelson/software/fusd/

Mike

2008-11-13 05:55:12

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET] FUSE: extend FUSE to support more operations

Miklos Szeredi wrote:
> On Wed, 12 Nov 2008, Tejun Heo wrote:
>>> Comments about the others:
>>>
>>> 0002-FUSE-pass-nonblock-flag-to-client.patch
>>>
>>> this is not needed, f_flags are already passed to userspace for read
>>> and write.
>> Hmmm... I'll try to find out whether I can use f_flags. There was
>> something that prevented it from working properly. I'll dig.
>
> Support for this was missing from libfuse, but now I fixed that in the
> CVS version.

Yeah, right, it can just use fi.flags. 0002 dropped.

>>> 0004-FUSE-implement-direct-lseek-support.patch
>>>
>>> this is trickier to get the interface right I think. If we want to
>>> allow filesystems to implement a custom lseek, then we also want them
>>> to keep track of the file position, which means we must differentiate
>>> between a write(2) and a pwrite(2) and similarly for reads. AFAICS
>>> this isn't needed for CUSE so we can leave this to later.
>> Read/write already passes @offset, so the only thing required is an
>> extra flag there. I mainly wanted a way for a CUSE server to veto lseek
>> with proper error and still think it's better to have this as we don't
>> really know what wacky users are out there. What do you think about an
>> extra flag?
>
> OK, but that's gonna involve a fair bit of API churn, and I'm not sure
> it's worth it at this stage. If this is not needed for the OSS
> emulation, I think we should postpone it.

Alright.

Thanks.

--
tejun

2008-11-13 06:06:50

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET] FUSE: extend FUSE to support more operations

Tejun Heo wrote:
>> I still got qualms about this ioctl thing. One is the security
>> aspect, but that could be dealt with. The other is that I really
>> really don't want people to start implementing new custom ioctls for
>> their filesystems, as I think that way lies madness. We could limit
>> ioctls to CUSE and that would be fine with me. Or for non-CUSE users
>> we could enforce the "standard" format where the type and length is
>> encoded in the command number.
>
> For now, I'll limit ioctl to CUSE. Hmmm... Yeah, limiting ioctl to
> well-formatted ones sounds like a good idea.
>
>> I don't have any problems with the iterative way you implemented
>> ioctls. We just need some additional restrictions to the current
>> implementation, I think.

I've been thinking about this a bit more. What do you think about
putting the following restrictions?

1. FUSE server can only support well-formed ioctls. At the kernel side,
the interfaces remains the same for both FUSE and CUSE but libfuse only
exports well-formed ioctl API.

2. ioctl can only be used by FUSE server running as root (would this be
necessary? I'm not sure. To me it seems all the necessary protections
are already there).

Thanks.

--
tejun

2008-11-13 06:26:31

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET] FUSE: extend FUSE to support more operations

Hello, Miklos.

Miklos Szeredi wrote:
> 0006-FUSE-implement-unsolicited-notification.patch
> 0007-FUSE-implement-poll-support.patch
>
> This would be nice, but... I don't really like the fact that it uses
> the file handle. Could we have a separate "poll handle" that is
> returned in the POLL reply?

Eh... I replied too early for this. I'm now trying to convert it to its
own handle but there is a rather serious problem. It's usually much
easier to have the entity to be waken up registered before calling
->poll so that ->poll can use the same notification path from ->poll ans
for later.

However, if we allocate poll handle from ->poll and tell it to kernel
via reply, it creates two problem. 1. the entity which is to be waken
up can't be registered prior to calling ->poll as there's nothing to
identify it, 2. the interval from reply write and in-kernel polled
entity registration must be made atomic so that no notification can come
through inbetween. #1 means that ->poll can't call the same
notification path from ->poll itself and #2 means that there needs to be
special provision from dev.c::fuse_dev_write() to
file.c::fuse_file_poll() so that atomicity can be guaranteed. Both of
which can be done but I'm not really sure whether using a separate
handle would be a good idea even with the involved cost.

Why do you think using separate poll handle would be better? And do you
still think the overhead is justifiable?

Thanks.

--
tejun

2008-11-13 11:20:26

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCHSET] FUSE: extend FUSE to support more operations

On Thu, 13 Nov 2008, Tejun Heo wrote:
> Tejun Heo wrote:
> >> I still got qualms about this ioctl thing. One is the security
> >> aspect, but that could be dealt with. The other is that I really
> >> really don't want people to start implementing new custom ioctls for
> >> their filesystems, as I think that way lies madness. We could limit
> >> ioctls to CUSE and that would be fine with me. Or for non-CUSE users
> >> we could enforce the "standard" format where the type and length is
> >> encoded in the command number.
> >
> > For now, I'll limit ioctl to CUSE. Hmmm... Yeah, limiting ioctl to
> > well-formatted ones sounds like a good idea.
> >
> >> I don't have any problems with the iterative way you implemented
> >> ioctls. We just need some additional restrictions to the current
> >> implementation, I think.
>
> I've been thinking about this a bit more. What do you think about
> putting the following restrictions?
>
> 1. FUSE server can only support well-formed ioctls. At the kernel side,
> the interfaces remains the same for both FUSE and CUSE but libfuse only
> exports well-formed ioctl API.
>
> 2. ioctl can only be used by FUSE server running as root (would this be
> necessary? I'm not sure. To me it seems all the necessary protections
> are already there).

Not with '-oallow_other'. Consider the case that the caller invoked a
non well formed ioctl, but since there's no way to know this we
allowed the fuse server to tinker with the caller's address space
_as if_ the ioctl was well formed.

So we should always make sure that the server has enough privilege to
read/write the caller's memory, i.e. it can ptrace the caller.

At this point we could allow any ioctls, not just well formed ones.
But I don't want that for a different reason: if the possibility is
there people will find new "innovative" uses for it and just get
themselves into a big mess.

Miklos

2008-11-13 11:30:00

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET] FUSE: extend FUSE to support more operations

Hello,

Miklos Szeredi wrote:
> Not with '-oallow_other'. Consider the case that the caller invoked a
> non well formed ioctl, but since there's no way to know this we
> allowed the fuse server to tinker with the caller's address space
> _as if_ the ioctl was well formed.

Right, allow_other.

> So we should always make sure that the server has enough privilege to
> read/write the caller's memory, i.e. it can ptrace the caller.
>
> At this point we could allow any ioctls, not just well formed ones.
> But I don't want that for a different reason: if the possibility is
> there people will find new "innovative" uses for it and just get
> themselves into a big mess.

I don't really mind people doing strange things in userland as long as
it's safe but you're the maintainer. It's a bit strange to export the
feature only for CUSE, so I'm a little bit hesitant. I wanna make it
useful for both. So, at the kernel level, only well formed for FUSE and
everything goes for CUSE. Does that sound good enough?

--
tejun

2008-11-13 11:48:18

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCHSET] FUSE: extend FUSE to support more operations

On Thu, 13 Nov 2008, Tejun Heo wrote:
> Hello, Miklos.
>
> Miklos Szeredi wrote:
> > 0006-FUSE-implement-unsolicited-notification.patch
> > 0007-FUSE-implement-poll-support.patch
> >
> > This would be nice, but... I don't really like the fact that it uses
> > the file handle. Could we have a separate "poll handle" that is
> > returned in the POLL reply?
>
> Eh... I replied too early for this. I'm now trying to convert it to its
> own handle but there is a rather serious problem. It's usually much
> easier to have the entity to be waken up registered before calling
> ->poll so that ->poll can use the same notification path from ->poll ans
> for later.
>
> However, if we allocate poll handle from ->poll and tell it to kernel
> via reply, it creates two problem. 1. the entity which is to be waken
> up can't be registered prior to calling ->poll as there's nothing to
> identify it, 2. the interval from reply write and in-kernel polled
> entity registration must be made atomic so that no notification can come
> through inbetween. #1 means that ->poll can't call the same
> notification path from ->poll itself and #2 means that there needs to be
> special provision from dev.c::fuse_dev_write() to
> file.c::fuse_file_poll() so that atomicity can be guaranteed. Both of
> which can be done but I'm not really sure whether using a separate
> handle would be a good idea even with the involved cost.

Yeah, I see the problems.

> Why do you think using separate poll handle would be better? And do you
> still think the overhead is justifiable?

Because it would be a change in the semantics of the file handle.
Previously it was just an opaque cookie that the kernel stored for the
filesystem, not making any assumptions about it (like uniqueness).

OK, we can say that if the filesystems wants to implement poll, it has
to make the file handle unique. Also now the filesystem (or
something) has to deal with races between poll notification and
reuse of the file handle (release/open).

With a new poll handle we'd have more room to properly deal with these
without overloading the file handle with extra requirements.

How about this: the poll handle is allocated by the kernel, not by the
filesystem. This guarantees uniqueness, so the filesystem cannot get
this wrong. Releasing the poll handle is still tricky, there could be
various races... only the userspace filesystem knows if it has no
outstanding notificiatons on a poll handle, so the release has to come
after all outstanding notifications have been ack'ed. Something like
this:

(userspace <- kernel)

<- POLL-request(pollhandle) (alloc handle)
-> POLL-reply
...
-> POLL-notification(pollhandle)
<- POLL-ack
...
<- POLL_RELEASE(pollhandle)
-> POLL_RELEASE-reply (free handle)

Thanks,
Miklos

2008-11-13 11:55:22

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET] FUSE: extend FUSE to support more operations

Hello,

Miklos Szeredi wrote:
>> Why do you think using separate poll handle would be better? And do you
>> still think the overhead is justifiable?
>
> Because it would be a change in the semantics of the file handle.
> Previously it was just an opaque cookie that the kernel stored for the
> filesystem, not making any assumptions about it (like uniqueness).
>
> OK, we can say that if the filesystems wants to implement poll, it has
> to make the file handle unique. Also now the filesystem (or
> something) has to deal with races between poll notification and
> reuse of the file handle (release/open).
>
> With a new poll handle we'd have more room to properly deal with these
> without overloading the file handle with extra requirements.
>
> How about this: the poll handle is allocated by the kernel, not by the
> filesystem. This guarantees uniqueness, so the filesystem cannot get
> this wrong. Releasing the poll handle is still tricky, there could be
> various races... only the userspace filesystem knows if it has no
> outstanding notificiatons on a poll handle, so the release has to come
> after all outstanding notifications have been ack'ed. Something like
> this:
>
> (userspace <- kernel)
>
> <- POLL-request(pollhandle) (alloc handle)
> -> POLL-reply
> ...
> -> POLL-notification(pollhandle)
> <- POLL-ack
> ...
> <- POLL_RELEASE(pollhandle)
> -> POLL_RELEASE-reply (free handle)

Hmm... yeah, allocating handle from kernel should work fine, but I
wouldn't worry about race here. We can just use 64 bit and guarantee
that any handle won't be reused ever.

Thanks.

--
tejun

2008-11-13 11:57:31

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCHSET] FUSE: extend FUSE to support more operations

On Thu, 13 Nov 2008, Tejun Heo wrote:
> I don't really mind people doing strange things in userland as long as
> it's safe but you're the maintainer. It's a bit strange to export the
> feature only for CUSE, so I'm a little bit hesitant.

You are starting from the fact that ioctl is a good API. It's not,
it's a bad API, so I don't want to encourage the use of it.

> I wanna make it
> useful for both. So, at the kernel level, only well formed for FUSE and
> everything goes for CUSE. Does that sound good enough?

With additional restrictions for ptraceability yes. But if you just
restrict it to CUSE at first, that's fine by me as well :)

Thanks,
Miklos

2008-11-13 11:58:52

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCHSET] FUSE: extend FUSE to support more operations

On Thu, 13 Nov 2008, Tejun Heo wrote:
> Hmm... yeah, allocating handle from kernel should work fine, but I
> wouldn't worry about race here. We can just use 64 bit and guarantee
> that any handle won't be reused ever.

Right, that sounds perfect.

Thanks,
Miklos

2008-11-13 12:14:44

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET] FUSE: extend FUSE to support more operations

Miklos Szeredi wrote:
> On Thu, 13 Nov 2008, Tejun Heo wrote:
>> I don't really mind people doing strange things in userland as long as
>> it's safe but you're the maintainer. It's a bit strange to export the
>> feature only for CUSE, so I'm a little bit hesitant.
>
> You are starting from the fact that ioctl is a good API. It's not,
> it's a bad API, so I don't want to encourage the use of it.

Heh heh, I don't think ioctl is a good API but it's fun to watch people
screw up themselves. :-P

>> I wanna make it
>> useful for both. So, at the kernel level, only well formed for FUSE and
>> everything goes for CUSE. Does that sound good enough?
>
> With additional restrictions for ptraceability yes. But if you just
> restrict it to CUSE at first, that's fine by me as well :)

It actually wasn't a big change. Now only well formed ioctls are
allowed for FUSE server and kernel will dispatch the correct amount of
input data and feed back the correct amount of output data without any
retry. With proper massaging of FUSE ioctl API, ioctl, at least for
FUSE, will be very straight forward.

Thanks.

--
tejun

2008-11-13 12:34:58

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCHSET] FUSE: extend FUSE to support more operations

On Thu, 13 Nov 2008, Tejun Heo wrote:

> Hmm... yeah, allocating handle from kernel should work fine, but I
> wouldn't worry about race here. We can just use 64 bit and guarantee
> that any handle won't be reused ever.

Btw, the protocol could be simplified even more by getting rid of all
acknowledgements:

<- POLL-request(pollhandle) (alloc handle)
...
-> POLL-notification(pollhandle)
...
<- POLL_RELEASE(pollhandle)

So normally ->poll() wouldn't have to sleep at all. If the POLL
request fails for some reason, userspace just sends a notification
with the respective error.

Miklos

2008-11-13 13:24:19

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET] FUSE: extend FUSE to support more operations

Hi,

Miklos Szeredi wrote:
>> Hmm... yeah, allocating handle from kernel should work fine, but I
>> wouldn't worry about race here. We can just use 64 bit and guarantee
>> that any handle won't be reused ever.
>
> Btw, the protocol could be simplified even more by getting rid of all
> acknowledgements:
>
> <- POLL-request(pollhandle) (alloc handle)
> ...
> -> POLL-notification(pollhandle)
> ...
> <- POLL_RELEASE(pollhandle)
>
> So normally ->poll() wouldn't have to sleep at all. If the POLL
> request fails for some reason, userspace just sends a notification
> with the respective error.

We don't need POLL_RELEASE but we still need POLL-reply (to request) to
send revents. We can put that into notification too. Hmmm... Yeah,
that could be simpler for FUSE servers. I'll venture that way.

Thanks.

--
tejun

2008-11-13 13:43:00

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCHSET] FUSE: extend FUSE to support more operations

On Thu, 13 Nov 2008, Tejun Heo wrote:
> We don't need POLL_RELEASE

What happens on timeout? Do we just let userspace continue polling
the file descriptor, and then ignore the notification?

> but we still need POLL-reply (to request) to
> send revents. We can put that into notification too. Hmmm... Yeah,
> that could be simpler for FUSE servers. I'll venture that way.

Then in fact the notification could just become the reply:

<- POLL-request (sent with request_send_nowait())
...
-> POLL-reply (calls req->end())

So there won't even be a need to implement notification (we'll need
that for other things in the future) simplifying things even further.
Even if we want to cancel the request because of a timeout, that could
be done with the existing INTERRUPT request.

Miklos

2008-11-13 14:29:46

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET] FUSE: extend FUSE to support more operations

Miklos Szeredi wrote:
> On Thu, 13 Nov 2008, Tejun Heo wrote:
>> We don't need POLL_RELEASE
>
> What happens on timeout? Do we just let userspace continue polling
> the file descriptor, and then ignore the notification?

Yeap, spurious notifications don't cause any problems.

>> but we still need POLL-reply (to request) to
>> send revents. We can put that into notification too. Hmmm... Yeah,
>> that could be simpler for FUSE servers. I'll venture that way.
>
> Then in fact the notification could just become the reply:
>
> <- POLL-request (sent with request_send_nowait())
> ...
> -> POLL-reply (calls req->end())
>
> So there won't even be a need to implement notification (we'll need
> that for other things in the future) simplifying things even further.
> Even if we want to cancel the request because of a timeout, that could
> be done with the existing INTERRUPT request.

poll/select/epoll can poll on massive number of files. I don't think
it's wise to have that many outstanding requests. FUSE currently uses
linear list to match replies to requests and libfuse will consume one
thread per each poll if implemented like other requests. It can be made
asynchronous from libfuse tho.

I kind of like the original implementation tho. The f_ops->poll
interface is designed to be used like ->poll returning events if
available immediately and queue for later notification as necessary.
Notification is asynchronous and can be spurious (this actually comes
pretty handy for low level implementation). When notified, upper layer
queries the same way using ->poll. This is quite convenient for low
level implementation as the actual logic of poll can live in ->poll
proper while notifications can be scattered around places where events
can occur.

Thanks.

--
tejun

2008-11-13 14:48:35

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCHSET] FUSE: extend FUSE to support more operations

On Thu, 13 Nov 2008, Tejun Heo wrote:
> poll/select/epoll can poll on massive number of files. I don't think
> it's wise to have that many outstanding requests. FUSE currently uses
> linear list to match replies to requests and libfuse will consume one
> thread per each poll if implemented like other requests. It can be made
> asynchronous from libfuse tho.
>
> I kind of like the original implementation tho. The f_ops->poll
> interface is designed to be used like ->poll returning events if
> available immediately and queue for later notification as necessary.
> Notification is asynchronous and can be spurious (this actually comes
> pretty handy for low level implementation). When notified, upper layer
> queries the same way using ->poll. This is quite convenient for low
> level implementation as the actual logic of poll can live in ->poll
> proper while notifications can be scattered around places where events
> can occur.

Yes, that kind of interface is nice for f_ops->poll, and for libfuse.

But for the kernel interface it's inefficient. A wake up event is 3
context switches instead of one. And that's inherent in the interface
itself for no good reason.

Also there's again the question of userspace filesystem messing with
the caller: your original implementation allows the userspace
filesystem to block f_ops->poll() forever, which really isn't what
poll/select is about.

So I'd still argue for the simple POLL-request/POLL-notify protocol on
the kernel API, and possibly have the async notification similar to
the kernel interface on the library API.

Implementation wise I don't care all that much, but I'd actually
prefer if it was implemented using the traditional request/reply thing
and optimized (possibly later) to find requests in a more efficient
way than searching the linear list, which would benefit not just poll
but all requests.

Miklos

2008-11-13 15:11:16

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET] FUSE: extend FUSE to support more operations

Hello,

Miklos Szeredi wrote:
>> I kind of like the original implementation tho. The f_ops->poll
>> interface is designed to be used like ->poll returning events if
>> available immediately and queue for later notification as necessary.
>> Notification is asynchronous and can be spurious (this actually comes
>> pretty handy for low level implementation). When notified, upper layer
>> queries the same way using ->poll. This is quite convenient for low
>> level implementation as the actual logic of poll can live in ->poll
>> proper while notifications can be scattered around places where events
>> can occur.
>
> Yes, that kind of interface is nice for f_ops->poll, and for libfuse.
>
> But for the kernel interface it's inefficient. A wake up event is 3
> context switches instead of one. And that's inherent in the interface
> itself for no good reason.

Event notification performance problem is usually in its scalability
not in each notification. It's nice to optimize that too but I don't
think it weighs too much especially for FUSE. Doing it request/reply
way could have scalability concerns, please see below.

> Also there's again the question of userspace filesystem messing with
> the caller: your original implementation allows the userspace
> filesystem to block f_ops->poll() forever, which really isn't what
> poll/select is about.

That would simply be a broken poll implementation just as O_NONBLOCK
read can block in ->read forever.

> So I'd still argue for the simple POLL-request/POLL-notify protocol on
> the kernel API, and possibly have the async notification similar to
> the kernel interface on the library API.
>
> Implementation wise I don't care all that much, but I'd actually
> prefer if it was implemented using the traditional request/reply thing
> and optimized (possibly later) to find requests in a more efficient
> way than searching the linear list, which would benefit not just poll
> but all requests.

Given that the number of in-flight requests are not too high, I think
linear search is fine for now but switching it to b-tree shouldn't be
difficult.

So, pros for req/reply approach.

* Less context switch per event notification.

* No need for separate async notification mechanism.

Cons.

* More interface impedence matching from libfuse.

* Higher overhead when poll/select finishes. Either all outstanding
requests need to be cancelled using INTERRUPT whenever poll/select
returns or kernel needs to keep persistent list of outstanding polls
so that later poll/select can reuse them. The problem here is that
kernel doesn't know when or whether they'll be re-used. We can put
in LRU-based heuristics but it's getting too complex. Note that
it's different from userland server keeping track. The same problem
exists with userland based tracking but for many servers it would be
just a bit in existing structure and we can be much more lax on
userland. ie. actual storage backed files usually don't need
notification at all as data is always available, so the amount of
overhead is limited in most cases but we can't assume things like
that for the kernel.

Overall, I think being lazy about cancellation and let userland notify
asynchronously would be better performance and simplicity wise. What
do you think?

Thanks.

--
tejun

2008-11-13 15:53:18

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCHSET] FUSE: extend FUSE to support more operations

On Fri, 14 Nov 2008, Tejun Heo wrote:
> Given that the number of in-flight requests are not too high, I think
> linear search is fine for now but switching it to b-tree shouldn't be
> difficult.
>
> So, pros for req/reply approach.
>
> * Less context switch per event notification.
>
> * No need for separate async notification mechanism.
>
> Cons.
>
> * More interface impedence matching from libfuse.
>
> * Higher overhead when poll/select finishes. Either all outstanding
> requests need to be cancelled using INTERRUPT whenever poll/select
> returns or kernel needs to keep persistent list of outstanding polls
> so that later poll/select can reuse them. The problem here is that
> kernel doesn't know when or whether they'll be re-used. We can put
> in LRU-based heuristics but it's getting too complex.

Why not just link the outstanding poll requests into a list anchored
in 'fuse_file'? Easy to reuse, don't care about cancellation.

> Note that
> it's different from userland server keeping track. The same problem
> exists with userland based tracking but for many servers it would be
> just a bit in existing structure and we can be much more lax on
> userland. ie. actual storage backed files usually don't need
> notification at all as data is always available, so the amount of
> overhead is limited in most cases but we can't assume things like
> that for the kernel.
>
> Overall, I think being lazy about cancellation and let userland notify
> asynchronously would be better performance and simplicity wise. What
> do you think?

Lazy cancellation (no cancellation, esentially) sounds good. But that
works fine with the simplified protocol.

Think of it this way, this is what a poll event would look like with
your scheme:

1) -> POLL-notification
2) <- POLL-req
3) -> POLL-reply (revents)

Notice, how 1) and 2) don't carry _any_ information (the notification
can be spurious, the events in the POLL request is just repeated from
the original request). All the info is in 3), so I really don't see
any reason why the above would be better than just omitting the first
two steps.

Miklos

2008-11-13 16:01:37

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET] FUSE: extend FUSE to support more operations

Hello,

Miklos Szeredi wrote:
>> * Higher overhead when poll/select finishes. Either all outstanding
>> requests need to be cancelled using INTERRUPT whenever poll/select
>> returns or kernel needs to keep persistent list of outstanding polls
>> so that later poll/select can reuse them. The problem here is that
>> kernel doesn't know when or whether they'll be re-used. We can put
>> in LRU-based heuristics but it's getting too complex.
>
> Why not just link the outstanding poll requests into a list anchored
> in 'fuse_file'? Easy to reuse, don't care about cancellation.

Ah, that's the right place.

>> Overall, I think being lazy about cancellation and let userland notify
>> asynchronously would be better performance and simplicity wise. What
>> do you think?
>
> Lazy cancellation (no cancellation, esentially) sounds good. But that
> works fine with the simplified protocol.
>
> Think of it this way, this is what a poll event would look like with
> your scheme:
>
> 1) -> POLL-notification
> 2) <- POLL-req
> 3) -> POLL-reply (revents)
>
> Notice, how 1) and 2) don't carry _any_ information (the notification
> can be spurious, the events in the POLL request is just repeated from
> the original request). All the info is in 3), so I really don't see
> any reason why the above would be better than just omitting the first
> two steps.

Alrighty then. I'll convert it.

Thanks.

--
tejun

2008-11-17 09:17:57

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET] FUSE: extend FUSE to support more operations

Hello, Miklos.

I tried to implement poll as you suggested but it doesn't work because
poll actually is synchronous. Please consider the following scenario.
A file system implements a file which supports poll and other file
operations and there's a single threaded client which does the
following.

1. open the file
2. do polling (timeout == 0) poll on the fd
3-1. if POLLIN, consume data and goto #2
3-2. if ! POLLIN, do a ioctl (or whatever) on the fd and goto #2

For a client with single stream of syscalls (single threaded), it's
generally guaranteed that the attempt to consume data is successful
after POLLIN unless the fd dies inbetween. I don't think this is
something guaranteed in POSIX but for most in-kernel poll
implementations, this holds and I've seen good amount of code
depending on it.

To satisfy the above assumption, if ->poll is always asynchronous,
FUSE has to cache revents from previous ->poll attempts and clear it
when something which could have consumed data has occurred.
Unfortunately, in the above case, FUSE has no idea what constitutes
"consume data" but, double unfortunately, it can't take big hammer
approach (clearing on any access) either, because intervening non-data
consuming call like 3-2 above would mean that poll() will never
succeed.

Because data availability should be determined atomically && only the
filesystem knows when or how data availability changes, revents return
from ->poll() must be synchronous.

We can still use req -> reply approach where there's a flag telling
the FUSE server whether the request is synchronous or not but at that
point it seems just obfuscating to me.

So, ->poll() needs to be the combination of synchronous data
availability check + asynchronous notification which can be spurious
to implement the required semantics and I think the original interface
was much more natural for such functionality.

What do you think?

Thanks.

--
tejun

2008-11-17 10:16:58

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCHSET] FUSE: extend FUSE to support more operations

Hi Tejun,

On Mon, 17 Nov 2008, Tejun Heo wrote:
> Hello, Miklos.
>
> I tried to implement poll as you suggested but it doesn't work because
> poll actually is synchronous. Please consider the following scenario.
> A file system implements a file which supports poll and other file
> operations and there's a single threaded client which does the
> following.

Hmm, I do see your point.

> 1. open the file
> 2. do polling (timeout == 0) poll on the fd
> 3-1. if POLLIN, consume data and goto #2
> 3-2. if ! POLLIN, do a ioctl (or whatever) on the fd and goto #2
>
> For a client with single stream of syscalls (single threaded), it's
> generally guaranteed that the attempt to consume data is successful
> after POLLIN unless the fd dies inbetween. I don't think this is
> something guaranteed in POSIX but for most in-kernel poll
> implementations, this holds and I've seen good amount of code
> depending on it.
>
> To satisfy the above assumption, if ->poll is always asynchronous,
> FUSE has to cache revents from previous ->poll attempts and clear it
> when something which could have consumed data has occurred.
> Unfortunately, in the above case, FUSE has no idea what constitutes
> "consume data" but, double unfortunately, it can't take big hammer
> approach (clearing on any access) either, because intervening non-data
> consuming call like 3-2 above would mean that poll() will never
> succeed.
>
> Because data availability should be determined atomically && only the
> filesystem knows when or how data availability changes, revents return
> from ->poll() must be synchronous.
>
> We can still use req -> reply approach where there's a flag telling
> the FUSE server whether the request is synchronous or not but at that
> point it seems just obfuscating to me.
>
> So, ->poll() needs to be the combination of synchronous data
> availability check + asynchronous notification which can be spurious
> to implement the required semantics and I think the original interface
> was much more natural for such functionality.

OK, lets do it with the original interface. There's still room for
optimization, though, because the _normal_ operation of poll() is
absolutely asynchronous. I think the 'flags' field in poll_in should
be adequate to make the interface extensible in the future.

Thanks,
Miklos

2008-11-18 03:32:43

by Tejun Heo

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCHSET] FUSE: extend FUSE to support more operations

Miklos Szeredi wrote:
> OK, lets do it with the original interface. There's still room for
> optimization, though, because the _normal_ operation of poll() is
> absolutely asynchronous.

Hmm... I'm not sure I understand what you're saying but if you're
talking about optimizing async case where notification occurs while the
poller is sleeping, I don't think it really matters. That could be
common but they're not performance sensitive path. As select/poll users
become busy, ->poll operation becomes more and more synchronous.

If the client is using better interface like epoll, sending revents via
notification could help a bit but again the problem is that the ->poll
vfs interface is not ready for that. e.g. How do you tell whether
->poll is for epoll polling after the notification or an asynchronous
poll(2) being called after a read(2) after the notification arrived.
There simply isn't enough information to determine when the cached
revents (no matter how short the period of caching is) can be used or
not. One thing we can do is to invalidate the received revents value on
every file operation and then reverting to synchronous call only when
cached revents is not available, but I don't really see good
justifications for such over complexity.

Thanks.

--
tejun

2008-11-18 09:34:22

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCHSET] FUSE: extend FUSE to support more operations

On Tue, 18 Nov 2008, Tejun Heo wrote:
> Miklos Szeredi wrote:
> > OK, lets do it with the original interface. There's still room for
> > optimization, though, because the _normal_ operation of poll() is
> > absolutely asynchronous.
>
> Hmm... I'm not sure I understand what you're saying but if you're
> talking about optimizing async case where notification occurs while the
> poller is sleeping, I don't think it really matters. That could be
> common but they're not performance sensitive path. As select/poll users
> become busy, ->poll operation becomes more and more synchronous.

If poll does have to sleep (however short), then it's alwasy more
synchronous than necessary. It introduces an extra latency of two
context switches per poll. Yes, this means the performance hit will
decrease as the sleep lengthens, but short sleep != no sleep.

> If the client is using better interface like epoll, sending revents via
> notification could help a bit but again the problem is that the ->poll
> vfs interface is not ready for that. e.g. How do you tell whether
> ->poll is for epoll polling after the notification or an asynchronous
> poll(2) being called after a read(2) after the notification arrived.
> There simply isn't enough information to determine when the cached
> revents (no matter how short the period of caching is) can be used or
> not. One thing we can do is to invalidate the received revents value on
> every file operation and then reverting to synchronous call only when
> cached revents is not available, but I don't really see good
> justifications for such over complexity.

It's an optimization, and not a very complex one at that. But yes, we
should leave that till later, when the simple interface prooved itself
working.

Miklos

2008-11-18 10:30:45

by Tejun Heo

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCHSET] FUSE: extend FUSE to support more operations

Hello,

Miklos Szeredi wrote:
>> Hmm... I'm not sure I understand what you're saying but if you're
>> talking about optimizing async case where notification occurs while the
>> poller is sleeping, I don't think it really matters. That could be
>> common but they're not performance sensitive path. As select/poll users
>> become busy, ->poll operation becomes more and more synchronous.
>
> If poll does have to sleep (however short), then it's alwasy more
> synchronous than necessary. It introduces an extra latency of two
> context switches per poll. Yes, this means the performance hit will
> decrease as the sleep lengthens, but short sleep != no sleep.

If we're talking about poll(2) and select(2), in many cases, they don't
sleep at all as the load nears its limit due to the cost for scanning
idle fds. Those apps usually reach 100% before hitting their bandwidth
limits (no sleep at all) and until they hit the bandwidth limit only the
overhead of scanning idle fds decreases.

What I was trying to say was that for poll(2) and select(2), the
optimization would be a bit empty as they're not gonna help any with
increasing load. Optimization would still be helpful but cost/benefit
doesn't seem to make much sense to me.

>> If the client is using better interface like epoll, sending revents via
>> notification could help a bit but again the problem is that the ->poll
>> vfs interface is not ready for that. e.g. How do you tell whether
>> ->poll is for epoll polling after the notification or an asynchronous
>> poll(2) being called after a read(2) after the notification arrived.
>> There simply isn't enough information to determine when the cached
>> revents (no matter how short the period of caching is) can be used or
>> not. One thing we can do is to invalidate the received revents value on
>> every file operation and then reverting to synchronous call only when
>> cached revents is not available, but I don't really see good
>> justifications for such over complexity.
>
> It's an optimization, and not a very complex one at that. But yes, we
> should leave that till later, when the simple interface prooved itself
> working.

Yeap, we can cache @revents per file and clear it on any operation on
the file but as you said let's leave it for the day when it actually
comes biting.

Thanks.

--
tejun