2016-03-11 16:35:42

by Seth Forshee

[permalink] [raw]
Subject: [PATCH 0/2] Fix async io races

These patches are fixes for the problems reported at
https://sourceforge.net/p/fuse/mailman/message/34537139/. The first
patch is one of those that Robert posted on that thread, and the second
is a different patch that Robert and I agree represents a more robust
solution to the race.

The first race is that a is_sync_kiocb() is called on an iocb that could
have been freed if async io has already completed. The fix in this case
is simple and obvious: cache the result before starting io.

In the second case aio completion results in waking a waiting task
followed by unlocking the fuse_io_priv object. The waiting task will
free that object once awoken though, so it's possible for the object to
be freed before spin_unlock() is called. The simple fix is to move the
unlock to before the wake, but really this problem suggests a more
fundamental problem in the lifecycle of fuse_io_priv objects. The
request count in this object is serving two purposes, to count
outstanding aio requests to the fuse server and to serve as a reference
count in the object. For synchronous io these roles come into conflict.
The task which submits io must wait for all requests to the fuse server
to finish and then retrieve the result of the operation from the
fuse_io_priv object, yet it will not be woken until the outstanding
request count reaches 0. This means the task must release its reference
to the object before waiting, which leads to a somewhat confusing set of
rules for when it is safe to free the object.

This becomes much more straightforward if a separate reference count is
maintained for fuse_io_priv objects. Then the waiting task can hold a
reference while waiting, and there is no ambiguity about whether or not
it is safe to free an object. The second patch implements this solution.

Thanks,
Seth

Robert Doebbelin (1):
fuse: do not use iocb after it may have been freed

Seth Forshee (1):
fuse: Add reference counting for fuse_io_priv

fs/fuse/cuse.c | 12 ++++++++++--
fs/fuse/file.c | 47 +++++++++++++++++++++++++++++++++++++----------
fs/fuse/fuse_i.h | 15 +++++++++++++++
3 files changed, 62 insertions(+), 12 deletions(-)


2016-03-11 16:35:46

by Seth Forshee

[permalink] [raw]
Subject: [PATCH 2/2] fuse: Add reference counting for fuse_io_priv

The req member of fuse_io_priv serves two purposes. First is to
track the number of oustanding async requests to the server and
to signal that the io request is completed. The second is to be a
reference count on the structure to know when it can be freed.

For sync io requests these purposes can be at odds.
fuse_direct_IO() wants to block until the request is done, and
since the signal is sent when req reaches 0 it cannot keep a
reference to the object. Yet it needs to use the object after the
userspace server has completed processing requests. This leads to
some handshaking and special casing that it needlessly
complicated and responsible for at least one race condition.

It's much cleaner and safer to maintain a separate reference
count for the object lifecycle and to let req just be a count of
outstanding requests to the userspace server. Then we can know
for sure when it is safe to free the object without any
handshaking or special cases.

The catch here is that most of the time these objects are stack
allocated and should not be freed. Initializing these objects
with a single reference that is never released prevents
accidental attempts to free the objects.

Fixes: 9d5722b7777e ("fuse: handle synchronous iocbs internally")
Cc: [email protected] # v4.1+
Signed-off-by: Seth Forshee <[email protected]>
---
fs/fuse/cuse.c | 12 ++++++++++--
fs/fuse/file.c | 42 ++++++++++++++++++++++++++++++++++--------
fs/fuse/fuse_i.h | 15 +++++++++++++++
3 files changed, 59 insertions(+), 10 deletions(-)

diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index 8e3ee1936c7e..990436a21533 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -90,7 +90,11 @@ static struct list_head *cuse_conntbl_head(dev_t devt)

static ssize_t cuse_read_iter(struct kiocb *kiocb, struct iov_iter *to)
{
- struct fuse_io_priv io = { .async = 0, .file = kiocb->ki_filp };
+ struct fuse_io_priv io = {
+ .refcnt = ATOMIC_INIT(1),
+ .async = 0,
+ .file = kiocb->ki_filp,
+ };
loff_t pos = 0;

return fuse_direct_io(&io, to, &pos, FUSE_DIO_CUSE);
@@ -98,7 +102,11 @@ static ssize_t cuse_read_iter(struct kiocb *kiocb, struct iov_iter *to)

static ssize_t cuse_write_iter(struct kiocb *kiocb, struct iov_iter *from)
{
- struct fuse_io_priv io = { .async = 0, .file = kiocb->ki_filp };
+ struct fuse_io_priv io = {
+ .refcnt = ATOMIC_INIT(1),
+ .async = 0,
+ .file = kiocb->ki_filp,
+ };
loff_t pos = 0;
/*
* No locking or generic_write_checks(), the server is
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 34a23df361ca..6403c35fa39a 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -585,8 +585,9 @@ static void fuse_aio_complete(struct fuse_io_priv *io, int err, ssize_t pos)
}

io->iocb->ki_complete(io->iocb, res, 0);
- kfree(io);
}
+
+ fuse_io_unref(io);
}

static void fuse_aio_complete_req(struct fuse_conn *fc, struct fuse_req *req)
@@ -613,6 +614,7 @@ static size_t fuse_async_req_send(struct fuse_conn *fc, struct fuse_req *req,
size_t num_bytes, struct fuse_io_priv *io)
{
spin_lock(&io->lock);
+ fuse_io_ref(io);
io->size += num_bytes;
io->reqs++;
spin_unlock(&io->lock);
@@ -691,7 +693,11 @@ static void fuse_short_read(struct fuse_req *req, struct inode *inode,

static int fuse_do_readpage(struct file *file, struct page *page)
{
- struct fuse_io_priv io = { .async = 0, .file = file };
+ struct fuse_io_priv io = {
+ .refcnt = ATOMIC_INIT(1),
+ .async = 0,
+ .file = file,
+ };
struct inode *inode = page->mapping->host;
struct fuse_conn *fc = get_fuse_conn(inode);
struct fuse_req *req;
@@ -984,7 +990,11 @@ static size_t fuse_send_write_pages(struct fuse_req *req, struct file *file,
size_t res;
unsigned offset;
unsigned i;
- struct fuse_io_priv io = { .async = 0, .file = file };
+ struct fuse_io_priv io = {
+ .refcnt = ATOMIC_INIT(1),
+ .async = 0,
+ .file = file,
+ };

for (i = 0; i < req->num_pages; i++)
fuse_wait_on_page_writeback(inode, req->pages[i]->index);
@@ -1398,7 +1408,11 @@ static ssize_t __fuse_direct_read(struct fuse_io_priv *io,

static ssize_t fuse_direct_read_iter(struct kiocb *iocb, struct iov_iter *to)
{
- struct fuse_io_priv io = { .async = 0, .file = iocb->ki_filp };
+ struct fuse_io_priv io = {
+ .refcnt = ATOMIC_INIT(1),
+ .async = 0,
+ .file = iocb->ki_filp,
+ };
return __fuse_direct_read(&io, to, &iocb->ki_pos);
}

@@ -1406,7 +1420,11 @@ static ssize_t fuse_direct_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
struct file *file = iocb->ki_filp;
struct inode *inode = file_inode(file);
- struct fuse_io_priv io = { .async = 0, .file = file };
+ struct fuse_io_priv io = {
+ .refcnt = ATOMIC_INIT(1),
+ .async = 0,
+ .file = file,
+ };
ssize_t res;

if (is_bad_inode(inode))
@@ -2864,6 +2882,7 @@ fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset)
if (!io)
return -ENOMEM;
spin_lock_init(&io->lock);
+ atomic_set(&io->refcnt, 1);
io->reqs = 1;
io->bytes = -1;
io->size = 0;
@@ -2887,8 +2906,15 @@ fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset)
iov_iter_rw(iter) == WRITE)
io->async = false;

- if (io->async && is_sync)
- io->done = &wait;
+ if (is_sync) {
+ /*
+ * Additional reference to keep io around after
+ * calling fuse_aio_complete()
+ */
+ fuse_io_ref(io);
+ if (io->async)
+ io->done = &wait;
+ }

if (iov_iter_rw(iter) == WRITE) {
ret = fuse_direct_io(io, iter, &pos, FUSE_DIO_WRITE);
@@ -2908,7 +2934,7 @@ fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset)
ret = fuse_get_res_by_io(io);
}

- kfree(io);
+ fuse_io_unref(io);

if (iov_iter_rw(iter) == WRITE) {
if (ret > 0)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index ce394b5fe6b4..77fb63e6dbae 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -22,6 +22,7 @@
#include <linux/rbtree.h>
#include <linux/poll.h>
#include <linux/workqueue.h>
+#include <linux/atomic.h>

/** Max number of pages that can be used in a single read request */
#define FUSE_MAX_PAGES_PER_REQ 32
@@ -243,6 +244,7 @@ struct fuse_args {

/** The request IO state (for asynchronous processing) */
struct fuse_io_priv {
+ atomic_t refcnt;
int async;
spinlock_t lock;
unsigned reqs;
@@ -256,6 +258,19 @@ struct fuse_io_priv {
struct completion *done;
};

+static inline void fuse_io_ref(struct fuse_io_priv *io)
+{
+ atomic_inc(&io->refcnt);
+}
+
+static inline void fuse_io_unref(struct fuse_io_priv *io)
+{
+ int cnt = atomic_dec_return(&io->refcnt);
+ BUG_ON(cnt < 0);
+ if (cnt == 0)
+ kfree(io);
+}
+
/**
* Request flags
*
--
1.9.1

2016-03-11 16:35:50

by Seth Forshee

[permalink] [raw]
Subject: [PATCH 1/2] fuse: do not use iocb after it may have been freed

From: Robert Doebbelin <[email protected]>

Reading wrong data may cause the IO thread to starve waiting for completion.
The issue was introduced with commit 04b2fa9 and thus affects kernel >= 4.1. It
was discovered by KASan:

kernel: ==================================================================
kernel: BUG: KASan: use after free in fuse_direct_IO+0xb1a/0xcc0 at addr ffff88036c414390
kernel: Read of size 8 by task qemu-system-x86/2784
kernel: =============================================================================
kernel: BUG kmalloc-128 (Tainted: G I ): kasan: bad access detected
kernel: -----------------------------------------------------------------------------
kernel: Disabling lock debugging due to kernel taint
kernel: INFO: Slab 0xffffea000db10500 objects=32 used=26 fp=0xffff88036c414e80 flags=0x2ffff0000000080
kernel: INFO: Object 0xffff88036c414380 @offset=896 fp=0x (null)
kernel: Bytes b4 ffff88036c414370: 18 00 00 00 40 27 a3 1f 3b 56 00 00 00 00 00 00 ....@'..;V......
kernel: Object ffff88036c414380: 00 00 00 00 00 00 00 00 00 f0 75 35 00 00 00 00 ..........u5....
kernel: Object ffff88036c414390: 80 27 67 81 ff ff ff ff 00 00 00 00 00 00 00 00 .'g.............
kernel: Object ffff88036c4143a0: 05 00 00 00 00 00 00 00 80 82 44 ad 05 88 ff ff ..........D.....
kernel: Object ffff88036c4143b0: 00 00 00 00 00 00 00 00 10 e1 bc 56 49 56 00 00 ...........VIV..
kernel: Object ffff88036c4143c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
kernel: Object ffff88036c4143d0: 00 00 00 00 00 00 00 00 80 f6 85 6d 03 88 ff ff ...........m....
kernel: Object ffff88036c4143e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
kernel: Object ffff88036c4143f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
kernel: CPU: 0 PID: 2784 Comm: qemu-system-x86 Tainted: G B I 4.2.0-25-generic 0000030
kernel: Hardware name: IBM System x3550 M2 -[794654G]-/49Y6512 , BIOS -[D6E131CUS-1.05]- 11/25/2009
kernel: ffff88036c414380 00000000d939cde9 ffff8805adf0f7c8 ffffffff828cafee
kernel: 0000000000000080 ffff880373803680 ffff8805adf0f7f8 ffffffff81546759
kernel: ffff880373803680 ffffea000db10500 ffff88036c414380 ffff8805ad56d600
kernel: Call Trace:
kernel: [<ffffffff828cafee>] dump_stack+0x45/0x57
kernel: [<ffffffff81546759>] print_trailer+0xf9/0x150
kernel: [<ffffffff8154b9c8>] object_err+0x38/0x50
kernel: [<ffffffff8154e3d8>] kasan_report_error+0x1e8/0x3f0
kernel: [<ffffffff8154de94>] ? kasan_slab_free+0x44/0x50
kernel: [<ffffffff8154e791>] __asan_report_load8_noabort+0x61/0x70
kernel: [<ffffffff818d8bfa>] ? fuse_direct_IO+0xb1a/0xcc0
kernel: [<ffffffff818d8bfa>] fuse_direct_IO+0xb1a/0xcc0
kernel: [<ffffffff818d80e0>] ? fuse_direct_write_iter+0x1f0/0x1f0
kernel: [<ffffffff815d1d90>] ? SyS_pselect6+0x460/0x460
kernel: [<ffffffff8193528e>] ? cap_inode_need_killpriv+0x8e/0xb0
kernel: [<ffffffff8145eda6>] generic_file_direct_write+0x246/0x540
kernel: [<ffffffff815e1ced>] ? file_remove_privs+0xfd/0x270
kernel: [<ffffffff8145eb60>] ? generic_file_read_iter+0x1090/0x1090
kernel: [<ffffffff8159beaa>] ? __sb_start_write+0x10a/0x2d0
kernel: [<ffffffff8159bda0>] ? __sb_end_write+0xc0/0xc0
kernel: [<ffffffff818da16c>] fuse_file_write_iter+0x31c/0x780
kernel: [<ffffffff81a0e91d>] ? aa_file_perm+0x24d/0xa00
kernel: [<ffffffff81673aba>] aio_run_iocb+0x68a/0x870
kernel: [<ffffffff815cfa80>] ? poll_select_copy_remaining+0x3a0/0x3a0
kernel: [<ffffffff818d9e50>] ? fuse_perform_write+0xe80/0xe80
kernel: [<ffffffff81673430>] ? aio_complete+0xcb0/0xcb0
kernel: [<ffffffff8166f52a>] ? eventfd_ctx_read+0xda/0x6d0
kernel: [<ffffffff8166f450>] ? eventfd_write+0x6e0/0x6e0
kernel: [<ffffffff815cfa80>] ? poll_select_copy_remaining+0x3a0/0x3a0
kernel: [<ffffffff811d3280>] ? wake_up_q+0xe0/0xe0
kernel: [<ffffffff81652110>] ? __fsnotify_inode_delete+0x10/0x10
kernel: [<ffffffff815e872e>] ? __fget_light+0x7e/0x1f0
kernel: [<ffffffff8154de4d>] ? kasan_slab_alloc+0xd/0x10
kernel: [<ffffffff81676567>] do_io_submit+0x4a7/0xb40
kernel: [<ffffffff816760c0>] ? SyS_io_destroy+0x200/0x200
kernel: [<ffffffff81597fc0>] ? do_sendfile+0x1280/0x1280
kernel: [<ffffffff81676c10>] SyS_io_submit+0x10/0x20
kernel: [<ffffffff828dc632>] entry_SYSCALL_64_fastpath+0x16/0x75
kernel: Memory state around the buggy address:
kernel: ffff88036c414280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
kernel: ffff88036c414300: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
kernel: >ffff88036c414380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
kernel: ^
kernel: ffff88036c414400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
kernel: ffff88036c414480: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 fc
kernel: ==================================================================
kernel: ==================================================================
kernel: BUG: KASan: use after free in fuse_direct_IO+0xb1a/0xcc0 at addr ffff88054c8ad210
kernel: Read of size 8 by task qemu-system-x86/2747
kernel: =============================================================================
kernel: BUG kmalloc-128 (Tainted: G B I ): kasan: bad access detected
kernel: -----------------------------------------------------------------------------
kernel: INFO: Slab 0xffffea0015322b40 objects=32 used=10 fp=0xffff88054c8ad200 flags=0x6ffff0000000080
kernel: INFO: Object 0xffff88054c8ad200 @offset=512 fp=0xffff88054c8ade00
kernel: Bytes b4 ffff88054c8ad1f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
kernel: Object ffff88054c8ad200: 00 de 8a 4c 05 88 ff ff 00 00 12 07 00 00 00 00 ...L............
kernel: Object ffff88054c8ad210: 80 27 67 81 ff ff ff ff 00 00 00 00 00 00 00 00 .'g.............
kernel: Object ffff88054c8ad220: 05 00 00 00 00 00 00 00 00 80 44 ad 05 88 ff ff ..........D.....
kernel: Object ffff88054c8ad230: 00 00 00 00 00 00 00 00 90 a3 6e 75 94 55 00 00 ..........nu.U..
kernel: Object ffff88054c8ad240: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
kernel: Object ffff88054c8ad250: 00 00 00 00 00 00 00 00 00 ed 81 ae 05 88 ff ff ................
kernel: Object ffff88054c8ad260: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
kernel: Object ffff88054c8ad270: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
kernel: CPU: 3 PID: 2747 Comm: qemu-system-x86 Tainted: G B I 4.2.0-25-generic 0000030
kernel: Hardware name: IBM System x3550 M2 -[794654G]-/49Y6512 , BIOS -[D6E131CUS-1.05]- 11/25/2009
kernel: ffff88054c8ad200 000000005e8ba930 ffff880371aef7c8 ffffffff828cafee
kernel: 0000000000000080 ffff880373803680 ffff880371aef7f8 ffffffff81546759
kernel: ffff880373803680 ffffea0015322b40 ffff88054c8ad200 ffff8805aa4d1e00
kernel: Call Trace:
kernel: [<ffffffff828cafee>] dump_stack+0x45/0x57
kernel: [<ffffffff81546759>] print_trailer+0xf9/0x150
kernel: [<ffffffff8154b9c8>] object_err+0x38/0x50
kernel: [<ffffffff8154e3d8>] kasan_report_error+0x1e8/0x3f0
kernel: [<ffffffff8154de94>] ? kasan_slab_free+0x44/0x50
kernel: [<ffffffff8154e791>] __asan_report_load8_noabort+0x61/0x70
kernel: [<ffffffff818d8bfa>] ? fuse_direct_IO+0xb1a/0xcc0
kernel: [<ffffffff818d8bfa>] fuse_direct_IO+0xb1a/0xcc0
kernel: [<ffffffff818d80e0>] ? fuse_direct_write_iter+0x1f0/0x1f0
kernel: [<ffffffff8193528e>] ? cap_inode_need_killpriv+0x8e/0xb0
kernel: [<ffffffff8145eda6>] generic_file_direct_write+0x246/0x540
kernel: [<ffffffff815e1ced>] ? file_remove_privs+0xfd/0x270
kernel: [<ffffffff8145eb60>] ? generic_file_read_iter+0x1090/0x1090
kernel: [<ffffffff8159beaa>] ? __sb_start_write+0x10a/0x2d0
kernel: [<ffffffff8159bda0>] ? __sb_end_write+0xc0/0xc0
kernel: [<ffffffff818da16c>] fuse_file_write_iter+0x31c/0x780
kernel: [<ffffffff81673aba>] aio_run_iocb+0x68a/0x870
kernel: [<ffffffff815cfa80>] ? poll_select_copy_remaining+0x3a0/0x3a0
kernel: [<ffffffff818d9e50>] ? fuse_perform_write+0xe80/0xe80
kernel: [<ffffffff81673430>] ? aio_complete+0xcb0/0xcb0
kernel: [<ffffffff8120390d>] ? should_numa_migrate_memory+0xfd/0x410
kernel: [<ffffffff8153a963>] ? mpol_misplaced+0x393/0x520
kernel: [<ffffffff815e872e>] ? __fget_light+0x7e/0x1f0
kernel: [<ffffffff8154de4d>] ? kasan_slab_alloc+0xd/0x10
kernel: [<ffffffff81676567>] do_io_submit+0x4a7/0xb40
kernel: [<ffffffff816760c0>] ? SyS_io_destroy+0x200/0x200
kernel: [<ffffffff81597fc0>] ? do_sendfile+0x1280/0x1280
kernel: [<ffffffff81676c10>] SyS_io_submit+0x10/0x20
kernel: [<ffffffff828dc632>] entry_SYSCALL_64_fastpath+0x16/0x75
kernel: Memory state around the buggy address:
kernel: ffff88054c8ad100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
kernel: ffff88054c8ad180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
kernel: >ffff88054c8ad200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
kernel: ^
kernel: ffff88054c8ad280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
kernel: ffff88054c8ad300: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 fc
kernel: ==================================================================

Fixes: bcba24ccdc82 ("fuse: enable asynchronous processing direct IO")
Cc: [email protected] # 3.10+
Signed-off-by: Robert Doebbelin <[email protected]>
---
fs/fuse/file.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index b03d253ece15..34a23df361ca 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2843,6 +2843,7 @@ fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset)
loff_t i_size;
size_t count = iov_iter_count(iter);
struct fuse_io_priv *io;
+ bool is_sync = is_sync_kiocb(iocb);

pos = offset;
inode = file->f_mapping->host;
@@ -2882,11 +2883,11 @@ fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset)
* to wait on real async I/O requests, so we must submit this request
* synchronously.
*/
- if (!is_sync_kiocb(iocb) && (offset + count > i_size) &&
+ if (!is_sync && (offset + count > i_size) &&
iov_iter_rw(iter) == WRITE)
io->async = false;

- if (io->async && is_sync_kiocb(iocb))
+ if (io->async && is_sync)
io->done = &wait;

if (iov_iter_rw(iter) == WRITE) {
@@ -2900,7 +2901,7 @@ fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset)
fuse_aio_complete(io, ret < 0 ? ret : 0, -1);

/* we have a non-extending, async request, so return */
- if (!is_sync_kiocb(iocb))
+ if (!is_sync)
return -EIOCBQUEUED;

wait_for_completion(&wait);
--
1.9.1

2016-03-14 13:52:04

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 2/2] fuse: Add reference counting for fuse_io_priv

On Fri, Mar 11, 2016 at 10:35:34AM -0600, Seth Forshee wrote:
> The req member of fuse_io_priv serves two purposes. First is to
> track the number of oustanding async requests to the server and
> to signal that the io request is completed. The second is to be a
> reference count on the structure to know when it can be freed.
>
> For sync io requests these purposes can be at odds.
> fuse_direct_IO() wants to block until the request is done, and
> since the signal is sent when req reaches 0 it cannot keep a
> reference to the object. Yet it needs to use the object after the
> userspace server has completed processing requests. This leads to
> some handshaking and special casing that it needlessly
> complicated and responsible for at least one race condition.
>
> It's much cleaner and safer to maintain a separate reference
> count for the object lifecycle and to let req just be a count of
> outstanding requests to the userspace server. Then we can know
> for sure when it is safe to free the object without any
> handshaking or special cases.
>
> The catch here is that most of the time these objects are stack
> allocated and should not be freed. Initializing these objects
> with a single reference that is never released prevents
> accidental attempts to free the objects.
>
> Fixes: 9d5722b7777e ("fuse: handle synchronous iocbs internally")
> Cc: [email protected] # v4.1+
> Signed-off-by: Seth Forshee <[email protected]>
> ---
> fs/fuse/cuse.c | 12 ++++++++++--
> fs/fuse/file.c | 42 ++++++++++++++++++++++++++++++++++--------
> fs/fuse/fuse_i.h | 15 +++++++++++++++
> 3 files changed, 59 insertions(+), 10 deletions(-)
>

[snip]

> @@ -2864,6 +2882,7 @@ fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset)
> if (!io)
> return -ENOMEM;
> spin_lock_init(&io->lock);
> + atomic_set(&io->refcnt, 1);
> io->reqs = 1;
> io->bytes = -1;
> io->size = 0;
> @@ -2887,8 +2906,15 @@ fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset)
> iov_iter_rw(iter) == WRITE)
> io->async = false;
>
> - if (io->async && is_sync)
> - io->done = &wait;
> + if (is_sync) {
> + /*
> + * Additional reference to keep io around after
> + * calling fuse_aio_complete()
> + */
> + fuse_io_ref(io);

AFAICS, the additional reference should only be needed for the io->async case,
no?

Updated, prettified patch below. Could you please test?

Thanks,
Miklos
---

From: Seth Forshee <[email protected]>
Subject: fuse: Add reference counting for fuse_io_priv
Date: Fri, 11 Mar 2016 10:35:34 -0600

The 'reqs' member of fuse_io_priv serves two purposes. First is to track
the number of oustanding async requests to the server and to signal that
the io request is completed. The second is to be a reference count on the
structure to know when it can be freed.

For sync io requests these purposes can be at odds. fuse_direct_IO() wants
to block until the request is done, and since the signal is sent when
'reqs' reaches 0 it cannot keep a reference to the object. Yet it needs to
use the object after the userspace server has completed processing
requests. This leads to some handshaking and special casing that it
needlessly complicated and responsible for at least one race condition.

It's much cleaner and safer to maintain a separate reference count for the
object lifecycle and to let 'reqs' just be a count of outstanding requests
to the userspace server. Then we can know for sure when it is safe to free
the object without any handshaking or special cases.

The catch here is that most of the time these objects are stack allocated
and should not be freed. Initializing these objects with a single reference
that is never released prevents accidental attempts to free the objects.

Fixes: 9d5722b7777e ("fuse: handle synchronous iocbs internally")
Cc: [email protected] # v4.1+
Signed-off-by: Seth Forshee <[email protected]>
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/fuse/cuse.c | 4 ++--
fs/fuse/file.c | 28 +++++++++++++++++++++-------
fs/fuse/fuse_i.h | 10 ++++++++++
3 files changed, 33 insertions(+), 9 deletions(-)

--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -90,7 +90,7 @@ static struct list_head *cuse_conntbl_he

static ssize_t cuse_read_iter(struct kiocb *kiocb, struct iov_iter *to)
{
- struct fuse_io_priv io = { .async = 0, .file = kiocb->ki_filp };
+ struct fuse_io_priv io = FUSE_IO_PRIV_SYNC(kiocb->ki_filp);
loff_t pos = 0;

return fuse_direct_io(&io, to, &pos, FUSE_DIO_CUSE);
@@ -98,7 +98,7 @@ static ssize_t cuse_read_iter(struct kio

static ssize_t cuse_write_iter(struct kiocb *kiocb, struct iov_iter *from)
{
- struct fuse_io_priv io = { .async = 0, .file = kiocb->ki_filp };
+ struct fuse_io_priv io = FUSE_IO_PRIV_SYNC(kiocb->ki_filp);
loff_t pos = 0;
/*
* No locking or generic_write_checks(), the server is
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -528,6 +528,11 @@ static void fuse_release_user_pages(stru
}
}

+static void fuse_io_release(struct kref *kref)
+{
+ kfree(container_of(kref, struct fuse_io_priv, refcnt));
+}
+
static ssize_t fuse_get_res_by_io(struct fuse_io_priv *io)
{
if (io->err)
@@ -585,8 +590,9 @@ static void fuse_aio_complete(struct fus
}

io->iocb->ki_complete(io->iocb, res, 0);
- kfree(io);
}
+
+ kref_put(&io->refcnt, fuse_io_release);
}

static void fuse_aio_complete_req(struct fuse_conn *fc, struct fuse_req *req)
@@ -613,6 +619,7 @@ static size_t fuse_async_req_send(struct
size_t num_bytes, struct fuse_io_priv *io)
{
spin_lock(&io->lock);
+ kref_get(&io->refcnt);
io->size += num_bytes;
io->reqs++;
spin_unlock(&io->lock);
@@ -691,7 +698,7 @@ static void fuse_short_read(struct fuse_

static int fuse_do_readpage(struct file *file, struct page *page)
{
- struct fuse_io_priv io = { .async = 0, .file = file };
+ struct fuse_io_priv io = FUSE_IO_PRIV_SYNC(file);
struct inode *inode = page->mapping->host;
struct fuse_conn *fc = get_fuse_conn(inode);
struct fuse_req *req;
@@ -984,7 +991,7 @@ static size_t fuse_send_write_pages(stru
size_t res;
unsigned offset;
unsigned i;
- struct fuse_io_priv io = { .async = 0, .file = file };
+ struct fuse_io_priv io = FUSE_IO_PRIV_SYNC(file);

for (i = 0; i < req->num_pages; i++)
fuse_wait_on_page_writeback(inode, req->pages[i]->index);
@@ -1398,7 +1405,7 @@ static ssize_t __fuse_direct_read(struct

static ssize_t fuse_direct_read_iter(struct kiocb *iocb, struct iov_iter *to)
{
- struct fuse_io_priv io = { .async = 0, .file = iocb->ki_filp };
+ struct fuse_io_priv io = FUSE_IO_PRIV_SYNC(iocb->ki_filp);
return __fuse_direct_read(&io, to, &iocb->ki_pos);
}

@@ -1406,7 +1413,7 @@ static ssize_t fuse_direct_write_iter(st
{
struct file *file = iocb->ki_filp;
struct inode *inode = file_inode(file);
- struct fuse_io_priv io = { .async = 0, .file = file };
+ struct fuse_io_priv io = FUSE_IO_PRIV_SYNC(file);
ssize_t res;

if (is_bad_inode(inode))
@@ -2864,6 +2871,7 @@ fuse_direct_IO(struct kiocb *iocb, struc
if (!io)
return -ENOMEM;
spin_lock_init(&io->lock);
+ kref_init(&io->refcnt);
io->reqs = 1;
io->bytes = -1;
io->size = 0;
@@ -2887,8 +2895,14 @@ fuse_direct_IO(struct kiocb *iocb, struc
iov_iter_rw(iter) == WRITE)
io->async = false;

- if (io->async && is_sync)
+ if (io->async && is_sync) {
+ /*
+ * Additional reference to keep io around after
+ * calling fuse_aio_complete()
+ */
+ kref_get(&io->refcnt);
io->done = &wait;
+ }

if (iov_iter_rw(iter) == WRITE) {
ret = fuse_direct_io(io, iter, &pos, FUSE_DIO_WRITE);
@@ -2908,7 +2922,7 @@ fuse_direct_IO(struct kiocb *iocb, struc
ret = fuse_get_res_by_io(io);
}

- kfree(io);
+ kref_put(&io->refcnt, fuse_io_release);

if (iov_iter_rw(iter) == WRITE) {
if (ret > 0)
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -22,6 +22,8 @@
#include <linux/rbtree.h>
#include <linux/poll.h>
#include <linux/workqueue.h>
+#include <linux/atomic.h>
+#include <linux/kref.h>

/** Max number of pages that can be used in a single read request */
#define FUSE_MAX_PAGES_PER_REQ 32
@@ -243,6 +245,7 @@ struct fuse_args {

/** The request IO state (for asynchronous processing) */
struct fuse_io_priv {
+ struct kref refcnt;
int async;
spinlock_t lock;
unsigned reqs;
@@ -256,6 +259,13 @@ struct fuse_io_priv {
struct completion *done;
};

+#define FUSE_IO_PRIV_SYNC(f) \
+{ \
+ .refcnt = { ATOMIC_INIT(1) }, \
+ .async = 0, \
+ .file = f, \
+}
+
/**
* Request flags
*

2016-03-14 13:53:12

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 1/2] fuse: do not use iocb after it may have been freed

On Fri, Mar 11, 2016 at 10:35:33AM -0600, Seth Forshee wrote:
> From: Robert Doebbelin <[email protected]>
>
> Reading wrong data may cause the IO thread to starve waiting for completion.
> The issue was introduced with commit 04b2fa9 and thus affects kernel >= 4.1. It
> was discovered by KASan:

Thanks, queued.

Miklos

2016-03-14 15:13:08

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH 2/2] fuse: Add reference counting for fuse_io_priv

On Mon, Mar 14, 2016 at 02:53:11PM +0100, Miklos Szeredi wrote:
> On Fri, Mar 11, 2016 at 10:35:34AM -0600, Seth Forshee wrote:
> > The req member of fuse_io_priv serves two purposes. First is to
> > track the number of oustanding async requests to the server and
> > to signal that the io request is completed. The second is to be a
> > reference count on the structure to know when it can be freed.
> >
> > For sync io requests these purposes can be at odds.
> > fuse_direct_IO() wants to block until the request is done, and
> > since the signal is sent when req reaches 0 it cannot keep a
> > reference to the object. Yet it needs to use the object after the
> > userspace server has completed processing requests. This leads to
> > some handshaking and special casing that it needlessly
> > complicated and responsible for at least one race condition.
> >
> > It's much cleaner and safer to maintain a separate reference
> > count for the object lifecycle and to let req just be a count of
> > outstanding requests to the userspace server. Then we can know
> > for sure when it is safe to free the object without any
> > handshaking or special cases.
> >
> > The catch here is that most of the time these objects are stack
> > allocated and should not be freed. Initializing these objects
> > with a single reference that is never released prevents
> > accidental attempts to free the objects.
> >
> > Fixes: 9d5722b7777e ("fuse: handle synchronous iocbs internally")
> > Cc: [email protected] # v4.1+
> > Signed-off-by: Seth Forshee <[email protected]>
> > ---
> > fs/fuse/cuse.c | 12 ++++++++++--
> > fs/fuse/file.c | 42 ++++++++++++++++++++++++++++++++++--------
> > fs/fuse/fuse_i.h | 15 +++++++++++++++
> > 3 files changed, 59 insertions(+), 10 deletions(-)
> >
>
> [snip]
>
> > @@ -2864,6 +2882,7 @@ fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset)
> > if (!io)
> > return -ENOMEM;
> > spin_lock_init(&io->lock);
> > + atomic_set(&io->refcnt, 1);
> > io->reqs = 1;
> > io->bytes = -1;
> > io->size = 0;
> > @@ -2887,8 +2906,15 @@ fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset)
> > iov_iter_rw(iter) == WRITE)
> > io->async = false;
> >
> > - if (io->async && is_sync)
> > - io->done = &wait;
> > + if (is_sync) {
> > + /*
> > + * Additional reference to keep io around after
> > + * calling fuse_aio_complete()
> > + */
> > + fuse_io_ref(io);
>
> AFAICS, the additional reference should only be needed for the io->async case,
> no?

That seems right, good catch.

> Updated, prettified patch below. Could you please test?

That looks good to me. I'll have to leave it to Robert or Martin to test
though as I could never reproduce the race.

Thanks,
Seth