2009-03-09 15:50:55

by Jeff Moyer

[permalink] [raw]
Subject: [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring

Hi,

Believe it or not, I get numerous questions from customers about the
suggested tuning value of aio-max-nr. aio-max-nr limits the total
number of io events that can be reserved, system wide, for aio
completions. Each time io_setup is called, a ring buffer is allocated
that can hold nr_events I/O completions. That ring buffer is then
mapped into the process' address space, and the pages are pinned in
memory. So, the reason for this upper limit (I believe) is to keep a
malicious user from pinning all of kernel memory. Now, this sounds like
a much better job for the memlock rlimit to me, hence the following
patch.

It's worth noting that, by default, aio-max-nr was set to 65536
requests. By default, the memlock rlimit is set to 64kb per process.
With this patch in place, a single process can specify 2045 for the
nr_events parameter of io_setup. Or, for the worst case scenario, a
process can only create 16 io contexts, each with a single event (since
the minimum ring size is a page).

I created a simple program that sets the process' memlock rlimit and
then drops the CAP_IPC_LOCK capability (the memlock rlimit is a command
line parameter, so you can see how the nr_events allowable changes with
different limits in place). Then, it calls io_setup/io_destroy in a
loop, increasing nr_events until it fails. Finally, it calls io_setup
in a loop with a single event to see how many iterations it can perform
before receiving -EAGAIN. I ran the test on a patched kernel and the
results are in line with expectations.

I also ran the aio-dio-regress regression tests, and all passed but
one. The one that failed only failed due to an expected return code of
-ENOMEM, and instead got -EAGAIN. Part of the patch below returns a
proper error code from aio_setup_ring. So, I'm calling this a test
issue, but we can debate this nuance if it is deemed important.

Further, as a result of this exercise, I noticed that there are numerous
places in the kernel that test to see if locking memory will exceed the
maximum amount of allowed locked memory. I've factored out that bit of
code in a follow-on patch that I will post.

Finally, I updated the aio man pages to reflect this change (and fix
references to aio_context_t that should be io_context_t).

You can find a copy of the test program I used at:
http://people.redhat.com/jmoyer/aio/ioctx_alloc.c
I'll apologize in advance for the crudity of the test code.

For those reviewing the below patch, it may be worth looking at:

commit d55b5fdaf40846221d543937b786956e27837fda
Author: Zach Brown <[email protected]>
Date: Mon Nov 7 00:59:31 2005 -0800

[PATCH] aio: remove aio_max_nr accounting race

The below patch basically reverts the above commit. There is no
accounting race now, because we are charging per-process rlimits instead
of a system-wide maximum number of aio requests. This has the added
benefit of reducing some code complexity.

Comments and suggestions are encouraged. I'll follow up with the other
two patches I mentioned shortly.

Thanks!
Jeff

diff --git a/fs/aio.c b/fs/aio.c
index 8fa77e2..3bbda9d 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -43,9 +43,8 @@
#endif

/*------ sysctl variables----*/
-static DEFINE_SPINLOCK(aio_nr_lock);
-unsigned long aio_nr; /* current system wide number of aio requests */
-unsigned long aio_max_nr = 0x10000; /* system wide maximum number of aio requests */
+/* current system wide number of aio requests */
+atomic_t aio_nr = ATOMIC_INIT(0);
/*----end sysctl variables---*/

static struct kmem_cache *kiocb_cachep;
@@ -90,6 +89,7 @@ static void aio_free_ring(struct kioctx *ctx)
if (info->mmap_size) {
down_write(&ctx->mm->mmap_sem);
do_munmap(ctx->mm, info->mmap_base, info->mmap_size);
+ ctx->mm->locked_vm -= info->nr_pages;
up_write(&ctx->mm->mmap_sem);
}

@@ -106,6 +106,8 @@ static int aio_setup_ring(struct kioctx *ctx)
unsigned nr_events = ctx->max_reqs;
unsigned long size;
int nr_pages;
+ unsigned long locked;
+ unsigned long lock_limit;

/* Compensate for the ring buffer's head/tail overlap entry */
nr_events += 2; /* 1 is required, 2 for good luck */
@@ -130,6 +132,20 @@ static int aio_setup_ring(struct kioctx *ctx)
info->mmap_size = nr_pages * PAGE_SIZE;
dprintk("attempting mmap of %lu bytes\n", info->mmap_size);
down_write(&ctx->mm->mmap_sem);
+ /*
+ * Check that the memory reserved for the completion ring does
+ * not exceed the memlock memory limit.
+ */
+ locked = nr_pages + ctx->mm->locked_vm;
+ lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur;
+ lock_limit >>= PAGE_SHIFT;
+ if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
+ up_write(&ctx->mm->mmap_sem);
+ info->mmap_size = 0;
+ aio_free_ring(ctx);
+ return -EAGAIN;
+ }
+
info->mmap_base = do_mmap(NULL, 0, info->mmap_size,
PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_PRIVATE,
0);
@@ -144,6 +160,7 @@ static int aio_setup_ring(struct kioctx *ctx)
info->nr_pages = get_user_pages(current, ctx->mm,
info->mmap_base, nr_pages,
1, 0, info->ring_pages, NULL);
+ ctx->mm->locked_vm += info->nr_pages;
up_write(&ctx->mm->mmap_sem);

if (unlikely(info->nr_pages != nr_pages)) {
@@ -194,16 +211,8 @@ static int aio_setup_ring(struct kioctx *ctx)
static void ctx_rcu_free(struct rcu_head *head)
{
struct kioctx *ctx = container_of(head, struct kioctx, rcu_head);
- unsigned nr_events = ctx->max_reqs;

kmem_cache_free(kioctx_cachep, ctx);
-
- if (nr_events) {
- spin_lock(&aio_nr_lock);
- BUG_ON(aio_nr - nr_events > aio_nr);
- aio_nr -= nr_events;
- spin_unlock(&aio_nr_lock);
- }
}

/* __put_ioctx
@@ -217,6 +226,7 @@ static void __put_ioctx(struct kioctx *ctx)
cancel_delayed_work(&ctx->wq);
cancel_work_sync(&ctx->wq.work);
aio_free_ring(ctx);
+ atomic_sub(ctx->max_reqs, &aio_nr);
mmdrop(ctx->mm);
ctx->mm = NULL;
pr_debug("__put_ioctx: freeing %p\n", ctx);
@@ -240,7 +250,7 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
{
struct mm_struct *mm;
struct kioctx *ctx;
- int did_sync = 0;
+ int ret;

/* Prevent overflows */
if ((nr_events > (0x10000000U / sizeof(struct io_event))) ||
@@ -249,9 +259,6 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
return ERR_PTR(-EINVAL);
}

- if ((unsigned long)nr_events > aio_max_nr)
- return ERR_PTR(-EAGAIN);
-
ctx = kmem_cache_zalloc(kioctx_cachep, GFP_KERNEL);
if (!ctx)
return ERR_PTR(-ENOMEM);
@@ -269,26 +276,11 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
INIT_LIST_HEAD(&ctx->run_list);
INIT_DELAYED_WORK(&ctx->wq, aio_kick_handler);

- if (aio_setup_ring(ctx) < 0)
+ if ((ret = aio_setup_ring(ctx)) < 0)
goto out_freectx;

- /* limit the number of system wide aios */
- do {
- spin_lock_bh(&aio_nr_lock);
- if (aio_nr + nr_events > aio_max_nr ||
- aio_nr + nr_events < aio_nr)
- ctx->max_reqs = 0;
- else
- aio_nr += ctx->max_reqs;
- spin_unlock_bh(&aio_nr_lock);
- if (ctx->max_reqs || did_sync)
- break;
-
- /* wait for rcu callbacks to have completed before giving up */
- synchronize_rcu();
- did_sync = 1;
- ctx->max_reqs = nr_events;
- } while (1);
+ /* update the number of system wide aios */
+ atomic_add(ctx->max_reqs, &aio_nr); /* undone by __put_ioctx */

if (ctx->max_reqs == 0)
goto out_cleanup;
@@ -309,7 +301,7 @@ out_cleanup:
out_freectx:
mmdrop(mm);
kmem_cache_free(kioctx_cachep, ctx);
- ctx = ERR_PTR(-ENOMEM);
+ ctx = ERR_PTR(ret);

dprintk("aio: error allocating ioctx %p\n", ctx);
return ctx;
diff --git a/include/linux/aio.h b/include/linux/aio.h
index b16a957..09ec393 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -233,7 +233,6 @@ static inline struct kiocb *list_kiocb(struct list_head *h)
}

/* for sysctl: */
-extern unsigned long aio_nr;
-extern unsigned long aio_max_nr;
+extern atomic_t aio_nr;

#endif /* __LINUX__AIO_H */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c5ef44f..32ced38 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1385,14 +1385,7 @@ static struct ctl_table fs_table[] = {
.data = &aio_nr,
.maxlen = sizeof(aio_nr),
.mode = 0444,
- .proc_handler = &proc_doulongvec_minmax,
- },
- {
- .procname = "aio-max-nr",
- .data = &aio_max_nr,
- .maxlen = sizeof(aio_max_nr),
- .mode = 0644,
- .proc_handler = &proc_doulongvec_minmax,
+ .proc_handler = &proc_dointvec,
},
#endif /* CONFIG_AIO */
#ifdef CONFIG_INOTIFY_USER


2009-03-09 15:55:31

by Jeff Moyer

[permalink] [raw]
Subject: [patch] factor out checks against the memlock rlimit

Hi,

There are several places in the kernel where the memlock rlimit is
checked, all duplicating code. I added another in a recent patch and it
made me feel dirty, so this patch factors all of that into a single
function, can_mlock_pages.

The infiniband implementation of the rlimit check was actually broken.
Weeding through changelogs showed that the initial implementation was
wrong, code was #if 0'd out because it didn't work, and instead of
fixing things, they just removed the code all together. If my
assessment of that is wrong, please let me know! ;-)

Comments, as always, are encouraged.

Cheers,
Jeff

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 6f7c096..c4d4f1b 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -135,14 +135,11 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,

down_write(&current->mm->mmap_sem);

- locked = npages + current->mm->locked_vm;
- lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur >> PAGE_SHIFT;
-
- if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
+ if (!can_mlock_pages(npages)) {
ret = -ENOMEM;
goto out;
}
-
+ locked = npages + current->mm->locked_vm;
cur_base = addr & PAGE_MASK;

ret = 0;
diff --git a/drivers/infiniband/hw/ipath/ipath_user_pages.c b/drivers/infiniband/hw/ipath/ipath_user_pages.c
index 0190edc..9ab17e3 100644
--- a/drivers/infiniband/hw/ipath/ipath_user_pages.c
+++ b/drivers/infiniband/hw/ipath/ipath_user_pages.c
@@ -58,10 +58,7 @@ static int __get_user_pages(unsigned long start_page, size_t num_pages,
size_t got;
int ret;

- lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur >>
- PAGE_SHIFT;
-
- if (num_pages > lock_limit) {
+ if (!can_mlock_pages(num_pages)) {
ret = -ENOMEM;
goto bail;
}
diff --git a/fs/aio.c b/fs/aio.c
index 3bbda9d..0ca64eb 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -136,10 +136,7 @@ static int aio_setup_ring(struct kioctx *ctx)
* Check that the memory reserved for the completion ring does
* not exceed the memlock memory limit.
*/
- locked = nr_pages + ctx->mm->locked_vm;
- lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur;
- lock_limit >>= PAGE_SHIFT;
- if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
+ if (!can_mlock_pages(nr_pages)) {
up_write(&ctx->mm->mmap_sem);
info->mmap_size = 0;
aio_free_ring(ctx);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 065cdf8..79c1127 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -737,6 +737,7 @@ extern unsigned long shmem_get_unmapped_area(struct file *file,
#endif

extern int can_do_mlock(void);
+extern int can_mlock_pages(unsigned long);
extern int user_shm_lock(size_t, struct user_struct *);
extern void user_shm_unlock(size_t, struct user_struct *);

diff --git a/mm/mlock.c b/mm/mlock.c
index cbe9e05..4ec96c2 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -31,6 +31,33 @@ int can_do_mlock(void)
}
EXPORT_SYMBOL(can_do_mlock);

+/**
+ * can_mlock_pages() - tell whether mlocking nr_pages is possible
+ * @nr_pages: Number of pages to be locked in memory
+ *
+ * Boolean function which tells whether the MEMLOCK rlimit will prevent
+ * locking nr_pages pages.
+ */
+int can_mlock_pages(unsigned long nr_pages)
+{
+ unsigned long locked, lock_limit;
+
+ if (capable(CAP_IPC_LOCK))
+ return 1;
+
+ locked = nr_pages;
+ locked += current->mm->locked_vm;
+
+ lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur;
+ lock_limit >>= PAGE_SHIFT;
+
+ /* check against resource limits */
+ if (locked <= lock_limit)
+ return 1;
+ return 0;
+}
+EXPORT_SYMBOL(can_mlock_pages);
+
#ifdef CONFIG_UNEVICTABLE_LRU
/*
* Mlocked pages are marked with PageMlocked() flag for efficient testing
@@ -505,14 +532,8 @@ SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len)
len = PAGE_ALIGN(len + (start & ~PAGE_MASK));
start &= PAGE_MASK;

- locked = len >> PAGE_SHIFT;
- locked += current->mm->locked_vm;
-
- lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur;
- lock_limit >>= PAGE_SHIFT;
-
/* check against resource limits */
- if ((locked <= lock_limit) || capable(CAP_IPC_LOCK))
+ if (can_mlock_pages(len >> PAGE_SHIFT))
error = do_mlock(start, len, 1);
up_write(&current->mm->mmap_sem);
return error;
diff --git a/mm/mmap.c b/mm/mmap.c
index 00ced3e..52d1c32 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -975,12 +975,7 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,

/* mlock MCL_FUTURE? */
if (vm_flags & VM_LOCKED) {
- unsigned long locked, lock_limit;
- locked = len >> PAGE_SHIFT;
- locked += mm->locked_vm;
- lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur;
- lock_limit >>= PAGE_SHIFT;
- if (locked > lock_limit && !capable(CAP_IPC_LOCK))
+ if (!can_mlock_pages(len >> PAGE_SHIFT))
return -EAGAIN;
}

@@ -2006,12 +2001,7 @@ unsigned long do_brk(unsigned long addr, unsigned long len)
* mlock MCL_FUTURE?
*/
if (mm->def_flags & VM_LOCKED) {
- unsigned long locked, lock_limit;
- locked = len >> PAGE_SHIFT;
- locked += mm->locked_vm;
- lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur;
- lock_limit >>= PAGE_SHIFT;
- if (locked > lock_limit && !capable(CAP_IPC_LOCK))
+ if (!can_mlock_pages(len>>PAGE_SHIFT))
return -EAGAIN;
}

diff --git a/mm/mremap.c b/mm/mremap.c
index a39b7b9..6fb50e2 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -342,12 +342,8 @@ unsigned long do_mremap(unsigned long addr,
goto out;
}
if (vma->vm_flags & VM_LOCKED) {
- unsigned long locked, lock_limit;
- locked = mm->locked_vm << PAGE_SHIFT;
- lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur;
- locked += new_len - old_len;
ret = -EAGAIN;
- if (locked > lock_limit && !capable(CAP_IPC_LOCK))
+ if (!can_mlock_pages((new_len - old_len)>>PAGE_SHIFT))
goto out;
}
if (!may_expand_vm(mm, (new_len - old_len) >> PAGE_SHIFT)) {

2009-03-09 15:59:40

by Jeff Moyer

[permalink] [raw]
Subject: [patch] man-pages: add documentation about the memlock implications of io_setup

Hi,

This patch, against man-pages-3.19, adds verbiage surrounding the newly
proposed accounting of pinned aio completion ring pages against the
memlock rlimit. It also fixes up a few references to aio_context_t
(which should be io_context_t). Sorry for conflating the two, I can
break it apart if needed.

Cheers,
Jeff

diff -up man-pages-3.19/man2/io_cancel.2.orig man-pages-3.19/man2/io_cancel.2
--- man-pages-3.19/man2/io_cancel.2.orig 2009-03-08 11:57:50.000000000 -0400
+++ man-pages-3.19/man2/io_cancel.2 2009-03-08 11:58:01.000000000 -0400
@@ -32,7 +32,7 @@ io_cancel \- cancel an outstanding async
.\"#include <linux/aio.h>
.sp
.\" .HP 16
-.BI "int io_cancel(aio_context_t " ctx_id ", struct iocb *" iocb ,
+.BI "int io_cancel(io_context_t " ctx_id ", struct iocb *" iocb ,
.BI " struct io_event *" result );
.\" .ad
.\" .hy
diff -up man-pages-3.19/man2/io_destroy.2.orig man-pages-3.19/man2/io_destroy.2
--- man-pages-3.19/man2/io_destroy.2.orig 2009-03-08 11:58:08.000000000 -0400
+++ man-pages-3.19/man2/io_destroy.2 2009-03-08 11:58:16.000000000 -0400
@@ -31,7 +31,7 @@ io_destroy \- destroy an asynchronous I/
.\" #include <linux/aio.h>
.sp
.\" .HP 17
-.BI "int io_destroy(aio_context_t " ctx );
+.BI "int io_destroy(io_context_t " ctx );
.\" .ad
.\" .hy
.sp
diff -up man-pages-3.19/man2/io_setup.2.orig man-pages-3.19/man2/io_setup.2
--- man-pages-3.19/man2/io_setup.2.orig 2009-03-08 11:53:39.000000000 -0400
+++ man-pages-3.19/man2/io_setup.2 2009-03-08 11:57:09.000000000 -0400
@@ -31,7 +31,7 @@ io_setup \- create an asynchronous I/O c
.\" #include <linux/aio.h>
.sp
.\" .HP 15
-.BI "int io_setup(unsigned " nr_events ", aio_context_t *" ctxp );
+.BI "int io_setup(unsigned " nr_events ", io_context_t *" ctxp );
.\" .ad
.\" .hy
.sp
@@ -45,7 +45,9 @@ at least \fInr_events\fP.
\fIctxp\fP must not point to an AIO context that already exists, and must
be initialized to 0 prior to the call.
On successful creation of the AIO context, \fI*ctxp\fP is filled in
-with the resulting handle.
+with the resulting handle. Memory is reserved by the kernel for the completion
+ring and, because this memory is pinned, it will be charged against the
+process's memlock rlimit.
.SH "RETURN VALUE"
On success,
.BR io_setup ()
@@ -54,7 +56,8 @@ For the failure return, see NOTES.
.SH "ERRORS"
.TP
.B EAGAIN
-The specified \fInr_events\fP exceeds the user's limit of available events.
+The specified \fInr_events\fP exceeds the process's limit of available events.
+Try increasing the memlock rlimit for the process.
.TP
.B EFAULT
An invalid pointer is passed for \fIctxp\fP.
diff -up man-pages-3.19/man2/io_submit.2.orig man-pages-3.19/man2/io_submit.2
--- man-pages-3.19/man2/io_submit.2.orig 2009-03-08 11:58:26.000000000 -0400
+++ man-pages-3.19/man2/io_submit.2 2009-03-08 11:58:34.000000000 -0400
@@ -31,7 +31,7 @@ io_submit \- submit asynchronous I/O blo
.\" #include <linux/aio.h>
.sp
.\" .HP 16
-.BI "int io_submit(aio_context_t " ctx_id ", long " nr \
+.BI "int io_submit(io_context_t " ctx_id ", long " nr \
", struct iocb **" iocbpp );
.\" .ad
.\" .hy
@@ -63,7 +63,7 @@ The file descriptor specified in the fir
One of the data structures points to invalid data.
.TP
.B EINVAL
-The \fIaio_context\fP specified by \fIctx_id\fP is invalid.
+The \fIio_context\fP specified by \fIctx_id\fP is invalid.
\fInr\fP is less than 0.
The \fIiocb\fP at *iocbpp[0] is not properly initialized,
or the operation specified is invalid for the file descriptor
diff -up man-pages-3.19/man7/capabilities.7.orig man-pages-3.19/man7/capabilities.7
--- man-pages-3.19/man7/capabilities.7.orig 2009-03-08 11:52:56.000000000 -0400
+++ man-pages-3.19/man7/capabilities.7 2009-03-08 11:53:14.000000000 -0400
@@ -129,7 +129,8 @@ Lock memory
.RB ( mlock (2),
.BR mlockall (2),
.BR mmap (2),
-.BR shmctl (2)).
+.BR shmctl (2),
+.BR io_setup (2)).
.TP
.B CAP_IPC_OWNER
Bypass permission checks for operations on System V IPC objects.

2009-03-09 16:18:40

by Avi Kivity

[permalink] [raw]
Subject: Re: [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring

Jeff Moyer wrote:
> Hi,
>
> Believe it or not, I get numerous questions from customers about the
> suggested tuning value of aio-max-nr. aio-max-nr limits the total
> number of io events that can be reserved, system wide, for aio
> completions. Each time io_setup is called, a ring buffer is allocated
> that can hold nr_events I/O completions. That ring buffer is then
> mapped into the process' address space, and the pages are pinned in
> memory. So, the reason for this upper limit (I believe) is to keep a
> malicious user from pinning all of kernel memory. Now, this sounds like
> a much better job for the memlock rlimit to me, hence the following
> patch.
>

Is it not possible to get rid of the pinning entirely? Pinning
interferes with page migration which is important for NUMA, among other
issues.

--
error compiling committee.c: too many arguments to function

2009-03-09 16:45:22

by Michael Kerrisk

[permalink] [raw]
Subject: Re: [patch] man-pages: add documentation about the memlock implications of io_setup

[CC+=Vegard Nossum]

Hi Jeff,

Note that my address for man-pages is [email protected].

On Tue, Mar 10, 2009 at 4:59 AM, Jeff Moyer <[email protected]> wrote:
> Hi,
>
> This patch, against man-pages-3.19, adds verbiage surrounding the newly
> proposed accounting of pinned aio completion ring pages against the
> memlock rlimit. It also fixes up a few references to aio_context_t
> (which should be io_context_t). Sorry for conflating the two, I can
> break it apart if needed.

Breaking apart really would be better. Could you resubmit (to
mtk.manpages@, and CC linux-man@vger) please?

When you resubmit, could you please clarify the following:

1. For the "proposed accounting of pinned aio completion ring pages
against the memlock rlimit", can you tell me which mainline kernel
version this change (will) appear(s) in. (You say "proposed", so that
it's not clear when/if the change will be merged into mainline. A git
commit reference could be useful to me here.)

2. Regarding aio_context_t vs io_context_t, Vegard Nossum also raised
this issue a while back, and I replied that the aio_context_t type is
used because:

[[
These pages attempt to document the bare syscall API.
Take a look at the kernel sources: there the type *is*
aio_context_t.
]]

Perhaps I am/was being muddle-headed here, since, after all, the pages
do say "Link with -laio." (Once upon a time the pages actually *did*
use "io_context_t", but I changed that during a big clean-up of the
io_*.2 pages a year or two back.) Your thoughts? (Why do we even
have aio_context_t in the kernel versus io_context_t in libaio?)

Cheers,

Michael


> diff -up man-pages-3.19/man2/io_cancel.2.orig man-pages-3.19/man2/io_cancel.2
> --- man-pages-3.19/man2/io_cancel.2.orig ? ? ? ?2009-03-08 11:57:50.000000000 -0400
> +++ man-pages-3.19/man2/io_cancel.2 ? ? 2009-03-08 11:58:01.000000000 -0400
> @@ -32,7 +32,7 @@ io_cancel \- cancel an outstanding async
> ?.\"#include <linux/aio.h>
> ?.sp
> ?.\" .HP 16
> -.BI "int io_cancel(aio_context_t " ctx_id ", struct iocb *" iocb ,
> +.BI "int io_cancel(io_context_t " ctx_id ", struct iocb *" iocb ,
> ?.BI " ? ? ? ? ? ? ?struct io_event *" result );
> ?.\" .ad
> ?.\" .hy
> diff -up man-pages-3.19/man2/io_destroy.2.orig man-pages-3.19/man2/io_destroy.2
> --- man-pages-3.19/man2/io_destroy.2.orig ? ? ? 2009-03-08 11:58:08.000000000 -0400
> +++ man-pages-3.19/man2/io_destroy.2 ? ?2009-03-08 11:58:16.000000000 -0400
> @@ -31,7 +31,7 @@ io_destroy \- destroy an asynchronous I/
> ?.\" #include <linux/aio.h>
> ?.sp
> ?.\" .HP 17
> -.BI "int io_destroy(aio_context_t " ctx );
> +.BI "int io_destroy(io_context_t " ctx );
> ?.\" .ad
> ?.\" .hy
> ?.sp
> diff -up man-pages-3.19/man2/io_setup.2.orig man-pages-3.19/man2/io_setup.2
> --- man-pages-3.19/man2/io_setup.2.orig 2009-03-08 11:53:39.000000000 -0400
> +++ man-pages-3.19/man2/io_setup.2 ? ? ?2009-03-08 11:57:09.000000000 -0400
> @@ -31,7 +31,7 @@ io_setup \- create an asynchronous I/O c
> ?.\" #include <linux/aio.h>
> ?.sp
> ?.\" .HP 15
> -.BI "int io_setup(unsigned " nr_events ", aio_context_t *" ctxp );
> +.BI "int io_setup(unsigned " nr_events ", io_context_t *" ctxp );
> ?.\" .ad
> ?.\" .hy
> ?.sp
> @@ -45,7 +45,9 @@ at least \fInr_events\fP.
> ?\fIctxp\fP must not point to an AIO context that already exists, and must
> ?be initialized to 0 prior to the call.
> ?On successful creation of the AIO context, \fI*ctxp\fP is filled in
> -with the resulting handle.
> +with the resulting handle. ?Memory is reserved by the kernel for the completion
> +ring and, because this memory is pinned, it will be charged against the
> +process's memlock rlimit.
> ?.SH "RETURN VALUE"
> ?On success,
> ?.BR io_setup ()
> @@ -54,7 +56,8 @@ For the failure return, see NOTES.
> ?.SH "ERRORS"
> ?.TP
> ?.B EAGAIN
> -The specified \fInr_events\fP exceeds the user's limit of available events.
> +The specified \fInr_events\fP exceeds the process's limit of available events.
> +Try increasing the memlock rlimit for the process.
> ?.TP
> ?.B EFAULT
> ?An invalid pointer is passed for \fIctxp\fP.
> diff -up man-pages-3.19/man2/io_submit.2.orig man-pages-3.19/man2/io_submit.2
> --- man-pages-3.19/man2/io_submit.2.orig ? ? ? ?2009-03-08 11:58:26.000000000 -0400
> +++ man-pages-3.19/man2/io_submit.2 ? ? 2009-03-08 11:58:34.000000000 -0400
> @@ -31,7 +31,7 @@ io_submit \- submit asynchronous I/O blo
> ?.\" #include <linux/aio.h>
> ?.sp
> ?.\" .HP 16
> -.BI "int io_submit(aio_context_t " ctx_id ", long " nr \
> +.BI "int io_submit(io_context_t " ctx_id ", long " nr \
> ?", struct iocb **" iocbpp );
> ?.\" .ad
> ?.\" .hy
> @@ -63,7 +63,7 @@ The file descriptor specified in the fir
> ?One of the data structures points to invalid data.
> ?.TP
> ?.B EINVAL
> -The \fIaio_context\fP specified by \fIctx_id\fP is invalid.
> +The \fIio_context\fP specified by \fIctx_id\fP is invalid.
> ?\fInr\fP is less than 0.
> ?The \fIiocb\fP at *iocbpp[0] is not properly initialized,
> ?or the operation specified is invalid for the file descriptor
> diff -up man-pages-3.19/man7/capabilities.7.orig man-pages-3.19/man7/capabilities.7
> --- man-pages-3.19/man7/capabilities.7.orig ? ? 2009-03-08 11:52:56.000000000 -0400
> +++ man-pages-3.19/man7/capabilities.7 ?2009-03-08 11:53:14.000000000 -0400
> @@ -129,7 +129,8 @@ Lock memory
> ?.RB ( mlock (2),
> ?.BR mlockall (2),
> ?.BR mmap (2),
> -.BR shmctl (2)).
> +.BR shmctl (2),
> +.BR io_setup (2)).
> ?.TP
> ?.B CAP_IPC_OWNER
> ?Bypass permission checks for operations on System V IPC objects.
>

2009-03-09 16:48:20

by Michael Kerrisk

[permalink] [raw]
Subject: Re: [patch] man-pages: add documentation about the memlock implications of io_setup

[RESEND: Really CC+=Vegard Nossum, this time, and += mtk.manpages@, as well]

Hi Jeff,

Note that my address for man-pages is [email protected].

On Tue, Mar 10, 2009 at 4:59 AM, Jeff Moyer <[email protected]> wrote:
> Hi,
>
> This patch, against man-pages-3.19, adds verbiage surrounding the newly
> proposed accounting of pinned aio completion ring pages against the
> memlock rlimit. It also fixes up a few references to aio_context_t
> (which should be io_context_t). Sorry for conflating the two, I can
> break it apart if needed.

Breaking apart really would be better. Could you resubmit (to
mtk.manpages@, and CC linux-man@vger) please?

When you resubmit, could you please clarify the following:

1. For the "proposed accounting of pinned aio completion ring pages
against the memlock rlimit", can you tell me which mainline kernel
version this change (will) appear(s) in. (You say "proposed", so that
it's not clear when/if the change will be merged into mainline. A git
commit reference could be useful to me here.)

2. Regarding aio_context_t vs io_context_t, Vegard Nossum also raised
this issue a while back, and I replied that the aio_context_t type is
used because:

[[
These pages attempt to document the bare syscall API.
Take a look at the kernel sources: there the type *is*
aio_context_t.
]]

Perhaps I am/was being muddle-headed here, since, after all, the pages
do say "Link with -laio." (Once upon a time the pages actually *did*
use "io_context_t", but I changed that during a big clean-up of the
io_*.2 pages a year or two back.) Your thoughts? (Why do we even
have aio_context_t in the kernel versus io_context_t in libaio?)

Cheers,

Michael

> diff -up man-pages-3.19/man2/io_cancel.2.orig man-pages-3.19/man2/io_cancel.2
> --- man-pages-3.19/man2/io_cancel.2.orig ? ? ? ?2009-03-08 11:57:50.000000000 -0400
> +++ man-pages-3.19/man2/io_cancel.2 ? ? 2009-03-08 11:58:01.000000000 -0400
> @@ -32,7 +32,7 @@ io_cancel \- cancel an outstanding async
> ?.\"#include <linux/aio.h>
> ?.sp
> ?.\" .HP 16
> -.BI "int io_cancel(aio_context_t " ctx_id ", struct iocb *" iocb ,
> +.BI "int io_cancel(io_context_t " ctx_id ", struct iocb *" iocb ,
> ?.BI " ? ? ? ? ? ? ?struct io_event *" result );
> ?.\" .ad
> ?.\" .hy
> diff -up man-pages-3.19/man2/io_destroy.2.orig man-pages-3.19/man2/io_destroy.2
> --- man-pages-3.19/man2/io_destroy.2.orig ? ? ? 2009-03-08 11:58:08.000000000 -0400
> +++ man-pages-3.19/man2/io_destroy.2 ? ?2009-03-08 11:58:16.000000000 -0400
> @@ -31,7 +31,7 @@ io_destroy \- destroy an asynchronous I/
> ?.\" #include <linux/aio.h>
> ?.sp
> ?.\" .HP 17
> -.BI "int io_destroy(aio_context_t " ctx );
> +.BI "int io_destroy(io_context_t " ctx );
> ?.\" .ad
> ?.\" .hy
> ?.sp
> diff -up man-pages-3.19/man2/io_setup.2.orig man-pages-3.19/man2/io_setup.2
> --- man-pages-3.19/man2/io_setup.2.orig 2009-03-08 11:53:39.000000000 -0400
> +++ man-pages-3.19/man2/io_setup.2 ? ? ?2009-03-08 11:57:09.000000000 -0400
> @@ -31,7 +31,7 @@ io_setup \- create an asynchronous I/O c
> ?.\" #include <linux/aio.h>
> ?.sp
> ?.\" .HP 15
> -.BI "int io_setup(unsigned " nr_events ", aio_context_t *" ctxp );
> +.BI "int io_setup(unsigned " nr_events ", io_context_t *" ctxp );
> ?.\" .ad
> ?.\" .hy
> ?.sp
> @@ -45,7 +45,9 @@ at least \fInr_events\fP.
> ?\fIctxp\fP must not point to an AIO context that already exists, and must
> ?be initialized to 0 prior to the call.
> ?On successful creation of the AIO context, \fI*ctxp\fP is filled in
> -with the resulting handle.
> +with the resulting handle. ?Memory is reserved by the kernel for the completion
> +ring and, because this memory is pinned, it will be charged against the
> +process's memlock rlimit.
> ?.SH "RETURN VALUE"
> ?On success,
> ?.BR io_setup ()
> @@ -54,7 +56,8 @@ For the failure return, see NOTES.
> ?.SH "ERRORS"
> ?.TP
> ?.B EAGAIN
> -The specified \fInr_events\fP exceeds the user's limit of available events.
> +The specified \fInr_events\fP exceeds the process's limit of available events.
> +Try increasing the memlock rlimit for the process.
> ?.TP
> ?.B EFAULT
> ?An invalid pointer is passed for \fIctxp\fP.
> diff -up man-pages-3.19/man2/io_submit.2.orig man-pages-3.19/man2/io_submit.2
> --- man-pages-3.19/man2/io_submit.2.orig ? ? ? ?2009-03-08 11:58:26.000000000 -0400
> +++ man-pages-3.19/man2/io_submit.2 ? ? 2009-03-08 11:58:34.000000000 -0400
> @@ -31,7 +31,7 @@ io_submit \- submit asynchronous I/O blo
> ?.\" #include <linux/aio.h>
> ?.sp
> ?.\" .HP 16
> -.BI "int io_submit(aio_context_t " ctx_id ", long " nr \
> +.BI "int io_submit(io_context_t " ctx_id ", long " nr \
> ?", struct iocb **" iocbpp );
> ?.\" .ad
> ?.\" .hy
> @@ -63,7 +63,7 @@ The file descriptor specified in the fir
> ?One of the data structures points to invalid data.
> ?.TP
> ?.B EINVAL
> -The \fIaio_context\fP specified by \fIctx_id\fP is invalid.
> +The \fIio_context\fP specified by \fIctx_id\fP is invalid.
> ?\fInr\fP is less than 0.
> ?The \fIiocb\fP at *iocbpp[0] is not properly initialized,
> ?or the operation specified is invalid for the file descriptor
> diff -up man-pages-3.19/man7/capabilities.7.orig man-pages-3.19/man7/capabilities.7
> --- man-pages-3.19/man7/capabilities.7.orig ? ? 2009-03-08 11:52:56.000000000 -0400
> +++ man-pages-3.19/man7/capabilities.7 ?2009-03-08 11:53:14.000000000 -0400
> @@ -129,7 +129,8 @@ Lock memory
> ?.RB ( mlock (2),
> ?.BR mlockall (2),
> ?.BR mmap (2),
> -.BR shmctl (2)).
> +.BR shmctl (2),
> +.BR io_setup (2)).
> ?.TP
> ?.B CAP_IPC_OWNER
> ?Bypass permission checks for operations on System V IPC objects.
>

2009-03-09 17:57:34

by Jeff Moyer

[permalink] [raw]
Subject: Re: [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring

Avi Kivity <[email protected]> writes:

> Jeff Moyer wrote:
>> Hi,
>>
>> Believe it or not, I get numerous questions from customers about the
>> suggested tuning value of aio-max-nr. aio-max-nr limits the total
>> number of io events that can be reserved, system wide, for aio
>> completions. Each time io_setup is called, a ring buffer is allocated
>> that can hold nr_events I/O completions. That ring buffer is then
>> mapped into the process' address space, and the pages are pinned in
>> memory. So, the reason for this upper limit (I believe) is to keep a
>> malicious user from pinning all of kernel memory. Now, this sounds like
>> a much better job for the memlock rlimit to me, hence the following
>> patch.
>>
>
> Is it not possible to get rid of the pinning entirely? Pinning
> interferes with page migration which is important for NUMA, among
> other issues.

aio_complete is called from interrupt handlers, so can't block faulting
in a page. Zach mentions there is a possibility of handing completions
off to a kernel thread, with all of the performance worries and extra
bookkeeping that go along with such a scheme (to help frame my concerns,
I often get lambasted over .5% performance regressions).

I'm happy to look into such a scheme, should anyone show me data that
points to this NUMA issue as an actual performance problem today. In
the absence of such data, I simply can't justify the work at the moment.

Thanks for taking a look!

-Jeff

2009-03-09 19:46:19

by Avi Kivity

[permalink] [raw]
Subject: Re: [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring

Jeff Moyer wrote:
>> Is it not possible to get rid of the pinning entirely? Pinning
>> interferes with page migration which is important for NUMA, among
>> other issues.
>>
>
> aio_complete is called from interrupt handlers, so can't block faulting
> in a page. Zach mentions there is a possibility of handing completions
> off to a kernel thread, with all of the performance worries and extra
> bookkeeping that go along with such a scheme (to help frame my concerns,
> I often get lambasted over .5% performance regressions).
>

Or you could queue the completions somewhere, and only copy them to user
memory when io_getevents() is called. I think the plan was once to
allow events to be consumed opportunistically even without
io_getevents(), though.


> I'm happy to look into such a scheme, should anyone show me data that
> points to this NUMA issue as an actual performance problem today. In
> the absence of such data, I simply can't justify the work at the moment.
>

Right now page migration is a dead duck. Outside HPC, there is now
support for triggering it or for getting the scheduler to prefer a
process's memory node. Only a minority of hosts are NUMA.

I think that will/should change in the near future. Nehalem-based
servers mean that NUMA will be commonplace. The larger core counts will
mean that hosts will run several unrelated applications (often through
virtualization); such partitioning can easily benefit from page migration.

> Thanks for taking a look!
>

Sorry, I didn't actually take a look at the patches. I only reacted to
the description - I am allergic to pinned memory.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2009-03-09 20:33:17

by Eric Dumazet

[permalink] [raw]
Subject: Re: [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring

Jeff Moyer a ?crit :
> Avi Kivity <[email protected]> writes:
>
>> Jeff Moyer wrote:
>>> Hi,
>>>
>>> Believe it or not, I get numerous questions from customers about the
>>> suggested tuning value of aio-max-nr. aio-max-nr limits the total
>>> number of io events that can be reserved, system wide, for aio
>>> completions. Each time io_setup is called, a ring buffer is allocated
>>> that can hold nr_events I/O completions. That ring buffer is then
>>> mapped into the process' address space, and the pages are pinned in
>>> memory. So, the reason for this upper limit (I believe) is to keep a
>>> malicious user from pinning all of kernel memory. Now, this sounds like
>>> a much better job for the memlock rlimit to me, hence the following
>>> patch.
>>>
>> Is it not possible to get rid of the pinning entirely? Pinning
>> interferes with page migration which is important for NUMA, among
>> other issues.
>
> aio_complete is called from interrupt handlers, so can't block faulting
> in a page. Zach mentions there is a possibility of handing completions
> off to a kernel thread, with all of the performance worries and extra
> bookkeeping that go along with such a scheme (to help frame my concerns,
> I often get lambasted over .5% performance regressions).

This aio_completion from interrupt handlers keep us from using SLAB_DESTROY_BY_RCU
instead of call_rcu() for "struct file" freeing.

http://lkml.org/lkml/2008/12/17/364

I would love it we could get rid of this mess...

2009-03-09 20:36:56

by Jamie Lokier

[permalink] [raw]
Subject: Re: [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring

Avi Kivity wrote:
> Or you could queue the completions somewhere, and only copy them to
> user memory when io_getevents() is called. I think the plan was
> once to allow events to be consumed opportunistically even without
> io_getevents(), though.

Isn't that integrated (or to be integrated) with the vring buffer
thingy used by virtio?

If not, should it be?

> Sorry, I didn't actually take a look at the patches. I only reacted to
> the description - I am allergic to pinned memory.

While we're at it I'm allergic to fixed size ring buffers :-)

-- Jamie

2009-03-09 20:44:56

by Jeff Moyer

[permalink] [raw]
Subject: Re: [patch] man-pages: add documentation about the memlock implications of io_setup

Michael Kerrisk <[email protected]> writes:

> [RESEND: Really CC+=Vegard Nossum, this time, and += mtk.manpages@, as well]
>
> Hi Jeff,
>
> Note that my address for man-pages is [email protected].

Sorry about that!

> On Tue, Mar 10, 2009 at 4:59 AM, Jeff Moyer <[email protected]> wrote:
>> Hi,
>>
>> This patch, against man-pages-3.19, adds verbiage surrounding the newly
>> proposed accounting of pinned aio completion ring pages against the
>> memlock rlimit. It also fixes up a few references to aio_context_t
>> (which should be io_context_t). Sorry for conflating the two, I can
>> break it apart if needed.
>
> Breaking apart really would be better. Could you resubmit (to
> mtk.manpages@, and CC linux-man@vger) please?

Sure thing. I'll address your question when I resubmit.

Thanks!

Jeff

2009-03-09 22:37:44

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring

On Mon, 09 Mar 2009 11:49:57 -0400
Jeff Moyer <[email protected]> wrote:

> Hi,
>
> Believe it or not, I get numerous questions from customers about the
> suggested tuning value of aio-max-nr. aio-max-nr limits the total
> number of io events that can be reserved, system wide, for aio
> completions. Each time io_setup is called, a ring buffer is allocated
> that can hold nr_events I/O completions. That ring buffer is then
> mapped into the process' address space, and the pages are pinned in
> memory. So, the reason for this upper limit (I believe) is to keep a
> malicious user from pinning all of kernel memory. Now, this sounds like
> a much better job for the memlock rlimit to me, hence the following
> patch.
>
> It's worth noting that, by default, aio-max-nr was set to 65536
> requests. By default, the memlock rlimit is set to 64kb per process.
> With this patch in place, a single process can specify 2045 for the
> nr_events parameter of io_setup. Or, for the worst case scenario, a
> process can only create 16 io contexts, each with a single event (since
> the minimum ring size is a page).
>
> I created a simple program that sets the process' memlock rlimit and
> then drops the CAP_IPC_LOCK capability (the memlock rlimit is a command
> line parameter, so you can see how the nr_events allowable changes with
> different limits in place). Then, it calls io_setup/io_destroy in a
> loop, increasing nr_events until it fails. Finally, it calls io_setup
> in a loop with a single event to see how many iterations it can perform
> before receiving -EAGAIN. I ran the test on a patched kernel and the
> results are in line with expectations.
>
> I also ran the aio-dio-regress regression tests, and all passed but
> one. The one that failed only failed due to an expected return code of
> -ENOMEM, and instead got -EAGAIN. Part of the patch below returns a
> proper error code from aio_setup_ring. So, I'm calling this a test
> issue, but we can debate this nuance if it is deemed important.
>
> Further, as a result of this exercise, I noticed that there are numerous
> places in the kernel that test to see if locking memory will exceed the
> maximum amount of allowed locked memory. I've factored out that bit of
> code in a follow-on patch that I will post.
>
> Finally, I updated the aio man pages to reflect this change (and fix
> references to aio_context_t that should be io_context_t).
>
> You can find a copy of the test program I used at:
> http://people.redhat.com/jmoyer/aio/ioctx_alloc.c
> I'll apologize in advance for the crudity of the test code.
>
> For those reviewing the below patch, it may be worth looking at:
>
> commit d55b5fdaf40846221d543937b786956e27837fda
> Author: Zach Brown <[email protected]>
> Date: Mon Nov 7 00:59:31 2005 -0800
>
> [PATCH] aio: remove aio_max_nr accounting race
>
> The below patch basically reverts the above commit. There is no
> accounting race now, because we are charging per-process rlimits instead
> of a system-wide maximum number of aio requests. This has the added
> benefit of reducing some code complexity.

It's risky to simply remove an existing tunable. What if someone's
mission-critical startup script which is provided by a super-slow or
even no-longer-in-business vendor does

if (write(aoi-max-nr, something) == error)
crash_and_burn()

?

It would be prudent to have a more cautious update scheme. Leave the
existing tunable in place. Keep it working if possible. If someone
uses it, blurt out a loud printk telling them that they're using a
deprecated interface and informing them how to update.

Then at some later time we can remove the old interface.

> /*------ sysctl variables----*/
> -static DEFINE_SPINLOCK(aio_nr_lock);
> -unsigned long aio_nr; /* current system wide number of aio requests */
> -unsigned long aio_max_nr = 0x10000; /* system wide maximum number of aio requests */
> +/* current system wide number of aio requests */
> +atomic_t aio_nr = ATOMIC_INIT(0);

Was it a conscious decision to downgrade this from a `long' type to an
`int' type? It _could_ have used atomic_long_t.

2009-03-10 08:36:52

by Avi Kivity

[permalink] [raw]
Subject: Re: [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring

Jamie Lokier wrote:
> Avi Kivity wrote:
>
>> Or you could queue the completions somewhere, and only copy them to
>> user memory when io_getevents() is called. I think the plan was
>> once to allow events to be consumed opportunistically even without
>> io_getevents(), though.
>>
>
> Isn't that integrated (or to be integrated) with the vring buffer
> thingy used by virtio?
>
> If not, should it be?
>
>

vringfd has been abandoned, AFAICT. I think it's too similar to aio anyway.


--
error compiling committee.c: too many arguments to function

2009-03-10 13:44:24

by Jeff Moyer

[permalink] [raw]
Subject: Re: [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring

Andrew Morton <[email protected]> writes:

> It's risky to simply remove an existing tunable. What if someone's
> mission-critical startup script which is provided by a super-slow or
> even no-longer-in-business vendor does
>
> if (write(aoi-max-nr, something) == error)
> crash_and_burn()
>
> ?
>
> It would be prudent to have a more cautious update scheme. Leave the
> existing tunable in place. Keep it working if possible. If someone
> uses it, blurt out a loud printk telling them that they're using a
> deprecated interface and informing them how to update.
>
> Then at some later time we can remove the old interface.

You are absolutely right. The more I think about this, the less
enthused I am about it. I still believe the change is the right way to
go, but there are years worth of deployments using the current scheme,
and there are existing documents detailing how to work within that
scheme. I hereby rescind this patch.

I will follow-up with the memlock cleanup, though; let me know if you
have comments on that patch. Zach mentioned that he hates my new
function name (can_mlock_pages), so I guess at least that will change.
;)

Cheers,
Jeff

2009-03-12 02:41:50

by Eric Dumazet

[permalink] [raw]
Subject: Re: [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring

Eric Dumazet a ?crit :
> Jeff Moyer a ?crit :
>> Avi Kivity <[email protected]> writes:
>>
>>> Jeff Moyer wrote:
>>>> Hi,
>>>>
>>>> Believe it or not, I get numerous questions from customers about the
>>>> suggested tuning value of aio-max-nr. aio-max-nr limits the total
>>>> number of io events that can be reserved, system wide, for aio
>>>> completions. Each time io_setup is called, a ring buffer is allocated
>>>> that can hold nr_events I/O completions. That ring buffer is then
>>>> mapped into the process' address space, and the pages are pinned in
>>>> memory. So, the reason for this upper limit (I believe) is to keep a
>>>> malicious user from pinning all of kernel memory. Now, this sounds like
>>>> a much better job for the memlock rlimit to me, hence the following
>>>> patch.
>>>>
>>> Is it not possible to get rid of the pinning entirely? Pinning
>>> interferes with page migration which is important for NUMA, among
>>> other issues.
>> aio_complete is called from interrupt handlers, so can't block faulting
>> in a page. Zach mentions there is a possibility of handing completions
>> off to a kernel thread, with all of the performance worries and extra
>> bookkeeping that go along with such a scheme (to help frame my concerns,
>> I often get lambasted over .5% performance regressions).
>
> This aio_completion from interrupt handlers keep us from using SLAB_DESTROY_BY_RCU
> instead of call_rcu() for "struct file" freeing.
>
> http://lkml.org/lkml/2008/12/17/364
>
> I would love if we could get rid of this mess...

Speaking of that, I tried to take a look at this aio stuff and have one question.

Assuming that __fput() cannot be called from interrupt context.
-> fput() should not be called from interrupt context as well.

How comes we call fput(req->ki_eventfd) from really_put_req()
from interrupt context ?

If user program closes eventfd, then inflight AIO requests can trigger
a bug.

2009-03-12 03:11:43

by Eric Dumazet

[permalink] [raw]
Subject: Re: [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring

Eric Dumazet a ?crit :
> Eric Dumazet a ?crit :
>> Jeff Moyer a ?crit :
>>> Avi Kivity <[email protected]> writes:
>>>
>>>> Jeff Moyer wrote:
>>>>> Hi,
>>>>>
>>>>> Believe it or not, I get numerous questions from customers about the
>>>>> suggested tuning value of aio-max-nr. aio-max-nr limits the total
>>>>> number of io events that can be reserved, system wide, for aio
>>>>> completions. Each time io_setup is called, a ring buffer is allocated
>>>>> that can hold nr_events I/O completions. That ring buffer is then
>>>>> mapped into the process' address space, and the pages are pinned in
>>>>> memory. So, the reason for this upper limit (I believe) is to keep a
>>>>> malicious user from pinning all of kernel memory. Now, this sounds like
>>>>> a much better job for the memlock rlimit to me, hence the following
>>>>> patch.
>>>>>
>>>> Is it not possible to get rid of the pinning entirely? Pinning
>>>> interferes with page migration which is important for NUMA, among
>>>> other issues.
>>> aio_complete is called from interrupt handlers, so can't block faulting
>>> in a page. Zach mentions there is a possibility of handing completions
>>> off to a kernel thread, with all of the performance worries and extra
>>> bookkeeping that go along with such a scheme (to help frame my concerns,
>>> I often get lambasted over .5% performance regressions).
>> This aio_completion from interrupt handlers keep us from using SLAB_DESTROY_BY_RCU
>> instead of call_rcu() for "struct file" freeing.
>>
>> http://lkml.org/lkml/2008/12/17/364
>>
>> I would love if we could get rid of this mess...
>
> Speaking of that, I tried to take a look at this aio stuff and have one question.
>
> Assuming that __fput() cannot be called from interrupt context.
> -> fput() should not be called from interrupt context as well.
>
> How comes we call fput(req->ki_eventfd) from really_put_req()
> from interrupt context ?
>
> If user program closes eventfd, then inflight AIO requests can trigger
> a bug.
>

Path could be :

1) fput() changes so that calling it from interrupt context is possible
(Using a working queue to make sure __fput() is called from process context)

2) Changes aio to use fput() as is (and zap its internal work_queue and aio_fput_routine() stuff)

3) Once atomic_long_dec_and_test(&filp->f_count) only performed in fput(),
SLAB_DESTROY_BY_RCU for "struct file" get back :)

2009-03-12 03:17:18

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring

On Thu, Mar 12, 2009 at 03:39:51AM +0100, Eric Dumazet wrote:
> Assuming that __fput() cannot be called from interrupt context.
> -> fput() should not be called from interrupt context as well.

It doesn't get called in irq context. The atomic dec is done in irq
context, and if that is the last user of the file descriptor, it's
placed on a work queue and then released in process context.

-ben

2009-03-12 03:25:38

by Eric Dumazet

[permalink] [raw]
Subject: Re: [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring

Benjamin LaHaise a ?crit :
> On Thu, Mar 12, 2009 at 03:39:51AM +0100, Eric Dumazet wrote:
>> Assuming that __fput() cannot be called from interrupt context.
>> -> fput() should not be called from interrupt context as well.
>
> It doesn't get called in irq context. The atomic dec is done in irq
> context, and if that is the last user of the file descriptor, it's
> placed on a work queue and then released in process context.
>

really_put_req() itself can be called from interrupt context, unless I am wrong.
(its 4:24 am here ;) )

if (unlikely(atomic_long_dec_and_test(&req->ki_filp->f_count))) {
get_ioctx(ctx);
spin_lock(&fput_lock);
list_add(&req->ki_list, &fput_head);
spin_unlock(&fput_lock);
queue_work(aio_wq, &fput_work);
} else
really_put_req(ctx, req); /* from interrupt context */



static inline void really_put_req(struct kioctx *ctx, struct kiocb *req)
{
assert_spin_locked(&ctx->ctx_lock);

if (!IS_ERR(req->ki_eventfd))
fput(req->ki_eventfd); /* BANG : can be called from interrupt context */
if (req->ki_dtor)
req->ki_dtor(req);
if (req->ki_iovec != &req->ki_inline_vec)
kfree(req->ki_iovec);
kmem_cache_free(kiocb_cachep, req);
ctx->reqs_active--;

if (unlikely(!ctx->reqs_active && ctx->dead))
wake_up(&ctx->wait);
}

Thank you

2009-03-12 03:29:41

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring

On Thu, Mar 12, 2009 at 04:24:42AM +0100, Eric Dumazet wrote:
> if (!IS_ERR(req->ki_eventfd))
> fput(req->ki_eventfd); /* BANG : can be called from interrupt context */
...
> Thank you

That's a bug in the eventfd code, not aio. Davide: please fix.

-ben

2009-03-12 03:34:45

by Eric Dumazet

[permalink] [raw]
Subject: Re: [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring

Benjamin LaHaise a ?crit :
> On Thu, Mar 12, 2009 at 04:24:42AM +0100, Eric Dumazet wrote:
>> if (!IS_ERR(req->ki_eventfd))
>> fput(req->ki_eventfd); /* BANG : can be called from interrupt context */
> ...
>> Thank you
>
> That's a bug in the eventfd code, not aio. Davide: please fix.
>

Hmm... what about fget_light() ... is it Davide fault too ?

aio breaks the fget_light() concept too, if process is mono threaded.

/*
* Lightweight file lookup - no refcnt increment if fd table isn't shared.
* You can use this only if it is guranteed that the current task already
* holds a refcnt to that file. That check has to be done at fget() only
* and a flag is returned to be passed to the corresponding fput_light().
* There must not be a cloning between an fget_light/fput_light pair.
*/
struct file *fget_light(unsigned int fd, int *fput_needed)
{
struct file *file;
struct files_struct *files = current->files;

*fput_needed = 0;
if (likely((atomic_read(&files->count) == 1))) {
file = fcheck_files(files, fd);
} else {
rcu_read_lock();
file = fcheck_files(files, fd);
if (file) {
if (atomic_long_inc_not_zero(&file->f_count))
*fput_needed = 1;
else
/* Didn't get the reference, someone's freed */
file = NULL;
}
rcu_read_unlock();
}

return file;
}


2009-03-12 03:36:25

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring

On Thu, Mar 12, 2009 at 04:33:20AM +0100, Eric Dumazet wrote:
> aio breaks the fget_light() concept too, if process is mono threaded.

AIO requests cannot use fget_light(). The user space program could very
well use the close() syscall to destroy the file descriptor while an io
request is in flight, so you cannot avoid the reference counting overhead.

-ben

2009-03-12 03:41:49

by Eric Dumazet

[permalink] [raw]
Subject: Re: [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring

Benjamin LaHaise a ?crit :
> On Thu, Mar 12, 2009 at 04:33:20AM +0100, Eric Dumazet wrote:
>> aio breaks the fget_light() concept too, if process is mono threaded.
>
> AIO requests cannot use fget_light(). The user space program could very
> well use the close() syscall to destroy the file descriptor while an io
> request is in flight, so you cannot avoid the reference counting overhead.

Yes, you are right, AIO doesnt break fget_light()

2009-03-12 05:19:55

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH] fs: fput() can be called from interrupt context

Eric Dumazet a ?crit :
> Eric Dumazet a ?crit :
>> Eric Dumazet a ?crit :
>>> Jeff Moyer a ?crit :
>>>> Avi Kivity <[email protected]> writes:
>>>>
>>>>> Jeff Moyer wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Believe it or not, I get numerous questions from customers about the
>>>>>> suggested tuning value of aio-max-nr. aio-max-nr limits the total
>>>>>> number of io events that can be reserved, system wide, for aio
>>>>>> completions. Each time io_setup is called, a ring buffer is allocated
>>>>>> that can hold nr_events I/O completions. That ring buffer is then
>>>>>> mapped into the process' address space, and the pages are pinned in
>>>>>> memory. So, the reason for this upper limit (I believe) is to keep a
>>>>>> malicious user from pinning all of kernel memory. Now, this sounds like
>>>>>> a much better job for the memlock rlimit to me, hence the following
>>>>>> patch.
>>>>>>
>>>>> Is it not possible to get rid of the pinning entirely? Pinning
>>>>> interferes with page migration which is important for NUMA, among
>>>>> other issues.
>>>> aio_complete is called from interrupt handlers, so can't block faulting
>>>> in a page. Zach mentions there is a possibility of handing completions
>>>> off to a kernel thread, with all of the performance worries and extra
>>>> bookkeeping that go along with such a scheme (to help frame my concerns,
>>>> I often get lambasted over .5% performance regressions).
>>> This aio_completion from interrupt handlers keep us from using SLAB_DESTROY_BY_RCU
>>> instead of call_rcu() for "struct file" freeing.
>>>
>>> http://lkml.org/lkml/2008/12/17/364
>>>
>>> I would love if we could get rid of this mess...
>> Speaking of that, I tried to take a look at this aio stuff and have one question.
>>
>> Assuming that __fput() cannot be called from interrupt context.
>> -> fput() should not be called from interrupt context as well.
>>
>> How comes we call fput(req->ki_eventfd) from really_put_req()
>> from interrupt context ?
>>
>> If user program closes eventfd, then inflight AIO requests can trigger
>> a bug.
>>
>
> Path could be :
>
> 1) fput() changes so that calling it from interrupt context is possible
> (Using a working queue to make sure __fput() is called from process context)
>
> 2) Changes aio to use fput() as is (and zap its internal work_queue and aio_fput_routine() stuff)
>
> 3) Once atomic_long_dec_and_test(&filp->f_count) only performed in fput(),
> SLAB_DESTROY_BY_RCU for "struct file" get back :)
>

Please find first patch against linux-2.6

Next patch (2) can cleanup aio code, but it probably can wait linux-2.6.30

Thank you

[PATCH] fs: fput() can be called from interrupt context

Current aio/eventfd code can call fput() from interrupt context, which is
not allowed.

In order to fix the problem and prepare SLAB_DESTROY_BY_RCU use for "struct file"
allocation/freeing in 2.6.30, we might extend existing workqueue infrastructure and
allow fput() to be called from interrupt context.

This unfortunalty adds a pointer to 'struct file'.

Signed-off-by: Eric Dumazet <[email protected]>
---
fs/file.c | 55 ++++++++++++++++++++++++++------------
fs/file_table.c | 10 +++++-
include/linux/fdtable.h | 1
include/linux/fs.h | 1
4 files changed, 49 insertions(+), 18 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index f313314..8c819a8 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -24,6 +24,7 @@ struct fdtable_defer {
spinlock_t lock;
struct work_struct wq;
struct fdtable *next;
+ struct file *fhead;
};

int sysctl_nr_open __read_mostly = 1024*1024;
@@ -67,24 +68,53 @@ static void free_fdtable_work(struct work_struct *work)
struct fdtable_defer *f =
container_of(work, struct fdtable_defer, wq);
struct fdtable *fdt;
+ struct file *fhead;

- spin_lock_bh(&f->lock);
- fdt = f->next;
- f->next = NULL;
- spin_unlock_bh(&f->lock);
- while(fdt) {
+ spin_lock_irq(&f->lock);
+ fdt = f->next;
+ fhead = f->fhead;
+ f->next = NULL;
+ f->fhead = NULL;
+ spin_unlock_irq(&f->lock);
+
+ while (fdt) {
struct fdtable *next = fdt->next;
+
vfree(fdt->fd);
free_fdset(fdt);
kfree(fdt);
fdt = next;
}
+
+ while (fhead) {
+ struct file *next = fhead->f_next;
+
+ __fput(fhead);
+ fhead = next;
+ }
+}
+
+void fd_defer_queue(struct fdtable *fdt, struct file *file)
+{
+ struct fdtable_defer *fddef = &get_cpu_var(fdtable_defer_list);
+
+ spin_lock_irq(&fddef->lock);
+ if (fdt) {
+ fdt->next = fddef->next;
+ fddef->next = fdt;
+ }
+ if (file) {
+ file->f_next = fddef->fhead;
+ fddef->fhead = file;
+ }
+ spin_unlock_irq(&fddef->lock);
+ schedule_work(&fddef->wq);
+ put_cpu_var(fdtable_defer_list);
}

void free_fdtable_rcu(struct rcu_head *rcu)
{
struct fdtable *fdt = container_of(rcu, struct fdtable, rcu);
- struct fdtable_defer *fddef;

BUG_ON(!fdt);

@@ -101,16 +131,8 @@ void free_fdtable_rcu(struct rcu_head *rcu)
kfree(fdt->fd);
kfree(fdt->open_fds);
kfree(fdt);
- } else {
- fddef = &get_cpu_var(fdtable_defer_list);
- spin_lock(&fddef->lock);
- fdt->next = fddef->next;
- fddef->next = fdt;
- /* vmallocs are handled from the workqueue context */
- schedule_work(&fddef->wq);
- spin_unlock(&fddef->lock);
- put_cpu_var(fdtable_defer_list);
- }
+ } else
+ fd_defer_queue(fdt, NULL);
}

/*
@@ -410,6 +432,7 @@ static void __devinit fdtable_defer_list_init(int cpu)
spin_lock_init(&fddef->lock);
INIT_WORK(&fddef->wq, free_fdtable_work);
fddef->next = NULL;
+ fddef->fhead = NULL;
}

void __init files_defer_init(void)
diff --git a/fs/file_table.c b/fs/file_table.c
index bbeeac6..77f42d8 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -21,6 +21,7 @@
#include <linux/fsnotify.h>
#include <linux/sysctl.h>
#include <linux/percpu_counter.h>
+#include <linux/interrupt.h>

#include <asm/atomic.h>

@@ -220,10 +221,15 @@ int init_file(struct file *file, struct vfsmount *mnt, struct dentry *dentry,
}
EXPORT_SYMBOL(init_file);

+
void fput(struct file *file)
{
- if (atomic_long_dec_and_test(&file->f_count))
- __fput(file);
+ if (atomic_long_dec_and_test(&file->f_count)) {
+ if (unlikely(!in_interrupt()))
+ fd_defer_queue(NULL, file);
+ else
+ __fput(file);
+ }
}

EXPORT_SYMBOL(fput);
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index 09d6c5b..42e5e8e 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -35,6 +35,7 @@ struct fdtable {
struct fdtable *next;
};

+extern void fd_defer_queue(struct fdtable *fdt, struct file *file);
/*
* Open file table structure
*/
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 92734c0..94beb0e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -872,6 +872,7 @@ struct file {
#ifdef CONFIG_DEBUG_WRITECOUNT
unsigned long f_mnt_write_state;
#endif
+ struct file *f_next;
};
extern spinlock_t files_lock;
#define file_list_lock() spin_lock(&files_lock);

2009-03-12 05:44:53

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH] aio: fput() can be called from interrupt context

Eric Dumazet a ?crit :
>> Path could be :
>>
>> 1) fput() changes so that calling it from interrupt context is possible
>> (Using a working queue to make sure __fput() is called from process context)
>>
>> 2) Changes aio to use fput() as is (and zap its internal work_queue and aio_fput_routine() stuff)
>>
>> 3) Once atomic_long_dec_and_test(&filp->f_count) only performed in fput(),
>> SLAB_DESTROY_BY_RCU for "struct file" get back :)
>>

Here is the second patch

Thank you

[PATCH] aio: cleanup, since fput() is IRQ safe

Once fput() is IRQ safe, we can cleanup aio code and delete its work_queue.

Signed-off-by: Eric Dumazet <[email protected]>
---
fs/aio.c | 52 ++--------------------------------------------------
1 files changed, 2 insertions(+), 50 deletions(-)


diff --git a/fs/aio.c b/fs/aio.c
index 8fa77e2..b0351a1 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -53,13 +53,6 @@ static struct kmem_cache *kioctx_cachep;

static struct workqueue_struct *aio_wq;

-/* Used for rare fput completion. */
-static void aio_fput_routine(struct work_struct *);
-static DECLARE_WORK(fput_work, aio_fput_routine);
-
-static DEFINE_SPINLOCK(fput_lock);
-static LIST_HEAD(fput_head);
-
static void aio_kick_handler(struct work_struct *);
static void aio_queue_work(struct kioctx *);

@@ -469,15 +462,7 @@ static struct kiocb *__aio_get_req(struct kioctx *ctx)
static inline struct kiocb *aio_get_req(struct kioctx *ctx)
{
struct kiocb *req;
- /* Handle a potential starvation case -- should be exceedingly rare as
- * requests will be stuck on fput_head only if the aio_fput_routine is
- * delayed and the requests were the last user of the struct file.
- */
req = __aio_get_req(ctx);
- if (unlikely(NULL == req)) {
- aio_fput_routine(NULL);
- req = __aio_get_req(ctx);
- }
return req;
}

@@ -498,30 +483,6 @@ static inline void really_put_req(struct kioctx *ctx, struct kiocb *req)
wake_up(&ctx->wait);
}

-static void aio_fput_routine(struct work_struct *data)
-{
- spin_lock_irq(&fput_lock);
- while (likely(!list_empty(&fput_head))) {
- struct kiocb *req = list_kiocb(fput_head.next);
- struct kioctx *ctx = req->ki_ctx;
-
- list_del(&req->ki_list);
- spin_unlock_irq(&fput_lock);
-
- /* Complete the fput */
- __fput(req->ki_filp);
-
- /* Link the iocb into the context's free list */
- spin_lock_irq(&ctx->ctx_lock);
- really_put_req(ctx, req);
- spin_unlock_irq(&ctx->ctx_lock);
-
- put_ioctx(ctx);
- spin_lock_irq(&fput_lock);
- }
- spin_unlock_irq(&fput_lock);
-}
-
/* __aio_put_req
* Returns true if this put was the last user of the request.
*/
@@ -540,17 +501,8 @@ static int __aio_put_req(struct kioctx *ctx, struct kiocb *req)
req->ki_cancel = NULL;
req->ki_retry = NULL;

- /* Must be done under the lock to serialise against cancellation.
- * Call this aio_fput as it duplicates fput via the fput_work.
- */
- if (unlikely(atomic_long_dec_and_test(&req->ki_filp->f_count))) {
- get_ioctx(ctx);
- spin_lock(&fput_lock);
- list_add(&req->ki_list, &fput_head);
- spin_unlock(&fput_lock);
- queue_work(aio_wq, &fput_work);
- } else
- really_put_req(ctx, req);
+ fput(req->ki_filp);
+ really_put_req(ctx, req);
return 1;
}

2009-03-12 05:51:15

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] fs: fput() can be called from interrupt context

On Thu, 12 Mar 2009 06:18:26 +0100 Eric Dumazet <[email protected]> wrote:

> Eric Dumazet a __crit :
> > Eric Dumazet a __crit :
> >> Eric Dumazet a __crit :
> >>> Jeff Moyer a __crit :
> >>>> Avi Kivity <[email protected]> writes:
> >>>>
> >>>>> Jeff Moyer wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> Believe it or not, I get numerous questions from customers about the
> >>>>>> suggested tuning value of aio-max-nr. aio-max-nr limits the total
> >>>>>> number of io events that can be reserved, system wide, for aio
> >>>>>> completions. Each time io_setup is called, a ring buffer is allocated
> >>>>>> that can hold nr_events I/O completions. That ring buffer is then
> >>>>>> mapped into the process' address space, and the pages are pinned in
> >>>>>> memory. So, the reason for this upper limit (I believe) is to keep a
> >>>>>> malicious user from pinning all of kernel memory. Now, this sounds like
> >>>>>> a much better job for the memlock rlimit to me, hence the following
> >>>>>> patch.
> >>>>>>
> >>>>> Is it not possible to get rid of the pinning entirely? Pinning
> >>>>> interferes with page migration which is important for NUMA, among
> >>>>> other issues.
> >>>> aio_complete is called from interrupt handlers, so can't block faulting
> >>>> in a page. Zach mentions there is a possibility of handing completions
> >>>> off to a kernel thread, with all of the performance worries and extra
> >>>> bookkeeping that go along with such a scheme (to help frame my concerns,
> >>>> I often get lambasted over .5% performance regressions).
> >>> This aio_completion from interrupt handlers keep us from using SLAB_DESTROY_BY_RCU
> >>> instead of call_rcu() for "struct file" freeing.
> >>>
> >>> http://lkml.org/lkml/2008/12/17/364
> >>>
> >>> I would love if we could get rid of this mess...
> >> Speaking of that, I tried to take a look at this aio stuff and have one question.
> >>
> >> Assuming that __fput() cannot be called from interrupt context.
> >> -> fput() should not be called from interrupt context as well.
> >>
> >> How comes we call fput(req->ki_eventfd) from really_put_req()
> >> from interrupt context ?
> >>
> >> If user program closes eventfd, then inflight AIO requests can trigger
> >> a bug.
> >>
> >
> > Path could be :
> >
> > 1) fput() changes so that calling it from interrupt context is possible
> > (Using a working queue to make sure __fput() is called from process context)
> >
> > 2) Changes aio to use fput() as is (and zap its internal work_queue and aio_fput_routine() stuff)
> >
> > 3) Once atomic_long_dec_and_test(&filp->f_count) only performed in fput(),
> > SLAB_DESTROY_BY_RCU for "struct file" get back :)
> >
>
> Please find first patch against linux-2.6
>
> Next patch (2) can cleanup aio code, but it probably can wait linux-2.6.30
>
> Thank you
>
> [PATCH] fs: fput() can be called from interrupt context
>
> Current aio/eventfd code can call fput() from interrupt context, which is
> not allowed.

The changelog forgot to tell us where this happens, and under what
circumstances.

See, there might be other ways of fixing the bug,

> In order to fix the problem and prepare SLAB_DESTROY_BY_RCU use for "struct file"
> allocation/freeing in 2.6.30, we might extend existing workqueue infrastructure and
> allow fput() to be called from interrupt context.
>
> This unfortunalty adds a pointer to 'struct file'.
>
> Signed-off-by: Eric Dumazet <[email protected]>
> ---
> fs/file.c | 55 ++++++++++++++++++++++++++------------
> fs/file_table.c | 10 +++++-
> include/linux/fdtable.h | 1
> include/linux/fs.h | 1
> 4 files changed, 49 insertions(+), 18 deletions(-)

which might not have some or all of the above problems.


I assume you're referring to really_put_req(), and commit
9c3060bedd84144653a2ad7bea32389f65598d40.

>From the above email straggle I extract "If user program closes
eventfd, then inflight AIO requests can trigger a bug" and I don't
immediately see anything in there which would prevent this.

Did you reproduce the bug, and confirm that the patch fixes it?

Are there simpler ways of fixing it? Maybe sneak a call to
wait_for_all_aios() into the right place? I doubt if it's performance
critical, as nobody seems to have ever hit the bug.

Bear in mind that if the bug _is_ real then it's now out there, and
we would like a fix which is usable by 2.6.<two-years-worth>.

etcetera..

2009-03-12 06:12:35

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] fs: fput() can be called from interrupt context

Andrew Morton a ?crit :
> On Thu, 12 Mar 2009 06:18:26 +0100 Eric Dumazet <[email protected]> wrote:
>
>> Eric Dumazet wrote :
>>> Path could be :
>>>
>>> 1) fput() changes so that calling it from interrupt context is possible
>>> (Using a working queue to make sure __fput() is called from process context)
>>>
>>> 2) Changes aio to use fput() as is (and zap its internal work_queue and aio_fput_routine() stuff)
>>>
>>> 3) Once atomic_long_dec_and_test(&filp->f_count) only performed in fput(),
>>> SLAB_DESTROY_BY_RCU for "struct file" get back :)
>>>
>> Please find first patch against linux-2.6
>>
>> Next patch (2) can cleanup aio code, but it probably can wait linux-2.6.30
>>
>> Thank you
>>
>> [PATCH] fs: fput() can be called from interrupt context
>>
>> Current aio/eventfd code can call fput() from interrupt context, which is
>> not allowed.
>
> The changelog forgot to tell us where this happens, and under what
> circumstances.
>
> See, there might be other ways of fixing the bug,

Sure

>
>> In order to fix the problem and prepare SLAB_DESTROY_BY_RCU use for "struct file"
>> allocation/freeing in 2.6.30, we might extend existing workqueue infrastructure and
>> allow fput() to be called from interrupt context.
>>
>> This unfortunalty adds a pointer to 'struct file'.
>>
>> Signed-off-by: Eric Dumazet <[email protected]>
>> ---
>> fs/file.c | 55 ++++++++++++++++++++++++++------------
>> fs/file_table.c | 10 +++++-
>> include/linux/fdtable.h | 1
>> include/linux/fs.h | 1
>> 4 files changed, 49 insertions(+), 18 deletions(-)
>
> which might not have some or all of the above problems.
>
>
> I assume you're referring to really_put_req(), and commit
> 9c3060bedd84144653a2ad7bea32389f65598d40.
>
>>From the above email straggle I extract "If user program closes
> eventfd, then inflight AIO requests can trigger a bug" and I don't
> immediately see anything in there which would prevent this.
>
> Did you reproduce the bug, and confirm that the patch fixes it?

take Davide program : http://www.xmailserver.org/eventfd-aio-test.c

and add at line 318 :
close(afd);

It should produce the kernel bug...
>
> Are there simpler ways of fixing it? Maybe sneak a call to
> wait_for_all_aios() into the right place? I doubt if it's performance
> critical, as nobody seems to have ever hit the bug.

Take the time to check how fs/aio.c handle the fput(req->ki_filp) case
(or read my 2nd patch, it should spot the thing)

If you want to add another kludge to properly fput(req->ki_eventfd),
be my guest :-(

>
> Bear in mind that if the bug _is_ real then it's now out there, and
> we would like a fix which is usable by 2.6.<two-years-worth>.
>
> etcetera..



2009-03-12 06:42:17

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] fs: fput() can be called from interrupt context

On Thu, 12 Mar 2009 07:10:38 +0100 Eric Dumazet <[email protected]> wrote:

> >
> > Did you reproduce the bug, and confirm that the patch fixes it?
>
> take Davide program : http://www.xmailserver.org/eventfd-aio-test.c
>
> and add at line 318 :
> close(afd);
>
> It should produce the kernel bug...

"should"?

> >
> > Are there simpler ways of fixing it? Maybe sneak a call to
> > wait_for_all_aios() into the right place? I doubt if it's performance
> > critical, as nobody seems to have ever hit the bug.
>
> Take the time to check how fs/aio.c handle the fput(req->ki_filp) case
> (or read my 2nd patch, it should spot the thing)

Well yes, a kludge like that seems a bit safer.

It's somewhat encouraging that we're apparently already doing fput()
from within keventd (although how frequently?). There might be
problems with file locking, security code, etc from doing fput() from
an unexpected thread. And then there are all the usual weird problem
with using the keventd queues which take a long time to get discovered.


> If you want to add another kludge to properly fput(req->ki_eventfd),
> be my guest :-(
>
> >
> > Bear in mind that if the bug _is_ real then it's now out there, and
> > we would like a fix which is usable by 2.6.<two-years-worth>.

The patches are large and scary and it would be a real problem to merge
them into 2.6.29 at this stage, let alone 2.6.25, etc.

Especially as the code which you sent out appears to be untested:

> void fput(struct file *file)
> {
> - if (atomic_long_dec_and_test(&file->f_count))
> - __fput(file);
> + if (atomic_long_dec_and_test(&file->f_count)) {
> + if (unlikely(!in_interrupt()))

^

> + fd_defer_queue(NULL, file);
> + else
> + __fput(file);
> + }
> }

2009-03-12 13:39:57

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH] fs: fput() can be called from interrupt context

On Wed, 11 Mar 2009, Andrew Morton wrote:

> > Take the time to check how fs/aio.c handle the fput(req->ki_filp) case
> > (or read my 2nd patch, it should spot the thing)
>
> Well yes, a kludge like that seems a bit safer.
>
> It's somewhat encouraging that we're apparently already doing fput()
> from within keventd (although how frequently?). There might be
> problems with file locking, security code, etc from doing fput() from
> an unexpected thread. And then there are all the usual weird problem
> with using the keventd queues which take a long time to get discovered.

Would it be a huge problem, performance-wise, to stop making ->f_count
tricks in __aio_put_req, and always offload to fput_work the task of
releasing the requests?
If that's a huge problem, IMO the lower impact fix would be to use
aio_fput_routine to loop into a second list, releasing the eventual
eventfd file*. There's no need, IMO, to turn the whole fput() into
IRQ-callable just for this case, when we can contain it into the
particular KAIO+eventfd usage.


- Davide

2009-03-12 19:26:41

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] fs: fput() can be called from interrupt context

Andrew Morton a ?crit :
> On Thu, 12 Mar 2009 07:10:38 +0100 Eric Dumazet <[email protected]> wrote:
>
>>> Did you reproduce the bug, and confirm that the patch fixes it?
>> take Davide program : http://www.xmailserver.org/eventfd-aio-test.c
>>
>> and add at line 318 :
>> close(afd);
>>
>> It should produce the kernel bug...
>
> "should"?

Sorry I had to run this morning, so I was a litle bit lazy...

Maybe some kind of O_DIRECT / NFS / blockdev setup or something calling
aio_complete() from interrupt context ? I am not familiar of this part,
but considering spin_lock_irq()/spin_unlock_irq() calls in fs/aio.c,
there must be a reason for this ?

>
>>> Are there simpler ways of fixing it? Maybe sneak a call to
>>> wait_for_all_aios() into the right place? I doubt if it's performance
>>> critical, as nobody seems to have ever hit the bug.
>> Take the time to check how fs/aio.c handle the fput(req->ki_filp) case
>> (or read my 2nd patch, it should spot the thing)
>
> Well yes, a kludge like that seems a bit safer.
>
> It's somewhat encouraging that we're apparently already doing fput()
> from within keventd (although how frequently?). There might be
> problems with file locking, security code, etc from doing fput() from
> an unexpected thread. And then there are all the usual weird problem
> with using the keventd queues which take a long time to get discovered.
>

Hum... Do you mean "if (in_interrupt())" is not the right test to
perform ?

>
>> If you want to add another kludge to properly fput(req->ki_eventfd),
>> be my guest :-(
>>
>>> Bear in mind that if the bug _is_ real then it's now out there, and
>>> we would like a fix which is usable by 2.6.<two-years-worth>.
>
> The patches are large and scary and it would be a real problem to merge
> them into 2.6.29 at this stage, let alone 2.6.25, etc.
>
> Especially as the code which you sent out appears to be untested:
>

I actually tested it and got a working kernel before sending patch,
please trust me...

Oh yes, my dev machine always called the slow path then, this is
why I didnt noticed.

Thank you for spotting this inverted test.

>> void fput(struct file *file)
>> {
>> - if (atomic_long_dec_and_test(&file->f_count))
>> - __fput(file);
>> + if (atomic_long_dec_and_test(&file->f_count)) {
>> + if (unlikely(!in_interrupt()))
>
> ^
>
>> + fd_defer_queue(NULL, file);
>> + else
>> + __fput(file);
>> + }
>> }
>

2009-03-12 20:25:14

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] fs: fput() can be called from interrupt context

On Thu, 12 Mar 2009 20:22:06 +0100
Eric Dumazet <[email protected]> wrote:

> > It's somewhat encouraging that we're apparently already doing fput()
> > from within keventd (although how frequently?). There might be
> > problems with file locking, security code, etc from doing fput() from
> > an unexpected thread. And then there are all the usual weird problem
> > with using the keventd queues which take a long time to get discovered.
> >
>
> Hum... Do you mean "if (in_interrupt())" is not the right test to
> perform ?

No, in_interrupt() seems appropriate.

It's a bit weak, because it means that the proposed code can't be used
for calling fput() inside spinlock. A better interface would be to
add a new fput_atomic() and let the caller decide which functions
should be called, depending upon what context the caller is running in.


2009-03-13 22:34:47

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH] fs: fput() can be called from interrupt context

On Thu, 12 Mar 2009, Davide Libenzi wrote:

> On Wed, 11 Mar 2009, Andrew Morton wrote:
>
> > > Take the time to check how fs/aio.c handle the fput(req->ki_filp) case
> > > (or read my 2nd patch, it should spot the thing)
> >
> > Well yes, a kludge like that seems a bit safer.
> >
> > It's somewhat encouraging that we're apparently already doing fput()
> > from within keventd (although how frequently?). There might be
> > problems with file locking, security code, etc from doing fput() from
> > an unexpected thread. And then there are all the usual weird problem
> > with using the keventd queues which take a long time to get discovered.
>
> Would it be a huge problem, performance-wise, to stop making ->f_count
> tricks in __aio_put_req, and always offload to fput_work the task of
> releasing the requests?
> If that's a huge problem, IMO the lower impact fix would be to use
> aio_fput_routine to loop into a second list, releasing the eventual
> eventfd file*. There's no need, IMO, to turn the whole fput() into
> IRQ-callable just for this case, when we can contain it into the
> particular KAIO+eventfd usage.

Eric, are you working on this or should I? I missed where the conversation
between you and Andrew is, at this point, WRT this issue.


- Davide

2009-03-13 22:45:39

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] fs: fput() can be called from interrupt context

Davide Libenzi a ?crit :
> On Thu, 12 Mar 2009, Davide Libenzi wrote:
>
>> On Wed, 11 Mar 2009, Andrew Morton wrote:
>>
>>>> Take the time to check how fs/aio.c handle the fput(req->ki_filp) case
>>>> (or read my 2nd patch, it should spot the thing)
>>> Well yes, a kludge like that seems a bit safer.
>>>
>>> It's somewhat encouraging that we're apparently already doing fput()
>>> from within keventd (although how frequently?). There might be
>>> problems with file locking, security code, etc from doing fput() from
>>> an unexpected thread. And then there are all the usual weird problem
>>> with using the keventd queues which take a long time to get discovered.
>> Would it be a huge problem, performance-wise, to stop making ->f_count
>> tricks in __aio_put_req, and always offload to fput_work the task of
>> releasing the requests?
>> If that's a huge problem, IMO the lower impact fix would be to use
>> aio_fput_routine to loop into a second list, releasing the eventual
>> eventfd file*. There's no need, IMO, to turn the whole fput() into
>> IRQ-callable just for this case, when we can contain it into the
>> particular KAIO+eventfd usage.
>
> Eric, are you working on this or should I? I missed where the conversation
> between you and Andrew is, at this point, WRT this issue.

I wish I could, but I currently have too much stress from my day job.

BTW, I could not produce a crash, so this problem is hypothetic for the moment :)

2009-03-13 23:29:58

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH] fs: fput() can be called from interrupt context

On Thu, 2009-03-12 at 06:39 -0700, Davide Libenzi wrote:
> On Wed, 11 Mar 2009, Andrew Morton wrote:
>
> > > Take the time to check how fs/aio.c handle the fput(req->ki_filp) case
> > > (or read my 2nd patch, it should spot the thing)
> >
> > Well yes, a kludge like that seems a bit safer.
> >
> > It's somewhat encouraging that we're apparently already doing fput()
> > from within keventd (although how frequently?). There might be
> > problems with file locking, security code, etc from doing fput() from
> > an unexpected thread. And then there are all the usual weird problem
> > with using the keventd queues which take a long time to get discovered.
>
> Would it be a huge problem, performance-wise, to stop making ->f_count
> tricks in __aio_put_req, and always offload to fput_work the task of
> releasing the requests?
> If that's a huge problem, IMO the lower impact fix would be to use
> aio_fput_routine to loop into a second list, releasing the eventual
> eventfd file*. There's no need, IMO, to turn the whole fput() into
> IRQ-callable just for this case, when we can contain it into the
> particular KAIO+eventfd usage.
>

Do you really want to see eventd doing umounts and remote flock() calls?
This really needs to be run in a thread that can cope with __long__
waits, unavailable servers, etc...

Trond

2009-03-14 02:06:30

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH] fs: fput() can be called from interrupt context

On Fri, 13 Mar 2009, Trond Myklebust wrote:

> On Thu, 2009-03-12 at 06:39 -0700, Davide Libenzi wrote:
> > On Wed, 11 Mar 2009, Andrew Morton wrote:
> >
> > > > Take the time to check how fs/aio.c handle the fput(req->ki_filp) case
> > > > (or read my 2nd patch, it should spot the thing)
> > >
> > > Well yes, a kludge like that seems a bit safer.
> > >
> > > It's somewhat encouraging that we're apparently already doing fput()
> > > from within keventd (although how frequently?). There might be
> > > problems with file locking, security code, etc from doing fput() from
> > > an unexpected thread. And then there are all the usual weird problem
> > > with using the keventd queues which take a long time to get discovered.
> >
> > Would it be a huge problem, performance-wise, to stop making ->f_count
> > tricks in __aio_put_req, and always offload to fput_work the task of
> > releasing the requests?
> > If that's a huge problem, IMO the lower impact fix would be to use
> > aio_fput_routine to loop into a second list, releasing the eventual
> > eventfd file*. There's no need, IMO, to turn the whole fput() into
> > IRQ-callable just for this case, when we can contain it into the
> > particular KAIO+eventfd usage.
> >
>
> Do you really want to see eventd doing umounts and remote flock() calls?
> This really needs to be run in a thread that can cope with __long__
> waits, unavailable servers, etc...

Did I miss something? This wouldn't be (eventually) on keventd shoulders,
but on aio work queue (aio_wq).
I guess we could do the same optimization we're already doing for ki_filp,
for ki_eventfd ...



- Davide


---
fs/aio.c | 32 ++++++++++++++++++++++++--------
1 file changed, 24 insertions(+), 8 deletions(-)

Index: linux-2.6.mod/fs/aio.c
===================================================================
--- linux-2.6.mod.orig/fs/aio.c 2009-03-13 18:19:34.000000000 -0700
+++ linux-2.6.mod/fs/aio.c 2009-03-13 18:32:25.000000000 -0700
@@ -485,8 +485,6 @@ static inline void really_put_req(struct
{
assert_spin_locked(&ctx->ctx_lock);

- if (!IS_ERR(req->ki_eventfd))
- fput(req->ki_eventfd);
if (req->ki_dtor)
req->ki_dtor(req);
if (req->ki_iovec != &req->ki_inline_vec)
@@ -508,8 +506,11 @@ static void aio_fput_routine(struct work
list_del(&req->ki_list);
spin_unlock_irq(&fput_lock);

- /* Complete the fput */
- __fput(req->ki_filp);
+ /* Complete the fput(s) */
+ if (req->ki_filp != NULL)
+ __fput(req->ki_filp);
+ if (!IS_ERR(req->ki_eventfd))
+ __fput(req->ki_eventfd);

/* Link the iocb into the context's free list */
spin_lock_irq(&ctx->ctx_lock);
@@ -527,12 +528,14 @@ static void aio_fput_routine(struct work
*/
static int __aio_put_req(struct kioctx *ctx, struct kiocb *req)
{
+ int schedule_putreq = 0;
+
dprintk(KERN_DEBUG "aio_put(%p): f_count=%ld\n",
req, atomic_long_read(&req->ki_filp->f_count));

assert_spin_locked(&ctx->ctx_lock);

- req->ki_users --;
+ req->ki_users--;
BUG_ON(req->ki_users < 0);
if (likely(req->ki_users))
return 0;
@@ -540,10 +543,23 @@ static int __aio_put_req(struct kioctx *
req->ki_cancel = NULL;
req->ki_retry = NULL;

- /* Must be done under the lock to serialise against cancellation.
- * Call this aio_fput as it duplicates fput via the fput_work.
+ /*
+ * Try to optimize the aio and eventfd file* puts, by avoiding to
+ * schedule work in case it is not __fput() time. In normal cases,
+ * we wouldn not be holding the last reference to the file*, so
+ * this function will be executed w/out any aio kthread wakeup.
*/
- if (unlikely(atomic_long_dec_and_test(&req->ki_filp->f_count))) {
+ if (unlikely(atomic_long_dec_and_test(&req->ki_filp->f_count)))
+ schedule_putreq++;
+ else
+ req->ki_filp = NULL;
+ if (unlikely(!IS_ERR(req->ki_eventfd))) {
+ if (unlikely(atomic_long_dec_and_test(&req->ki_eventfd->f_count)))
+ schedule_putreq++;
+ else
+ req->ki_eventfd = ERR_PTR(-EINVAL);
+ }
+ if (unlikely(schedule_putreq)) {
get_ioctx(ctx);
spin_lock(&fput_lock);
list_add(&req->ki_list, &fput_head);

2009-03-14 04:03:28

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH] fs: fput() can be called from interrupt context

On Fri, 2009-03-13 at 18:40 -0700, Davide Libenzi wrote:
> On Fri, 13 Mar 2009, Trond Myklebust wrote:
>
> > On Thu, 2009-03-12 at 06:39 -0700, Davide Libenzi wrote:
> > > On Wed, 11 Mar 2009, Andrew Morton wrote:
> > >
> > > > > Take the time to check how fs/aio.c handle the fput(req->ki_filp) case
> > > > > (or read my 2nd patch, it should spot the thing)
> > > >
> > > > Well yes, a kludge like that seems a bit safer.
> > > >
> > > > It's somewhat encouraging that we're apparently already doing fput()
> > > > from within keventd (although how frequently?). There might be
> > > > problems with file locking, security code, etc from doing fput() from
> > > > an unexpected thread. And then there are all the usual weird problem
> > > > with using the keventd queues which take a long time to get discovered.
> > >
> > > Would it be a huge problem, performance-wise, to stop making ->f_count
> > > tricks in __aio_put_req, and always offload to fput_work the task of
> > > releasing the requests?
> > > If that's a huge problem, IMO the lower impact fix would be to use
> > > aio_fput_routine to loop into a second list, releasing the eventual
> > > eventfd file*. There's no need, IMO, to turn the whole fput() into
> > > IRQ-callable just for this case, when we can contain it into the
> > > particular KAIO+eventfd usage.
> > >
> >
> > Do you really want to see eventd doing umounts and remote flock() calls?
> > This really needs to be run in a thread that can cope with __long__
> > waits, unavailable servers, etc...
>
> Did I miss something? This wouldn't be (eventually) on keventd shoulders,
> but on aio work queue (aio_wq).
> I guess we could do the same optimization we're already doing for ki_filp,
> for ki_eventfd ...

Last I checked, a call to schedule_work(&fddef->wq) would still run
fd_defer_queue() under keventd.

That said, even if you were to run it under aio_wq, the argument remains
the same: do you really want to add potentially long lasting sleeps onto
a work queue that was designed to service fast i/o requests?

Trond

2009-03-14 14:33:44

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH] fs: fput() can be called from interrupt context

On Sat, 14 Mar 2009, Trond Myklebust wrote:

> > Did I miss something? This wouldn't be (eventually) on keventd shoulders,
> > but on aio work queue (aio_wq).
> > I guess we could do the same optimization we're already doing for ki_filp,
> > for ki_eventfd ...
>
> Last I checked, a call to schedule_work(&fddef->wq) would still run
> fd_defer_queue() under keventd.

Oh, you were commenting about Eric's patch-set. I thought you were talking
about the current status (that my reply to Andrew was hinting to), and was
wondering for a while from where fd_defer_queue() came from.



> That said, even if you were to run it under aio_wq, the argument remains
> the same: do you really want to add potentially long lasting sleeps onto
> a work queue that was designed to service fast i/o requests?

No, and if you roll back, you couldn't miss to notice that mine was a
question about the f_count direct manipulation/optimization. We can do the
same optimization for eventfd, although both will fall in the aio kthread
in any case, if we're holding the last insteance of the file* (that I
guess we can consider an unlikely case).
Dunno how worth it could be adding a new NCPU*thread(s) just to handle
delayed fputs.


- Davide

2009-03-15 01:38:16

by Davide Libenzi

[permalink] [raw]
Subject: [patch] eventfd - remove fput() call from possible IRQ context

The following patch remove a possible source of fput() call from inside
IRQ context. Myself, like Eric, wasn't able to reproduce an fput() call
from IRQ context, but conceptually the bug is there.
This patch adds an optimization similar to the one we already do on
->ki_filp, on ->ki_eventfd. Playing with ->f_count directly is not pretty
in general, but the alternative here would be to add a brand new delayed
fput() infrastructure, that I'm not sure is worth it.


Signed-off-by: Davide Libenzi <[email protected]>


- Davide


---
fs/aio.c | 32 ++++++++++++++++++++++++--------
1 file changed, 24 insertions(+), 8 deletions(-)

Index: linux-2.6.mod/fs/aio.c
===================================================================
--- linux-2.6.mod.orig/fs/aio.c 2009-03-13 18:19:34.000000000 -0700
+++ linux-2.6.mod/fs/aio.c 2009-03-13 18:32:25.000000000 -0700
@@ -485,8 +485,6 @@ static inline void really_put_req(struct
{
assert_spin_locked(&ctx->ctx_lock);

- if (!IS_ERR(req->ki_eventfd))
- fput(req->ki_eventfd);
if (req->ki_dtor)
req->ki_dtor(req);
if (req->ki_iovec != &req->ki_inline_vec)
@@ -508,8 +506,11 @@ static void aio_fput_routine(struct work
list_del(&req->ki_list);
spin_unlock_irq(&fput_lock);

- /* Complete the fput */
- __fput(req->ki_filp);
+ /* Complete the fput(s) */
+ if (req->ki_filp != NULL)
+ __fput(req->ki_filp);
+ if (!IS_ERR(req->ki_eventfd))
+ __fput(req->ki_eventfd);

/* Link the iocb into the context's free list */
spin_lock_irq(&ctx->ctx_lock);
@@ -527,12 +528,14 @@ static void aio_fput_routine(struct work
*/
static int __aio_put_req(struct kioctx *ctx, struct kiocb *req)
{
+ int schedule_putreq = 0;
+
dprintk(KERN_DEBUG "aio_put(%p): f_count=%ld\n",
req, atomic_long_read(&req->ki_filp->f_count));

assert_spin_locked(&ctx->ctx_lock);

- req->ki_users --;
+ req->ki_users--;
BUG_ON(req->ki_users < 0);
if (likely(req->ki_users))
return 0;
@@ -540,10 +543,23 @@ static int __aio_put_req(struct kioctx *
req->ki_cancel = NULL;
req->ki_retry = NULL;

- /* Must be done under the lock to serialise against cancellation.
- * Call this aio_fput as it duplicates fput via the fput_work.
+ /*
+ * Try to optimize the aio and eventfd file* puts, by avoiding to
+ * schedule work in case it is not __fput() time. In normal cases,
+ * we wouldn not be holding the last reference to the file*, so
+ * this function will be executed w/out any aio kthread wakeup.
*/
- if (unlikely(atomic_long_dec_and_test(&req->ki_filp->f_count))) {
+ if (unlikely(atomic_long_dec_and_test(&req->ki_filp->f_count)))
+ schedule_putreq++;
+ else
+ req->ki_filp = NULL;
+ if (unlikely(!IS_ERR(req->ki_eventfd))) {
+ if (unlikely(atomic_long_dec_and_test(&req->ki_eventfd->f_count)))
+ schedule_putreq++;
+ else
+ req->ki_eventfd = ERR_PTR(-EINVAL);
+ }
+ if (unlikely(schedule_putreq)) {
get_ioctx(ctx);
spin_lock(&fput_lock);
list_add(&req->ki_list, &fput_head);

2009-03-15 17:44:57

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [patch] eventfd - remove fput() call from possible IRQ context

On Sat, Mar 14, 2009 at 06:36:34PM -0700, Davide Libenzi wrote:
> The following patch remove a possible source of fput() call from inside
> IRQ context. Myself, like Eric, wasn't able to reproduce an fput() call
> from IRQ context, but conceptually the bug is there.
> This patch adds an optimization similar to the one we already do on
> ->ki_filp, on ->ki_eventfd. Playing with ->f_count directly is not pretty
> in general, but the alternative here would be to add a brand new delayed
> fput() infrastructure, that I'm not sure is worth it.

This looks reasonably sane, the only concern I have with it is that I think
it logically makes more sense to use the same convention for fi_filp and
ki_eventfd, as the different in IS_ERR vs checking for NULL is a bit
confusing. Aside from that, it looks like it should fix the problem
correctly.

-ben

2009-03-15 20:09:26

by Davide Libenzi

[permalink] [raw]
Subject: [patch] eventfd - remove fput() call from possible IRQ context (2nd rev)

The following patch remove a possible source of fput() call from inside
IRQ context. Myself, like Eric, wasn't able to reproduce an fput() call
from IRQ context, but conceptually the bug is there.
This patch adds an optimization similar to the one we already do on
->ki_filp, on ->ki_eventfd. Playing with ->f_count directly is not pretty
in general, but the alternative here would be to add a brand new delayed
fput() infrastructure, that I'm not sure is worth it.



On Sun, 15 Mar 2009, Benjamin LaHaise wrote:

> This looks reasonably sane, the only concern I have with it is that I think
> it logically makes more sense to use the same convention for fi_filp and
> ki_eventfd, as the different in IS_ERR vs checking for NULL is a bit
> confusing. Aside from that, it looks like it should fix the problem
> correctly.

Makes sense.



Signed-off-by: Davide Libenzi <[email protected]>


- Davide


---
fs/aio.c | 37 +++++++++++++++++++++++++++----------
1 file changed, 27 insertions(+), 10 deletions(-)

Index: linux-2.6.mod/fs/aio.c
===================================================================
--- linux-2.6.mod.orig/fs/aio.c 2009-03-14 09:24:12.000000000 -0700
+++ linux-2.6.mod/fs/aio.c 2009-03-15 12:54:10.000000000 -0700
@@ -443,7 +443,7 @@ static struct kiocb *__aio_get_req(struc
req->private = NULL;
req->ki_iovec = NULL;
INIT_LIST_HEAD(&req->ki_run_list);
- req->ki_eventfd = ERR_PTR(-EINVAL);
+ req->ki_eventfd = NULL;

/* Check if the completion queue has enough free space to
* accept an event from this io.
@@ -485,8 +485,6 @@ static inline void really_put_req(struct
{
assert_spin_locked(&ctx->ctx_lock);

- if (!IS_ERR(req->ki_eventfd))
- fput(req->ki_eventfd);
if (req->ki_dtor)
req->ki_dtor(req);
if (req->ki_iovec != &req->ki_inline_vec)
@@ -508,8 +506,11 @@ static void aio_fput_routine(struct work
list_del(&req->ki_list);
spin_unlock_irq(&fput_lock);

- /* Complete the fput */
- __fput(req->ki_filp);
+ /* Complete the fput(s) */
+ if (req->ki_filp != NULL)
+ __fput(req->ki_filp);
+ if (req->ki_eventfd != NULL)
+ __fput(req->ki_eventfd);

/* Link the iocb into the context's free list */
spin_lock_irq(&ctx->ctx_lock);
@@ -527,12 +528,14 @@ static void aio_fput_routine(struct work
*/
static int __aio_put_req(struct kioctx *ctx, struct kiocb *req)
{
+ int schedule_putreq = 0;
+
dprintk(KERN_DEBUG "aio_put(%p): f_count=%ld\n",
req, atomic_long_read(&req->ki_filp->f_count));

assert_spin_locked(&ctx->ctx_lock);

- req->ki_users --;
+ req->ki_users--;
BUG_ON(req->ki_users < 0);
if (likely(req->ki_users))
return 0;
@@ -540,10 +543,23 @@ static int __aio_put_req(struct kioctx *
req->ki_cancel = NULL;
req->ki_retry = NULL;

- /* Must be done under the lock to serialise against cancellation.
- * Call this aio_fput as it duplicates fput via the fput_work.
+ /*
+ * Try to optimize the aio and eventfd file* puts, by avoiding to
+ * schedule work in case it is not __fput() time. In normal cases,
+ * we wouldn not be holding the last reference to the file*, so
+ * this function will be executed w/out any aio kthread wakeup.
*/
- if (unlikely(atomic_long_dec_and_test(&req->ki_filp->f_count))) {
+ if (unlikely(atomic_long_dec_and_test(&req->ki_filp->f_count)))
+ schedule_putreq++;
+ else
+ req->ki_filp = NULL;
+ if (unlikely(req->ki_eventfd != NULL)) {
+ if (unlikely(atomic_long_dec_and_test(&req->ki_eventfd->f_count)))
+ schedule_putreq++;
+ else
+ req->ki_eventfd = NULL;
+ }
+ if (unlikely(schedule_putreq)) {
get_ioctx(ctx);
spin_lock(&fput_lock);
list_add(&req->ki_list, &fput_head);
@@ -1009,7 +1025,7 @@ int aio_complete(struct kiocb *iocb, lon
* eventfd. The eventfd_signal() function is safe to be called
* from IRQ context.
*/
- if (!IS_ERR(iocb->ki_eventfd))
+ if (iocb->ki_eventfd != NULL)
eventfd_signal(iocb->ki_eventfd, 1);

put_rq:
@@ -1608,6 +1624,7 @@ static int io_submit_one(struct kioctx *
req->ki_eventfd = eventfd_fget((int) iocb->aio_resfd);
if (IS_ERR(req->ki_eventfd)) {
ret = PTR_ERR(req->ki_eventfd);
+ req->ki_eventfd = NULL;
goto out_put_req;
}
}

2009-03-16 17:25:43

by Jamie Lokier

[permalink] [raw]
Subject: Re: [patch] eventfd - remove fput() call from possible IRQ context (2nd rev)

Davide Libenzi wrote:
> + if (unlikely(req->ki_eventfd != NULL)) {

Is this really unlikely? Having an eventfd seems to be the modern,
cool, recommended way to do AIO.

The other unlikelies look ok.

-- Jamie

2009-03-16 18:37:16

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch] eventfd - remove fput() call from possible IRQ context (2nd rev)

On Mon, 16 Mar 2009, Jamie Lokier wrote:

> Davide Libenzi wrote:
> > + if (unlikely(req->ki_eventfd != NULL)) {
>
> Is this really unlikely? Having an eventfd seems to be the modern,
> cool, recommended way to do AIO.

I don't have statistics at hand, but I guess that nowadays AIO+eventfd is
still not very popular. It doesn't make much of a difference either ways I
guess, so we could drop the unlikely().



- Davide

2009-03-18 14:24:42

by Jeff Moyer

[permalink] [raw]
Subject: Re: [patch] eventfd - remove fput() call from possible IRQ context (2nd rev)

Davide Libenzi <[email protected]> writes:

> The following patch remove a possible source of fput() call from inside
> IRQ context. Myself, like Eric, wasn't able to reproduce an fput() call
> from IRQ context, but conceptually the bug is there.

I've attached a test program which can reproduce the fput call in
interrupt context. It's a modified version of the eventfd test that
Rusty wrote for the libaio test harness. I verified that fput was in
fact being called in interrupt context by using systemtap to print out
the "thead_indent" of fput calls, and observing a "swapper(0)" in the
output. After applying your patch, I confirmed that __fput is no longer
called from interrupt context. Strangely enough, I never did get any
output from the might_sleep in __fput. I can't explain that.

I have some minor comments inlined below.

> This patch adds an optimization similar to the one we already do on
> ->ki_filp, on ->ki_eventfd. Playing with ->f_count directly is not pretty
> in general, but the alternative here would be to add a brand new delayed
> fput() infrastructure, that I'm not sure is worth it.
>
> On Sun, 15 Mar 2009, Benjamin LaHaise wrote:
>
>> This looks reasonably sane, the only concern I have with it is that I think
>> it logically makes more sense to use the same convention for fi_filp and
>> ki_eventfd, as the different in IS_ERR vs checking for NULL is a bit
>> confusing. Aside from that, it looks like it should fix the problem
>> correctly.
>
> Makes sense.
>
> Signed-off-by: Davide Libenzi <[email protected]>
>
>
> - Davide
>
>
> ---
> fs/aio.c | 37 +++++++++++++++++++++++++++----------
> 1 file changed, 27 insertions(+), 10 deletions(-)
>
> Index: linux-2.6.mod/fs/aio.c
> ===================================================================
> --- linux-2.6.mod.orig/fs/aio.c 2009-03-14 09:24:12.000000000 -0700
> +++ linux-2.6.mod/fs/aio.c 2009-03-15 12:54:10.000000000 -0700
...
> @@ -527,12 +528,14 @@ static void aio_fput_routine(struct work
> */
> static int __aio_put_req(struct kioctx *ctx, struct kiocb *req)
> {
> + int schedule_putreq = 0;
> +
> dprintk(KERN_DEBUG "aio_put(%p): f_count=%ld\n",
> req, atomic_long_read(&req->ki_filp->f_count));
>
> assert_spin_locked(&ctx->ctx_lock);
>
> - req->ki_users --;
> + req->ki_users--;
> BUG_ON(req->ki_users < 0);
> if (likely(req->ki_users))
> return 0;
> @@ -540,10 +543,23 @@ static int __aio_put_req(struct kioctx *
> req->ki_cancel = NULL;
> req->ki_retry = NULL;
>
> - /* Must be done under the lock to serialise against cancellation.
> - * Call this aio_fput as it duplicates fput via the fput_work.
> + /*
> + * Try to optimize the aio and eventfd file* puts, by avoiding to
> + * schedule work in case it is not __fput() time. In normal cases,
> + * we wouldn not be holding the last reference to the file*, so
^^^^^^^^^^
tyop

> + * this function will be executed w/out any aio kthread wakeup.
> */
> - if (unlikely(atomic_long_dec_and_test(&req->ki_filp->f_count))) {
> + if (unlikely(atomic_long_dec_and_test(&req->ki_filp->f_count)))
> + schedule_putreq++;
> + else
> + req->ki_filp = NULL;
> + if (unlikely(req->ki_eventfd != NULL)) {
> + if (unlikely(atomic_long_dec_and_test(&req->ki_eventfd->f_count)))
> + schedule_putreq++;
> + else
> + req->ki_eventfd = NULL;
> + }

I agree with Jamie that you should get rid of the unlikely.

Thanks for taking care of this, Davide.

Signed-off-by: Jeff Moyer <[email protected]>

Cheers,
Jeff

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <string.h>
#include <sys/types.h>
#include <errno.h>
#include <assert.h>
#include <sys/eventfd.h>
#include <libaio.h>

int
main(int argc, char **argv)
{
#define SIZE (256*1024*1024)
char *buf;
struct io_event io_event;
struct iocb iocb;
struct iocb *iocbs[] = { &iocb };
int rwfd, efd;
int res;
io_context_t io_ctx;

efd = eventfd(0, 0);
if (efd < 0) {
perror("eventfd");
exit(1);
}

rwfd = open("rwfile", O_RDWR|O_DIRECT); assert(rwfd != -1);
if (posix_memalign((void **)&buf, getpagesize(), SIZE) < 0) {
perror("posix_memalign");
exit(1);
}
memset(buf, 0x42, SIZE);

/* Write test. */
res = io_queue_init(1024, &io_ctx); assert(res == 0);
io_prep_pwrite(&iocb, rwfd, buf, SIZE, 0);
io_set_eventfd(&iocb, efd);
res = io_submit(io_ctx, 1, iocbs); assert(res == 1);

/* Now close the eventfd so that AIO has the last reference */
close(efd);

/* Keep this process around so that the aio subsystem does not hold
* the last reference on the rwfd, otherwise the really_put_req will
* be called from process context */
res = io_getevents(io_ctx, 1, 1, &io_event, NULL);
if (res != 1) {
if (res < 0) {
errno = -res;
perror("io_getevents");
} else
printf("io_getevents did not return 1 event after "
"closing eventfd\n");
exit(1);
}
assert(io_event.res == SIZE);
printf("eventfd write test [SUCCESS]\n");

return 0;
}
/*
* Local variables:
* c-basic-offset: 8
* compile-command: "gcc -o eventfd-in-irq eventfd-in-irq.c -laio -g3"
* End:
*/

2009-03-18 14:47:38

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch] eventfd - remove fput() call from possible IRQ context (2nd rev)

On Wed, 18 Mar 2009, Jeff Moyer wrote:

> I agree with Jamie that you should get rid of the unlikely.

Ok, will repost today with typo and unlikely fixed.


- Davide

2009-03-18 15:00:27

by Eric Dumazet

[permalink] [raw]
Subject: Re: [patch] eventfd - remove fput() call from possible IRQ context (2nd rev)

Jeff Moyer a ?crit :
> Davide Libenzi <[email protected]> writes:
>
>> The following patch remove a possible source of fput() call from inside
>> IRQ context. Myself, like Eric, wasn't able to reproduce an fput() call
>> from IRQ context, but conceptually the bug is there.
>
> I've attached a test program which can reproduce the fput call in
> interrupt context. It's a modified version of the eventfd test that
> Rusty wrote for the libaio test harness. I verified that fput was in
> fact being called in interrupt context by using systemtap to print out
> the "thead_indent" of fput calls, and observing a "swapper(0)" in the
> output. After applying your patch, I confirmed that __fput is no longer
> called from interrupt context. Strangely enough, I never did get any
> output from the might_sleep in __fput. I can't explain that.
>
> I have some minor comments inlined below.
>
>> This patch adds an optimization similar to the one we already do on
>> ->ki_filp, on ->ki_eventfd. Playing with ->f_count directly is not pretty
>> in general, but the alternative here would be to add a brand new delayed
>> fput() infrastructure, that I'm not sure is worth it.
>>
>> On Sun, 15 Mar 2009, Benjamin LaHaise wrote:
>>
>>> This looks reasonably sane, the only concern I have with it is that I think
>>> it logically makes more sense to use the same convention for fi_filp and
>>> ki_eventfd, as the different in IS_ERR vs checking for NULL is a bit
>>> confusing. Aside from that, it looks like it should fix the problem
>>> correctly.
>> Makes sense.
>>
>> Signed-off-by: Davide Libenzi <[email protected]>
>>
>>
>> - Davide
>>
>>
>> ---
>> fs/aio.c | 37 +++++++++++++++++++++++++++----------
>> 1 file changed, 27 insertions(+), 10 deletions(-)
>>
>> Index: linux-2.6.mod/fs/aio.c
>> ===================================================================
>> --- linux-2.6.mod.orig/fs/aio.c 2009-03-14 09:24:12.000000000 -0700
>> +++ linux-2.6.mod/fs/aio.c 2009-03-15 12:54:10.000000000 -0700
> ...
>> @@ -527,12 +528,14 @@ static void aio_fput_routine(struct work
>> */
>> static int __aio_put_req(struct kioctx *ctx, struct kiocb *req)
>> {
>> + int schedule_putreq = 0;
>> +
>> dprintk(KERN_DEBUG "aio_put(%p): f_count=%ld\n",
>> req, atomic_long_read(&req->ki_filp->f_count));
>>
>> assert_spin_locked(&ctx->ctx_lock);
>>
>> - req->ki_users --;
>> + req->ki_users--;
>> BUG_ON(req->ki_users < 0);
>> if (likely(req->ki_users))
>> return 0;
>> @@ -540,10 +543,23 @@ static int __aio_put_req(struct kioctx *
>> req->ki_cancel = NULL;
>> req->ki_retry = NULL;
>>
>> - /* Must be done under the lock to serialise against cancellation.
>> - * Call this aio_fput as it duplicates fput via the fput_work.
>> + /*
>> + * Try to optimize the aio and eventfd file* puts, by avoiding to
>> + * schedule work in case it is not __fput() time. In normal cases,
>> + * we wouldn not be holding the last reference to the file*, so
> ^^^^^^^^^^
> tyop
>
>> + * this function will be executed w/out any aio kthread wakeup.
>> */
>> - if (unlikely(atomic_long_dec_and_test(&req->ki_filp->f_count))) {
>> + if (unlikely(atomic_long_dec_and_test(&req->ki_filp->f_count)))
>> + schedule_putreq++;
>> + else
>> + req->ki_filp = NULL;
>> + if (unlikely(req->ki_eventfd != NULL)) {
>> + if (unlikely(atomic_long_dec_and_test(&req->ki_eventfd->f_count)))
>> + schedule_putreq++;
>> + else
>> + req->ki_eventfd = NULL;
>> + }
>
> I agree with Jamie that you should get rid of the unlikely.
>
> Thanks for taking care of this, Davide.
>
> Signed-off-by: Jeff Moyer <[email protected]>
>
> Cheers,
> Jeff
>
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <fcntl.h>
> #include <string.h>
> #include <sys/types.h>
> #include <errno.h>
> #include <assert.h>
> #include <sys/eventfd.h>
> #include <libaio.h>
>
> int
> main(int argc, char **argv)
> {
> #define SIZE (256*1024*1024)
> char *buf;
> struct io_event io_event;
> struct iocb iocb;
> struct iocb *iocbs[] = { &iocb };
> int rwfd, efd;
> int res;
> io_context_t io_ctx;
>
> efd = eventfd(0, 0);
> if (efd < 0) {
> perror("eventfd");
> exit(1);
> }
>
> rwfd = open("rwfile", O_RDWR|O_DIRECT); assert(rwfd != -1);
> if (posix_memalign((void **)&buf, getpagesize(), SIZE) < 0) {
> perror("posix_memalign");
> exit(1);
> }
> memset(buf, 0x42, SIZE);
>
> /* Write test. */
> res = io_queue_init(1024, &io_ctx); assert(res == 0);
> io_prep_pwrite(&iocb, rwfd, buf, SIZE, 0);
> io_set_eventfd(&iocb, efd);
> res = io_submit(io_ctx, 1, iocbs); assert(res == 1);

yes but io_submit() is blocking. so your close(efd) will come after the release in fs/aio.c

I suggest you start a thread just before io_submit() and give it this work :

void *thread_work(void *arg)
{
usleep(10000);
close(efd);
return (void *)0;
}

>
> /* Now close the eventfd so that AIO has the last reference */
> close(efd);
>
> /* Keep this process around so that the aio subsystem does not hold
> * the last reference on the rwfd, otherwise the really_put_req will
> * be called from process context */
> res = io_getevents(io_ctx, 1, 1, &io_event, NULL);
> if (res != 1) {
> if (res < 0) {
> errno = -res;
> perror("io_getevents");
> } else
> printf("io_getevents did not return 1 event after "
> "closing eventfd\n");
> exit(1);
> }
> assert(io_event.res == SIZE);
> printf("eventfd write test [SUCCESS]\n");
>
> return 0;
> }
> /*
> * Local variables:
> * c-basic-offset: 8
> * compile-command: "gcc -o eventfd-in-irq eventfd-in-irq.c -laio -g3"
> * End:
> */
>
>

2009-03-18 15:27:23

by Jeff Moyer

[permalink] [raw]
Subject: Re: [patch] eventfd - remove fput() call from possible IRQ context (2nd rev)

Eric Dumazet <[email protected]> writes:


>> rwfd = open("rwfile", O_RDWR|O_DIRECT); assert(rwfd != -1);
>> if (posix_memalign((void **)&buf, getpagesize(), SIZE) < 0) {
>> perror("posix_memalign");
>> exit(1);
>> }
>> memset(buf, 0x42, SIZE);
>>
>> /* Write test. */
>> res = io_queue_init(1024, &io_ctx); assert(res == 0);
>> io_prep_pwrite(&iocb, rwfd, buf, SIZE, 0);
>> io_set_eventfd(&iocb, efd);
>> res = io_submit(io_ctx, 1, iocbs); assert(res == 1);
>
> yes but io_submit() is blocking. so your close(efd) will come after the release in fs/aio.c

I'm not sure why you think io_submit is blocking. In my setup, I
preallocated the file, and the test code opens it with O_DIRECT. So,
io_submit should return after the dio is issued, and the I/O size is
large enough that it should still be outstanding when io_submit returns.

Cheers,
Jeff

2009-03-18 15:44:55

by Eric Dumazet

[permalink] [raw]
Subject: Re: [patch] eventfd - remove fput() call from possible IRQ context (2nd rev)

Jeff Moyer a ?crit :
> Eric Dumazet <[email protected]> writes:
>
>
>>> rwfd = open("rwfile", O_RDWR|O_DIRECT); assert(rwfd != -1);
>>> if (posix_memalign((void **)&buf, getpagesize(), SIZE) < 0) {
>>> perror("posix_memalign");
>>> exit(1);
>>> }
>>> memset(buf, 0x42, SIZE);
>>>
>>> /* Write test. */
>>> res = io_queue_init(1024, &io_ctx); assert(res == 0);
>>> io_prep_pwrite(&iocb, rwfd, buf, SIZE, 0);
>>> io_set_eventfd(&iocb, efd);
>>> res = io_submit(io_ctx, 1, iocbs); assert(res == 1);
>> yes but io_submit() is blocking. so your close(efd) will come after the release in fs/aio.c
>
> I'm not sure why you think io_submit is blocking. In my setup, I
> preallocated the file, and the test code opens it with O_DIRECT. So,
> io_submit should return after the dio is issued, and the I/O size is
> large enough that it should still be outstanding when io_submit returns.

Hmm.. io_submit() is a blocking syscall, this is how I understood fs/aio.c

Then, using strace -tt -T on your program, I can confirm it is quite a long syscall (3.5 seconds,
about time needed to write a 256 MB file on my disk ;) )

16:33:15.861301 eventfd(0) = 3 <0.000010>
16:33:15.861344 open("rwfile", O_RDWR|O_DIRECT) = 4 <0.000010>
16:33:15.861444 mmap(NULL, 268443648, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f08eec55000 <0.000008>
16:33:16.100040 io_setup(1024, {139676618563584}) = 0 <0.000047>
16:33:16.100149 io_submit(139676618563584, 1, {...}) = 1 <3.558085>
16:33:19.658292 io_getevents(139676618563584, 1, 1, {...}NULL) = 1 <0.258104>
16:33:19.916486 fstat(1, {st_mode=S_IFCHR|0600, st_rdev=makedev(136, 0), ...}) = 0 <0.000007>
16:33:19.916559 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f08ff3b9000 <0.000009>
16:33:19.916607 write(1, "eventfd write test [SUCCESS]\n"..., 29eventfd write test [SUCCESS]
) = 29 <0.000008>
16:33:19.916653 exit_group(0) = ?

2009-03-18 16:14:38

by Jeff Moyer

[permalink] [raw]
Subject: Re: [patch] eventfd - remove fput() call from possible IRQ context (2nd rev)

Eric Dumazet <[email protected]> writes:

> Jeff Moyer a écrit :
>> Eric Dumazet <[email protected]> writes:
>>
>>
>>>> rwfd = open("rwfile", O_RDWR|O_DIRECT); assert(rwfd != -1);
>>>> if (posix_memalign((void **)&buf, getpagesize(), SIZE) < 0) {
>>>> perror("posix_memalign");
>>>> exit(1);
>>>> }
>>>> memset(buf, 0x42, SIZE);
>>>>
>>>> /* Write test. */
>>>> res = io_queue_init(1024, &io_ctx); assert(res == 0);
>>>> io_prep_pwrite(&iocb, rwfd, buf, SIZE, 0);
>>>> io_set_eventfd(&iocb, efd);
>>>> res = io_submit(io_ctx, 1, iocbs); assert(res == 1);
>>> yes but io_submit() is blocking. so your close(efd) will come after the release in fs/aio.c
>>
>> I'm not sure why you think io_submit is blocking. In my setup, I
>> preallocated the file, and the test code opens it with O_DIRECT. So,
>> io_submit should return after the dio is issued, and the I/O size is
>> large enough that it should still be outstanding when io_submit returns.
>
> Hmm.. io_submit() is a blocking syscall, this is how I understood fs/aio.c

Hi, Eric,

The whole point of io_submit is to allow you to submit I/O without
waiting for it. There are known cases where io_submit will block, of
course, such as when we run out of request descriptors. See the
io_submit.stp script for some examples.[1]

Now, I admit I was testing using an SSD, so I didn't actually notice the
time it took for the 256MB write (!!!). I tried the reproducer I posted
on my F9 box, and here is the output I get:

BUG: sleeping function called from invalid context at
fs/file_table.c:262
in_atomic():1, irqs_disabled():1
Pid: 0, comm: swapper Not tainted 2.6.27.15-78.2.23.fc9.x86_64 #1

Call Trace:
<IRQ> [<ffffffff8103892e>] __might_sleep+0xe7/0xec
[<ffffffff810bfa86>] __fput+0x35/0x16d
[<ffffffff810bfbd3>] fput+0x15/0x17
[<ffffffff810d71bb>] really_put_req+0x34/0x9c
[<ffffffff810d72f0>] __aio_put_req+0xcd/0xda
[<ffffffff810d7f77>] aio_complete+0x15d/0x19f
[<ffffffff810e7016>] dio_bio_end_aio+0x8e/0xa0
[<ffffffff810e32ab>] bio_endio+0x2a/0x2c
[<ffffffff8113beae>] req_bio_endio+0x9d/0xba
[<ffffffff8113c073>] __end_that_request_first+0x1a8/0x2b5
[<ffffffff8113cb89>] blk_end_io+0x2f/0xa9
[<ffffffff8113cc2f>] blk_end_request+0xe/0x10
[<ffffffffa005d30b>] scsi_end_request+0x30/0x90 [scsi_mod]
[<ffffffffa005d9e9>] scsi_io_completion+0x1aa/0x3b3 [scsi_mod]
[<ffffffffa0057658>] scsi_finish_command+0xde/0xe7 [scsi_mod]
[<ffffffffa005de68>] scsi_softirq_done+0xe4/0xed [scsi_mod]
[<ffffffff8113baa8>] blk_done_softirq+0x7e/0x8e
[<ffffffff81045146>] __do_softirq+0x7e/0x10c
[<ffffffff81011bfc>] call_softirq+0x1c/0x28
[<ffffffff81012e06>] do_softirq+0x4d/0xb0
[<ffffffff81044d1b>] irq_exit+0x4e/0x9d
[<ffffffff81013122>] do_IRQ+0x147/0x169
[<ffffffff81010963>] ret_from_intr+0x0/0x2e
<EOI> [<ffffffff810173a9>] ? mwait_idle+0x3e/0x4f
[<ffffffff810173a0>] ? mwait_idle+0x35/0x4f
[<ffffffff8100f2a7>] ? cpu_idle+0xb2/0x10b
[<ffffffff812af35d>] ? rest_init+0x61/0x63

So, I think it is a valid reproducer as it stands.

> Then, using strace -tt -T on your program, I can confirm it is quite a long syscall (3.5 seconds,
> about time needed to write a 256 MB file on my disk ;) )

Did you preallocate the file?

Cheers,
Jeff

[1] http://sourceware.org/systemtap/wiki/ScriptsTools?action=AttachFile&do=view&target=io_submit.stp

2009-03-18 17:25:57

by Davide Libenzi

[permalink] [raw]
Subject: [patch] eventfd - remove fput() call from possible IRQ context (3rd rev)

The following patch remove a possible source of fput() call from inside
IRQ context. Myself, like Eric, wasn't able to reproduce an fput() call
from IRQ context, but Jeff said he was able to, with the attached test
program. Independently from this, the bug is conceptually there, so we
might be better off fixing it.
This patch adds an optimization similar to the one we already do on
->ki_filp, on ->ki_eventfd. Playing with ->f_count directly is not pretty
in general, but the alternative here would be to add a brand new delayed
fput() infrastructure, that I'm not sure is worth it.


On Sun, 15 Mar 2009, Benjamin LaHaise wrote:

> This looks reasonably sane, the only concern I have with it is that I think
> it logically makes more sense to use the same convention for fi_filp and
> ki_eventfd, as the different in IS_ERR vs checking for NULL is a bit
> confusing. Aside from that, it looks like it should fix the problem
> correctly.

Makes sense.



Signed-off-by: Davide Libenzi <[email protected]>


- Davide


---
fs/aio.c | 37 +++++++++++++++++++++++++++----------
1 file changed, 27 insertions(+), 10 deletions(-)

Index: linux-2.6.mod/fs/aio.c
===================================================================
--- linux-2.6.mod.orig/fs/aio.c 2009-03-15 13:11:56.000000000 -0700
+++ linux-2.6.mod/fs/aio.c 2009-03-18 10:18:21.000000000 -0700
@@ -443,7 +443,7 @@ static struct kiocb *__aio_get_req(struc
req->private = NULL;
req->ki_iovec = NULL;
INIT_LIST_HEAD(&req->ki_run_list);
- req->ki_eventfd = ERR_PTR(-EINVAL);
+ req->ki_eventfd = NULL;

/* Check if the completion queue has enough free space to
* accept an event from this io.
@@ -485,8 +485,6 @@ static inline void really_put_req(struct
{
assert_spin_locked(&ctx->ctx_lock);

- if (!IS_ERR(req->ki_eventfd))
- fput(req->ki_eventfd);
if (req->ki_dtor)
req->ki_dtor(req);
if (req->ki_iovec != &req->ki_inline_vec)
@@ -508,8 +506,11 @@ static void aio_fput_routine(struct work
list_del(&req->ki_list);
spin_unlock_irq(&fput_lock);

- /* Complete the fput */
- __fput(req->ki_filp);
+ /* Complete the fput(s) */
+ if (req->ki_filp != NULL)
+ __fput(req->ki_filp);
+ if (req->ki_eventfd != NULL)
+ __fput(req->ki_eventfd);

/* Link the iocb into the context's free list */
spin_lock_irq(&ctx->ctx_lock);
@@ -527,12 +528,14 @@ static void aio_fput_routine(struct work
*/
static int __aio_put_req(struct kioctx *ctx, struct kiocb *req)
{
+ int schedule_putreq = 0;
+
dprintk(KERN_DEBUG "aio_put(%p): f_count=%ld\n",
req, atomic_long_read(&req->ki_filp->f_count));

assert_spin_locked(&ctx->ctx_lock);

- req->ki_users --;
+ req->ki_users--;
BUG_ON(req->ki_users < 0);
if (likely(req->ki_users))
return 0;
@@ -540,10 +543,23 @@ static int __aio_put_req(struct kioctx *
req->ki_cancel = NULL;
req->ki_retry = NULL;

- /* Must be done under the lock to serialise against cancellation.
- * Call this aio_fput as it duplicates fput via the fput_work.
+ /*
+ * Try to optimize the aio and eventfd file* puts, by avoiding to
+ * schedule work in case it is not __fput() time. In normal cases,
+ * we would not be holding the last reference to the file*, so
+ * this function will be executed w/out any aio kthread wakeup.
*/
- if (unlikely(atomic_long_dec_and_test(&req->ki_filp->f_count))) {
+ if (unlikely(atomic_long_dec_and_test(&req->ki_filp->f_count)))
+ schedule_putreq++;
+ else
+ req->ki_filp = NULL;
+ if (req->ki_eventfd != NULL) {
+ if (unlikely(atomic_long_dec_and_test(&req->ki_eventfd->f_count)))
+ schedule_putreq++;
+ else
+ req->ki_eventfd = NULL;
+ }
+ if (unlikely(schedule_putreq)) {
get_ioctx(ctx);
spin_lock(&fput_lock);
list_add(&req->ki_list, &fput_head);
@@ -1009,7 +1025,7 @@ int aio_complete(struct kiocb *iocb, lon
* eventfd. The eventfd_signal() function is safe to be called
* from IRQ context.
*/
- if (!IS_ERR(iocb->ki_eventfd))
+ if (iocb->ki_eventfd != NULL)
eventfd_signal(iocb->ki_eventfd, 1);

put_rq:
@@ -1608,6 +1624,7 @@ static int io_submit_one(struct kioctx *
req->ki_eventfd = eventfd_fget((int) iocb->aio_resfd);
if (IS_ERR(req->ki_eventfd)) {
ret = PTR_ERR(req->ki_eventfd);
+ req->ki_eventfd = NULL;
goto out_put_req;
}
}


Attachments:
eventfd-irqctx-trigger.c (1.52 kB)

2009-03-18 17:35:44

by Jeff Moyer

[permalink] [raw]
Subject: Re: [patch] eventfd - remove fput() call from possible IRQ context (3rd rev)

Davide Libenzi <[email protected]> writes:

> The following patch remove a possible source of fput() call from inside
> IRQ context. Myself, like Eric, wasn't able to reproduce an fput() call
> from IRQ context, but Jeff said he was able to, with the attached test
> program. Independently from this, the bug is conceptually there, so we
> might be better off fixing it.
> This patch adds an optimization similar to the one we already do on
> ->ki_filp, on ->ki_eventfd. Playing with ->f_count directly is not pretty
> in general, but the alternative here would be to add a brand new delayed
> fput() infrastructure, that I'm not sure is worth it.
>
>
> On Sun, 15 Mar 2009, Benjamin LaHaise wrote:
>
>> This looks reasonably sane, the only concern I have with it is that I think
>> it logically makes more sense to use the same convention for fi_filp and
>> ki_eventfd, as the different in IS_ERR vs checking for NULL is a bit
>> confusing. Aside from that, it looks like it should fix the problem
>> correctly.
>
> Makes sense.
>
>
>
> Signed-off-by: Davide Libenzi <[email protected]>

Signed-off-by: Jeff Moyer <[email protected]>