2006-12-27 15:34:45

by Suparna Bhattacharya

[permalink] [raw]
Subject: [RFC] Heads up on a series of AIO patchsets


Here is a quick attempt to summarize where we are heading with a bunch of
AIO patches that I'll be posting over the next few days. Because a few of
these patches have been hanging around for a bit, and have gone through
bursts of iterations from time to time, falling dormant for other phases,
the intent of this note is to help pull things together into some coherent
picture for folks to comment on the patches and arrive at a decision of
some sort.

Native linux aio (i.e using libaio) is properly supported (in the sense of
being asynchronous) only for files opened with O_DIRECT, which actually
suffices for a major (and most visible) user of AIO, i.e. databases.

However, for other types of users, e.g. Samba and other applications which
use POSIX AIO, there have been several issues outstanding for a while:

(1) The filesystem AIO patchset, attempts to address one part of the problem
which is to make regular file IO, (without O_DIRECT) asynchronous (mainly
the case of reads of uncached or partially cached files, and O_SYNC writes).

(2) Most of these other applications need the ability to process both
network events (epoll) and disk file AIO in the same loop. With POSIX AIO
they could at least sort of do this using signals (yeah, and all associated
issues). The IO_CMD_EPOLL_WAIT patch (originally from Zach Brown with
modifications from Jeff Moyer and me) addresses this problem for native
linux aio in a simple manner. Tridge has written a test harness to
try out the Samba4 event library modifications to use this. Jeff Moyer
has a modified version of pipetest for comparison.

(3) For glibc POSIX AIO to switch to using native AIO (instead of simulation
with threads) kernel changes are needed to ensure aio sigevent notification
and efficient listio support. Sebestian Dugue's patches for aio sigevent
notifications has undergone several review iterations and seems to be
in good shape now. His patch for lio_listio is pending discussion
on whether to implement it as a separate syscall rather than an additional
iocb command. Bharata B Rao has posted a patch with the syscall variation
for review.

(4) If glibc POSIX AIO switches completely to using native AIO then it
would need basic AIO support for various file types - including sockets,
pipes etc. Since it no longer will be simulating asynchronous behaviour
with threads, it expects the underlying implementation to be asynchronous.
Which is still an issue with native linux AIO, but I now think the problem
to be tractable without a lot of additional work. While (1) helps the case
for regular files, (2) now provides us an alternative infrastructure to
simulate this in kernel using async epoll and O_NONBLOCK for all pollable
fds, i.e. sockets, pipes etc. This should be good enough for working
POSIX AIO.

(5) That leaves just one more todo - implementing aio_fsync() in kernel.

Please note that all of this work is not in conflict with kevent development.
In fact it is my hope that progress made in getting these pieces of the
puzzle in place would also help us along the long term goal of eventual
convergence.

Regards
Suparna

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India


2006-12-27 16:25:41

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC] Heads up on a series of AIO patchsets

On Wed, Dec 27, 2006 at 09:08:56PM +0530, Suparna Bhattacharya wrote:
> (2) Most of these other applications need the ability to process both
> network events (epoll) and disk file AIO in the same loop. With POSIX AIO
> they could at least sort of do this using signals (yeah, and all associated
> issues). The IO_CMD_EPOLL_WAIT patch (originally from Zach Brown with
> modifications from Jeff Moyer and me) addresses this problem for native
> linux aio in a simple manner. Tridge has written a test harness to
> try out the Samba4 event library modifications to use this. Jeff Moyer
> has a modified version of pipetest for comparison.

The real question here is which interface we want people to use for these
"combined" applications. Evgeny is heavily pushing kevent for this while
other seem to prefer integration epoll into the aio interface. (1)

I must admit that kevent seems to be the cleaner way to support this,
although I see some advantages for the aio variant. I do think however
that we should not actively promote two differnt interfaces long term.


(1) note that there is another problem with the current kevent interface,
and that is that it duplicates the event infrastructure for it's
underlying subsystems instead of reusing existing code (e.g.
inotify, epoll, dio-aio). If we want kevent to be _the_ unified
event system for Linux we need people to help out with straightening
out these even provides as Evgeny seems to be unwilling/unable to
do the work himself and the duplication is simply not acceptable.

2006-12-27 16:58:06

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC] Heads up on a series of AIO patchsets


* Christoph Hellwig <[email protected]> wrote:

> The real question here is which interface we want people to use for
> these "combined" applications. Evgeny is heavily pushing kevent for
> this while other seem to prefer integration epoll into the aio
> interface. (1)
>
> I must admit that kevent seems to be the cleaner way to support this,
> although I see some advantages for the aio variant. I do think
> however that we should not actively promote two differnt interfaces
> long term.

i see no fundamental disadvantage from doing both. That way the 'market'
of applications will vote. (we have 2 other fundamental types available
as well: sync IO and poll() based IO - so it's not like we have the
choice between 2 or 1 variant, we have the choice between 4 or 3
variants)

> (1) note that there is another problem with the current kevent
> interface, and that is that it duplicates the event infrastructure
> for it's underlying subsystems instead of reusing existing code
> (e.g. inotify, epoll, dio-aio). If we want kevent to be _the_
> unified event system for Linux we need people to help out with
> straightening out these even provides as Evgeny seems to be
> unwilling/unable to do the work himself and the duplication is
> simply not acceptable.

yeah. The internal machinery should be as unified as possible - but
different sets of APIs can be offered, to make it easy for people to
extend their existing apps in the most straightforward way.

(In fact i'd like to see all the 'poll table' code to be unified into
this as well, if possible - it does not really "poll" anything, it's an
event infrastructure as well, used via the naive select() and poll()
syscalls. We should fix that naming mistake.)

Ingo

2006-12-27 17:22:04

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC] Heads up on a series of AIO patchsets


* Ingo Molnar <[email protected]> wrote:

> > unified event system for Linux we need people to help out with
> > straightening out these even provides as Evgeny seems to be
> > unwilling/unable to do the work himself and the duplication is
> > simply not acceptable.
>
> yeah. The internal machinery should be as unified as possible - but
> different sets of APIs can be offered, to make it easy for people to
> extend their existing apps in the most straightforward way.

just to expand on this: i dont think this should be an impediment to the
POSIX AIO patches. We should get some movement into this and should give
the capability to glibc and applications. Kernel-internal unification is
something we are pretty good at doing after the fact. (and if any of the
APIs dies or gets very uncommon we know in which direction to unify)

Ingo

2006-12-28 08:18:48

by Suparna Bhattacharya

[permalink] [raw]
Subject: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write


Currently native linux AIO is properly supported (in the sense of
actually being asynchronous) only for files opened with O_DIRECT.
While this suffices for a major (and most visible) user of AIO, i.e. databases,
other types of users like Samba require AIO support for regular file IO.
Also, for glibc POSIX AIO to be able to switch to using native AIO instead
of the current simulation using threads, it needs/expects asynchronous
behaviour for both O_DIRECT and buffered file AIO.

This patchset implements changes to make filesystem AIO read
and write asynchronous for the non O_DIRECT case. This is mainly
relevant in the case of reads of uncached or partially cached files, and
O_SYNC writes.

Instead of translating regular IO to [AIO + wait], it translates AIO
to [regular IO - blocking + retries]. The intent of implementing it
this way is to avoid modifying or slowing down normal usage, by keeping
it pretty much the way it is without AIO, while avoiding code duplication.
Instead we make AIO vs regular IO checks inside io_schedule(), i.e. at
the blocking points. The low-level unit of distinction is a wait queue
entry, which in the AIO case is contained in an iocb and in the
synchronous IO case is associated with the calling task.

The core idea is that is we complete as much IO as we can in a non-blocking
fashion, and then continue the remaining part of the transfer again when
woken up asynchronously via a wait queue callback when pages are ready ...
thus each iteration progresses through more of the request until it is
completed. The interesting part here is that owing largely to the idempotence
in the way radix-tree page cache traveral happens, every iteration is simply
a smaller read/write. Almost all of the iocb manipulation and advancement
in the AIO case happens in the high level AIO code, and rather than in
regular VFS/filesystem paths.

The following is a sampling of comparative aio-stress results with the
patches (each run starts with uncached files):

---------------------------------------------

aio-stress throughput comparisons (in MB/s):

file size 1GB, record size 64KB, depth 64, ios per iteration 8
max io_submit 8, buffer alignment set to 4KB
4 way Pentium III SMP box, Adaptec AIC-7896/7 Ultra2 SCSI, 40 MB/s
Filesystem: ext2

----------------------------------------------------------------------------
Buffered (non O_DIRECT)
Vanilla Patched O_DIRECT
----------------------------------------------------------------------------
Vanilla Patched
Random-Read 10.08 23.91 18.91, 18.98
Random-O_SYNC-Write 8.86 15.84 16.51, 16.53
Sequential-Read 31.49 33.00 31.86, 31.79
Sequential-O_SYNC-Write 8.68 32.60 31.45, 32.44
Random-Write 31.09 (19.65) 30.90 (19.65)
Sequential-Write 30.84 (28.94) 30.09 (28.39)

----------------------------------------------------------------------------

Regards
Suparna

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2006-12-28 08:29:52

by Suparna Bhattacharya

[permalink] [raw]
Subject: [FSAIO][PATCH 1/6] Add a wait queue parameter to the wait_bit action routine


Add a wait queue parameter to the action routine called by
__wait_on_bit to allow it to determine whether to block or
not.

Signed-off-by: Suparna Bhattacharya <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
---

linux-2.6.20-rc1-root/fs/buffer.c | 2 +-
linux-2.6.20-rc1-root/fs/inode.c | 2 +-
linux-2.6.20-rc1-root/fs/nfs/inode.c | 2 +-
linux-2.6.20-rc1-root/fs/nfs/nfs4proc.c | 2 +-
linux-2.6.20-rc1-root/fs/nfs/pagelist.c | 2 +-
linux-2.6.20-rc1-root/include/linux/sunrpc/sched.h | 3 ++-
linux-2.6.20-rc1-root/include/linux/wait.h | 18 ++++++++++++------
linux-2.6.20-rc1-root/include/linux/writeback.h | 2 +-
linux-2.6.20-rc1-root/kernel/wait.c | 14 ++++++++------
linux-2.6.20-rc1-root/mm/filemap.c | 2 +-
linux-2.6.20-rc1-root/net/sunrpc/sched.c | 5 +++--
11 files changed, 32 insertions(+), 22 deletions(-)

diff -puN fs/buffer.c~modify-wait-bit-action-args fs/buffer.c
--- linux-2.6.20-rc1/fs/buffer.c~modify-wait-bit-action-args 2006-12-21 08:45:34.000000000 +0530
+++ linux-2.6.20-rc1-root/fs/buffer.c 2006-12-21 08:45:34.000000000 +0530
@@ -55,7 +55,7 @@ init_buffer(struct buffer_head *bh, bh_e
bh->b_private = private;
}

-static int sync_buffer(void *word)
+static int sync_buffer(void *word, wait_queue_t *wait)
{
struct block_device *bd;
struct buffer_head *bh
diff -puN fs/inode.c~modify-wait-bit-action-args fs/inode.c
--- linux-2.6.20-rc1/fs/inode.c~modify-wait-bit-action-args 2006-12-21 08:45:34.000000000 +0530
+++ linux-2.6.20-rc1-root/fs/inode.c 2006-12-21 08:45:34.000000000 +0530
@@ -1279,7 +1279,7 @@ void remove_dquot_ref(struct super_block

#endif

-int inode_wait(void *word)
+int inode_wait(void *word, wait_queue_t *wait)
{
schedule();
return 0;
diff -puN include/linux/wait.h~modify-wait-bit-action-args include/linux/wait.h
--- linux-2.6.20-rc1/include/linux/wait.h~modify-wait-bit-action-args 2006-12-21 08:45:34.000000000 +0530
+++ linux-2.6.20-rc1-root/include/linux/wait.h 2006-12-28 09:32:57.000000000 +0530
@@ -145,11 +145,15 @@ void FASTCALL(__wake_up(wait_queue_head_
extern void FASTCALL(__wake_up_locked(wait_queue_head_t *q, unsigned int mode));
extern void FASTCALL(__wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr));
void FASTCALL(__wake_up_bit(wait_queue_head_t *, void *, int));
-int FASTCALL(__wait_on_bit(wait_queue_head_t *, struct wait_bit_queue *, int (*)(void *), unsigned));
-int FASTCALL(__wait_on_bit_lock(wait_queue_head_t *, struct wait_bit_queue *, int (*)(void *), unsigned));
+int FASTCALL(__wait_on_bit(wait_queue_head_t *, struct wait_bit_queue *,
+ int (*)(void *, wait_queue_t *), unsigned));
+int FASTCALL(__wait_on_bit_lock(wait_queue_head_t *, struct wait_bit_queue *,
+ int (*)(void *, wait_queue_t *), unsigned));
void FASTCALL(wake_up_bit(void *, int));
-int FASTCALL(out_of_line_wait_on_bit(void *, int, int (*)(void *), unsigned));
-int FASTCALL(out_of_line_wait_on_bit_lock(void *, int, int (*)(void *), unsigned));
+int FASTCALL(out_of_line_wait_on_bit(void *, int, int (*)(void *,
+ wait_queue_t *), unsigned));
+int FASTCALL(out_of_line_wait_on_bit_lock(void *, int, int (*)(void *,
+ wait_queue_t *), unsigned));
wait_queue_head_t *FASTCALL(bit_waitqueue(void *, int));

#define wake_up(x) __wake_up(x, TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, 1, NULL)
@@ -427,7 +431,8 @@ int wake_bit_function(wait_queue_t *wait
* but has no intention of setting it.
*/
static inline int wait_on_bit(void *word, int bit,
- int (*action)(void *), unsigned mode)
+ int (*action)(void *, wait_queue_t *),
+ unsigned mode)
{
if (!test_bit(bit, word))
return 0;
@@ -451,7 +456,8 @@ static inline int wait_on_bit(void *word
* clear with the intention of setting it, and when done, clearing it.
*/
static inline int wait_on_bit_lock(void *word, int bit,
- int (*action)(void *), unsigned mode)
+ int (*action)(void *, wait_queue_t *),
+ unsigned mode)
{
if (!test_and_set_bit(bit, word))
return 0;
diff -puN include/linux/writeback.h~modify-wait-bit-action-args include/linux/writeback.h
--- linux-2.6.20-rc1/include/linux/writeback.h~modify-wait-bit-action-args 2006-12-21 08:45:34.000000000 +0530
+++ linux-2.6.20-rc1-root/include/linux/writeback.h 2006-12-21 08:45:34.000000000 +0530
@@ -66,7 +66,7 @@ struct writeback_control {
*/
void writeback_inodes(struct writeback_control *wbc);
void wake_up_inode(struct inode *inode);
-int inode_wait(void *);
+int inode_wait(void *, wait_queue_t *);
void sync_inodes_sb(struct super_block *, int wait);
void sync_inodes(int wait);

diff -puN kernel/wait.c~modify-wait-bit-action-args kernel/wait.c
--- linux-2.6.20-rc1/kernel/wait.c~modify-wait-bit-action-args 2006-12-21 08:45:34.000000000 +0530
+++ linux-2.6.20-rc1-root/kernel/wait.c 2006-12-28 09:32:57.000000000 +0530
@@ -159,14 +159,14 @@ EXPORT_SYMBOL(wake_bit_function);
*/
int __sched fastcall
__wait_on_bit(wait_queue_head_t *wq, struct wait_bit_queue *q,
- int (*action)(void *), unsigned mode)
+ int (*action)(void *, wait_queue_t *), unsigned mode)
{
int ret = 0;

do {
prepare_to_wait(wq, &q->wait, mode);
if (test_bit(q->key.bit_nr, q->key.flags))
- ret = (*action)(q->key.flags);
+ ret = (*action)(q->key.flags, &q->wait);
} while (test_bit(q->key.bit_nr, q->key.flags) && !ret);
finish_wait(wq, &q->wait);
return ret;
@@ -174,7 +174,8 @@ __wait_on_bit(wait_queue_head_t *wq, str
EXPORT_SYMBOL(__wait_on_bit);

int __sched fastcall out_of_line_wait_on_bit(void *word, int bit,
- int (*action)(void *), unsigned mode)
+ int (*action)(void *, wait_queue_t *),
+ unsigned mode)
{
wait_queue_head_t *wq = bit_waitqueue(word, bit);
DEFINE_WAIT_BIT(wait, word, bit);
@@ -185,14 +186,14 @@ EXPORT_SYMBOL(out_of_line_wait_on_bit);

int __sched fastcall
__wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
- int (*action)(void *), unsigned mode)
+ int (*action)(void *, wait_queue_t *), unsigned mode)
{
int ret = 0;

do {
prepare_to_wait_exclusive(wq, &q->wait, mode);
if (test_bit(q->key.bit_nr, q->key.flags)) {
- if ((ret = (*action)(q->key.flags)))
+ if ((ret = (*action)(q->key.flags, &q->wait)))
break;
}
} while (test_and_set_bit(q->key.bit_nr, q->key.flags));
@@ -202,7 +203,8 @@ __wait_on_bit_lock(wait_queue_head_t *wq
EXPORT_SYMBOL(__wait_on_bit_lock);

int __sched fastcall out_of_line_wait_on_bit_lock(void *word, int bit,
- int (*action)(void *), unsigned mode)
+ int (*action)(void *, wait_queue_t *wait),
+ unsigned mode)
{
wait_queue_head_t *wq = bit_waitqueue(word, bit);
DEFINE_WAIT_BIT(wait, word, bit);
diff -puN mm/filemap.c~modify-wait-bit-action-args mm/filemap.c
--- linux-2.6.20-rc1/mm/filemap.c~modify-wait-bit-action-args 2006-12-21 08:45:34.000000000 +0530
+++ linux-2.6.20-rc1-root/mm/filemap.c 2006-12-28 09:33:02.000000000 +0530
@@ -133,7 +133,7 @@ void remove_from_page_cache(struct page
write_unlock_irq(&mapping->tree_lock);
}

-static int sync_page(void *word)
+static int sync_page(void *word, wait_queue_t *wait)
{
struct address_space *mapping;
struct page *page;
diff -puN fs/nfs/inode.c~modify-wait-bit-action-args fs/nfs/inode.c
--- linux-2.6.20-rc1/fs/nfs/inode.c~modify-wait-bit-action-args 2006-12-21 08:45:34.000000000 +0530
+++ linux-2.6.20-rc1-root/fs/nfs/inode.c 2006-12-21 08:45:34.000000000 +0530
@@ -380,7 +380,7 @@ void nfs_setattr_update_inode(struct ino
}
}

-static int nfs_wait_schedule(void *word)
+static int nfs_wait_schedule(void *word, wait_queue_t *wait)
{
if (signal_pending(current))
return -ERESTARTSYS;
diff -puN fs/nfs/nfs4proc.c~modify-wait-bit-action-args fs/nfs/nfs4proc.c
--- linux-2.6.20-rc1/fs/nfs/nfs4proc.c~modify-wait-bit-action-args 2006-12-21 08:45:34.000000000 +0530
+++ linux-2.6.20-rc1-root/fs/nfs/nfs4proc.c 2006-12-21 08:45:34.000000000 +0530
@@ -2738,7 +2738,7 @@ nfs4_async_handle_error(struct rpc_task
return 0;
}

-static int nfs4_wait_bit_interruptible(void *word)
+static int nfs4_wait_bit_interruptible(void *word, wait_queue_t *wait)
{
if (signal_pending(current))
return -ERESTARTSYS;
diff -puN fs/nfs/pagelist.c~modify-wait-bit-action-args fs/nfs/pagelist.c
--- linux-2.6.20-rc1/fs/nfs/pagelist.c~modify-wait-bit-action-args 2006-12-21 08:45:34.000000000 +0530
+++ linux-2.6.20-rc1-root/fs/nfs/pagelist.c 2006-12-21 08:45:34.000000000 +0530
@@ -183,7 +183,7 @@ nfs_release_request(struct nfs_page *req
nfs_page_free(req);
}

-static int nfs_wait_bit_interruptible(void *word)
+static int nfs_wait_bit_interruptible(void *word, wait_queue_t *wait)
{
int ret = 0;

diff -puN net/sunrpc/sched.c~modify-wait-bit-action-args net/sunrpc/sched.c
--- linux-2.6.20-rc1/net/sunrpc/sched.c~modify-wait-bit-action-args 2006-12-21 08:45:34.000000000 +0530
+++ linux-2.6.20-rc1-root/net/sunrpc/sched.c 2006-12-21 08:45:34.000000000 +0530
@@ -258,7 +258,7 @@ void rpc_init_wait_queue(struct rpc_wait
}
EXPORT_SYMBOL(rpc_init_wait_queue);

-static int rpc_wait_bit_interruptible(void *word)
+static int rpc_wait_bit_interruptible(void *word, wait_queue_t *wait)
{
if (signal_pending(current))
return -ERESTARTSYS;
@@ -294,7 +294,8 @@ static void rpc_mark_complete_task(struc
/*
* Allow callers to wait for completion of an RPC call
*/
-int __rpc_wait_for_completion_task(struct rpc_task *task, int (*action)(void *))
+int __rpc_wait_for_completion_task(struct rpc_task *task, int (*action)(
+ void *, wait_queue_t *wait))
{
if (action == NULL)
action = rpc_wait_bit_interruptible;
diff -puN include/linux/sunrpc/sched.h~modify-wait-bit-action-args include/linux/sunrpc/sched.h
--- linux-2.6.20-rc1/include/linux/sunrpc/sched.h~modify-wait-bit-action-args 2006-12-21 08:45:34.000000000 +0530
+++ linux-2.6.20-rc1-root/include/linux/sunrpc/sched.h 2006-12-21 08:45:34.000000000 +0530
@@ -268,7 +268,8 @@ void * rpc_malloc(struct rpc_task *, si
void rpc_free(struct rpc_task *);
int rpciod_up(void);
void rpciod_down(void);
-int __rpc_wait_for_completion_task(struct rpc_task *task, int (*)(void *));
+int __rpc_wait_for_completion_task(struct rpc_task *task,
+ int (*)(void *, wait_queue_t *wait));
#ifdef RPC_DEBUG
void rpc_show_tasks(void);
#endif
_
--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2006-12-28 08:31:58

by Suparna Bhattacharya

[permalink] [raw]
Subject: [FSAIO][PATCH 2/8] Rename __lock_page to lock_page_slow


In order to allow for interruptible and asynchronous versions of
lock_page in conjunction with the wait_on_bit changes, we need to
define low-level lock page routines which take an additional
argument, i.e a wait queue entry and may return non-zero status,
e.g -EINTR, -EIOCBRETRY, -EWOULDBLOCK etc. This patch renames
__lock_page to lock_page_slow, so that __lock_page and
__lock_page_slow can denote the versions which take a wait queue
parameter.

Signed-off-by: Suparna Bhattacharya <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
---

linux-2.6.20-rc1-root/include/linux/pagemap.h | 4 ++--
linux-2.6.20-rc1-root/mm/filemap.c | 8 ++++----
2 files changed, 6 insertions(+), 6 deletions(-)

diff -puN include/linux/pagemap.h~lock_page_slow include/linux/pagemap.h
--- linux-2.6.20-rc1/include/linux/pagemap.h~lock_page_slow 2006-12-21 08:45:40.000000000 +0530
+++ linux-2.6.20-rc1-root/include/linux/pagemap.h 2006-12-28 09:32:39.000000000 +0530
@@ -133,7 +133,7 @@ static inline pgoff_t linear_page_index(
return pgoff >> (PAGE_CACHE_SHIFT - PAGE_SHIFT);
}

-extern void FASTCALL(__lock_page(struct page *page));
+extern void FASTCALL(lock_page_slow(struct page *page));
extern void FASTCALL(__lock_page_nosync(struct page *page));
extern void FASTCALL(unlock_page(struct page *page));

@@ -144,7 +144,7 @@ static inline void lock_page(struct page
{
might_sleep();
if (TestSetPageLocked(page))
- __lock_page(page);
+ lock_page_slow(page);
}

/*
diff -puN mm/filemap.c~lock_page_slow mm/filemap.c
--- linux-2.6.20-rc1/mm/filemap.c~lock_page_slow 2006-12-21 08:45:40.000000000 +0530
+++ linux-2.6.20-rc1-root/mm/filemap.c 2006-12-28 09:32:39.000000000 +0530
@@ -556,7 +556,7 @@ void end_page_writeback(struct page *pag
EXPORT_SYMBOL(end_page_writeback);

/**
- * __lock_page - get a lock on the page, assuming we need to sleep to get it
+ * lock_page_slow - get a lock on the page, assuming we need to sleep to get it
* @page: the page to lock
*
* Ugly. Running sync_page() in state TASK_UNINTERRUPTIBLE is scary. If some
@@ -564,14 +564,14 @@ EXPORT_SYMBOL(end_page_writeback);
* chances are that on the second loop, the block layer's plug list is empty,
* so sync_page() will then return in state TASK_UNINTERRUPTIBLE.
*/
-void fastcall __lock_page(struct page *page)
+void fastcall lock_page_slow(struct page *page)
{
DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);

__wait_on_bit_lock(page_waitqueue(page), &wait, sync_page,
TASK_UNINTERRUPTIBLE);
}
-EXPORT_SYMBOL(__lock_page);
+EXPORT_SYMBOL(lock_page_slow);

/*
* Variant of lock_page that does not require the caller to hold a reference
@@ -647,7 +647,7 @@ repeat:
page_cache_get(page);
if (TestSetPageLocked(page)) {
read_unlock_irq(&mapping->tree_lock);
- __lock_page(page);
+ lock_page_slow(page);
read_lock_irq(&mapping->tree_lock);

/* Has the page been truncated while we slept? */
_
--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2006-12-28 08:34:36

by Suparna Bhattacharya

[permalink] [raw]
Subject: [FSAIO][PATCH 3/8] Routines to initialize and test a wait bit key


init_wait_bit_key() initializes the key field in an already
allocated wait bit structure, useful for async wait bit support.
Also separate out the wait bit test to a common routine which
can be used by different kinds of wakeup callbacks.

Signed-off-by: Suparna Bhattacharya <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
---

linux-2.6.20-rc1-root/include/linux/wait.h | 24 ++++++++++++++++++++++++
linux-2.6.20-rc1-root/kernel/wait.c | 11 ++---------
2 files changed, 26 insertions(+), 9 deletions(-)

diff -puN include/linux/wait.h~init-wait-bit-key include/linux/wait.h
--- linux-2.6.20-rc1/include/linux/wait.h~init-wait-bit-key 2006-12-21 08:45:46.000000000 +0530
+++ linux-2.6.20-rc1-root/include/linux/wait.h 2006-12-21 08:45:46.000000000 +0530
@@ -108,6 +108,17 @@ static inline int waitqueue_active(wait_
return !list_empty(&q->task_list);
}

+static inline int test_wait_bit_key(wait_queue_t *wait,
+ struct wait_bit_key *key)
+{
+ struct wait_bit_queue *wait_bit
+ = container_of(wait, struct wait_bit_queue, wait);
+
+ return (wait_bit->key.flags == key->flags &&
+ wait_bit->key.bit_nr == key->bit_nr &&
+ !test_bit(key->bit_nr, key->flags));
+}
+
/*
* Used to distinguish between sync and async io wait context:
* sync i/o typically specifies a NULL wait queue entry or a wait
@@ -416,6 +427,19 @@ int wake_bit_function(wait_queue_t *wait
INIT_LIST_HEAD(&(wait)->task_list); \
} while (0)

+#define init_wait_bit_key(waitbit, word, bit) \
+ do { \
+ (waitbit)->key.flags = word; \
+ (waitbit)->key.bit_nr = bit; \
+ } while (0)
+
+#define init_wait_bit_task(waitbit, tsk) \
+ do { \
+ (waitbit)->wait.private = tsk; \
+ (waitbit)->wait.func = wake_bit_function; \
+ INIT_LIST_HEAD(&(waitbit)->wait.task_list); \
+ } while (0)
+
/**
* wait_on_bit - wait for a bit to be cleared
* @word: the word being waited on, a kernel virtual address
diff -puN kernel/wait.c~init-wait-bit-key kernel/wait.c
--- linux-2.6.20-rc1/kernel/wait.c~init-wait-bit-key 2006-12-21 08:45:46.000000000 +0530
+++ linux-2.6.20-rc1-root/kernel/wait.c 2006-12-28 09:32:44.000000000 +0530
@@ -139,16 +139,9 @@ EXPORT_SYMBOL(autoremove_wake_function);

int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *arg)
{
- struct wait_bit_key *key = arg;
- struct wait_bit_queue *wait_bit
- = container_of(wait, struct wait_bit_queue, wait);
-
- if (wait_bit->key.flags != key->flags ||
- wait_bit->key.bit_nr != key->bit_nr ||
- test_bit(key->bit_nr, key->flags))
+ if (!test_wait_bit_key(wait, arg))
return 0;
- else
- return autoremove_wake_function(wait, mode, sync, key);
+ return autoremove_wake_function(wait, mode, sync, arg);
}
EXPORT_SYMBOL(wake_bit_function);

_
--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2006-12-28 08:35:28

by Suparna Bhattacharya

[permalink] [raw]
Subject: [FSAIO][PATCH 4/8] Add a default io wait bit field in task struct


Allocates space for the default io wait queue entry (actually a wait
bit entry) in the task struct. Doing so simplifies the patches
for AIO wait page allowing for cleaner and more efficient
implementation, at the cost of 28 additional bytes in task struct
vs allocation on demand on-stack.

Signed-off-by: Suparna Bhattacharya <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
---

linux-2.6.20-rc1-root/include/linux/sched.h | 11 +++++++----
linux-2.6.20-rc1-root/kernel/fork.c | 3 ++-
2 files changed, 9 insertions(+), 5 deletions(-)

diff -puN include/linux/sched.h~tsk-default-io-wait include/linux/sched.h
--- linux-2.6.20-rc1/include/linux/sched.h~tsk-default-io-wait 2006-12-21 08:45:51.000000000 +0530
+++ linux-2.6.20-rc1-root/include/linux/sched.h 2006-12-28 09:32:39.000000000 +0530
@@ -1006,11 +1006,14 @@ struct task_struct {

unsigned long ptrace_message;
siginfo_t *last_siginfo; /* For ptrace use. */
+
+/* Space for default IO wait bit entry used for synchronous IO waits */
+ struct wait_bit_queue __wait;
/*
- * current io wait handle: wait queue entry to use for io waits
- * If this thread is processing aio, this points at the waitqueue
- * inside the currently handled kiocb. It may be NULL (i.e. default
- * to a stack based synchronous wait) if its doing sync IO.
+ * Current IO wait handle: wait queue entry to use for IO waits
+ * If this thread is processing AIO, this points at the waitqueue
+ * inside the currently handled kiocb. Otherwise, points to the
+ * default IO wait field (i.e &__wait.wait above).
*/
wait_queue_t *io_wait;
/* i/o counters(bytes read/written, #syscalls */
diff -puN kernel/fork.c~tsk-default-io-wait kernel/fork.c
--- linux-2.6.20-rc1/kernel/fork.c~tsk-default-io-wait 2006-12-21 08:45:51.000000000 +0530
+++ linux-2.6.20-rc1-root/kernel/fork.c 2006-12-21 08:45:51.000000000 +0530
@@ -1056,7 +1056,8 @@ static struct task_struct *copy_process(
do_posix_clock_monotonic_gettime(&p->start_time);
p->security = NULL;
p->io_context = NULL;
- p->io_wait = NULL;
+ init_wait_bit_task(&p->__wait, p);
+ p->io_wait = &p->__wait.wait;
p->audit_context = NULL;
cpuset_fork(p);
#ifdef CONFIG_NUMA
_
--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2006-12-28 08:36:33

by Suparna Bhattacharya

[permalink] [raw]
Subject: [FSAIO][PATCH 5/8] Enable wait bit based filtered wakeups to work for AIO


Enable wait bit based filtered wakeups to work for AIO.
Replaces the wait queue entry in the kiocb with a wait bit
structure, to allow enough space for the wait bit key.
This adds an extra level of indirection in references to the
wait queue entry in the iocb. Also, an extra check had to be
added in aio_wake_function to allow for other kinds of waiters
which do not require wait bit, based on the assumption that
the key passed in would be NULL in such cases.

Signed-off-by: Suparna Bhattacharya <[email protected]>
Acked-by: Ingo Molnar <[email protected]>

---

linux-2.6.20-rc1-root/fs/aio.c | 21 ++++++++++++++-------
linux-2.6.20-rc1-root/include/linux/aio.h | 7 ++++---
linux-2.6.20-rc1-root/kernel/wait.c | 17 ++++++++++++++---
3 files changed, 32 insertions(+), 13 deletions(-)

diff -puN fs/aio.c~aio-wait-bit fs/aio.c
--- linux-2.6.20-rc1/fs/aio.c~aio-wait-bit 2006-12-21 08:45:57.000000000 +0530
+++ linux-2.6.20-rc1-root/fs/aio.c 2006-12-28 09:32:27.000000000 +0530
@@ -719,13 +719,13 @@ static ssize_t aio_run_iocb(struct kiocb
* cause the iocb to be kicked for continuation (through
* the aio_wake_function callback).
*/
- BUG_ON(current->io_wait != NULL);
- current->io_wait = &iocb->ki_wait;
+ BUG_ON(!is_sync_wait(current->io_wait));
+ current->io_wait = &iocb->ki_wait.wait;
ret = retry(iocb);
current->io_wait = NULL;

if (ret != -EIOCBRETRY && ret != -EIOCBQUEUED) {
- BUG_ON(!list_empty(&iocb->ki_wait.task_list));
+ BUG_ON(!list_empty(&iocb->ki_wait.wait.task_list));
aio_complete(iocb, ret, 0);
}
out:
@@ -883,7 +883,7 @@ static void try_queue_kicked_iocb(struct
* than retry has happened before we could queue the iocb. This also
* means that the retry could have completed and freed our iocb, no
* good. */
- BUG_ON((!list_empty(&iocb->ki_wait.task_list)));
+ BUG_ON((!list_empty(&iocb->ki_wait.wait.task_list)));

spin_lock_irqsave(&ctx->ctx_lock, flags);
/* set this inside the lock so that we can't race with aio_run_iocb()
@@ -1519,7 +1519,13 @@ static ssize_t aio_setup_iocb(struct kio
static int aio_wake_function(wait_queue_t *wait, unsigned mode,
int sync, void *key)
{
- struct kiocb *iocb = container_of(wait, struct kiocb, ki_wait);
+ struct wait_bit_queue *wait_bit
+ = container_of(wait, struct wait_bit_queue, wait);
+ struct kiocb *iocb = container_of(wait_bit, struct kiocb, ki_wait);
+
+ /* Assumes that a non-NULL key implies wait bit filtering */
+ if (key && !test_wait_bit_key(wait, key))
+ return 0;

list_del_init(&wait->task_list);
kick_iocb(iocb);
@@ -1574,8 +1580,9 @@ int fastcall io_submit_one(struct kioctx
req->ki_buf = (char __user *)(unsigned long)iocb->aio_buf;
req->ki_left = req->ki_nbytes = iocb->aio_nbytes;
req->ki_opcode = iocb->aio_lio_opcode;
- init_waitqueue_func_entry(&req->ki_wait, aio_wake_function);
- INIT_LIST_HEAD(&req->ki_wait.task_list);
+ init_waitqueue_func_entry(&req->ki_wait.wait, aio_wake_function);
+ INIT_LIST_HEAD(&req->ki_wait.wait.task_list);
+ req->ki_run_list.next = req->ki_run_list.prev = NULL;

ret = aio_setup_iocb(req);

diff -puN include/linux/aio.h~aio-wait-bit include/linux/aio.h
--- linux-2.6.20-rc1/include/linux/aio.h~aio-wait-bit 2006-12-21 08:45:57.000000000 +0530
+++ linux-2.6.20-rc1-root/include/linux/aio.h 2006-12-28 09:32:27.000000000 +0530
@@ -102,7 +102,7 @@ struct kiocb {
} ki_obj;

__u64 ki_user_data; /* user's data for completion */
- wait_queue_t ki_wait;
+ struct wait_bit_queue ki_wait;
loff_t ki_pos;

atomic_t ki_bio_count; /* num bio used for this iocb */
@@ -135,7 +135,7 @@ struct kiocb {
(x)->ki_dtor = NULL; \
(x)->ki_obj.tsk = tsk; \
(x)->ki_user_data = 0; \
- init_wait((&(x)->ki_wait)); \
+ init_wait_bit_task((&(x)->ki_wait), current);\
} while (0)

#define AIO_RING_MAGIC 0xa10a10a1
@@ -237,7 +237,8 @@ do { \
} \
} while (0)

-#define io_wait_to_kiocb(wait) container_of(wait, struct kiocb, ki_wait)
+#define io_wait_to_kiocb(io_wait) container_of(container_of(io_wait, \
+ struct wait_bit_queue, wait), struct kiocb, ki_wait)

#include <linux/aio_abi.h>

diff -puN kernel/wait.c~aio-wait-bit kernel/wait.c
--- linux-2.6.20-rc1/kernel/wait.c~aio-wait-bit 2006-12-21 08:45:57.000000000 +0530
+++ linux-2.6.20-rc1-root/kernel/wait.c 2006-12-21 08:45:57.000000000 +0530
@@ -139,7 +139,8 @@ EXPORT_SYMBOL(autoremove_wake_function);

int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *arg)
{
- if (!test_wait_bit_key(wait, arg))
+ /* Assumes that a non-NULL key implies wait bit filtering */
+ if (arg && !test_wait_bit_key(wait, arg))
return 0;
return autoremove_wake_function(wait, mode, sync, arg);
}
@@ -161,7 +162,12 @@ __wait_on_bit(wait_queue_head_t *wq, str
if (test_bit(q->key.bit_nr, q->key.flags))
ret = (*action)(q->key.flags, &q->wait);
} while (test_bit(q->key.bit_nr, q->key.flags) && !ret);
- finish_wait(wq, &q->wait);
+ /*
+ * AIO retries require the wait queue entry to remain queued
+ * for async notification
+ */
+ if (ret != -EIOCBRETRY)
+ finish_wait(wq, &q->wait);
return ret;
}
EXPORT_SYMBOL(__wait_on_bit);
@@ -190,7 +196,12 @@ __wait_on_bit_lock(wait_queue_head_t *wq
break;
}
} while (test_and_set_bit(q->key.bit_nr, q->key.flags));
- finish_wait(wq, &q->wait);
+ /*
+ * AIO retries require the wait queue entry to remain queued
+ * for async notification
+ */
+ if (ret != -EIOCBRETRY)
+ finish_wait(wq, &q->wait);
return ret;
}
EXPORT_SYMBOL(__wait_on_bit_lock);
_
--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2006-12-28 08:37:25

by Suparna Bhattacharya

[permalink] [raw]
Subject: [FSAIO][PATCH 6/8] Enable asynchronous wait page and lock page


Define low-level page wait and lock page routines which take a
wait queue entry pointer as an additional parameter and
return status (which may be non-zero when the wait queue
parameter signifies an asynchronous wait, typically during
AIO).

Synchronous IO waits become a special case where the wait
queue parameter is the running task's default io wait context.
Asynchronous IO waits happen when the wait queue parameter
is the io wait context of a kiocb. Code paths which choose to
execute synchronous or asynchronous behaviour depending on the
called context specify the current io wait context (which points
to sync or async context as the case may be) as the wait
parameter.

Signed-off-by: Suparna Bhattacharya <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
---

linux-2.6.20-rc1-root/include/linux/pagemap.h | 30 ++++++++++++++++++-------
linux-2.6.20-rc1-root/include/linux/sched.h | 1
linux-2.6.20-rc1-root/kernel/sched.c | 14 +++++++++++
linux-2.6.20-rc1-root/mm/filemap.c | 31 +++++++++++++++-----------
4 files changed, 55 insertions(+), 21 deletions(-)

diff -puN include/linux/pagemap.h~aio-wait-page include/linux/pagemap.h
--- linux-2.6.20-rc1/include/linux/pagemap.h~aio-wait-page 2006-12-21 08:46:02.000000000 +0530
+++ linux-2.6.20-rc1-root/include/linux/pagemap.h 2006-12-21 08:46:02.000000000 +0530
@@ -133,20 +133,24 @@ static inline pgoff_t linear_page_index(
return pgoff >> (PAGE_CACHE_SHIFT - PAGE_SHIFT);
}

-extern void FASTCALL(lock_page_slow(struct page *page));
+extern int FASTCALL(__lock_page_slow(struct page *page, wait_queue_t *wait));
extern void FASTCALL(__lock_page_nosync(struct page *page));
extern void FASTCALL(unlock_page(struct page *page));

/*
* lock_page may only be called if we have the page's inode pinned.
*/
-static inline void lock_page(struct page *page)
+static inline int __lock_page(struct page *page, wait_queue_t *wait)
{
might_sleep();
if (TestSetPageLocked(page))
- lock_page_slow(page);
+ return __lock_page_slow(page, wait);
+ return 0;
}

+#define lock_page(page) __lock_page(page, &current->__wait.wait)
+#define lock_page_slow(page) __lock_page_slow(page, &current->__wait.wait)
+
/*
* lock_page_nosync should only be used if we can't pin the page's inode.
* Doesn't play quite so well with block device plugging.
@@ -162,7 +166,8 @@ static inline void lock_page_nosync(stru
* This is exported only for wait_on_page_locked/wait_on_page_writeback.
* Never use this directly!
*/
-extern void FASTCALL(wait_on_page_bit(struct page *page, int bit_nr));
+extern int FASTCALL(wait_on_page_bit(struct page *page, int bit_nr,
+ wait_queue_t *wait));

/*
* Wait for a page to be unlocked.
@@ -171,21 +176,30 @@ extern void FASTCALL(wait_on_page_bit(st
* ie with increased "page->count" so that the page won't
* go away during the wait..
*/
-static inline void wait_on_page_locked(struct page *page)
+static inline int __wait_on_page_locked(struct page *page, wait_queue_t *wait)
{
if (PageLocked(page))
- wait_on_page_bit(page, PG_locked);
+ return wait_on_page_bit(page, PG_locked, wait);
+ return 0;
}

+#define wait_on_page_locked(page) \
+ __wait_on_page_locked(page, &current->__wait.wait)
+
/*
* Wait for a page to complete writeback
*/
-static inline void wait_on_page_writeback(struct page *page)
+static inline int __wait_on_page_writeback(struct page *page,
+ wait_queue_t *wait)
{
if (PageWriteback(page))
- wait_on_page_bit(page, PG_writeback);
+ return wait_on_page_bit(page, PG_writeback, wait);
+ return 0;
}

+#define wait_on_page_writeback(page) \
+ __wait_on_page_writeback(page, &current->__wait.wait)
+
extern void end_page_writeback(struct page *page);

/*
diff -puN include/linux/sched.h~aio-wait-page include/linux/sched.h
--- linux-2.6.20-rc1/include/linux/sched.h~aio-wait-page 2006-12-21 08:46:02.000000000 +0530
+++ linux-2.6.20-rc1-root/include/linux/sched.h 2006-12-28 09:26:27.000000000 +0530
@@ -216,6 +216,7 @@ extern void show_stack(struct task_struc

void io_schedule(void);
long io_schedule_timeout(long timeout);
+int io_wait_schedule(wait_queue_t *wait);

extern void cpu_init (void);
extern void trap_init(void);
diff -puN kernel/sched.c~aio-wait-page kernel/sched.c
--- linux-2.6.20-rc1/kernel/sched.c~aio-wait-page 2006-12-21 08:46:02.000000000 +0530
+++ linux-2.6.20-rc1-root/kernel/sched.c 2006-12-21 08:46:02.000000000 +0530
@@ -4743,6 +4743,20 @@ long __sched io_schedule_timeout(long ti
return ret;
}

+/*
+ * Sleep only if the wait context passed is not async,
+ * otherwise return so that a retry can be issued later.
+ */
+int __sched io_wait_schedule(wait_queue_t *wait)
+{
+ if (!is_sync_wait(wait))
+ return -EIOCBRETRY;
+ io_schedule();
+ return 0;
+}
+
+EXPORT_SYMBOL(io_wait_schedule);
+
/**
* sys_sched_get_priority_max - return maximum RT priority.
* @policy: scheduling class.
diff -puN mm/filemap.c~aio-wait-page mm/filemap.c
--- linux-2.6.20-rc1/mm/filemap.c~aio-wait-page 2006-12-21 08:46:02.000000000 +0530
+++ linux-2.6.20-rc1-root/mm/filemap.c 2006-12-28 09:32:27.000000000 +0530
@@ -165,8 +165,7 @@ static int sync_page(void *word, wait_qu
mapping = page_mapping(page);
if (mapping && mapping->a_ops && mapping->a_ops->sync_page)
mapping->a_ops->sync_page(page);
- io_schedule();
- return 0;
+ return io_wait_schedule(wait);
}

/**
@@ -478,7 +477,7 @@ struct page *__page_cache_alloc(gfp_t gf
EXPORT_SYMBOL(__page_cache_alloc);
#endif

-static int __sleep_on_page_lock(void *word)
+static int __sleep_on_page_lock(void *word, wait_queue_t *wait)
{
io_schedule();
return 0;
@@ -506,13 +505,17 @@ static inline void wake_up_page(struct p
__wake_up_bit(page_waitqueue(page), &page->flags, bit);
}

-void fastcall wait_on_page_bit(struct page *page, int bit_nr)
+int fastcall wait_on_page_bit(struct page *page, int bit_nr,
+ wait_queue_t *wait)
{
- DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
-
- if (test_bit(bit_nr, &page->flags))
- __wait_on_bit(page_waitqueue(page), &wait, sync_page,
+ if (test_bit(bit_nr, &page->flags)) {
+ struct wait_bit_queue *wait_bit
+ = container_of(wait, struct wait_bit_queue, wait);
+ init_wait_bit_key(wait_bit, &page->flags, bit_nr);
+ return __wait_on_bit(page_waitqueue(page), wait_bit, sync_page,
TASK_UNINTERRUPTIBLE);
+ }
+ return 0;
}
EXPORT_SYMBOL(wait_on_page_bit);

@@ -556,7 +559,7 @@ void end_page_writeback(struct page *pag
EXPORT_SYMBOL(end_page_writeback);

/**
- * lock_page_slow - get a lock on the page, assuming we need to sleep to get it
+ * __lock_page_slow: get a lock on the page, assuming we need to wait to get it
* @page: the page to lock
*
* Ugly. Running sync_page() in state TASK_UNINTERRUPTIBLE is scary. If some
@@ -564,14 +567,16 @@ EXPORT_SYMBOL(end_page_writeback);
* chances are that on the second loop, the block layer's plug list is empty,
* so sync_page() will then return in state TASK_UNINTERRUPTIBLE.
*/
-void fastcall lock_page_slow(struct page *page)
+int fastcall __lock_page_slow(struct page *page, wait_queue_t *wait)
{
- DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
+ struct wait_bit_queue *wait_bit
+ = container_of(wait, struct wait_bit_queue, wait);

- __wait_on_bit_lock(page_waitqueue(page), &wait, sync_page,
+ init_wait_bit_key(wait_bit, &page->flags, PG_locked);
+ return __wait_on_bit_lock(page_waitqueue(page), wait_bit, sync_page,
TASK_UNINTERRUPTIBLE);
}
-EXPORT_SYMBOL(lock_page_slow);
+EXPORT_SYMBOL(__lock_page_slow);

/*
* Variant of lock_page that does not require the caller to hold a reference
_
--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2006-12-28 08:38:36

by Suparna Bhattacharya

[permalink] [raw]
Subject: [FSAIO][PATCH 7/8] Filesystem AIO read


Converts the wait for page to become uptodate (lock page)
after readahead/readpage (in do_generic_mapping_read) to a retry
exit, to make buffered filesystem AIO reads actually synchronous.

The patch avoids exclusive wakeups with AIO, a problem originally
spotted by Chris Mason, though the reasoning for why it is an
issue is now much clearer (see explanation in the comment below
in aio.c), and the solution is perhaps slightly simpler.

Signed-off-by: Suparna Bhattacharya <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
---

linux-2.6.20-rc1-root/fs/aio.c | 11 ++++++++++-
linux-2.6.20-rc1-root/include/linux/aio.h | 5 +++++
linux-2.6.20-rc1-root/mm/filemap.c | 19 ++++++++++++++++---
3 files changed, 31 insertions(+), 4 deletions(-)

diff -puN fs/aio.c~aio-fs-read fs/aio.c
--- linux-2.6.20-rc1/fs/aio.c~aio-fs-read 2006-12-21 08:46:13.000000000 +0530
+++ linux-2.6.20-rc1-root/fs/aio.c 2006-12-28 09:26:30.000000000 +0530
@@ -1529,7 +1529,16 @@ static int aio_wake_function(wait_queue_

list_del_init(&wait->task_list);
kick_iocb(iocb);
- return 1;
+ /*
+ * Avoid exclusive wakeups with retries since an exclusive wakeup
+ * may involve implicit expectations of waking up the next waiter
+ * and there is no guarantee that the retry will take a path that
+ * would do so. For example if a page has become up-to-date, then
+ * a retried read may end up straightaway performing a copyout
+ * and not go through a lock_page - unlock_page that would have
+ * passed the baton to the next waiter.
+ */
+ return 0;
}

int fastcall io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
diff -puN mm/filemap.c~aio-fs-read mm/filemap.c
--- linux-2.6.20-rc1/mm/filemap.c~aio-fs-read 2006-12-21 08:46:13.000000000 +0530
+++ linux-2.6.20-rc1-root/mm/filemap.c 2006-12-28 09:31:48.000000000 +0530
@@ -909,6 +909,11 @@ void do_generic_mapping_read(struct addr
if (!isize)
goto out;

+ if (in_aio()) {
+ /* Avoid repeat readahead */
+ if (kiocbTryRestart(io_wait_to_kiocb(current->io_wait)))
+ next_index = last_index;
+ }
end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
for (;;) {
struct page *page;
@@ -978,7 +983,10 @@ page_ok:

page_not_up_to_date:
/* Get exclusive access to the page ... */
- lock_page(page);
+
+ if ((error = __lock_page(page, current->io_wait))) {
+ goto readpage_error;
+ }

/* Did it get truncated before we got the lock? */
if (!page->mapping) {
@@ -1006,7 +1014,8 @@ readpage:
}

if (!PageUptodate(page)) {
- lock_page(page);
+ if ((error = __lock_page(page, current->io_wait)))
+ goto readpage_error;
if (!PageUptodate(page)) {
if (page->mapping == NULL) {
/*
@@ -1052,7 +1061,11 @@ readpage:
goto page_ok;

readpage_error:
- /* UHHUH! A synchronous read error occurred. Report it */
+ /* We don't have uptodate data in the page yet */
+ /* Could be due to an error or because we need to
+ * retry when we get an async i/o notification.
+ * Report the reason.
+ */
desc->error = error;
page_cache_release(page);
goto out;
diff -puN include/linux/aio.h~aio-fs-read include/linux/aio.h
--- linux-2.6.20-rc1/include/linux/aio.h~aio-fs-read 2006-12-21 08:46:13.000000000 +0530
+++ linux-2.6.20-rc1-root/include/linux/aio.h 2006-12-28 09:26:27.000000000 +0530
@@ -34,21 +34,26 @@ struct kioctx;
/* #define KIF_LOCKED 0 */
#define KIF_KICKED 1
#define KIF_CANCELLED 2
+#define KIF_RESTARTED 3

#define kiocbTryLock(iocb) test_and_set_bit(KIF_LOCKED, &(iocb)->ki_flags)
#define kiocbTryKick(iocb) test_and_set_bit(KIF_KICKED, &(iocb)->ki_flags)
+#define kiocbTryRestart(iocb) test_and_set_bit(KIF_RESTARTED, &(iocb)->ki_flags)

#define kiocbSetLocked(iocb) set_bit(KIF_LOCKED, &(iocb)->ki_flags)
#define kiocbSetKicked(iocb) set_bit(KIF_KICKED, &(iocb)->ki_flags)
#define kiocbSetCancelled(iocb) set_bit(KIF_CANCELLED, &(iocb)->ki_flags)
+#define kiocbSetRestarted(iocb) set_bit(KIF_RESTARTED, &(iocb)->ki_flags)

#define kiocbClearLocked(iocb) clear_bit(KIF_LOCKED, &(iocb)->ki_flags)
#define kiocbClearKicked(iocb) clear_bit(KIF_KICKED, &(iocb)->ki_flags)
#define kiocbClearCancelled(iocb) clear_bit(KIF_CANCELLED, &(iocb)->ki_flags)
+#define kiocbClearRestarted(iocb) clear_bit(KIF_RESTARTED, &(iocb)->ki_flags)

#define kiocbIsLocked(iocb) test_bit(KIF_LOCKED, &(iocb)->ki_flags)
#define kiocbIsKicked(iocb) test_bit(KIF_KICKED, &(iocb)->ki_flags)
#define kiocbIsCancelled(iocb) test_bit(KIF_CANCELLED, &(iocb)->ki_flags)
+#define kiocbIsRestarted(iocb) test_bit(KIF_RESTARTED, &(iocb)->ki_flags)

/* is there a better place to document function pointer methods? */
/**
_
--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2006-12-28 08:40:24

by Suparna Bhattacharya

[permalink] [raw]
Subject: [FSAIO][PATCH 8/8] AIO O_SYNC filesystem write


AIO support for O_SYNC buffered writes, built over O_SYNC-speedup.
It uses the tagged radix tree lookups to writeout just the pages
pertaining to this request, and retries instead of blocking
for writeback to complete on the same range. All the writeout is
issued at the time of io submission, and there is a check to make
sure that retries skip over straight to the wait_on_page_writeback_range.

Limitations: Extending file writes or hole overwrites with O_SYNC may
still block because we have yet to convert generic_osync_inode to be
asynchronous. For non O_SYNC writes, writeout happens in the background
and so typically appears async to the caller except for memory throttling
and non-block aligned writes involving read-modify-write.

Signed-off-by: Suparna Bhattacharya <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
---

include/linux/aio.h | 0
linux-2.6.20-rc1-root/include/linux/fs.h | 13 +++++-
linux-2.6.20-rc1-root/mm/filemap.c | 61 +++++++++++++++++++++----------
3 files changed, 54 insertions(+), 20 deletions(-)

diff -puN include/linux/aio.h~aio-fs-write include/linux/aio.h
diff -puN mm/filemap.c~aio-fs-write mm/filemap.c
--- linux-2.6.20-rc1/mm/filemap.c~aio-fs-write 2006-12-21 08:46:21.000000000 +0530
+++ linux-2.6.20-rc1-root/mm/filemap.c 2006-12-21 08:46:21.000000000 +0530
@@ -239,10 +239,11 @@ EXPORT_SYMBOL(filemap_flush);
* @end: ending page index
*
* Wait for writeback to complete against pages indexed by start->end
- * inclusive
+ * inclusive. In AIO context, this may queue an async notification
+ * and retry callback and return, instead of blocking the caller.
*/
-int wait_on_page_writeback_range(struct address_space *mapping,
- pgoff_t start, pgoff_t end)
+int __wait_on_page_writeback_range(struct address_space *mapping,
+ pgoff_t start, pgoff_t end, wait_queue_t *wait)
{
struct pagevec pvec;
int nr_pages;
@@ -254,20 +255,20 @@ int wait_on_page_writeback_range(struct

pagevec_init(&pvec, 0);
index = start;
- while ((index <= end) &&
+ while (!ret && (index <= end) &&
(nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
PAGECACHE_TAG_WRITEBACK,
min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1)) != 0) {
unsigned i;

- for (i = 0; i < nr_pages; i++) {
+ for (i = 0; !ret && (i < nr_pages); i++) {
struct page *page = pvec.pages[i];

/* until radix tree lookup accepts end_index */
if (page->index > end)
continue;

- wait_on_page_writeback(page);
+ ret = __wait_on_page_writeback(page, wait);
if (PageError(page))
ret = -EIO;
}
@@ -303,18 +304,27 @@ int sync_page_range(struct inode *inode,
{
pgoff_t start = pos >> PAGE_CACHE_SHIFT;
pgoff_t end = (pos + count - 1) >> PAGE_CACHE_SHIFT;
- int ret;
+ int ret = 0;

if (!mapping_cap_writeback_dirty(mapping) || !count)
return 0;
+ if (in_aio()) {
+ /* Already issued writeouts for this iocb ? */
+ if (kiocbTryRestart(io_wait_to_kiocb(current->io_wait)))
+ goto do_wait; /* just need to check if done */
+ }
ret = filemap_fdatawrite_range(mapping, pos, pos + count - 1);
- if (ret == 0) {
+
+ if (ret >= 0) {
mutex_lock(&inode->i_mutex);
ret = generic_osync_inode(inode, mapping, OSYNC_METADATA);
mutex_unlock(&inode->i_mutex);
}
- if (ret == 0)
- ret = wait_on_page_writeback_range(mapping, start, end);
+do_wait:
+ if (ret >= 0) {
+ ret = __wait_on_page_writeback_range(mapping, start, end,
+ current->io_wait);
+ }
return ret;
}
EXPORT_SYMBOL(sync_page_range);
@@ -335,15 +345,23 @@ int sync_page_range_nolock(struct inode
{
pgoff_t start = pos >> PAGE_CACHE_SHIFT;
pgoff_t end = (pos + count - 1) >> PAGE_CACHE_SHIFT;
- int ret;
+ int ret = 0;

if (!mapping_cap_writeback_dirty(mapping) || !count)
return 0;
+ if (in_aio()) {
+ /* Already issued writeouts for this iocb ? */
+ if (kiocbTryRestart(io_wait_to_kiocb(current->io_wait)))
+ goto do_wait; /* just need to check if done */
+ }
ret = filemap_fdatawrite_range(mapping, pos, pos + count - 1);
- if (ret == 0)
+ if (ret >= 0)
ret = generic_osync_inode(inode, mapping, OSYNC_METADATA);
- if (ret == 0)
- ret = wait_on_page_writeback_range(mapping, start, end);
+do_wait:
+ if (ret >= 0) {
+ ret = __wait_on_page_writeback_range(mapping, start, end,
+ current->io_wait);
+ }
return ret;
}
EXPORT_SYMBOL(sync_page_range_nolock);
@@ -2216,7 +2234,7 @@ zero_length_segment:
*/
if (likely(status >= 0)) {
if (unlikely((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
- if (!a_ops->writepage || !is_sync_kiocb(iocb))
+ if (!a_ops->writepage)
status = generic_osync_inode(inode, mapping,
OSYNC_METADATA|OSYNC_DATA);
}
@@ -2268,7 +2286,10 @@ __generic_file_aio_write_nolock(struct k
ocount -= iv->iov_len; /* This segment is no good */
break;
}
-
+ if (!is_sync_kiocb(iocb) && kiocbIsRestarted(iocb)) {
+ /* nothing to transfer, may just need to sync data */
+ return ocount;
+ }
count = ocount;
pos = *ppos;

@@ -2368,8 +2389,10 @@ ssize_t generic_file_aio_write_nolock(st
ssize_t err;

err = sync_page_range_nolock(inode, mapping, pos, ret);
- if (err < 0)
+ if (err < 0) {
ret = err;
+ iocb->ki_pos = pos;
+ }
}
return ret;
}
@@ -2394,8 +2417,10 @@ ssize_t generic_file_aio_write(struct ki
ssize_t err;

err = sync_page_range(inode, mapping, pos, ret);
- if (err < 0)
+ if (err < 0) {
ret = err;
+ iocb->ki_pos = pos;
+ }
}
return ret;
}
diff -puN include/linux/fs.h~aio-fs-write include/linux/fs.h
--- linux-2.6.20-rc1/include/linux/fs.h~aio-fs-write 2006-12-21 08:46:21.000000000 +0530
+++ linux-2.6.20-rc1-root/include/linux/fs.h 2006-12-21 08:46:21.000000000 +0530
@@ -279,6 +279,7 @@ extern int dir_notify_enable;
#include <linux/prio_tree.h>
#include <linux/init.h>
#include <linux/pid.h>
+#include <linux/sched.h>
#include <linux/mutex.h>

#include <asm/atomic.h>
@@ -1588,8 +1589,16 @@ extern int filemap_fdatawait(struct addr
extern int filemap_write_and_wait(struct address_space *mapping);
extern int filemap_write_and_wait_range(struct address_space *mapping,
loff_t lstart, loff_t lend);
-extern int wait_on_page_writeback_range(struct address_space *mapping,
- pgoff_t start, pgoff_t end);
+extern int __wait_on_page_writeback_range(struct address_space *mapping,
+ pgoff_t start, pgoff_t end, wait_queue_t *wait);
+
+static inline int wait_on_page_writeback_range(struct address_space *mapping,
+ pgoff_t start, pgoff_t end)
+{
+ return __wait_on_page_writeback_range(mapping, start, end,
+ &current->__wait.wait);
+}
+
extern int __filemap_fdatawrite_range(struct address_space *mapping,
loff_t start, loff_t end, int sync_mode);

_
--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2006-12-28 08:41:49

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: [FSAIO][PATCH 1/6] Add a wait queue parameter to the wait_bit action routine

Sorry this should have read [PATCH 1/8] instead of [PATCH 1/6]

Regards
Suparna

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2006-12-28 09:54:40

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write


* Suparna Bhattacharya <[email protected]> wrote:

> The following is a sampling of comparative aio-stress results with the
> patches (each run starts with uncached files):
>
> ---------------------------------------------
>
> aio-stress throughput comparisons (in MB/s):
>
> file size 1GB, record size 64KB, depth 64, ios per iteration 8
> max io_submit 8, buffer alignment set to 4KB
> 4 way Pentium III SMP box, Adaptec AIC-7896/7 Ultra2 SCSI, 40 MB/s
> Filesystem: ext2
>
> ----------------------------------------------------------------------------
> Buffered (non O_DIRECT)
> Vanilla Patched O_DIRECT
> ----------------------------------------------------------------------------
> Vanilla Patched
> Random-Read 10.08 23.91 18.91, 18.98
> Random-O_SYNC-Write 8.86 15.84 16.51, 16.53
> Sequential-Read 31.49 33.00 31.86, 31.79
> Sequential-O_SYNC-Write 8.68 32.60 31.45, 32.44
> Random-Write 31.09 (19.65) 30.90 (19.65)
> Sequential-Write 30.84 (28.94) 30.09 (28.39)

the numbers look very convincing to me!

Ingo

2006-12-28 11:42:45

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [RFC] Heads up on a series of AIO patchsets

[ I'm only subscribed to linux-fsdevel@ from above Cc list, please keep this
list in Cc: for AIO related stuff. ]

On Wed, Dec 27, 2006 at 04:25:30PM +0000, Christoph Hellwig ([email protected]) wrote:
> (1) note that there is another problem with the current kevent interface,
> and that is that it duplicates the event infrastructure for it's
> underlying subsystems instead of reusing existing code (e.g.
> inotify, epoll, dio-aio). If we want kevent to be _the_ unified
> event system for Linux we need people to help out with straightening
> out these even provides as Evgeny seems to be unwilling/unable to
> do the work himself and the duplication is simply not acceptable.

I would rewrite inotify/epoll to use kevent, but I would strongly prefer
that it would be done by peopl who created original interfaces - it is
politic decision, not techinical - I do not want to be blamed on each
corner that I killed other people work :)

FS and network AIO kevent based stuff was dropped from kevent tree in
favour of upcoming project (description below).

According do AIO - my personal opinion is that AIO should be designed
asynchronously in all aspects. Here is brief note on how I plan to
iplement it (I plan to start in about a week after New Year vacations).

===

All existing AIO - both mainline and kevent based lack major feature -
they are not fully asyncronous, i.e. they require synchronous set of
steps, some of which can be asynchronous. For example aio_sendfile() [1]
requires open of the file descriptor and only then aio_sendfile() call.
The same applies to mainline AIO and read/write calls.

My idea is to create real asyncronous IO - i.e. some entity which will
describe set of tasks which should be performed asynchronously (from
user point of view, although read and write obviously must be done after
open and before close), for example syscall which gets as parameter
destination socket and local filename (with optional offset and length
fields), which will asynchronously from user point of view open a file
and transfer requested part to the destination socket and then return
opened file descriptor (or it can be closed if requested). Similar
mechanism can be done for read/write calls.

This approach as long as asynchronous IO at all requires access to user
memory from kernels thread or even interrupt handler (that is where
kevent based AIO completes its requests) - it can be done in the way
similar to how existing kevent ring buffer implementation and also can
use dedicated kernel thread or workqueue to copy data into process
memory.

It is very interesting task and should greatly speed up workloads of
busy web/ftp and other servers, which can work with a huge number of
files and huge number of clients.
I've put it into TODO list.

Someone, please stop the time for several days, so I could create some
really good things for the universe.

1. Network AIO
http://tservice.net.ru/~s0mbre/old/?section=projects&item=naio

--
Evgeniy Polyakov

2006-12-28 11:55:22

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [FSAIO][PATCH 6/8] Enable asynchronous wait page and lock page

On Thu, Dec 28, 2006 at 02:11:49PM +0530, Suparna Bhattacharya wrote:
> -extern void FASTCALL(lock_page_slow(struct page *page));
> +extern int FASTCALL(__lock_page_slow(struct page *page, wait_queue_t *wait));
> extern void FASTCALL(__lock_page_nosync(struct page *page));
> extern void FASTCALL(unlock_page(struct page *page));
>
> /*
> * lock_page may only be called if we have the page's inode pinned.
> */
> -static inline void lock_page(struct page *page)
> +static inline int __lock_page(struct page *page, wait_queue_t *wait)
> {
> might_sleep();
> if (TestSetPageLocked(page))
> - lock_page_slow(page);
> + return __lock_page_slow(page, wait);
> + return 0;
> }
>
> +#define lock_page(page) __lock_page(page, &current->__wait.wait)
> +#define lock_page_slow(page) __lock_page_slow(page, &current->__wait.wait)

Can we please simply kill your lock_page_slow wrapper and rename the
arguments taking __lock_page_slow to lock_page_slow? All too many
variants of the locking functions aren't all that useful and there's
very few users.

Similarly I don't really think __lock_page is an all that useful name here.
What about lock_page_wq? or aio_lock_page to denote it has special
meaning in aio contect? Then again because of these special sematics
we need a bunch of really verbose kerneldoc comments for this function
famility.

2006-12-28 11:57:55

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [FSAIO][PATCH 7/8] Filesystem AIO read

> + if (in_aio()) {
> + /* Avoid repeat readahead */
> + if (kiocbTryRestart(io_wait_to_kiocb(current->io_wait)))
> + next_index = last_index;
> + }

Every place we use kiocbTryRestart in this and the next patch it's in
this from, so we should add a little helper for it:

int aio_try_restart(void)
{
struct wait_queue_head_t *wq = current->io_wait;

if (!is_sync_wait(wq) && kiocbTryRestart(io_wait_to_kiocb(wq)))
return 1;
return 0;
}

with a big kerneldoc comment explaining this idiom (and possible a better
name for the function ;-))

> +
> + if ((error = __lock_page(page, current->io_wait))) {
> + goto readpage_error;
> + }

This should be

error = __lock_page(page, current->io_wait);
if (error)
goto readpage_error;

Pluse possible naming updates discussed in the last mail. Also do we
really need to pass current->io_wait here? Isn't the waitqueue in
the kiocb always guaranteed to be the same? Now that all pagecache
I/O goes through the ->aio_read/->aio_write routines I'd prefer to
get rid of the task_struct field cludges and pass all this around in
the kiocb.

2006-12-28 14:15:30

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [FSAIO][PATCH 7/8] Filesystem AIO read

> Pluse possible naming updates discussed in the last mail. Also do we
> really need to pass current->io_wait here? Isn't the waitqueue in
> the kiocb always guaranteed to be the same? Now that all pagecache
> I/O goes through the ->aio_read/->aio_write routines I'd prefer to
> get rid of the task_struct field cludges and pass all this around in
> the kiocb.

Btw, in current mainline task_struct.iowait is not used at all! The patch
below would remove it vs mainline, although I don't think it should go
in as-is as it would create quite a bit of messup for your patchset.

2006-12-28 14:42:59

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: [FSAIO][PATCH 6/8] Enable asynchronous wait page and lock page

On Thu, Dec 28, 2006 at 11:55:10AM +0000, Christoph Hellwig wrote:
> On Thu, Dec 28, 2006 at 02:11:49PM +0530, Suparna Bhattacharya wrote:
> > -extern void FASTCALL(lock_page_slow(struct page *page));
> > +extern int FASTCALL(__lock_page_slow(struct page *page, wait_queue_t *wait));
> > extern void FASTCALL(__lock_page_nosync(struct page *page));
> > extern void FASTCALL(unlock_page(struct page *page));
> >
> > /*
> > * lock_page may only be called if we have the page's inode pinned.
> > */
> > -static inline void lock_page(struct page *page)
> > +static inline int __lock_page(struct page *page, wait_queue_t *wait)
> > {
> > might_sleep();
> > if (TestSetPageLocked(page))
> > - lock_page_slow(page);
> > + return __lock_page_slow(page, wait);
> > + return 0;
> > }
> >
> > +#define lock_page(page) __lock_page(page, &current->__wait.wait)
> > +#define lock_page_slow(page) __lock_page_slow(page, &current->__wait.wait)
>
> Can we please simply kill your lock_page_slow wrapper and rename the
> arguments taking __lock_page_slow to lock_page_slow? All too many
> variants of the locking functions aren't all that useful and there's
> very few users.

OK.

>
> Similarly I don't really think __lock_page is an all that useful name here.
> What about lock_page_wq? or aio_lock_page to denote it has special

I am really bad with names :( I tried using the _wq suffixes earlier and
that seemed confusing to some, but if no one else objects I'm happy to use
that. I thought aio_lock_page() might be misleading because it is
synchronous if a regular wait queue entry is passed in, but again it may not
be too bad.

What's your preference ? Does anything more intuitive come to mind ?

> meaning in aio contect? Then again because of these special sematics
> we need a bunch of really verbose kerneldoc comments for this function
> famility.

Regards
Suparna

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2006-12-28 15:14:12

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: [FSAIO][PATCH 7/8] Filesystem AIO read

On Thu, Dec 28, 2006 at 11:57:47AM +0000, Christoph Hellwig wrote:
> > + if (in_aio()) {
> > + /* Avoid repeat readahead */
> > + if (kiocbTryRestart(io_wait_to_kiocb(current->io_wait)))
> > + next_index = last_index;
> > + }
>
> Every place we use kiocbTryRestart in this and the next patch it's in
> this from, so we should add a little helper for it:
>
> int aio_try_restart(void)
> {
> struct wait_queue_head_t *wq = current->io_wait;
>
> if (!is_sync_wait(wq) && kiocbTryRestart(io_wait_to_kiocb(wq)))
> return 1;
> return 0;
> }

Yes, we can do that -- how about aio_restarted() as an alternate name ?

>
> with a big kerneldoc comment explaining this idiom (and possible a better
> name for the function ;-))
>
> > +
> > + if ((error = __lock_page(page, current->io_wait))) {
> > + goto readpage_error;
> > + }
>
> This should be
>
> error = __lock_page(page, current->io_wait);
> if (error)
> goto readpage_error;
>
> Pluse possible naming updates discussed in the last mail. Also do we
> really need to pass current->io_wait here? Isn't the waitqueue in
> the kiocb always guaranteed to be the same? Now that all pagecache

We don't have have the kiocb available to this routine. Using current->io_wait
avoids the need to pass the iocb down to deeper levels just for the sync vs
async checks, also allowing such routines to be shared by other code which
does not use iocbs (e.g. generic_file_sendfile->do_generic_file_read
->do_generic_mapping_read) without having to set up dummy iocbs.

Does that clarify ? We could abstract this away within a lock page wrapper,
but I don't know if that makes a difference.

> I/O goes through the ->aio_read/->aio_write routines I'd prefer to
> get rid of the task_struct field cludges and pass all this around in
> the kiocb.

Regards
Suparna

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2006-12-28 16:31:35

by Jan Engelhardt

[permalink] [raw]
Subject: Re: [FSAIO][PATCH 7/8] Filesystem AIO read


On Dec 28 2006 11:57, Christoph Hellwig wrote:
>
>> +
>> + if ((error = __lock_page(page, current->io_wait))) {
>> + goto readpage_error;
>> + }
>
>This should be
>
> error = __lock_page(page, current->io_wait);
> if (error)
> goto readpage_error;

That's effectively the same. Essentially a style thing, and I see if((err =
xyz)) not being uncommon in the kernel tree.


-`J'
--

2006-12-28 16:59:25

by Randy Dunlap

[permalink] [raw]
Subject: Re: [FSAIO][PATCH 7/8] Filesystem AIO read

On Thu, 28 Dec 2006 17:22:07 +0100 (MET) Jan Engelhardt wrote:

>
> On Dec 28 2006 11:57, Christoph Hellwig wrote:
> >
> >> +
> >> + if ((error = __lock_page(page, current->io_wait))) {
> >> + goto readpage_error;
> >> + }
> >
> >This should be
> >
> > error = __lock_page(page, current->io_wait);
> > if (error)
> > goto readpage_error;
>
> That's effectively the same. Essentially a style thing, and I see if((err =
> xyz)) not being uncommon in the kernel tree.

The combined if/assignment has been known to contain coding errors
that are legal C, just not what was intended.

Since the latter replacement shouldn't cause any code efficiency
problems and it's more readable, it's becoming preferred.

---
~Randy

2006-12-28 22:43:12

by Andrew Morton

[permalink] [raw]
Subject: Re: [FSAIO][PATCH 3/8] Routines to initialize and test a wait bit key

On Thu, 28 Dec 2006 14:09:00 +0530
Suparna Bhattacharya <[email protected]> wrote:

> +#define init_wait_bit_key(waitbit, word, bit) \
> + do { \
> + (waitbit)->key.flags = word; \
> + (waitbit)->key.bit_nr = bit; \
> + } while (0)
> +
> +#define init_wait_bit_task(waitbit, tsk) \
> + do { \
> + (waitbit)->wait.private = tsk; \
> + (waitbit)->wait.func = wake_bit_function; \
> + INIT_LIST_HEAD(&(waitbit)->wait.task_list); \
> + } while (0)

Can we convert these to C functions (inlined or regular, probably inlined
would be better)?

2006-12-28 22:54:08

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write

On Thu, 28 Dec 2006 13:53:08 +0530
Suparna Bhattacharya <[email protected]> wrote:

> This patchset implements changes to make filesystem AIO read
> and write asynchronous for the non O_DIRECT case.

I did s/lock_page_slow/lock_page_blocking/g then merged all these
into -mm, thanks.

2007-01-02 14:26:51

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [FSAIO][PATCH 6/8] Enable asynchronous wait page and lock page

On Thu, Dec 28, 2006 at 08:17:17PM +0530, Suparna Bhattacharya wrote:
> I am really bad with names :( I tried using the _wq suffixes earlier and
> that seemed confusing to some, but if no one else objects I'm happy to use
> that. I thought aio_lock_page() might be misleading because it is
> synchronous if a regular wait queue entry is passed in, but again it may not
> be too bad.
>
> What's your preference ? Does anything more intuitive come to mind ?

Beein bad about naming seems to be a disease, at least I suffer from it
aswell. I wouldn't mind either the _wq or aio_ naming - _wq describes
the way it's called and aio_ describes it's a special case for aio.
Similarly to how ->aio_read/->aio_write can be used for synchronous I/O
aswell.

2007-01-02 14:29:10

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [FSAIO][PATCH 7/8] Filesystem AIO read

On Thu, Dec 28, 2006 at 08:48:30PM +0530, Suparna Bhattacharya wrote:
> Yes, we can do that -- how about aio_restarted() as an alternate name ?

Sounds fine to me.

> > Pluse possible naming updates discussed in the last mail. Also do we
> > really need to pass current->io_wait here? Isn't the waitqueue in
> > the kiocb always guaranteed to be the same? Now that all pagecache
>
> We don't have have the kiocb available to this routine. Using current->io_wait
> avoids the need to pass the iocb down to deeper levels just for the sync vs
> async checks, also allowing such routines to be shared by other code which
> does not use iocbs (e.g. generic_file_sendfile->do_generic_file_read
> ->do_generic_mapping_read) without having to set up dummy iocbs.

We really want to switch senfile to kiocbs btw, - for one thing to
allow an aio_sendfile implementation and second to make it more common
to all the other I/O path code so we can avoid special cases in the
fs code So I'm not convinced by that argument. But again we don't
need to put the io_wait removal into your patchkit. I'll try to
hack on it once I'll get a little spare time.

2007-01-02 21:38:18

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC] Heads up on a series of AIO patchsets

On 12/28/06, Evgeniy Polyakov <[email protected]> wrote:
> [ I'm only subscribed to linux-fsdevel@ from above Cc list, please keep this
> list in Cc: for AIO related stuff. ]
>
> On Wed, Dec 27, 2006 at 04:25:30PM +0000, Christoph Hellwig ([email protected]) wrote:
> > (1) note that there is another problem with the current kevent interface,
> > and that is that it duplicates the event infrastructure for it's
> > underlying subsystems instead of reusing existing code (e.g.
> > inotify, epoll, dio-aio). If we want kevent to be _the_ unified
> > event system for Linux we need people to help out with straightening
> > out these even provides as Evgeny seems to be unwilling/unable to
> > do the work himself and the duplication is simply not acceptable.
>
> I would rewrite inotify/epoll to use kevent, but I would strongly prefer
> that it would be done by peopl who created original interfaces - it is
> politic decision, not techinical - I do not want to be blamed on each
> corner that I killed other people work :)
>
> FS and network AIO kevent based stuff was dropped from kevent tree in
> favour of upcoming project (description below).
>
> According do AIO - my personal opinion is that AIO should be designed
> asynchronously in all aspects. Here is brief note on how I plan to
> iplement it (I plan to start in about a week after New Year vacations).
>
> ===
>
> All existing AIO - both mainline and kevent based lack major feature -
> they are not fully asyncronous, i.e. they require synchronous set of
> steps, some of which can be asynchronous. For example aio_sendfile() [1]
> requires open of the file descriptor and only then aio_sendfile() call.
> The same applies to mainline AIO and read/write calls.
>
> My idea is to create real asyncronous IO - i.e. some entity which will
> describe set of tasks which should be performed asynchronously (from
> user point of view, although read and write obviously must be done after
> open and before close), for example syscall which gets as parameter
> destination socket and local filename (with optional offset and length
> fields), which will asynchronously from user point of view open a file
> and transfer requested part to the destination socket and then return
> opened file descriptor (or it can be closed if requested). Similar
> mechanism can be done for read/write calls.
>
> This approach as long as asynchronous IO at all requires access to user
> memory from kernels thread or even interrupt handler (that is where
> kevent based AIO completes its requests) - it can be done in the way
> similar to how existing kevent ring buffer implementation and also can
> use dedicated kernel thread or workqueue to copy data into process
> memory.
>
Would you have time to comment on the approach I have taken to
implement a standard asynchronous memcpy interface? It seems it would
be a good complement to what you are proposing. The entity that
describes the aio operation could take advantage of asynchronous
engines to carry out copies or other transforms (maybe an acrypto tie
in as well).

Here is the posting for 2.6.19. There has since been updates for
2.6.20, but the overall approach remains the same.
intro: http://marc.theaimsgroup.com/?l=linux-raid&m=116491661527161&w=2
async_tx: http://marc.theaimsgroup.com/?l=linux-raid&m=116491753318175&w=2

Regards,

Dan

2007-01-02 23:56:13

by Zach Brown

[permalink] [raw]
Subject: Re: [RFC] Heads up on a series of AIO patchsets

Sorry for the delay, I'm finally back from the holiday break :)

> (1) The filesystem AIO patchset, attempts to address one part of
> the problem
> which is to make regular file IO, (without O_DIRECT)
> asynchronous (mainly
> the case of reads of uncached or partially cached files, and
> O_SYNC writes).

One of the properties of the currently implemented EIOCBRETRY aio
path is that ->mm is the only field in current which matches the
submitting task_struct while inside the retry path.

It looks like a retry-based aio write path would be broken because of
this. generic_write_checks() could run in the aio thread and get its
task_struct instead of that of the submitter. The wrong rlimit will
be tested and SIGXFSZ won't be raised. remove_suid() could check the
capabilities of the aio thread instead of those of the submitter.

I don't think EIOCBRETRY is the way to go because of this increased
(and subtle!) complexity. What are the chances that we would have
ever found those bugs outside code review? How do we make sure that
current references don't sneak back in after having initially audited
the paths?

Take the io_cmd_epoll_wait patch..

> issues). The IO_CMD_EPOLL_WAIT patch (originally from Zach
> Brown with
> modifications from Jeff Moyer and me) addresses this problem
> for native
> linux aio in a simple manner.

It's simple looking, sure. This current flipping didn't even occur
to me while throwing the patch together!

But that patch ends up calling ->poll (and poll_table->qproc) and
writing to userspace (so potentially calling ->nopage) from the aio
threads. Are we sure that none of them will behave surprisingly
because current changed under them?

It might be safe now, but that isn't really the point. I'd rather we
didn't have yet one more subtle invariant to audit and maintain.

At the risk of making myself vulnerable to the charge of mentioning
vapourware, I will admit that I've been working on a (slightly mad)
implementation of async syscalls. I've been quiet about it because I
don't want to whip up complicated discussion without being able to
show code that works, even if barely. I mention it now only to make
it clear that I want to be constructive, not just critical :).

- z

2007-01-03 01:18:42

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC] Heads up on a series of AIO patchsets

> > Any details?
>
> Well, one path I tried I couldn't help but post a blog entry about
> for my friends. I'm not sure it's the direction I'll take with linux-
> kernel, but the fundamentals are there: the api should be the
> syscall interface, and there should be no difference between sync and
> async behaviour.
>
> http://www.zabbo.net/?p=72

Any code you're willing to let people play with? I could at least have
real test cases, and a library to go along with it as it gets
finished.

Another pie in the sky idea:
One thing that's been bugging me lately (working on a 9p server), is
sendfile is hard to use in practice because you need packet headers
and such, and they need to go out at the same time.

Sendfile listio support would fix this, but it's not a general
solution. What would be really usefull is a way to say that a certain
batch of async ops either all succeed or all fail, and happen
atomically; i.e., transactions for syscalls.

Probably even harder to do than general async syscalls, but it'd be
the best thing since sliced bread... and hey, it seems the logical
next step.

2007-01-03 04:58:54

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: [RFC] Heads up on a series of AIO patchsets

On Tue, Jan 02, 2007 at 03:56:09PM -0800, Zach Brown wrote:
> Sorry for the delay, I'm finally back from the holiday break :)

Welcome back !

>
> >(1) The filesystem AIO patchset, attempts to address one part of
> >the problem
> > which is to make regular file IO, (without O_DIRECT)
> >asynchronous (mainly
> > the case of reads of uncached or partially cached files, and
> >O_SYNC writes).
>
> One of the properties of the currently implemented EIOCBRETRY aio
> path is that ->mm is the only field in current which matches the
> submitting task_struct while inside the retry path.

Yes and that as I guess you know is to enable the aio worker thread to
operate on the caller's address space for copy_from/to_user.

The actual io setup and associated checks are expected to have been
handled at submission time.

>
> It looks like a retry-based aio write path would be broken because of
> this. generic_write_checks() could run in the aio thread and get its
> task_struct instead of that of the submitter. The wrong rlimit will
> be tested and SIGXFSZ won't be raised. remove_suid() could check the
> capabilities of the aio thread instead of those of the submitter.

generic_write_checks() are done in the submission path, not repeated during
retries, so such types of checks are not intended to run in the aio thread.

Did I miss something here ?

>
> I don't think EIOCBRETRY is the way to go because of this increased
> (and subtle!) complexity. What are the chances that we would have
> ever found those bugs outside code review? How do we make sure that
> current references don't sneak back in after having initially audited
> the paths?

The EIOCBRETRY route is not something that is intended to be used blindly,
It is just one alternative to implement an aio operation by splitting up
responsibility between the submitter and aio threads, where aio threads
can run in the caller's address space.

>
> Take the io_cmd_epoll_wait patch..
>
> > issues). The IO_CMD_EPOLL_WAIT patch (originally from Zach
> >Brown with
> > modifications from Jeff Moyer and me) addresses this problem
> >for native
> > linux aio in a simple manner.
>
> It's simple looking, sure. This current flipping didn't even occur
> to me while throwing the patch together!
>
> But that patch ends up calling ->poll (and poll_table->qproc) and
> writing to userspace (so potentially calling ->nopage) from the aio

Yes of course, but why is that a problem ?
The copy_from/to_user/put_user constructs are designed to handle soft failures,
and we are already using the caller's ->mm. Do you see a need for any
additional asserts() ?

If there is something that is needed by ->nopage etc which is not abstracted
out within the ->mm, then we would need to fix that instead, for correctness
anyway, isn't that so ?

Now it is possible that there are minor blocking points in the code and the
effect of these would be to hold up / delay subsequent queued aio operations;
which is an efficiency issue, but not a correctness concern.

> threads. Are we sure that none of them will behave surprisingly
> because current changed under them?

My take is that we should fix the problems that we see. It is likely that
what manifests relatively more easily with AIO is also a subtle problem
in other cases.

>
> It might be safe now, but that isn't really the point. I'd rather we
> didn't have yet one more subtle invariant to audit and maintain.
>
> At the risk of making myself vulnerable to the charge of mentioning
> vapourware, I will admit that I've been working on a (slightly mad)
> implementation of async syscalls. I've been quiet about it because I
> don't want to whip up complicated discussion without being able to
> show code that works, even if barely. I mention it now only to make
> it clear that I want to be constructive, not just critical :).

That is great and I look forward to it :) I am, however assuming that
whatever implementation you come up will have a different interface
from current linux aio -- i.e. a next generation aio model, that will be
easily integratable with kevents etc.

Which takes me back to Ingo's point - lets have the new evolve parallely
with the old, if we can, and not hold up the patches for POSIX AIO to
start using kernel AIO, or for epoll to integrate with AIO.

OK, I just took a quick look at your blog and I see that you
are basically implementing Linus' microthreads scheduling approach -
one year since we had that discussion. Glad to see that you found a way
to make it workable ... (I'm guessing that you are copying over the part
of the stack in use at the time of every switch, is that correct ? At what
point do you do the allocation of the saved stacks ? Sorry I should hold
off all these questions till your patch comes out)

Regards
Suparna

>
> - z

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2007-01-03 07:19:21

by Suparna Bhattacharya

[permalink] [raw]
Subject: [PATCHSET 2][PATCH 1/1] Combining epoll and disk file AIO

On Wed, Dec 27, 2006 at 09:08:56PM +0530, Suparna Bhattacharya wrote:
> (2) Most of these other applications need the ability to process both
> network events (epoll) and disk file AIO in the same loop. With POSIX AIO
> they could at least sort of do this using signals (yeah, and all associated
> issues). The IO_CMD_EPOLL_WAIT patch (originally from Zach Brown with
> modifications from Jeff Moyer and me) addresses this problem for native
> linux aio in a simple manner. Tridge has written a test harness to
> try out the Samba4 event library modifications to use this. Jeff Moyer
> has a modified version of pipetest for comparison.
>


Enable epoll wait to be unified with io_getevents

From: Zach Brown, Jeff Moyer, Suparna Bhattacharya

Previously there have been (complicated and scary) attempts to funnel
individual aio events down epoll or vice versa. This instead lets one
issue an entire sys_epoll_wait() as an aio op. You'd setup epoll as
usual and then issue epoll_wait aio ops which would complete once epoll
events had been copied. This will enable a single io_getevents() event
loop to process both disk AIO and epoll notifications.

>From an application standpoint a typical flow works like this:
- Use epoll_ctl as usual to add/remove epoll registrations
- Instead of issuing sys_epoll_wait, setup an iocb using
io_prep_epoll_wait (see examples below) specifying the epoll
events buffer to fill up with epoll notifications. Submit the iocb
using io_submit
- Now io_getevents can be used to wait for both epoll waits and
disk aio completion. If the returned AIO event is of type
IO_CMD_EPOLL_WAIT, then corresponding result value indicates the
number of epoll notifications in the iocb's event buffer, which
can now be processed just like once would process results from a
sys_epoll_wait()

There are a couple of sample applications:
- Andrew Tridgell has implemented a little test harness using an aio events
library implementation intended for samba4
http://samba.org/~tridge/etest
(The -e aio option uses aio epoll wait and can issue disk aio as well)
- An updated version of pipetest from Jeff Moyer has a --aio-epoll option
http://people.redhat.com/jmoyer/aio/epoll/pipetest.c

There is obviously a little overhead compared to using sys_epoll_wait(), due
to the extra step of submitting the epoll wait iocb, most noticible when
there are very few events processed per loop. However, the goal here is not
to build an epoll alternative but merely to allow network and disk I/O to
be processed in the same event loop which is where the efficiencies really
come from. Picking up more epoll events in each loop can amortize the
overhead across many operations to mitigate the impact.

Thanks to Arjan Van de Van for helping figure out how to resolve the
lockdep complaints. Both ctx->lock and ep->lock can be held in certain
wait queue callback routines, thus being nested inside q->lock. However, this
excludes ctx->wait or ep->wq wait queues, which can safetly be nested
inside ctx->lock or ep->lock respectively. So we teach lockdep to recognize
these as distinct classes.

Signed-off-by: Zach Brown <[email protected]>
Signed-off-by: Jeff Moyer <[email protected]>
Signed-off-by: Suparna Bhattacharya <[email protected]>

---

linux-2.6.20-rc1-root/fs/aio.c | 54 +++++++++++++
linux-2.6.20-rc1-root/fs/eventpoll.c | 95 +++++++++++++++++++++---
linux-2.6.20-rc1-root/include/linux/aio.h | 2
linux-2.6.20-rc1-root/include/linux/aio_abi.h | 1
linux-2.6.20-rc1-root/include/linux/eventpoll.h | 31 +++++++
linux-2.6.20-rc1-root/include/linux/sched.h | 2
linux-2.6.20-rc1-root/kernel/timer.c | 21 +++++
7 files changed, 196 insertions(+), 10 deletions(-)

diff -puN fs/aio.c~aio-epoll-wait fs/aio.c
--- linux-2.6.20-rc1/fs/aio.c~aio-epoll-wait 2006-12-28 14:22:52.000000000 +0530
+++ linux-2.6.20-rc1-root/fs/aio.c 2007-01-03 11:45:40.000000000 +0530
@@ -30,6 +30,7 @@
#include <linux/highmem.h>
#include <linux/workqueue.h>
#include <linux/security.h>
+#include <linux/eventpoll.h>

#include <asm/kmap_types.h>
#include <asm/uaccess.h>
@@ -193,6 +194,8 @@ static int aio_setup_ring(struct kioctx
kunmap_atomic((void *)((unsigned long)__event & PAGE_MASK), km); \
} while(0)

+static struct lock_class_key ioctx_wait_queue_head_lock_key;
+
/* ioctx_alloc
* Allocates and initializes an ioctx. Returns an ERR_PTR if it failed.
*/
@@ -224,6 +227,8 @@ static struct kioctx *ioctx_alloc(unsign
spin_lock_init(&ctx->ctx_lock);
spin_lock_init(&ctx->ring_info.ring_lock);
init_waitqueue_head(&ctx->wait);
+ /* Teach lockdep to recognize this lock as a different class */
+ lockdep_set_class(&ctx->wait.lock, &ioctx_wait_queue_head_lock_key);

INIT_LIST_HEAD(&ctx->active_reqs);
INIT_LIST_HEAD(&ctx->run_list);
@@ -1401,6 +1406,42 @@ static ssize_t aio_setup_single_vector(s
return 0;
}

+/* Uses iocb->ki_private */
+void aio_free_iocb_timer(struct kiocb *iocb)
+{
+ struct timer_list *timer = (struct timer_list *)iocb->private;
+
+ if (timer) {
+ del_timer(timer);
+ kfree(timer);
+ iocb->private = NULL;
+ }
+}
+
+/* Uses iocb->private */
+int aio_setup_iocb_timer(struct kiocb *iocb, unsigned long expires,
+ void (*function)(unsigned long))
+{
+ struct timer_list *timer;
+
+ if (iocb->private)
+ return -EEXIST;
+
+ timer = kmalloc(sizeof(struct timer_list), GFP_KERNEL);
+ if (!timer)
+ return -ENOMEM;
+
+ init_timer(timer);
+ timer->function = function;
+ timer->data = (unsigned long)iocb;
+ timer->expires = expires;
+
+ iocb->private = timer;
+ iocb->ki_dtor = aio_free_iocb_timer;
+ return 0;
+}
+
+
/*
* aio_setup_iocb:
* Performs the initial checks and aio retry method
@@ -1486,6 +1527,19 @@ static ssize_t aio_setup_iocb(struct kio
if (file->f_op->aio_fsync)
kiocb->ki_retry = aio_fsync;
break;
+ case IOCB_CMD_EPOLL_WAIT:
+ /*
+ * Note that we unconditionally allocate a timer, but we
+ * only use it if a timeout was specified. Otherwise, it
+ * is just a holder for the "infinite" value.
+ */
+ ret = aio_setup_iocb_timer(kiocb, ep_relative_ms_to_jiffies(
+ kiocb->ki_pos), eventpoll_aio_timer);
+ if (unlikely(ret))
+ break;
+ kiocb->ki_retry = eventpoll_aio_wait;
+ kiocb->ki_cancel = eventpoll_aio_cancel;
+ break;
default:
dprintk("EINVAL: io_submit: no operation provided\n");
ret = -EINVAL;
diff -puN fs/eventpoll.c~aio-epoll-wait fs/eventpoll.c
--- linux-2.6.20-rc1/fs/eventpoll.c~aio-epoll-wait 2006-12-28 14:22:52.000000000 +0530
+++ linux-2.6.20-rc1-root/fs/eventpoll.c 2006-12-28 14:22:52.000000000 +0530
@@ -35,6 +35,7 @@
#include <linux/mount.h>
#include <linux/bitops.h>
#include <linux/mutex.h>
+#include <linux/aio.h>
#include <asm/uaccess.h>
#include <asm/system.h>
#include <asm/io.h>
@@ -642,6 +643,75 @@ eexit_1:
return error;
}

+/*
+ * Called when the eventpoll timer expires or a cancellation
+ * occurs for an aio_epoll_wait. It is enough for this function to
+ * trigger a wakeup on the eventpoll waitqueue. The aio_wake_function()
+ * callback will pull out the wait queue entry and kick the iocb so that
+ * the rest gets taken care of in aio_run_iocb->aio_epoll_wait which
+ * can recognize the cancelled state or timeout expiration and do
+ * the right thing.
+ */
+void eventpoll_aio_timer(unsigned long data)
+{
+ struct kiocb *iocb = (struct kiocb *)data;
+ struct timer_list *timer = iocb_timer(iocb);
+ struct file *file = iocb->ki_filp;
+ struct eventpoll *ep = (struct eventpoll *)file->private_data;
+ unsigned long flags;
+
+ if (timer)
+ del_timer(timer);
+ write_lock_irqsave(&ep->lock, flags);
+ /* because ep->lock also protects ep->wq */
+ __wake_up_locked(&ep->wq, TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE);
+ write_unlock_irqrestore(&ep->lock, flags);
+}
+
+
+int eventpoll_aio_cancel(struct kiocb *iocb, struct io_event *event)
+{
+ /* to wakeup the iocb, so actual cancellation happens aio_run_iocb */
+ eventpoll_aio_timer((unsigned long)iocb);
+
+ event->res = event->res2 = 0;
+ /* drop the cancel reference */
+ aio_put_req(iocb);
+
+ return 0;
+}
+
+/*
+ * iocb->ki_nbytes -- number of events
+ * iocb->ki_pos -- relative timeout in milliseconds
+ * iocb->private -- timer with absolute timeout in jiffies
+ */
+ssize_t eventpoll_aio_wait(struct kiocb *iocb)
+{
+ struct file *file = iocb->ki_filp;
+ ssize_t ret = -EINVAL;
+ unsigned long expires;
+ struct timer_list *timer = iocb_timer(iocb);
+
+ if (!is_file_epoll(file) || iocb->ki_nbytes > EP_MAX_EVENTS ||
+ iocb->ki_nbytes <= 0)
+ return -EINVAL;
+
+ expires = timer->expires;
+ ret = ep_poll(file->private_data,
+ (struct epoll_event __user *)iocb->ki_buf,
+ iocb->ki_nbytes, ep_jiffies_to_relative_ms(expires));
+
+ /*
+ * If a timeout was specified, ep_poll returned retry, and we have
+ * not yet registered a timer, go ahead and register one.
+ */
+ if (ret == -EIOCBRETRY) {
+ mod_timer(timer, expires);
+ }
+
+ return ret;
+}

/*
* Implement the event wait interface for the eventpoll file. It is the kernel
@@ -824,6 +894,7 @@ eexit_1:
return error;
}

+static struct lock_class_key eventpoll_wait_queue_head_lock_key;

static int ep_alloc(struct eventpoll **pep)
{
@@ -835,6 +906,9 @@ static int ep_alloc(struct eventpoll **p
rwlock_init(&ep->lock);
init_rwsem(&ep->sem);
init_waitqueue_head(&ep->wq);
+ /* Teach lockdep to recognize this lock as a different class */
+ lockdep_set_class(&ep->wq.lock, &eventpoll_wait_queue_head_lock_key);
+
init_waitqueue_head(&ep->poll_wait);
INIT_LIST_HEAD(&ep->rdllist);
ep->rbr = RB_ROOT;
@@ -1549,7 +1623,7 @@ static int ep_poll(struct eventpoll *ep,
int res, eavail;
unsigned long flags;
long jtimeout;
- wait_queue_t wait;
+ wait_queue_t *wait = current->io_wait;

/*
* Calculate the timeout by checking for the "infinite" value ( -1 )
@@ -1569,16 +1643,13 @@ retry:
* We need to sleep here, and we will be wake up by
* ep_poll_callback() when events will become available.
*/
- init_waitqueue_entry(&wait, current);
- __add_wait_queue(&ep->wq, &wait);
-
for (;;) {
/*
* We don't want to sleep if the ep_poll_callback() sends us
* a wakeup in between. That's why we set the task state
* to TASK_INTERRUPTIBLE before doing the checks.
*/
- set_current_state(TASK_INTERRUPTIBLE);
+ prepare_to_wait(&ep->wq, wait, TASK_INTERRUPTIBLE);
if (!list_empty(&ep->rdllist) || !jtimeout)
break;
if (signal_pending(current)) {
@@ -1587,12 +1658,16 @@ retry:
}

write_unlock_irqrestore(&ep->lock, flags);
- jtimeout = schedule_timeout(jtimeout);
+ if ((jtimeout = schedule_timeout_wait(jtimeout, wait))
+ < 0) {
+ if ((res = jtimeout) == -EIOCBRETRY)
+ goto out;
+ }
+ if (res < 0)
+ break;
write_lock_irqsave(&ep->lock, flags);
}
- __remove_wait_queue(&ep->wq, &wait);
-
- set_current_state(TASK_RUNNING);
+ finish_wait(&ep->wq, wait);
}

/* Is it worth to try to dig for events ? */
@@ -1608,7 +1683,7 @@ retry:
if (!res && eavail &&
!(res = ep_events_transfer(ep, events, maxevents)) && jtimeout)
goto retry;
-
+out:
return res;
}

diff -puN include/linux/aio_abi.h~aio-epoll-wait include/linux/aio_abi.h
--- linux-2.6.20-rc1/include/linux/aio_abi.h~aio-epoll-wait 2006-12-28 14:22:52.000000000 +0530
+++ linux-2.6.20-rc1-root/include/linux/aio_abi.h 2006-12-28 14:22:52.000000000 +0530
@@ -43,6 +43,7 @@ enum {
IOCB_CMD_NOOP = 6,
IOCB_CMD_PREADV = 7,
IOCB_CMD_PWRITEV = 8,
+ IOCB_CMD_EPOLL_WAIT = 9,
};

/* read() from /dev/aio returns these structures. */
diff -puN include/linux/aio.h~aio-epoll-wait include/linux/aio.h
--- linux-2.6.20-rc1/include/linux/aio.h~aio-epoll-wait 2006-12-28 14:22:52.000000000 +0530
+++ linux-2.6.20-rc1-root/include/linux/aio.h 2006-12-28 14:22:52.000000000 +0530
@@ -244,6 +244,8 @@ do { \

#define io_wait_to_kiocb(io_wait) container_of(container_of(io_wait, \
struct wait_bit_queue, wait), struct kiocb, ki_wait)
+#define iocb_timer(iocb) ((struct timer_list *)((iocb)->private))
+

#include <linux/aio_abi.h>

diff -puN include/linux/eventpoll.h~aio-epoll-wait include/linux/eventpoll.h
--- linux-2.6.20-rc1/include/linux/eventpoll.h~aio-epoll-wait 2006-12-28 14:22:52.000000000 +0530
+++ linux-2.6.20-rc1-root/include/linux/eventpoll.h 2006-12-28 14:22:52.000000000 +0530
@@ -48,6 +48,33 @@ struct epoll_event {
/* Forward declarations to avoid compiler errors */
struct file;

+/* Maximum msec timeout value storeable in a long int */
+#define EP_MAX_MSTIMEO min(1000ULL * MAX_SCHEDULE_TIMEOUT / HZ, (LONG_MAX - 999ULL) / HZ)
+
+static inline int ep_jiffies_to_relative_ms(unsigned long expires)
+{
+ int relative_ms = 0;
+ unsigned long now = jiffies;
+
+ if (expires == MAX_SCHEDULE_TIMEOUT)
+ relative_ms = EP_MAX_MSTIMEO;
+ else if (time_before(now, expires))
+ relative_ms = jiffies_to_msecs(expires - now);
+
+ return relative_ms;
+}
+
+static inline unsigned long ep_relative_ms_to_jiffies(int relative_ms)
+{
+ unsigned long expires;
+
+ if (relative_ms < 0 || relative_ms >= EP_MAX_MSTIMEO)
+ expires = MAX_SCHEDULE_TIMEOUT;
+ else
+ expires = jiffies + msecs_to_jiffies(relative_ms);
+ return expires;
+}
+

#ifdef CONFIG_EPOLL

@@ -90,6 +117,10 @@ static inline void eventpoll_release(str
eventpoll_release_file(file);
}

+extern void eventpoll_aio_timer(unsigned long data);
+extern int eventpoll_aio_cancel(struct kiocb *iocb, struct io_event *event);
+extern ssize_t eventpoll_aio_wait(struct kiocb *iocb);
+
#else

static inline void eventpoll_init_file(struct file *file) {}
diff -puN include/linux/sched.h~aio-epoll-wait include/linux/sched.h
--- linux-2.6.20-rc1/include/linux/sched.h~aio-epoll-wait 2006-12-28 14:22:52.000000000 +0530
+++ linux-2.6.20-rc1-root/include/linux/sched.h 2006-12-28 14:22:52.000000000 +0530
@@ -247,6 +247,8 @@ extern int in_sched_functions(unsigned l

#define MAX_SCHEDULE_TIMEOUT LONG_MAX
extern signed long FASTCALL(schedule_timeout(signed long timeout));
+extern signed long FASTCALL(schedule_timeout_wait(signed long timeout,
+ wait_queue_t *wait));
extern signed long schedule_timeout_interruptible(signed long timeout);
extern signed long schedule_timeout_uninterruptible(signed long timeout);
asmlinkage void schedule(void);
diff -puN kernel/timer.c~aio-epoll-wait kernel/timer.c
--- linux-2.6.20-rc1/kernel/timer.c~aio-epoll-wait 2006-12-28 14:22:52.000000000 +0530
+++ linux-2.6.20-rc1-root/kernel/timer.c 2006-12-28 14:22:52.000000000 +0530
@@ -1369,6 +1369,27 @@ fastcall signed long __sched schedule_ti
EXPORT_SYMBOL(schedule_timeout);

/*
+ * Same as schedule_timeout, except that it checks the wait queue context
+ * passed in, and in case of an asynchronous waiter it does not sleep,
+ * but returns -EIOCBRETRY to allow the operation to be retried later when
+ * notified, unless it has been cancelled in which case it returns -EINTR
+ */
+fastcall signed long __sched schedule_timeout_wait(signed long timeout,
+ wait_queue_t *wait)
+{
+ struct kiocb *iocb;
+ if (is_sync_wait(wait))
+ return schedule_timeout(timeout);
+
+ iocb = io_wait_to_kiocb(wait);
+ if (kiocbIsCancelled(iocb))
+ return -EINTR;
+
+ return -EIOCBRETRY;
+}
+
+
+/*
* We can use __set_current_state() here because schedule_timeout() calls
* schedule() unconditionally.
*/
_


--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2007-01-03 13:36:46

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [RFC] Heads up on a series of AIO patchsets

On Tue, Jan 02, 2007 at 02:38:13PM -0700, Dan Williams ([email protected]) wrote:
> Would you have time to comment on the approach I have taken to
> implement a standard asynchronous memcpy interface? It seems it would
> be a good complement to what you are proposing. The entity that
> describes the aio operation could take advantage of asynchronous
> engines to carry out copies or other transforms (maybe an acrypto tie
> in as well).
>
> Here is the posting for 2.6.19. There has since been updates for
> 2.6.20, but the overall approach remains the same.
> intro: http://marc.theaimsgroup.com/?l=linux-raid&m=116491661527161&w=2
> async_tx: http://marc.theaimsgroup.com/?l=linux-raid&m=116491753318175&w=2

My first impression is that it has too many lists :)

Looks good, but IMHO there are steps to implement further.
I have not found there any kind of scheduler - what if system has two
async engines? What if sync engine faster than async in some cases (and
it is indeed the case for small buffers), and should be selected that time?
What if you will want to add additional transformations for some
devices like crypto processing or checksumming?

I would just create a driver for low-level engine, and exported its
functionality - iop3_async_copy(), iop3_async_checksum(), iop3_async_crypto_1(),
iop3_async_crypto_2() and so on.

There will be a lot of potential users of exactly that functionality,
but not stricly hardcoded higher layer operations like raidX.

More generic solution must be used to select appropriate device.
We had a very brief discussion about asynchronous crypto layer (acrypto)
and how its ideas could be used for async dma engines - user should not
even know how his data has been transferred - it calls async_copy(),
which selects appropriate device (and sync copy is just an additional
usual device in that case) from the list of devices, exported its
functionality, selection can be done in millions of different ways from
getting the fisrt one from the list (this is essentially how your
approach is implemented right now), or using special (including run-time
updated) heueristics (like it is done in acrypto).

Thinking further, async_copy() is just a usual case for async class of
operations. So the same above logic must be applied on this layer too.

But 'layers are the way to design protocols, not implement them'.
David Miller on netchannels

So, user should not even know about layers - it should just say 'copy
data from pointer A to pointer B', or 'copy data from pointer A to
socket B' or even 'copy it from file "/tmp/file" to "192.168.0.1:80:tcp"',
without ever knowing that there are sockets and/or memcpy() calls,
and if user requests to perform it asynchronously, it must be later
notified (one might expect, that I will prefer to use kevent :)
The same approach thus can be used by NFS/SAMBA/CIFS and other users.

That is how I start to implement AIO (it looks like it becomes popular):
1. system exports set of operations it supports (send, receive, copy,
crypto, ....)
2. each operation has subsequent set of suboptions (different crypto
types, for example)
3. each operation has set of low-level drivers, which support it (with
optional performance or any other parameters)
4. each driver when loaded publishes its capabilities (async copy with
speed A, XOR and so on)

>From user's point of view its aio_sendfile() or async_copy() will look
following:
1. call aio_schedule_pointer(source='0xaabbccdd', dest='0x123456578')
1. call aio_schedule_file_socket(source='/tmp/file', dest='socket')
1. call aio_schedule_file_addr(source='/tmp/file',
dest='192.168.0.1:80:tcp')

or any other similar call

then wait for received descriptor in kevent_get_events() or provide own
cookie in each call.

Each request is then converted into FIFO of smaller request like 'open file',
'open socket', 'get in user pages' and so on, each of which should be
handled on appropriate device (hardware or software), completeness of
each request starts procesing of the next one.

Reading microthreading design notes I recall comparison of the NPTL and
Erlang threading models on Debian site - they are _completely_ different
models, NPTL creates real threads, which is supposed (I hope NOT)
to be implemented in microthreading design too. It is slow.
(Or is it not, Zach, we are intrigued :)
It's damn bloody slow to create a thread compared to the correct non-blocking
state machine. TUX state machine is similar to what I had in my first kevent
based FS and network AIO patchset, and what I will use for current async
processing work.


A bit of empty words actually, but it can provide some food for
thoughts.

> Regards,
>
> Dan

--
Evgeniy Polyakov

2007-01-03 22:16:42

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write

On Thu, 28 Dec 2006 13:53:08 +0530
Suparna Bhattacharya <[email protected]> wrote:

> This patchset implements changes to make filesystem AIO read
> and write asynchronous for the non O_DIRECT case.

Unfortunately the unplugging changes in Jen's block tree have trashed these
patches to a degree that I'm not confident in my repair attempts. So I'll
drop the fasio patches from -mm.

Zach's observations regarding this code's reliance upon things at *current
sounded pretty serious, so I expect we'd be seeing changes for that anyway?

Plus Jens's unplugging changes add more reliance upon context inside
*current, for the plugging and unplugging operations. I expect that the
fsaio patches will need to be aware of the protocol which those proposed
changes add.

2007-01-04 04:52:53

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write

On Wed, Jan 03, 2007 at 02:15:56PM -0800, Andrew Morton wrote:
> On Thu, 28 Dec 2006 13:53:08 +0530
> Suparna Bhattacharya <[email protected]> wrote:
>
> > This patchset implements changes to make filesystem AIO read
> > and write asynchronous for the non O_DIRECT case.
>
> Unfortunately the unplugging changes in Jen's block tree have trashed these
> patches to a degree that I'm not confident in my repair attempts. So I'll
> drop the fasio patches from -mm.

I took a quick look and the conflicts seem pretty minor to me, the unplugging
changes mostly touch nearby code. Please let know how you want this fixed
up.

>From what I can tell the comments in the unplug patches seem to say that
it needs more work and testing, so perhaps a separate fixup patch may be
a better idea rather than make the fsaio patchset dependent on this.

>
> Zach's observations regarding this code's reliance upon things at *current
> sounded pretty serious, so I expect we'd be seeing changes for that anyway?

Not really, at least nothing that I can see needing a change.
As I mentioned there is no reliance on *current in the code that
runs in the aio threads that we need to worry about.

The generic_write_checks etc that Zach was referring to all happens in the
context of submitting process, not in retry context. The model is to perform
all validation at the time of io submission. And of course things like
copy_to_user() are already taken care of by use_mm().

Lets look at it this way - the kernel already has the ability to do
background writeout on behalf of a task from a kernel thread and likewise
read(ahead) pages that may be consumed by another task. There is also the
ability to operate another task's address space (as used by ptrace).

So there is nothing groundbreaking here.

In fact on most occasions, all the IO is initiated in the context of the
submitting task, so the aio threads mainly deal with checking for completion
and transfering completed data to user space.

>
> Plus Jens's unplugging changes add more reliance upon context inside
> *current, for the plugging and unplugging operations. I expect that the
> fsaio patches will need to be aware of the protocol which those proposed
> changes add.

Whatever logic applies to background writeout etc should also just apply
as is to aio worker threads, shouldn't it ? At least at a quick glance I
don't see anything special that needs to be done for fsaio, but its good
to be aware of this anyway, thanks !

Regards
Suparna

>
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [email protected]. For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: <a href=mailto:"[email protected]">[email protected]</a>

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2007-01-04 05:52:36

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write

Suparna Bhattacharya wrote:
> On Wed, Jan 03, 2007 at 02:15:56PM -0800, Andrew Morton wrote:

>>Plus Jens's unplugging changes add more reliance upon context inside
>>*current, for the plugging and unplugging operations. I expect that the
>>fsaio patches will need to be aware of the protocol which those proposed
>>changes add.
>
>
> Whatever logic applies to background writeout etc should also just apply
> as is to aio worker threads, shouldn't it ? At least at a quick glance I
> don't see anything special that needs to be done for fsaio, but its good
> to be aware of this anyway, thanks !

The submitting process plugs itself, submits all its IO, then unplugs
itself (ie. so the plug is now on the process, rather than the block
device).

So long as AIO threads do the same, there would be no problem (plugging
is optional, of course).

This (is supposed to) give a number of improvements over the traditional
plugging (although some downsides too). Most notably for me, the VM gets
cleaner ;)

However AIO could be an interesting case to test for explicit plugging
because of the way they interact. What kind of improvements do you see
with samba and do you have any benchmark setups?

Thanks,
Nick

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2007-01-04 06:21:51

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write

On Thu, Jan 04, 2007 at 04:51:58PM +1100, Nick Piggin wrote:
> Suparna Bhattacharya wrote:
> >On Wed, Jan 03, 2007 at 02:15:56PM -0800, Andrew Morton wrote:
>
> >>Plus Jens's unplugging changes add more reliance upon context inside
> >>*current, for the plugging and unplugging operations. I expect that the
> >>fsaio patches will need to be aware of the protocol which those proposed
> >>changes add.
> >
> >
> >Whatever logic applies to background writeout etc should also just apply
> >as is to aio worker threads, shouldn't it ? At least at a quick glance I
> >don't see anything special that needs to be done for fsaio, but its good
> >to be aware of this anyway, thanks !
>
> The submitting process plugs itself, submits all its IO, then unplugs
> itself (ie. so the plug is now on the process, rather than the block
> device).
>
> So long as AIO threads do the same, there would be no problem (plugging
> is optional, of course).

Yup, the AIO threads run the same code as for regular IO, i.e in the rare
situations where they actually end up submitting IO, so there should
be no problem. And you have already added plug/unplug at the appropriate
places in those path, so things should just work.

>
> This (is supposed to) give a number of improvements over the traditional
> plugging (although some downsides too). Most notably for me, the VM gets
> cleaner ;)
>
> However AIO could be an interesting case to test for explicit plugging
> because of the way they interact. What kind of improvements do you see
> with samba and do you have any benchmark setups?

I think aio-stress would be a good way to test/benchmark this sort of
stuff, at least for a start.
Samba (if I understand this correctly based on my discussions with Tridge)
is less likely to generate the kind of io patterns that could benefit from
explicit plugging (because the file server has no way to tell what the next
request is going to be, it ends up submitting each independently instead of
batching iocbs).

In future there may be optimization possibilities to consider when
submitting batches of iocbs, i.e. on the io submission path. Maybe
AIO - O_DIRECT would be interesting to play with first in this regardi ?

Regards
Suparna

>
> Thanks,
> Nick
>
> --
> SUSE Labs, Novell Inc.
> Send instant messages to your online friends http://au.messenger.yahoo.com
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [email protected]. For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: <a href=mailto:"[email protected]">[email protected]</a>

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2007-01-04 06:51:27

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write

Suparna Bhattacharya wrote:
> On Thu, Jan 04, 2007 at 04:51:58PM +1100, Nick Piggin wrote:

>>So long as AIO threads do the same, there would be no problem (plugging
>>is optional, of course).
>
>
> Yup, the AIO threads run the same code as for regular IO, i.e in the rare
> situations where they actually end up submitting IO, so there should
> be no problem. And you have already added plug/unplug at the appropriate
> places in those path, so things should just work.

Yes I think it should.

>>This (is supposed to) give a number of improvements over the traditional
>>plugging (although some downsides too). Most notably for me, the VM gets
>>cleaner ;)
>>
>>However AIO could be an interesting case to test for explicit plugging
>>because of the way they interact. What kind of improvements do you see
>>with samba and do you have any benchmark setups?
>
>
> I think aio-stress would be a good way to test/benchmark this sort of
> stuff, at least for a start.
> Samba (if I understand this correctly based on my discussions with Tridge)
> is less likely to generate the kind of io patterns that could benefit from
> explicit plugging (because the file server has no way to tell what the next
> request is going to be, it ends up submitting each independently instead of
> batching iocbs).

OK, but I think that after IO submission, you do not run sync_page to
unplug the block device, like the normal IO path would (via lock_page,
before the explicit plug patches).

However, with explicit plugging, AIO requests will be started immediately.
Maybe this won't be noticable if the device is always busy, but I would
like to know there isn't a regression.

> In future there may be optimization possibilities to consider when
> submitting batches of iocbs, i.e. on the io submission path. Maybe
> AIO - O_DIRECT would be interesting to play with first in this regardi ?

Well I've got some simple per-process batching in there now, each process
has a list of pending requests. Request merging is done locklessly against
the last request added; and submission at unplug time is batched under a
single block device lock.

I'm sure more merging or batching could be done, but also consider that
most programs will not ever make use of any added complexity.

Regarding your patches, I've just had a quick look and have a question --
what do you do about blocking in page reclaim and dirty balancing? Aren't
those major points of blocking with buffered IO? Did your test cases
dirty enough to start writeout or cause a lot of reclaim? (admittedly,
blocking in reclaim will now be much less common since the dirty mapping
accounting).

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2007-01-04 06:51:46

by Nick Piggin

[permalink] [raw]
Subject: Re: [FSAIO][PATCH 6/8] Enable asynchronous wait page and lock page

Christoph Hellwig wrote:
> On Thu, Dec 28, 2006 at 08:17:17PM +0530, Suparna Bhattacharya wrote:
>
>>I am really bad with names :( I tried using the _wq suffixes earlier and
>>that seemed confusing to some, but if no one else objects I'm happy to use
>>that. I thought aio_lock_page() might be misleading because it is
>>synchronous if a regular wait queue entry is passed in, but again it may not
>>be too bad.
>>
>>What's your preference ? Does anything more intuitive come to mind ?
>
>
> Beein bad about naming seems to be a disease, at least I suffer from it
> aswell. I wouldn't mind either the _wq or aio_ naming - _wq describes
> the way it's called and aio_ describes it's a special case for aio.
> Similarly to how ->aio_read/->aio_write can be used for synchronous I/O
> aswell.

What about lock_page_async? A synchronous lock_page is the normal case,
and for that guy it makes no sense to explicitly pass in a waitqueue, so
it kind of falls into place?

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2007-01-04 09:20:26

by Bharata B Rao

[permalink] [raw]
Subject: [PATCHSET 3][PATCH 0/5][AIO] - AIO completion signal notification v4

Hi

Here is a repost of Sebastien's AIO completion signal notification v4
patches along with the syscall based listio support patch. The goal
of this patchset is to improve the POSIX AIO support in the kernel.

While the 1st 4 patches provide the AIO completion signal notification
support, the 5th one provides the listio support.

Sebastien's original patchset had a different listio support patch
(patch number 5) based on overloading the io_submit() with a new
aio_lio_opcode (IOCB_CMD_GROUP). But here listio support is provided
by a separate system call.

As mentioned, this set consists of 5 patches:

1. rework-compat-sys-io-submit: cleanup the sys_io_submit() compat
layer, making it more efficient and laying out the base for the
following patches

2. aio-header-fix-includes: fixes the double inclusion of uio.h in aio.h

3. export-good_sigevent: move good_sigevent into signal.c and make it
non-static

4. aio-notify-sig: the AIO completion signal notification

5. listio: adds listio support

Description are in the individual patches.

Original v4 post is present at http://lkml.org/lkml/2006/11/30/223

Changes from v3:
All changes following comments from Zach Brown and Christoph Hellwig

- added justification for the compat_sys_io_submit() cleanup
- more cleanups in compat_sys_io_submit() to put it in line with
sys_io_submit()
- Changed "Export good_sigevent()" patch name to "Make good_sigevent()
non-static" to better describe what it does.=20
- Reworked good_sigevent() to make it more readable.
- Simplified the use of the SIGEV_* constants in signal notification
- Take a reference on the target task both for the SIGEV_THREAD_ID and
SIGEV_SIGNAL cases.

Changes from v2:
- rebased to 2.6.19-rc6-mm2
- reworked the sys_io_submit() compat layer as suggested by Zach Brown
- fixed saving of a pointer to a task struct in aio-notify-sig as
pointed out by Andrew Morton

Changes from v1:
- cleanups suggested by Christoph Hellwig, Badari Pulavarty and
Zach Brown
- added lisio patch

Regards,
Bharata.

2007-01-04 09:23:32

by Bharata B Rao

[permalink] [raw]
Subject: [PATCHSET 3][PATCH 1/5][AIO] - Rework compat_sys_io_submit


compat_sys_io_submit() cleanup

Cleanup compat_sys_io_submit by duplicating some of the native syscall
logic in the compat layer and directly calling io_submit_one() instead
of fooling the syscall into thinking it is called from a native 64-bit
caller.

This eliminates:

- the overhead of copying the nr iocb pointers on the userspace stack

- the PAGE_SIZE/(sizeof(void *)) limit on the number of iocbs that
can be submitted.

This is also needed for the completion notification patch to avoid having
to rewrite each iocb on the caller stack for io_submit_one() to find the
sigevents.


Attachments:
(No filename) (619.00 B)
rework-compat-sys-io-submit.patch (2.55 kB)
Download all attachments

2007-01-04 09:27:18

by Bharata B Rao

[permalink] [raw]
Subject: [PATCHSET 3][PATCH 3/5][AIO] - Make good_sigevent non-static

Make good_sigevent() non-static

Move good_sigevent() from posix-timers.c to signal.c where it belongs,
and make it non-static so that it can be used by other subsystems.


Attachments:
(No filename) (195.00 B)
export-good_sigevent.patch (3.11 kB)
Download all attachments

2007-01-04 09:31:21

by Bharata B Rao

[permalink] [raw]
Subject: [PATCHSET 3][PATCH 4/5][AIO] - AIO completion signal notification

AIO completion signal notification

The current 2.6 kernel does not support notification of user space via
an RT signal upon an asynchronous IO completion. The POSIX specification
states that when an AIO request completes, a signal can be delivered to
the application as notification.

This patch adds a struct sigevent *aio_sigeventp to the iocb.
The relevant fields (pid, signal number and value) are stored in the kiocb
for use when the request completes.

That sigevent structure is filled by the application as part of the AIO
request preparation. Upon request completion, the kernel notifies the
application using those sigevent parameters. If SIGEV_NONE has been specifi=
ed,
then the old behaviour is retained and the application must rely on polling
the completion queue using io_getevents().


A struct sigevent *aio_sigeventp field is added to struct iocb in
include/linux/aio_abi.h

A struct aio_notify containing the sigevent parameters is defined in aio.=
h:

struct aio_notify {
struct task_struct *target;
__u16 signo;
__u16 notify;
sigval_t value;
};

A struct aio_notify ki_notify is added to struct kiocb in include/linux/a=
io.h

In io_submit_one(), if the application provided a sigevent then
setup_sigevent() is called which does the following:

- check access to the user sigevent and make a local copy

- if the requested notification is SIGEV_NONE, then nothing to do

- fill in the kiocb->ki_notify fields (notify, signo, value)

- check sigevent consistency, get the signal target task and
save it in kiocb->ki_notify.target

- preallocate a sigqueue for this event using sigqueue_alloc()

Upon request completion, in aio_complete(), if notification is needed for
this request (iocb->ki_notify.notify !=3D SIGEV_NONE), then
aio_send_signal is called to signal the target task as follows:

- fill in the siginfo struct to be sent to the task

- if notify is SIGEV_THREAD_ID then send signal to specific task
using send_sigqueue()

- else send signal to task group using send_5group_sigqueue()



Notes concerning sigqueue preallocation:

To ensure reliable delivery of completion notification, the sigqueue is
preallocated in the submission path so that there is no chance it can fail
in the completion path.

Unlike the posix-timers case (currently the single other user of sigqueue
preallocation), where the sigqueue is allocated for the lifetime of the
timer and freed at timer destruction time, the aio case is a bit more tricky
due to the async nature of the whole thing.

In the aio case, the sigqueue exists for the lifetime of the request,
therefore it must be freed only once the signal for the request completion
has been delivered. This involves changing __sigqueue_free() to free the
sigqueue when the signal is collected if si_code is SI_ASYNCIO even if it
was preallocated as well as explicitly calling sigqueue_free() in submission
and completion error paths.


Attachments:
(No filename) (2.90 kB)
aio-notify-sig.patch (11.73 kB)
Download all attachments

2007-01-04 09:33:34

by Bharata B Rao

[permalink] [raw]
Subject: [PATCHSET 3][PATCH 5/5][AIO] - Add listio support

listio support through a system call(lio_submit)

This builds on the infrastructure provided by Sebastien's AIO
completion signal notification and listio patches to provide listio
support via a new system call.

long lio_submit(aio_context_t ctx_id, int mode, long nr,
struct iocb __user * __user *iocbpp, struct sigevent __user *event)

More details about the system call appears within the patch itself.

Sebastien had taken the approach of overloading the io_submit() with
new aio_lio_opcode of IOCB_CMD_GROUP to support the listio behaviour.
And this is an alternative approach for supporting listio.

Would this system call approach be agreeable ?

Would it make sense to add another argument to support partial
completion notification ? i,e., generate a notification when a minimum
number of ios complete. Would such a feature be desirable ? Is it ok
to add one more argument to the system call for this purpose ?

This patch along with the previous 4 AIO completion signal notification
patches have been tested using libposix-aio library. Tests have been
done on x86 and x86_64 boxes. The compat syscall changes have been tested
on x86_64 system.


Attachments:
(No filename) (1.13 kB)
aio-listio-support.patch (15.63 kB)
Download all attachments

2007-01-04 11:20:31

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write

On Thu, Jan 04, 2007 at 05:50:11PM +1100, Nick Piggin wrote:
> Suparna Bhattacharya wrote:
> >On Thu, Jan 04, 2007 at 04:51:58PM +1100, Nick Piggin wrote:
>
> >>So long as AIO threads do the same, there would be no problem (plugging
> >>is optional, of course).
> >
> >
> >Yup, the AIO threads run the same code as for regular IO, i.e in the rare
> >situations where they actually end up submitting IO, so there should
> >be no problem. And you have already added plug/unplug at the appropriate
> >places in those path, so things should just work.
>
> Yes I think it should.
>
> >>This (is supposed to) give a number of improvements over the traditional
> >>plugging (although some downsides too). Most notably for me, the VM gets
> >>cleaner ;)
> >>
> >>However AIO could be an interesting case to test for explicit plugging
> >>because of the way they interact. What kind of improvements do you see
> >>with samba and do you have any benchmark setups?
> >
> >
> >I think aio-stress would be a good way to test/benchmark this sort of
> >stuff, at least for a start.
> >Samba (if I understand this correctly based on my discussions with Tridge)
> >is less likely to generate the kind of io patterns that could benefit from
> >explicit plugging (because the file server has no way to tell what the next
> >request is going to be, it ends up submitting each independently instead of
> >batching iocbs).
>
> OK, but I think that after IO submission, you do not run sync_page to
> unplug the block device, like the normal IO path would (via lock_page,
> before the explicit plug patches).

In the buffered AIO case, we do run sync page like normal IO ... just
that we don't block in io_schedule(), everything else is pretty much
similar.

In the case of AIO-DIO, the path is like the just like non-AIO DIO, there
is a call to blk_run_address_space() after submission.

>
> However, with explicit plugging, AIO requests will be started immediately.
> Maybe this won't be noticable if the device is always busy, but I would
> like to know there isn't a regression.
>
> >In future there may be optimization possibilities to consider when
> >submitting batches of iocbs, i.e. on the io submission path. Maybe
> >AIO - O_DIRECT would be interesting to play with first in this regardi ?
>
> Well I've got some simple per-process batching in there now, each process
> has a list of pending requests. Request merging is done locklessly against
> the last request added; and submission at unplug time is batched under a
> single block device lock.
>
> I'm sure more merging or batching could be done, but also consider that
> most programs will not ever make use of any added complexity.

I guess I didn't express myself well - by batching I meant being able to
surround submission of a batch of iocbs with explicit plug/unplug instead
of explicit plug/unplug for each iocb separately. Of course there is no
easy way to do that, since at the io_submit() level we do not know about
the block device (each iocb could be directed to a different fd and not
just block devices). So it may not be worth thinking about.

>
> Regarding your patches, I've just had a quick look and have a question --
> what do you do about blocking in page reclaim and dirty balancing? Aren't
> those major points of blocking with buffered IO? Did your test cases
> dirty enough to start writeout or cause a lot of reclaim? (admittedly,
> blocking in reclaim will now be much less common since the dirty mapping
> accounting).

In my earlier versions of patches I actually had converted these waits to
be async retriable, but then I came to the conclusion that the additional
complexity wasn't worth it. For one it didn't seem to make a difference
compared to the other bigger cases, and I was looking primarily at handling
the gross blocking points (say to enable an application to keep device queues
busy) and not making everything asynchronous; for another we had a long
discussion thread waay back about not making AIO submitters exempt from
throttling or memory availability waits.

Regards
Suparna

>
> --
> SUSE Labs, Novell Inc.
> Send instant messages to your online friends http://au.messenger.yahoo.com
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [email protected]. For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: <a href=mailto:"[email protected]">[email protected]</a>

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2007-01-04 17:03:28

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write

On Thu, 4 Jan 2007 10:26:21 +0530
Suparna Bhattacharya <[email protected]> wrote:

> On Wed, Jan 03, 2007 at 02:15:56PM -0800, Andrew Morton wrote:
> > On Thu, 28 Dec 2006 13:53:08 +0530
> > Suparna Bhattacharya <[email protected]> wrote:
> >
> > > This patchset implements changes to make filesystem AIO read
> > > and write asynchronous for the non O_DIRECT case.
> >
> > Unfortunately the unplugging changes in Jen's block tree have trashed these
> > patches to a degree that I'm not confident in my repair attempts. So I'll
> > drop the fasio patches from -mm.
>
> I took a quick look and the conflicts seem pretty minor to me, the unplugging
> changes mostly touch nearby code.

Well... the conflicts (both mechanical and conceptual) are such that a
round of retesting is needed.

> Please let know how you want this fixed up.
>
> >From what I can tell the comments in the unplug patches seem to say that
> it needs more work and testing, so perhaps a separate fixup patch may be
> a better idea rather than make the fsaio patchset dependent on this.

Patches against next -mm would be appreciated, please. Sorry about that.

I _assume_ Jens is targetting 2.6.21?

2007-01-04 17:49:23

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write

On Thu, Jan 04 2007, Andrew Morton wrote:
> > Please let know how you want this fixed up.
> >
> > >From what I can tell the comments in the unplug patches seem to say that
> > it needs more work and testing, so perhaps a separate fixup patch may be
> > a better idea rather than make the fsaio patchset dependent on this.
>
> Patches against next -mm would be appreciated, please. Sorry about that.
>
> I _assume_ Jens is targetting 2.6.21?

Only if everything works perfectly, 2.6.22 is also a viable target.

--
Jens Axboe

2007-01-04 20:33:46

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC] Heads up on a series of AIO patchsets

On Tue 2007-01-02 16:18:40, Kent Overstreet wrote:
> >> Any details?
> >
> >Well, one path I tried I couldn't help but post a blog
> >entry about
> >for my friends. I'm not sure it's the direction I'll
> >take with linux-
> >kernel, but the fundamentals are there: the api should
> >be the
> >syscall interface, and there should be no difference
> >between sync and
> >async behaviour.
> >
> >http://www.zabbo.net/?p=72
>
> Any code you're willing to let people play with? I could
> at least have
> real test cases, and a library to go along with it as it
> gets
> finished.
>
> Another pie in the sky idea:
> One thing that's been bugging me lately (working on a 9p
> server), is
> sendfile is hard to use in practice because you need
> packet headers
> and such, and they need to go out at the same time.

splice()?
Pavel

--
Thanks for all the (sleeping) penguins.

2007-01-05 00:36:22

by Zach Brown

[permalink] [raw]
Subject: Re: [RFC] Heads up on a series of AIO patchsets

> generic_write_checks() are done in the submission path, not
> repeated during
> retries, so such types of checks are not intended to run in the aio
> thread.

Ah, I see, I was missing the short-cut which avoids re-running parts
of the write path if we're in a retry.

if (!is_sync_kiocb(iocb) && kiocbIsRestarted(iocb)) {
/* nothing to transfer, may just need to sync data */
return ocount;

It's pretty subtle that this has to be placed before the first
significant current reference and that nothing in the path can return
-EIOCBRETRY until after all of the significant current references.

In totally unrelated news, I noticed that current->io_wait is set to
NULL instead of &current->__wait after having run the iocb. I wonder
if it shouldn't be saved and restored instead. And maybe update the
comment over is_sync_wait()? Just an observation.

> That is great and I look forward to it :) I am, however assuming that
> whatever implementation you come up will have a different interface
> from current linux aio -- i.e. a next generation aio model, that
> will be
> easily integratable with kevents etc.

Yeah, that's the hope.

> Which takes me back to Ingo's point - lets have the new evolve
> parallely
> with the old, if we can, and not hold up the patches for POSIX AIO to
> start using kernel AIO, or for epoll to integrate with AIO.

Sure, though there are only so many hours in a day :).

> OK, I just took a quick look at your blog and I see that you
> are basically implementing Linus' microthreads scheduling approach -
> one year since we had that discussion.

Yeah. I wanted to see what it would look like.

> Glad to see that you found a way to make it workable ...

Wellll, that remains to be seen. If nothing else we'll at least hav
code to point at when discussing it. If we all agree it's not the
right way and dismiss the notion, fine, that's progress :).

> (I'm guessing that you are copying over the part
> of the stack in use at the time of every switch, is that correct ?

That was my first pass, yeah. It turned the knob a little too far
towards the "invasive but efficient" direction for my taste. I'm now
giving it a try by having full stacks for each blocked op, we'll see
how that goes.

> At what
> point do you do the allocation of the saved stacks ?

I was allocating at block-time to keep memory consumption down, but I
think my fiddling around with it convinced me that isn't workable.

- z

2007-01-05 04:57:01

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write

Suparna Bhattacharya wrote:
> On Thu, Jan 04, 2007 at 05:50:11PM +1100, Nick Piggin wrote:

>>OK, but I think that after IO submission, you do not run sync_page to
>>unplug the block device, like the normal IO path would (via lock_page,
>>before the explicit plug patches).
>
>
> In the buffered AIO case, we do run sync page like normal IO ... just
> that we don't block in io_schedule(), everything else is pretty much
> similar.

You do? OK I must have misread it. Ignore that, then ;)

>>I'm sure more merging or batching could be done, but also consider that
>>most programs will not ever make use of any added complexity.
>
>
> I guess I didn't express myself well - by batching I meant being able to
> surround submission of a batch of iocbs with explicit plug/unplug instead
> of explicit plug/unplug for each iocb separately. Of course there is no
> easy way to do that, since at the io_submit() level we do not know about
> the block device (each iocb could be directed to a different fd and not
> just block devices). So it may not be worth thinking about.

Well we currently _could_ do that, because the block device plugging code
will detect if the request queue changes, and flush built up requests...

However, I think we may want to make the plug operations a callback rather
than hardcoded block device plugging, so that will make it harder... but
you have a good point about increasing the scope of the plugging, it would
be a win if we can do it.

>>Regarding your patches, I've just had a quick look and have a question --
>>what do you do about blocking in page reclaim and dirty balancing? Aren't
>>those major points of blocking with buffered IO? Did your test cases
>>dirty enough to start writeout or cause a lot of reclaim? (admittedly,
>>blocking in reclaim will now be much less common since the dirty mapping
>>accounting).
>
>
> In my earlier versions of patches I actually had converted these waits to
> be async retriable, but then I came to the conclusion that the additional
> complexity wasn't worth it. For one it didn't seem to make a difference
> compared to the other bigger cases, and I was looking primarily at handling
> the gross blocking points (say to enable an application to keep device queues
> busy) and not making everything asynchronous; for another we had a long
> discussion thread waay back about not making AIO submitters exempt from
> throttling or memory availability waits.

OK, I was just curious. For keeping queues busy, your patchset should work
well (sleeping for more memory should be pretty uncommon). But for
overlapping computation with IO, it may not work so well if it encounters
throttling.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2007-01-05 05:29:09

by Suparna Bhattacharya

[permalink] [raw]
Subject: [PATCHSET 4][PATCH 1/1] AIO fallback for pipes, sockets and pollable fds


As glibc POSIX AIO switches over completely to using native AIO it needs
basic AIO support for various file types - including sockets, pipes etc.
Since userland will no longer be simulating asynchronous behaviour
with threads, it expects the underlying implementation to be asynchronous.
Which is still an issue with native linux AIO.

One (not so appealing) alternative that has been considered in the past is
a fallback path that spawns a kernel thread per AIO request. This in some
sense amounts to pushing the problem down from user to kernel space.
Fortunately we can do better. We can effectively simulate AIO in kernel
using async poll and O_NONBLOCK for all pollable fds, i.e. sockets, pipes
etc.

With this scheme in place, all that needs to be done to add AIO support
for any pollable file type is to make sure that the corresponding
f_op->aio_read/aio_write implements O_NONBLOCK behaviour if called in
aio context, i.e. with an async kiocb. The high level common AIO code
takes care of the rest, by enabling retries for completing the rest of
the IO to be initiated directly via poll wait notifications.

This fallback option should be good enough to get us to working POSIX AIO,
now that filesystem AIO already takes care of ISREG files which do not
support O_NONBLOCK. I have tested this with modified pipetest runs, also
using sockets instead of pipes.


---

linux-2.6.20-rc1-root/fs/aio.c | 54 +++++++++++++++++++++++++++++++++++++
linux-2.6.20-rc1-root/fs/pipe.c | 17 +++++++----
linux-2.6.20-rc1-root/net/socket.c | 6 ++--
3 files changed, 69 insertions(+), 8 deletions(-)

diff -puN fs/aio.c~aio-fallback-nonblock fs/aio.c
--- linux-2.6.20-rc1/fs/aio.c~aio-fallback-nonblock 2007-01-03 19:16:36.000000000 +0530
+++ linux-2.6.20-rc1-root/fs/aio.c 2007-01-05 10:29:52.000000000 +0530
@@ -30,6 +30,7 @@
#include <linux/highmem.h>
#include <linux/workqueue.h>
#include <linux/security.h>
+#include <linux/poll.h>
#include <linux/eventpoll.h>

#include <asm/kmap_types.h>
@@ -1315,6 +1316,42 @@ static void aio_advance_iovec(struct kio
BUG_ON(ret > 0 && iocb->ki_left == 0);
}

+/* Wrapper structure used by poll queuing */
+struct aio_pqueue {
+ poll_table pt;
+ struct kiocb *iocb;
+};
+
+static int aio_cancel_wait(struct kiocb *iocb, struct io_event *event)
+{
+ wait_queue_head_t *wq = (struct wait_queue_head_t *)iocb->private;
+ if (wq)
+ wake_up(wq);
+ event->res = iocb->ki_nbytes - iocb->ki_left;
+ event->res2 = 0;
+ /* drop the cancel reference */
+ aio_put_req(iocb);
+ return 0;
+}
+
+/* Sets things up for a readiness event to trigger the iocb's retry */
+static void aio_poll_table_queue_proc(struct file *file,
+ wait_queue_head_t *whead, poll_table *pt)
+{
+ struct kiocb *iocb = container_of(pt, struct aio_pqueue, pt)->iocb;
+
+ if (unlikely(iocb->private && iocb->ki_dtor)) {
+ /* FIXME: We really shouldn't have to do this */
+ /* the siocb allocation in socket.c is unused AFAIK */
+ iocb->ki_dtor(iocb);
+ iocb->ki_dtor = NULL;
+ }
+
+ iocb->private = whead;
+ iocb->ki_cancel = aio_cancel_wait;
+ prepare_to_wait(whead, &iocb->ki_wait.wait, 0);
+}
+
static ssize_t aio_rw_vect_retry(struct kiocb *iocb)
{
struct file *file = iocb->ki_filp;
@@ -1334,6 +1371,7 @@ static ssize_t aio_rw_vect_retry(struct
opcode = IOCB_CMD_PWRITEV;
}

+ready:
do {
ret = rw_op(iocb, &iocb->ki_iovec[iocb->ki_cur_seg],
iocb->ki_nr_segs - iocb->ki_cur_seg,
@@ -1352,6 +1390,22 @@ static ssize_t aio_rw_vect_retry(struct
if ((ret == 0) || (iocb->ki_left == 0))
ret = iocb->ki_nbytes - iocb->ki_left;

+ if (ret == -EAGAIN && file->f_op->poll) {
+ /* This means fop->aio_read implements O_NONBLOCK behaviour */
+ /* Let us try to simulate aio retries using ->poll */
+ struct aio_pqueue pollq = {.iocb = iocb};
+ int events = (opcode == IOCB_CMD_PWRITEV) ?
+ POLLOUT | POLLERR | POLLHUP :
+ POLLIN | POLLERR | POLLHUP;
+
+ init_poll_funcptr(&pollq.pt, aio_poll_table_queue_proc);
+ ret = file->f_op->poll(file, &pollq.pt);
+ if (ret >= 0) {
+ if (ret & events)
+ goto ready;
+ ret = -EIOCBRETRY;
+ }
+ }
return ret;
}

diff -puN net/socket.c~aio-fallback-nonblock net/socket.c
--- linux-2.6.20-rc1/net/socket.c~aio-fallback-nonblock 2007-01-03 19:16:36.000000000 +0530
+++ linux-2.6.20-rc1-root/net/socket.c 2007-01-03 19:16:36.000000000 +0530
@@ -701,7 +701,8 @@ static ssize_t do_sock_read(struct msghd
msg->msg_controllen = 0;
msg->msg_iov = (struct iovec *)iov;
msg->msg_iovlen = nr_segs;
- msg->msg_flags = (file->f_flags & O_NONBLOCK) ? MSG_DONTWAIT : 0;
+ msg->msg_flags = ((file->f_flags & O_NONBLOCK) || !is_sync_kiocb(iocb))
+ ? MSG_DONTWAIT : 0;

return __sock_recvmsg(iocb, sock, msg, size, msg->msg_flags);
}
@@ -741,7 +742,8 @@ static ssize_t do_sock_write(struct msgh
msg->msg_controllen = 0;
msg->msg_iov = (struct iovec *)iov;
msg->msg_iovlen = nr_segs;
- msg->msg_flags = (file->f_flags & O_NONBLOCK) ? MSG_DONTWAIT : 0;
+ msg->msg_flags = ((file->f_flags & O_NONBLOCK) || !is_sync_kiocb(iocb))
+ ? MSG_DONTWAIT : 0;
if (sock->type == SOCK_SEQPACKET)
msg->msg_flags |= MSG_EOR;

diff -puN fs/pipe.c~aio-fallback-nonblock fs/pipe.c
--- linux-2.6.20-rc1/fs/pipe.c~aio-fallback-nonblock 2007-01-03 19:16:36.000000000 +0530
+++ linux-2.6.20-rc1-root/fs/pipe.c 2007-01-03 19:16:36.000000000 +0530
@@ -226,14 +226,16 @@ pipe_read(struct kiocb *iocb, const stru
struct pipe_inode_info *pipe;
int do_wakeup;
ssize_t ret;
- struct iovec *iov = (struct iovec *)_iov;
+ struct iovec iov_array[nr_segs];
+ struct iovec *iov = iov_array;
size_t total_len;

- total_len = iov_length(iov, nr_segs);
+ total_len = iov_length(_iov, nr_segs);
/* Null read succeeds. */
if (unlikely(total_len == 0))
return 0;

+ memcpy(iov, _iov, nr_segs * sizeof(struct iovec));
do_wakeup = 0;
ret = 0;
mutex_lock(&inode->i_mutex);
@@ -302,7 +304,8 @@ redo:
*/
if (ret)
break;
- if (filp->f_flags & O_NONBLOCK) {
+ if (filp->f_flags & O_NONBLOCK ||
+ !is_sync_kiocb(iocb)) {
ret = -EAGAIN;
break;
}
@@ -339,15 +342,17 @@ pipe_write(struct kiocb *iocb, const str
struct pipe_inode_info *pipe;
ssize_t ret;
int do_wakeup;
- struct iovec *iov = (struct iovec *)_iov;
+ struct iovec iov_array[nr_segs];
+ struct iovec *iov = iov_array;
size_t total_len;
ssize_t chars;

- total_len = iov_length(iov, nr_segs);
+ total_len = iov_length(_iov, nr_segs);
/* Null write succeeds. */
if (unlikely(total_len == 0))
return 0;

+ memcpy(iov, _iov, nr_segs * sizeof(struct iovec));
do_wakeup = 0;
ret = 0;
mutex_lock(&inode->i_mutex);
@@ -473,7 +478,7 @@ redo2:
}
if (bufs < PIPE_BUFFERS)
continue;
- if (filp->f_flags & O_NONBLOCK) {
+ if (filp->f_flags & O_NONBLOCK || !is_sync_kiocb(iocb)) {
if (!ret)
ret = -EAGAIN;
break;
_

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2007-01-05 06:24:10

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write

On Thu, Jan 04, 2007 at 09:02:42AM -0800, Andrew Morton wrote:
> On Thu, 4 Jan 2007 10:26:21 +0530
> Suparna Bhattacharya <[email protected]> wrote:
>
> > On Wed, Jan 03, 2007 at 02:15:56PM -0800, Andrew Morton wrote:
> > > On Thu, 28 Dec 2006 13:53:08 +0530
> > > Suparna Bhattacharya <[email protected]> wrote:
> > >
> > > > This patchset implements changes to make filesystem AIO read
> > > > and write asynchronous for the non O_DIRECT case.
> > >
> > > Unfortunately the unplugging changes in Jen's block tree have trashed these
> > > patches to a degree that I'm not confident in my repair attempts. So I'll
> > > drop the fasio patches from -mm.
> >
> > I took a quick look and the conflicts seem pretty minor to me, the unplugging
> > changes mostly touch nearby code.
>
> Well... the conflicts (both mechanical and conceptual) are such that a
> round of retesting is needed.
>
> > Please let know how you want this fixed up.
> >
> > >From what I can tell the comments in the unplug patches seem to say that
> > it needs more work and testing, so perhaps a separate fixup patch may be
> > a better idea rather than make the fsaio patchset dependent on this.
>
> Patches against next -mm would be appreciated, please. Sorry about that.
>
> I _assume_ Jens is targetting 2.6.21?

When is the next -mm likely to be out ?

I was considering regenerating the blk unplug patches against the
fsaio changes instead of the other way around, if Jens were willing to accept
that. But if the next -mm is just around the corner then its not an issue.

Regards
Suparna

>
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [email protected]. For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: <a href=mailto:"[email protected]">[email protected]</a>

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2007-01-05 07:02:17

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write

On Fri, Jan 05 2007, Suparna Bhattacharya wrote:
> On Thu, Jan 04, 2007 at 09:02:42AM -0800, Andrew Morton wrote:
> > On Thu, 4 Jan 2007 10:26:21 +0530
> > Suparna Bhattacharya <[email protected]> wrote:
> >
> > > On Wed, Jan 03, 2007 at 02:15:56PM -0800, Andrew Morton wrote:
> > > > On Thu, 28 Dec 2006 13:53:08 +0530
> > > > Suparna Bhattacharya <[email protected]> wrote:
> > > >
> > > > > This patchset implements changes to make filesystem AIO read
> > > > > and write asynchronous for the non O_DIRECT case.
> > > >
> > > > Unfortunately the unplugging changes in Jen's block tree have trashed these
> > > > patches to a degree that I'm not confident in my repair attempts. So I'll
> > > > drop the fasio patches from -mm.
> > >
> > > I took a quick look and the conflicts seem pretty minor to me, the unplugging
> > > changes mostly touch nearby code.
> >
> > Well... the conflicts (both mechanical and conceptual) are such that a
> > round of retesting is needed.
> >
> > > Please let know how you want this fixed up.
> > >
> > > >From what I can tell the comments in the unplug patches seem to say that
> > > it needs more work and testing, so perhaps a separate fixup patch may be
> > > a better idea rather than make the fsaio patchset dependent on this.
> >
> > Patches against next -mm would be appreciated, please. Sorry about that.
> >
> > I _assume_ Jens is targetting 2.6.21?
>
> When is the next -mm likely to be out ?
>
> I was considering regenerating the blk unplug patches against the
> fsaio changes instead of the other way around, if Jens were willing to
> accept that. But if the next -mm is just around the corner then its
> not an issue.

I don't really care much, but I work against mainline and anything but
occasional one-off generations of a patch against a different base is
not very likely.

The -mm order should just reflect the merge order of the patches, what
is the fsaio target?

--
Jens Axboe

2007-01-05 08:03:54

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write

On Fri, Jan 05, 2007 at 08:02:33AM +0100, Jens Axboe wrote:
> On Fri, Jan 05 2007, Suparna Bhattacharya wrote:
> > On Thu, Jan 04, 2007 at 09:02:42AM -0800, Andrew Morton wrote:
> > > On Thu, 4 Jan 2007 10:26:21 +0530
> > > Suparna Bhattacharya <[email protected]> wrote:
> > >
> > > > On Wed, Jan 03, 2007 at 02:15:56PM -0800, Andrew Morton wrote:
> > > > > On Thu, 28 Dec 2006 13:53:08 +0530
> > > > > Suparna Bhattacharya <[email protected]> wrote:
> > > > >
> > > > > > This patchset implements changes to make filesystem AIO read
> > > > > > and write asynchronous for the non O_DIRECT case.
> > > > >
> > > > > Unfortunately the unplugging changes in Jen's block tree have trashed these
> > > > > patches to a degree that I'm not confident in my repair attempts. So I'll
> > > > > drop the fasio patches from -mm.
> > > >
> > > > I took a quick look and the conflicts seem pretty minor to me, the unplugging
> > > > changes mostly touch nearby code.
> > >
> > > Well... the conflicts (both mechanical and conceptual) are such that a
> > > round of retesting is needed.
> > >
> > > > Please let know how you want this fixed up.
> > > >
> > > > >From what I can tell the comments in the unplug patches seem to say that
> > > > it needs more work and testing, so perhaps a separate fixup patch may be
> > > > a better idea rather than make the fsaio patchset dependent on this.
> > >
> > > Patches against next -mm would be appreciated, please. Sorry about that.
> > >
> > > I _assume_ Jens is targetting 2.6.21?
> >
> > When is the next -mm likely to be out ?
> >
> > I was considering regenerating the blk unplug patches against the
> > fsaio changes instead of the other way around, if Jens were willing to
> > accept that. But if the next -mm is just around the corner then its
> > not an issue.
>
> I don't really care much, but I work against mainline and anything but
> occasional one-off generations of a patch against a different base is
> not very likely.
>
> The -mm order should just reflect the merge order of the patches, what
> is the fsaio target?

2.6.21 was what I had in mind, to enable the glibc folks to proceed with
conversion to native AIO.

Regenerating my patches against the unplug stuff is not a problem, I only
worry about being queued up behind something that may take longer to
stabilize and is likely to change ... If that is not the case, I don't
mind.

Regards
Suparna

>
> --
> Jens Axboe
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [email protected]. For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: <a href=mailto:"[email protected]">[email protected]</a>

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2007-01-05 08:32:26

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write

On Fri, Jan 05 2007, Suparna Bhattacharya wrote:
> On Fri, Jan 05, 2007 at 08:02:33AM +0100, Jens Axboe wrote:
> > On Fri, Jan 05 2007, Suparna Bhattacharya wrote:
> > > On Thu, Jan 04, 2007 at 09:02:42AM -0800, Andrew Morton wrote:
> > > > On Thu, 4 Jan 2007 10:26:21 +0530
> > > > Suparna Bhattacharya <[email protected]> wrote:
> > > >
> > > > > On Wed, Jan 03, 2007 at 02:15:56PM -0800, Andrew Morton wrote:
> > > > > > On Thu, 28 Dec 2006 13:53:08 +0530
> > > > > > Suparna Bhattacharya <[email protected]> wrote:
> > > > > >
> > > > > > > This patchset implements changes to make filesystem AIO read
> > > > > > > and write asynchronous for the non O_DIRECT case.
> > > > > >
> > > > > > Unfortunately the unplugging changes in Jen's block tree have trashed these
> > > > > > patches to a degree that I'm not confident in my repair attempts. So I'll
> > > > > > drop the fasio patches from -mm.
> > > > >
> > > > > I took a quick look and the conflicts seem pretty minor to me, the unplugging
> > > > > changes mostly touch nearby code.
> > > >
> > > > Well... the conflicts (both mechanical and conceptual) are such that a
> > > > round of retesting is needed.
> > > >
> > > > > Please let know how you want this fixed up.
> > > > >
> > > > > >From what I can tell the comments in the unplug patches seem to say that
> > > > > it needs more work and testing, so perhaps a separate fixup patch may be
> > > > > a better idea rather than make the fsaio patchset dependent on this.
> > > >
> > > > Patches against next -mm would be appreciated, please. Sorry about that.
> > > >
> > > > I _assume_ Jens is targetting 2.6.21?
> > >
> > > When is the next -mm likely to be out ?
> > >
> > > I was considering regenerating the blk unplug patches against the
> > > fsaio changes instead of the other way around, if Jens were willing to
> > > accept that. But if the next -mm is just around the corner then its
> > > not an issue.
> >
> > I don't really care much, but I work against mainline and anything but
> > occasional one-off generations of a patch against a different base is
> > not very likely.
> >
> > The -mm order should just reflect the merge order of the patches, what
> > is the fsaio target?
>
> 2.6.21 was what I had in mind, to enable the glibc folks to proceed with
> conversion to native AIO.
>
> Regenerating my patches against the unplug stuff is not a problem, I only
> worry about being queued up behind something that may take longer to
> stabilize and is likely to change ... If that is not the case, I don't
> mind.

Same here, hence the suggestion to base then in merging order. If your
target is 2.6.21, then I think fsaio should be first. While I think the
plug changes are safe and as such mergable, we still need to see lots of
results and do more testing.

--
Jens Axboe

2007-01-10 05:40:12

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write

On Thu, Jan 04, 2007 at 09:02:42AM -0800, Andrew Morton wrote:
> On Thu, 4 Jan 2007 10:26:21 +0530
> Suparna Bhattacharya <[email protected]> wrote:
>
> > On Wed, Jan 03, 2007 at 02:15:56PM -0800, Andrew Morton wrote:
>
> Patches against next -mm would be appreciated, please. Sorry about that.

I have updated the patchset against 2620-rc3-mm1, incorporated various
cleanups suggested during last review. Please let me know if I have missed
anything:

It should show up at
http://www.kernel.org:/pub/linux/kernel/people/suparna/aio/2620-rc3-mm1

Brief changelog:
- Reworked against the block layer unplug changes
- Switched from defines to inlines for init_wait_bit* etc (per akpm)
- Better naming: __lock_page to lock_page_async (per hch, npiggin)
- Kill lock_page_slow wrapper and rename __lock_page_slow to lock_page_slow
(per hch)
- Use a helper function aio_restarted() (per hch)
- Replace combined if/assignment (per hch)
- fix resetting of current->io_wait after ->retry in aio_run_iocb (per zab)

I have run my usual aio-stress variations script
(http://www.kernel.org:/pub/linux/kernel/people/suparna/aio/aio-results.sh)

Regards
Suparna


--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2007-01-11 01:09:57

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write

On Wed, 10 Jan 2007 11:14:19 +0530
Suparna Bhattacharya <[email protected]> wrote:

> On Thu, Jan 04, 2007 at 09:02:42AM -0800, Andrew Morton wrote:
> > On Thu, 4 Jan 2007 10:26:21 +0530
> > Suparna Bhattacharya <[email protected]> wrote:
> >
> > > On Wed, Jan 03, 2007 at 02:15:56PM -0800, Andrew Morton wrote:
> >
> > Patches against next -mm would be appreciated, please. Sorry about that.
>
> I have updated the patchset against 2620-rc3-mm1, incorporated various
> cleanups suggested during last review. Please let me know if I have missed
> anything:

The s/lock_page_slow/lock_page_blocking/ got lost. I redid it.

For the record, patches-via-http are very painful. Please always always
email them.

As a result, these patches ended up with titles which are derived from their
filenames, which are cryptic.

2007-01-11 03:08:55

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write

On Wed, Jan 10, 2007 at 05:08:29PM -0800, Andrew Morton wrote:
> On Wed, 10 Jan 2007 11:14:19 +0530
> Suparna Bhattacharya <[email protected]> wrote:
>
> > On Thu, Jan 04, 2007 at 09:02:42AM -0800, Andrew Morton wrote:
> > > On Thu, 4 Jan 2007 10:26:21 +0530
> > > Suparna Bhattacharya <[email protected]> wrote:
> > >
> > > > On Wed, Jan 03, 2007 at 02:15:56PM -0800, Andrew Morton wrote:
> > >
> > > Patches against next -mm would be appreciated, please. Sorry about that.
> >
> > I have updated the patchset against 2620-rc3-mm1, incorporated various
> > cleanups suggested during last review. Please let me know if I have missed
> > anything:
>
> The s/lock_page_slow/lock_page_blocking/ got lost. I redid it.

I thought the lock_page_blocking was an alternative you had suggested
to the __lock_page vs lock_page_async discussion which got resolved later.
That is why I didn't make the change in this patchset.
The call does not block in the async case, hence the choice of
the _slow suffix (like in fs/buffer.c). But if lock_page_blocking()
sounds more intuitive to you, that's OK.

>
> For the record, patches-via-http are very painful. Please always always
> email them.
>
> As a result, these patches ended up with titles which are derived from their
> filenames, which are cryptic.

Sorry about that - I wanted to ask if you'd prefer my resending them to the
list, but missed doing so. Some people have found it easier to download the
series as a whole when they intend to apply it, so I ended up maintaining it
that way all this while.

Regards
Suparna

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2007-01-11 04:54:00

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write

On Thu, 11 Jan 2007 08:43:36 +0530
Suparna Bhattacharya <[email protected]> wrote:

> > The s/lock_page_slow/lock_page_blocking/ got lost. I redid it.
>
> I thought the lock_page_blocking was an alternative you had suggested
> to the __lock_page vs lock_page_async discussion which got resolved later.
> That is why I didn't make the change in this patchset.
> The call does not block in the async case, hence the choice of
> the _slow suffix (like in fs/buffer.c). But if lock_page_blocking()
> sounds more intuitive to you, that's OK.

I thought people didn't like the "lock_page_slow" name.