2006-09-11 09:42:24

by Sébastien Dugué

[permalink] [raw]
Subject: [PATCH AIO 2/2] Add listio support


POSIX listio support

This patch adds POSIX listio completion notification support. It builds
on support provided by the aio event patch and adds an IOCB_CMD_GROUP
command to io_submit().

The purpose of IOCB_CMD_GROUP is to group together the following requests in
the list up to the end of the list sumbitted to io_submit.

As io_submit already accepts an array of iocbs, as part of listio submission,
the user process prepends to a list of requests an empty special aiocb with
an aio_lio_opcode of IOCB_CMD_GROUP, filling only the aio_sigevent fields.



An IOCB_CMD_GROUP is added to the IOCB_CMD enum in include/linux/aio_abi.h

A struct lio_event is added in include/linux/aio.h

A struct lio_event *ki_lio is added to struct iocb in include/linux/aio.h

In io_submit(), upon detecting such an IOCB_CMD_GROUP marker iocb, an
lio_event is created in lio_create() which contains the necessary information
for signaling a thread (signal number, pid, notify type and value) along with
a count of requests attached to this event.

The following depicts the lio_event structure:

struct lio_event {
atomic_t lio_users;
int lio_wait;
struct aio_notify lio_notify;
};

lio_users holds an atomic counter of the number of requests attached to this
lio. It is incremented with each request submitted and decremented at each
request completion. When the counter reaches 0, we send the notification.

Each subsequent submitted request is attached to this lio_event by setting
the request kiocb->ki_lio to that lio_event (in io_submit_one()) and
incrementing the lio_users count.

In aio_complete(), if the request is attached to an lio (ki_lio <> 0),
then lio_check() is called to decrement the lio_users count and eventually
signal the user process when all the requests in the group have completed.


The IOCB_CMD_GROUP semantic is as follows:

- if the associated sigevent is NULL then we want to group
requests for the purpose of blocking on the group completion
(LIO_WAIT sync behavior) and lio_event->lio_wait is set to 1.

- if the associated sigevent is valid (not NULL) then we want to
group requests for the purpose of being notified upon that
group of requests completion (LIO_NOWAIT async behaviour).





fs/aio.c | 118 ++++++++++++++++++++++++++++++++++++++++++++++--
include/linux/aio.h | 14 ++++-
include/linux/aio_abi.h | 1
3 files changed, 127 insertions(+), 6 deletions(-)

Signed-off-by: Laurent Vivier <[email protected]>
Signed-off-by: S?bastien Dugu? <[email protected]>

Index: linux-2.6.18-rc6/fs/aio.c
===================================================================
--- linux-2.6.18-rc6.orig/fs/aio.c 2006-09-06 16:38:10.000000000 +0200
+++ linux-2.6.18-rc6/fs/aio.c 2006-09-06 16:49:24.000000000 +0200
@@ -413,6 +413,7 @@ static struct kiocb fastcall *__aio_get_
req->ki_cancel = NULL;
req->ki_retry = NULL;
req->ki_dtor = NULL;
+ req->ki_lio = NULL;
req->private = NULL;
INIT_LIST_HEAD(&req->ki_run_list);

@@ -1016,6 +1017,69 @@ static long setup_sigevent(struct aio_no
return error;
}

+static void lio_wait(struct kioctx *ctx, struct lio_event *lio)
+{
+ struct task_struct *tsk = current;
+ DECLARE_WAITQUEUE(wait, tsk);
+
+ add_wait_queue(&ctx->wait, &wait);
+ set_task_state(tsk, TASK_UNINTERRUPTIBLE);
+
+ while ( atomic_read(&lio->lio_users) ) {
+ schedule();
+ set_task_state(tsk, TASK_UNINTERRUPTIBLE);
+ }
+ __set_task_state(tsk, TASK_RUNNING);
+ remove_wait_queue(&ctx->wait, &wait);
+
+ /* Now free the lio */
+ kfree(lio);
+}
+
+static inline void lio_check(struct lio_event *lio)
+{
+ int ret;
+
+ ret = atomic_dec_and_test(&(lio->lio_users));
+
+ if (unlikely(ret) && lio->lio_notify.notify != SIGEV_NONE) {
+ /* last one -> notify process */
+ aio_send_signal(&lio->lio_notify);
+ kfree(lio);
+ }
+}
+
+static int lio_create(struct lio_event **lio, struct sigevent __user
*user_event) +{
+ int ret = 0;
+
+ if (*lio) {
+ printk (KERN_DEBUG "lio_create: already have an lio\n");
+ return 0;
+ }
+
+ *lio = kmalloc(sizeof(*lio), GFP_KERNEL);
+
+ if (!*lio)
+ return -EAGAIN;
+
+ memset (*lio, 0, sizeof(struct lio_event));
+
+ (*lio)->lio_notify.notify = SIGEV_NONE;
+
+ if (user_event)
+ /* User specified an event for this lio,
+ * he wants to be notified upon lio completion.
+ */
+ ret = setup_sigevent(&((*lio)->lio_notify), user_event);
+
+ else
+ /* No sigevent specified, it's an LIO_WAIT operation */
+ (*lio)->lio_wait = 1;
+
+ return ret;
+}
+
static void __aio_write_evt(struct kioctx *ctx, struct io_event *event)
{
struct aio_ring_info *info;
@@ -1110,6 +1174,9 @@ int fastcall aio_complete(struct kiocb *
if (iocb->ki_notify.notify != SIGEV_NONE)
aio_send_signal(&iocb->ki_notify);

+ if (iocb->ki_lio)
+ lio_check(iocb->ki_lio);
+
pr_debug("%ld retries: %d of %d\n", iocb->ki_retried,
iocb->ki_nbytes - iocb->ki_left, iocb->ki_nbytes);
put_rq:
@@ -1580,7 +1647,7 @@ static int aio_wake_function(wait_queue_
}

int fastcall io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
- struct iocb *iocb)
+ struct iocb *iocb, struct lio_event *lio)
{
struct kiocb *req;
struct file *file;
@@ -1642,6 +1709,9 @@ int fastcall io_submit_one(struct kioctx
goto out_put_req;
}

+ /* Attach this iocb to its lio */
+ req->ki_lio = lio;
+
ret = aio_setup_iocb(req);

if (ret)
@@ -1680,6 +1750,7 @@ asmlinkage long sys_io_submit(aio_contex
struct iocb __user * __user *iocbpp)
{
struct kioctx *ctx;
+ struct lio_event *lio = NULL;
long ret = 0;
int i;

@@ -1713,11 +1784,50 @@ asmlinkage long sys_io_submit(aio_contex
break;
}

- ret = io_submit_one(ctx, user_iocb, &tmp);
- if (ret)
- break;
+ if (tmp.aio_lio_opcode == IOCB_CMD_GROUP) {
+
+ /* this command means that all following IO commands
+ * are in the same group.
+ *
+ * Userspace either wants to be notified upon or block
until
+ * completion of all the requests in the group.
+ */
+ ret = lio_create(&lio,
+ (struct sigevent __user *)(unsigned
long)
+ tmp.aio_sigeventp);
+ if (ret)
+ break;
+
+ continue;
+ }
+
+ if (lio && ((tmp.aio_lio_opcode == IOCB_CMD_PREAD) ||
+ (tmp.aio_lio_opcode == IOCB_CMD_PWRITE))) {
+
+ atomic_inc(&lio->lio_users);
+ ret = io_submit_one(ctx, user_iocb, &tmp, lio);
+
+ /*
+ * If a request failed, just decrement the users count,
+ * but go on submitting subsequent requests.
+ *
+ */
+ if (ret)
+ atomic_dec(&lio->lio_users);
+
+ continue;
+
+ } else {
+ ret = io_submit_one(ctx, user_iocb, &tmp, NULL);
+ if (ret)
+ break;
+ }
}

+ /* User wants to block until list completion */
+ if (lio && lio->lio_wait)
+ lio_wait(ctx, lio);
+
put_ioctx(ctx);
return i ? i : ret;
}
Index: linux-2.6.18-rc6/include/linux/aio_abi.h
===================================================================
--- linux-2.6.18-rc6.orig/include/linux/aio_abi.h 2006-09-06
16:32:17.000000000 +0200 +++ linux-2.6.18-rc6/include/linux/aio_abi.h
2006-09-06 16:38:17.000000000 +0200 @@ -41,6 +41,7 @@ enum {
* IOCB_CMD_POLL = 5,
*/
IOCB_CMD_NOOP = 6,
+ IOCB_CMD_GROUP = 7,
};

/* read() from /dev/aio returns these structures. */
Index: linux-2.6.18-rc6/include/linux/aio.h
===================================================================
--- linux-2.6.18-rc6.orig/include/linux/aio.h 2006-09-06
16:32:17.000000000 +0200 +++ linux-2.6.18-rc6/include/linux/aio.h
2006-09-06 16:38:17.000000000 +0200 @@ -56,6 +56,12 @@ struct aio_notify {
sigval_t value;
};

+struct lio_event {
+ atomic_t lio_users;
+ int lio_wait;
+ struct aio_notify lio_notify;
+};
+
/* is there a better place to document function pointer methods? */
/**
* ki_retry - iocb forward progress callback
@@ -111,6 +117,9 @@ struct kiocb {
wait_queue_t ki_wait;
loff_t ki_pos;

+ /* lio this iocb might be attached to */
+ struct lio_event *ki_lio;
+
void *private;
/* State that we remember to be able to restart/retry */
unsigned short ki_opcode;
@@ -216,12 +225,13 @@ struct mm_struct;
extern void FASTCALL(exit_aio(struct mm_struct *mm));
extern struct kioctx *lookup_ioctx(unsigned long ctx_id);
extern int FASTCALL(io_submit_one(struct kioctx *ctx,
- struct iocb __user *user_iocb, struct iocb *iocb));
+ struct iocb __user *user_iocb, struct iocb *iocb,
+ struct lio_event *lio));

/* semi private, but used by the 32bit emulations: */
struct kioctx *lookup_ioctx(unsigned long ctx_id);
int FASTCALL(io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
- struct iocb *iocb));
+ struct iocb *iocb, struct lio_event *lio));

#define get_ioctx(kioctx) do { \
BUG_ON(unlikely(atomic_read(&(kioctx)->users) <= 0)); \



-----------------------------------------------------

S?bastien Dugu? BULL/FREC:B1-247
phone: (+33) 476 29 77 70 Bullcom: 229-7770

mailto:[email protected]

Linux POSIX AIO: http://www.bullopensource.org/posix
http://sourceforge.net/projects/paiol

-----------------------------------------------------


2006-09-13 00:42:26

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH AIO 2/2] Add listio support


> As io_submit already accepts an array of iocbs, as part of listio submission,
> the user process prepends to a list of requests an empty special aiocb with
> an aio_lio_opcode of IOCB_CMD_GROUP, filling only the aio_sigevent fields.

Hmm, overloading an iocb as the _GROUP indication doesn't make much
sense to me, other than being an expedient way to shoe-horn the
semantics into the existing API. That we have to worry about multiple
_GROUP iocbs in the array (ignoring them currently) seems like
complexity that we shouldn't even have to consider.

Am I just missing the discussion that lead us to avoid a syscall? It
seems like we're getting a pretty funky API in return.

I guess I could also see an iocb command whose buf/bytes pointed to an
array of iocb pointers that it should issue and wait for completion on.
You could call it, uh, IOCB_CMD_IO_SUBMIT :). I'm only sort of kidding
here :). With a sys_io_wait_for_one_iocb() that might not be entirely
ridiculous. Then you could get an io_getevents() completion of a group
instead of only being able to get a signal or waiting on sync
completion. It makes it less of a special case.

In any case, this current approach doesn't seem complete. It doesn't
deal with the case where canceled iocbs don't come through
aio_complete() and so don't drop their lio_users count.

Also, it seems that the lio_submit() interface allows NULL iocb pointers
by skipping them. sys_io_submit() looks like it'd fault on them. It
sounds unpleasant to have an app allow nulls in the array and have the
library copy and compress the members, or something. Maybe we could fix
up that mismatch in an API that explicitly took the sigevent struct? Am
I making up this problem?

more inline..

> +static void lio_wait(struct kioctx *ctx, struct lio_event *lio)
> +{
> + struct task_struct *tsk = current;
> + DECLARE_WAITQUEUE(wait, tsk);
> +
> + add_wait_queue(&ctx->wait, &wait);
> + set_task_state(tsk, TASK_UNINTERRUPTIBLE);
> +
> + while ( atomic_read(&lio->lio_users) ) {
> + schedule();
> + set_task_state(tsk, TASK_UNINTERRUPTIBLE);
> + }
> + __set_task_state(tsk, TASK_RUNNING);
> + remove_wait_queue(&ctx->wait, &wait);

wait_event(ctx->wait, atomic_read(&lio->lio_users));

> +static inline void lio_check(struct lio_event *lio)
> +{
> + int ret;
> +
> + ret = atomic_dec_and_test(&(lio->lio_users));
> +
> + if (unlikely(ret) && lio->lio_notify.notify != SIGEV_NONE) {
> + /* last one -> notify process */
> + aio_send_signal(&lio->lio_notify);
> + kfree(lio);

Since io_submit() doesn't hold a ref on lio_users this can race during
submission to send the signal and free the lio before all the operations
have been submitted.

> + if (*lio) {
> + printk (KERN_DEBUG "lio_create: already have an lio\n");

We should get rid of the printk so that the user doesn't have a trivial
way to flood the kernel message log.

> + *lio = kmalloc(sizeof(*lio), GFP_KERNEL);
> +
> + if (!*lio)
> + return -EAGAIN;
> +
> + memset (*lio, 0, sizeof(struct lio_event));

kzalloc(), and I imagine it would be a lot cleaner to return the lio or
ERR_PTR() instead of passing in the pointer to the struct and filling it in.

And is it legal to memset(&atomic_t, 0, )? I would have expected a
atomic_set(&lio->lio_users, 0). Maybe I'm stuck in the bad old days of
24 bit atomic_t and embedded locks :).

> + if (lio && ((tmp.aio_lio_opcode == IOCB_CMD_PREAD) ||
> + (tmp.aio_lio_opcode == IOCB_CMD_PWRITE))) {

It doesn't make sense to me to restrict which ops are allowed to be in a
list or not. Why not PREADV? An iocb is an iocb, I think.

> + IOCB_CMD_GROUP = 7,

7 is already taken in -mm by IOCB_CMD_PREADV .

> +struct lio_event {
> + atomic_t lio_users;
> + int lio_wait;
> + struct aio_notify lio_notify;

I'm pretty sure you don't need lio_wait here, given that it's only
tested in the submission path.

- z

2006-09-14 14:43:53

by Sébastien Dugué

[permalink] [raw]
Subject: Re: [PATCH AIO 2/2] Add listio support

On Tue, 2006-09-12 at 17:41 -0700, Zach Brown wrote:
> > As io_submit already accepts an array of iocbs, as part of listio submission,
> > the user process prepends to a list of requests an empty special aiocb with
> > an aio_lio_opcode of IOCB_CMD_GROUP, filling only the aio_sigevent fields.
>
> Hmm, overloading an iocb as the _GROUP indication doesn't make much
> sense to me, other than being an expedient way to shoe-horn the
> semantics into the existing API. That we have to worry about multiple
> _GROUP iocbs in the array (ignoring them currently) seems like
> complexity that we shouldn't even have to consider.
>
> Am I just missing the discussion that lead us to avoid a syscall? It
> seems like we're getting a pretty funky API in return.

The original intent was to use the io_submit() interface which already
accepts a list of iocbs without having to resort on yet another syscall.
But I agree that it's kludgy and a new syscall might end up being much
cleaner.

Discussion on this issue would be much welcomed.

>
> I guess I could also see an iocb command whose buf/bytes pointed to an
> array of iocb pointers that it should issue and wait for completion on.
> You could call it, uh, IOCB_CMD_IO_SUBMIT :). I'm only sort of kidding
> here :). With a sys_io_wait_for_one_iocb() that might not be entirely
> ridiculous. Then you could get an io_getevents() completion of a group
> instead of only being able to get a signal or waiting on sync
> completion. It makes it less of a special case.
>
> In any case, this current approach doesn't seem complete. It doesn't
> deal with the case where canceled iocbs don't come through
> aio_complete() and so don't drop their lio_users count.

Yep, will have to think a bit more about this.

>
> Also, it seems that the lio_submit() interface allows NULL iocb pointers
> by skipping them. sys_io_submit() looks like it'd fault on them. It
> sounds unpleasant to have an app allow nulls in the array and have the
> library copy and compress the members, or something. Maybe we could fix
> up that mismatch in an API that explicitly took the sigevent struct? Am
> I making up this problem?

Effectively, right now the NULL pointers have to be handled by the
library and the new syscall approach would allow to solve this in
an elegant way.

>
> more inline..
>
> > +static void lio_wait(struct kioctx *ctx, struct lio_event *lio)
> > +{
> > + struct task_struct *tsk = current;
> > + DECLARE_WAITQUEUE(wait, tsk);
> > +
> > + add_wait_queue(&ctx->wait, &wait);
> > + set_task_state(tsk, TASK_UNINTERRUPTIBLE);
> > +
> > + while ( atomic_read(&lio->lio_users) ) {
> > + schedule();
> > + set_task_state(tsk, TASK_UNINTERRUPTIBLE);
> > + }
> > + __set_task_state(tsk, TASK_RUNNING);
> > + remove_wait_queue(&ctx->wait, &wait);
>
> wait_event(ctx->wait, atomic_read(&lio->lio_users));

Nice shortcut, thanks.

>
> > +static inline void lio_check(struct lio_event *lio)
> > +{
> > + int ret;
> > +
> > + ret = atomic_dec_and_test(&(lio->lio_users));
> > +
> > + if (unlikely(ret) && lio->lio_notify.notify != SIGEV_NONE) {
> > + /* last one -> notify process */
> > + aio_send_signal(&lio->lio_notify);
> > + kfree(lio);
>
> Since io_submit() doesn't hold a ref on lio_users this can race during
> submission to send the signal and free the lio before all the operations
> have been submitted.

Right.

>
> > + if (*lio) {
> > + printk (KERN_DEBUG "lio_create: already have an lio\n");
>
> We should get rid of the printk so that the user doesn't have a trivial
> way to flood the kernel message log.

OK.

>
> > + *lio = kmalloc(sizeof(*lio), GFP_KERNEL);
> > +
> > + if (!*lio)
> > + return -EAGAIN;
> > +
> > + memset (*lio, 0, sizeof(struct lio_event));
>
> kzalloc(), and I imagine it would be a lot cleaner to return the lio or
> ERR_PTR() instead of passing in the pointer to the struct and filling it in.

Makes sense.

>
> And is it legal to memset(&atomic_t, 0, )? I would have expected a
> atomic_set(&lio->lio_users, 0). Maybe I'm stuck in the bad old days of
> 24 bit atomic_t and embedded locks :).

Well, it's just an allocation so lio->lio_users is not even ready to
be used. So I don't think it makes any difference whether this field
is cleared during the memset (or kzalloc) or using an explicit
atomic_set(). Or are there any architectures still out there that stuff
state bits in an atomic_t?

>
> > + if (lio && ((tmp.aio_lio_opcode == IOCB_CMD_PREAD) ||
> > + (tmp.aio_lio_opcode == IOCB_CMD_PWRITE))) {
>
> It doesn't make sense to me to restrict which ops are allowed to be in a
> list or not. Why not PREADV? An iocb is an iocb, I think.

Right, I wrongly focused on the POSIX aspects, but we should be more
versatile here.

>
> > + IOCB_CMD_GROUP = 7,
>
> 7 is already taken in -mm by IOCB_CMD_PREADV .

Argh, forgot to check -mm. Sorry.

>
> > +struct lio_event {
> > + atomic_t lio_users;
> > + int lio_wait;
> > + struct aio_notify lio_notify;
>
> I'm pretty sure you don't need lio_wait here, given that it's only
> tested in the submission path.
>

Yes, we can do without.

Thanks,

S?bastien.

--
-----------------------------------------------------

S?bastien Dugu? BULL/FREC:B1-247
phone: (+33) 476 29 77 70 Bullcom: 229-7770

mailto:[email protected]

Linux POSIX AIO: http://www.bullopensource.org/posix
http://sourceforge.net/projects/paiol

-----------------------------------------------------

2006-09-21 16:47:00

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: [PATCH AIO 2/2] Add listio support

On Tue, Sep 12, 2006 at 05:41:37PM -0700, Zach Brown wrote:
>
> > As io_submit already accepts an array of iocbs, as part of listio submission,
> > the user process prepends to a list of requests an empty special aiocb with
> > an aio_lio_opcode of IOCB_CMD_GROUP, filling only the aio_sigevent fields.
>
> Hmm, overloading an iocb as the _GROUP indication doesn't make much
> sense to me, other than being an expedient way to shoe-horn the
> semantics into the existing API. That we have to worry about multiple
> _GROUP iocbs in the array (ignoring them currently) seems like
> complexity that we shouldn't even have to consider.

I think it would just start a new lio group if it encounters another _GROUP
iocb. (The code probably has a bug in that for LIO_WAIT only the latest group
seems to be taken into account in such a case, but that should be easy
to fix ... in fact maybe we could do the waiting using sys_io_wait,
as suggest in which case it is no longer an issue)

>
> Am I just missing the discussion that lead us to avoid a syscall? It
> seems like we're getting a pretty funky API in return.

There was some discussion way back on linux-aio - we seemed to be going
back and forth between introducing new syscalls and overloading what we
have. There was another more recent discussion with Ulrich on the API
(if you look up the POSIX AIO discussion thread, you should find that one)

>
> I guess I could also see an iocb command whose buf/bytes pointed to an
> array of iocb pointers that it should issue and wait for completion on.
> You could call it, uh, IOCB_CMD_IO_SUBMIT :). I'm only sort of kidding
> here :). With a sys_io_wait_for_one_iocb() that might not be entirely
> ridiculous. Then you could get an io_getevents() completion of a group
> instead of only being able to get a signal or waiting on sync
> completion. It makes it less of a special case.

Hmm, it would be nice to implement a sys_io_wait for the IOCB_CMD_GROUP
iocb instead of performing the lio_wait() within io_submit. But what
should the semantics for sys_io_wait be for other (regular) iocb types
where there is no lio association (seems non-trivial to implement) ?
Should we just return -EINVAL for those cases ?

>
> In any case, this current approach doesn't seem complete. It doesn't
> deal with the case where canceled iocbs don't come through
> aio_complete() and so don't drop their lio_users count.
>
> Also, it seems that the lio_submit() interface allows NULL iocb pointers
> by skipping them. sys_io_submit() looks like it'd fault on them. It
> sounds unpleasant to have an app allow nulls in the array and have the
> library copy and compress the members, or something. Maybe we could fix
> up that mismatch in an API that explicitly took the sigevent struct? Am
> I making up this problem?

The iocbs have to be built dynamically anyway by glibc, so does it
matter ?

Regards
Suparna

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India


--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India