2006-11-20 14:17:07

by Sébastien Dugué

[permalink] [raw]
Subject: [PATCH -mm 0/4][AIO] - AIO completion signal notification v2



Hi

Here is the latest rework of the AIO completion signal notification patches.

This set consists in 4 patches:

1. aio-header-fix-includes: fixes the double inclusion of uio.h in aio.h

2. export-good_sigevent: move good_sigevent into signal.c and export it

3. aio-notify-sig: the AIO completion signal notification

4. listio: adds listio support

Description are in the individual patches.


Changes from v1:
- cleanups suggested by Christoph Hellwig, Badari Pulavarty and Zach Brown
- added lisio patch


Comments are welcome as usual.

Thanks,

S?bastien.


2006-11-20 14:22:47

by Sébastien Dugué

[permalink] [raw]
Subject: Re: [PATCH -mm 2/4][AIO] - export good_sigevent()



Export good_sigevent()


Move good_sigevent() from posix-timers.c to signal.c where it belongs,
and export it so that it can be used by other subsystems.


include/linux/signal.h | 2 ++
kernel/posix-timers.c | 17 -----------------
kernel/signal.c | 23 +++++++++++++++++++++++
3 files changed, 25 insertions(+), 17 deletions(-)

Signed-off-by: S?bastien Dugu? <[email protected]>


Index: linux-2.6.19-rc5-mm2/include/linux/signal.h
===================================================================
--- linux-2.6.19-rc5-mm2.orig/include/linux/signal.h 2006-11-17
11:20:08.000000000 +0100 +++ linux-2.6.19-rc5-mm2/include/linux/signal.h
2006-11-17 11:20:31.000000000 +0100 @@ -241,6 +241,8 @@ extern int
sigprocmask(int, sigset_t *, struct pt_regs;
extern int get_signal_to_deliver(siginfo_t *info, struct k_sigaction
*return_ka, struct pt_regs *regs, void *cookie);
+extern struct task_struct * good_sigevent(sigevent_t *);
+
#endif /* __KERNEL__ */

#endif /* _LINUX_SIGNAL_H */
Index: linux-2.6.19-rc5-mm2/kernel/posix-timers.c
===================================================================
--- linux-2.6.19-rc5-mm2.orig/kernel/posix-timers.c 2006-11-17
11:20:08.000000000 +0100 +++ linux-2.6.19-rc5-mm2/kernel/posix-timers.c
2006-11-17 11:20:31.000000000 +0100 @@ -367,23 +367,6 @@ static enum
hrtimer_restart posix_timer_ return ret;
}

-static struct task_struct * good_sigevent(sigevent_t * event)
-{
- struct task_struct *rtn = current->group_leader;
-
- if ((event->sigev_notify & SIGEV_THREAD_ID ) &&
- (!(rtn = find_task_by_pid(event->sigev_notify_thread_id)) ||
- rtn->tgid != current->tgid ||
- (event->sigev_notify & ~SIGEV_THREAD_ID) != SIGEV_SIGNAL))
- return NULL;
-
- if (((event->sigev_notify & ~SIGEV_THREAD_ID) != SIGEV_NONE) &&
- ((event->sigev_signo <= 0) || (event->sigev_signo > SIGRTMAX)))
- return NULL;
-
- return rtn;
-}
-
void register_posix_clock(const clockid_t clock_id, struct k_clock *new_clock)
{
if ((unsigned) clock_id >= MAX_CLOCKS) {
Index: linux-2.6.19-rc5-mm2/kernel/signal.c
===================================================================
--- linux-2.6.19-rc5-mm2.orig/kernel/signal.c 2006-11-17
11:20:08.000000000 +0100 +++ linux-2.6.19-rc5-mm2/kernel/signal.c
2006-11-17 11:20:31.000000000 +0100 @@ -1189,6 +1189,29 @@ int
group_send_sig_info(int sig, struct return ret;
}

+/***
+ * good_sigevent - check and get target task from a sigevent.
+ * @event: the sigevent to be checked
+ *
+ * This function must be called with tasklist_lock held for reading.
+ */
+struct task_struct * good_sigevent(sigevent_t * event)
+{
+ struct task_struct *rtn = current->group_leader;
+
+ if ((event->sigev_notify & SIGEV_THREAD_ID ) &&
+ (!(rtn = find_task_by_pid(event->sigev_notify_thread_id)) ||
+ rtn->tgid != current->tgid ||
+ (event->sigev_notify & ~SIGEV_THREAD_ID) != SIGEV_SIGNAL))
+ return NULL;
+
+ if (((event->sigev_notify & ~SIGEV_THREAD_ID) != SIGEV_NONE) &&
+ ((event->sigev_signo <= 0) || (event->sigev_signo > SIGRTMAX)))
+ return NULL;
+
+ return rtn;
+}
+
/*
* kill_pgrp_info() sends a signal to a process group: this is what the tty
* control characters do (^C, ^Z etc)

2006-11-20 14:23:17

by Sébastien Dugué

[permalink] [raw]
Subject: Re: [PATCH -mm 4/4][AIO] - Listio support


POSIX listio support

This patch adds POSIX listio completion notification support. It builds
on support provided by the aio signal notification patch and adds an
IOCB_CMD_GROUP command to io_submit().

The purpose of IOCB_CMD_GROUP is to group together the following requests in
the list up to the end of the list sumbitted to io_submit.

As io_submit already accepts an array of iocbs, as part of listio submission,
the user process prepends to a list of requests an empty special aiocb with
an aio_lio_opcode of IOCB_CMD_GROUP, filling only the aio_sigevent fields.


An IOCB_CMD_GROUP is added to the IOCB_CMD enum in include/linux/aio_abi.h

A struct lio_event is added in include/linux/aio.h

A struct lio_event *ki_lio is added to struct iocb in include/linux/aio.h

In io_submit(), upon detecting such an IOCB_CMD_GROUP marker iocb, an
lio_event is created in lio_create() which contains the necessary information
for signaling a thread (signal number, pid, notify type and value) along with
a count of requests attached to this event.

The following depicts the lio_event structure:

struct lio_event {
atomic_t lio_users;
struct aio_notify lio_notify;
};

lio_users holds an atomic counter of the number of requests attached to this
lio. It is incremented with each request submitted and decremented at each
request completion. When the counter reaches 0, we send the notification.

Each subsequent submitted request is attached to this lio_event by setting
the request kiocb->ki_lio to that lio_event (in io_submit_one()) and
incrementing the lio_users count.

In aio_complete(), if the request is attached to an lio (ki_lio <> 0),
then lio_check() is called to decrement the lio_users count and eventually
signal the user process when all the requests in the group have completed.


The IOCB_CMD_GROUP semantic is as follows:

- if the associated sigevent is NULL then we want to group
requests for the purpose of blocking on the group completion
(LIO_WAIT sync behavior).

- if the associated sigevent is valid (not NULL) then we want to
group requests for the purpose of being notified upon that
group of requests completion (LIO_NOWAIT async behaviour).




fs/aio.c | 123 ++++++++++++++++++++++++++++++++++++++++++++++--
include/linux/aio.h | 13 ++++-
include/linux/aio_abi.h | 1
3 files changed, 131 insertions(+), 6 deletions(-)


Signed-off-by: S?bastien Dugu? <[email protected]>
Signed-off-by: Laurent Vivier <[email protected]>

Index: linux-2.6.19-rc5-mm2/fs/aio.c
===================================================================
--- linux-2.6.19-rc5-mm2.orig/fs/aio.c 2006-11-17 11:20:32.000000000
+0100 +++ linux-2.6.19-rc5-mm2/fs/aio.c 2006-11-17 16:22:00.000000000
+0100 @@ -414,6 +414,7 @@ static struct kiocb fastcall *__aio_get_
req->ki_cancel = NULL;
req->ki_retry = NULL;
req->ki_dtor = NULL;
+ req->ki_lio = NULL;
req->private = NULL;
req->ki_iovec = NULL;
req->ki_notify.sigq = NULL;
@@ -1010,6 +1011,53 @@ out_unlock:
return -EINVAL;
}

+static inline void lio_check(struct lio_event *lio)
+{
+ int ret;
+
+ ret = atomic_dec_and_test(&lio->lio_users);
+
+ if (unlikely(ret) && lio->lio_notify.notify != SIGEV_NONE) {
+ /* last one -> notify process */
+ aio_send_signal(&lio->lio_notify);
+ kfree(lio);
+ }
+}
+
+static struct lio_event *lio_create(struct sigevent __user *user_event)
+{
+ int ret = 0;
+ struct lio_event *lio = NULL;
+
+ lio = kzalloc(sizeof(*lio), GFP_KERNEL);
+
+ if (!lio)
+ return ERR_PTR(-EAGAIN);
+
+ /*
+ * Grab an initial ref on the lio to avoid races between
+ * submission and completion.
+ */
+ atomic_set(&lio->lio_users, 1);
+
+ lio->lio_notify.notify = SIGEV_NONE;
+
+ if (user_event) {
+ /*
+ * User specified an event for this lio,
+ * he wants to be notified upon lio completion.
+ */
+ ret = aio_setup_sigevent(&lio->lio_notify, user_event);
+
+ if (ret) {
+ kfree(lio);
+ return ERR_PTR(ret);
+ }
+ }
+
+ return lio;
+}
+
/* aio_complete
* Called when the io request on the given iocb is complete.
* Returns true if this is the last user of the request. The
@@ -1058,8 +1106,12 @@ int fastcall aio_complete(struct kiocb *
* when the event got cancelled.
*/
if (kiocbIsCancelled(iocb)) {
+ if (iocb->ki_lio)
+ lio_check(iocb->ki_lio);
+
if (iocb->ki_notify.sigq)
sigqueue_free(iocb->ki_notify.sigq);
+
goto put_rq;
}

@@ -1100,6 +1152,9 @@ int fastcall aio_complete(struct kiocb *
sigqueue_free(iocb->ki_notify.sigq);
}

+ if (iocb->ki_lio)
+ lio_check(iocb->ki_lio);
+
pr_debug("%ld retries: %zd of %zd\n", iocb->ki_retried,
iocb->ki_nbytes - iocb->ki_left, iocb->ki_nbytes);
put_rq:
@@ -1634,7 +1689,7 @@ static int aio_wake_function(wait_queue_
}

int fastcall io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
- struct iocb *iocb)
+ struct iocb *iocb, struct lio_event *lio)
{
struct kiocb *req;
struct file *file;
@@ -1696,6 +1751,9 @@ int fastcall io_submit_one(struct kioctx
goto out_put_req;
}

+ /* Attach this iocb to its lio */
+ req->ki_lio = lio;
+
ret = aio_setup_iocb(req);

if (ret)
@@ -1739,6 +1797,8 @@ asmlinkage long sys_io_submit(aio_contex
struct iocb __user * __user *iocbpp)
{
struct kioctx *ctx;
+ struct lio_event *lio = NULL;
+ int lio_wait = 0;
long ret = 0;
int i;

@@ -1772,11 +1832,66 @@ asmlinkage long sys_io_submit(aio_contex
break;
}

- ret = io_submit_one(ctx, user_iocb, &tmp);
- if (ret)
- break;
+ if (tmp.aio_lio_opcode == IOCB_CMD_GROUP) {
+
+ /* this command means that all following IO commands
+ * are in the same group.
+ *
+ * Userspace either wants to be notified upon or block
until
+ * completion of all the requests in the group.
+ */
+ /*
+ * Ignore an IOCB_CMD_GROUP request if we are already
+ * processing one. This means only one listio per
+ * io_submit call.
+ */
+ if (lio)
+ continue;
+
+ lio = lio_create((struct sigevent __user *)(unsigned
long)
+ tmp.aio_sigeventp);
+
+ ret = PTR_ERR(lio);
+
+ if (IS_ERR(lio))
+ goto out_put_ctx;
+
+ if (!tmp.aio_sigeventp)
+ lio_wait = 1;
+ } else {
+ if (lio)
+ atomic_inc(&lio->lio_users);
+
+ ret = io_submit_one(ctx, user_iocb, &tmp, lio);
+
+ if (ret) {
+ if (lio) {
+ /*
+ * If a request failed, just decrement
+ * the users count, but go on
submitting
+ * subsequent requests.
+ */
+ atomic_dec(&lio->lio_users);
+ } else
+ break;
+ }
+ }
+ }
+
+ if (lio) {
+ /*
+ * Drop extra ref on the lio now that we're done submitting
+ * requests
+ */
+ lio_check(lio);
+
+ if (lio_wait) {
+ wait_event(ctx->wait, atomic_read(&lio->lio_users)==0);
+ kfree(lio);
+ }
}

+out_put_ctx:
put_ioctx(ctx);
return i ? i : ret;
}
Index: linux-2.6.19-rc5-mm2/include/linux/aio_abi.h
===================================================================
--- linux-2.6.19-rc5-mm2.orig/include/linux/aio_abi.h 2006-11-17
11:20:32.000000000 +0100 +++
linux-2.6.19-rc5-mm2/include/linux/aio_abi.h 2006-11-17
11:21:31.000000000 +0100 @@ -43,6 +43,7 @@ enum { IOCB_CMD_NOOP = 6,
IOCB_CMD_PREADV = 7,
IOCB_CMD_PWRITEV = 8,
+ IOCB_CMD_GROUP = 9,
};

/* read() from /dev/aio returns these structures. */
Index: linux-2.6.19-rc5-mm2/include/linux/aio.h
===================================================================
--- linux-2.6.19-rc5-mm2.orig/include/linux/aio.h 2006-11-17
11:20:32.000000000 +0100 +++ linux-2.6.19-rc5-mm2/include/linux/aio.h
2006-11-17 11:21:31.000000000 +0100 @@ -58,6 +58,11 @@ struct aio_notify {
struct sigqueue *sigq;
};

+struct lio_event {
+ atomic_t lio_users;
+ struct aio_notify lio_notify;
+};
+
/* is there a better place to document function pointer methods? */
/**
* ki_retry - iocb forward progress callback
@@ -113,6 +118,9 @@ struct kiocb {
wait_queue_t ki_wait;
loff_t ki_pos;

+ /* lio this iocb might be attached to */
+ struct lio_event *ki_lio;
+
void *private;
/* State that we remember to be able to restart/retry */
unsigned short ki_opcode;
@@ -220,12 +228,13 @@ struct mm_struct;
extern void FASTCALL(exit_aio(struct mm_struct *mm));
extern struct kioctx *lookup_ioctx(unsigned long ctx_id);
extern int FASTCALL(io_submit_one(struct kioctx *ctx,
- struct iocb __user *user_iocb, struct iocb *iocb));
+ struct iocb __user *user_iocb, struct iocb
*iocb,
+ struct lio_event *lio));

/* semi private, but used by the 32bit emulations: */
struct kioctx *lookup_ioctx(unsigned long ctx_id);
int FASTCALL(io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
- struct iocb *iocb));
+ struct iocb *iocb, struct lio_event *lio));

#define get_ioctx(kioctx) do { \
BUG_ON(atomic_read(&(kioctx)->users) <= 0); \

2006-11-20 14:22:59

by Sébastien Dugué

[permalink] [raw]
Subject: Re: [PATCH -mm 3/4][AIO] - AIO completion signal notification


AIO completion signal notification

The current 2.6 kernel does not support notification of user space via
an RT signal upon an asynchronous IO completion. The POSIX specification
states that when an AIO request completes, a signal can be delivered to
the application as notification.

This patch adds a struct sigevent *aio_sigeventp to the iocb.
The relevant fields (pid, signal number and value) are stored in the kiocb
for use when the request completes.

That sigevent structure is filled by the application as part of the AIO
request preparation. Upon request completion, the kernel notifies the
application using those sigevent parameters. If SIGEV_NONE has been specified,
then the old behaviour is retained and the application must rely on polling
the completion queue using io_getevents().


A struct sigevent *aio_sigeventp field is added to struct iocb in
include/linux/aio_abi.h

A struct aio_notify containing the sigevent parameters is defined in aio.h:

struct aio_notify {
struct task_struct *target;
__u16 signo;
__u16 notify;
sigval_t value;
};

A struct aio_notify ki_notify is added to struct kiocb in include/linux/aio.h

In io_submit_one(), if the application provided a sigevent then
setup_sigevent() is called which does the following:

- check access to the user sigevent and make a local copy

- if the requested notification is SIGEV_NONE, then nothing to do

- fill in the kiocb->ki_notify fields (notify, signo, value)

- check sigevent consistency, get the signal target task and
save it in kiocb->ki_notify.target

- preallocate a sigqueue for this event using sigqueue_alloc()

Upon request completion, in aio_complete(), if notification is needed for
this request (iocb->ki_notify.notify != SIGEV_NONE), then aio_send_signal()
is called to signal the target task as follows:

- fill in the siginfo struct to be sent to the task

- if notify is SIGEV_THREAD_ID then send signal to specific task
using send_sigqueue()

- else send signal to task group using send_5group_sigqueue()



Notes concerning sigqueue preallocation:

To ensure reliable delivery of completion notification, the sigqueue is
preallocated in the submission path so that there is no chance it can fail
in the completion path.

Unlike the posix-timers case (currently the single other user of sigqueue
preallocation), where the sigqueue is allocated for the lifetime of the timer
and freed at timer destruction time, the aio case is a bit more tricky due to
the async nature of the whole thing.

In the aio case, the sigqueue exists for the lifetime of the request,
therefore it must be freed only once the signal for the request completion has
been delivered. This involves changing __sigqueue_free() to free the sigqueue
when the signal is collected if si_code is SI_ASYNCIO even if it was
preallocated as well as explicitly calling sigqueue_free() in submission and
completion error paths.


fs/aio.c | 116 ++++++++++++++++++++++++++++++++++++++++++++++--
include/linux/aio.h | 12 ++++
include/linux/aio_abi.h | 3 -
kernel/signal.c | 2
4 files changed, 127 insertions(+), 6 deletions(-)


Signed-off-by: S?bastien Dugu? <[email protected]>
Signed-off-by: Laurent Vivier <[email protected]>

Index: linux-2.6.19-rc5-mm2/fs/aio.c
===================================================================
--- linux-2.6.19-rc5-mm2.orig/fs/aio.c 2006-11-17 11:20:08.000000000
+0100 +++ linux-2.6.19-rc5-mm2/fs/aio.c 2006-11-17 11:20:32.000000000
+0100 @@ -416,6 +416,7 @@ static struct kiocb fastcall *__aio_get_
req->ki_dtor = NULL;
req->private = NULL;
req->ki_iovec = NULL;
+ req->ki_notify.sigq = NULL;
INIT_LIST_HEAD(&req->ki_run_list);

/* Check if the completion queue has enough free space to
@@ -463,6 +464,11 @@ static inline void really_put_req(struct
req->ki_dtor(req);
if (req->ki_iovec != &req->ki_inline_vec)
kfree(req->ki_iovec);
+
+ /* Release task ref */
+ if (req->ki_notify.notify == (SIGEV_SIGNAL|SIGEV_THREAD_ID))
+ put_task_struct(req->ki_notify.target);
+
kmem_cache_free(kiocb_cachep, req);
ctx->reqs_active--;

@@ -929,6 +935,81 @@ void fastcall kick_iocb(struct kiocb *io
}
EXPORT_SYMBOL(kick_iocb);

+static int aio_send_signal(struct aio_notify *notify)
+{
+ struct sigqueue *sigq = notify->sigq;
+ struct siginfo *info = &sigq->info;
+ int ret;
+
+ memset(info, 0, sizeof(struct siginfo));
+
+ info->si_signo = notify->signo;
+ info->si_errno = 0;
+ info->si_code = SI_ASYNCIO;
+ info->si_pid = 0;
+ info->si_uid = 0;
+ info->si_value = notify->value;
+
+ if (notify->notify & SIGEV_THREAD_ID)
+ ret = send_sigqueue(notify->signo, sigq, notify->target);
+ else
+ ret = send_group_sigqueue(notify->signo, sigq, notify->target);
+
+ return ret;
+}
+
+static long aio_setup_sigevent(struct aio_notify *notify,
+ struct sigevent __user *user_event)
+{
+ sigevent_t event;
+ struct task_struct *target;
+
+ if (copy_from_user(&event, user_event, sizeof (event)))
+ return -EFAULT;
+
+ if (event.sigev_notify == SIGEV_NONE)
+ return 0;
+
+ notify->notify = event.sigev_notify;
+ notify->signo = event.sigev_signo;
+ notify->value = event.sigev_value;
+
+ read_lock(&tasklist_lock);
+ target = good_sigevent(&event);
+
+ if (unlikely(!target || (target->flags & PF_EXITING)))
+ goto out_unlock;
+
+ notify->target = target;
+
+ if (notify->notify == (SIGEV_SIGNAL|SIGEV_THREAD_ID)) {
+ /*
+ * This reference will be dropped in really_put_req() when
+ * we're done with the request.
+ */
+ get_task_struct(target);
+ }
+
+ read_unlock(&tasklist_lock);
+
+ /*
+ * NOTE: we cannot free the sigqueue in the completion path as
+ * the signal may not have been delivered to the target task.
+ * Therefore it has to be freed in __sigqueue_free() when the
+ * signal is collected if si_code is SI_ASYNCIO.
+ */
+ notify->sigq = sigqueue_alloc();
+
+ if (unlikely(!notify->sigq))
+ return -EAGAIN;
+
+ return 0;
+
+out_unlock:
+ read_unlock(&tasklist_lock);
+ return -EINVAL;
+}
+
/* aio_complete
* Called when the io request on the given iocb is complete.
* Returns true if this is the last user of the request. The
@@ -976,8 +1057,11 @@ int fastcall aio_complete(struct kiocb *
* cancelled requests don't get events, userland was given one
* when the event got cancelled.
*/
- if (kiocbIsCancelled(iocb))
+ if (kiocbIsCancelled(iocb)) {
+ if (iocb->ki_notify.sigq)
+ sigqueue_free(iocb->ki_notify.sigq);
goto put_rq;
+ }

ring = kmap_atomic(info->ring_pages[0], KM_IRQ1);

@@ -1008,6 +1092,14 @@ int fastcall aio_complete(struct kiocb *

pr_debug("added to ring %p at [%lu]\n", iocb, tail);

+ if (iocb->ki_notify.notify != SIGEV_NONE) {
+ ret = aio_send_signal(&iocb->ki_notify);
+
+ /* If signal generation failed, release the sigqueue */
+ if (ret)
+ sigqueue_free(iocb->ki_notify.sigq);
+ }
+
pr_debug("%ld retries: %zd of %zd\n", iocb->ki_retried,
iocb->ki_nbytes - iocb->ki_left, iocb->ki_nbytes);
put_rq:
@@ -1549,8 +1641,7 @@ int fastcall io_submit_one(struct kioctx
ssize_t ret;

/* enforce forwards compatibility on users */
- if (unlikely(iocb->aio_reserved1 || iocb->aio_reserved2 ||
- iocb->aio_reserved3)) {
+ if (unlikely(iocb->aio_reserved1 || iocb->aio_reserved3)) {
pr_debug("EINVAL: io_submit: reserve field set\n");
return -EINVAL;
}
@@ -1559,6 +1650,7 @@ int fastcall io_submit_one(struct kioctx
if (unlikely(
(iocb->aio_buf != (unsigned long)iocb->aio_buf) ||
(iocb->aio_nbytes != (size_t)iocb->aio_nbytes) ||
+ (iocb->aio_sigeventp != (unsigned long)iocb->aio_sigeventp) ||
((ssize_t)iocb->aio_nbytes < 0)
)) {
pr_debug("EINVAL: io_submit: overflow check\n");
@@ -1593,10 +1685,21 @@ int fastcall io_submit_one(struct kioctx
INIT_LIST_HEAD(&req->ki_wait.task_list);
req->ki_retried = 0;

+ /* handle setting up the sigevent for POSIX AIO signals */
+ req->ki_notify.notify = SIGEV_NONE;
+
+ if (iocb->aio_sigeventp) {
+ ret = aio_setup_sigevent(&req->ki_notify,
+ (struct sigevent __user *)(unsigned
long)
+ iocb->aio_sigeventp);
+ if (ret)
+ goto out_put_req;
+ }
+
ret = aio_setup_iocb(req);

if (ret)
- goto out_put_req;
+ goto out_sigqfree;

spin_lock_irq(&ctx->ctx_lock);
aio_run_iocb(req);
@@ -1609,6 +1712,11 @@ int fastcall io_submit_one(struct kioctx
aio_put_req(req); /* drop extra ref to req */
return 0;

+out_sigqfree:
+ /* Undo the sigqueue alloc if someting went bad */
+ if (req->ki_notify.sigq)
+ sigqueue_free(req->ki_notify.sigq);
+
out_put_req:
aio_put_req(req); /* drop extra ref to req */
aio_put_req(req); /* drop i/o ref to req */
Index: linux-2.6.19-rc5-mm2/include/linux/aio_abi.h
===================================================================
--- linux-2.6.19-rc5-mm2.orig/include/linux/aio_abi.h 2006-11-17
11:20:08.000000000 +0100 +++
linux-2.6.19-rc5-mm2/include/linux/aio_abi.h 2006-11-17
11:20:32.000000000 +0100 @@ -82,8 +82,9 @@ struct iocb { __u64
aio_nbytes; __s64 aio_offset;

+ __u64 aio_sigeventp; /* pointer to struct sigevent */
+
/* extra parameters */
- __u64 aio_reserved2; /* TODO: use this for a (struct
sigevent *) */ __u64 aio_reserved3;
}; /* 64 bytes */

Index: linux-2.6.19-rc5-mm2/include/linux/aio.h
===================================================================
--- linux-2.6.19-rc5-mm2.orig/include/linux/aio.h 2006-11-17
11:20:27.000000000 +0100 +++ linux-2.6.19-rc5-mm2/include/linux/aio.h
2006-11-17 11:20:32.000000000 +0100 @@ -7,6 +7,7 @@
#include <linux/uio.h>

#include <asm/atomic.h>
+#include <asm/siginfo.h>

#define AIO_MAXSEGS 4
#define AIO_KIOGRP_NR_ATOMIC 8
@@ -49,6 +50,14 @@ struct kioctx;
#define kiocbIsKicked(iocb) test_bit(KIF_KICKED, &(iocb)->ki_flags)
#define kiocbIsCancelled(iocb) test_bit(KIF_CANCELLED,
&(iocb)->ki_flags)
+struct aio_notify {
+ struct task_struct *target;
+ __u16 signo;
+ __u16 notify;
+ sigval_t value;
+ struct sigqueue *sigq;
+};
+
/* is there a better place to document function pointer methods? */
/**
* ki_retry - iocb forward progress callback
@@ -118,6 +127,9 @@ struct kiocb {

struct list_head ki_list; /* the aio core uses this
* for cancellation */
+
+ /* to notify a process on I/O event */
+ struct aio_notify ki_notify;
};

#define is_sync_kiocb(iocb) ((iocb)->ki_key == KIOCB_SYNC_KEY)
Index: linux-2.6.19-rc5-mm2/kernel/signal.c
===================================================================
--- linux-2.6.19-rc5-mm2.orig/kernel/signal.c 2006-11-17
11:20:31.000000000 +0100 +++ linux-2.6.19-rc5-mm2/kernel/signal.c
2006-11-17 11:20:32.000000000 +0100 @@ -297,7 +297,7 @@ static struct sigqueue
*__sigqueue_alloc
static void __sigqueue_free(struct sigqueue *q)
{
- if (q->flags & SIGQUEUE_PREALLOC)
+ if (q->flags & SIGQUEUE_PREALLOC && q->info.si_code != SI_ASYNCIO)
return;
atomic_dec(&q->user->sigpending);
free_uid(q->user);

2006-11-20 14:22:38

by Sébastien Dugué

[permalink] [raw]
Subject: Re: [PATCH -mm 1/4][AIO] - fix aio.h includes


Fix the double inclusion of linux/uio.h in linux/aio.h



aio.h | 1 -
1 file changed, 1 deletion(-)

Signed-off-by: S?bastien Dugu? <[email protected]>

Index: linux-2.6.19-rc5-mm2/include/linux/aio.h
===================================================================
--- linux-2.6.19-rc5-mm2.orig/include/linux/aio.h 2006-11-17
11:20:08.000000000 +0100 +++ linux-2.6.19-rc5-mm2/include/linux/aio.h
2006-11-17 11:20:27.000000000 +0100 @@ -7,7 +7,6 @@
#include <linux/uio.h>

#include <asm/atomic.h>
-#include <linux/uio.h>

#define AIO_MAXSEGS 4
#define AIO_KIOGRP_NR_ATOMIC 8

2006-11-20 21:37:15

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH -mm 1/4][AIO] - fix aio.h includes

Sébastien Dugué wrote:
> Fix the double inclusion of linux/uio.h in linux/aio.h

Sure seems reasonable to me.

> Signed-off-by: Sébastien Dugué <[email protected]>

Signed-off-by: Zach Brown <[email protected]>

- z

2006-11-20 21:53:26

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [PATCH -mm 2/4][AIO] - export good_sigevent()

Zach Brown wrote:
> I wonder if "good_sigevent()" isn't a very strong name for something
> that is now up in signal.h. Maybe "sigevent_find_task()"?

That's not the purpose, it's just one of the checks which need to be
done to check that the sigevent value is "good". I think the name is fine.

--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

2006-11-20 21:55:21

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH -mm 2/4][AIO] - export good_sigevent()


> +extern struct task_struct * good_sigevent(sigevent_t *);

I wonder if "good_sigevent()" isn't a very strong name for something
that is now up in signal.h. Maybe "sigevent_find_task()"?

- z

2006-11-20 22:03:46

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH -mm 2/4][AIO] - export good_sigevent()

Ulrich Drepper wrote:
> Zach Brown wrote:
>> I wonder if "good_sigevent()" isn't a very strong name for something
>> that is now up in signal.h. Maybe "sigevent_find_task()"?
>
> That's not the purpose, it's just one of the checks which need to be
> done to check that the sigevent value is "good".

Hmm? It returns a task_struct *, potentially after finding it with
find_task_by_pid().

That said, I don't particularly care much. It just jumped out as I was
reading it.

- z

2006-11-20 23:16:27

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH -mm 3/4][AIO] - AIO completion signal notification

Sébastien Dugué wrote:
> AIO completion signal notification

This is looking a lot better, thanks for keeping at it.

> +static long aio_setup_sigevent(struct aio_notify *notify,
> + struct sigevent __user *user_event)
> +{
> + sigevent_t event;
> + struct task_struct *target;
> +
> + if (copy_from_user(&event, user_event, sizeof (event)))
> + return -EFAULT;

Last time we talked about this needing to call get_compat_sigevent(). I
think it still needs to.

I think we should avoid the examples set by the current
compat_sys_io_submit() and get_compat_sigevent() callers. They copy
translated data on to the userspace stack and pass it to the syscalls.
That will get crazy for compat_sys_io_submit() because it would have to
rewrite the iocb and the pointer to the iocb to get sys_io_submit() to
find a copied sigevent on the stack.

I think the model is compat_do_readv_writev(). Hoist some of the
syscall logic up into the compat layer so that one copying and
translating pass is made instead of trying to fool the syscall logic
into thinking that it's being called from a native word size caller.

So io_submit_one() should be given the kernel copies of the userspace
structures it needs. sys_io_submit() will pass it the copies it made
for native word size callers. compat_sys_io_submit() will pass in the
copies it made after translating from 32bit arguments. io_submit_one()
and lookup_kioctx() will have to be made available to kernel/compat.c
(via linux/aio.h, surely.). aio_setup_sigevent() will be called from
*_io_submit() and given the kernel sigevent, not a userspace pointer.

Reworking things this way should have the added benefit of making 32/64
sys_io_submit() more efficient than it is today.

- z

2006-11-21 10:32:10

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: [PATCH -mm 4/4][AIO] - Listio support

Sebestian,

Thanks for taking this forward ! I assume that you have addressed all the
comments (including missing counts in cancel path etc), except for the
interface debate (current vs syscall vs sys_io_wait_for_kiocb), is
that correct ?

Zach,
You had suggested a couple of alternative interfaces.
If we go the separate syscall way, there are a couple of things to figure
out/decide:
(1) Do we want to keep the LIO_WAIT semantics in the kernel, or
should we try to expose the lio to user-space for an explicit wait to
be issued ?
(2) We would have to address compat alternatives for the new syscall as well,
hopefully it should be possible to structure the code to reuse most of
the logic.
If we go the way of associating the array of iocbs as a parameter to
IOCB_CMD_GROUP, and make LIO_WAIT happen through a sys_io_wait for one iocb
on the group iocb, then
(1) Compat handling could get a little more complicated
(2) We have to think through the semantics of the completion path in various
situations. Will it be possible to cancel the whole group in one go ?

Regards
Suparna

On Mon, Nov 20, 2006 at 03:23:07PM +0100, S?bastien Dugu? wrote:
>
> POSIX listio support
>
> This patch adds POSIX listio completion notification support. It builds
> on support provided by the aio signal notification patch and adds an
> IOCB_CMD_GROUP command to io_submit().
>
> The purpose of IOCB_CMD_GROUP is to group together the following requests in
> the list up to the end of the list sumbitted to io_submit.
>
> As io_submit already accepts an array of iocbs, as part of listio submission,
> the user process prepends to a list of requests an empty special aiocb with
> an aio_lio_opcode of IOCB_CMD_GROUP, filling only the aio_sigevent fields.
>
>
> An IOCB_CMD_GROUP is added to the IOCB_CMD enum in include/linux/aio_abi.h
>
> A struct lio_event is added in include/linux/aio.h
>
> A struct lio_event *ki_lio is added to struct iocb in include/linux/aio.h
>
> In io_submit(), upon detecting such an IOCB_CMD_GROUP marker iocb, an
> lio_event is created in lio_create() which contains the necessary information
> for signaling a thread (signal number, pid, notify type and value) along with
> a count of requests attached to this event.
>
> The following depicts the lio_event structure:
>
> struct lio_event {
> atomic_t lio_users;
> struct aio_notify lio_notify;
> };
>
> lio_users holds an atomic counter of the number of requests attached to this
> lio. It is incremented with each request submitted and decremented at each
> request completion. When the counter reaches 0, we send the notification.
>
> Each subsequent submitted request is attached to this lio_event by setting
> the request kiocb->ki_lio to that lio_event (in io_submit_one()) and
> incrementing the lio_users count.
>
> In aio_complete(), if the request is attached to an lio (ki_lio <> 0),
> then lio_check() is called to decrement the lio_users count and eventually
> signal the user process when all the requests in the group have completed.
>
>
> The IOCB_CMD_GROUP semantic is as follows:
>
> - if the associated sigevent is NULL then we want to group
> requests for the purpose of blocking on the group completion
> (LIO_WAIT sync behavior).
>
> - if the associated sigevent is valid (not NULL) then we want to
> group requests for the purpose of being notified upon that
> group of requests completion (LIO_NOWAIT async behaviour).
>
>
>
>
> fs/aio.c | 123 ++++++++++++++++++++++++++++++++++++++++++++++--
> include/linux/aio.h | 13 ++++-
> include/linux/aio_abi.h | 1
> 3 files changed, 131 insertions(+), 6 deletions(-)
>
>
> Signed-off-by: S?bastien Dugu? <[email protected]>
> Signed-off-by: Laurent Vivier <[email protected]>
>
> Index: linux-2.6.19-rc5-mm2/fs/aio.c
> ===================================================================
> --- linux-2.6.19-rc5-mm2.orig/fs/aio.c 2006-11-17 11:20:32.000000000
> +0100 +++ linux-2.6.19-rc5-mm2/fs/aio.c 2006-11-17 16:22:00.000000000
> +0100 @@ -414,6 +414,7 @@ static struct kiocb fastcall *__aio_get_
> req->ki_cancel = NULL;
> req->ki_retry = NULL;
> req->ki_dtor = NULL;
> + req->ki_lio = NULL;
> req->private = NULL;
> req->ki_iovec = NULL;
> req->ki_notify.sigq = NULL;
> @@ -1010,6 +1011,53 @@ out_unlock:
> return -EINVAL;
> }
>
> +static inline void lio_check(struct lio_event *lio)
> +{
> + int ret;
> +
> + ret = atomic_dec_and_test(&lio->lio_users);
> +
> + if (unlikely(ret) && lio->lio_notify.notify != SIGEV_NONE) {
> + /* last one -> notify process */
> + aio_send_signal(&lio->lio_notify);
> + kfree(lio);
> + }
> +}
> +
> +static struct lio_event *lio_create(struct sigevent __user *user_event)
> +{
> + int ret = 0;
> + struct lio_event *lio = NULL;
> +
> + lio = kzalloc(sizeof(*lio), GFP_KERNEL);
> +
> + if (!lio)
> + return ERR_PTR(-EAGAIN);
> +
> + /*
> + * Grab an initial ref on the lio to avoid races between
> + * submission and completion.
> + */
> + atomic_set(&lio->lio_users, 1);
> +
> + lio->lio_notify.notify = SIGEV_NONE;
> +
> + if (user_event) {
> + /*
> + * User specified an event for this lio,
> + * he wants to be notified upon lio completion.
> + */
> + ret = aio_setup_sigevent(&lio->lio_notify, user_event);
> +
> + if (ret) {
> + kfree(lio);
> + return ERR_PTR(ret);
> + }
> + }
> +
> + return lio;
> +}
> +
> /* aio_complete
> * Called when the io request on the given iocb is complete.
> * Returns true if this is the last user of the request. The
> @@ -1058,8 +1106,12 @@ int fastcall aio_complete(struct kiocb *
> * when the event got cancelled.
> */
> if (kiocbIsCancelled(iocb)) {
> + if (iocb->ki_lio)
> + lio_check(iocb->ki_lio);
> +
> if (iocb->ki_notify.sigq)
> sigqueue_free(iocb->ki_notify.sigq);
> +
> goto put_rq;
> }
>
> @@ -1100,6 +1152,9 @@ int fastcall aio_complete(struct kiocb *
> sigqueue_free(iocb->ki_notify.sigq);
> }
>
> + if (iocb->ki_lio)
> + lio_check(iocb->ki_lio);
> +
> pr_debug("%ld retries: %zd of %zd\n", iocb->ki_retried,
> iocb->ki_nbytes - iocb->ki_left, iocb->ki_nbytes);
> put_rq:
> @@ -1634,7 +1689,7 @@ static int aio_wake_function(wait_queue_
> }
>
> int fastcall io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
> - struct iocb *iocb)
> + struct iocb *iocb, struct lio_event *lio)
> {
> struct kiocb *req;
> struct file *file;
> @@ -1696,6 +1751,9 @@ int fastcall io_submit_one(struct kioctx
> goto out_put_req;
> }
>
> + /* Attach this iocb to its lio */
> + req->ki_lio = lio;
> +
> ret = aio_setup_iocb(req);
>
> if (ret)
> @@ -1739,6 +1797,8 @@ asmlinkage long sys_io_submit(aio_contex
> struct iocb __user * __user *iocbpp)
> {
> struct kioctx *ctx;
> + struct lio_event *lio = NULL;
> + int lio_wait = 0;
> long ret = 0;
> int i;
>
> @@ -1772,11 +1832,66 @@ asmlinkage long sys_io_submit(aio_contex
> break;
> }
>
> - ret = io_submit_one(ctx, user_iocb, &tmp);
> - if (ret)
> - break;
> + if (tmp.aio_lio_opcode == IOCB_CMD_GROUP) {
> +
> + /* this command means that all following IO commands
> + * are in the same group.
> + *
> + * Userspace either wants to be notified upon or block
> until
> + * completion of all the requests in the group.
> + */
> + /*
> + * Ignore an IOCB_CMD_GROUP request if we are already
> + * processing one. This means only one listio per
> + * io_submit call.
> + */
> + if (lio)
> + continue;
> +
> + lio = lio_create((struct sigevent __user *)(unsigned
> long)
> + tmp.aio_sigeventp);
> +
> + ret = PTR_ERR(lio);
> +
> + if (IS_ERR(lio))
> + goto out_put_ctx;
> +
> + if (!tmp.aio_sigeventp)
> + lio_wait = 1;
> + } else {
> + if (lio)
> + atomic_inc(&lio->lio_users);
> +
> + ret = io_submit_one(ctx, user_iocb, &tmp, lio);
> +
> + if (ret) {
> + if (lio) {
> + /*
> + * If a request failed, just decrement
> + * the users count, but go on
> submitting
> + * subsequent requests.
> + */
> + atomic_dec(&lio->lio_users);
> + } else
> + break;
> + }
> + }
> + }
> +
> + if (lio) {
> + /*
> + * Drop extra ref on the lio now that we're done submitting
> + * requests
> + */
> + lio_check(lio);
> +
> + if (lio_wait) {
> + wait_event(ctx->wait, atomic_read(&lio->lio_users)==0);
> + kfree(lio);
> + }
> }
>
> +out_put_ctx:
> put_ioctx(ctx);
> return i ? i : ret;
> }
> Index: linux-2.6.19-rc5-mm2/include/linux/aio_abi.h
> ===================================================================
> --- linux-2.6.19-rc5-mm2.orig/include/linux/aio_abi.h 2006-11-17
> 11:20:32.000000000 +0100 +++
> linux-2.6.19-rc5-mm2/include/linux/aio_abi.h 2006-11-17
> 11:21:31.000000000 +0100 @@ -43,6 +43,7 @@ enum { IOCB_CMD_NOOP = 6,
> IOCB_CMD_PREADV = 7,
> IOCB_CMD_PWRITEV = 8,
> + IOCB_CMD_GROUP = 9,
> };
>
> /* read() from /dev/aio returns these structures. */
> Index: linux-2.6.19-rc5-mm2/include/linux/aio.h
> ===================================================================
> --- linux-2.6.19-rc5-mm2.orig/include/linux/aio.h 2006-11-17
> 11:20:32.000000000 +0100 +++ linux-2.6.19-rc5-mm2/include/linux/aio.h
> 2006-11-17 11:21:31.000000000 +0100 @@ -58,6 +58,11 @@ struct aio_notify {
> struct sigqueue *sigq;
> };
>
> +struct lio_event {
> + atomic_t lio_users;
> + struct aio_notify lio_notify;
> +};
> +
> /* is there a better place to document function pointer methods? */
> /**
> * ki_retry - iocb forward progress callback
> @@ -113,6 +118,9 @@ struct kiocb {
> wait_queue_t ki_wait;
> loff_t ki_pos;
>
> + /* lio this iocb might be attached to */
> + struct lio_event *ki_lio;
> +
> void *private;
> /* State that we remember to be able to restart/retry */
> unsigned short ki_opcode;
> @@ -220,12 +228,13 @@ struct mm_struct;
> extern void FASTCALL(exit_aio(struct mm_struct *mm));
> extern struct kioctx *lookup_ioctx(unsigned long ctx_id);
> extern int FASTCALL(io_submit_one(struct kioctx *ctx,
> - struct iocb __user *user_iocb, struct iocb *iocb));
> + struct iocb __user *user_iocb, struct iocb
> *iocb,
> + struct lio_event *lio));
>
> /* semi private, but used by the 32bit emulations: */
> struct kioctx *lookup_ioctx(unsigned long ctx_id);
> int FASTCALL(io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
> - struct iocb *iocb));
> + struct iocb *iocb, struct lio_event *lio));
>
> #define get_ioctx(kioctx) do { \
> BUG_ON(atomic_read(&(kioctx)->users) <= 0); \

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2006-11-21 10:40:25

by Sébastien Dugué

[permalink] [raw]
Subject: Re: [PATCH -mm 3/4][AIO] - AIO completion signal notification


Zach Brown <[email protected]> wrote:

> S?bastien Dugu? wrote:
> > AIO completion signal notification
>
> This is looking a lot better, thanks for keeping at it.
>
> > +static long aio_setup_sigevent(struct aio_notify *notify,
> > + struct sigevent __user *user_event)
> > +{
> > + sigevent_t event;
> > + struct task_struct *target;
> > +
> > + if (copy_from_user(&event, user_event, sizeof (event)))
> > + return -EFAULT;
>
> Last time we talked about this needing to call get_compat_sigevent(). I
> think it still needs to.

Yes, I could not find an elegant way to tackle this, so I left it
float for a while.

>
> I think we should avoid the examples set by the current
> compat_sys_io_submit() and get_compat_sigevent() callers. They copy
> translated data on to the userspace stack and pass it to the syscalls.
> That will get crazy for compat_sys_io_submit() because it would have to
> rewrite the iocb and the pointer to the iocb to get sys_io_submit() to
> find a copied sigevent on the stack.

Right, that's what I really wanted to avoid at all costs.

>
> I think the model is compat_do_readv_writev(). Hoist some of the
> syscall logic up into the compat layer so that one copying and
> translating pass is made instead of trying to fool the syscall logic
> into thinking that it's being called from a native word size caller.

Yes, but you end up duplicating a lot of code, but if it's the price to
pay, I'm all for it.

>
> So io_submit_one() should be given the kernel copies of the userspace
> structures it needs. sys_io_submit() will pass it the copies it made
> for native word size callers. compat_sys_io_submit() will pass in the
> copies it made after translating from 32bit arguments. io_submit_one()
> and lookup_kioctx() will have to be made available to kernel/compat.c
> (via linux/aio.h, surely.). aio_setup_sigevent() will be called from
> *_io_submit() and given the kernel sigevent, not a userspace pointer.
>
> Reworking things this way should have the added benefit of making 32/64
> sys_io_submit() more efficient than it is today.

I completely aggree. I will look into this.

Thanks,

S?bastien.

2006-11-22 01:03:11

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH -mm 3/4][AIO] - AIO completion signal notification

On Mon, 20 Nov 2006 15:22:52 +0100
S__bastien Dugu__ <[email protected]> wrote:

> +static long aio_setup_sigevent(struct aio_notify *notify,
> + struct sigevent __user *user_event)
> +{
> + sigevent_t event;
> + struct task_struct *target;
> +
> + if (copy_from_user(&event, user_event, sizeof (event)))
> + return -EFAULT;
> +
> + if (event.sigev_notify == SIGEV_NONE)
> + return 0;
> +
> + notify->notify = event.sigev_notify;
> + notify->signo = event.sigev_signo;
> + notify->value = event.sigev_value;
> +
> + read_lock(&tasklist_lock);
> + target = good_sigevent(&event);
> +
> + if (unlikely(!target || (target->flags & PF_EXITING)))
> + goto out_unlock;
> +
> + notify->target = target;
> +
> + if (notify->notify == (SIGEV_SIGNAL|SIGEV_THREAD_ID)) {
> + /*
> + * This reference will be dropped in really_put_req() when
> + * we're done with the request.
> + */
> + get_task_struct(target);
> + }

It worries me that this function can save away a task_struct* without
having taken a reference against it.


> + read_unlock(&tasklist_lock);
> +
> + /*
> + * NOTE: we cannot free the sigqueue in the completion path as
> + * the signal may not have been delivered to the target task.
> + * Therefore it has to be freed in __sigqueue_free() when the
> + * signal is collected if si_code is SI_ASYNCIO.
> + */
> + notify->sigq = sigqueue_alloc();
> +
> + if (unlikely(!notify->sigq))
> + return -EAGAIN;
> +
> + return 0;
> +
> +out_unlock:
> + read_unlock(&tasklist_lock);
> + return -EINVAL;
> +}

2006-11-23 08:29:41

by Sébastien Dugué

[permalink] [raw]
Subject: Re: [PATCH -mm 3/4][AIO] - AIO completion signal notification

On Tue, 21 Nov 2006 17:02:28 -0800
Andrew Morton <[email protected]> wrote:

> On Mon, 20 Nov 2006 15:22:52 +0100
> S__bastien Dugu__ <[email protected]> wrote:
>
> > +static long aio_setup_sigevent(struct aio_notify *notify,
> > + struct sigevent __user *user_event)
> > +{
> > + sigevent_t event;
> > + struct task_struct *target;
> > +
> > + if (copy_from_user(&event, user_event, sizeof (event)))
> > + return -EFAULT;
> > +
> > + if (event.sigev_notify == SIGEV_NONE)
> > + return 0;
> > +
> > + notify->notify = event.sigev_notify;
> > + notify->signo = event.sigev_signo;
> > + notify->value = event.sigev_value;
> > +
> > + read_lock(&tasklist_lock);
> > + target = good_sigevent(&event);
> > +
> > + if (unlikely(!target || (target->flags & PF_EXITING)))
> > + goto out_unlock;
> > +
> > +
> > +
> > + if (notify->notify == (SIGEV_SIGNAL|SIGEV_THREAD_ID)) {
> > + /*
> > + * This reference will be dropped in really_put_req() when
> > + * we're done with the request.
> > + */
> > + get_task_struct(target);
> > + }
>
> It worries me that this function can save away a task_struct* without
> having taken a reference against it.
>

OK. Does moving 'notify->target = target;' after the get_task_struct() will
do, or am I missing something more subtle?

Thanks,

S?bastien.

2006-11-23 08:42:13

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH -mm 3/4][AIO] - AIO completion signal notification

On Thu, 23 Nov 2006 09:28:05 +0100
S?bastien Dugu? <[email protected]> wrote:

> > > + if (notify->notify == (SIGEV_SIGNAL|SIGEV_THREAD_ID)) {
> > > + /*
> > > + * This reference will be dropped in really_put_req() when
> > > + * we're done with the request.
> > > + */
> > > + get_task_struct(target);
> > > + }
> >
> > It worries me that this function can save away a task_struct* without
> > having taken a reference against it.
> >
>
> OK. Does moving 'notify->target = target;' after the get_task_struct() will
> do, or am I missing something more subtle?

Well it's your code - you tell me ;)

It is unsafe (and rather pointless) to be saving the address of some structure
which can be freed at any time.

2006-11-23 09:47:57

by Sébastien Dugué

[permalink] [raw]
Subject: Re: [PATCH -mm 3/4][AIO] - AIO completion signal notification

On Thu, 23 Nov 2006 00:40:53 -0800
Andrew Morton <[email protected]> wrote:

> On Thu, 23 Nov 2006 09:28:05 +0100
> S?bastien Dugu? <[email protected]> wrote:
>
> > > > + target = good_sigevent(&event);
> > > > +
> > > > + if (unlikely(!target || (target->flags & PF_EXITING)))
> > > > + goto out_unlock;
> > > > +
> > > > +
> > > > +
> > > > + if (notify->notify == (SIGEV_SIGNAL|SIGEV_THREAD_ID)) {
> > > > + /*
> > > > + * This reference will be dropped in really_put_req() when
> > > > + * we're done with the request.
> > > > + */
> > > > + get_task_struct(target);
> > > > + }
> > >
> > > It worries me that this function can save away a task_struct* without
> > > having taken a reference against it.
> > >
> >
> > OK. Does moving 'notify->target = target;' after the get_task_struct() will
> > do, or am I missing something more subtle?
>
> Well it's your code - you tell me ;)
>
> It is unsafe (and rather pointless) to be saving the address of some structure
> which can be freed at any time.

Sorry, I expressed myself quite badly. What I wanted to know is whether you
are worried with the task been freed between saving its pointer and getting a
ref on it (which is trivial to fix) or you are thinking of something deeper.

Thanks,

S?bastien.

2006-11-23 10:19:10

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH -mm 3/4][AIO] - AIO completion signal notification

On Thu, 23 Nov 2006 10:47:55 +0100
S?bastien Dugu? <[email protected]> wrote:

> On Thu, 23 Nov 2006 00:40:53 -0800
> Andrew Morton <[email protected]> wrote:
>
> > On Thu, 23 Nov 2006 09:28:05 +0100
> > S?bastien Dugu? <[email protected]> wrote:
> >
> > > > > + target = good_sigevent(&event);
> > > > > +
> > > > > + if (unlikely(!target || (target->flags & PF_EXITING)))
> > > > > + goto out_unlock;
> > > > > +
> > > > > +
> > > > > +
> > > > > + if (notify->notify == (SIGEV_SIGNAL|SIGEV_THREAD_ID)) {
> > > > > + /*
> > > > > + * This reference will be dropped in really_put_req() when
> > > > > + * we're done with the request.
> > > > > + */
> > > > > + get_task_struct(target);
> > > > > + }
> > > >
> > > > It worries me that this function can save away a task_struct* without
> > > > having taken a reference against it.
> > > >
> > >
> > > OK. Does moving 'notify->target = target;' after the get_task_struct() will
> > > do, or am I missing something more subtle?
> >
> > Well it's your code - you tell me ;)
> >
> > It is unsafe (and rather pointless) to be saving the address of some structure
> > which can be freed at any time.
>
> Sorry, I expressed myself quite badly. What I wanted to know is whether you
> are worried with the task been freed between saving its pointer and getting a
> ref on it (which is trivial to fix) or you are thinking of something deeper.
>

Look:

> + notify->target = target;
> +
> + if (notify->notify == (SIGEV_SIGNAL|SIGEV_THREAD_ID)) {
> + /*
> + * This reference will be dropped in really_put_req() when
> + * we're done with the request.
> + */
> + get_task_struct(target);

If that test fails, we've saved a pointer to the task_struct without having
taken a refreence on it.

2006-11-23 10:27:52

by Sébastien Dugué

[permalink] [raw]
Subject: Re: [PATCH -mm 3/4][AIO] - AIO completion signal notification

On Thu, 23 Nov 2006 02:14:37 -0800
Andrew Morton <[email protected]> wrote:

[snip]

> Look:
>
> > + notify->target = target;
> > +
> > + if (notify->notify == (SIGEV_SIGNAL|SIGEV_THREAD_ID)) {
> > + /*
> > + * This reference will be dropped in really_put_req() when
> > + * we're done with the request.
> > + */
> > + get_task_struct(target);
>
> If that test fails, we've saved a pointer to the task_struct without having
> taken a refreence on it.
>

Aggreed, just wanted to make sure.

Thanks,

S?bastien.

2006-11-23 10:57:14

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH] fs : reorder some 'struct inode' fields to speedup i_size manipulations

--- linux-2.6.19-rc6/include/linux/fs.h 2006-11-23 10:47:53.000000000 +0100
+++ linux-2.6.19-rc6-ed/include/linux/fs.h 2006-11-23 11:13:36.000000000 +0100
@@ -548,12 +548,15 @@ struct inode {
uid_t i_uid;
gid_t i_gid;
dev_t i_rdev;
+ unsigned long i_version;
loff_t i_size;
+#ifdef __NEED_I_SIZE_ORDERED
+ seqcount_t i_size_seqcount;
+#endif
struct timespec i_atime;
struct timespec i_mtime;
struct timespec i_ctime;
unsigned int i_blkbits;
- unsigned long i_version;
blkcnt_t i_blocks;
unsigned short i_bytes;
spinlock_t i_lock; /* i_blocks, i_bytes, maybe i_size */
@@ -598,9 +601,6 @@ struct inode {
void *i_security;
#endif
void *i_private; /* fs or device private pointer */
-#ifdef __NEED_I_SIZE_ORDERED
- seqcount_t i_size_seqcount;
-#endif
};

/*


Attachments:
(No filename) (766.00 B)
move_i_size_seqcount.patch (816.00 B)
Download all attachments

2006-11-27 13:39:32

by Bharata B Rao

[permalink] [raw]
Subject: Re: [PATCH -mm 4/4][AIO] - Listio support

On 11/20/06, S?bastien Dugu? <[email protected]> wrote:
>
> POSIX listio support
>
> This patch adds POSIX listio completion notification support. It builds
> on support provided by the aio signal notification patch and adds an
> IOCB_CMD_GROUP command to io_submit().
>

I have altered this patch to provide the listio support through a
separate syscall.

The code is in RFC stage at the moment and is only compile-tested on
i386. I have refrained from modifying unistd.h for now in order to
avoid touching too many files. I plan to do actual tests (which
involves user space changes to aio libraries) only if there is an
agreement with this syscall approach.

More details about the interface are in the patch itself.

Regards,
Bharata.


Attachments:
(No filename) (772.00 B)
aio-listio-support.patch (11.77 kB)
Download all attachments

2006-11-27 21:38:03

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] fs : reorder some 'struct inode' fields to speedup i_size manipulations

On Thu, 23 Nov 2006 11:57:29 +0100
Eric Dumazet <[email protected]> wrote:

> On 32bits SMP platforms, 64bits i_size is protected by a seqcount
> (i_size_seqcount).
>
> When i_size is read or written, i_size_seqcount is read/written as well, so it
> make sense to group these two fields together in the same cache line.
>
> Before this patch, accessing i_size needed 3 cache lines (2 for i_size, one
> for i_size_seqcount). After, only one cache line is needed/ (dirtied on a
> i_size change).

I didn't understand that paragraph at all, really, so I took it out.

At present an i_size change will dirty one, two or three cachelines, most
likely one or two.

After your patch an i_size change will dirty one or two cachelines, most
likely one.

yes?

> This patch moves i_size_seqcount next to i_size, and also moves i_version to
> let offsetof(struct inode, i_size) being 0x40 instead of 0x3c (for 32bits
> platforms).
>
> For 64 bits platforms, i_size_seqcount doesnt exist, and the move of a 'long
> i_version' should not introduce a new hole because of padding.
>
> Signed-off-by: Eric Dumazet <[email protected]>

2006-11-27 21:52:04

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] fs : reorder some 'struct inode' fields to speedup i_size manipulations

Andrew Morton a ?crit :
> On Thu, 23 Nov 2006 11:57:29 +0100
> Eric Dumazet <[email protected]> wrote:
>
>> On 32bits SMP platforms, 64bits i_size is protected by a seqcount
>> (i_size_seqcount).
>>
>> When i_size is read or written, i_size_seqcount is read/written as well, so it
>> make sense to group these two fields together in the same cache line.
>>
>> Before this patch, accessing i_size needed 3 cache lines (2 for i_size, one
>> for i_size_seqcount). After, only one cache line is needed/ (dirtied on a
>> i_size change).
>
> I didn't understand that paragraph at all, really, so I took it out.
>
> At present an i_size change will dirty one, two or three cachelines, most
> likely one or two.
>
> After your patch an i_size change will dirty one or two cachelines, most
> likely one.
>
> yes?

nope

Before :
---------
offsetof(i_size) = 0x3C

i_size is 8 bytes, so i_size spans 2 cache lines (if 64 or 32 bytes cache lines)

and offsetof(i_size_seqcount) = 0x160, so a read of i_size (coupled with a
read of seqcount) needed 3 cache lines. A change of i_size dirtied 2 or 3
cache lines.


After :
--------
offsetof(i_size) = 0x40
offsetof(i_size_seqcount) = 0x48

One cache line 'only', reading or writing.


2006-11-27 22:01:46

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] fs : reorder some 'struct inode' fields to speedup i_size manipulations

On Mon, 27 Nov 2006 22:52:04 +0100
Eric Dumazet <[email protected]> wrote:

> > I didn't understand that paragraph at all, really, so I took it out.
> >
> > At present an i_size change will dirty one, two or three cachelines, most
> > likely one or two.
> >
> > After your patch an i_size change will dirty one or two cachelines, most
> > likely one.
> >
> > yes?
>
> nope
>
> Before :
> ---------
> offsetof(i_size) = 0x3C
>
> i_size is 8 bytes, so i_size spans 2 cache lines (if 64 or 32 bytes cache lines)

This all depends on the offset of the inode, and you don't know what that is.

offsetof(ext3_inode_info, vfs_inode) != offsetof(nfs_inode, vfs_inode), etc.

2006-11-28 20:31:47

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] fs : reorder some 'struct inode' fields to speedup i_size manipulations

Andrew Morton a ?crit :

> This all depends on the offset of the inode, and you don't know what that is.
>
> offsetof(ext3_inode_info, vfs_inode) != offsetof(nfs_inode, vfs_inode), etc.

Doh... yes you are absolutly right :) I feel dumb now :(