LinuxLists.cc - [PATCH] aio: Add command to wait completion of all requests

2017-06-09 09:49:45

Subject: [PATCH] aio: Add command to wait completion of all requests

During implementation aio support for Checkpoint-Restore
in Userspace project (CRIU), we found, that current
kernel functionality does not allow to wait for
completion of all submitted requests. Checkpoint software
can't determine if there is no in-flight requests, i.e.
the process aio rings memory is ready for dumping.

The patch solves this problem. Also, the interface, it
introduces, may be useful for other in-userspace users.
Here new IOCB_CMD_WAIT_ALL cmd, which idea is to make
the caller wait till the sum of completed and available
requests becomes equal to (ctx->nr_events - 1). If they
are equal, then there is no aio requests in-flight in
the system.

Since available requests are per-cpu, we need synchronization
with their other readers and writers. Variable ctx->dead
is used for that, and we set it in 1 during synchronization.
Now let's look how we become sure, that concurrents can
see it on other cpus.

There is two cases. When the task is the only user of its
memory, it races with aio_complete() only. This function
takes ctx->completion_lock, so we simply use it to synchronize
during per-cpu reading.

When there are more users of current->mm, we base on that
put_reqs_available() and get_reqs_available() are already
interrupts-free. Primitives local_irq_disable and
local_irq_enable() work as rcu_read_xxx_sched() barriers,
and waiter uses synchronize_sched() to propogate ctx->dead
visability on other cpus. This RCU primitive works as
full memory barrier, so nobody will use percpu reqs_available
after we returned from it, and the waiting comes down
to the first case further actions. The good point
is the solution does not affect on existing code
performance in any way, as it does not change it.

Couple more words about checkpoint/restore. We need
especially this functionality, and we are not going to
dump in-flight request for now, because the experiments
show, that if dumping of a container was failed by
another subsystem reasons, it's not always possible
to queue cancelled requests back (file descriptors
may be already closed, memory pressure). So, here is
nothing about dumping.

It seams, this functionality will be useful not only
for us, and others may use the new command like
other standard aio commands.

Signed-off-by: Kirill Tkhai <[email protected]>
---
fs/aio.c | 71 +++++++++++++++++++++++++++++++++++++++---
include/uapi/linux/aio_abi.h | 1 +
2 files changed, 67 insertions(+), 5 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index f52d925ee259..447fa4283c7c 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -899,7 +899,11 @@ static void put_reqs_available(struct kioctx *ctx, unsigned nr)
struct kioctx_cpu *kcpu;
unsigned long flags;

- local_irq_save(flags);
+ local_irq_save(flags); /* Implies rcu_read_lock_sched() */
+ if (atomic_read(&ctx->dead)) {
+ atomic_add(nr, &ctx->reqs_available);
+ goto unlock;
+ }
kcpu = this_cpu_ptr(ctx->cpu);
kcpu->reqs_available += nr;

@@ -907,8 +911,8 @@ static void put_reqs_available(struct kioctx *ctx, unsigned nr)
kcpu->reqs_available -= ctx->req_batch;
atomic_add(ctx->req_batch, &ctx->reqs_available);
}
-
- local_irq_restore(flags);
+unlock:
+ local_irq_restore(flags); /* Implies rcu_read_unlock_sched() */
}

static bool get_reqs_available(struct kioctx *ctx)
@@ -917,7 +921,9 @@ static bool get_reqs_available(struct kioctx *ctx)
bool ret = false;
unsigned long flags;

- local_irq_save(flags);
+ local_irq_save(flags); /* Implies rcu_read_lock_sched() */
+ if (atomic_read(&ctx->dead))
+ goto out;
kcpu = this_cpu_ptr(ctx->cpu);
if (!kcpu->reqs_available) {
int old, avail = atomic_read(&ctx->reqs_available);
@@ -937,7 +943,7 @@ static bool get_reqs_available(struct kioctx *ctx)
ret = true;
kcpu->reqs_available--;
out:
- local_irq_restore(flags);
+ local_irq_restore(flags); /* Implies rcu_read_unlock_sched() */
return ret;
}

@@ -1533,6 +1539,58 @@ static ssize_t aio_write(struct kiocb *req, struct iocb *iocb, bool vectored,
return ret;
}

+static bool reqs_completed(struct kioctx *ctx)
+{
+ unsigned available;
+ spin_lock_irq(&ctx->completion_lock);
+ available = atomic_read(&ctx->reqs_available) + ctx->completed_events;
+ spin_unlock_irq(&ctx->completion_lock);
+
+ return (available == ctx->nr_events - 1);
+}
+
+static int aio_wait_all(struct kioctx *ctx)
+{
+ unsigned users, reqs = 0;
+ struct kioctx_cpu *kcpu;
+ int cpu, ret;
+
+ if (atomic_xchg(&ctx->dead, 1))
+ return -EBUSY;
+
+ users = atomic_read(&current->mm->mm_users);
+ if (users > 1) {
+ /*
+ * Wait till concurrent threads and aio_complete() see
+ * dead flag. Implies full memory barrier on all cpus.
+ */
+ synchronize_sched();
+ } else {
+ /*
+ * Sync with aio_complete() to be sure it puts reqs_available,
+ * when dead flag is already seen.
+ */
+ spin_lock_irq(&ctx->completion_lock);
+ }
+
+ for_each_possible_cpu(cpu) {
+ kcpu = per_cpu_ptr(ctx->cpu, cpu);
+ reqs += kcpu->reqs_available;
+ kcpu->reqs_available = 0;
+ }
+
+ if (users == 1)
+ spin_unlock_irq(&ctx->completion_lock);
+
+ atomic_add(reqs, &ctx->reqs_available);
+
+ ret = wait_event_interruptible(ctx->wait, reqs_completed(ctx));
+
+ atomic_set(&ctx->dead, 0);
+
+ return ret;
+}
+
static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
struct iocb *iocb, bool compat)
{
@@ -1556,6 +1614,9 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
return -EINVAL;
}

+ if (iocb->aio_lio_opcode == IOCB_CMD_WAIT_ALL)
+ return aio_wait_all(ctx);
+
req = aio_get_req(ctx);
if (unlikely(!req))
return -EAGAIN;
diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
index bb2554f7fbd1..29291946d4f1 100644
--- a/include/uapi/linux/aio_abi.h
+++ b/include/uapi/linux/aio_abi.h
@@ -44,6 +44,7 @@ enum {
IOCB_CMD_NOOP = 6,
IOCB_CMD_PREADV = 7,
IOCB_CMD_PWRITEV = 8,
+ IOCB_CMD_WAIT_ALL = 9,
};

/*

2017-06-13 08:14:17

by Cyrill Gorcunov

[permalink] [raw]

Subject: Re: [PATCH] aio: Add command to wait completion of all requests

On Fri, Jun 09, 2017 at 12:49:34PM +0300, Kirill Tkhai wrote:
...
> +static int aio_wait_all(struct kioctx *ctx)
> +{
> + unsigned users, reqs = 0;
> + struct kioctx_cpu *kcpu;
> + int cpu, ret;
> +
> + if (atomic_xchg(&ctx->dead, 1))
> + return -EBUSY;
> +
> + users = atomic_read(&current->mm->mm_users);
> + if (users > 1) {
> + /*
> + * Wait till concurrent threads and aio_complete() see
> + * dead flag. Implies full memory barrier on all cpus.
> + */
> + synchronize_sched();
> + } else {
> + /*
> + * Sync with aio_complete() to be sure it puts reqs_available,
> + * when dead flag is already seen.
> + */
> + spin_lock_irq(&ctx->completion_lock);
> + }
> +
> + for_each_possible_cpu(cpu) {
> + kcpu = per_cpu_ptr(ctx->cpu, cpu);
> + reqs += kcpu->reqs_available;

I'm not that familiar with AIO internals but this snippet worries me:
the reqs_available is unsigned int, reqs is unsigned it as well but
used as an accumulator over ALL cpus, can't it get overflow and
gives modulo result, should not it be unsigned long or something?

> + kcpu->reqs_available = 0;
> + }
> +
> + if (users == 1)
> + spin_unlock_irq(&ctx->completion_lock);
> +
> + atomic_add(reqs, &ctx->reqs_available);
> +
> + ret = wait_event_interruptible(ctx->wait, reqs_completed(ctx));
> +
> + atomic_set(&ctx->dead, 0);
> +
> + return ret;
> +}

2017-06-13 09:46:06

by Kirill Tkhai

[permalink] [raw]

Subject: Re: [PATCH] aio: Add command to wait completion of all requests

On 13.06.2017 11:14, Cyrill Gorcunov wrote:
> On Fri, Jun 09, 2017 at 12:49:34PM +0300, Kirill Tkhai wrote:
> ...
>> +static int aio_wait_all(struct kioctx *ctx)
>> +{
>> + unsigned users, reqs = 0;
>> + struct kioctx_cpu *kcpu;
>> + int cpu, ret;
>> +
>> + if (atomic_xchg(&ctx->dead, 1))
>> + return -EBUSY;
>> +
>> + users = atomic_read(&current->mm->mm_users);
>> + if (users > 1) {
>> + /*
>> + * Wait till concurrent threads and aio_complete() see
>> + * dead flag. Implies full memory barrier on all cpus.
>> + */
>> + synchronize_sched();
>> + } else {
>> + /*
>> + * Sync with aio_complete() to be sure it puts reqs_available,
>> + * when dead flag is already seen.
>> + */
>> + spin_lock_irq(&ctx->completion_lock);
>> + }
>> +
>> + for_each_possible_cpu(cpu) {
>> + kcpu = per_cpu_ptr(ctx->cpu, cpu);
>> + reqs += kcpu->reqs_available;
>
> I'm not that familiar with AIO internals but this snippet worries me:
> the reqs_available is unsigned int, reqs is unsigned it as well but
> used as an accumulator over ALL cpus, can't it get overflow and
> gives modulo result, should not it be unsigned long or something?

All available reqs are initially contain in kioctx::reqs_available,
which is atomic_t, and then they are distributed over percpu counters.
So, this is OK.

>> + kcpu->reqs_available = 0;
>> + }
>> +
>> + if (users == 1)
>> + spin_unlock_irq(&ctx->completion_lock);
>> +
>> + atomic_add(reqs, &ctx->reqs_available);
>> +
>> + ret = wait_event_interruptible(ctx->wait, reqs_completed(ctx));
>> +
>> + atomic_set(&ctx->dead, 0);
>> +
>> + return ret;
>> +}

2017-06-13 10:32:32

by Cyrill Gorcunov

[permalink] [raw]

Subject: Re: [PATCH] aio: Add command to wait completion of all requests

On Tue, Jun 13, 2017 at 12:45:39PM +0300, Kirill Tkhai wrote:
> >
> > I'm not that familiar with AIO internals but this snippet worries me:
> > the reqs_available is unsigned int, reqs is unsigned it as well but
> > used as an accumulator over ALL cpus, can't it get overflow and
> > gives modulo result, should not it be unsigned long or something?
>
> All available reqs are initially contain in kioctx::reqs_available,
> which is atomic_t, and then they are distributed over percpu counters.
> So, this is OK.

Thanks for explanation! Looks ok to me. But I'm far from being
aio expert, if someone else won't beat me I'll take a closer
look in [a day|couple of] or so.

2017-06-13 15:09:28

by Benjamin LaHaise

[permalink] [raw]

Subject: Re: [PATCH] aio: Add command to wait completion of all requests

On Fri, Jun 09, 2017 at 12:49:34PM +0300, Kirill Tkhai wrote:
> During implementation aio support for Checkpoint-Restore
> in Userspace project (CRIU), we found, that current
> kernel functionality does not allow to wait for
> completion of all submitted requests. Checkpoint software
> can't determine if there is no in-flight requests, i.e.
> the process aio rings memory is ready for dumping.

Simply put: no. There is no way this command will ever complete under
various conditions - if another thread submits i/o, if the in flight i/o
is waiting on non-disk i/o, etc. If you want to wait until all aios of a
task are complete, kill the process and wait for the zombie to be reaped.
This is a simplistic approach to checkpointing that does not solve all
issues related to checkpoint aio.

-ben

> The patch solves this problem. Also, the interface, it
> introduces, may be useful for other in-userspace users.
> Here new IOCB_CMD_WAIT_ALL cmd, which idea is to make
> the caller wait till the sum of completed and available
> requests becomes equal to (ctx->nr_events - 1). If they
> are equal, then there is no aio requests in-flight in
> the system.
>
> Since available requests are per-cpu, we need synchronization
> with their other readers and writers. Variable ctx->dead
> is used for that, and we set it in 1 during synchronization.
> Now let's look how we become sure, that concurrents can
> see it on other cpus.
>
> There is two cases. When the task is the only user of its
> memory, it races with aio_complete() only. This function
> takes ctx->completion_lock, so we simply use it to synchronize
> during per-cpu reading.
>
> When there are more users of current->mm, we base on that
> put_reqs_available() and get_reqs_available() are already
> interrupts-free. Primitives local_irq_disable and
> local_irq_enable() work as rcu_read_xxx_sched() barriers,
> and waiter uses synchronize_sched() to propogate ctx->dead
> visability on other cpus. This RCU primitive works as
> full memory barrier, so nobody will use percpu reqs_available
> after we returned from it, and the waiting comes down
> to the first case further actions. The good point
> is the solution does not affect on existing code
> performance in any way, as it does not change it.
>
> Couple more words about checkpoint/restore. We need
> especially this functionality, and we are not going to
> dump in-flight request for now, because the experiments
> show, that if dumping of a container was failed by
> another subsystem reasons, it's not always possible
> to queue cancelled requests back (file descriptors
> may be already closed, memory pressure). So, here is
> nothing about dumping.
>
> It seams, this functionality will be useful not only
> for us, and others may use the new command like
> other standard aio commands.
>
> Signed-off-by: Kirill Tkhai <[email protected]>
> ---
> fs/aio.c | 71 +++++++++++++++++++++++++++++++++++++++---
> include/uapi/linux/aio_abi.h | 1 +
> 2 files changed, 67 insertions(+), 5 deletions(-)
>
> diff --git a/fs/aio.c b/fs/aio.c
> index f52d925ee259..447fa4283c7c 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -899,7 +899,11 @@ static void put_reqs_available(struct kioctx *ctx, unsigned nr)
> struct kioctx_cpu *kcpu;
> unsigned long flags;
>
> - local_irq_save(flags);
> + local_irq_save(flags); /* Implies rcu_read_lock_sched() */
> + if (atomic_read(&ctx->dead)) {
> + atomic_add(nr, &ctx->reqs_available);
> + goto unlock;
> + }
> kcpu = this_cpu_ptr(ctx->cpu);
> kcpu->reqs_available += nr;
>
> @@ -907,8 +911,8 @@ static void put_reqs_available(struct kioctx *ctx, unsigned nr)
> kcpu->reqs_available -= ctx->req_batch;
> atomic_add(ctx->req_batch, &ctx->reqs_available);
> }
> -
> - local_irq_restore(flags);
> +unlock:
> + local_irq_restore(flags); /* Implies rcu_read_unlock_sched() */
> }
>
> static bool get_reqs_available(struct kioctx *ctx)
> @@ -917,7 +921,9 @@ static bool get_reqs_available(struct kioctx *ctx)
> bool ret = false;
> unsigned long flags;
>
> - local_irq_save(flags);
> + local_irq_save(flags); /* Implies rcu_read_lock_sched() */
> + if (atomic_read(&ctx->dead))
> + goto out;
> kcpu = this_cpu_ptr(ctx->cpu);
> if (!kcpu->reqs_available) {
> int old, avail = atomic_read(&ctx->reqs_available);
> @@ -937,7 +943,7 @@ static bool get_reqs_available(struct kioctx *ctx)
> ret = true;
> kcpu->reqs_available--;
> out:
> - local_irq_restore(flags);
> + local_irq_restore(flags); /* Implies rcu_read_unlock_sched() */
> return ret;
> }
>
> @@ -1533,6 +1539,58 @@ static ssize_t aio_write(struct kiocb *req, struct iocb *iocb, bool vectored,
> return ret;
> }
>
> +static bool reqs_completed(struct kioctx *ctx)
> +{
> + unsigned available;
> + spin_lock_irq(&ctx->completion_lock);
> + available = atomic_read(&ctx->reqs_available) + ctx->completed_events;
> + spin_unlock_irq(&ctx->completion_lock);
> +
> + return (available == ctx->nr_events - 1);
> +}
> +
> +static int aio_wait_all(struct kioctx *ctx)
> +{
> + unsigned users, reqs = 0;
> + struct kioctx_cpu *kcpu;
> + int cpu, ret;
> +
> + if (atomic_xchg(&ctx->dead, 1))
> + return -EBUSY;
> +
> + users = atomic_read(&current->mm->mm_users);
> + if (users > 1) {
> + /*
> + * Wait till concurrent threads and aio_complete() see
> + * dead flag. Implies full memory barrier on all cpus.
> + */
> + synchronize_sched();
> + } else {
> + /*
> + * Sync with aio_complete() to be sure it puts reqs_available,
> + * when dead flag is already seen.
> + */
> + spin_lock_irq(&ctx->completion_lock);
> + }
> +
> + for_each_possible_cpu(cpu) {
> + kcpu = per_cpu_ptr(ctx->cpu, cpu);
> + reqs += kcpu->reqs_available;
> + kcpu->reqs_available = 0;
> + }
> +
> + if (users == 1)
> + spin_unlock_irq(&ctx->completion_lock);
> +
> + atomic_add(reqs, &ctx->reqs_available);
> +
> + ret = wait_event_interruptible(ctx->wait, reqs_completed(ctx));
> +
> + atomic_set(&ctx->dead, 0);
> +
> + return ret;
> +}
> +
> static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
> struct iocb *iocb, bool compat)
> {
> @@ -1556,6 +1614,9 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
> return -EINVAL;
> }
>
> + if (iocb->aio_lio_opcode == IOCB_CMD_WAIT_ALL)
> + return aio_wait_all(ctx);
> +
> req = aio_get_req(ctx);
> if (unlikely(!req))
> return -EAGAIN;
> diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
> index bb2554f7fbd1..29291946d4f1 100644
> --- a/include/uapi/linux/aio_abi.h
> +++ b/include/uapi/linux/aio_abi.h
> @@ -44,6 +44,7 @@ enum {
> IOCB_CMD_NOOP = 6,
> IOCB_CMD_PREADV = 7,
> IOCB_CMD_PWRITEV = 8,
> + IOCB_CMD_WAIT_ALL = 9,
> };
>
> /*
>

--
"Thought is the essence of where you are now."

2017-06-13 15:11:28

by Kirill Tkhai

[permalink] [raw]

Subject: Re: [PATCH] aio: Add command to wait completion of all requests

On 13.06.2017 17:42, Benjamin LaHaise wrote:
> On Fri, Jun 09, 2017 at 12:49:34PM +0300, Kirill Tkhai wrote:
>> During implementation aio support for Checkpoint-Restore
>> in Userspace project (CRIU), we found, that current
>> kernel functionality does not allow to wait for
>> completion of all submitted requests. Checkpoint software
>> can't determine if there is no in-flight requests, i.e.
>> the process aio rings memory is ready for dumping.
>
> Simply put: no. There is no way this command will ever complete under
> various conditions - if another thread submits i/o, if the in flight i/o
> is waiting on non-disk i/o, etc.

Another thread can't submit request in parallel as aio_wait_all() sets
dead flag at the beginning of work, and it waits till another threads
see it. Then get_reqs_available() can't get any request. Have you seen
comments I wrote? There is synchronize_sched() in (current->mm->mm_users > 1)
branch.

> If you want to wait until all aios of a
> task are complete, kill the process and wait for the zombie to be reaped.
> This is a simplistic approach to checkpointing that does not solve all
> issues related to checkpoint aio.

The functionality, I did, grew from real need and experience. We try to
avoid kernel modification, where it's possible, but the in-flight aio
requests is not a case suitable for that.

The checkpointing of live system is not easy as it seems. The way,
you suggested, makes impossible the whole class of doing snapshots
of the life system. You can't just kill a process, wait for zombie,
and then restart the process again: processes are connected in difficult
topologies of essences. You need to restore pgid and sid of the process,
the namespaces, shared files (CLONE_FILES and CLONE_FS). Everything
of this requires to be created in the certain order, and there is a
lot of rules and limitations. You can't just create the same process
in the same place: it's not easy, and it's just impossible. Your
suggestion kills the big class of use cases, and it's not suitable
in any way. You may refer to criu project site if you are interested
(criu.org).

Benjamin, please, could you check this once again? We really need
this functionality, it's not empty desire. Lets speak about the
way we should implement it, if you don't like the patch.

There are many functionality in kernel to support the concept
I described. Check out MSG_PEEK flag for receiving from socket
(see unix_dgram_recvmsg()), for example. AIO now is one of the
last barriers of full support of snapshots in criu.

Kirill

> -ben
>
>> The patch solves this problem. Also, the interface, it
>> introduces, may be useful for other in-userspace users.
>> Here new IOCB_CMD_WAIT_ALL cmd, which idea is to make
>> the caller wait till the sum of completed and available
>> requests becomes equal to (ctx->nr_events - 1). If they
>> are equal, then there is no aio requests in-flight in
>> the system.
>>
>> Since available requests are per-cpu, we need synchronization
>> with their other readers and writers. Variable ctx->dead
>> is used for that, and we set it in 1 during synchronization.
>> Now let's look how we become sure, that concurrents can
>> see it on other cpus.
>>
>> There is two cases. When the task is the only user of its
>> memory, it races with aio_complete() only. This function
>> takes ctx->completion_lock, so we simply use it to synchronize
>> during per-cpu reading.
>>
>> When there are more users of current->mm, we base on that
>> put_reqs_available() and get_reqs_available() are already
>> interrupts-free. Primitives local_irq_disable and
>> local_irq_enable() work as rcu_read_xxx_sched() barriers,
>> and waiter uses synchronize_sched() to propogate ctx->dead
>> visability on other cpus. This RCU primitive works as
>> full memory barrier, so nobody will use percpu reqs_available
>> after we returned from it, and the waiting comes down
>> to the first case further actions. The good point
>> is the solution does not affect on existing code
>> performance in any way, as it does not change it.
>>
>> Couple more words about checkpoint/restore. We need
>> especially this functionality, and we are not going to
>> dump in-flight request for now, because the experiments
>> show, that if dumping of a container was failed by
>> another subsystem reasons, it's not always possible
>> to queue cancelled requests back (file descriptors
>> may be already closed, memory pressure). So, here is
>> nothing about dumping.
>>
>> It seams, this functionality will be useful not only
>> for us, and others may use the new command like
>> other standard aio commands.
>>
>> Signed-off-by: Kirill Tkhai <[email protected]>
>> ---
>> fs/aio.c | 71 +++++++++++++++++++++++++++++++++++++++---
>> include/uapi/linux/aio_abi.h | 1 +
>> 2 files changed, 67 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/aio.c b/fs/aio.c
>> index f52d925ee259..447fa4283c7c 100644
>> --- a/fs/aio.c
>> +++ b/fs/aio.c
>> @@ -899,7 +899,11 @@ static void put_reqs_available(struct kioctx *ctx, unsigned nr)
>> struct kioctx_cpu *kcpu;
>> unsigned long flags;
>>
>> - local_irq_save(flags);
>> + local_irq_save(flags); /* Implies rcu_read_lock_sched() */
>> + if (atomic_read(&ctx->dead)) {
>> + atomic_add(nr, &ctx->reqs_available);
>> + goto unlock;
>> + }
>> kcpu = this_cpu_ptr(ctx->cpu);
>> kcpu->reqs_available += nr;
>>
>> @@ -907,8 +911,8 @@ static void put_reqs_available(struct kioctx *ctx, unsigned nr)
>> kcpu->reqs_available -= ctx->req_batch;
>> atomic_add(ctx->req_batch, &ctx->reqs_available);
>> }
>> -
>> - local_irq_restore(flags);
>> +unlock:
>> + local_irq_restore(flags); /* Implies rcu_read_unlock_sched() */
>> }
>>
>> static bool get_reqs_available(struct kioctx *ctx)
>> @@ -917,7 +921,9 @@ static bool get_reqs_available(struct kioctx *ctx)
>> bool ret = false;
>> unsigned long flags;
>>
>> - local_irq_save(flags);
>> + local_irq_save(flags); /* Implies rcu_read_lock_sched() */
>> + if (atomic_read(&ctx->dead))
>> + goto out;
>> kcpu = this_cpu_ptr(ctx->cpu);
>> if (!kcpu->reqs_available) {
>> int old, avail = atomic_read(&ctx->reqs_available);
>> @@ -937,7 +943,7 @@ static bool get_reqs_available(struct kioctx *ctx)
>> ret = true;
>> kcpu->reqs_available--;
>> out:
>> - local_irq_restore(flags);
>> + local_irq_restore(flags); /* Implies rcu_read_unlock_sched() */
>> return ret;
>> }
>>
>> @@ -1533,6 +1539,58 @@ static ssize_t aio_write(struct kiocb *req, struct iocb *iocb, bool vectored,
>> return ret;
>> }
>>
>> +static bool reqs_completed(struct kioctx *ctx)
>> +{
>> + unsigned available;
>> + spin_lock_irq(&ctx->completion_lock);
>> + available = atomic_read(&ctx->reqs_available) + ctx->completed_events;
>> + spin_unlock_irq(&ctx->completion_lock);
>> +
>> + return (available == ctx->nr_events - 1);
>> +}
>> +
>> +static int aio_wait_all(struct kioctx *ctx)
>> +{
>> + unsigned users, reqs = 0;
>> + struct kioctx_cpu *kcpu;
>> + int cpu, ret;
>> +
>> + if (atomic_xchg(&ctx->dead, 1))
>> + return -EBUSY;
>> +
>> + users = atomic_read(&current->mm->mm_users);
>> + if (users > 1) {
>> + /*
>> + * Wait till concurrent threads and aio_complete() see
>> + * dead flag. Implies full memory barrier on all cpus.
>> + */
>> + synchronize_sched();
>> + } else {
>> + /*
>> + * Sync with aio_complete() to be sure it puts reqs_available,
>> + * when dead flag is already seen.
>> + */
>> + spin_lock_irq(&ctx->completion_lock);
>> + }
>> +
>> + for_each_possible_cpu(cpu) {
>> + kcpu = per_cpu_ptr(ctx->cpu, cpu);
>> + reqs += kcpu->reqs_available;
>> + kcpu->reqs_available = 0;
>> + }
>> +
>> + if (users == 1)
>> + spin_unlock_irq(&ctx->completion_lock);
>> +
>> + atomic_add(reqs, &ctx->reqs_available);
>> +
>> + ret = wait_event_interruptible(ctx->wait, reqs_completed(ctx));
>> +
>> + atomic_set(&ctx->dead, 0);
>> +
>> + return ret;
>> +}
>> +
>> static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
>> struct iocb *iocb, bool compat)
>> {
>> @@ -1556,6 +1614,9 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
>> return -EINVAL;
>> }
>>
>> + if (iocb->aio_lio_opcode == IOCB_CMD_WAIT_ALL)
>> + return aio_wait_all(ctx);
>> +
>> req = aio_get_req(ctx);
>> if (unlikely(!req))
>> return -EAGAIN;
>> diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
>> index bb2554f7fbd1..29291946d4f1 100644
>> --- a/include/uapi/linux/aio_abi.h
>> +++ b/include/uapi/linux/aio_abi.h
>> @@ -44,6 +44,7 @@ enum {
>> IOCB_CMD_NOOP = 6,
>> IOCB_CMD_PREADV = 7,
>> IOCB_CMD_PWRITEV = 8,
>> + IOCB_CMD_WAIT_ALL = 9,
>> };
>>
>> /*
>>
>

2017-06-13 15:26:03

by Benjamin LaHaise

[permalink] [raw]

Subject: Re: [PATCH] aio: Add command to wait completion of all requests

On Tue, Jun 13, 2017 at 06:11:03PM +0300, Kirill Tkhai wrote:
...
> The functionality, I did, grew from real need and experience. We try to
> avoid kernel modification, where it's possible, but the in-flight aio
> requests is not a case suitable for that.

What you've done only works for *your* use-case, but not in general. Like
in other subsystems, you need to provide hooks on a per file descriptor
basis for quiescing different kinds of file descriptors. Your current
patch set completely ignores things like usb gadget. You need to create
infrastructure for restarting i/os after your checkpointing occurs, which
you haven't put any thought into in this patchset. If you want to discuss
how to do that, fine, but the approach in this patchset simply does not
work in general. What happens when an aio doesn't complete or takes hours
to complete?

> The checkpointing of live system is not easy as it seems. The way,
> you suggested, makes impossible the whole class of doing snapshots
> of the life system. You can't just kill a process, wait for zombie,
> and then restart the process again: processes are connected in difficult
> topologies of essences. You need to restore pgid and sid of the process,
> the namespaces, shared files (CLONE_FILES and CLONE_FS). Everything
> of this requires to be created in the certain order, and there is a
> lot of rules and limitations. You can't just create the same process
> in the same place: it's not easy, and it's just impossible. Your
> suggestion kills the big class of use cases, and it's not suitable
> in any way. You may refer to criu project site if you are interested
> (criu.org).

Point.

> Benjamin, please, could you check this once again? We really need
> this functionality, it's not empty desire. Lets speak about the
> way we should implement it, if you don't like the patch.
>
> There are many functionality in kernel to support the concept
> I described. Check out MSG_PEEK flag for receiving from socket
> (see unix_dgram_recvmsg()), for example. AIO now is one of the
> last barriers of full support of snapshots in criu.
...

Then please start looking at the big picture and think about things other
than short lived disk i/o. Without some design in the infrastructure to
handle those cases, your solution is incomplete and will potentially leave
us with complex and unsupportable semantics that don't actually solve the
problem you're trying to solve.

Some of the things to think about: you need infrastructure to restart an
aio, which means you need some way of dumping aios that remain in flight,
as otherwise your application will see aios cancelled during checkpointing
that should no have been. You need to actually cancel aios. These
details need to be addressed if checkpointing is going to be a robust
feature that works for other than toy use-cases.

-ben
--
"Thought is the essence of where you are now."

2017-06-13 16:18:16

by Kirill Tkhai

[permalink] [raw]

Subject: Re: [PATCH] aio: Add command to wait completion of all requests

On 13.06.2017 18:26, Benjamin LaHaise wrote:
> On Tue, Jun 13, 2017 at 06:11:03PM +0300, Kirill Tkhai wrote:
> ...
>> The functionality, I did, grew from real need and experience. We try to
>> avoid kernel modification, where it's possible, but the in-flight aio
>> requests is not a case suitable for that.
>
> What you've done only works for *your* use-case, but not in general. Like
> in other subsystems, you need to provide hooks on a per file descriptor
> basis for quiescing different kinds of file descriptors.

Which hooks do you suggest? It's possible there is no a file descriptor open after
a request is submitted. Do you suggest an interface to reopen a struct file?

> Your current
> patch set completely ignores things like usb gadget. You need to create
> infrastructure for restarting i/os after your checkpointing occurs, which
> you haven't put any thought into in this patchset. If you want to discuss
> how to do that, fine, but the approach in this patchset simply does not
> work in general. What happens when an aio doesn't complete or takes hours
> to complete?

Here is wait_event_interruptible(), but it's possible to convert it
in wait_event_interruptible_hrtimeout() like it's made in read_events().
It's not a deadly issue of patch. The function read_events() simply
waits for timeout, can't we do the same?

>> The checkpointing of live system is not easy as it seems. The way,
>> you suggested, makes impossible the whole class of doing snapshots
>> of the life system. You can't just kill a process, wait for zombie,
>> and then restart the process again: processes are connected in difficult
>> topologies of essences. You need to restore pgid and sid of the process,
>> the namespaces, shared files (CLONE_FILES and CLONE_FS). Everything
>> of this requires to be created in the certain order, and there is a
>> lot of rules and limitations. You can't just create the same process
>> in the same place: it's not easy, and it's just impossible. Your
>> suggestion kills the big class of use cases, and it's not suitable
>> in any way. You may refer to criu project site if you are interested
>> (criu.org).
>
> Point.
>
>> Benjamin, please, could you check this once again? We really need
>> this functionality, it's not empty desire. Lets speak about the
>> way we should implement it, if you don't like the patch.
>>
>> There are many functionality in kernel to support the concept
>> I described. Check out MSG_PEEK flag for receiving from socket
>> (see unix_dgram_recvmsg()), for example. AIO now is one of the
>> last barriers of full support of snapshots in criu.
> ...
>
> Then please start looking at the big picture and think about things other
> than short lived disk i/o. Without some design in the infrastructure to
> handle those cases, your solution is incomplete and will potentially leave
> us with complex and unsupportable semantics that don't actually solve the
> problem you're trying to solve.
>
> Some of the things to think about: you need infrastructure to restart an
> aio, which means you need some way of dumping aios that remain in flight,
> as otherwise your application will see aios cancelled during checkpointing
> that should no have been. You need to actually cancel aios. These
> details need to be addressed if checkpointing is going to be a robust
> feature that works for other than toy use-cases.

Could you please describe how will cancelling aio requests will help to wait
till their completion? Is there is guarantee, they will be easily queued back?
I suppose, no, because there are may be a memory limit or some low level
drivers limitations, dependent on internal conditions.

Also, it's not seems good to overload aio with the functionality of obtaining
closed file descriptors of submitted requests.

Do you mean this way, or I misunderstood you? Could you please to concretize your
idea?

In my vision cancelling requests does not allow to implement the need I described.
If we can't queue request back, it breaks snapshotting and user application in
general.

Thanks,
Kirill

2017-06-13 23:26:47

by Benjamin LaHaise

[permalink] [raw]

Subject: Re: [PATCH] aio: Add command to wait completion of all requests

On Tue, Jun 13, 2017 at 07:17:43PM +0300, Kirill Tkhai wrote:
> On 13.06.2017 18:26, Benjamin LaHaise wrote:
> > On Tue, Jun 13, 2017 at 06:11:03PM +0300, Kirill Tkhai wrote:
> > ...
> >> The functionality, I did, grew from real need and experience. We try to
> >> avoid kernel modification, where it's possible, but the in-flight aio
> >> requests is not a case suitable for that.
> >
> > What you've done only works for *your* use-case, but not in general. Like
> > in other subsystems, you need to provide hooks on a per file descriptor
> > basis for quiescing different kinds of file descriptors.
>
> Which hooks do you suggest? It's possible there is no a file descriptor open after
> a request is submitted. Do you suggest an interface to reopen a struct file?

That's what you have to contend with. AIO keeps the struct file pinned
for the life of the request. This is one of the complex issues you need
to address.

> > Your current
> > patch set completely ignores things like usb gadget. You need to create
> > infrastructure for restarting i/os after your checkpointing occurs, which
> > you haven't put any thought into in this patchset. If you want to discuss
> > how to do that, fine, but the approach in this patchset simply does not
> > work in general. What happens when an aio doesn't complete or takes hours
> > to complete?
>
> Here is wait_event_interruptible(), but it's possible to convert it
> in wait_event_interruptible_hrtimeout() like it's made in read_events().
> It's not a deadly issue of patch. The function read_events() simply
> waits for timeout, can't we do the same?

Nope. An aio may not complete in a timely fashion, in which case your
wait for all aios to complete will simply wait forever. I take it this is
not the desired behaviour of checkpointing.

> Could you please describe how will cancelling aio requests will help to wait
> till their completion? Is there is guarantee, they will be easily queued back?
> I suppose, no, because there are may be a memory limit or some low level
> drivers limitations, dependent on internal conditions.

This is the problem you're facing. Quiescing aios is complicated!

> Also, it's not seems good to overload aio with the functionality of obtaining
> closed file descriptors of submitted requests.
>
> Do you mean this way, or I misunderstood you? Could you please to concretize your
> idea?

Yes, this is what I'm talking about.

> In my vision cancelling requests does not allow to implement the need I described.
> If we can't queue request back, it breaks snapshotting and user application in
> general.

This is what you have to figure out. This is why your patch is incomplete
and cannot be accepted. You can't punt the complexity your feature
requires onto other maintainers -- your implementation has to be reasonably
complete at time of patch inclusion. Can you see now why your patch can't
be included as-is? The assumption you made that aios complete in a timely
fashion is incorrect. Everything else stems from that.

-ben

> Thanks,
> Kirill
>

--
"Thought is the essence of where you are now."

2017-06-14 09:12:29

by Kirill Tkhai

[permalink] [raw]

Subject: Re: [PATCH] aio: Add command to wait completion of all requests

On 14.06.2017 02:26, Benjamin LaHaise wrote:
> On Tue, Jun 13, 2017 at 07:17:43PM +0300, Kirill Tkhai wrote:
>> On 13.06.2017 18:26, Benjamin LaHaise wrote:
>>> On Tue, Jun 13, 2017 at 06:11:03PM +0300, Kirill Tkhai wrote:
>>> ...
>>>> The functionality, I did, grew from real need and experience. We try to
>>>> avoid kernel modification, where it's possible, but the in-flight aio
>>>> requests is not a case suitable for that.
>>>
>>> What you've done only works for *your* use-case, but not in general. Like
>>> in other subsystems, you need to provide hooks on a per file descriptor
>>> basis for quiescing different kinds of file descriptors.
>>
>> Which hooks do you suggest? It's possible there is no a file descriptor open after
>> a request is submitted. Do you suggest an interface to reopen a struct file?
>
> That's what you have to contend with. AIO keeps the struct file pinned
> for the life of the request. This is one of the complex issues you need
> to address.
>
>>> Your current
>>> patch set completely ignores things like usb gadget. You need to create
>>> infrastructure for restarting i/os after your checkpointing occurs, which
>>> you haven't put any thought into in this patchset. If you want to discuss
>>> how to do that, fine, but the approach in this patchset simply does not
>>> work in general. What happens when an aio doesn't complete or takes hours
>>> to complete?
>>
>> Here is wait_event_interruptible(), but it's possible to convert it
>> in wait_event_interruptible_hrtimeout() like it's made in read_events().
>> It's not a deadly issue of patch. The function read_events() simply
>> waits for timeout, can't we do the same?
>
> Nope. An aio may not complete in a timely fashion, in which case your
> wait for all aios to complete will simply wait forever. I take it this is
> not the desired behaviour of checkpointing.

But read_events() does it. What the difference between them?

>> Could you please describe how will cancelling aio requests will help to wait
>> till their completion? Is there is guarantee, they will be easily queued back?
>> I suppose, no, because there are may be a memory limit or some low level
>> drivers limitations, dependent on internal conditions.
>
> This is the problem you're facing. Quiescing aios is complicated!

It's too generic words and they may refer to anything. And it's a problem,
which is not connected with small aio engine, because it's impossible to guarantee
availability of kernel resources from there. Imagine, parallel task eats memory
of your just cancelled request: then you just can't queue request back.

This argument makes your suggestion not rock-stable suitable for all cases:
it will break user applications in such situations.

>> Also, it's not seems good to overload aio with the functionality of obtaining
>> closed file descriptors of submitted requests.
>>
>> Do you mean this way, or I misunderstood you? Could you please to concretize your
>> idea?
>
> Yes, this is what I'm talking about.
>
>> In my vision cancelling requests does not allow to implement the need I described.
>> If we can't queue request back, it breaks snapshotting and user application in
>> general.
>
> This is what you have to figure out. This is why your patch is incomplete
> and cannot be accepted. You can't punt the complexity your feature
> requires onto other maintainers -- your implementation has to be reasonably
> complete at time of patch inclusion. Can you see now why your patch can't
> be included as-is? The assumption you made that aios complete in a timely
> fashion is incorrect. Everything else stems from that.

I just can't see why read_events() just waits for requests completion not paying
attention of the all above you said. Could you please clarify the difference
between two these situations?

2017-06-14 14:47:59

by Benjamin LaHaise

[permalink] [raw]

Subject: Re: [PATCH] aio: Add command to wait completion of all requests

On Wed, Jun 14, 2017 at 12:11:38PM +0300, Kirill Tkhai wrote:
> On 14.06.2017 02:26, Benjamin LaHaise wrote:
...
> > Nope. An aio may not complete in a timely fashion, in which case your
> > wait for all aios to complete will simply wait forever. I take it this is
> > not the desired behaviour of checkpointing.
>
> But read_events() does it. What the difference between them?

read_events() does not wait for all aios to complete. The model here is
event driving programming: the program waits for an event, wakes up, does
some work, goes back to sleep until another event comes in. Some of the
aios in flight won't complete for significant amounts of time, but that
doesn't matter since those which do complete will be processed and cause
state machines to make progress.

> >> Could you please describe how will cancelling aio requests will help to wait
> >> till their completion? Is there is guarantee, they will be easily queued back?
> >> I suppose, no, because there are may be a memory limit or some low level
> >> drivers limitations, dependent on internal conditions.
> >
> > This is the problem you're facing. Quiescing aios is complicated!
>
> It's too generic words and they may refer to anything. And it's a problem,
> which is not connected with small aio engine, because it's impossible to guarantee
> availability of kernel resources from there. Imagine, parallel task eats memory
> of your just cancelled request: then you just can't queue request back.
>
> This argument makes your suggestion not rock-stable suitable for all cases:
> it will break user applications in such situations.

Again, you're failing to understand what the programming model is for aio.

> >> Also, it's not seems good to overload aio with the functionality of obtaining
> >> closed file descriptors of submitted requests.
> >>
> >> Do you mean this way, or I misunderstood you? Could you please to concretize your
> >> idea?
> >
> > Yes, this is what I'm talking about.
> >
> >> In my vision cancelling requests does not allow to implement the need I described.
> >> If we can't queue request back, it breaks snapshotting and user application in
> >> general.
> >
> > This is what you have to figure out. This is why your patch is incomplete
> > and cannot be accepted. You can't punt the complexity your feature
> > requires onto other maintainers -- your implementation has to be reasonably
> > complete at time of patch inclusion. Can you see now why your patch can't
> > be included as-is? The assumption you made that aios complete in a timely
> > fashion is incorrect. Everything else stems from that.
>
> I just can't see why read_events() just waits for requests completion not paying
> attention of the all above you said. Could you please clarify the difference
> between two these situations?

Read my above comments about event driven programming. Waiting for all
aios to complete is very different from waiting for a non-specific
completion event to come along.

-ben
--
"Thought is the essence of where you are now."