2022-11-04 21:53:27

by Seth Jenkins

[permalink] [raw]
Subject: [PATCH] aio: fix mremap after fork null-deref

Commit e4a0d3e720e7 ("aio: Make it possible to remap aio ring") introduced
a null-deref if mremap is called on an old aio mapping after fork as
mm->ioctx_table will be set to NULL.

Fixes: e4a0d3e720e7 ("aio: Make it possible to remap aio ring")
Cc: [email protected]
Signed-off-by: Seth Jenkins <[email protected]>
---
fs/aio.c | 20 +++++++++++---------
1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 5b2ff20ad322..74eae7de7323 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -361,16 +361,18 @@ static int aio_ring_mremap(struct vm_area_struct *vma)
spin_lock(&mm->ioctx_lock);
rcu_read_lock();
table = rcu_dereference(mm->ioctx_table);
- for (i = 0; i < table->nr; i++) {
- struct kioctx *ctx;
-
- ctx = rcu_dereference(table->table[i]);
- if (ctx && ctx->aio_ring_file == file) {
- if (!atomic_read(&ctx->dead)) {
- ctx->user_id = ctx->mmap_base = vma->vm_start;
- res = 0;
+ if (table) {
+ for (i = 0; i < table->nr; i++) {
+ struct kioctx *ctx;
+
+ ctx = rcu_dereference(table->table[i]);
+ if (ctx && ctx->aio_ring_file == file) {
+ if (!atomic_read(&ctx->dead)) {
+ ctx->user_id = ctx->mmap_base = vma->vm_start;
+ res = 0;
+ }
+ break;
}
- break;
}
}

--
2.38.1.431.g37b22c650d-goog



2023-01-10 22:55:09

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] aio: fix mremap after fork null-deref

On Fri, 4 Nov 2022 17:25:19 -0400 Seth Jenkins <[email protected]> wrote:

> Commit e4a0d3e720e7 ("aio: Make it possible to remap aio ring") introduced
> a null-deref if mremap is called on an old aio mapping after fork as
> mm->ioctx_table will be set to NULL.
>

Is this a theoretical thing, or has this oops actually been observed?

2023-01-11 19:50:00

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH] aio: fix mremap after fork null-deref

Hi, Seth,

Seth Jenkins <[email protected]> writes:

> Commit e4a0d3e720e7 ("aio: Make it possible to remap aio ring") introduced
> a null-deref if mremap is called on an old aio mapping after fork as
> mm->ioctx_table will be set to NULL.
>
> Fixes: e4a0d3e720e7 ("aio: Make it possible to remap aio ring")
> Cc: [email protected]
> Signed-off-by: Seth Jenkins <[email protected]>
> ---
> fs/aio.c | 20 +++++++++++---------
> 1 file changed, 11 insertions(+), 9 deletions(-)
>
> diff --git a/fs/aio.c b/fs/aio.c
> index 5b2ff20ad322..74eae7de7323 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -361,16 +361,18 @@ static int aio_ring_mremap(struct vm_area_struct *vma)
> spin_lock(&mm->ioctx_lock);
> rcu_read_lock();
> table = rcu_dereference(mm->ioctx_table);
> - for (i = 0; i < table->nr; i++) {
> - struct kioctx *ctx;
> -
> - ctx = rcu_dereference(table->table[i]);
> - if (ctx && ctx->aio_ring_file == file) {
> - if (!atomic_read(&ctx->dead)) {
> - ctx->user_id = ctx->mmap_base = vma->vm_start;
> - res = 0;
> + if (table) {
> + for (i = 0; i < table->nr; i++) {
> + struct kioctx *ctx;
> +
> + ctx = rcu_dereference(table->table[i]);
> + if (ctx && ctx->aio_ring_file == file) {
> + if (!atomic_read(&ctx->dead)) {
> + ctx->user_id = ctx->mmap_base = vma->vm_start;
> + res = 0;
> + }
> + break;
> }
> - break;
> }
> }

I wonder if it would be better to not copy the ring mapping on fork.
Something like the below? I think that would be more in line with
expectations (the ring isn't available in the child process).

-Jeff

diff --git a/fs/aio.c b/fs/aio.c
index 562916d85cba..dbf3b0749cb4 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -390,7 +390,7 @@ static const struct vm_operations_struct aio_ring_vm_ops = {

static int aio_ring_mmap(struct file *file, struct vm_area_struct *vma)
{
- vma->vm_flags |= VM_DONTEXPAND;
+ vma->vm_flags |= VM_DONTEXPAND|VM_DONTCOPY;
vma->vm_ops = &aio_ring_vm_ops;
return 0;
}

2023-01-11 20:47:59

by Seth Jenkins

[permalink] [raw]
Subject: Re: [PATCH] aio: fix mremap after fork null-deref

> I wonder if it would be better to not copy the ring mapping on fork.
> Something like the below? I think that would be more in line with
> expectations (the ring isn't available in the child process).

I like this idea a lot, but would this be viewed as subtly changing
the userland to kernel interface?

If we're okay with this change though, I think it makes sense.


On Wed, Jan 11, 2023 at 2:37 PM Jeff Moyer <[email protected]> wrote:
>
> Hi, Seth,
>
> Seth Jenkins <[email protected]> writes:
>
> > Commit e4a0d3e720e7 ("aio: Make it possible to remap aio ring") introduced
> > a null-deref if mremap is called on an old aio mapping after fork as
> > mm->ioctx_table will be set to NULL.
> >
> > Fixes: e4a0d3e720e7 ("aio: Make it possible to remap aio ring")
> > Cc: [email protected]
> > Signed-off-by: Seth Jenkins <[email protected]>
> > ---
> > fs/aio.c | 20 +++++++++++---------
> > 1 file changed, 11 insertions(+), 9 deletions(-)
> >
> > diff --git a/fs/aio.c b/fs/aio.c
> > index 5b2ff20ad322..74eae7de7323 100644
> > --- a/fs/aio.c
> > +++ b/fs/aio.c
> > @@ -361,16 +361,18 @@ static int aio_ring_mremap(struct vm_area_struct *vma)
> > spin_lock(&mm->ioctx_lock);
> > rcu_read_lock();
> > table = rcu_dereference(mm->ioctx_table);
> > - for (i = 0; i < table->nr; i++) {
> > - struct kioctx *ctx;
> > -
> > - ctx = rcu_dereference(table->table[i]);
> > - if (ctx && ctx->aio_ring_file == file) {
> > - if (!atomic_read(&ctx->dead)) {
> > - ctx->user_id = ctx->mmap_base = vma->vm_start;
> > - res = 0;
> > + if (table) {
> > + for (i = 0; i < table->nr; i++) {
> > + struct kioctx *ctx;
> > +
> > + ctx = rcu_dereference(table->table[i]);
> > + if (ctx && ctx->aio_ring_file == file) {
> > + if (!atomic_read(&ctx->dead)) {
> > + ctx->user_id = ctx->mmap_base = vma->vm_start;
> > + res = 0;
> > + }
> > + break;
> > }
> > - break;
> > }
> > }
>
> I wonder if it would be better to not copy the ring mapping on fork.
> Something like the below? I think that would be more in line with
> expectations (the ring isn't available in the child process).
>
> -Jeff
>
> diff --git a/fs/aio.c b/fs/aio.c
> index 562916d85cba..dbf3b0749cb4 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -390,7 +390,7 @@ static const struct vm_operations_struct aio_ring_vm_ops = {
>
> static int aio_ring_mmap(struct file *file, struct vm_area_struct *vma)
> {
> - vma->vm_flags |= VM_DONTEXPAND;
> + vma->vm_flags |= VM_DONTEXPAND|VM_DONTCOPY;
> vma->vm_ops = &aio_ring_vm_ops;
> return 0;
> }
>

2023-01-12 21:50:12

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH] aio: fix mremap after fork null-deref

Hi, Seth,

Seth Jenkins <[email protected]> writes:

>> I wonder if it would be better to not copy the ring mapping on fork.
>> Something like the below? I think that would be more in line with
>> expectations (the ring isn't available in the child process).
>
> I like this idea a lot, but would this be viewed as subtly changing
> the userland to kernel interface?

Yes, but...

The way things stand today, if you setup an io context in a process and
then fork a child, the child will be unable to use the aio system calls
to submit and reap I/Os. This is because the ioctx_table was cleared
during fork, which means that lookup_ioctx() will not find the io
context. However, the child still has access to the ring through the
memory mapping. As a result, the child can reap I/O completions
directly from the ring. That wasn't always the case. The aio ring used
to be a MAP_PRIVATE mapping. Commit 36bc08cc0170 ("fs/aio: Add support
to aio ring pages migration") changed it from a private to a shared
mapping, and I'm not sure why. (That patch was included in v3.12, so
it's been this way for quite some time.)

With the patch I proposed (flagging the ring buffer with VM_DONTCOPY),
the child process would still be unable to submit and reap I/Os via the
aio system calls. What changes is that the child process would now be
unable to reap completions via the shared ring buffer. In fact, because
the ring is no longer mapped in the child process, any attempt to access
that memory would result in a segmentation fault. However, I would be
very surprised if the interface was being used in this way.

> If we're okay with this change though, I think it makes sense.

My preference is to make the interface consistent. I think setting
VM_DONTCOPY on the mapping is the right way forward. I'd welcome other
opinions on whether the potential risk is worth it.

Cheers,
Jeff

>
>
> On Wed, Jan 11, 2023 at 2:37 PM Jeff Moyer <[email protected]> wrote:
>>
>> Hi, Seth,
>>
>> Seth Jenkins <[email protected]> writes:
>>
>> > Commit e4a0d3e720e7 ("aio: Make it possible to remap aio ring") introduced
>> > a null-deref if mremap is called on an old aio mapping after fork as
>> > mm->ioctx_table will be set to NULL.
>> >
>> > Fixes: e4a0d3e720e7 ("aio: Make it possible to remap aio ring")
>> > Cc: [email protected]
>> > Signed-off-by: Seth Jenkins <[email protected]>
>> > ---
>> > fs/aio.c | 20 +++++++++++---------
>> > 1 file changed, 11 insertions(+), 9 deletions(-)
>> >
>> > diff --git a/fs/aio.c b/fs/aio.c
>> > index 5b2ff20ad322..74eae7de7323 100644
>> > --- a/fs/aio.c
>> > +++ b/fs/aio.c
>> > @@ -361,16 +361,18 @@ static int aio_ring_mremap(struct vm_area_struct *vma)
>> > spin_lock(&mm->ioctx_lock);
>> > rcu_read_lock();
>> > table = rcu_dereference(mm->ioctx_table);
>> > - for (i = 0; i < table->nr; i++) {
>> > - struct kioctx *ctx;
>> > -
>> > - ctx = rcu_dereference(table->table[i]);
>> > - if (ctx && ctx->aio_ring_file == file) {
>> > - if (!atomic_read(&ctx->dead)) {
>> > - ctx->user_id = ctx->mmap_base = vma->vm_start;
>> > - res = 0;
>> > + if (table) {
>> > + for (i = 0; i < table->nr; i++) {
>> > + struct kioctx *ctx;
>> > +
>> > + ctx = rcu_dereference(table->table[i]);
>> > + if (ctx && ctx->aio_ring_file == file) {
>> > + if (!atomic_read(&ctx->dead)) {
>> > + ctx->user_id = ctx->mmap_base = vma->vm_start;
>> > + res = 0;
>> > + }
>> > + break;
>> > }
>> > - break;
>> > }
>> > }
>>
>> I wonder if it would be better to not copy the ring mapping on fork.
>> Something like the below? I think that would be more in line with
>> expectations (the ring isn't available in the child process).
>>
>> -Jeff
>>
>> diff --git a/fs/aio.c b/fs/aio.c
>> index 562916d85cba..dbf3b0749cb4 100644
>> --- a/fs/aio.c
>> +++ b/fs/aio.c
>> @@ -390,7 +390,7 @@ static const struct vm_operations_struct aio_ring_vm_ops = {
>>
>> static int aio_ring_mmap(struct file *file, struct vm_area_struct *vma)
>> {
>> - vma->vm_flags |= VM_DONTEXPAND;
>> + vma->vm_flags |= VM_DONTEXPAND|VM_DONTCOPY;
>> vma->vm_ops = &aio_ring_vm_ops;
>> return 0;
>> }
>>

2023-01-12 23:17:39

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [PATCH] aio: fix mremap after fork null-deref

On Thu, Jan 12, 2023 at 04:32:42PM -0500, Jeff Moyer wrote:
> The way things stand today, if you setup an io context in a process and
> then fork a child, the child will be unable to use the aio system calls
> to submit and reap I/Os. This is because the ioctx_table was cleared
> during fork, which means that lookup_ioctx() will not find the io
> context. However, the child still has access to the ring through the
> memory mapping. As a result, the child can reap I/O completions
> directly from the ring. That wasn't always the case. The aio ring used
> to be a MAP_PRIVATE mapping. Commit 36bc08cc0170 ("fs/aio: Add support
> to aio ring pages migration") changed it from a private to a shared
> mapping, and I'm not sure why. (That patch was included in v3.12, so
> it's been this way for quite some time.)

It is necessary to make migration work. The pages have to be owned by
the backing file and to map the backing file pages into the process
without COW requires the mapping to be marked shared. If they're COWed,
events can be lost. Migration is rare and ugly.

> With the patch I proposed (flagging the ring buffer with VM_DONTCOPY),
> the child process would still be unable to submit and reap I/Os via the
> aio system calls. What changes is that the child process would now be
> unable to reap completions via the shared ring buffer. In fact, because
> the ring is no longer mapped in the child process, any attempt to access
> that memory would result in a segmentation fault. However, I would be
> very surprised if the interface was being used in this way.
>
> > If we're okay with this change though, I think it makes sense.
>
> My preference is to make the interface consistent. I think setting
> VM_DONTCOPY on the mapping is the right way forward. I'd welcome other
> opinions on whether the potential risk is worth it.

VM_DONTCOPY makes sense, but a SEGV is a pretty bad failure mode. Any
process reaping events in the child after fork() isn't going to be
consistent in behaviour, and is able to see partial completion of an I/O
and other inconsistencies, so they're going to be subtly broken at best.

Unfortunately, we have no way of knowing if this behaviour is exercised
anywhere without changing it and waiting for someone to holler.

-ben

> Cheers,
> Jeff
>
> >
> >
> > On Wed, Jan 11, 2023 at 2:37 PM Jeff Moyer <[email protected]> wrote:
> >>
> >> Hi, Seth,
> >>
> >> Seth Jenkins <[email protected]> writes:
> >>
> >> > Commit e4a0d3e720e7 ("aio: Make it possible to remap aio ring") introduced
> >> > a null-deref if mremap is called on an old aio mapping after fork as
> >> > mm->ioctx_table will be set to NULL.
> >> >
> >> > Fixes: e4a0d3e720e7 ("aio: Make it possible to remap aio ring")
> >> > Cc: [email protected]
> >> > Signed-off-by: Seth Jenkins <[email protected]>
> >> > ---
> >> > fs/aio.c | 20 +++++++++++---------
> >> > 1 file changed, 11 insertions(+), 9 deletions(-)
> >> >
> >> > diff --git a/fs/aio.c b/fs/aio.c
> >> > index 5b2ff20ad322..74eae7de7323 100644
> >> > --- a/fs/aio.c
> >> > +++ b/fs/aio.c
> >> > @@ -361,16 +361,18 @@ static int aio_ring_mremap(struct vm_area_struct *vma)
> >> > spin_lock(&mm->ioctx_lock);
> >> > rcu_read_lock();
> >> > table = rcu_dereference(mm->ioctx_table);
> >> > - for (i = 0; i < table->nr; i++) {
> >> > - struct kioctx *ctx;
> >> > -
> >> > - ctx = rcu_dereference(table->table[i]);
> >> > - if (ctx && ctx->aio_ring_file == file) {
> >> > - if (!atomic_read(&ctx->dead)) {
> >> > - ctx->user_id = ctx->mmap_base = vma->vm_start;
> >> > - res = 0;
> >> > + if (table) {
> >> > + for (i = 0; i < table->nr; i++) {
> >> > + struct kioctx *ctx;
> >> > +
> >> > + ctx = rcu_dereference(table->table[i]);
> >> > + if (ctx && ctx->aio_ring_file == file) {
> >> > + if (!atomic_read(&ctx->dead)) {
> >> > + ctx->user_id = ctx->mmap_base = vma->vm_start;
> >> > + res = 0;
> >> > + }
> >> > + break;
> >> > }
> >> > - break;
> >> > }
> >> > }
> >>
> >> I wonder if it would be better to not copy the ring mapping on fork.
> >> Something like the below? I think that would be more in line with
> >> expectations (the ring isn't available in the child process).
> >>
> >> -Jeff
> >>
> >> diff --git a/fs/aio.c b/fs/aio.c
> >> index 562916d85cba..dbf3b0749cb4 100644
> >> --- a/fs/aio.c
> >> +++ b/fs/aio.c
> >> @@ -390,7 +390,7 @@ static const struct vm_operations_struct aio_ring_vm_ops = {
> >>
> >> static int aio_ring_mmap(struct file *file, struct vm_area_struct *vma)
> >> {
> >> - vma->vm_flags |= VM_DONTEXPAND;
> >> + vma->vm_flags |= VM_DONTEXPAND|VM_DONTCOPY;
> >> vma->vm_ops = &aio_ring_vm_ops;
> >> return 0;
> >> }
> >>
>
>

--
"Thought is the essence of where you are now."

2023-01-13 16:38:17

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH] aio: fix mremap after fork null-deref

Seth Jenkins <[email protected]> writes:

> Commit e4a0d3e720e7 ("aio: Make it possible to remap aio ring") introduced
> a null-deref if mremap is called on an old aio mapping after fork as
> mm->ioctx_table will be set to NULL.
>
> Fixes: e4a0d3e720e7 ("aio: Make it possible to remap aio ring")
> Cc: [email protected]
> Signed-off-by: Seth Jenkins <[email protected]>
> ---
> fs/aio.c | 20 +++++++++++---------
> 1 file changed, 11 insertions(+), 9 deletions(-)
>
> diff --git a/fs/aio.c b/fs/aio.c
> index 5b2ff20ad322..74eae7de7323 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -361,16 +361,18 @@ static int aio_ring_mremap(struct vm_area_struct *vma)
> spin_lock(&mm->ioctx_lock);
> rcu_read_lock();
> table = rcu_dereference(mm->ioctx_table);
> - for (i = 0; i < table->nr; i++) {
> - struct kioctx *ctx;
> -
> - ctx = rcu_dereference(table->table[i]);
> - if (ctx && ctx->aio_ring_file == file) {
> - if (!atomic_read(&ctx->dead)) {
> - ctx->user_id = ctx->mmap_base = vma->vm_start;
> - res = 0;
> + if (table) {
> + for (i = 0; i < table->nr; i++) {
> + struct kioctx *ctx;
> +
> + ctx = rcu_dereference(table->table[i]);
> + if (ctx && ctx->aio_ring_file == file) {
> + if (!atomic_read(&ctx->dead)) {
> + ctx->user_id = ctx->mmap_base = vma->vm_start;

This is now over 80 columns. I'd prefer it if you would just invert the
NULL pointer check and add a label:

if (!table)
goto out_unlock;

Other than that, it looks good to me.

Cheers,
Jeff

2023-01-13 17:00:04

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH] aio: fix mremap after fork null-deref

Hi, Ben,

Thanks for taking the time to chime in.

Benjamin LaHaise <[email protected]> writes:

> On Thu, Jan 12, 2023 at 04:32:42PM -0500, Jeff Moyer wrote:
>> With the patch I proposed (flagging the ring buffer with VM_DONTCOPY),
>> the child process would still be unable to submit and reap I/Os via the
>> aio system calls. What changes is that the child process would now be
>> unable to reap completions via the shared ring buffer. In fact, because
>> the ring is no longer mapped in the child process, any attempt to access
>> that memory would result in a segmentation fault. However, I would be
>> very surprised if the interface was being used in this way.
>>
>> > If we're okay with this change though, I think it makes sense.
>>
>> My preference is to make the interface consistent. I think setting
>> VM_DONTCOPY on the mapping is the right way forward. I'd welcome other
>> opinions on whether the potential risk is worth it.
>
> VM_DONTCOPY makes sense, but a SEGV is a pretty bad failure mode. Any
> process reaping events in the child after fork() isn't going to be
> consistent in behaviour, and is able to see partial completion of an I/O
> and other inconsistencies, so they're going to be subtly broken at best.
>
> Unfortunately, we have no way of knowing if this behaviour is exercised
> anywhere without changing it and waiting for someone to holler.

OK, so it sounds like you would rather err on the side of caution.
That's fine with me.

Seth, I'll go back and review your patch.

Thanks!
Jeff