2014-12-14 20:23:46

by Benjamin LaHaise

[permalink] [raw]
Subject: [GIT PULL] aio: changes for 3.19

Hello Linus & everyone,

The following changes since commit b2776bf7149bddd1f4161f14f79520f17fc1d71d:

Linux 3.18 (2014-12-07 14:21:05 -0800)

are available in the git repository at:

git://git.kvack.org/~bcrl/aio-next.git master

for you to fetch changes up to 5f785de588735306ec4d7c875caf9d28481c8b21:

aio: Skip timer for io_getevents if timeout=0 (2014-12-13 17:50:20 -0500)

----------------------------------------------------------------
Fam Zheng (1):
aio: Skip timer for io_getevents if timeout=0

Pavel Emelyanov (1):
aio: Make it possible to remap aio ring

fs/aio.c | 33 +++++++++++++++++++++++++++++++--
include/linux/fs.h | 1 +
mm/mremap.c | 3 ++-
3 files changed, 34 insertions(+), 3 deletions(-)

Regards,

-ben
--
"Thought is the essence of where you are now."


2014-12-14 21:47:42

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] aio: changes for 3.19

On Sun, Dec 14, 2014 at 12:22 PM, Benjamin LaHaise <[email protected]> wrote:
>
> Pavel Emelyanov (1):
> aio: Make it possible to remap aio ring

So quite frankly, I think this should have had more acks from VM
people. The patch looks ok to me, but it took me by surprise, and I
don't see much any discussion about it on linux-mm either..

Linus

2014-12-14 21:52:29

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [GIT PULL] aio: changes for 3.19

On Sun, Dec 14, 2014 at 01:47:32PM -0800, Linus Torvalds wrote:
> On Sun, Dec 14, 2014 at 12:22 PM, Benjamin LaHaise <[email protected]> wrote:
> >
> > Pavel Emelyanov (1):
> > aio: Make it possible to remap aio ring
>
> So quite frankly, I think this should have had more acks from VM
> people. The patch looks ok to me, but it took me by surprise, and I
> don't see much any discussion about it on linux-mm either..

Sadly, nobody responded. Maybe akpm can chime in on this change (included
below for ease of review and akpm added to the To:)?

-ben
--
"Thought is the essence of where you are now."

commit e4a0d3e720e7e508749c1439b5ba3aff56c92976
Author: Pavel Emelyanov <[email protected]>
Date: Thu Sep 18 19:56:17 2014 +0400

aio: Make it possible to remap aio ring

There are actually two issues this patch addresses. Let me start with
the one I tried to solve in the beginning.

So, in the checkpoint-restore project (criu) we try to dump tasks'
state and restore one back exactly as it was. One of the tasks' state
bits is rings set up with io_setup() call. There's (almost) no problems
in dumping them, there's a problem restoring them -- if I dump a task
with aio ring originally mapped at address A, I want to restore one
back at exactly the same address A. Unfortunately, the io_setup() does
not allow for that -- it mmaps the ring at whatever place mm finds
appropriate (it calls do_mmap_pgoff() with zero address and without
the MAP_FIXED flag).

To make restore possible I'm going to mremap() the freshly created ring
into the address A (under which it was seen before dump). The problem is
that the ring's virtual address is passed back to the user-space as the
context ID and this ID is then used as search key by all the other io_foo()
calls. Reworking this ID to be just some integer doesn't seem to work, as
this value is already used by libaio as a pointer using which this library
accesses memory for aio meta-data.

So, to make restore work we need to make sure that

a) ring is mapped at desired virtual address
b) kioctx->user_id matches this value

Having said that, the patch makes mremap() on aio region update the
kioctx's user_id and mmap_base values.

Here appears the 2nd issue I mentioned in the beginning of this mail.
If (regardless of the C/R dances I do) someone creates an io context
with io_setup(), then mremap()-s the ring and then destroys the context,
the kill_ioctx() routine will call munmap() on wrong (old) address.
This will result in a) aio ring remaining in memory and b) some other
vma get unexpectedly unmapped.

What do you think?

Signed-off-by: Pavel Emelyanov <[email protected]>
Acked-by: Dmitry Monakhov <[email protected]>
Signed-off-by: Benjamin LaHaise <[email protected]>

diff --git a/fs/aio.c b/fs/aio.c
index 14b9315..bfab556 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -286,12 +286,37 @@ static void aio_free_ring(struct kioctx *ctx)

static int aio_ring_mmap(struct file *file, struct vm_area_struct *vma)
{
+ vma->vm_flags |= VM_DONTEXPAND;
vma->vm_ops = &generic_file_vm_ops;
return 0;
}

+static void aio_ring_remap(struct file *file, struct vm_area_struct *vma)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ struct kioctx_table *table;
+ int i;
+
+ spin_lock(&mm->ioctx_lock);
+ rcu_read_lock();
+ table = rcu_dereference(mm->ioctx_table);
+ for (i = 0; i < table->nr; i++) {
+ struct kioctx *ctx;
+
+ ctx = table->table[i];
+ if (ctx && ctx->aio_ring_file == file) {
+ ctx->user_id = ctx->mmap_base = vma->vm_start;
+ break;
+ }
+ }
+
+ rcu_read_unlock();
+ spin_unlock(&mm->ioctx_lock);
+}
+
static const struct file_operations aio_ring_fops = {
.mmap = aio_ring_mmap,
+ .mremap = aio_ring_remap,
};

#if IS_ENABLED(CONFIG_MIGRATION)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9ab779e..85f378c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1497,6 +1497,7 @@ struct file_operations {
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
+ void (*mremap)(struct file *, struct vm_area_struct *);
int (*open) (struct inode *, struct file *);
int (*flush) (struct file *, fl_owner_t id);
int (*release) (struct inode *, struct file *);
diff --git a/mm/mremap.c b/mm/mremap.c
index b147f66..c855922 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -288,7 +288,8 @@ static unsigned long move_vma(struct vm_area_struct *vma,
old_len = new_len;
old_addr = new_addr;
new_addr = -ENOMEM;
- }
+ } else if (vma->vm_file && vma->vm_file->f_op->mremap)
+ vma->vm_file->f_op->mremap(vma->vm_file, new_vma);

/* Conceal VM_ACCOUNT so old reservation is not undone */
if (vm_flags & VM_ACCOUNT) {

2014-12-14 22:10:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] aio: changes for 3.19

On Sun, Dec 14, 2014 at 1:52 PM, Benjamin LaHaise <[email protected]> wrote:
>
> Sadly, nobody responded.

Well, I did pull, because I found the patch ok, but in general I
really want to get a heads up for changes like this, and not be taken
by surprise by them.

Linus

2014-12-14 22:13:35

by Andrew Morton

[permalink] [raw]
Subject: Re: [GIT PULL] aio: changes for 3.19

On Sun, 14 Dec 2014 16:52:21 -0500 Benjamin LaHaise <[email protected]> wrote:

> On Sun, Dec 14, 2014 at 01:47:32PM -0800, Linus Torvalds wrote:
> > On Sun, Dec 14, 2014 at 12:22 PM, Benjamin LaHaise <[email protected]> wrote:
> > >
> > > Pavel Emelyanov (1):
> > > aio: Make it possible to remap aio ring
> >
> > So quite frankly, I think this should have had more acks from VM
> > people. The patch looks ok to me, but it took me by surprise, and I
> > don't see much any discussion about it on linux-mm either..
>
> Sadly, nobody responded. Maybe akpm can chime in on this change (included
> below for ease of review and akpm added to the To:)?
>
> -ben
> --
> "Thought is the essence of where you are now."
>
> commit e4a0d3e720e7e508749c1439b5ba3aff56c92976
> Author: Pavel Emelyanov <[email protected]>
> Date: Thu Sep 18 19:56:17 2014 +0400
>
> aio: Make it possible to remap aio ring

The patch appears to be a bugfix which coincidentally helps CRIU?

If it weren't for the bugfix part, I'd be asking "why not pass the
desired virtual address into io_setup()?".

> There are actually two issues this patch addresses. Let me start with
> the one I tried to solve in the beginning.
>
> So, in the checkpoint-restore project (criu) we try to dump tasks'
> state and restore one back exactly as it was. One of the tasks' state
> bits is rings set up with io_setup() call. There's (almost) no problems
> in dumping them, there's a problem restoring them -- if I dump a task
> with aio ring originally mapped at address A, I want to restore one
> back at exactly the same address A. Unfortunately, the io_setup() does
> not allow for that -- it mmaps the ring at whatever place mm finds
> appropriate (it calls do_mmap_pgoff() with zero address and without
> the MAP_FIXED flag).
>
> To make restore possible I'm going to mremap() the freshly created ring
> into the address A (under which it was seen before dump). The problem is
> that the ring's virtual address is passed back to the user-space as the
> context ID and this ID is then used as search key by all the other io_foo()
> calls. Reworking this ID to be just some integer doesn't seem to work, as
> this value is already used by libaio as a pointer using which this library
> accesses memory for aio meta-data.
>
> So, to make restore work we need to make sure that
>
> a) ring is mapped at desired virtual address
> b) kioctx->user_id matches this value
>
> Having said that, the patch makes mremap() on aio region update the
> kioctx's user_id and mmap_base values.
>
> Here appears the 2nd issue I mentioned in the beginning of this mail.
> If (regardless of the C/R dances I do) someone creates an io context
> with io_setup(), then mremap()-s the ring and then destroys the context,
> the kill_ioctx() routine will call munmap() on wrong (old) address.
> This will result in a) aio ring remaining in memory and b) some other
> vma get unexpectedly unmapped.
>
> What do you think?
>
> ...
>
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -286,12 +286,37 @@ static void aio_free_ring(struct kioctx *ctx)
>
> static int aio_ring_mmap(struct file *file, struct vm_area_struct *vma)
> {
> + vma->vm_flags |= VM_DONTEXPAND;

I don't think this was changelogged? A comment here would be best.

> vma->vm_ops = &generic_file_vm_ops;
> return 0;
> }
>
> +static void aio_ring_remap(struct file *file, struct vm_area_struct *vma)
> +{
> + struct mm_struct *mm = vma->vm_mm;
> + struct kioctx_table *table;
> + int i;
> +
> + spin_lock(&mm->ioctx_lock);
> + rcu_read_lock();
> + table = rcu_dereference(mm->ioctx_table);
> + for (i = 0; i < table->nr; i++) {
> + struct kioctx *ctx;
> +
> + ctx = table->table[i];
> + if (ctx && ctx->aio_ring_file == file) {
> + ctx->user_id = ctx->mmap_base = vma->vm_start;
> + break;
> + }
> + }
> +
> + rcu_read_unlock();
> + spin_unlock(&mm->ioctx_lock);
> +}

Looks simple enough. Possibly so simple that it doesn't need commenting ;)

> static const struct file_operations aio_ring_fops = {
> .mmap = aio_ring_mmap,
> + .mremap = aio_ring_remap,
> };
>
> #if IS_ENABLED(CONFIG_MIGRATION)
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 9ab779e..85f378c 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1497,6 +1497,7 @@ struct file_operations {
> long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
> long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
> int (*mmap) (struct file *, struct vm_area_struct *);
> + void (*mremap)(struct file *, struct vm_area_struct *);

Documentation/filesystems/vfs.txt needs updating. Note that it says
"all methods are called without any locks being held, unless otherwise
noted", but ->mremap() is called under mmap_sem (at least) so otherwise
noting will be needed.

It might make sense for ->mremap() to return an errno, but we can
change that later if needed.

> int (*open) (struct inode *, struct file *);
> int (*flush) (struct file *, fl_owner_t id);
> int (*release) (struct inode *, struct file *);
> diff --git a/mm/mremap.c b/mm/mremap.c
> index b147f66..c855922 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -288,7 +288,8 @@ static unsigned long move_vma(struct vm_area_struct *vma,
> old_len = new_len;
> old_addr = new_addr;
> new_addr = -ENOMEM;
> - }
> + } else if (vma->vm_file && vma->vm_file->f_op->mremap)
> + vma->vm_file->f_op->mremap(vma->vm_file, new_vma);
>

The patch overall is a no-op from an MM perspective and seems OK to me.

2014-12-14 22:39:44

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [GIT PULL] aio: changes for 3.19

On Sun, Dec 14, 2014 at 02:13:36PM -0800, Andrew Morton wrote:
> The patch appears to be a bugfix which coincidentally helps CRIU?

Yes.

> If it weren't for the bugfix part, I'd be asking "why not pass the
> desired virtual address into io_setup()?".

It's entirely possible someone might have a need to mremap the event ring,
but nobody seems to have tried until now.

> The patch overall is a no-op from an MM perspective and seems OK to me.

How about the documentation/comment updates below?

I'll try to be more aggressive about getting signoffs on these wider changes
in the future.

-ben
--
"Thought is the essence of where you are now."

aio/mm: update documentation for mremap hook in commit e4a0d3e720e7e508749c1439b5ba3aff56c92976

Add a few more comments and documentation to explain the mremap hook introduced
in commit e4a0d3e720e7e508749c1439b5ba3aff56c92976.

Signed-off-by: Benjamin LaHaise <[email protected]>

Documentation/filesystems/vfs.txt | 4 ++++
fs/aio.c | 8 ++++++++
2 files changed, 12 insertions(+)

diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 43ce050..a9f3df5 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -820,6 +820,7 @@ struct file_operations {
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
+ void (*mremap) (struct file *, struct vm_area_struct *);
int (*open) (struct inode *, struct file *);
int (*flush) (struct file *);
int (*release) (struct inode *, struct file *);
@@ -868,6 +869,9 @@ otherwise noted.

mmap: called by the mmap(2) system call

+ mremap: called by the mremap(2) system call. Called holding mmap_sem for
+ write.
+
open: called by the VFS when an inode should be opened. When the VFS
opens a file, it creates a new "struct file". It then calls the
open method for the newly allocated file structure. You might
diff --git a/fs/aio.c b/fs/aio.c
index 1b7893e..aba5385 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -286,11 +286,19 @@ static void aio_free_ring(struct kioctx *ctx)

static int aio_ring_mmap(struct file *file, struct vm_area_struct *vma)
{
+ /* Resizing the event ring is not supported; mark it so. */
vma->vm_flags |= VM_DONTEXPAND;
vma->vm_ops = &generic_file_vm_ops;
return 0;
}

+/* aio_ring_remap()
+ * Called when th aio event ring is being relocated within the process'
+ * address space. The primary purpose is to update the saved address of
+ * the aio event ring so that when the ioctx is detroyed, it gets removed
+ * from the correct userspace address. This is typically used when
+ * reloading a process back into memory by checkpoint-restore.
+ */
static void aio_ring_remap(struct file *file, struct vm_area_struct *vma)
{
struct mm_struct *mm = vma->vm_mm;

2014-12-14 23:02:35

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [GIT PULL] aio: changes for 3.19

On Sun, Dec 14, 2014 at 02:13:36PM -0800, Andrew Morton wrote:
> On Sun, 14 Dec 2014 16:52:21 -0500 Benjamin LaHaise <[email protected]> wrote:
>
> > On Sun, Dec 14, 2014 at 01:47:32PM -0800, Linus Torvalds wrote:
> > > On Sun, Dec 14, 2014 at 12:22 PM, Benjamin LaHaise <[email protected]> wrote:
> > > >
> > > > Pavel Emelyanov (1):
> > > > aio: Make it possible to remap aio ring
> > >
> > > So quite frankly, I think this should have had more acks from VM
> > > people. The patch looks ok to me, but it took me by surprise, and I
> > > don't see much any discussion about it on linux-mm either..
> >
> > Sadly, nobody responded. Maybe akpm can chime in on this change (included
> > below for ease of review and akpm added to the To:)?
> >
> > -ben
> > --
> > "Thought is the essence of where you are now."
> >
> > commit e4a0d3e720e7e508749c1439b5ba3aff56c92976
> > Author: Pavel Emelyanov <[email protected]>
> > Date: Thu Sep 18 19:56:17 2014 +0400
> >
> > aio: Make it possible to remap aio ring
>
> The patch appears to be a bugfix which coincidentally helps CRIU?
>
> If it weren't for the bugfix part, I'd be asking "why not pass the
> desired virtual address into io_setup()?".

But it seems the problem is bigger than what the patch fixes. To me we are
too permisive on what vma can be remapped.

How can we know that it's okay to move vma around for random driver which
provide .mmap? Or I miss something obvious?

--
Kirill A. Shutemov

2014-12-14 23:08:13

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [GIT PULL] aio: changes for 3.19

On Mon, Dec 15, 2014 at 01:02:08AM +0200, Kirill A. Shutemov wrote:
> But it seems the problem is bigger than what the patch fixes. To me we are
> too permisive on what vma can be remapped.
>
> How can we know that it's okay to move vma around for random driver which
> provide .mmap? Or I miss something obvious?

Most drivers do not care if a vma is moved within the virtual address space
of a process. The aio ring buffer is special in that it gets unmapped when
userspace does an io_destroy(), and io_destroy() has to know what the address
is moved to in order to perform the unmap. Normal drivers don't perform the
unmap themselves.

-ben
--
"Thought is the essence of where you are now."

2014-12-14 23:11:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] aio: changes for 3.19

On Sun, Dec 14, 2014 at 3:02 PM, Kirill A. Shutemov
<[email protected]> wrote:
>
> How can we know that it's okay to move vma around for random driver which
> provide .mmap? Or I miss something obvious?

I do think that it would likely be a good idea to require an explicit
flag somewhere before we do "move_vma()". I agree that it's kind of
odd that we just assume everything is safe to move.

That said, drivers or other random mappings that know about the
virtual address should largely be considered buggy and broken anyway.
I'm not convinced it's a good idea for aio either, but it probably has
a better excuse than most.

Linus

2014-12-15 05:53:13

by Andrew Morton

[permalink] [raw]
Subject: Re: [GIT PULL] aio: changes for 3.19

On Sun, 14 Dec 2014 17:39:36 -0500 Benjamin LaHaise <[email protected]> wrote:

> How about the documentation/comment updates below?

lgtm.

> ...
>
> +/* aio_ring_remap()
> + * Called when th aio event ring is being relocated within the process'

"the"

> + * address space. The primary purpose is to update the saved address of
> + * the aio event ring so that when the ioctx is detroyed, it gets removed
> + * from the correct userspace address. This is typically used when
> + * reloading a process back into memory by checkpoint-restore.
> + */
> static void aio_ring_remap(struct file *file, struct vm_area_struct *vma)
> {
> struct mm_struct *mm = vma->vm_mm;