LinuxLists.cc - [RFC PATCH 0/1] mm/mremap: add MREMAP

2017-07-06 16:17:53

Subject: [RFC PATCH 0/1] mm/mremap: add MREMAP_MIRROR flag

2017-07-06 16:19:27

Subject: [RFC PATCH 1/1] mm/mremap: add MREMAP_MIRROR flag for existing mirroring functionality

The mremap system call has the ability to 'mirror' parts of an existing
mapping. To do so, it creates a new mapping that maps the same pages as
the original mapping, just at a different virtual address. This
functionality has existed since at least the 2.6 kernel.

This patch simply adds a new flag to mremap which will make this
functionality part of the API. It maintains backward compatibility with
the existing way of requesting mirroring (old_size == 0).

If this new MREMAP_MIRROR flag is specified, then new_size must equal
old_size. In addition, the MREMAP_MAYMOVE flag must be specified.

Signed-off-by: Mike Kravetz <[email protected]>
---
include/uapi/linux/mman.h | 5 +++--
mm/mremap.c | 23 ++++++++++++++++-------
tools/include/uapi/linux/mman.h | 5 +++--
3 files changed, 22 insertions(+), 11 deletions(-)

diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
index ade4acd..6b3e0df 100644
--- a/include/uapi/linux/mman.h
+++ b/include/uapi/linux/mman.h
@@ -3,8 +3,9 @@

#include <asm/mman.h>

-#define MREMAP_MAYMOVE 1
-#define MREMAP_FIXED 2
+#define MREMAP_MAYMOVE 0x01
+#define MREMAP_FIXED 0x02
+#define MREMAP_MIRROR 0x04

#define OVERCOMMIT_GUESS 0
#define OVERCOMMIT_ALWAYS 1
diff --git a/mm/mremap.c b/mm/mremap.c
index cd8a1b1..f18ab36 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -516,10 +516,11 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
struct vm_userfaultfd_ctx uf = NULL_VM_UFFD_CTX;
LIST_HEAD(uf_unmap);

- if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE))
+ if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE | MREMAP_MIRROR))
return ret;

- if (flags & MREMAP_FIXED && !(flags & MREMAP_MAYMOVE))
+ if ((flags & MREMAP_FIXED || flags & MREMAP_MIRROR) &&
+ !(flags & MREMAP_MAYMOVE))
return ret;

if (offset_in_page(addr))
@@ -528,14 +529,22 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
old_len = PAGE_ALIGN(old_len);
new_len = PAGE_ALIGN(new_len);

- /*
- * We allow a zero old-len as a special case
- * for DOS-emu "duplicate shm area" thing. But
- * a zero new-len is nonsensical.
- */
+ /* A zero new-len is nonsensical. */
if (!new_len)
return ret;

+ /*
+ * For backward compatibility, we allow a zero old-len to imply
+ * mirroring. This was originally a special case for DOS-emu.
+ */
+ if (!old_len)
+ flags |= MREMAP_MIRROR;
+ else if (flags & MREMAP_MIRROR) {
+ if (old_len != new_len)
+ return ret;
+ old_len = 0;
+ }
+
if (down_write_killable(&current->mm->mmap_sem))
return -EINTR;

diff --git a/tools/include/uapi/linux/mman.h b/tools/include/uapi/linux/mman.h
index 81d8edf..069f7a5 100644
--- a/tools/include/uapi/linux/mman.h
+++ b/tools/include/uapi/linux/mman.h
@@ -3,8 +3,9 @@

#include <uapi/asm/mman.h>

-#define MREMAP_MAYMOVE 1
-#define MREMAP_FIXED 2
+#define MREMAP_MAYMOVE 0x01
+#define MREMAP_FIXED 0x02
+#define MREMAP_MIRROR 0x04

#define OVERCOMMIT_GUESS 0
#define OVERCOMMIT_ALWAYS 1
--
2.7.5

2017-07-07 08:20:04

by Anshuman Khandual

[permalink] [raw]

Subject: Re: [RFC PATCH 0/1] mm/mremap: add MREMAP_MIRROR flag

On 07/06/2017 09:47 PM, Mike Kravetz wrote:
> The mremap system call has the ability to 'mirror' parts of an existing
> mapping. To do so, it creates a new mapping that maps the same pages as
> the original mapping, just at a different virtual address. This
> functionality has existed since at least the 2.6 kernel [1]. A comment
> was added to the code to help preserve this feature.

Is this the comment ? If yes, then its not very clear.

/*
* We allow a zero old-len as a special case
* for DOS-emu "duplicate shm area" thing. But
* a zero new-len is nonsensical.
*/

>
> The Oracle JVM team has discovered this feature and used it while
> prototyping a new garbage collection model. This new model shows promise,
> and they are considering its use in a future release. However, since
> the only mention of this functionality is a single comment in the kernel,
> they are concerned about its future.
>
> I propose the addition of a new MREMAP_MIRROR flag to explicitly request
> this functionality. The flag simply provides the same functionality as
> the existing undocumented 'old_size == 0' interface. As an alternative,
> we could simply document the 'old_size == 0' interface in the man page.
> In either case, man page modifications would be needed.

Right. Adding MREMAP_MIRROR sounds cleaner from application programming
point of view. But it extends the interface.

>
> Future Direction
>
> After more formally adding this to the API (either new flag or documenting
> existing interface), the mremap code could be enhanced to optimize this
> case. Currently, 'mirroring' only sets up the new mapping. It does not
> create page table entries for new mapping. This could be added as an
> enhancement.

Then how it achieves mirroring, both the pointers should see the same
data, that can happen with page table entries pointing to same pages,
right ?

2017-07-07 08:47:06

by Anshuman Khandual

[permalink] [raw]

Subject: Re: [RFC PATCH 1/1] mm/mremap: add MREMAP_MIRROR flag for existing mirroring functionality

2017-07-07 10:23:28

by Kirill A. Shutemov

[permalink] [raw]

Subject: Re: [RFC PATCH 1/1] mm/mremap: add MREMAP_MIRROR flag for existing mirroring functionality

On Thu, Jul 06, 2017 at 09:17:26AM -0700, Mike Kravetz wrote:
> The mremap system call has the ability to 'mirror' parts of an existing
> mapping. To do so, it creates a new mapping that maps the same pages as
> the original mapping, just at a different virtual address. This
> functionality has existed since at least the 2.6 kernel.
>
> This patch simply adds a new flag to mremap which will make this
> functionality part of the API. It maintains backward compatibility with
> the existing way of requesting mirroring (old_size == 0).
>
> If this new MREMAP_MIRROR flag is specified, then new_size must equal
> old_size. In addition, the MREMAP_MAYMOVE flag must be specified.

The patch breaks important invariant that anon page can be mapped into a
process only once.

What is going to happen to mirrored after CoW for instance?

In my opinion, it shouldn't be allowed for anon/private mappings at least.
And with this limitation, I don't see much sense in the new interface --
just create mirror by mmap()ing the file again.

--
Kirill A. Shutemov

2017-07-07 11:03:49

by Anshuman Khandual

[permalink] [raw]

On Thu, Jul 13, 2017 at 11:11:37AM -0700, Mike Kravetz wrote:
> Here is my understanding of how things work for old_len == 0 of anon
> mappings:
> - shared mappings
> - New vma is created at new virtual address
> - vma refers to the same underlying object/pages as old vma
> - after mremap, no page tables exist for new vma, they are
> created as pages are accessed/faulted
> - page at new_address is same as page at old_address

Yes, and this isn't backed by anon memory, it's backed by
shmem. "Shared anon mapping" is really synonymous of shmem, the fact
it's not a mmap of a tmpfs file is purely an API detail.

> - private mappings
> - New vma is created at new virtual address
> - vma does not refer to same pages as old vma. It is a 'new'
> private anon mapping.
> - after mremap, no page tables exist for new vma. access to
> the range of the new vma will result in faults that allocate
> a new page.
> - page at new_address is different than page at old_address
> the new vma will result in new

Yes, for a anon private mapping (so backed by real anonymous memory)
no payload in the old vma could possibly go in the new vma.

> So, the result of mremap(old_len == 0) on a private mapping is that it
> simply creates a new private mapping. IMO, this is contrary to the purpose
> of mremap. mremap should return a mapping that is somehow related to
> the original mapping.

I agree there's no point to ever use the mremap(old_len == 0)
undocumented trick, to create a new anon private mmap, when you could
use mmap instead and the result would be the same.

So it's plausible nobody could use it for it.

> Perhaps you are thinking about mremap of a private file mapping? I was
> not considering that case. I believe you are right. In this case a
> private COW mapping based on the original mapping would be created. So,
> this seems more in line with the intent of mremap. The new mapping is
> still related to the old mapping.

Yes my earlier example was all about filebacked private mappings, to
point out those also have a deterministic behavior with the old_len ==
0 trick and it could be still used because the IPC_RMID was executed
early on.

The point is that you could always use a plain new mmap instead of the
old_len == 0 trick, but that applies to shared mappings as well.

My argument is that if you keep it and document it for shared anon
mappings, I don't see something fundamentally wrong as keeping it for
private filebacked mappings too as the shmat ID may have been deleted
for those too.

> With this in mind, what about returning EINVAL only for the anon private
> mapping case?

The only case where there's no excuse to use mremap(old_len == 0) as
replacement for a new mmap is the private anon mappings case, so while
it may still break something (as opposed to a deprecation warning), I
guess the likely hood somebody is using it, is very low.

> However, if you have a fd (for a file mapping) then I can not see why
> someone would be using the old_len == 0 trick. It would be more straight
> forward to simply use mmap to create the additional mapping.

That applies to MAP_SHARED too and that's why deprecating the whole
undocumented old_len ==0 sounded and still sound attractive to me, but
doing it right away without a deprecation warning cycle, sounds too
risky.

> > So an alternative would be to start by adding a WARN_ON_ONCE deprecation
> > warning instead of -EINVAL right away.
> >
> > The vma->vm_flags VM_ACCOUNT being wiped on the original vma as side
> > effect of using the old_len == 0 trick looks like a bug, I guess it
> > should get fixed if we intend to keep old_len and document it for the
> > long term.
>
> Others seem to think we should keep old_len == 0 and document.

The only case where it makes sense is after IPC_RMID, but with
memfd_create there's no point anymore to use IPC_RMID.

tmpfs/hugetlbfs/realfs files can be unlinked while the fd is still
open so again no need of the mremap(old_len == 0) trick.

Which is why I'd find it attractive to deprecate it if we could, but I
assume we can't drop it even if undocumented, which is why I felt a
deprecation warning would be suitable in this case (similar to
deprecation warning of sysfs and then dropped via config option). I am
assuming here that nobody is using it because it's undocumented and it
has a bug in the VM_ACCOUNT code too. Without a deprecation warning
it'd be hard to tell if the assumption is correct.

> I assume you are concerned about the do_munmap call in move_vma? That

Yes exactly.

> does indeed look to be of concern. This happens AFTER setting up the
> new mapping. So, I'm thinking we should tear down the new mapping in
> the case do_munmap of the old mapping fails? That 'should' simply
> be a matter of:
> - moving page tables back to original mapping
> - remove/delete new vma

Yes.

> - I don't think we need to 'unmap' the new vma as there should be no
> associated pages.

The new vma doesn't require memory allocations to drop as it was just
created by copy_vma so there's no risk of further failures in the
unwind.

After the unwind it'll return -ENOMEM to userland (which we don't
right now).

> I'll look into doing this as well.

It's mostly theoretical, the chances of an allocation failure
triggering exactly in that split_vma are basically zero, but I think
it'd be more correct and safer.

> Just curious, do those userfaultfd callouts still work as desired in the
> case of map duplication (old_len == 0)?

old_len == 0 is fine with userfaultfd because, len == 0 returns
-EINVAL in do_munmap before userfaultfd_unmap_prep is called.

Still looking at the VM_ACCOUNT adjustments around do_munmap:

mremap:

/* Conceal VM_ACCOUNT so old reservation is not undone */
if (vm_flags & VM_ACCOUNT) {

do_munmap:

if (uf) {
int error = userfaultfd_unmap_prep(vma, start, end, uf);

if (error)
return error;
}

/*
* If we need to split any vma, do it now to save pain later.
*
* Note: mremap's move_vma VM_ACCOUNT handling assumes a partially
* unmapped vm_area_struct will remain in use: so lower split_vma
* places tmp vma above, and higher split_vma places tmp vma below.
*/

I don't see this assumption where it matters that on do_munmap
failure, mremap assumes the partially unmapped vma remains in use. In
fact it's not partially unmapped at all, it's only split at the
"start" address of the do_munmap but not unmapped.

mremap caller simply sets excess = 0 and assumes it's all still mapped
at the original vma as expected regardless of the order of the
__split_vma executed in do_munmap.

The whole VM_ACCOUNT logic in this place exists since the start of the
git history so I can't see the change originating the above comment,
but I assume the comment is wrong or simply confusing.

I don't see a problem in userfaultfd_unmap_prep failing with -ENOMEM
in relation to the VM_ACCOUNT logic above, before split_vma is called
(callee doesn't seem to make assumption).

However unrelated to mremap old_len == 0, but purely internal to
do_munmap and theoretical, if either of the two __split_vma fails
there's no need to send an unmap event and in fact it'd be wrong to,
so userfaultfd_unmap_prep should be moved after both split_vma succeded.

Thanks,
Andrea