The mremap system call has the ability to 'mirror' parts of an existing
mapping. To do so, it creates a new mapping that maps the same pages as
the original mapping, just at a different virtual address. This
functionality has existed since at least the 2.6 kernel [1]. A comment
was added to the code to help preserve this feature.
The Oracle JVM team has discovered this feature and used it while
prototyping a new garbage collection model. This new model shows promise,
and they are considering its use in a future release. However, since
the only mention of this functionality is a single comment in the kernel,
they are concerned about its future.
I propose the addition of a new MREMAP_MIRROR flag to explicitly request
this functionality. The flag simply provides the same functionality as
the existing undocumented 'old_size == 0' interface. As an alternative,
we could simply document the 'old_size == 0' interface in the man page.
In either case, man page modifications would be needed.
Future Direction
After more formally adding this to the API (either new flag or documenting
existing interface), the mremap code could be enhanced to optimize this
case. Currently, 'mirroring' only sets up the new mapping. It does not
create page table entries for new mapping. This could be added as an
enhancement.
The JVM today has the option of using (static) huge pages. The mremap
system call does not fully support huge page mappings today. You can
use mremap to shrink the size of a huge page mapping, but it can not be
used to expand or mirror a mapping. Such support is fairly straight
forward.
[1] https://lkml.org/lkml/2004/1/12/260
Mike Kravetz (1):
mm/mremap: add MREMAP_MIRROR flag for existing mirroring functionality
include/uapi/linux/mman.h | 5 +++--
mm/mremap.c | 23 ++++++++++++++++-------
tools/include/uapi/linux/mman.h | 5 +++--
3 files changed, 22 insertions(+), 11 deletions(-)
--
2.7.5
The mremap system call has the ability to 'mirror' parts of an existing
mapping. To do so, it creates a new mapping that maps the same pages as
the original mapping, just at a different virtual address. This
functionality has existed since at least the 2.6 kernel.
This patch simply adds a new flag to mremap which will make this
functionality part of the API. It maintains backward compatibility with
the existing way of requesting mirroring (old_size == 0).
If this new MREMAP_MIRROR flag is specified, then new_size must equal
old_size. In addition, the MREMAP_MAYMOVE flag must be specified.
Signed-off-by: Mike Kravetz <[email protected]>
---
include/uapi/linux/mman.h | 5 +++--
mm/mremap.c | 23 ++++++++++++++++-------
tools/include/uapi/linux/mman.h | 5 +++--
3 files changed, 22 insertions(+), 11 deletions(-)
diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
index ade4acd..6b3e0df 100644
--- a/include/uapi/linux/mman.h
+++ b/include/uapi/linux/mman.h
@@ -3,8 +3,9 @@
#include <asm/mman.h>
-#define MREMAP_MAYMOVE 1
-#define MREMAP_FIXED 2
+#define MREMAP_MAYMOVE 0x01
+#define MREMAP_FIXED 0x02
+#define MREMAP_MIRROR 0x04
#define OVERCOMMIT_GUESS 0
#define OVERCOMMIT_ALWAYS 1
diff --git a/mm/mremap.c b/mm/mremap.c
index cd8a1b1..f18ab36 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -516,10 +516,11 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
struct vm_userfaultfd_ctx uf = NULL_VM_UFFD_CTX;
LIST_HEAD(uf_unmap);
- if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE))
+ if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE | MREMAP_MIRROR))
return ret;
- if (flags & MREMAP_FIXED && !(flags & MREMAP_MAYMOVE))
+ if ((flags & MREMAP_FIXED || flags & MREMAP_MIRROR) &&
+ !(flags & MREMAP_MAYMOVE))
return ret;
if (offset_in_page(addr))
@@ -528,14 +529,22 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
old_len = PAGE_ALIGN(old_len);
new_len = PAGE_ALIGN(new_len);
- /*
- * We allow a zero old-len as a special case
- * for DOS-emu "duplicate shm area" thing. But
- * a zero new-len is nonsensical.
- */
+ /* A zero new-len is nonsensical. */
if (!new_len)
return ret;
+ /*
+ * For backward compatibility, we allow a zero old-len to imply
+ * mirroring. This was originally a special case for DOS-emu.
+ */
+ if (!old_len)
+ flags |= MREMAP_MIRROR;
+ else if (flags & MREMAP_MIRROR) {
+ if (old_len != new_len)
+ return ret;
+ old_len = 0;
+ }
+
if (down_write_killable(¤t->mm->mmap_sem))
return -EINTR;
diff --git a/tools/include/uapi/linux/mman.h b/tools/include/uapi/linux/mman.h
index 81d8edf..069f7a5 100644
--- a/tools/include/uapi/linux/mman.h
+++ b/tools/include/uapi/linux/mman.h
@@ -3,8 +3,9 @@
#include <uapi/asm/mman.h>
-#define MREMAP_MAYMOVE 1
-#define MREMAP_FIXED 2
+#define MREMAP_MAYMOVE 0x01
+#define MREMAP_FIXED 0x02
+#define MREMAP_MIRROR 0x04
#define OVERCOMMIT_GUESS 0
#define OVERCOMMIT_ALWAYS 1
--
2.7.5
On 07/06/2017 09:47 PM, Mike Kravetz wrote:
> The mremap system call has the ability to 'mirror' parts of an existing
> mapping. To do so, it creates a new mapping that maps the same pages as
> the original mapping, just at a different virtual address. This
> functionality has existed since at least the 2.6 kernel [1]. A comment
> was added to the code to help preserve this feature.
Is this the comment ? If yes, then its not very clear.
/*
* We allow a zero old-len as a special case
* for DOS-emu "duplicate shm area" thing. But
* a zero new-len is nonsensical.
*/
>
> The Oracle JVM team has discovered this feature and used it while
> prototyping a new garbage collection model. This new model shows promise,
> and they are considering its use in a future release. However, since
> the only mention of this functionality is a single comment in the kernel,
> they are concerned about its future.
>
> I propose the addition of a new MREMAP_MIRROR flag to explicitly request
> this functionality. The flag simply provides the same functionality as
> the existing undocumented 'old_size == 0' interface. As an alternative,
> we could simply document the 'old_size == 0' interface in the man page.
> In either case, man page modifications would be needed.
Right. Adding MREMAP_MIRROR sounds cleaner from application programming
point of view. But it extends the interface.
>
> Future Direction
>
> After more formally adding this to the API (either new flag or documenting
> existing interface), the mremap code could be enhanced to optimize this
> case. Currently, 'mirroring' only sets up the new mapping. It does not
> create page table entries for new mapping. This could be added as an
> enhancement.
Then how it achieves mirroring, both the pointers should see the same
data, that can happen with page table entries pointing to same pages,
right ?
On 07/06/2017 09:47 PM, Mike Kravetz wrote:
> The mremap system call has the ability to 'mirror' parts of an existing
> mapping. To do so, it creates a new mapping that maps the same pages as
> the original mapping, just at a different virtual address. This
> functionality has existed since at least the 2.6 kernel.
>
> This patch simply adds a new flag to mremap which will make this
> functionality part of the API. It maintains backward compatibility with
> the existing way of requesting mirroring (old_size == 0).
>
> If this new MREMAP_MIRROR flag is specified, then new_size must equal
> old_size. In addition, the MREMAP_MAYMOVE flag must be specified.
Yeah it all looks good. But why is this requirement that if
MREMAP_MAYMOVE is specified then old_size and new_size must
be equal.
On Thu, Jul 06, 2017 at 09:17:26AM -0700, Mike Kravetz wrote:
> The mremap system call has the ability to 'mirror' parts of an existing
> mapping. To do so, it creates a new mapping that maps the same pages as
> the original mapping, just at a different virtual address. This
> functionality has existed since at least the 2.6 kernel.
>
> This patch simply adds a new flag to mremap which will make this
> functionality part of the API. It maintains backward compatibility with
> the existing way of requesting mirroring (old_size == 0).
>
> If this new MREMAP_MIRROR flag is specified, then new_size must equal
> old_size. In addition, the MREMAP_MAYMOVE flag must be specified.
The patch breaks important invariant that anon page can be mapped into a
process only once.
What is going to happen to mirrored after CoW for instance?
In my opinion, it shouldn't be allowed for anon/private mappings at least.
And with this limitation, I don't see much sense in the new interface --
just create mirror by mmap()ing the file again.
--
Kirill A. Shutemov
On 07/06/2017 09:47 PM, Mike Kravetz wrote:
> The mremap system call has the ability to 'mirror' parts of an existing
> mapping. To do so, it creates a new mapping that maps the same pages as
> the original mapping, just at a different virtual address. This
> functionality has existed since at least the 2.6 kernel [1]. A comment
> was added to the code to help preserve this feature.
In mremap() implementation move_vma() attempts to do do_unmap() after
move_page_tables(). do_unmap() on the original VMA bails out because
the requested length being 0. Hence both the original VMA and the new
VMA remains after the page table migration. Seems like this whole
mirror function is by coincidence or it has been designed that way ?
On 07/07/2017 01:19 AM, Anshuman Khandual wrote:
> On 07/06/2017 09:47 PM, Mike Kravetz wrote:
>> The mremap system call has the ability to 'mirror' parts of an existing
>> mapping. To do so, it creates a new mapping that maps the same pages as
>> the original mapping, just at a different virtual address. This
>> functionality has existed since at least the 2.6 kernel [1]. A comment
>> was added to the code to help preserve this feature.
>
>
> Is this the comment ? If yes, then its not very clear.
>
> /*
> * We allow a zero old-len as a special case
> * for DOS-emu "duplicate shm area" thing. But
> * a zero new-len is nonsensical.
> */
>
Yes, I believe that is the comment.
>>
>> The Oracle JVM team has discovered this feature and used it while
>> prototyping a new garbage collection model. This new model shows promise,
>> and they are considering its use in a future release. However, since
>> the only mention of this functionality is a single comment in the kernel,
>> they are concerned about its future.
>>
>> I propose the addition of a new MREMAP_MIRROR flag to explicitly request
>> this functionality. The flag simply provides the same functionality as
>> the existing undocumented 'old_size == 0' interface. As an alternative,
>> we could simply document the 'old_size == 0' interface in the man page.
>> In either case, man page modifications would be needed.
>
> Right. Adding MREMAP_MIRROR sounds cleaner from application programming
> point of view. But it extends the interface.
Yes. That is the reason for the RFC. We currently have functionality
that is not clearly part of a programming interface. Application programmers
do not like to depend on something that is not part of an interface.
>>
>> Future Direction
>>
>> After more formally adding this to the API (either new flag or documenting
>> existing interface), the mremap code could be enhanced to optimize this
>> case. Currently, 'mirroring' only sets up the new mapping. It does not
>> create page table entries for new mapping. This could be added as an
>> enhancement.
>
> Then how it achieves mirroring, both the pointers should see the same
> data, that can happen with page table entries pointing to same pages,
> right ?
Correct.
In the code today, page tables for the new (mirrored) mapping are created
as needed via faults. The enhancement would be to create page table entries
for the new mapping.
--
Mike Kravetz
On 07/07/2017 04:03 AM, Anshuman Khandual wrote:
> On 07/06/2017 09:47 PM, Mike Kravetz wrote:
>> The mremap system call has the ability to 'mirror' parts of an existing
>> mapping. To do so, it creates a new mapping that maps the same pages as
>> the original mapping, just at a different virtual address. This
>> functionality has existed since at least the 2.6 kernel [1]. A comment
>> was added to the code to help preserve this feature.
>
> In mremap() implementation move_vma() attempts to do do_unmap() after
> move_page_tables(). do_unmap() on the original VMA bails out because
> the requested length being 0. Hence both the original VMA and the new
> VMA remains after the page table migration. Seems like this whole
> mirror function is by coincidence or it has been designed that way ?
I honestly do not know. From what I can tell, the functionality existed
in 2.4. The email thread [1], exists because it was 'accidentally' removed
in 2.6. All of this is before git history (and my involvement).
My 'guess' is that this functionality was created by coincidence. Someone
noticed it and took advantage of it. When it was removed, their code broke.
The code was 'fixed' and a comment was added to the code in an attempt to
prevent removing the functionality in the future. Again, this is speculation
as I was not originally involved.
The point of this RFC is to consider adding the functionality to the API.
If we are carrying the functionality in the code, we should at least document
so that application programmers can take advantage of it.
[1] https://lkml.org/lkml/2004/1/12/260
--
Mike Kravetz
On 07/07/2017 01:45 AM, Anshuman Khandual wrote:
> On 07/06/2017 09:47 PM, Mike Kravetz wrote:
>> The mremap system call has the ability to 'mirror' parts of an existing
>> mapping. To do so, it creates a new mapping that maps the same pages as
>> the original mapping, just at a different virtual address. This
>> functionality has existed since at least the 2.6 kernel.
>>
>> This patch simply adds a new flag to mremap which will make this
>> functionality part of the API. It maintains backward compatibility with
>> the existing way of requesting mirroring (old_size == 0).
>>
>> If this new MREMAP_MIRROR flag is specified, then new_size must equal
>> old_size. In addition, the MREMAP_MAYMOVE flag must be specified.
>
> Yeah it all looks good. But why is this requirement that if
> MREMAP_MAYMOVE is specified then old_size and new_size must
> be equal.
No real reason. I just wanted to clearly separate the new interface from
the old. On second thought, it would be better to require old_size == 0
as in the legacy interface.
--
Mike Kravetz
On 07/07/2017 03:23 AM, Kirill A. Shutemov wrote:
> On Thu, Jul 06, 2017 at 09:17:26AM -0700, Mike Kravetz wrote:
>> The mremap system call has the ability to 'mirror' parts of an existing
>> mapping. To do so, it creates a new mapping that maps the same pages as
>> the original mapping, just at a different virtual address. This
>> functionality has existed since at least the 2.6 kernel.
>>
>> This patch simply adds a new flag to mremap which will make this
>> functionality part of the API. It maintains backward compatibility with
>> the existing way of requesting mirroring (old_size == 0).
>>
>> If this new MREMAP_MIRROR flag is specified, then new_size must equal
>> old_size. In addition, the MREMAP_MAYMOVE flag must be specified.
>
> The patch breaks important invariant that anon page can be mapped into a
> process only once.
Actually, the patch does not add any new functionality. It only provides
a new interface to existing functionality.
Is it not possible to have an anon page mapped twice into the same process
via system V shared memory? shmget(anon), shmat(), shmat.
Of course, those are shared rather than private anon pages.
>
> What is going to happen to mirrored after CoW for instance?
>
> In my opinion, it shouldn't be allowed for anon/private mappings at least.
> And with this limitation, I don't see much sense in the new interface --
> just create mirror by mmap()ing the file again.
The code today works for anon shared mappings. See simple program below.
You are correct in that it makes little or no sense for private mappings.
When looking closer at existing code, mremap() creates a new private
mapping in this case. This is most likely a bug.
Again, my intention is not to create new functionality but rather document
existing functionality as part of a programming interface.
--
Mike Kravetz
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <string.h>
#define __USE_GNU
#include <sys/mman.h>
#include <fcntl.h>
#include <errno.h>
#include <sys/types.h>
#include <time.h>
#define H_PAGESIZE (2 * 1024 * 1024)
int hugetlb = 0;
#define PROTECTION (PROT_READ | PROT_WRITE)
#define ADDR (void *)(0x0UL)
/* #define FLAGS (MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB) */
#define FLAGS (MAP_SHARED|MAP_ANONYMOUS)
int main(int argc, char ** argv)
{
int fd, ret;
int i;
long long hpages, tpage;
void *addr;
void *addr2;
char foo;
if (argc == 2) {
if (!strcmp(argv[1], "hugetlb"))
hugetlb = 1;
}
hpages = 5;
printf("Reserving an address ...\n");
addr = mmap(ADDR, H_PAGESIZE * hpages * 2,
PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE|
(hugetlb ? MAP_HUGETLB : 0),
-1, 0);
if (addr == MAP_FAILED) {
perror("mmap");
exit (1);
}
printf("\tgot address %p to %p\n",
(void *)addr, (void *)(addr + H_PAGESIZE * hpages * 2));
printf("mmapping %d 2MB huge pages\n", hpages);
addr = mmap(addr, H_PAGESIZE * hpages, PROT_READ|PROT_WRITE,
MAP_SHARED|MAP_FIXED|MAP_ANONYMOUS|
(hugetlb ? MAP_HUGETLB : 0),
-1, 0);
if (addr == MAP_FAILED) {
perror("mmap");
exit (1);
}
/* initialize data */
for (i = 0; i < hpages; i++)
*((char *)addr + (i * H_PAGESIZE)) = 'a';
printf("pages allocated and initialized at %p\n", (void *)addr);
addr2 = mremap(addr, 0, H_PAGESIZE * hpages,
MREMAP_MAYMOVE | MREMAP_FIXED,
addr + (H_PAGESIZE * hpages));
if (addr2 == MAP_FAILED) {
perror("mremap");
exit (1);
}
printf("mapping relocated to %p\n", (void *)addr2);
/* verify data */
printf("Verifying data at address %p\n", (void *)addr);
for (i = 0; i < hpages; i++) {
if (*((char *)addr + (i * H_PAGESIZE)) != 'a') {
printf("data at address %p not as expected\n",
(void *)((char *)addr + (i * H_PAGESIZE)));
}
}
if (i >= hpages)
printf("\t success!\n");
/* verify data */
printf("Verifying data at address %p\n", (void *)addr2);
for (i = 0; i < hpages; i++) {
if (*((char *)addr2 + (i * H_PAGESIZE)) != 'a') {
printf("data at address %p not as expected\n",
(void *)((char *)addr2 + (i * H_PAGESIZE)));
}
}
if (i >= hpages)
printf("\t success!\n");
return ret;
}
On Fri, Jul 07, 2017 at 10:29:52AM -0700, Mike Kravetz wrote:
> On 07/07/2017 03:23 AM, Kirill A. Shutemov wrote:
> > On Thu, Jul 06, 2017 at 09:17:26AM -0700, Mike Kravetz wrote:
> >> The mremap system call has the ability to 'mirror' parts of an existing
> >> mapping. To do so, it creates a new mapping that maps the same pages as
> >> the original mapping, just at a different virtual address. This
> >> functionality has existed since at least the 2.6 kernel.
> >>
> >> This patch simply adds a new flag to mremap which will make this
> >> functionality part of the API. It maintains backward compatibility with
> >> the existing way of requesting mirroring (old_size == 0).
> >>
> >> If this new MREMAP_MIRROR flag is specified, then new_size must equal
> >> old_size. In addition, the MREMAP_MAYMOVE flag must be specified.
> >
> > The patch breaks important invariant that anon page can be mapped into a
> > process only once.
>
> Actually, the patch does not add any new functionality. It only provides
> a new interface to existing functionality.
>
> Is it not possible to have an anon page mapped twice into the same process
> via system V shared memory? shmget(anon), shmat(), shmat.
> Of course, those are shared rather than private anon pages.
By anon pages I mean, private anon or file pages. These are subject to CoW.
> > What is going to happen to mirrored after CoW for instance?
> >
> > In my opinion, it shouldn't be allowed for anon/private mappings at least.
> > And with this limitation, I don't see much sense in the new interface --
> > just create mirror by mmap()ing the file again.
>
> The code today works for anon shared mappings. See simple program below.
>
> You are correct in that it makes little or no sense for private mappings.
> When looking closer at existing code, mremap() creates a new private
> mapping in this case. This is most likely a bug.
IIRC, existing code doesn't create mirrors of private pages as it requires
old_len to be zero. There's no way to get private pages mapped twice this
way.
--
Kirill A. Shutemov
On 07/07/2017 10:45 AM, Kirill A. Shutemov wrote:
> On Fri, Jul 07, 2017 at 10:29:52AM -0700, Mike Kravetz wrote:
>> On 07/07/2017 03:23 AM, Kirill A. Shutemov wrote:
>>> On Thu, Jul 06, 2017 at 09:17:26AM -0700, Mike Kravetz wrote:
>>>> The mremap system call has the ability to 'mirror' parts of an existing
>>>> mapping. To do so, it creates a new mapping that maps the same pages as
>>>> the original mapping, just at a different virtual address. This
>>>> functionality has existed since at least the 2.6 kernel.
>>>>
>>>> This patch simply adds a new flag to mremap which will make this
>>>> functionality part of the API. It maintains backward compatibility with
>>>> the existing way of requesting mirroring (old_size == 0).
>>>>
>>>> If this new MREMAP_MIRROR flag is specified, then new_size must equal
>>>> old_size. In addition, the MREMAP_MAYMOVE flag must be specified.
>>>
>>> The patch breaks important invariant that anon page can be mapped into a
>>> process only once.
>>
>> Actually, the patch does not add any new functionality. It only provides
>> a new interface to existing functionality.
>>
>> Is it not possible to have an anon page mapped twice into the same process
>> via system V shared memory? shmget(anon), shmat(), shmat.
>> Of course, those are shared rather than private anon pages.
>
> By anon pages I mean, private anon or file pages. These are subject to CoW.
>
>>> What is going to happen to mirrored after CoW for instance?
>>>
>>> In my opinion, it shouldn't be allowed for anon/private mappings at least.
>>> And with this limitation, I don't see much sense in the new interface --
>>> just create mirror by mmap()ing the file again.
>>
>> The code today works for anon shared mappings. See simple program below.
>>
>> You are correct in that it makes little or no sense for private mappings.
>> When looking closer at existing code, mremap() creates a new private
>> mapping in this case. This is most likely a bug.
>
> IIRC, existing code doesn't create mirrors of private pages as it requires
> old_len to be zero. There's no way to get private pages mapped twice this
> way.
Correct.
As mentioned above, mremap does 'something' for private anon pages when
old_len == 0. However, this may be considered a bug. In this case, mremap
creates a new private anon mapping of length new_size. Since old_len == 0,
it does not unmap any of the old mapping. So, in this case mremap basically
creates a new private mapping (unrealted to the original) and does not
modify the old mapping.
--
Mike Kravetz
On 07/07/2017 10:44 PM, Mike Kravetz wrote:
> On 07/07/2017 01:45 AM, Anshuman Khandual wrote:
>> On 07/06/2017 09:47 PM, Mike Kravetz wrote:
>>> The mremap system call has the ability to 'mirror' parts of an existing
>>> mapping. To do so, it creates a new mapping that maps the same pages as
>>> the original mapping, just at a different virtual address. This
>>> functionality has existed since at least the 2.6 kernel.
>>>
>>> This patch simply adds a new flag to mremap which will make this
>>> functionality part of the API. It maintains backward compatibility with
>>> the existing way of requesting mirroring (old_size == 0).
>>>
>>> If this new MREMAP_MIRROR flag is specified, then new_size must equal
>>> old_size. In addition, the MREMAP_MAYMOVE flag must be specified.
>>
>> Yeah it all looks good. But why is this requirement that if
>> MREMAP_MAYMOVE is specified then old_size and new_size must
>> be equal.
>
> No real reason. I just wanted to clearly separate the new interface from
> the old. On second thought, it would be better to require old_size == 0
> as in the legacy interface.
That would be redundant. Mirroring will just happen because old_size is
0 whether we mention the MREMAP_MIRROR flag or not. IMHO it should just
mirror if the flag is specified irrespective of the old_size value.
On 07/07/2017 11:39 PM, Mike Kravetz wrote:
> On 07/07/2017 10:45 AM, Kirill A. Shutemov wrote:
>> On Fri, Jul 07, 2017 at 10:29:52AM -0700, Mike Kravetz wrote:
>>> On 07/07/2017 03:23 AM, Kirill A. Shutemov wrote:
>>>> On Thu, Jul 06, 2017 at 09:17:26AM -0700, Mike Kravetz wrote:
>>>>> The mremap system call has the ability to 'mirror' parts of an existing
>>>>> mapping. To do so, it creates a new mapping that maps the same pages as
>>>>> the original mapping, just at a different virtual address. This
>>>>> functionality has existed since at least the 2.6 kernel.
>>>>>
>>>>> This patch simply adds a new flag to mremap which will make this
>>>>> functionality part of the API. It maintains backward compatibility with
>>>>> the existing way of requesting mirroring (old_size == 0).
>>>>>
>>>>> If this new MREMAP_MIRROR flag is specified, then new_size must equal
>>>>> old_size. In addition, the MREMAP_MAYMOVE flag must be specified.
>>>>
>>>> The patch breaks important invariant that anon page can be mapped into a
>>>> process only once.
>>>
>>> Actually, the patch does not add any new functionality. It only provides
>>> a new interface to existing functionality.
>>>
>>> Is it not possible to have an anon page mapped twice into the same process
>>> via system V shared memory? shmget(anon), shmat(), shmat.
>>> Of course, those are shared rather than private anon pages.
>>
>> By anon pages I mean, private anon or file pages. These are subject to CoW.
>>
>>>> What is going to happen to mirrored after CoW for instance?
>>>>
>>>> In my opinion, it shouldn't be allowed for anon/private mappings at least.
>>>> And with this limitation, I don't see much sense in the new interface --
>>>> just create mirror by mmap()ing the file again.
>>>
>>> The code today works for anon shared mappings. See simple program below.
>>>
>>> You are correct in that it makes little or no sense for private mappings.
>>> When looking closer at existing code, mremap() creates a new private
>>> mapping in this case. This is most likely a bug.
>>
>> IIRC, existing code doesn't create mirrors of private pages as it requires
>> old_len to be zero. There's no way to get private pages mapped twice this
>> way.
>
> Correct.
> As mentioned above, mremap does 'something' for private anon pages when
> old_len == 0. However, this may be considered a bug. In this case, mremap
> creates a new private anon mapping of length new_size. Since old_len == 0,
> it does not unmap any of the old mapping. So, in this case mremap basically
> creates a new private mapping (unrealted to the original) and does not
> modify the old mapping.
>
Yeah, in my experiment, after the mremap() exists we have two different VMAs
which can contain two different set of data. No page sharing is happening.
On 07/09/2017 09:32 AM, Anshuman Khandual wrote:
> On 07/07/2017 11:39 PM, Mike Kravetz wrote:
>> On 07/07/2017 10:45 AM, Kirill A. Shutemov wrote:
>>> On Fri, Jul 07, 2017 at 10:29:52AM -0700, Mike Kravetz wrote:
>>>> On 07/07/2017 03:23 AM, Kirill A. Shutemov wrote:
>>>>> What is going to happen to mirrored after CoW for instance?
>>>>>
>>>>> In my opinion, it shouldn't be allowed for anon/private mappings at least.
>>>>> And with this limitation, I don't see much sense in the new interface --
>>>>> just create mirror by mmap()ing the file again.
>>>>
>>>> The code today works for anon shared mappings. See simple program below.
>>>>
>>>> You are correct in that it makes little or no sense for private mappings.
>>>> When looking closer at existing code, mremap() creates a new private
>>>> mapping in this case. This is most likely a bug.
>>>
>>> IIRC, existing code doesn't create mirrors of private pages as it requires
>>> old_len to be zero. There's no way to get private pages mapped twice this
>>> way.
>>
>> Correct.
>> As mentioned above, mremap does 'something' for private anon pages when
>> old_len == 0. However, this may be considered a bug. In this case, mremap
>> creates a new private anon mapping of length new_size. Since old_len == 0,
>> it does not unmap any of the old mapping. So, in this case mremap basically
>> creates a new private mapping (unrealted to the original) and does not
>> modify the old mapping.
>>
>
> Yeah, in my experiment, after the mremap() exists we have two different VMAs
> which can contain two different set of data. No page sharing is happening.
So how does this actually work for the JVM garbage collector use case?
Aren't the garbage collected objects private anon?
Anyway this should be documented.
On 07/10/2017 09:22 AM, Vlastimil Babka wrote:
> On 07/09/2017 09:32 AM, Anshuman Khandual wrote:
>> On 07/07/2017 11:39 PM, Mike Kravetz wrote:
>>> On 07/07/2017 10:45 AM, Kirill A. Shutemov wrote:
>>>> On Fri, Jul 07, 2017 at 10:29:52AM -0700, Mike Kravetz wrote:
>>>>> On 07/07/2017 03:23 AM, Kirill A. Shutemov wrote:
>>>>>> What is going to happen to mirrored after CoW for instance?
>>>>>>
>>>>>> In my opinion, it shouldn't be allowed for anon/private mappings at least.
>>>>>> And with this limitation, I don't see much sense in the new interface --
>>>>>> just create mirror by mmap()ing the file again.
>>>>>
>>>>> The code today works for anon shared mappings. See simple program below.
>>>>>
>>>>> You are correct in that it makes little or no sense for private mappings.
>>>>> When looking closer at existing code, mremap() creates a new private
>>>>> mapping in this case. This is most likely a bug.
>>>>
>>>> IIRC, existing code doesn't create mirrors of private pages as it requires
>>>> old_len to be zero. There's no way to get private pages mapped twice this
>>>> way.
>>>
>>> Correct.
>>> As mentioned above, mremap does 'something' for private anon pages when
>>> old_len == 0. However, this may be considered a bug. In this case, mremap
>>> creates a new private anon mapping of length new_size. Since old_len == 0,
>>> it does not unmap any of the old mapping. So, in this case mremap basically
>>> creates a new private mapping (unrealted to the original) and does not
>>> modify the old mapping.
>>>
>>
>> Yeah, in my experiment, after the mremap() exists we have two different VMAs
>> which can contain two different set of data. No page sharing is happening.
>
> So how does this actually work for the JVM garbage collector use case?
> Aren't the garbage collected objects private anon?
Good point.
The sample program the JVM team gave me uses a shared anon mapping. As you
mention one would expect these mappings to be private. I have asked them
for more details on their use case.
> Anyway this should be documented.
Yes, their prototype work seems to take advantage of this existing undocumented
behavior. It seems we have been carrying this functionality for at least 13
years. It may be time to document.
--
Mike Kravetz
On Thu 06-07-17 09:17:26, Mike Kravetz wrote:
> The mremap system call has the ability to 'mirror' parts of an existing
> mapping. To do so, it creates a new mapping that maps the same pages as
> the original mapping, just at a different virtual address. This
> functionality has existed since at least the 2.6 kernel.
>
> This patch simply adds a new flag to mremap which will make this
> functionality part of the API. It maintains backward compatibility with
> the existing way of requesting mirroring (old_size == 0).
>
> If this new MREMAP_MIRROR flag is specified, then new_size must equal
> old_size. In addition, the MREMAP_MAYMOVE flag must be specified.
I have to admit that this came as a suprise to me. There is no mention
about this special case in the man page and the mremap code is so
convoluted that I simply didn't see it there. I guess the only
reasonable usecase is when you do not have a fd for the shared memory.
Anyway the patch should fail with -EINVAL on private mappings as Kirill
already pointed out and this should go along with an update to the
man page which describes also the historical behavior. Make sure you
document that this is not really a mirroring (e.g. faulting page in one
address will automatically map it to the other mapping(s)) but merely a
copy of the range. Maybe MREMAP_COPY would be more appropriate name.
> Signed-off-by: Mike Kravetz <[email protected]>
> ---
> include/uapi/linux/mman.h | 5 +++--
> mm/mremap.c | 23 ++++++++++++++++-------
> tools/include/uapi/linux/mman.h | 5 +++--
> 3 files changed, 22 insertions(+), 11 deletions(-)
>
> diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
> index ade4acd..6b3e0df 100644
> --- a/include/uapi/linux/mman.h
> +++ b/include/uapi/linux/mman.h
> @@ -3,8 +3,9 @@
>
> #include <asm/mman.h>
>
> -#define MREMAP_MAYMOVE 1
> -#define MREMAP_FIXED 2
> +#define MREMAP_MAYMOVE 0x01
> +#define MREMAP_FIXED 0x02
> +#define MREMAP_MIRROR 0x04
>
> #define OVERCOMMIT_GUESS 0
> #define OVERCOMMIT_ALWAYS 1
> diff --git a/mm/mremap.c b/mm/mremap.c
> index cd8a1b1..f18ab36 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -516,10 +516,11 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
> struct vm_userfaultfd_ctx uf = NULL_VM_UFFD_CTX;
> LIST_HEAD(uf_unmap);
>
> - if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE))
> + if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE | MREMAP_MIRROR))
> return ret;
>
> - if (flags & MREMAP_FIXED && !(flags & MREMAP_MAYMOVE))
> + if ((flags & MREMAP_FIXED || flags & MREMAP_MIRROR) &&
> + !(flags & MREMAP_MAYMOVE))
> return ret;
>
> if (offset_in_page(addr))
> @@ -528,14 +529,22 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
> old_len = PAGE_ALIGN(old_len);
> new_len = PAGE_ALIGN(new_len);
>
> - /*
> - * We allow a zero old-len as a special case
> - * for DOS-emu "duplicate shm area" thing. But
> - * a zero new-len is nonsensical.
> - */
> + /* A zero new-len is nonsensical. */
> if (!new_len)
> return ret;
>
> + /*
> + * For backward compatibility, we allow a zero old-len to imply
> + * mirroring. This was originally a special case for DOS-emu.
> + */
> + if (!old_len)
> + flags |= MREMAP_MIRROR;
> + else if (flags & MREMAP_MIRROR) {
> + if (old_len != new_len)
> + return ret;
> + old_len = 0;
> + }
> +
> if (down_write_killable(¤t->mm->mmap_sem))
> return -EINTR;
>
> diff --git a/tools/include/uapi/linux/mman.h b/tools/include/uapi/linux/mman.h
> index 81d8edf..069f7a5 100644
> --- a/tools/include/uapi/linux/mman.h
> +++ b/tools/include/uapi/linux/mman.h
> @@ -3,8 +3,9 @@
>
> #include <uapi/asm/mman.h>
>
> -#define MREMAP_MAYMOVE 1
> -#define MREMAP_FIXED 2
> +#define MREMAP_MAYMOVE 0x01
> +#define MREMAP_FIXED 0x02
> +#define MREMAP_MIRROR 0x04
>
> #define OVERCOMMIT_GUESS 0
> #define OVERCOMMIT_ALWAYS 1
> --
> 2.7.5
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
--
Michal Hocko
SUSE Labs
On 07/11/2017 05:36 AM, Michal Hocko wrote:
> On Thu 06-07-17 09:17:26, Mike Kravetz wrote:
>> The mremap system call has the ability to 'mirror' parts of an existing
>> mapping. To do so, it creates a new mapping that maps the same pages as
>> the original mapping, just at a different virtual address. This
>> functionality has existed since at least the 2.6 kernel.
>>
>> This patch simply adds a new flag to mremap which will make this
>> functionality part of the API. It maintains backward compatibility with
>> the existing way of requesting mirroring (old_size == 0).
>>
>> If this new MREMAP_MIRROR flag is specified, then new_size must equal
>> old_size. In addition, the MREMAP_MAYMOVE flag must be specified.
>
> I have to admit that this came as a suprise to me. There is no mention
> about this special case in the man page and the mremap code is so
> convoluted that I simply didn't see it there. I guess the only
> reasonable usecase is when you do not have a fd for the shared memory.
I was surprised as well when a JVM developer pointed this out.
>From the old e-mail thread, here is original use case:
shmget(IPC_PRIVATE, 31498240, 0x1c0|0600) = 11337732
shmat(11337732, 0, 0) = 0x40299000
shmctl(11337732, IPC_RMID, 0) = 0
mremap(0x402a9000, 0, 65536, MREMAP_MAYMOVE|MREMAP_FIXED, 0) = 0
mremap(0x402a9000, 0, 65536, MREMAP_MAYMOVE|MREMAP_FIXED, 0x100000) = 0x100000
The JVM team wants to do something similar. They are using
mmap(MAP_ANONYMOUS|MAP_SHARED) to create the initial mapping instead
of shmget/shmat. As Vlastimil mentioned previously, one would not
expect a shared mapping for parts of the JVM heap. I am working
to get clarification from the JVM team.
> Anyway the patch should fail with -EINVAL on private mappings as Kirill
> already pointed out
Yes. I think this should be a separate patch. As mentioned earlier,
mremap today creates a new/additional private mapping if called in this
way with old_size == 0. To me, this is a bug.
> and this should go along with an update to the
> man page which describes also the historical behavior.
Yes, man page updates are a must.
One reason for the RFC was to determine if people thought we should:
1) Just document the existing old_size == 0 functionality
2) Create a more explicit interface such as a new mremap flag for this
functionality
I am waiting to see what direction people prefer before making any
man page updates.
> Make sure you
> document that this is not really a mirroring (e.g. faulting page in one
> address will automatically map it to the other mapping(s)) but merely a
> copy of the range. Maybe MREMAP_COPY would be more appropriate name.
Good point. mirror is the first word that came to mind, but it does
not exactly apply.
--
Mike Kravetz
>
>> Signed-off-by: Mike Kravetz <[email protected]>
>> ---
>> include/uapi/linux/mman.h | 5 +++--
>> mm/mremap.c | 23 ++++++++++++++++-------
>> tools/include/uapi/linux/mman.h | 5 +++--
>> 3 files changed, 22 insertions(+), 11 deletions(-)
>>
>> diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
>> index ade4acd..6b3e0df 100644
>> --- a/include/uapi/linux/mman.h
>> +++ b/include/uapi/linux/mman.h
>> @@ -3,8 +3,9 @@
>>
>> #include <asm/mman.h>
>>
>> -#define MREMAP_MAYMOVE 1
>> -#define MREMAP_FIXED 2
>> +#define MREMAP_MAYMOVE 0x01
>> +#define MREMAP_FIXED 0x02
>> +#define MREMAP_MIRROR 0x04
>>
>> #define OVERCOMMIT_GUESS 0
>> #define OVERCOMMIT_ALWAYS 1
>> diff --git a/mm/mremap.c b/mm/mremap.c
>> index cd8a1b1..f18ab36 100644
>> --- a/mm/mremap.c
>> +++ b/mm/mremap.c
>> @@ -516,10 +516,11 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
>> struct vm_userfaultfd_ctx uf = NULL_VM_UFFD_CTX;
>> LIST_HEAD(uf_unmap);
>>
>> - if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE))
>> + if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE | MREMAP_MIRROR))
>> return ret;
>>
>> - if (flags & MREMAP_FIXED && !(flags & MREMAP_MAYMOVE))
>> + if ((flags & MREMAP_FIXED || flags & MREMAP_MIRROR) &&
>> + !(flags & MREMAP_MAYMOVE))
>> return ret;
>>
>> if (offset_in_page(addr))
>> @@ -528,14 +529,22 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
>> old_len = PAGE_ALIGN(old_len);
>> new_len = PAGE_ALIGN(new_len);
>>
>> - /*
>> - * We allow a zero old-len as a special case
>> - * for DOS-emu "duplicate shm area" thing. But
>> - * a zero new-len is nonsensical.
>> - */
>> + /* A zero new-len is nonsensical. */
>> if (!new_len)
>> return ret;
>>
>> + /*
>> + * For backward compatibility, we allow a zero old-len to imply
>> + * mirroring. This was originally a special case for DOS-emu.
>> + */
>> + if (!old_len)
>> + flags |= MREMAP_MIRROR;
>> + else if (flags & MREMAP_MIRROR) {
>> + if (old_len != new_len)
>> + return ret;
>> + old_len = 0;
>> + }
>> +
>> if (down_write_killable(¤t->mm->mmap_sem))
>> return -EINTR;
>>
>> diff --git a/tools/include/uapi/linux/mman.h b/tools/include/uapi/linux/mman.h
>> index 81d8edf..069f7a5 100644
>> --- a/tools/include/uapi/linux/mman.h
>> +++ b/tools/include/uapi/linux/mman.h
>> @@ -3,8 +3,9 @@
>>
>> #include <uapi/asm/mman.h>
>>
>> -#define MREMAP_MAYMOVE 1
>> -#define MREMAP_FIXED 2
>> +#define MREMAP_MAYMOVE 0x01
>> +#define MREMAP_FIXED 0x02
>> +#define MREMAP_MIRROR 0x04
>>
>> #define OVERCOMMIT_GUESS 0
>> #define OVERCOMMIT_ALWAYS 1
>> --
>> 2.7.5
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to [email protected]. For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
On Tue, Jul 11, 2017 at 11:23:19AM -0700, Mike Kravetz wrote:
> I was surprised as well when a JVM developer pointed this out.
>
> From the old e-mail thread, here is original use case:
> shmget(IPC_PRIVATE, 31498240, 0x1c0|0600) = 11337732
> shmat(11337732, 0, 0) = 0x40299000
> shmctl(11337732, IPC_RMID, 0) = 0
> mremap(0x402a9000, 0, 65536, MREMAP_MAYMOVE|MREMAP_FIXED, 0) = 0
> mremap(0x402a9000, 0, 65536, MREMAP_MAYMOVE|MREMAP_FIXED, 0x100000) = 0x100000
>
> The JVM team wants to do something similar. They are using
> mmap(MAP_ANONYMOUS|MAP_SHARED) to create the initial mapping instead
> of shmget/shmat. As Vlastimil mentioned previously, one would not
> expect a shared mapping for parts of the JVM heap. I am working
> to get clarification from the JVM team.
Why don't they use memfd_create instead? That's made so that the fd is
born anon unlinked so when the last reference is dropped all memory
associated with it is automatically freed. No need of IC_RMID and then
they can use mmap instead of mremap(len=0) to get a double map of it.
If they use mmap(MAP_ANONYMOUS|MAP_SHARED) it's not hugetlbfs, that
would have been the only issue.
Using hugetlbfs for JVM wouldn't be really flexible, better they try
to leverage THP on SHM or the hugetlbfs reservation gets in the way of
efficient use of the unused memory for memory allocations that don't
have a definitive size (i.e. JVM forks or more JVM are run in
parallel).
> Yes. I think this should be a separate patch. As mentioned earlier,
> mremap today creates a new/additional private mapping if called in this
> way with old_size == 0. To me, this is a bug.
Kernel by sheer luck should stay stable, but the result is weird and
it's unlikely intentional.
memfd_create doesn't have such issue, the new mmap MAP_PRIVATE will
get the file pages correctly after a new mmap (even if there were cows
in the old MAP_PRIVATE mmap).
> One reason for the RFC was to determine if people thought we should:
> 1) Just document the existing old_size == 0 functionality
> 2) Create a more explicit interface such as a new mremap flag for this
> functionality
>
> I am waiting to see what direction people prefer before making any
> man page updates.
I guess old_size == 0 would better be dropped if possible, if
memfd_create fits perfectly your needs as I supposed above. If it's
not dropped then it's not very far from allowing mmap of /proc/self/mm
again (removed around so far as 2.3.x?).
Thanks,
Andrea
On 07/11/2017 02:02 PM, Andrea Arcangeli wrote:
> On Tue, Jul 11, 2017 at 11:23:19AM -0700, Mike Kravetz wrote:
>> I was surprised as well when a JVM developer pointed this out.
>>
>> From the old e-mail thread, here is original use case:
>> shmget(IPC_PRIVATE, 31498240, 0x1c0|0600) = 11337732
>> shmat(11337732, 0, 0) = 0x40299000
>> shmctl(11337732, IPC_RMID, 0) = 0
>> mremap(0x402a9000, 0, 65536, MREMAP_MAYMOVE|MREMAP_FIXED, 0) = 0
>> mremap(0x402a9000, 0, 65536, MREMAP_MAYMOVE|MREMAP_FIXED, 0x100000) = 0x100000
>>
>> The JVM team wants to do something similar. They are using
>> mmap(MAP_ANONYMOUS|MAP_SHARED) to create the initial mapping instead
>> of shmget/shmat. As Vlastimil mentioned previously, one would not
>> expect a shared mapping for parts of the JVM heap. I am working
>> to get clarification from the JVM team.
>
> Why don't they use memfd_create instead? That's made so that the fd is
> born anon unlinked so when the last reference is dropped all memory
> associated with it is automatically freed. No need of IC_RMID and then
> they can use mmap instead of mremap(len=0) to get a double map of it.
Wow! I did not even know about memfd_create until you mentioned it.
That would certainly work for 'normal' pages.
> If they use mmap(MAP_ANONYMOUS|MAP_SHARED) it's not hugetlbfs, that
> would have been the only issue.
>
> Using hugetlbfs for JVM wouldn't be really flexible, better they try
> to leverage THP on SHM or the hugetlbfs reservation gets in the way of
> efficient use of the unused memory for memory allocations that don't
> have a definitive size (i.e. JVM forks or more JVM are run in
> parallel).
Well, the JVM has had a config option for the use of hugetlbfs for quite
some time. I assume they have already had to deal with these issues.
What prompted this discussion is that they want the mremap mirroring/
duplication functionality extended to support hugetlbfs. This is pretty
straight forward. But, I wanted to have a discussion about whether the
mremap(old_size == 0) functionality should be formally documented first.
Do note that if you actually create/mount a hugetlbfs filesystem and
use a fd in that filesystem you can get the desired functionality. However,
they want to avoid this extra step if possible and use mmap(anon, hugetlb).
I'm guessing that if memfd_create supported hugetlbfs, that would also
meet their needs. Any thoughts about extending memfd_create support to
hugetlbfs? I can't think of any big issues. In fact, 'under the covers'
there actually is a hugetlbfs file created for anon mappings. However,
that is not exposed to the user.
>> Yes. I think this should be a separate patch. As mentioned earlier,
>> mremap today creates a new/additional private mapping if called in this
>> way with old_size == 0. To me, this is a bug.
>
> Kernel by sheer luck should stay stable, but the result is weird and
> it's unlikely intentional.
Yes, that is why I think it is a bug. Not that kernel is unstable, but
rather the unintentional/unexpected result.
> memfd_create doesn't have such issue, the new mmap MAP_PRIVATE will
> get the file pages correctly after a new mmap (even if there were cows
> in the old MAP_PRIVATE mmap).
>
>> One reason for the RFC was to determine if people thought we should:
>> 1) Just document the existing old_size == 0 functionality
>> 2) Create a more explicit interface such as a new mremap flag for this
>> functionality
>>
>> I am waiting to see what direction people prefer before making any
>> man page updates.
>
> I guess old_size == 0 would better be dropped if possible, if
> memfd_create fits perfectly your needs as I supposed above. If it's
> not dropped then it's not very far from allowing mmap of /proc/self/mm
> again (removed around so far as 2.3.x?).
Yes, in my google'ing it appears the first users of mremap(old_size == 0)
previously used mmap of /proc/self/mm.
If memfd_create can be extended to support hugetlbfs, then I might suggest
dropping the memfd_create(old_size == 0) support. Just a thought.
--
Mike Kravetz
On Tue, Jul 11, 2017 at 02:57:38PM -0700, Mike Kravetz wrote:
> Well, the JVM has had a config option for the use of hugetlbfs for quite
> some time. I assume they have already had to deal with these issues.
Yes, the config tweak exists well before THP existed but in production
I know nobody who used it because as you start more processes you risk
running out of hugetlbfs reservation and in addition the reservation
"wastes memory" at times.
> What prompted this discussion is that they want the mremap mirroring/
> duplication functionality extended to support hugetlbfs. This is pretty
> straight forward. But, I wanted to have a discussion about whether the
> mremap(old_size == 0) functionality should be formally documented first.
Agreed.
> Do note that if you actually create/mount a hugetlbfs filesystem and
> use a fd in that filesystem you can get the desired functionality. However,
> they want to avoid this extra step if possible and use mmap(anon, hugetlb).
I see, I thought they needed to use the mremap on pure SHM because of
the there was no MAP_HUGETLB in the mmap flags of the use case.
> I'm guessing that if memfd_create supported hugetlbfs, that would also
> meet their needs. Any thoughts about extending memfd_create support to
> hugetlbfs? I can't think of any big issues. In fact, 'under the covers'
> there actually is a hugetlbfs file created for anon mappings. However,
> that is not exposed to the user.
Yes, that should fit fine as MFD_HUGETLB or similar.
> Yes, that is why I think it is a bug. Not that kernel is unstable, but
> rather the unintentional/unexpected result.
The most unexpected is the old mapping isn't wiped, at least it
doesn't seem to cause trouble to anon as move_page_tables is
nullified (old_end == old_addr so the loop never runs).
> > memfd_create doesn't have such issue, the new mmap MAP_PRIVATE will
> > get the file pages correctly after a new mmap (even if there were cows
> > in the old MAP_PRIVATE mmap).
> >
> >> One reason for the RFC was to determine if people thought we should:
> >> 1) Just document the existing old_size == 0 functionality
> >> 2) Create a more explicit interface such as a new mremap flag for this
> >> functionality
> >>
> >> I am waiting to see what direction people prefer before making any
> >> man page updates.
> >
> > I guess old_size == 0 would better be dropped if possible, if
> > memfd_create fits perfectly your needs as I supposed above. If it's
> > not dropped then it's not very far from allowing mmap of /proc/self/mm
> > again (removed around so far as 2.3.x?).
>
> Yes, in my google'ing it appears the first users of mremap(old_size == 0)
> previously used mmap of /proc/self/mm.
>
> If memfd_create can be extended to support hugetlbfs, then I might suggest
> dropping the memfd_create(old_size == 0) support. Just a thought.
memfd_create interface sounds more robust than this mremap trick,
they would have to deal with one more fd that's all.
old_len == 0 by nullifying move_page_tables will cause not harm to
anon pages however the place where we would drop the vma is do_munmap
here:
if (vm_flags & VM_ACCOUNT) {
vma->vm_flags &= ~VM_ACCOUNT;
excess = vma->vm_end - vma->vm_start - old_len;
[..]
if (do_munmap(mm, old_addr, old_len, uf_unmap) < 0) {
/* OOM: unable to split vma, just get accounts right */
vm_unacct_memory(excess >> PAGE_SHIFT);
excess = 0;
}
It looks like a split_vma allocation failure can leave the old vma
around in a equal way to old_len == 0 (but in such case all anon
payload will have been moved to the new vma). That also seems safe as
far as the kernel is concerned but it could cause userland failure if
you depend on SIGSEGV to trigger later on the original vma you thought
was implicitly munmapped (and in MAP_SHARED case it could even lead to
unexpected file corruption instead of an expected SIGSEGV). If nobody
ever depends on whatever is left on the old vma it's ok, but it could
still leave file handle pinned unexpectedly if it's not anon.
The other issue of the old_len = 0 trick is that will unexpectedly
wipe the VM_ACCOUNT from the original vma as side effect of the above,
but it'd only be noticeable if you care about strict accounting. So
there is at least such one glitch in it.
Thanks,
Andrea
On Tue 11-07-17 11:23:19, Mike Kravetz wrote:
> On 07/11/2017 05:36 AM, Michal Hocko wrote:
[...]
> > Anyway the patch should fail with -EINVAL on private mappings as Kirill
> > already pointed out
>
> Yes. I think this should be a separate patch. As mentioned earlier,
> mremap today creates a new/additional private mapping if called in this
> way with old_size == 0. To me, this is a bug.
Not only that. It clears existing ptes in the old mapping so the content
is lost. That is quite unexpected behavior. Now it is hard to assume
whether somebody relies on the behavior (I can easily imagine somebody
doing backup&clear in atomic way) so failing with EINVAL might break
userspace so I am not longer sure. Anyway this really needs to be
documented.
--
Michal Hocko
SUSE Labs
On 07/12/2017 04:46 AM, Michal Hocko wrote:
> On Tue 11-07-17 11:23:19, Mike Kravetz wrote:
>> On 07/11/2017 05:36 AM, Michal Hocko wrote:
> [...]
>>> Anyway the patch should fail with -EINVAL on private mappings as Kirill
>>> already pointed out
>>
>> Yes. I think this should be a separate patch. As mentioned earlier,
>> mremap today creates a new/additional private mapping if called in this
>> way with old_size == 0. To me, this is a bug.
>
> Not only that. It clears existing ptes in the old mapping so the content
> is lost. That is quite unexpected behavior. Now it is hard to assume
> whether somebody relies on the behavior (I can easily imagine somebody
> doing backup&clear in atomic way) so failing with EINVAL might break
> userspace so I am not longer sure. Anyway this really needs to be
> documented.
I am pretty sure it does not clear ptes in the old mapping, or modify it
in any way. Are you thinking they are cleared as part of the call to
move_page_tables? Since old_size == 0 (len as passed to move_page_tables),
the for loop in move_page_tables is not run and it doesn't do much of
anything in this case.
My plan is to look into adding hugetlbfs support to memfd_create, as this
would meet the user's needs. And, this is a much more sane API than this
mremap(old_size == 0) behavior.
If adding hugetlbfs support to memfd_create works out, I would like to
see mremap(old_size == 0) support dropped. Nobody here (kernel mm
development) seems to like it. However, as you note there may be somebody
depending on this behavior. What would be the process for removing
such support? AFAIK, it is not documented anywhere. If we do document
the behavior, then we will certainly be stuck with it for a long time.
--
Mike Kravetz
On Wed 12-07-17 09:55:48, Mike Kravetz wrote:
> On 07/12/2017 04:46 AM, Michal Hocko wrote:
> > On Tue 11-07-17 11:23:19, Mike Kravetz wrote:
> >> On 07/11/2017 05:36 AM, Michal Hocko wrote:
> > [...]
> >>> Anyway the patch should fail with -EINVAL on private mappings as Kirill
> >>> already pointed out
> >>
> >> Yes. I think this should be a separate patch. As mentioned earlier,
> >> mremap today creates a new/additional private mapping if called in this
> >> way with old_size == 0. To me, this is a bug.
> >
> > Not only that. It clears existing ptes in the old mapping so the content
> > is lost. That is quite unexpected behavior. Now it is hard to assume
> > whether somebody relies on the behavior (I can easily imagine somebody
> > doing backup&clear in atomic way) so failing with EINVAL might break
> > userspace so I am not longer sure. Anyway this really needs to be
> > documented.
>
> I am pretty sure it does not clear ptes in the old mapping, or modify it
> in any way. Are you thinking they are cleared as part of the call to
> move_page_tables? Since old_size == 0 (len as passed to move_page_tables),
> the for loop in move_page_tables is not run and it doesn't do much of
> anything in this case.
Dang. I have completely missed that we give old_len as the len
parameter. Then it is clear that this old_len == 0 trick never really
worked for MAP_PRIVATE because it simply fails the main invariant that
the content at the new location matches the old one. Care to send a
patch to clarify that and sent EINVAL or should I do it?
> My plan is to look into adding hugetlbfs support to memfd_create, as this
> would meet the user's needs. And, this is a much more sane API than this
> mremap(old_size == 0) behavior.
agreed
> If adding hugetlbfs support to memfd_create works out, I would like to
> see mremap(old_size == 0) support dropped. Nobody here (kernel mm
> development) seems to like it. However, as you note there may be somebody
> depending on this behavior. What would be the process for removing
> such support? AFAIK, it is not documented anywhere. If we do document
> the behavior, then we will certainly be stuck with it for a long time.
I would rather document it than remove it. From the past we know that
there are users and my experience tells me that once something is used
it lives its life for ever basically. And moreover it is not like this
costs us any maintenance burden to support the hack. Just make it more
obvious so that we do not have to rediscover it each time.
--
Michal Hocko
SUSE Labs
On 07/12/2017 11:16 PM, Michal Hocko wrote:
> On Wed 12-07-17 09:55:48, Mike Kravetz wrote:
>> On 07/12/2017 04:46 AM, Michal Hocko wrote:
>>> On Tue 11-07-17 11:23:19, Mike Kravetz wrote:
>>>> On 07/11/2017 05:36 AM, Michal Hocko wrote:
>>> [...]
>>>>> Anyway the patch should fail with -EINVAL on private mappings as Kirill
>>>>> already pointed out
>>>>
>>>> Yes. I think this should be a separate patch. As mentioned earlier,
>>>> mremap today creates a new/additional private mapping if called in this
>>>> way with old_size == 0. To me, this is a bug.
>>>
>>> Not only that. It clears existing ptes in the old mapping so the content
>>> is lost. That is quite unexpected behavior. Now it is hard to assume
>>> whether somebody relies on the behavior (I can easily imagine somebody
>>> doing backup&clear in atomic way) so failing with EINVAL might break
>>> userspace so I am not longer sure. Anyway this really needs to be
>>> documented.
>>
>> I am pretty sure it does not clear ptes in the old mapping, or modify it
>> in any way. Are you thinking they are cleared as part of the call to
>> move_page_tables? Since old_size == 0 (len as passed to move_page_tables),
>> the for loop in move_page_tables is not run and it doesn't do much of
>> anything in this case.
>
> Dang. I have completely missed that we give old_len as the len
> parameter. Then it is clear that this old_len == 0 trick never really
> worked for MAP_PRIVATE because it simply fails the main invariant that
> the content at the new location matches the old one. Care to send a
> patch to clarify that and sent EINVAL or should I do it?
Sent a patch (in separate e-mail thread) to return EINVAL for private
mappings.
>> If adding hugetlbfs support to memfd_create works out, I would like to
>> see mremap(old_size == 0) support dropped. Nobody here (kernel mm
>> development) seems to like it. However, as you note there may be somebody
>> depending on this behavior. What would be the process for removing
>> such support? AFAIK, it is not documented anywhere. If we do document
>> the behavior, then we will certainly be stuck with it for a long time.
>
> I would rather document it than remove it. From the past we know that
> there are users and my experience tells me that once something is used
> it lives its life for ever basically. And moreover it is not like this
> costs us any maintenance burden to support the hack. Just make it more
> obvious so that we do not have to rediscover it each time.
I will put together a patch to add a description of (old_size == 0)
behavior to the man page.
--
Mike Kravetz
On Thu, Jul 13, 2017 at 09:01:54AM -0700, Mike Kravetz wrote:
> Sent a patch (in separate e-mail thread) to return EINVAL for private
> mappings.
The way old_len == 0 behaves for MAP_PRIVATE seems more sane to me
than the alternative of copying pagetables for anon pages (as behaving
the way that way avoids to break anon pages invariants), despite it's
not creating an exact mirror of what was in the original vma as it
excludes any modification done to cowed anon pages.
By nullifying move_page_tables old_len == 0 is simply duping the vma
which is equivalent to a new mmap on the file for the MAP_PRIVATE
case, it has a deterministic result. The real question is if it
anybody is using it.
So an alternative would be to start by adding a WARN_ON_ONCE deprecation
warning instead of -EINVAL right away.
The vma->vm_flags VM_ACCOUNT being wiped on the original vma as side
effect of using the old_len == 0 trick looks like a bug, I guess it
should get fixed if we intend to keep old_len and document it for the
long term.
Overall I'm more concerned about the fact an allocation failure in
do_munmap is unreported to userland and it will leave the old vma
intact like old_len == 0 would do (unless I'm misreading something
there). The VM_ACCOUNT wipe as side effect of old_len == 0 is not
major short term concern.
Thanks,
Andrea
On 07/13/2017 09:30 AM, Andrea Arcangeli wrote:
> On Thu, Jul 13, 2017 at 09:01:54AM -0700, Mike Kravetz wrote:
>> Sent a patch (in separate e-mail thread) to return EINVAL for private
>> mappings.
>
> The way old_len == 0 behaves for MAP_PRIVATE seems more sane to me
> than the alternative of copying pagetables for anon pages (as behaving
> the way that way avoids to break anon pages invariants), despite it's
> not creating an exact mirror of what was in the original vma as it
> excludes any modification done to cowed anon pages.
>
> By nullifying move_page_tables old_len == 0 is simply duping the vma
> which is equivalent to a new mmap on the file for the MAP_PRIVATE
> case, it has a deterministic result. The real question is if it
> anybody is using it.
As previously discussed, copying pagetables (via move_page_tables) does
not happen if old_len == 0. This is true for both for private and shared
mappings.
Here is my understanding of how things work for old_len == 0 of anon
mappings:
- shared mappings
- New vma is created at new virtual address
- vma refers to the same underlying object/pages as old vma
- after mremap, no page tables exist for new vma, they are
created as pages are accessed/faulted
- page at new_address is same as page at old_address
- private mappings
- New vma is created at new virtual address
- vma does not refer to same pages as old vma. It is a 'new'
private anon mapping.
- after mremap, no page tables exist for new vma. access to
the range of the new vma will result in faults that allocate
a new page.
- page at new_address is different than page at old_address
the new vma will result in new
So, the result of mremap(old_len == 0) on a private mapping is that it
simply creates a new private mapping. IMO, this is contrary to the purpose
of mremap. mremap should return a mapping that is somehow related to
the original mapping.
Perhaps you are thinking about mremap of a private file mapping? I was
not considering that case. I believe you are right. In this case a
private COW mapping based on the original mapping would be created. So,
this seems more in line with the intent of mremap. The new mapping is
still related to the old mapping.
With this in mind, what about returning EINVAL only for the anon private
mapping case?
However, if you have a fd (for a file mapping) then I can not see why
someone would be using the old_len == 0 trick. It would be more straight
forward to simply use mmap to create the additional mapping.
> So an alternative would be to start by adding a WARN_ON_ONCE deprecation
> warning instead of -EINVAL right away.
>
> The vma->vm_flags VM_ACCOUNT being wiped on the original vma as side
> effect of using the old_len == 0 trick looks like a bug, I guess it
> should get fixed if we intend to keep old_len and document it for the
> long term.
Others seem to think we should keep old_len == 0 and document.
> Overall I'm more concerned about the fact an allocation failure in
> do_munmap is unreported to userland and it will leave the old vma
> intact like old_len == 0 would do (unless I'm misreading something
> there). The VM_ACCOUNT wipe as side effect of old_len == 0 is not
> major short term concern.
I assume you are concerned about the do_munmap call in move_vma? That
does indeed look to be of concern. This happens AFTER setting up the
new mapping. So, I'm thinking we should tear down the new mapping in
the case do_munmap of the old mapping fails? That 'should' simply
be a matter of:
- moving page tables back to original mapping
- remove/delete new vma
- I don't think we need to 'unmap' the new vma as there should be no
associated pages.
I'll look into doing this as well.
Just curious, do those userfaultfd callouts still work as desired in the
case of map duplication (old_len == 0)?
--
Mike Kravetz
On Thu, Jul 13, 2017 at 11:11:37AM -0700, Mike Kravetz wrote:
> Here is my understanding of how things work for old_len == 0 of anon
> mappings:
> - shared mappings
> - New vma is created at new virtual address
> - vma refers to the same underlying object/pages as old vma
> - after mremap, no page tables exist for new vma, they are
> created as pages are accessed/faulted
> - page at new_address is same as page at old_address
Yes, and this isn't backed by anon memory, it's backed by
shmem. "Shared anon mapping" is really synonymous of shmem, the fact
it's not a mmap of a tmpfs file is purely an API detail.
> - private mappings
> - New vma is created at new virtual address
> - vma does not refer to same pages as old vma. It is a 'new'
> private anon mapping.
> - after mremap, no page tables exist for new vma. access to
> the range of the new vma will result in faults that allocate
> a new page.
> - page at new_address is different than page at old_address
> the new vma will result in new
Yes, for a anon private mapping (so backed by real anonymous memory)
no payload in the old vma could possibly go in the new vma.
> So, the result of mremap(old_len == 0) on a private mapping is that it
> simply creates a new private mapping. IMO, this is contrary to the purpose
> of mremap. mremap should return a mapping that is somehow related to
> the original mapping.
I agree there's no point to ever use the mremap(old_len == 0)
undocumented trick, to create a new anon private mmap, when you could
use mmap instead and the result would be the same.
So it's plausible nobody could use it for it.
> Perhaps you are thinking about mremap of a private file mapping? I was
> not considering that case. I believe you are right. In this case a
> private COW mapping based on the original mapping would be created. So,
> this seems more in line with the intent of mremap. The new mapping is
> still related to the old mapping.
Yes my earlier example was all about filebacked private mappings, to
point out those also have a deterministic behavior with the old_len ==
0 trick and it could be still used because the IPC_RMID was executed
early on.
The point is that you could always use a plain new mmap instead of the
old_len == 0 trick, but that applies to shared mappings as well.
My argument is that if you keep it and document it for shared anon
mappings, I don't see something fundamentally wrong as keeping it for
private filebacked mappings too as the shmat ID may have been deleted
for those too.
> With this in mind, what about returning EINVAL only for the anon private
> mapping case?
The only case where there's no excuse to use mremap(old_len == 0) as
replacement for a new mmap is the private anon mappings case, so while
it may still break something (as opposed to a deprecation warning), I
guess the likely hood somebody is using it, is very low.
> However, if you have a fd (for a file mapping) then I can not see why
> someone would be using the old_len == 0 trick. It would be more straight
> forward to simply use mmap to create the additional mapping.
That applies to MAP_SHARED too and that's why deprecating the whole
undocumented old_len ==0 sounded and still sound attractive to me, but
doing it right away without a deprecation warning cycle, sounds too
risky.
> > So an alternative would be to start by adding a WARN_ON_ONCE deprecation
> > warning instead of -EINVAL right away.
> >
> > The vma->vm_flags VM_ACCOUNT being wiped on the original vma as side
> > effect of using the old_len == 0 trick looks like a bug, I guess it
> > should get fixed if we intend to keep old_len and document it for the
> > long term.
>
> Others seem to think we should keep old_len == 0 and document.
The only case where it makes sense is after IPC_RMID, but with
memfd_create there's no point anymore to use IPC_RMID.
tmpfs/hugetlbfs/realfs files can be unlinked while the fd is still
open so again no need of the mremap(old_len == 0) trick.
Which is why I'd find it attractive to deprecate it if we could, but I
assume we can't drop it even if undocumented, which is why I felt a
deprecation warning would be suitable in this case (similar to
deprecation warning of sysfs and then dropped via config option). I am
assuming here that nobody is using it because it's undocumented and it
has a bug in the VM_ACCOUNT code too. Without a deprecation warning
it'd be hard to tell if the assumption is correct.
> I assume you are concerned about the do_munmap call in move_vma? That
Yes exactly.
> does indeed look to be of concern. This happens AFTER setting up the
> new mapping. So, I'm thinking we should tear down the new mapping in
> the case do_munmap of the old mapping fails? That 'should' simply
> be a matter of:
> - moving page tables back to original mapping
> - remove/delete new vma
Yes.
> - I don't think we need to 'unmap' the new vma as there should be no
> associated pages.
The new vma doesn't require memory allocations to drop as it was just
created by copy_vma so there's no risk of further failures in the
unwind.
After the unwind it'll return -ENOMEM to userland (which we don't
right now).
> I'll look into doing this as well.
It's mostly theoretical, the chances of an allocation failure
triggering exactly in that split_vma are basically zero, but I think
it'd be more correct and safer.
> Just curious, do those userfaultfd callouts still work as desired in the
> case of map duplication (old_len == 0)?
old_len == 0 is fine with userfaultfd because, len == 0 returns
-EINVAL in do_munmap before userfaultfd_unmap_prep is called.
Still looking at the VM_ACCOUNT adjustments around do_munmap:
mremap:
/* Conceal VM_ACCOUNT so old reservation is not undone */
if (vm_flags & VM_ACCOUNT) {
do_munmap:
if (uf) {
int error = userfaultfd_unmap_prep(vma, start, end, uf);
if (error)
return error;
}
/*
* If we need to split any vma, do it now to save pain later.
*
* Note: mremap's move_vma VM_ACCOUNT handling assumes a partially
* unmapped vm_area_struct will remain in use: so lower split_vma
* places tmp vma above, and higher split_vma places tmp vma below.
*/
I don't see this assumption where it matters that on do_munmap
failure, mremap assumes the partially unmapped vma remains in use. In
fact it's not partially unmapped at all, it's only split at the
"start" address of the do_munmap but not unmapped.
mremap caller simply sets excess = 0 and assumes it's all still mapped
at the original vma as expected regardless of the order of the
__split_vma executed in do_munmap.
The whole VM_ACCOUNT logic in this place exists since the start of the
git history so I can't see the change originating the above comment,
but I assume the comment is wrong or simply confusing.
I don't see a problem in userfaultfd_unmap_prep failing with -ENOMEM
in relation to the VM_ACCOUNT logic above, before split_vma is called
(callee doesn't seem to make assumption).
However unrelated to mremap old_len == 0, but purely internal to
do_munmap and theoretical, if either of the two __split_vma fails
there's no need to send an unmap event and in fact it'd be wrong to,
so userfaultfd_unmap_prep should be moved after both split_vma succeded.
Thanks,
Andrea