LinuxLists.cc - [PATCH] munmap() don't check sysctl_max

2009-10-02 09:07:29

Subject: [PATCH] munmap() don't check sysctl_max_mapcount

Hi everyone,

Is this good idea?

====================================================================
>From 0b9de3b65158847d376e2786840f932361d00e08 Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <[email protected]>
Date: Fri, 18 Sep 2009 13:22:06 +0900
Subject: [PATCH] munmap() don't check sysctl_max_mapcount

On ia64, the following test program exit abnormally, because
glibc thread library called abort().

========================================================
(gdb) bt
#0 0xa000000000010620 in __kernel_syscall_via_break ()
#1 0x20000000003208e0 in raise () from /lib/libc.so.6.1
#2 0x2000000000324090 in abort () from /lib/libc.so.6.1
#3 0x200000000027c3e0 in __deallocate_stack () from /lib/libpthread.so.0
#4 0x200000000027f7c0 in start_thread () from /lib/libpthread.so.0
#5 0x200000000047ef60 in __clone2 () from /lib/libc.so.6.1
========================================================

The fact is, glibc call munmap() when thread exitng time for freeing stack, and
it assume munlock() never fail. However, munmap() often make vma splitting
and it with many mapcount make -ENOMEM.

Oh well, stack unfreeing is not reasonable option. Also munlock() via free()
shouldn't failed.

Thus, munmap() shoudn't check max-mapcount. This patch does it.

test_max_mapcount.c
==================================================================
#include<stdio.h>
#include<stdlib.h>
#include<string.h>
#include<pthread.h>
#include<errno.h>
#include<unistd.h>

#define THREAD_NUM 30000
#define MAL_SIZE (100*1024)

void *wait_thread(void *args)
{
void *addr;

addr = malloc(MAL_SIZE);
if(addr)
memset(addr, 1, MAL_SIZE);
sleep(1);

return NULL;
}

void *wait_thread2(void *args)
{
sleep(60);

return NULL;
}

int main(int argc, char *argv[])
{
int i;
pthread_t thread[THREAD_NUM], th;
int ret, count = 0;
pthread_attr_t attr;

ret = pthread_attr_init(&attr);
if(ret) {
perror("pthread_attr_init");
}

ret = pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
if(ret) {
perror("pthread_attr_setdetachstate");
}

for (i = 0; i < THREAD_NUM; i++) {
ret = pthread_create(&th, &attr, wait_thread, NULL);
if(ret) {
fprintf(stderr, "[%d] ", count);
perror("pthread_create");
} else {
printf("[%d] create OK.\n", count);
}
count++;

ret = pthread_create(&thread[i], &attr, wait_thread2, NULL);
if(ret) {
fprintf(stderr, "[%d] ", count);
perror("pthread_create");
} else {
printf("[%d] create OK.\n", count);
}
count++;
}

sleep(3600);
return 0;
}
==================================================================

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/mmap.c | 18 ++++++++++++------
1 file changed, 12 insertions(+), 6 deletions(-)

Index: b/mm/mmap.c
===================================================================
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1832,7 +1832,7 @@ detach_vmas_to_be_unmapped(struct mm_str
* Split a vma into two pieces at address 'addr', a new vma is allocated
* either for the first part or the tail.
*/
-int split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
+static int __split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
unsigned long addr, int new_below)
{
struct mempolicy *pol;
@@ -1842,9 +1842,6 @@ int split_vma(struct mm_struct * mm, str
~(huge_page_mask(hstate_vma(vma)))))
return -EINVAL;

- if (mm->map_count >= sysctl_max_map_count)
- return -ENOMEM;
-
new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
if (!new)
return -ENOMEM;
@@ -1884,6 +1881,15 @@ int split_vma(struct mm_struct * mm, str
return 0;
}

+int split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
+ unsigned long addr, int new_below)
+{
+ if (mm->map_count >= sysctl_max_map_count)
+ return -ENOMEM;
+
+ return __split_vma(mm, vma, addr, new_below);
+}
+
/* Munmap is split into 2 main parts -- this part which finds
* what needs doing, and the areas themselves, which do the
* work. This now handles partial unmappings.
@@ -1919,7 +1925,7 @@ int do_munmap(struct mm_struct *mm, unsi
* places tmp vma above, and higher split_vma places tmp vma below.
*/
if (start > vma->vm_start) {
- int error = split_vma(mm, vma, start, 0);
+ int error = __split_vma(mm, vma, start, 0);
if (error)
return error;
prev = vma;
@@ -1928,7 +1934,7 @@ int do_munmap(struct mm_struct *mm, unsi
/* Does it split the last one? */
last = find_vma(mm, end);
if (last && end > last->vm_start) {
- int error = split_vma(mm, last, end, 1);
+ int error = __split_vma(mm, last, end, 1);
if (error)
return error;
}

2009-10-09 09:29:14

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH] munmap() don't check sysctl_max_mapcount

On Fri, 2 Oct 2009, KOSAKI Motohiro wrote:

> Hi everyone,
>
> Is this good idea?

Sorry, I overlooked this earlier.

Correct me if I'm wrong, but from the look of your patch,
I believe anyone could increase their mapcount arbitrarily far beyond
sysctl_max_map_count, just by taking little bites out of a large mmap.

In which case there's not much point in having sysctl_max_map_count
at all. Perhaps there isn't much point to it anyway, and the answer
is just to raise it to where it catches runaways but interferes with
nobody else?

If you change your patch so that do_munmap() cannot increase the final
number vmas beyond sysctl_max_map_count, that would seem reasonable.
But would that satisfy your testcase? And does the testcase really
matter in practice? It doesn't seem to have upset anyone before.

Hugh

>
>
> ====================================================================
> From 0b9de3b65158847d376e2786840f932361d00e08 Mon Sep 17 00:00:00 2001
> From: KOSAKI Motohiro <[email protected]>
> Date: Fri, 18 Sep 2009 13:22:06 +0900
> Subject: [PATCH] munmap() don't check sysctl_max_mapcount
>
> On ia64, the following test program exit abnormally, because
> glibc thread library called abort().
>
> ========================================================
> (gdb) bt
> #0 0xa000000000010620 in __kernel_syscall_via_break ()
> #1 0x20000000003208e0 in raise () from /lib/libc.so.6.1
> #2 0x2000000000324090 in abort () from /lib/libc.so.6.1
> #3 0x200000000027c3e0 in __deallocate_stack () from /lib/libpthread.so.0
> #4 0x200000000027f7c0 in start_thread () from /lib/libpthread.so.0
> #5 0x200000000047ef60 in __clone2 () from /lib/libc.so.6.1
> ========================================================
>
> The fact is, glibc call munmap() when thread exitng time for freeing stack, and
> it assume munlock() never fail. However, munmap() often make vma splitting
> and it with many mapcount make -ENOMEM.
>
> Oh well, stack unfreeing is not reasonable option. Also munlock() via free()
> shouldn't failed.
>
> Thus, munmap() shoudn't check max-mapcount. This patch does it.
>
> test_max_mapcount.c
> ==================================================================
> #include<stdio.h>
> #include<stdlib.h>
> #include<string.h>
> #include<pthread.h>
> #include<errno.h>
> #include<unistd.h>
>
> #define THREAD_NUM 30000
> #define MAL_SIZE (100*1024)
>
> void *wait_thread(void *args)
> {
> void *addr;
>
> addr = malloc(MAL_SIZE);
> if(addr)
> memset(addr, 1, MAL_SIZE);
> sleep(1);
>
> return NULL;
> }
>
> void *wait_thread2(void *args)
> {
> sleep(60);
>
> return NULL;
> }
>
> int main(int argc, char *argv[])
> {
> int i;
> pthread_t thread[THREAD_NUM], th;
> int ret, count = 0;
> pthread_attr_t attr;
>
> ret = pthread_attr_init(&attr);
> if(ret) {
> perror("pthread_attr_init");
> }
>
> ret = pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
> if(ret) {
> perror("pthread_attr_setdetachstate");
> }
>
> for (i = 0; i < THREAD_NUM; i++) {
> ret = pthread_create(&th, &attr, wait_thread, NULL);
> if(ret) {
> fprintf(stderr, "[%d] ", count);
> perror("pthread_create");
> } else {
> printf("[%d] create OK.\n", count);
> }
> count++;
>
> ret = pthread_create(&thread[i], &attr, wait_thread2, NULL);
> if(ret) {
> fprintf(stderr, "[%d] ", count);
> perror("pthread_create");
> } else {
> printf("[%d] create OK.\n", count);
> }
> count++;
> }
>
> sleep(3600);
> return 0;
> }
> ==================================================================
>
>
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> ---
> mm/mmap.c | 18 ++++++++++++------
> 1 file changed, 12 insertions(+), 6 deletions(-)
>
> Index: b/mm/mmap.c
> ===================================================================
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1832,7 +1832,7 @@ detach_vmas_to_be_unmapped(struct mm_str
> * Split a vma into two pieces at address 'addr', a new vma is allocated
> * either for the first part or the tail.
> */
> -int split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
> +static int __split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
> unsigned long addr, int new_below)
> {
> struct mempolicy *pol;
> @@ -1842,9 +1842,6 @@ int split_vma(struct mm_struct * mm, str
> ~(huge_page_mask(hstate_vma(vma)))))
> return -EINVAL;
>
> - if (mm->map_count >= sysctl_max_map_count)
> - return -ENOMEM;
> -
> new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
> if (!new)
> return -ENOMEM;
> @@ -1884,6 +1881,15 @@ int split_vma(struct mm_struct * mm, str
> return 0;
> }
>
> +int split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
> + unsigned long addr, int new_below)
> +{
> + if (mm->map_count >= sysctl_max_map_count)
> + return -ENOMEM;
> +
> + return __split_vma(mm, vma, addr, new_below);
> +}
> +
> /* Munmap is split into 2 main parts -- this part which finds
> * what needs doing, and the areas themselves, which do the
> * work. This now handles partial unmappings.
> @@ -1919,7 +1925,7 @@ int do_munmap(struct mm_struct *mm, unsi
> * places tmp vma above, and higher split_vma places tmp vma below.
> */
> if (start > vma->vm_start) {
> - int error = split_vma(mm, vma, start, 0);
> + int error = __split_vma(mm, vma, start, 0);
> if (error)
> return error;
> prev = vma;
> @@ -1928,7 +1934,7 @@ int do_munmap(struct mm_struct *mm, unsi
> /* Does it split the last one? */
> last = find_vma(mm, end);
> if (last && end > last->vm_start) {
> - int error = split_vma(mm, last, end, 1);
> + int error = __split_vma(mm, last, end, 1);
> if (error)
> return error;
> }

2009-10-12 10:14:50

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH] munmap() don't check sysctl_max_mapcount

Hi Hugh

> > Hi everyone,
> >
> > Is this good idea?
>
> Sorry, I overlooked this earlier.
>
> Correct me if I'm wrong, but from the look of your patch,
> I believe anyone could increase their mapcount arbitrarily far beyond
> sysctl_max_map_count, just by taking little bites out of a large mmap.
>
> In which case there's not much point in having sysctl_max_map_count
> at all. Perhaps there isn't much point to it anyway, and the answer
> is just to raise it to where it catches runaways but interferes with
> nobody else?
>
> If you change your patch so that do_munmap() cannot increase the final
> number vmas beyond sysctl_max_map_count, that would seem reasonable.
> But would that satisfy your testcase? And does the testcase really
> matter in practice? It doesn't seem to have upset anyone before.

Very thank you for payed attention to my patch. Yes, this is real issue.
my customer suffer from it.

May I explain why you haven't seen this issue? this issue is architecture
independent problem. however register stack architecture (e.g. ia64, sparc)
dramatically increase the possibility of the heppen this issue.

Why? the stack of the typical architecture have one PROT_READ|PROT_WRITE area and one
PROT_NONE area (it is called as guard page).
if the process multiple thread, stack layout is

fig1)
+--------+--------+-------+-------+--------------
| thr-0 | thr-0 | thr-1 | thr-1 |
| guard | stack | guard | stack | ......
+--------+--------+-------+-------+-------------

Thus, stack freeing didn't make vma splitting. in the other hand,
the stack of the register stack architecture have two PROT_READ|PROT_WRITE
area and one PROT_NONE area.
then, stack layout is

fig2)
+-----------+--------+------------------+-------+------------------+-----------
| thr-0 | thr-0 | thr-0 stack | thr-1 | thr-1 stack | thr-2
| reg-stack | guard | + thr-1 regstack | guard | + thr-2 regstack | guard
+-----------+--------+------------------+-------+------------------+-----------

Then, stack unmapping may cause vma splitting.

However, non-register stack architecture don't free from this issue.
if the program use malloc(), it can makes fig2 layout. please run attached
test program (test_max_mapcount_for_86.c), it should addressed the problem.

And, I doubt I haven't catch your mention. May I ask some question?
Honestly I don't think max_map_count is important knob. it is strange
mutant of limit of virtual address space in the process.
At very long time ago (probably the stone age), linux doesn't have
vma rb_tree handling, then many vma directly cause find_vma slow down.
However current linux have good scalability. it can handle many vma issue.
So, Why do you think max_mapcount sould be strictly keeped?

Honestly, I doubt nobody suffer from removing sysctl_max_mapcount.

And yes, stack unmapping have exceptional charactatics. the guard zone
gurantee it never raise map_count.
So, I think the attached patch (0001-Don-t...) is the same as you talked about, right?
I can accept it. I haven't test it on ia64. however, at least it works
well on x86.

BUT, I still think kernel souldn't refuse any resource deallocation.
otherwise, people discourage proper resource deallocation and encourage
brutal intentional memory leak programming style. What do you think?

Attachments:

test_max_mapcount_for_86.c (1.30 kB)
0001-Don-t-make-ENOMEM-temporary-mapcount-exceeding-in-mu.patch (5.29 kB)
Download all attachments

2009-10-12 15:05:11

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH] munmap() don't check sysctl_max_mapcount

On Mon, 12 Oct 2009, KOSAKI Motohiro wrote:
> >
> > If you change your patch so that do_munmap() cannot increase the final
> > number vmas beyond sysctl_max_map_count, that would seem reasonable.
> > But would that satisfy your testcase? And does the testcase really
> > matter in practice? It doesn't seem to have upset anyone before.
>
> Very thank you for payed attention to my patch. Yes, this is real issue.
> my customer suffer from it.

That's a good reason for a fix; though nothing you say explains why
they're operating right at the sysctl_max_map_count limit (which is
likely to give them further surprises), and cannot raise that limit.

>
> May I explain why you haven't seen this issue? this issue is architecture
> independent problem. however register stack architecture (e.g. ia64, sparc)
> dramatically increase the possibility of the heppen this issue.

Thanks for going to all this trouble; but I never doubted it could
happen, nor that some would be more likely to suffer than others.

> And, I doubt I haven't catch your mention. May I ask some question?
> Honestly I don't think max_map_count is important knob. it is strange
> mutant of limit of virtual address space in the process.
> At very long time ago (probably the stone age), linux doesn't have
> vma rb_tree handling, then many vma directly cause find_vma slow down.
> However current linux have good scalability. it can handle many vma issue.

I think there are probably several different reasons for the limit,
some perhaps buried in prehistory, yes, and others forgotten.

One reason is well-known to your colleague, KAMEZAWA-san:
the ELF core dump format only supports a ushort number of sections.

One reason will be to limit the amount of kernel memory which can
be pinned by a user program - why limit their ability to to lock down
user pages, if we let them run wild with kernel data structures?
The more important on 32-bit machines with more than 1GB of memory, as
the lowmem restriction comes to bite. But I probably should not have
mentioned that, I fear you'll now go on a hunt for other places where
we impose no such limit, and embarrass me greatly with the result ;)

And one reason will be the long vma->vm_next searches: less of an
issue nowadays, yes, and preemptible if you have CONFIG_PREEMPT=y;
but still might be something of a problem.

> So, Why do you think max_mapcount sould be strictly keeped?

I don't believe it's the most serious limit we have, and I'm no
expert on its origins; but I do believe that if we profess to have
some limit, then we have to enforce it. If we're going to allow
anybody to get around the limit, better just throw the limit away.

>
> Honestly, I doubt nobody suffer from removing sysctl_max_mapcount.

I expect Kame to disagree with you on that.

>
> And yes, stack unmapping have exceptional charactatics. the guard zone
> gurantee it never raise map_count.
> So, I think the attached patch (0001-Don-t...) is the same as you talked about, right?

Yes, I've not tested but that looks right to me (I did have to think a
bit to realize that the case where the munmap spans more than one vma
is fine with the check you've added). In the version below I've just
changed your code comment.

> I can accept it. I haven't test it on ia64. however, at least it works
> well on x86.
>
> BUT, I still think kernel souldn't refuse any resource deallocation.
> otherwise, people discourage proper resource deallocation and encourage
> brutal intentional memory leak programming style. What do you think?

I think you're a little too trusting. It's common enough that in order
to free one resource, we need just a little of another resource; and
it is frustrating when that other resource is tightly limited. But if
somebody owes you 10000 yen, and asks to borrow just another 1000 yen
to make some arrangement to pay you back, then the next day asks to
borrow just another 1000 yen to enhance that arrangement, then....

That's what I'm asking to guard against here. But if you're so
strongly against having that limit, please just get your customers
to raise it to INT_MAX: that should be enough to keep away from
its practical limitations, shouldn't it?

[PATCH] Don't make ENOMEM temporary mapcount exceeding in munmap()

From: KOSAKI Motohiro <[email protected]>

On ia64, the following test program exit abnormally, because
glibc thread library called abort().

========================================================
(gdb) bt
#0 0xa000000000010620 in __kernel_syscall_via_break ()
#1 0x20000000003208e0 in raise () from /lib/libc.so.6.1
#2 0x2000000000324090 in abort () from /lib/libc.so.6.1
#3 0x200000000027c3e0 in __deallocate_stack () from /lib/libpthread.so.0
#4 0x200000000027f7c0 in start_thread () from /lib/libpthread.so.0
#5 0x200000000047ef60 in __clone2 () from /lib/libc.so.6.1
========================================================

The fact is, glibc call munmap() when thread exitng time for freeing stack, and
it assume munlock() never fail. However, munmap() often make vma splitting
and it with many mapcount make -ENOMEM.

Oh well, that's crazy, because stack unmapping never increase mapcount.
The maxcount exceeding is only temporary. internal temporary exceeding
shouldn't make ENOMEM.

This patch does it.

test_max_mapcount.c
==================================================================
#include<stdio.h>
#include<stdlib.h>
#include<string.h>
#include<pthread.h>
#include<errno.h>
#include<unistd.h>

#define THREAD_NUM 30000
#define MAL_SIZE (8*1024*1024)

void *wait_thread(void *args)
{
void *addr;

addr = malloc(MAL_SIZE);
sleep(10);

return NULL;
}

void *wait_thread2(void *args)
{
sleep(60);

return NULL;
}

int main(int argc, char *argv[])
{
int i;
pthread_t thread[THREAD_NUM], th;
int ret, count = 0;
pthread_attr_t attr;

ret = pthread_attr_init(&attr);
if(ret) {
perror("pthread_attr_init");
}

ret = pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
if(ret) {
perror("pthread_attr_setdetachstate");
}

for (i = 0; i < THREAD_NUM; i++) {
ret = pthread_create(&th, &attr, wait_thread, NULL);
if(ret) {
fprintf(stderr, "[%d] ", count);
perror("pthread_create");
} else {
printf("[%d] create OK.\n", count);
}
count++;

ret = pthread_create(&thread[i], &attr, wait_thread2, NULL);
if(ret) {
fprintf(stderr, "[%d] ", count);
perror("pthread_create");
} else {
printf("[%d] create OK.\n", count);
}
count++;
}

sleep(3600);
return 0;
}
==================================================================

Signed-off-by: KOSAKI Motohiro <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
---
mm/mmap.c | 38 ++++++++++++++++++++++++++++++--------
1 files changed, 30 insertions(+), 8 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 865830d..41ab576 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1830,10 +1830,10 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
}

/*
- * Split a vma into two pieces at address 'addr', a new vma is allocated
- * either for the first part or the tail.
+ * __split_vma() bypasses sysctl_max_map_count checking. We use this on the
+ * munmap path where it doesn't make sense to fail.
*/
-int split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
+static int __split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
unsigned long addr, int new_below)
{
struct mempolicy *pol;
@@ -1843,9 +1843,6 @@ int split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
~(huge_page_mask(hstate_vma(vma)))))
return -EINVAL;

- if (mm->map_count >= sysctl_max_map_count)
- return -ENOMEM;
-
new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
if (!new)
return -ENOMEM;
@@ -1885,6 +1882,19 @@ int split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
return 0;
}

+/*
+ * Split a vma into two pieces at address 'addr', a new vma is allocated
+ * either for the first part or the tail.
+ */
+int split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
+ unsigned long addr, int new_below)
+{
+ if (mm->map_count >= sysctl_max_map_count)
+ return -ENOMEM;
+
+ return __split_vma(mm, vma, addr, new_below);
+}
+
/* Munmap is split into 2 main parts -- this part which finds
* what needs doing, and the areas themselves, which do the
* work. This now handles partial unmappings.
@@ -1920,7 +1930,17 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
* places tmp vma above, and higher split_vma places tmp vma below.
*/
if (start > vma->vm_start) {
- int error = split_vma(mm, vma, start, 0);
+ int error;
+
+ /*
+ * Make sure that map_count on return from munmap() will
+ * not exceed its limit; but let map_count go just above
+ * its limit temporarily, to help free resources as expected.
+ */
+ if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count)
+ return -ENOMEM;
+
+ error = __split_vma(mm, vma, start, 0);
if (error)
return error;
prev = vma;
@@ -1929,7 +1949,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
/* Does it split the last one? */
last = find_vma(mm, end);
if (last && end > last->vm_start) {
- int error = split_vma(mm, last, end, 1);
+ int error = __split_vma(mm, last, end, 1);
if (error)
return error;
}
--
1.6.2.5

2009-10-13 00:56:47

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH] munmap() don't check sysctl_max_mapcount

On Mon, 12 Oct 2009 16:04:08 +0100 (BST)
Hugh Dickins <[email protected]> wrote:

> On Mon, 12 Oct 2009, KOSAKI Motohiro wrote:
> > And, I doubt I haven't catch your mention. May I ask some question?
> > Honestly I don't think max_map_count is important knob. it is strange
> > mutant of limit of virtual address space in the process.
> > At very long time ago (probably the stone age), linux doesn't have
> > vma rb_tree handling, then many vma directly cause find_vma slow down.
> > However current linux have good scalability. it can handle many vma issue.
>
> I think there are probably several different reasons for the limit,
> some perhaps buried in prehistory, yes, and others forgotten.
>
> One reason is well-known to your colleague, KAMEZAWA-san:
> the ELF core dump format only supports a ushort number of sections.
>
yes.

> One reason will be to limit the amount of kernel memory which can
> be pinned by a user program - why limit their ability to to lock down
> user pages, if we let them run wild with kernel data structures?
> The more important on 32-bit machines with more than 1GB of memory, as
> the lowmem restriction comes to bite. But I probably should not have
> mentioned that, I fear you'll now go on a hunt for other places where
> we impose no such limit, and embarrass me greatly with the result ;)
>
> And one reason will be the long vma->vm_next searches: less of an
> issue nowadays, yes, and preemptible if you have CONFIG_PREEMPT=y;
> but still might be something of a problem.
>
> > So, Why do you think max_mapcount sould be strictly keeped?
>
> I don't believe it's the most serious limit we have, and I'm no
> expert on its origins; but I do believe that if we profess to have
> some limit, then we have to enforce it. If we're going to allow
> anybody to get around the limit, better just throw the limit away.
>
> >
> > Honestly, I doubt nobody suffer from removing sysctl_max_mapcount.
>
> I expect Kame to disagree with you on that.
>
> >
> > And yes, stack unmapping have exceptional charactatics. the guard zone
> > gurantee it never raise map_count.
> > So, I think the attached patch (0001-Don-t...) is the same as you talked about, right?
>
> Yes, I've not tested but that looks right to me (I did have to think a
> bit to realize that the case where the munmap spans more than one vma
> is fine with the check you've added). In the version below I've just
> changed your code comment.
>
> > I can accept it. I haven't test it on ia64. however, at least it works
> > well on x86.
> >
> > BUT, I still think kernel souldn't refuse any resource deallocation.
> > otherwise, people discourage proper resource deallocation and encourage
> > brutal intentional memory leak programming style. What do you think?
>
> I think you're a little too trusting. It's common enough that in order
> to free one resource, we need just a little of another resource; and
> it is frustrating when that other resource is tightly limited. But if
> somebody owes you 10000 yen, and asks to borrow just another 1000 yen
> to make some arrangement to pay you back, then the next day asks to
> borrow just another 1000 yen to enhance that arrangement, then....
>
> That's what I'm asking to guard against here. But if you're so
> strongly against having that limit, please just get your customers
> to raise it to INT_MAX: that should be enough to keep away from
> its practical limitations, shouldn't it?
>
>
I discussed with Kosaki. Ah, hmm, reporing our status.

- Even if we think the program which exceeds max_map_count and go abort()
as buggy program, we don't think abort() (in library) is very good.
So, we want to avoid this.

- We hear one of our collegue (debugger team) is now preparing ELF-extention
patches for kernel and gdb. We hear solaris has ELF-extention for handling more
than 65535 program headers and recent AMD64 ABI draft includes it.
We now think this extention should go first. We discuss him with our schedule.

- Considering "too much consume memory" attack, we need some limits.
Then, we wonder adding
- system-wide max_map_count (enough large)
or
- determine per process max_map_count based on host's memory size.

BTW, looking sysctl, there is threads-max.

[kamezawa@bluextal ~]$ cat /proc/sys/kernel/threads-max
409600

This number is system-wide and automatically determined at boot.
But, in fact, there is max_map_count and per process threads-max is determined
by it. We think this not very neat.

We'll consider more. Probably, we'll start from ELF extention.

Thanks,
-Kame

2009-10-13 02:50:52

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH] munmap() don't check sysctl_max_mapcount

Hi

> > > If you change your patch so that do_munmap() cannot increase the final
> > > number vmas beyond sysctl_max_map_count, that would seem reasonable.
> > > But would that satisfy your testcase? And does the testcase really
> > > matter in practice? It doesn't seem to have upset anyone before.
> >
> > Very thank you for payed attention to my patch. Yes, this is real issue.
> > my customer suffer from it.
>
> That's a good reason for a fix; though nothing you say explains why
> they're operating right at the sysctl_max_map_count limit (which is
> likely to give them further surprises), and cannot raise that limit.

Probably, my last mail was not clear a bit. I talked about two thing
at the same time. sorry for ambiguous message.

(1) custmer faced bug
(2) my future view

In point (1), the limit itself is not problem at all. the customer
can change it. but nobody accept resource deallocation makes SIGABORT
internally. I think both kernel and glibc have fault.

removing sysctl_max_map_count is only my future view.

> > May I explain why you haven't seen this issue? this issue is architecture
> > independent problem. however register stack architecture (e.g. ia64, sparc)
> > dramatically increase the possibility of the heppen this issue.
>
> Thanks for going to all this trouble; but I never doubted it could
> happen, nor that some would be more likely to suffer than others.
>
> > And, I doubt I haven't catch your mention. May I ask some question?
> > Honestly I don't think max_map_count is important knob. it is strange
> > mutant of limit of virtual address space in the process.
> > At very long time ago (probably the stone age), linux doesn't have
> > vma rb_tree handling, then many vma directly cause find_vma slow down.
> > However current linux have good scalability. it can handle many vma issue.
>
> I think there are probably several different reasons for the limit,
> some perhaps buried in prehistory, yes, and others forgotten.
>
> One reason is well-known to your colleague, KAMEZAWA-san:
> the ELF core dump format only supports a ushort number of sections.

Ah! yes.
I had forgot it. thanks pointing this.

Yes, I agree we can't remove max_mapcount yet.

Side node: My co-worker working on implement enhanced ELF header to gdb,
it's derived from solaris. I expect we can remove the above ushort limitation
in this year.

> One reason will be to limit the amount of kernel memory which can
> be pinned by a user program - why limit their ability to to lock down
> user pages, if we let them run wild with kernel data structures?
> The more important on 32-bit machines with more than 1GB of memory, as
> the lowmem restriction comes to bite. But I probably should not have
> mentioned that, I fear you'll now go on a hunt for other places where
> we impose no such limit, and embarrass me greatly with the result ;)

hmhm, 32bit, I see.

Side note: 64K max_mapcount restrict number of thread to 32K. there seems
too small in modern 64bit. after solving ELF ushort issue, the default value
on 64bit might be considerable. I think.

> And one reason will be the long vma->vm_next searches: less of an
> issue nowadays, yes, and preemptible if you have CONFIG_PREEMPT=y;
> but still might be something of a problem.
>
> > So, Why do you think max_mapcount sould be strictly keeped?
>
> I don't believe it's the most serious limit we have, and I'm no
> expert on its origins; but I do believe that if we profess to have
> some limit, then we have to enforce it. If we're going to allow
> anybody to get around the limit, better just throw the limit away.

OK, I agree.

> > Honestly, I doubt nobody suffer from removing sysctl_max_mapcount.
>
> I expect Kame to disagree with you on that.

I have to say your expection is bingo! ;)

> > And yes, stack unmapping have exceptional charactatics. the guard zone
> > gurantee it never raise map_count.
> > So, I think the attached patch (0001-Don-t...) is the same as you talked about, right?
>
> Yes, I've not tested but that looks right to me (I did have to think a
> bit to realize that the case where the munmap spans more than one vma
> is fine with the check you've added). In the version below I've just
> changed your code comment.

Thank you!

> > I can accept it. I haven't test it on ia64. however, at least it works
> > well on x86.
> >
> > BUT, I still think kernel souldn't refuse any resource deallocation.
> > otherwise, people discourage proper resource deallocation and encourage
> > brutal intentional memory leak programming style. What do you think?
>
> I think you're a little too trusting. It's common enough that in order
> to free one resource, we need just a little of another resource; and
> it is frustrating when that other resource is tightly limited. But if
> somebody owes you 10000 yen, and asks to borrow just another 1000 yen
> to make some arrangement to pay you back, then the next day asks to
> borrow just another 1000 yen to enhance that arrangement, then....
>
> That's what I'm asking to guard against here. But if you're so
> strongly against having that limit, please just get your customers
> to raise it to INT_MAX: that should be enough to keep away from
> its practical limitations, shouldn't it?

The customer don't need to remove this limit. That's merely my personal
opinion.
Currently, various library and application don't check munmap() return value.
I have pessimistic expection to change them. In practice, we have two way to
get rid of suffer of munlock return value.
(1) removing the possibility of error return (i.g. remove sysctl_max_mapcount)
(2) limitation bump up until usual application never touch its limit.

I thought we can (1). but I've changed my opinion by this mail. probably
(2) is better. but it is long term issue. we can't do it until solving ELF
issue....

Thanks again. you clarified various viewpoint.