Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932556AbZJLPFL (ORCPT ); Mon, 12 Oct 2009 11:05:11 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1756636AbZJLPFH (ORCPT ); Mon, 12 Oct 2009 11:05:07 -0400 Received: from mk-filter-3-a-1.mail.uk.tiscali.com ([212.74.100.54]:8446 "EHLO mk-filter-3-a-1.mail.uk.tiscali.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932490AbZJLPFE (ORCPT ); Mon, 12 Oct 2009 11:05:04 -0400 X-Trace: 269980944/mk-filter-3.mail.uk.tiscali.com/B2C/$b2c-THROTTLED-DYNAMIC/b2c-CUSTOMER-DYNAMIC-IP/80.41.74.254/None/hugh.dickins@tiscali.co.uk X-SBRS: None X-RemoteIP: 80.41.74.254 X-IP-MAIL-FROM: hugh.dickins@tiscali.co.uk X-SMTP-AUTH: X-MUA: X-IP-BHB: Once X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AigFAEvi0kpQKUr+/2dsb2JhbACBUdVLhC0E X-IronPort-AV: E=Sophos;i="4.44,546,1249254000"; d="scan'208";a="269980944" Date: Mon, 12 Oct 2009 16:04:08 +0100 (BST) From: Hugh Dickins X-X-Sender: hugh@sister.anvils To: KOSAKI Motohiro cc: LKML , linux-mm , Andrew Morton Subject: Re: [PATCH] munmap() don't check sysctl_max_mapcount In-Reply-To: <20091012184654.E4D0.A69D9226@jp.fujitsu.com> Message-ID: References: <20091002180533.5F77.A69D9226@jp.fujitsu.com> <20091012184654.E4D0.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 9727 Lines: 276 On Mon, 12 Oct 2009, KOSAKI Motohiro wrote: > > > > If you change your patch so that do_munmap() cannot increase the final > > number vmas beyond sysctl_max_map_count, that would seem reasonable. > > But would that satisfy your testcase? And does the testcase really > > matter in practice? It doesn't seem to have upset anyone before. > > Very thank you for payed attention to my patch. Yes, this is real issue. > my customer suffer from it. That's a good reason for a fix; though nothing you say explains why they're operating right at the sysctl_max_map_count limit (which is likely to give them further surprises), and cannot raise that limit. > > May I explain why you haven't seen this issue? this issue is architecture > independent problem. however register stack architecture (e.g. ia64, sparc) > dramatically increase the possibility of the heppen this issue. Thanks for going to all this trouble; but I never doubted it could happen, nor that some would be more likely to suffer than others. > And, I doubt I haven't catch your mention. May I ask some question? > Honestly I don't think max_map_count is important knob. it is strange > mutant of limit of virtual address space in the process. > At very long time ago (probably the stone age), linux doesn't have > vma rb_tree handling, then many vma directly cause find_vma slow down. > However current linux have good scalability. it can handle many vma issue. I think there are probably several different reasons for the limit, some perhaps buried in prehistory, yes, and others forgotten. One reason is well-known to your colleague, KAMEZAWA-san: the ELF core dump format only supports a ushort number of sections. One reason will be to limit the amount of kernel memory which can be pinned by a user program - why limit their ability to to lock down user pages, if we let them run wild with kernel data structures? The more important on 32-bit machines with more than 1GB of memory, as the lowmem restriction comes to bite. But I probably should not have mentioned that, I fear you'll now go on a hunt for other places where we impose no such limit, and embarrass me greatly with the result ;) And one reason will be the long vma->vm_next searches: less of an issue nowadays, yes, and preemptible if you have CONFIG_PREEMPT=y; but still might be something of a problem. > So, Why do you think max_mapcount sould be strictly keeped? I don't believe it's the most serious limit we have, and I'm no expert on its origins; but I do believe that if we profess to have some limit, then we have to enforce it. If we're going to allow anybody to get around the limit, better just throw the limit away. > > Honestly, I doubt nobody suffer from removing sysctl_max_mapcount. I expect Kame to disagree with you on that. > > And yes, stack unmapping have exceptional charactatics. the guard zone > gurantee it never raise map_count. > So, I think the attached patch (0001-Don-t...) is the same as you talked about, right? Yes, I've not tested but that looks right to me (I did have to think a bit to realize that the case where the munmap spans more than one vma is fine with the check you've added). In the version below I've just changed your code comment. > I can accept it. I haven't test it on ia64. however, at least it works > well on x86. > > BUT, I still think kernel souldn't refuse any resource deallocation. > otherwise, people discourage proper resource deallocation and encourage > brutal intentional memory leak programming style. What do you think? I think you're a little too trusting. It's common enough that in order to free one resource, we need just a little of another resource; and it is frustrating when that other resource is tightly limited. But if somebody owes you 10000 yen, and asks to borrow just another 1000 yen to make some arrangement to pay you back, then the next day asks to borrow just another 1000 yen to enhance that arrangement, then.... That's what I'm asking to guard against here. But if you're so strongly against having that limit, please just get your customers to raise it to INT_MAX: that should be enough to keep away from its practical limitations, shouldn't it? [PATCH] Don't make ENOMEM temporary mapcount exceeding in munmap() From: KOSAKI Motohiro On ia64, the following test program exit abnormally, because glibc thread library called abort(). ======================================================== (gdb) bt #0 0xa000000000010620 in __kernel_syscall_via_break () #1 0x20000000003208e0 in raise () from /lib/libc.so.6.1 #2 0x2000000000324090 in abort () from /lib/libc.so.6.1 #3 0x200000000027c3e0 in __deallocate_stack () from /lib/libpthread.so.0 #4 0x200000000027f7c0 in start_thread () from /lib/libpthread.so.0 #5 0x200000000047ef60 in __clone2 () from /lib/libc.so.6.1 ======================================================== The fact is, glibc call munmap() when thread exitng time for freeing stack, and it assume munlock() never fail. However, munmap() often make vma splitting and it with many mapcount make -ENOMEM. Oh well, that's crazy, because stack unmapping never increase mapcount. The maxcount exceeding is only temporary. internal temporary exceeding shouldn't make ENOMEM. This patch does it. test_max_mapcount.c ================================================================== #include #include #include #include #include #include #define THREAD_NUM 30000 #define MAL_SIZE (8*1024*1024) void *wait_thread(void *args) { void *addr; addr = malloc(MAL_SIZE); sleep(10); return NULL; } void *wait_thread2(void *args) { sleep(60); return NULL; } int main(int argc, char *argv[]) { int i; pthread_t thread[THREAD_NUM], th; int ret, count = 0; pthread_attr_t attr; ret = pthread_attr_init(&attr); if(ret) { perror("pthread_attr_init"); } ret = pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED); if(ret) { perror("pthread_attr_setdetachstate"); } for (i = 0; i < THREAD_NUM; i++) { ret = pthread_create(&th, &attr, wait_thread, NULL); if(ret) { fprintf(stderr, "[%d] ", count); perror("pthread_create"); } else { printf("[%d] create OK.\n", count); } count++; ret = pthread_create(&thread[i], &attr, wait_thread2, NULL); if(ret) { fprintf(stderr, "[%d] ", count); perror("pthread_create"); } else { printf("[%d] create OK.\n", count); } count++; } sleep(3600); return 0; } ================================================================== Signed-off-by: KOSAKI Motohiro Signed-off-by: Hugh Dickins --- mm/mmap.c | 38 ++++++++++++++++++++++++++++++-------- 1 files changed, 30 insertions(+), 8 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 865830d..41ab576 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1830,10 +1830,10 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma, } /* - * Split a vma into two pieces at address 'addr', a new vma is allocated - * either for the first part or the tail. + * __split_vma() bypasses sysctl_max_map_count checking. We use this on the + * munmap path where it doesn't make sense to fail. */ -int split_vma(struct mm_struct * mm, struct vm_area_struct * vma, +static int __split_vma(struct mm_struct * mm, struct vm_area_struct * vma, unsigned long addr, int new_below) { struct mempolicy *pol; @@ -1843,9 +1843,6 @@ int split_vma(struct mm_struct * mm, struct vm_area_struct * vma, ~(huge_page_mask(hstate_vma(vma))))) return -EINVAL; - if (mm->map_count >= sysctl_max_map_count) - return -ENOMEM; - new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL); if (!new) return -ENOMEM; @@ -1885,6 +1882,19 @@ int split_vma(struct mm_struct * mm, struct vm_area_struct * vma, return 0; } +/* + * Split a vma into two pieces at address 'addr', a new vma is allocated + * either for the first part or the tail. + */ +int split_vma(struct mm_struct * mm, struct vm_area_struct * vma, + unsigned long addr, int new_below) +{ + if (mm->map_count >= sysctl_max_map_count) + return -ENOMEM; + + return __split_vma(mm, vma, addr, new_below); +} + /* Munmap is split into 2 main parts -- this part which finds * what needs doing, and the areas themselves, which do the * work. This now handles partial unmappings. @@ -1920,7 +1930,17 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len) * places tmp vma above, and higher split_vma places tmp vma below. */ if (start > vma->vm_start) { - int error = split_vma(mm, vma, start, 0); + int error; + + /* + * Make sure that map_count on return from munmap() will + * not exceed its limit; but let map_count go just above + * its limit temporarily, to help free resources as expected. + */ + if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count) + return -ENOMEM; + + error = __split_vma(mm, vma, start, 0); if (error) return error; prev = vma; @@ -1929,7 +1949,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len) /* Does it split the last one? */ last = find_vma(mm, end); if (last && end > last->vm_start) { - int error = split_vma(mm, last, end, 1); + int error = __split_vma(mm, last, end, 1); if (error) return error; } -- 1.6.2.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/