2004-09-25 16:23:39

by Andrea Arcangeli

[permalink] [raw]
Subject: heap-stack-gap for 2.6

This patch enforces a gap between heap and stack, both on the mmap side
(for heap) and on the growsdown page faults for stack. the gap is in
page units and it's sysctl configurable. Against CVS head.

This is needed for some critical app, that wants an higher degree of
protection against potential stack overflows from the kernel. This is
mostly a 32bit matter of course, since on 32bit those apps are using
a few gigs of heap and they get as near as they can to the stack (but if
something goes wrong a page fault must happen).


the default value of 1 avoids userspace apps like java to break, but
those apps will of course set by hand in the rc.d scripts a much higher
value. 1 is a sane default, if you want to tweak the default with
mainline inclusion that's fine with me. the sysctl can always be
disabled by setting it to 0 and then nobody will notice.

feature is fully enabled on x86* and ppc*. No idea about the ia64 and
s390x layouts but they've presumably a lot more address space not to
care about this (this is primarly needed on 32bit apps).

I didn't check the topdown model, in theory it should be extended to
cover that too, this is only working for the legacy model right now
because those apps aren't going to use topdown anyways.

Index: linux-2.5/arch/alpha/mm/fault.c
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/arch/alpha/mm/fault.c,v
retrieving revision 1.15
diff -u -p -r1.15 fault.c
--- linux-2.5/arch/alpha/mm/fault.c 23 Sep 2004 05:59:47 -0000 1.15
+++ linux-2.5/arch/alpha/mm/fault.c 25 Sep 2004 01:46:10 -0000
@@ -125,7 +125,7 @@ do_page_fault(unsigned long address, uns
goto good_area;
if (!(vma->vm_flags & VM_GROWSDOWN))
goto bad_area;
- if (expand_stack(vma, address))
+ if (expand_stack(vma, address, NULL))
goto bad_area;

/* Ok, we have a good vm_area for this memory access, so
Index: linux-2.5/arch/arm/mm/fault.c
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/arch/arm/mm/fault.c,v
retrieving revision 1.5
diff -u -p -r1.5 fault.c
--- linux-2.5/arch/arm/mm/fault.c 6 Aug 2004 23:06:02 -0000 1.5
+++ linux-2.5/arch/arm/mm/fault.c 25 Sep 2004 01:46:10 -0000
@@ -208,7 +208,7 @@ survive:
goto survive;

check_stack:
- if (vma->vm_flags & VM_GROWSDOWN && !expand_stack(vma, addr))
+ if (vma->vm_flags & VM_GROWSDOWN && !expand_stack(vma, addr, NULL))
goto good_area;
out:
return fault;
Index: linux-2.5/arch/arm26/mm/fault.c
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/arch/arm26/mm/fault.c,v
retrieving revision 1.3
diff -u -p -r1.3 fault.c
--- linux-2.5/arch/arm26/mm/fault.c 7 Sep 2003 23:53:07 -0000 1.3
+++ linux-2.5/arch/arm26/mm/fault.c 25 Sep 2004 01:46:10 -0000
@@ -197,7 +197,7 @@ survive:
goto survive;

check_stack:
- if (vma->vm_flags & VM_GROWSDOWN && !expand_stack(vma, addr))
+ if (vma->vm_flags & VM_GROWSDOWN && !expand_stack(vma, addr, NULL))
goto good_area;
out:
return fault;
Index: linux-2.5/arch/cris/mm/fault.c
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/arch/cris/mm/fault.c,v
retrieving revision 1.14
diff -u -p -r1.14 fault.c
--- linux-2.5/arch/cris/mm/fault.c 1 Jun 2004 15:52:29 -0000 1.14
+++ linux-2.5/arch/cris/mm/fault.c 25 Sep 2004 01:46:10 -0000
@@ -207,7 +207,7 @@ do_page_fault(unsigned long address, str
if (address + PAGE_SIZE < rdusp())
goto bad_area;
}
- if (expand_stack(vma, address))
+ if (expand_stack(vma, address, NULL))
goto bad_area;

/*
Index: linux-2.5/arch/i386/mm/fault.c
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/arch/i386/mm/fault.c,v
retrieving revision 1.40
diff -u -p -r1.40 fault.c
--- linux-2.5/arch/i386/mm/fault.c 8 Sep 2004 14:48:45 -0000 1.40
+++ linux-2.5/arch/i386/mm/fault.c 25 Sep 2004 01:46:10 -0000
@@ -217,7 +217,7 @@ asmlinkage void do_page_fault(struct pt_
{
struct task_struct *tsk;
struct mm_struct *mm;
- struct vm_area_struct * vma;
+ struct vm_area_struct * vma, * prev_vma;
unsigned long address;
unsigned long page;
int write;
@@ -308,7 +308,13 @@ asmlinkage void do_page_fault(struct pt_
if (address + 32 < regs->esp)
goto bad_area;
}
- if (expand_stack(vma, address))
+ /*
+ * find_vma_prev is just a bit slower, because it cannot
+ * use the mmap_cache, so we run it only in the growsdown
+ * slow path and we leave find_vma in the fast path.
+ */
+ find_vma_prev(current->mm, address, &prev_vma);
+ if (expand_stack(vma, address, prev_vma))
goto bad_area;
/*
* Ok, we have a good vm_area for this memory access, so
Index: linux-2.5/arch/ia64/mm/fault.c
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/arch/ia64/mm/fault.c,v
retrieving revision 1.20
diff -u -p -r1.20 fault.c
--- linux-2.5/arch/ia64/mm/fault.c 8 Sep 2004 14:48:45 -0000 1.20
+++ linux-2.5/arch/ia64/mm/fault.c 25 Sep 2004 01:46:10 -0000
@@ -164,7 +164,7 @@ ia64_do_page_fault (unsigned long addres
if (REGION_NUMBER(address) != REGION_NUMBER(vma->vm_start)
|| REGION_OFFSET(address) >= RGN_MAP_LIMIT)
goto bad_area;
- if (expand_stack(vma, address))
+ if (expand_stack(vma, address, NULL /* FIXME? */))
goto bad_area;
} else {
vma = prev_vma;
Index: linux-2.5/arch/m68k/mm/fault.c
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/arch/m68k/mm/fault.c,v
retrieving revision 1.6
diff -u -p -r1.6 fault.c
--- linux-2.5/arch/m68k/mm/fault.c 11 May 2004 14:54:30 -0000 1.6
+++ linux-2.5/arch/m68k/mm/fault.c 25 Sep 2004 01:46:10 -0000
@@ -121,7 +121,7 @@ int do_page_fault(struct pt_regs *regs,
if (address + 256 < rdusp())
goto map_err;
}
- if (expand_stack(vma, address))
+ if (expand_stack(vma, address, NULL))
goto map_err;

/*
Index: linux-2.5/arch/mips/mm/fault.c
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/arch/mips/mm/fault.c,v
retrieving revision 1.12
diff -u -p -r1.12 fault.c
--- linux-2.5/arch/mips/mm/fault.c 8 Sep 2004 14:48:45 -0000 1.12
+++ linux-2.5/arch/mips/mm/fault.c 25 Sep 2004 01:46:10 -0000
@@ -75,7 +75,7 @@ asmlinkage void do_page_fault(struct pt_
goto good_area;
if (!(vma->vm_flags & VM_GROWSDOWN))
goto bad_area;
- if (expand_stack(vma, address))
+ if (expand_stack(vma, address, NULL))
goto bad_area;
/*
* Ok, we have a good vm_area for this memory access, so
Index: linux-2.5/arch/parisc/mm/fault.c
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/arch/parisc/mm/fault.c,v
retrieving revision 1.5
diff -u -p -r1.5 fault.c
--- linux-2.5/arch/parisc/mm/fault.c 13 Jan 2003 21:24:33 -0000 1.5
+++ linux-2.5/arch/parisc/mm/fault.c 25 Sep 2004 01:46:10 -0000
@@ -196,7 +196,7 @@ good_area:

check_expansion:
vma = prev_vma;
- if (vma && (expand_stack(vma, address) == 0))
+ if (vma && (expand_stack(vma, address, NULL) == 0))
goto good_area;

/*
Index: linux-2.5/arch/ppc/mm/fault.c
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/arch/ppc/mm/fault.c,v
retrieving revision 1.22
diff -u -p -r1.22 fault.c
--- linux-2.5/arch/ppc/mm/fault.c 27 Jul 2004 04:02:20 -0000 1.22
+++ linux-2.5/arch/ppc/mm/fault.c 25 Sep 2004 01:46:10 -0000
@@ -95,7 +95,7 @@ static int store_updates_sp(struct pt_re
int do_page_fault(struct pt_regs *regs, unsigned long address,
unsigned long error_code)
{
- struct vm_area_struct * vma;
+ struct vm_area_struct * vma, * prev_vma;
struct mm_struct *mm = current->mm;
siginfo_t info;
int code = SEGV_MAPERR;
@@ -175,7 +175,8 @@ int do_page_fault(struct pt_regs *regs,
&& (!user_mode(regs) || !store_updates_sp(regs)))
goto bad_area;
}
- if (expand_stack(vma, address))
+ find_vma_prev(mm, address, &prev_vma);
+ if (expand_stack(vma, address, prev_vma))
goto bad_area;

good_area:
Index: linux-2.5/arch/ppc64/mm/fault.c
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/arch/ppc64/mm/fault.c,v
retrieving revision 1.20
diff -u -p -r1.20 fault.c
--- linux-2.5/arch/ppc64/mm/fault.c 2 Aug 2004 17:11:09 -0000 1.20
+++ linux-2.5/arch/ppc64/mm/fault.c 25 Sep 2004 01:46:10 -0000
@@ -86,7 +86,7 @@ static int store_updates_sp(struct pt_re
int do_page_fault(struct pt_regs *regs, unsigned long address,
unsigned long error_code)
{
- struct vm_area_struct * vma;
+ struct vm_area_struct * vma, * prev_vma;
struct mm_struct *mm = current->mm;
siginfo_t info;
unsigned long code = SEGV_MAPERR;
@@ -185,7 +185,8 @@ int do_page_fault(struct pt_regs *regs,
goto bad_area;
}

- if (expand_stack(vma, address))
+ find_vma_prev(mm, address, &prev_vma);
+ if (expand_stack(vma, address, prev_vma))
goto bad_area;

good_area:
Index: linux-2.5/arch/ppc64/mm/hugetlbpage.c
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/arch/ppc64/mm/hugetlbpage.c,v
retrieving revision 1.32
diff -u -p -r1.32 hugetlbpage.c
--- linux-2.5/arch/ppc64/mm/hugetlbpage.c 17 Sep 2004 18:59:04 -0000 1.32
+++ linux-2.5/arch/ppc64/mm/hugetlbpage.c 25 Sep 2004 02:41:06 -0000
@@ -496,6 +496,7 @@ unsigned long arch_get_unmapped_area(str
full_search:
vma = find_vma(mm, addr);
while (TASK_SIZE - len >= addr) {
+ unsigned long __heap_stack_gap;
BUG_ON(vma && (addr >= vma->vm_end));

if (touches_hugepage_low_range(addr, len)) {
@@ -508,7 +509,13 @@ full_search:
vma = find_vma(mm, addr);
continue;
}
- if (!vma || addr + len <= vma->vm_start) {
+ if (!vma)
+ goto got_it;
+ __heap_stack_gap = 0;
+ if (vma->vm_flags & VM_GROWSDOWN)
+ __heap_stack_gap = heap_stack_gap << PAGE_SHIFT;
+ if (addr + len + __heap_stack_gap <= vma->vm_start) {
+ got_it:
/*
* Remember the place where we stopped the search:
*/
Index: linux-2.5/arch/s390/mm/fault.c
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/arch/s390/mm/fault.c,v
retrieving revision 1.20
diff -u -p -r1.20 fault.c
--- linux-2.5/arch/s390/mm/fault.c 8 Sep 2004 14:48:45 -0000 1.20
+++ linux-2.5/arch/s390/mm/fault.c 25 Sep 2004 01:46:10 -0000
@@ -225,7 +225,7 @@ do_exception(struct pt_regs *regs, unsig
goto good_area;
if (!(vma->vm_flags & VM_GROWSDOWN))
goto bad_area;
- if (expand_stack(vma, address))
+ if (expand_stack(vma, address, NULL /* FIXME? */))
goto bad_area;
/*
* Ok, we have a good vm_area for this memory access, so
Index: linux-2.5/arch/sh/mm/fault.c
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/arch/sh/mm/fault.c,v
retrieving revision 1.14
diff -u -p -r1.14 fault.c
--- linux-2.5/arch/sh/mm/fault.c 8 Sep 2004 14:48:45 -0000 1.14
+++ linux-2.5/arch/sh/mm/fault.c 25 Sep 2004 01:46:10 -0000
@@ -69,7 +69,7 @@ asmlinkage void do_page_fault(struct pt_
goto good_area;
if (!(vma->vm_flags & VM_GROWSDOWN))
goto bad_area;
- if (expand_stack(vma, address))
+ if (expand_stack(vma, address, NULL))
goto bad_area;
/*
* Ok, we have a good vm_area for this memory access, so
Index: linux-2.5/arch/sh64/mm/fault.c
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/arch/sh64/mm/fault.c,v
retrieving revision 1.3
diff -u -p -r1.3 fault.c
--- linux-2.5/arch/sh64/mm/fault.c 8 Sep 2004 14:48:45 -0000 1.3
+++ linux-2.5/arch/sh64/mm/fault.c 25 Sep 2004 01:46:10 -0000
@@ -188,7 +188,7 @@ asmlinkage void do_page_fault(struct pt_
#endif
goto bad_area;
}
- if (expand_stack(vma, address)) {
+ if (expand_stack(vma, address, NULL)) {
#ifdef DEBUG_FAULT
print_task(tsk);
printk("%s:%d fault, address is 0x%08x PC %016Lx textaccess %d writeaccess %d\n",
Index: linux-2.5/arch/sparc/mm/fault.c
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/arch/sparc/mm/fault.c,v
retrieving revision 1.19
diff -u -p -r1.19 fault.c
--- linux-2.5/arch/sparc/mm/fault.c 13 Jul 2004 18:02:33 -0000 1.19
+++ linux-2.5/arch/sparc/mm/fault.c 25 Sep 2004 01:46:10 -0000
@@ -271,7 +271,7 @@ asmlinkage void do_sparc_fault(struct pt
goto good_area;
if(!(vma->vm_flags & VM_GROWSDOWN))
goto bad_area;
- if(expand_stack(vma, address))
+ if(expand_stack(vma, address, NULL))
goto bad_area;
/*
* Ok, we have a good vm_area for this memory access, so
@@ -524,7 +524,7 @@ inline void force_user_fault(unsigned lo
goto good_area;
if(!(vma->vm_flags & VM_GROWSDOWN))
goto bad_area;
- if(expand_stack(vma, address))
+ if(expand_stack(vma, address, NULL))
goto bad_area;
good_area:
info.si_code = SEGV_ACCERR;
Index: linux-2.5/arch/sparc64/mm/fault.c
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/arch/sparc64/mm/fault.c,v
retrieving revision 1.22
diff -u -p -r1.22 fault.c
--- linux-2.5/arch/sparc64/mm/fault.c 2 Sep 2004 08:19:11 -0000 1.22
+++ linux-2.5/arch/sparc64/mm/fault.c 25 Sep 2004 01:46:10 -0000
@@ -409,7 +409,7 @@ continue_fault:
goto bad_area;
}
}
- if (expand_stack(vma, address))
+ if (expand_stack(vma, address, NULL))
goto bad_area;
/*
* Ok, we have a good vm_area for this memory access, so
Index: linux-2.5/arch/um/kernel/trap_kern.c
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/arch/um/kernel/trap_kern.c,v
retrieving revision 1.10
diff -u -p -r1.10 trap_kern.c
--- linux-2.5/arch/um/kernel/trap_kern.c 24 Aug 2004 18:18:53 -0000 1.10
+++ linux-2.5/arch/um/kernel/trap_kern.c 25 Sep 2004 01:46:10 -0000
@@ -30,7 +30,7 @@ int handle_page_fault(unsigned long addr
int is_write, int is_user, int *code_out)
{
struct mm_struct *mm = current->mm;
- struct vm_area_struct *vma;
+ struct vm_area_struct *vma, *prev_vma;
pgd_t *pgd;
pmd_t *pmd;
pte_t *pte;
@@ -46,8 +46,11 @@ int handle_page_fault(unsigned long addr
goto good_area;
else if(!(vma->vm_flags & VM_GROWSDOWN))
goto out;
- else if(expand_stack(vma, address))
- goto out;
+ else {
+ find_vma_prev(mm, address, &prev_vma);
+ if(expand_stack(vma, address, prev_vma))
+ goto out;
+ }

good_area:
*code_out = SEGV_ACCERR;
Index: linux-2.5/arch/x86_64/kernel/sys_x86_64.c
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/arch/x86_64/kernel/sys_x86_64.c,v
retrieving revision 1.18
diff -u -p -r1.18 sys_x86_64.c
--- linux-2.5/arch/x86_64/kernel/sys_x86_64.c 31 May 2004 03:07:42 -0000 1.18
+++ linux-2.5/arch/x86_64/kernel/sys_x86_64.c 25 Sep 2004 02:40:56 -0000
@@ -119,6 +119,7 @@ arch_get_unmapped_area(struct file *filp

full_search:
for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {
+ unsigned long __heap_stack_gap;
/* At this point: (!vma || addr < vma->vm_end). */
if (end - len < addr) {
/*
@@ -131,7 +132,13 @@ full_search:
}
return -ENOMEM;
}
- if (!vma || addr + len <= vma->vm_start) {
+ if (!vma)
+ goto got_it;
+ __heap_stack_gap = 0;
+ if (vma->vm_flags & VM_GROWSDOWN)
+ __heap_stack_gap = heap_stack_gap << PAGE_SHIFT;
+ if (addr + len + __heap_stack_gap <= vma->vm_start) {
+ got_it:
/*
* Remember the place where we stopped the search:
*/
Index: linux-2.5/arch/x86_64/mm/fault.c
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/arch/x86_64/mm/fault.c,v
retrieving revision 1.29
diff -u -p -r1.29 fault.c
--- linux-2.5/arch/x86_64/mm/fault.c 17 Sep 2004 19:00:12 -0000 1.29
+++ linux-2.5/arch/x86_64/mm/fault.c 25 Sep 2004 01:46:10 -0000
@@ -248,7 +248,7 @@ asmlinkage void do_page_fault(struct pt_
{
struct task_struct *tsk;
struct mm_struct *mm;
- struct vm_area_struct * vma;
+ struct vm_area_struct * vma, * prev_vma;
unsigned long address;
const struct exception_table_entry *fixup;
int write;
@@ -349,7 +349,13 @@ asmlinkage void do_page_fault(struct pt_
if (address + 128 < regs->rsp)
goto bad_area;
}
- if (expand_stack(vma, address))
+ /*
+ * find_vma_prev is just a bit slower, because it cannot
+ * use the mmap_cache, so we run it only in the growsdown
+ * slow path and we leave find_vma in the fast path.
+ */
+ find_vma_prev(current->mm, address, &prev_vma);
+ if (expand_stack(vma, address, prev_vma))
goto bad_area;
/*
* Ok, we have a good vm_area for this memory access, so
Index: linux-2.5/include/linux/mm.h
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/include/linux/mm.h,v
retrieving revision 1.190
diff -u -p -r1.190 mm.h
--- linux-2.5/include/linux/mm.h 3 Sep 2004 17:20:35 -0000 1.190
+++ linux-2.5/include/linux/mm.h 25 Sep 2004 02:32:38 -0000
@@ -730,7 +730,9 @@ void handle_ra_miss(struct address_space
unsigned long max_sane_readahead(unsigned long nr);

/* Do stack extension */
-extern int expand_stack(struct vm_area_struct * vma, unsigned long address);
+extern int heap_stack_gap;
+extern int expand_stack(struct vm_area_struct * vma, unsigned long address,
+ struct vm_area_struct * prev_vma);

/* Look up the first VMA which satisfies addr < vm_end, NULL if none. */
extern struct vm_area_struct * find_vma(struct mm_struct * mm, unsigned long addr);
Index: linux-2.5/include/linux/sysctl.h
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/include/linux/sysctl.h,v
retrieving revision 1.79
diff -u -p -r1.79 sysctl.h
--- linux-2.5/include/linux/sysctl.h 24 Aug 2004 18:12:13 -0000 1.79
+++ linux-2.5/include/linux/sysctl.h 25 Sep 2004 01:46:10 -0000
@@ -167,6 +167,7 @@ enum
VM_HUGETLB_GROUP=25, /* permitted hugetlb group */
VM_VFS_CACHE_PRESSURE=26, /* dcache/icache reclaim pressure */
VM_LEGACY_VA_LAYOUT=27, /* legacy/compatibility virtual address space layout */
+ VM_HEAP_STACK_GAP=28, /* int: page gap between heap and stack */
};


Index: linux-2.5/kernel/sysctl.c
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/kernel/sysctl.c,v
retrieving revision 1.89
diff -u -p -r1.89 sysctl.c
--- linux-2.5/kernel/sysctl.c 24 Aug 2004 19:40:58 -0000 1.89
+++ linux-2.5/kernel/sysctl.c 25 Sep 2004 01:46:10 -0000
@@ -800,6 +800,14 @@ static ctl_table vm_table[] = {
.extra1 = &zero,
},
#endif
+ {
+ .ctl_name = VM_HEAP_STACK_GAP,
+ .procname = "heap-stack-gap",
+ .data = &heap_stack_gap,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
{ .ctl_name = 0 }
};

Index: linux-2.5/mm/mmap.c
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/mm/mmap.c,v
retrieving revision 1.145
diff -u -p -r1.145 mmap.c
--- linux-2.5/mm/mmap.c 3 Sep 2004 17:22:55 -0000 1.145
+++ linux-2.5/mm/mmap.c 25 Sep 2004 01:46:10 -0000
@@ -58,6 +58,7 @@ int sysctl_overcommit_memory = 0; /* def
int sysctl_overcommit_ratio = 50; /* default is 50% */
int sysctl_max_map_count = DEFAULT_MAX_MAP_COUNT;
atomic_t vm_committed_space = ATOMIC_INIT(0);
+int heap_stack_gap = 1;

EXPORT_SYMBOL(sysctl_overcommit_memory);
EXPORT_SYMBOL(sysctl_overcommit_ratio);
@@ -1069,6 +1070,7 @@ arch_get_unmapped_area(struct file *filp
full_search:
for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {
/* At this point: (!vma || addr < vma->vm_end). */
+ unsigned long __heap_stack_gap;
if (TASK_SIZE - len < addr) {
/*
* Start a new search - just in case we missed
@@ -1080,7 +1082,13 @@ full_search:
}
return -ENOMEM;
}
- if (!vma || addr + len <= vma->vm_start) {
+ if (!vma)
+ goto got_it;
+ __heap_stack_gap = 0;
+ if (vma->vm_flags & VM_GROWSDOWN)
+ __heap_stack_gap = heap_stack_gap << PAGE_SHIFT;
+ if (addr + len + __heap_stack_gap <= vma->vm_start) {
+ got_it:
/*
* Remember the place where we stopped the search:
*/
@@ -1315,13 +1323,17 @@ out:
}

#ifdef CONFIG_STACK_GROWSUP
-/*
- * vma is the first one with address > vma->vm_end. Have to extend vma.
- */
-int expand_stack(struct vm_area_struct * vma, unsigned long address)
+int expand_stack(struct vm_area_struct * vma, unsigned long address,
+ struct vm_area_struct * prev_vma)
{
unsigned long grow;

+ /*
+ * If you re-use the heap-stack-gap for a growsup stack you
+ * should remove this WARN_ON.
+ */
+ WARN_ON(prev_vma);
+
if (!(vma->vm_flags & VM_GROWSUP))
return -EFAULT;

@@ -1373,7 +1385,7 @@ find_extend_vma(struct mm_struct *mm, un
vma = find_vma_prev(mm, addr, &prev);
if (vma && (vma->vm_start <= addr))
return vma;
- if (!prev || expand_stack(prev, addr))
+ if (!prev || expand_stack(prev, addr, NULL))
return NULL;
if (prev->vm_flags & VM_LOCKED) {
make_pages_present(addr, prev->vm_end);
@@ -1384,7 +1396,8 @@ find_extend_vma(struct mm_struct *mm, un
/*
* vma is the first one with address < vma->vm_start. Have to extend vma.
*/
-int expand_stack(struct vm_area_struct *vma, unsigned long address)
+int expand_stack(struct vm_area_struct *vma, unsigned long address,
+ struct vm_area_struct *prev_vma)
{
unsigned long grow;

@@ -1402,10 +1415,13 @@ int expand_stack(struct vm_area_struct *
* anon_vma lock to serialize against concurrent expand_stacks.
*/
address &= PAGE_MASK;
+ if (prev_vma && unlikely(prev_vma->vm_end + (heap_stack_gap << PAGE_SHIFT) > address))
+ goto out_unlock;
grow = (vma->vm_start - address) >> PAGE_SHIFT;

/* Overcommit.. */
if (security_vm_enough_memory(grow)) {
+ out_unlock:
anon_vma_unlock(vma);
return -ENOMEM;
}
@@ -1430,7 +1446,7 @@ int expand_stack(struct vm_area_struct *
struct vm_area_struct *
find_extend_vma(struct mm_struct * mm, unsigned long addr)
{
- struct vm_area_struct * vma;
+ struct vm_area_struct * vma, * prev_vma;
unsigned long start;

addr &= PAGE_MASK;
@@ -1442,7 +1458,8 @@ find_extend_vma(struct mm_struct * mm, u
if (!(vma->vm_flags & VM_GROWSDOWN))
return NULL;
start = vma->vm_start;
- if (expand_stack(vma, addr))
+ find_vma_prev(mm, addr, &prev_vma);
+ if (expand_stack(vma, addr, prev_vma))
return NULL;
if (vma->vm_flags & VM_LOCKED) {
make_pages_present(addr, start);


2004-09-25 23:57:18

by Rik van Riel

[permalink] [raw]
Subject: Re: heap-stack-gap for 2.6

On Sat, 25 Sep 2004, Andrea Arcangeli wrote:

> I didn't check the topdown model, in theory it should be extended to
> cover that too, this is only working for the legacy model right now
> because those apps aren't going to use topdown anyways.

Looks like it should just work, topdown shouldn't affect
expand_stack() or find_vma_prev() ...

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2004-09-26 00:41:12

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: heap-stack-gap for 2.6

On Sat, Sep 25, 2004 at 07:57:04PM -0400, Rik van Riel wrote:
> On Sat, 25 Sep 2004, Andrea Arcangeli wrote:
>
> > I didn't check the topdown model, in theory it should be extended to
> > cover that too, this is only working for the legacy model right now
> > because those apps aren't going to use topdown anyways.
>
> Looks like it should just work, topdown shouldn't affect
> expand_stack() or find_vma_prev() ...

expand_stack growsdown page faults are already covered.

but the mmap side of topdown doesn't seem covered instead. Think if an
application maps everything from TASK_UNMAPPED_BASE to the last byte of
heap allowed by topdown. Then userspace unmaps the nearest page to the
stack. Then the stack growsdown of 1 page. Then you mmap again for
a PAGE_SIZE area. You will then fill the gap, and that shouldn't happen,
mmap should fail instead to guarantee a failure notification to
userspace. This is what I meant with topdown not being fully covered.

BTW, I never tried to enforce a gap with MAP_FIXED, that's more a
feature than a bug, I expect people playing MAP_FIXED games, to know
what they're doing, and if they want they can also close the gap. But
they've to do it by hand. If they put a MAP_FIXED near the stack, the
growsdown page faults will then enforce the gap to be sure the stack
doesn't overwrite such MAP_FIXED.

So it's more a stack-growsdown only thing, but the get_unmapped_area
needs collaboration too to really enforce it, and that's the part
missing for topdown. MAP_FIXED is more than a stack-growsdown only thing
and so I still allow that.

2004-09-27 08:09:40

by Arjan van de Ven

[permalink] [raw]
Subject: Re: heap-stack-gap for 2.6


> I didn't check the topdown model, in theory it should be extended to
> cover that too, this is only working for the legacy model right now
> because those apps aren't going to use topdown anyways.

which "those apps" ?
The topdown model is used by default for all applications....
(exceptions being those started with setarch -L or when the sysadmin
sets the /proc tunable to legacy)


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2004-09-27 13:09:43

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: heap-stack-gap for 2.6

On Mon, Sep 27, 2004 at 10:09:13AM +0200, Arjan van de Ven wrote:
>
> > I didn't check the topdown model, in theory it should be extended to
> > cover that too, this is only working for the legacy model right now
> > because those apps aren't going to use topdown anyways.
>
> which "those apps" ?

those apps that wants to allocate as close as possible to the stack.
They're already using /proc/self/mapped_base, the gap of topdown isn't
configurable.

Also topdown may screwup some MAP_FIXED usage below the 1G mark, no?

2004-09-28 06:23:28

by Hui Huang

[permalink] [raw]
Subject: Re: heap-stack-gap for 2.6

Andrea Arcangeli wrote:
> This patch enforces a gap between heap and stack, both on the mmap side
> (for heap) and on the growsdown page faults for stack. the gap is in
> page units and it's sysctl configurable. Against CVS head.
>
> This is needed for some critical app, that wants an higher degree of
> protection against potential stack overflows from the kernel. This is
> mostly a 32bit matter of course, since on 32bit those apps are using
> a few gigs of heap and they get as near as they can to the stack (but if
> something goes wrong a page fault must happen).
>
>
> the default value of 1 avoids userspace apps like java to break,

Ok, I'm a JVM guy and I worked on heap-stack-gap issue many moons ago.
The reason that heap-stack-gap on SuSE Linux used to crash Java is
because we are explicitly setting up a guard page to prevent heap-stack
collision (a stack guard is also required in order to throw stack
overflow exception). So in a java program, prev_vma in your patch does
not point to regular heap memory but the guard page. The hidden gap
would cause SIGSEGV to occur before the thread actually hits the guard,
that confused JVM. We had to work around it by reading the gap size from
/proc.

I'd recommend adding an extra check to see if prev_vma is read/write,
and ignoring heap-stack-gap if prev_vma is guard page. Having a hidden
gap does not offer any extra protection, it only confuses an application
if it manages stack guard.

regards,
-hui
but
> those apps will of course set by hand in the rc.d scripts a much higher
> value. 1 is a sane default, if you want to tweak the default with
> mainline inclusion that's fine with me. the sysctl can always be
> disabled by setting it to 0 and then nobody will notice.
>
> feature is fully enabled on x86* and ppc*. No idea about the ia64 and
> s390x layouts but they've presumably a lot more address space not to
> care about this (this is primarly needed on 32bit apps).
>
> I didn't check the topdown model, in theory it should be extended to
> cover that too, this is only working for the legacy model right now
> because those apps aren't going to use topdown anyways.
>

2004-09-28 14:19:39

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: heap-stack-gap for 2.6

On Mon, Sep 27, 2004 at 11:25:40PM -0700, Hui Huang wrote:
> Andrea Arcangeli wrote:
> >This patch enforces a gap between heap and stack, both on the mmap side
> >(for heap) and on the growsdown page faults for stack. the gap is in
> >page units and it's sysctl configurable. Against CVS head.
> >
> >This is needed for some critical app, that wants an higher degree of
> >protection against potential stack overflows from the kernel. This is
> >mostly a 32bit matter of course, since on 32bit those apps are using
> >a few gigs of heap and they get as near as they can to the stack (but if
> >something goes wrong a page fault must happen).
> >
> >
> >the default value of 1 avoids userspace apps like java to break,
>
> Ok, I'm a JVM guy and I worked on heap-stack-gap issue many moons ago.
> The reason that heap-stack-gap on SuSE Linux used to crash Java is
> because we are explicitly setting up a guard page to prevent heap-stack
> collision (a stack guard is also required in order to throw stack
> overflow exception). So in a java program, prev_vma in your patch does
> not point to regular heap memory but the guard page. The hidden gap
> would cause SIGSEGV to occur before the thread actually hits the guard,

this however doesn't offer the protection on the mmap side, but maybe
you limit the amount of mmaps with another logic.

> that confused JVM. We had to work around it by reading the gap size from
> /proc.

cool, so you're already using this heap-stack-gap API. Then maybe we
could increase the size to more than 1 page. 1 page is the minimum,
infact those apps asking for the feature have to increase it by hand in
/proc.

> I'd recommend adding an extra check to see if prev_vma is read/write,
> and ignoring heap-stack-gap if prev_vma is guard page. Having a hidden
> gap does not offer any extra protection, it only confuses an application
> if it manages stack guard.

you recommendation is valid, and this should work, the stack cannot be
read before it's written to. In theory userspace could track the size of
the stack and know it is supposed to be 0 and avoid a potential
memset(0) on local variables the first run, but I don't think gcc is
capable of doing that.


My only worry is that this mess up things with mprotect, I mean I can't
just check only for read-only vmas, I've also to trap when this
read-only vma infact becomes read-write again. So for simplicity of the
implementation I would prefer not to track readonly pages. If you've a
strong reason for tracking readonly pages let me know of course. I think
java already does fine by checking the heap-stack-gap.

2004-09-28 19:44:44

by Arjan van de Ven

[permalink] [raw]
Subject: Re: heap-stack-gap for 2.6

On Mon, Sep 27, 2004 at 03:09:19PM +0200, Andrea Arcangeli wrote:
> > which "those apps" ?
>
> those apps that wants to allocate as close as possible to the stack.
> They're already using /proc/self/mapped_base, the gap of topdown isn't
> configurable.

/proc/self/mmaped_base doesn't exist...


> Also topdown may screwup some MAP_FIXED usage below the 1G mark, no?

no

map_fixed is map_fixed... if you give a hint the kernel will try that of
course.


Attachments:
(No filename) (455.00 B)
(No filename) (189.00 B)
Download all attachments

2004-09-28 22:23:38

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: heap-stack-gap for 2.6

On Tue, Sep 28, 2004 at 09:43:51PM +0200, Arjan van de Ven wrote:
> On Mon, Sep 27, 2004 at 03:09:19PM +0200, Andrea Arcangeli wrote:
> > > which "those apps" ?
> >
> > those apps that wants to allocate as close as possible to the stack.
> > They're already using /proc/self/mapped_base, the gap of topdown isn't
> > configurable.
>
> /proc/self/mmaped_base doesn't exist...

it does with this patch that should be included in mainline too. This
follows the redhat API that oracle requires (you invented it, didn't
you?) so you should be fine with it.

with mapped base people is free to allocate as much memory as the
hardware can, with topdown not.

Index: linux-2.6.8/fs/proc/base.c
===================================================================
--- linux-2.6.8.orig/fs/proc/base.c
+++ linux-2.6.8/fs/proc/base.c
@@ -59,8 +59,9 @@ enum pid_directory_inos {
PROC_TGID_STATM,
PROC_TGID_MAPS,
PROC_TGID_MOUNTS,
PROC_TGID_WCHAN,
+ PROC_TGID_MAPBASE,
#ifdef CONFIG_SCHEDSTATS
PROC_TGID_SCHEDSTAT,
#endif
#ifdef CONFIG_SECURITY
@@ -122,8 +123,11 @@ static struct pid_entry tgid_base_stuff[
E(PROC_TGID_CWD, "cwd", S_IFLNK|S_IRWXUGO),
E(PROC_TGID_ROOT, "root", S_IFLNK|S_IRWXUGO),
E(PROC_TGID_EXE, "exe", S_IFLNK|S_IRWXUGO),
E(PROC_TGID_MOUNTS, "mounts", S_IFREG|S_IRUGO),
+#ifdef __HAS_ARCH_PROC_MAPPED_BASE
+ E(PROC_TGID_MAPBASE, "mapped_base", S_IFREG|S_IRUSR|S_IWUSR),
+#endif
#ifdef CONFIG_SECURITY
E(PROC_TGID_ATTR, "attr", S_IFDIR|S_IRUGO|S_IXUGO),
#endif
#ifdef CONFIG_KALLSYMS
@@ -696,8 +700,57 @@ static struct file_operations proc_mem_o
.write = mem_write,
.open = mem_open,
};

+#ifdef __HAS_ARCH_PROC_MAPPED_BASE
+static ssize_t mapbase_read(struct file * file, char * buf,
+ size_t count, loff_t *ppos)
+{
+ struct task_struct *task = proc_task(file->f_dentry->d_inode);
+ char buffer[64];
+ size_t len;
+
+ len = sprintf(buffer, "%li\n", task->map_base) + 1;
+ if (*ppos >= len)
+ return 0;
+ if (count > len-*ppos)
+ count = len-*ppos;
+ if (copy_to_user(buf, buffer + *ppos, count))
+ return -EFAULT;
+ *ppos += count;
+ return count;
+}
+
+static ssize_t mapbase_write(struct file * file, const char * buf,
+ size_t count, loff_t *ppos)
+{
+ struct task_struct *task = proc_task(file->f_dentry->d_inode);
+ char buffer[64], *end;
+ unsigned long newbase;
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+ memset(buffer, 0, 64);
+ if (count > 62)
+ count = 62;
+ if (copy_from_user(buffer, buf, count))
+ return -EFAULT;
+ newbase = simple_strtoul(buffer, &end, 0);
+ if (*end == '\n')
+ end++;
+ if (newbase > 0)
+ task->map_base = newbase;
+ if (end - buffer == 0)
+ return -EIO;
+ return end - buffer;
+}
+
+static struct file_operations proc_mapbase_operations = {
+ read: mapbase_read,
+ write: mapbase_write,
+};
+#endif /* __HAS_ARCH_PROC_MAPPED_BASE */
+
static struct inode_operations proc_mem_inode_operations = {
.permission = proc_permission,
};

@@ -1332,8 +1385,13 @@ static struct dentry *proc_pident_lookup
case PROC_TID_MAPS:
case PROC_TGID_MAPS:
inode->i_fop = &proc_maps_operations;
break;
+#ifdef __HAS_ARCH_PROC_MAPPED_BASE
+ case PROC_TGID_MAPBASE:
+ inode->i_fop = &proc_mapbase_operations;
+ break;
+#endif
case PROC_TID_MEM:
case PROC_TGID_MEM:
inode->i_op = &proc_mem_inode_operations;
inode->i_fop = &proc_mem_operations;
Index: linux-2.6.8/include/linux/sched.h
===================================================================
--- linux-2.6.8.orig/include/linux/sched.h
+++ linux-2.6.8/include/linux/sched.h
@@ -583,8 +583,11 @@ struct task_struct {
#ifdef CONFIG_NUMA
struct mempolicy *mempolicy;
short il_next; /* could be shared with used_math */
#endif
+
+/* TASK_UNMAPPED_BASE */
+ unsigned long map_base;
};

static inline pid_t process_group(struct task_struct *tsk)
{
@@ -596,8 +599,14 @@ extern void __put_task_struct(struct tas
#define get_task_struct(tsk) do { atomic_inc(&(tsk)->usage); } while(0)
#define put_task_struct(tsk) \
do { if (atomic_dec_and_test(&(tsk)->usage)) __put_task_struct(tsk); } while(0)

+#ifndef __TASK_UNMAPPED_BASE
+#define __TASK_UNMAPPED_BASE 0UL
+#else
+#define __HAS_ARCH_PROC_MAPPED_BASE
+#endif
+
/*
* Per process flags
*/
#define PF_ALIGNWARN 0x00000001 /* Print alignment warning msgs */
Index: linux-2.6.8/include/linux/init_task.h
===================================================================
--- linux-2.6.8.orig/include/linux/init_task.h
+++ linux-2.6.8/include/linux/init_task.h
@@ -111,8 +111,9 @@ extern struct group_info init_groups;
.alloc_lock = SPIN_LOCK_UNLOCKED, \
.proc_lock = SPIN_LOCK_UNLOCKED, \
.switch_lock = SPIN_LOCK_UNLOCKED, \
.journal_info = NULL, \
+ .map_base = __TASK_UNMAPPED_BASE, \
}



Index: linux-2.6.8/include/asm-um/processor-generic.h
===================================================================
--- linux-2.6.8.orig/include/asm-um/processor-generic.h
+++ linux-2.6.8/include/asm-um/processor-generic.h
@@ -115,9 +115,10 @@ extern unsigned long task_size;

/* This decides where the kernel will search for a free chunk of vm
* space during mmap's.
*/
-#define TASK_UNMAPPED_BASE (0x40000000)
+#define __TASK_UNMAPPED_BASE (0x40000000)
+#define TASK_UNMAPPED_BASE (current->map_base)

extern void start_thread(struct pt_regs *regs, unsigned long entry,
unsigned long stack);

Index: linux-2.6.8/include/asm-x86_64/processor.h
===================================================================
--- linux-2.6.8.orig/include/asm-x86_64/processor.h
+++ linux-2.6.8/include/asm-x86_64/processor.h
@@ -171,11 +171,16 @@ static inline void clear_in_cr4 (unsigne
#define TASK_SIZE (0x0000007fc0000000UL)

/* This decides where the kernel will search for a free chunk of vm
* space during mmap's.
+ *
+ * /proc/pid/unmap_base is only supported for 32bit processes without
+ * 3GB personality for now.
*/
#define IA32_PAGE_OFFSET ((current->personality & ADDR_LIMIT_3GB) ? 0xc0000000 : 0xFFFFe000)
-#define TASK_UNMAPPED_32 PAGE_ALIGN(IA32_PAGE_OFFSET/3)
+#define __TASK_UNMAPPED_BASE (PAGE_ALIGN(0xffffe000 / 3))
+#define TASK_UNMAPPED_32 ((current->personality & ADDR_LIMIT_3GB) ? \
+ PAGE_ALIGN(0xc0000000 / 3) : PAGE_ALIGN(current->map_base))
#define TASK_UNMAPPED_64 PAGE_ALIGN(TASK_SIZE/3)
#define TASK_UNMAPPED_BASE \
(test_thread_flag(TIF_IA32) ? TASK_UNMAPPED_32 : TASK_UNMAPPED_64)

Index: linux-2.6.8/include/asm-ppc64/processor.h
===================================================================
--- linux-2.6.8.orig/include/asm-ppc64/processor.h
+++ linux-2.6.8/include/asm-ppc64/processor.h
@@ -516,10 +516,13 @@ extern struct task_struct *last_task_use
TASK_SIZE_USER32 : TASK_SIZE_USER64)

/* This decides where the kernel will search for a free chunk of vm
* space during mmap's.
+ *
+ * /proc/pid/unmap_base is only supported for 32bit processes for now.
*/
-#define TASK_UNMAPPED_BASE_USER32 (PAGE_ALIGN(STACK_TOP_USER32 / 4))
+#define __TASK_UNMAPPED_BASE (PAGE_ALIGN(STACK_TOP_USER32 / 4))
+#define TASK_UNMAPPED_BASE_USER32 (PAGE_ALIGN(current->map_base))
#define TASK_UNMAPPED_BASE_USER64 (PAGE_ALIGN(STACK_TOP_USER64 / 4))

#define TASK_UNMAPPED_BASE ((test_thread_flag(TIF_32BIT)||(ppcdebugset(PPCDBG_BINFMT_32ADDR))) ? \
TASK_UNMAPPED_BASE_USER32 : TASK_UNMAPPED_BASE_USER64 )
Index: linux-2.6.8/include/asm-i386/processor.h
===================================================================
--- linux-2.6.8.orig/include/asm-i386/processor.h
+++ linux-2.6.8/include/asm-i386/processor.h
@@ -294,9 +294,10 @@ extern unsigned int mca_pentium_flag;

/* This decides where the kernel will search for a free chunk of vm
* space during mmap's.
*/
-#define TASK_UNMAPPED_BASE (PAGE_ALIGN(TASK_SIZE / 3))
+#define TASK_UNMAPPED_BASE (current->map_base)
+#define __TASK_UNMAPPED_BASE PAGE_ALIGN(TASK_SIZE/3)

#define HAVE_ARCH_PICK_MMAP_LAYOUT

/*
Index: linux-2.6.8/include/asm-s390/processor.h
===================================================================
--- linux-2.6.8.orig/include/asm-s390/processor.h
+++ linux-2.6.8/include/asm-s390/processor.h
@@ -61,9 +61,10 @@ extern struct task_struct *last_task_use
*/
#ifndef __s390x__

# define TASK_SIZE (0x80000000UL)
-# define TASK_UNMAPPED_BASE (TASK_SIZE / 2)
+# define TASK_UNMAPPED_BASE (current->map_base)
+# define __TASK_UNMAPPED_BASE (TASK_SIZE / 2)
# define DEFAULT_TASK_SIZE (0x80000000UL)

#else /* __s390x__ */


> > Also topdown may screwup some MAP_FIXED usage below the 1G mark, no?
>
> no
>
> map_fixed is map_fixed... if you give a hint the kernel will try that of
> course.

Yeah, map fix is map fixed and when you execute map fixed on a existing
mapping becaue topdown moved below the 1G mark (a place where there
could never have been a "hinted" mapping before), the existing mapping
will be destroyed and the application will behave randomly.

isn't the whole point of topdown to gain ~1G more of RAM. A 1G area that
couldn't possibly be used before, and where people today can use
MAP_FIXED without colliding with dynamically allocated heap like for
mallocs. topdown breaks that assumption and can break random apps in
random ways.

Or did I misunderstood something? If topdown still forbids you to use
the first 1G of address space, then what's the point?!?

2004-09-29 06:06:29

by Arjan van de Ven

[permalink] [raw]
Subject: Re: heap-stack-gap for 2.6

On Wed, Sep 29, 2004 at 12:19:33AM +0200, Andrea Arcangeli wrote:
> On Tue, Sep 28, 2004 at 09:43:51PM +0200, Arjan van de Ven wrote:
> > On Mon, Sep 27, 2004 at 03:09:19PM +0200, Andrea Arcangeli wrote:
> > > > which "those apps" ?
> > >
> > > those apps that wants to allocate as close as possible to the stack.
> > > They're already using /proc/self/mapped_base, the gap of topdown isn't
> > > configurable.
> >
> > /proc/self/mmaped_base doesn't exist...
>
> it does with this patch that should be included in mainline too. This
> follows the redhat API that oracle requires (you invented it, didn't
> you?) so you should be fine with it.

> with mapped base people is free to allocate as much memory as the
> hardware can, with topdown not.

oh? you mean that 1Mb gap between stack and topdown? Every ISV I talked to
said they could get more VA space with topdown than with the suse
mmaped_base hack... :)

> Yeah, map fix is map fixed and when you execute map fixed on a existing
> mapping becaue topdown moved below the 1G mark (a place where there
> could never have been a "hinted" mapping before), the existing mapping
> will be destroyed and the application will behave randomly.

MAP_FIXED is to be used only on things YOU mmaped before.
>
> isn't the whole point of topdown to gain ~1G more of RAM. A 1G area that
> couldn't possibly be used before

wrong; brk() is there which is also used by malloc() and internally by the C
library.

> mallocs. topdown breaks that assumption and can break random apps in
> random ways.

do you have proof for that?


> Or did I misunderstood something? If topdown still forbids you to use
> the first 1G of address space, then what's the point?!?

You missed that you can only use MAP_FIXED on mmaps YOU mmaped before.
That's true for basically all operating systems because the runtime
(including C library) is allowed to malloc, to brk(), to mmap etc etc.


Attachments:
(No filename) (1.87 kB)
(No filename) (189.00 B)
Download all attachments

2004-09-29 08:39:56

by Hui Huang

[permalink] [raw]
Subject: Re: heap-stack-gap for 2.6

Andrea Arcangeli wrote:
> On Mon, Sep 27, 2004 at 11:25:40PM -0700, Hui Huang wrote:
>
>>Andrea Arcangeli wrote:
>>
>>>This patch enforces a gap between heap and stack, both on the mmap side
>>>(for heap) and on the growsdown page faults for stack. the gap is in
>>>page units and it's sysctl configurable. Against CVS head.
>>>
>>>This is needed for some critical app, that wants an higher degree of
>>>protection against potential stack overflows from the kernel. This is
>>>mostly a 32bit matter of course, since on 32bit those apps are using
>>>a few gigs of heap and they get as near as they can to the stack (but if
>>>something goes wrong a page fault must happen).
>>>
>>>
>>>the default value of 1 avoids userspace apps like java to break,
>>
>>Ok, I'm a JVM guy and I worked on heap-stack-gap issue many moons ago.
>>The reason that heap-stack-gap on SuSE Linux used to crash Java is
>>because we are explicitly setting up a guard page to prevent heap-stack
>>collision (a stack guard is also required in order to throw stack
>>overflow exception). So in a java program, prev_vma in your patch does
>>not point to regular heap memory but the guard page. The hidden gap
>>would cause SIGSEGV to occur before the thread actually hits the guard,
>
>
> this however doesn't offer the protection on the mmap side, but maybe
> you limit the amount of mmaps with another logic.

Hi Andrea,

Java heap is managed by JVM, its address space is reserved upfront
and does not grow or shrink. Our problem is mainly on the stack side.

>
>
>>that confused JVM. We had to work around it by reading the gap size from
>>/proc.
>
>
> cool, so you're already using this heap-stack-gap API. Then maybe we
> could increase the size to more than 1 page. 1 page is the minimum,
> infact those apps asking for the feature have to increase it by hand in
> /proc.
>

Should be Ok. But our "fix" only handles heap-stack-gap in a special
case. The problem is we only know the gap size, not its location.
We would have to know if a thread has MAP_GROWSDOWN stack or not to
tell if it has a gap near the stack end. Basically, we assume only
the main thread stack is MAP_GROWSDOWN and every other thread is
created with MAP_FIXED stack (i.e. the gap only exists in main
thread stack). Otherwise we wouldn't be able to tell if a SEGV is
due to stack overflow in the heap-stack-gap or something else.
It's not ideal, but works as long as LinuxThreads or NPTL don't
change back to MAP_GROWSDOWN.

>
>>I'd recommend adding an extra check to see if prev_vma is read/write,
>>and ignoring heap-stack-gap if prev_vma is guard page. Having a hidden
>>gap does not offer any extra protection, it only confuses an application
>>if it manages stack guard.
>
>
> you recommendation is valid, and this should work, the stack cannot be
> read before it's written to. In theory userspace could track the size of
> the stack and know it is supposed to be 0 and avoid a potential
> memset(0) on local variables the first run, but I don't think gcc is
> capable of doing that.
>

Sorry, I actually meant read _or_ write; that is, disable gap if
prev_vma is PROT_NONE. The heap-stack-gap is essentially a floating
PROT_NONE page. So if you see a PROT_NONE page below a growsdown stack,
you could ignore the gap and allow stack to grow without losing any
protection.

>
> My only worry is that this mess up things with mprotect, I mean I can't
> just check only for read-only vmas, I've also to trap when this
> read-only vma infact becomes read-write again.

You mean if there's heap memory right below the previously read-only
vma, now it becomes read-write the stack can keep growing down and
it can overwrite heap without a trap? That's a good point. We don't
see that problem because the way we handle stack and heap. But it's
certainly a possibility.

So for simplicity of the
> implementation I would prefer not to track readonly pages. If you've a
> strong reason for tracking readonly pages let me know of course. I think
> java already does fine by checking the heap-stack-gap.

Indeed having a special case for PROT_NONE would complicate the
situation, and it's not that simple as you pointed out.

However, my concern is that a hidden gap floating in memory makes it
very hard to manage stacks, while on the other hand applications
need heap-stack-gap protection could simply call mprotect to
explicitly set up a range of memory as PROT_NONE. I can't imagine
java being the only app out there that needs to handle stack
overflows. Other people could see similar problems too (i.e. SEGV
while cr2 is in the gap, not in the guard page).

Now another wish - instead of a system-wide property, a per-process
mechanism to change heap-stack-gap. Applications that need the gap
can set up the appropriate size, while others like Java that manage
heap and stack can simply disable the gap. Is it possible?

thanks,
-hui

2004-09-29 14:12:44

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: heap-stack-gap for 2.6

On Wed, Sep 29, 2004 at 08:05:21AM +0200, Arjan van de Ven wrote:
> oh? you mean that 1Mb gap between stack and topdown? Every ISV I talked to
> said they could get more VA space with topdown than with the suse
> mmaped_base hack... :)

/*
* Top of mmap area (just below the process stack).
*
* Leave an at least ~128 MB hole.
*/
#define MIN_GAP (128*1024*1024)
#define MAX_GAP (TASK_SIZE/6*5)

where does your 1M comes from? it's a minimum 128Mbytes.

> MAP_FIXED is to be used only on things YOU mmaped before.

where is that written?

> > isn't the whole point of topdown to gain ~1G more of RAM. A 1G area that
> > couldn't possibly be used before
>
> wrong; brk() is there which is also used by malloc() and internally by the C
> library.

that's malloc, but mmaps don't fit into it.

So sure, apps where careful about malloc, but not about mmap. mmap was
guaranteed not to mess in the area below 1G, now it does and can break
stuff.

> do you have proof for that?

do I need to write the exploited testcase? just let me know, it'll take
only a few minutes.

as usual, since the screwup generated by topdown would be random and at
runtime, one cannot exclude it could generate a security compromise.

Amittedly it sounds very unlikely that can break stuff, but it can.

> You missed that you can only use MAP_FIXED on mmaps YOU mmaped before.
> That's true for basically all operating systems because the runtime
> (including C library) is allowed to malloc, to brk(), to mmap etc etc.

small malloc works below the 1G area, but it's the application that has
to use malloc. I'm talking about an application that uses heavy mmap,
MAP_FIXED below 1G and _never_ mallocs. It's up to you to call malloc,
if you didn't, you had the guarantee that you could use the space near
1G. The mapped_base is the safe API to tell the kernel it can use the
memory below 1G for the heap.

now the best thing is the ADDR_COMPAT_LAYOUT personality, that is what
can make it safe, and I hope it's enabled by default on all apps, but
I'm afraid it's the other way around, i.e. that the application will be
marked "compatible" after it breaks at runtime. Plus the testing will be
decreased since most people runs with unlimited stack (which defaults to
bottomup beahviour).

the single fact you added ADDR_COMPAT_LAYOUT means you're very well
aware there are apps that will break, or there would be no point for a
"COMPAT" option, if it was really backwards compatible by default, or do
I misunderstand the semantics of ADDR_COMPAT_LAYOUT personality?

2004-09-29 14:28:41

by Arjan van de Ven

[permalink] [raw]
Subject: Re: heap-stack-gap for 2.6

On Wed, Sep 29, 2004 at 04:11:51PM +0200, Andrea Arcangeli wrote:
> On Wed, Sep 29, 2004 at 08:05:21AM +0200, Arjan van de Ven wrote:
> > oh? you mean that 1Mb gap between stack and topdown? Every ISV I talked to
> > said they could get more VA space with topdown than with the suse
> > mmaped_base hack... :)
>
> /*
> * Top of mmap area (just below the process stack).
> *
> * Leave an at least ~128 MB hole.
> */
> #define MIN_GAP (128*1024*1024)
> #define MAX_GAP (TASK_SIZE/6*5)
>
> where does your 1M comes from? it's a minimum 128Mbytes.

the patch posted recently on this list reduced it to 1Mb.

> > MAP_FIXED is to be used only on things YOU mmaped before.
>
> where is that written?

it's basic common sence

> > wrong; brk() is there which is also used by malloc() and internally by the C
> > library.
>
> that's malloc, but mmaps don't fit into it.

MAP_FIXED happily will go over an malloc(), it's both just vma's


> > do you have proof for that?
>
> do I need to write the exploited testcase? just let me know, it'll take
> only a few minutes.

your claim was that existing apps would break.
I know you can write a testcase, one can write a testcase even for your
proposed patch showing breakage.

> small malloc works below the 1G area, but it's the application that has
> to use malloc.

the application, the C library, the dynamic linker, all shared libs it links
against.

> now the best thing is the ADDR_COMPAT_LAYOUT personality, that is what
> can make it safe, and I hope it's enabled by default on all apps, but
> I'm afraid it's the other way around, i.e. that the application will be
> marked "compatible" after it breaks at runtime. Plus the testing will be
> decreased since most people runs with unlimited stack (which defaults to
> bottomup beahviour).

You are wrong; the default is 8Mb stack limit in the kernel; I absolutely do
not see where you claim from "most people run with unlimited stack" comes
from.

> the single fact you added ADDR_COMPAT_LAYOUT means you're very well
> aware there are apps that will break, or there would be no point for a
> "COMPAT" option, if it was really backwards compatible by default, or do
> I misunderstand the semantics of ADDR_COMPAT_LAYOUT personality?

I am aware of 2 applications breaking. Both did

int ptr;

ptr=malloc(bigchunk);
if (ptr < 0)
assume_alloc_failure();

Both are open source and since long fixed. Still it made sense to add a
safety net (akpm asked for it; for example in rhel3 or FC2 we do not have one at
all, nor do those kernels have mmaped_base hack).


Attachments:
(No filename) (2.50 kB)
(No filename) (189.00 B)
Download all attachments

2004-09-29 14:27:03

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: heap-stack-gap for 2.6

On Wed, Sep 29, 2004 at 01:42:12AM -0700, Hui Huang wrote:
> Sorry, I actually meant read _or_ write; that is, disable gap if
> prev_vma is PROT_NONE. The heap-stack-gap is essentially a floating
> PROT_NONE page. So if you see a PROT_NONE page below a growsdown stack,
> you could ignore the gap and allow stack to grow without losing any
> protection.

PROT_NONE would be ever safer than read-only indeed.

> You mean if there's heap memory right below the previously read-only
> vma, now it becomes read-write the stack can keep growing down and
> it can overwrite heap without a trap? That's a good point. We don't
> see that problem because the way we handle stack and heap. But it's
> certainly a possibility.

yes, even with PROT_NONE I'd still need to trap mprotect calls, and see
if the gap is still enforced after the mprotect. Then I would need to
return some sort of failure in mprotect if the gap is not enforced...

> So for simplicity of the
> >implementation I would prefer not to track readonly pages. If you've a
> >strong reason for tracking readonly pages let me know of course. I think
> >java already does fine by checking the heap-stack-gap.
>
> Indeed having a special case for PROT_NONE would complicate the
> situation, and it's not that simple as you pointed out.
>
> However, my concern is that a hidden gap floating in memory makes it
> very hard to manage stacks, while on the other hand applications
> need heap-stack-gap protection could simply call mprotect to
> explicitly set up a range of memory as PROT_NONE. I can't imagine
> java being the only app out there that needs to handle stack
> overflows. Other people could see similar problems too (i.e. SEGV
> while cr2 is in the gap, not in the guard page).

I'm afraid most other apps don't do that (or if they do, they set it low
enough that it doesn't bother the further heap-stack-gap protection),
and on 64bit archs that's ok (only 32bit is a problem). java is safer
than the others of course.

> Now another wish - instead of a system-wide property, a per-process
> mechanism to change heap-stack-gap. Applications that need the gap
> can set up the appropriate size, while others like Java that manage
> heap and stack can simply disable the gap. Is it possible?

This is a very reasonable request, and yes, this is definitely possible,
I feel this is a much better feature than the mprotect trapping, which
is possible too, but it sounds not worth it.

As for the api, should that be a prctl, /proc/<pid>/heap-stack-gap, or a
brand new syscall? I personally enjoy the /proc/<pid>/heap-stack-gap
approach, but that's just me, the prctl and syscall would be more
efficient at runtime (but the point is that this doesn't need to be more
efficient and echo xx > /proc/<pid>/heap-stack-gap is so much easier to
play with for experiments)

If we can choose the API together I'll take care of implementing the
rest. I'd like to still leave a global sysctl for global system safety
(since most apps needs it and they won't all be required to tune it by
hand), so that the final number will be choose with:

if (likely(per_process_setting < 0))
return global_sysctl;
else
return per_process_setting;

2004-09-29 15:06:12

by Arjan van de Ven

[permalink] [raw]
Subject: Re: heap-stack-gap for 2.6

On Wed, Sep 29, 2004 at 04:53:44PM +0200, Andrea Arcangeli wrote:
> > I am aware of 2 applications breaking. Both did
>
> oh, so you see, I wasn't *that* wrong if you're already aware of 2 apps
> breaking.

and you would have known it too if you had looked up the changeset that
implemented flex mmap since it was documented there.


Attachments:
(No filename) (334.00 B)
(No filename) (189.00 B)
Download all attachments

2004-09-29 15:06:43

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: heap-stack-gap for 2.6

On Wed, Sep 29, 2004 at 04:25:21PM +0200, Arjan van de Ven wrote:
> the patch posted recently on this list reduced it to 1Mb.

I must have overlooked that post, sorry (I was just reading latest
kernel CVS).

> > > MAP_FIXED is to be used only on things YOU mmaped before.
> >
> > where is that written?
>
> it's basic common sence

that sounds a bit overoptimistic assumption to me, especially if it was
not written anywhere. Plus applications do use MAP_FIXED, and myself,
the only place I would use it, is slightly below the 1G mark. that's the
safest place if you can find today, especially if you do heavy mmaps
(normally the brk never gets as high as 1G, no matter what the app is
doing, while you can easily reach the stack, or easily go below 1G with
topdown).

> I know you can write a testcase, one can write a testcase even for your
> proposed patch showing breakage.

my patch, will never genrate random mm corruption that will make the app
behave randomly at runtime.

The only testcase you can write for my patch, is a testcase that will
get a sigsegv and it'll get killed gracefully, like it happened to java
in the past. that's a bit different than random mm corruption. And the
only application where this testcase was suprios todate was java, which
got fixed meanwhile, and nothing wrong could ever happen with a spurious
sigsegv. if an application can't handle sigsegv gracefully, then the
application itself is broken, since any software bug can always generate
the sigsegv.

if the overridden mmap with MAP_FIXED would be guaranteed to generate a
sigsegv I wouldn't be talking about that right now.

> You are wrong; the default is 8Mb stack limit in the kernel; I absolutely do
> not see where you claim from "most people run with unlimited stack" comes
> from.

maybe I'm biased because I do run with unlimited stack? And I'm not
going to change that since I really like recursion. I want a stack-gap
instead, to be sure to get a sisgsegv if the recursion overflows on the
heap ;)

> I am aware of 2 applications breaking. Both did

oh, so you see, I wasn't *that* wrong if you're already aware of 2 apps
breaking.

Frankly I don't see the need of these kind of 32bit hacks that may break
stuff, when people is finally moving x86-64.

2004-09-29 15:27:54

by Alan

[permalink] [raw]
Subject: Re: heap-stack-gap for 2.6

On Mer, 2004-09-29 at 15:11, Andrea Arcangeli wrote:
> > MAP_FIXED is to be used only on things YOU mmaped before.
>
> where is that written?

MAP_FIXED is for very very special cases only. You can't combine
MAP_FIXED with glibc for example because the memory allocators may clash
with it. You can use it to map over an existing object but that is about
it.

As the man page says

"Use of this option is discouraged."


2004-09-29 15:40:54

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: heap-stack-gap for 2.6

On Wed, Sep 29, 2004 at 05:01:48PM +0200, Arjan van de Ven wrote:
> and you would have known it too if you had looked up the changeset that
> implemented flex mmap since it was documented there.

I did read the source from kernel CVS and I don't see any mention of
that in the sourcecode, I value source more than metadata in CVS logs.

2004-09-29 15:45:01

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: heap-stack-gap for 2.6

On Wed, Sep 29, 2004 at 03:24:46PM +0100, Alan Cox wrote:
> As the man page says
>
> "Use of this option is discouraged."

and as a matter of fact the vast majority of mmaps executed by databases
with vlm are using MAP_FIXED ;)

I'm not saying databases themself are going to break with topdown, I'm just saying
there are applications out there making heavy use of MAP_FIXED, no
matter if the above manpage says it's discouraged. depending on
applications not to use it, isn't reliable and it's prone to break
randomly at runtime with topdown, this is my _only_ point, and it's a
tangible fact as far as I can tell (not a purerly theoretical issue like
the set_pte not being exceuted in asm).

If you want to make your kernel really safe, then disable MAP_FIXED and
return -EINVAL, then I agree it will be safe.

2004-09-29 22:32:32

by Hui Huang

[permalink] [raw]
Subject: Re: heap-stack-gap for 2.6

Andrea Arcangeli wrote:
>
>>Now another wish - instead of a system-wide property, a per-process
>>mechanism to change heap-stack-gap. Applications that need the gap
>>can set up the appropriate size, while others like Java that manage
>>heap and stack can simply disable the gap. Is it possible?
>
>
> This is a very reasonable request, and yes, this is definitely possible,
> I feel this is a much better feature than the mprotect trapping, which
> is possible too, but it sounds not worth it.
>
> As for the api, should that be a prctl, /proc/<pid>/heap-stack-gap, or a
> brand new syscall? I personally enjoy the /proc/<pid>/heap-stack-gap
> approach, but that's just me, the prctl and syscall would be more
> efficient at runtime (but the point is that this doesn't need to be more
> efficient and echo xx > /proc/<pid>/heap-stack-gap is so much easier to
> play with for experiments)

Hi Andrea,

From our perspective prctl is the most attractive approach. Performance
is not an issue, as it's only used once during startup. The problem
with /proc is that we are not always able to access /proc (e.g.
java code is run with chroot). Stack overflow in chroot'ed program
probably is a corner case, but I'm afraid we'll have to deal with it
if the heap-stack-gap becomes more widespread. System call is better
but we need to remember different syscall numbers on different
platforms.

>
> If we can choose the API together I'll take care of implementing the
> rest. I'd like to still leave a global sysctl for global system safety
> (since most apps needs it and they won't all be required to tune it by
> hand), so that the final number will be choose with:
>
> if (likely(per_process_setting < 0))
> return global_sysctl;
> else
> return per_process_setting;


Sounds good.

thanks,
-hui

2004-10-12 00:56:26

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: heap-stack-gap for 2.6

On Wed, Sep 29, 2004 at 03:01:13PM -0700, Hui Huang wrote:
> > if (likely(per_process_setting < 0))
> > return global_sysctl;
> > else
> > return per_process_setting;
>
>
> Sounds good.

here we go, please review (the overflow checks are the only ones really
worth reviewing since they're exploitable if not correct):

http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.9-rc4/heap-stack-gap

Andrew, could you merge it in mainline after 2.6.9 is out?

Thanks!