2007-06-27 02:45:08

by Davide Libenzi

[permalink] [raw]
Subject: [patch 2/3] MAP_NOZERO - implement sys_brk2()

The following patch implements the sys_brk2() syscall, that nothing is
other than a sys_brk() with an extra "flags" parameter. This can be used
to pass the new MAP_NOZERO bit, to ask the kernel to hand over non-zero
pages if possible.



Signed-off-by: Davide Libenzi <[email protected]>


- Davide


---
include/linux/mm.h | 3 ++-
include/linux/syscalls.h | 1 +
mm/mmap.c | 22 ++++++++++++++++++----
3 files changed, 21 insertions(+), 5 deletions(-)

Index: linux-2.6.mod/include/linux/mm.h
===================================================================
--- linux-2.6.mod.orig/include/linux/mm.h 2007-06-25 19:27:42.000000000 -0700
+++ linux-2.6.mod/include/linux/mm.h 2007-06-26 18:08:28.000000000 -0700
@@ -1099,7 +1099,8 @@
}

extern int do_munmap(struct mm_struct *, unsigned long, size_t);
-
+extern unsigned long do_brk_flags(unsigned long addr, unsigned long len,
+ unsigned long vmflags);
extern unsigned long do_brk(unsigned long, unsigned long);

/* filemap.c */
Index: linux-2.6.mod/include/linux/syscalls.h
===================================================================
--- linux-2.6.mod.orig/include/linux/syscalls.h 2007-06-25 19:14:49.000000000 -0700
+++ linux-2.6.mod/include/linux/syscalls.h 2007-06-26 18:08:28.000000000 -0700
@@ -263,6 +263,7 @@
asmlinkage long sys_fremovexattr(int fd, char __user *name);

asmlinkage unsigned long sys_brk(unsigned long brk);
+asmlinkage unsigned long sys_brk2(unsigned long brk, unsigned long flags);
asmlinkage long sys_mprotect(unsigned long start, size_t len,
unsigned long prot);
asmlinkage unsigned long sys_mremap(unsigned long addr,
Index: linux-2.6.mod/mm/mmap.c
===================================================================
--- linux-2.6.mod.orig/mm/mmap.c 2007-06-25 19:14:49.000000000 -0700
+++ linux-2.6.mod/mm/mmap.c 2007-06-26 18:08:28.000000000 -0700
@@ -35,6 +35,8 @@
#define arch_mmap_check(addr, len, flags) (0)
#endif

+#define BRK_ALLOWED_FLAGS (VM_NOZERO)
+
static void unmap_region(struct mm_struct *mm,
struct vm_area_struct *vma, struct vm_area_struct *prev,
unsigned long start, unsigned long end);
@@ -234,7 +236,7 @@
return next;
}

-asmlinkage unsigned long sys_brk(unsigned long brk)
+asmlinkage unsigned long sys_brk2(unsigned long brk, unsigned long flags)
{
unsigned long rlim, retval;
unsigned long newbrk, oldbrk;
@@ -271,8 +273,10 @@
if (find_vma_intersection(mm, oldbrk, newbrk+PAGE_SIZE))
goto out;

+ flags = BRK_ALLOWED_FLAGS & calc_vm_flag_bits(flags);
+
/* Ok, looks good - let it rip. */
- if (do_brk(oldbrk, newbrk-oldbrk) != oldbrk)
+ if (do_brk_flags(oldbrk, newbrk-oldbrk, flags) != oldbrk)
goto out;
set_brk:
mm->brk = brk;
@@ -282,6 +286,11 @@
return retval;
}

+asmlinkage unsigned long sys_brk(unsigned long brk)
+{
+ return sys_brk2(brk, 0);
+}
+
#ifdef DEBUG_MM_RB
static int browse_rb(struct rb_root *root)
{
@@ -1863,7 +1872,8 @@
* anonymous maps. eventually we may be able to do some
* brk-specific accounting here.
*/
-unsigned long do_brk(unsigned long addr, unsigned long len)
+unsigned long do_brk_flags(unsigned long addr, unsigned long len,
+ unsigned long vmflags)
{
struct mm_struct * mm = current->mm;
struct vm_area_struct * vma, * prev;
@@ -1882,7 +1892,7 @@
if (is_hugepage_only_range(mm, addr, len))
return -EINVAL;

- flags = VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm->def_flags;
+ flags = VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm->def_flags | vmflags;

error = arch_mmap_check(addr, len, flags);
if (error)
@@ -1959,6 +1969,10 @@
return addr;
}

+unsigned long do_brk(unsigned long addr, unsigned long len)
+{
+ return do_brk_flags(addr, len, 0);
+}
EXPORT_SYMBOL(do_brk);

/* Release all mmaps. */


2007-06-27 03:07:31

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()

Davide Libenzi wrote:
> The following patch implements the sys_brk2() syscall, that nothing is
> other than a sys_brk() with an extra "flags" parameter. This can be used
> to pass the new MAP_NOZERO bit, to ask the kernel to hand over non-zero
> pages if possible.

Since programs can get back free()d memory after a malloc(),
with the old contents of the memory intact, surely your
MAP_NONZERO behavior could be the default for programs that
can get away with it?

Maybe we could use some magic ELF header, similar to the
way non-executable stack is handled?

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is. Each group
calls the other unpatriotic.

2007-06-27 03:33:17

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()

On Tue, 26 Jun 2007, Rik van Riel wrote:

> Davide Libenzi wrote:
> > The following patch implements the sys_brk2() syscall, that nothing is
> > other than a sys_brk() with an extra "flags" parameter. This can be used
> > to pass the new MAP_NOZERO bit, to ask the kernel to hand over non-zero
> > pages if possible.
>
> Since programs can get back free()d memory after a malloc(),
> with the old contents of the memory intact, surely your
> MAP_NONZERO behavior could be the default for programs that
> can get away with it?
>
> Maybe we could use some magic ELF header, similar to the
> way non-executable stack is handled?

Well, the quick&ugly glibc patch simply uses an environment variable, just
because I wanted to bench the kernel build with using the same glibc+gcc.
Yes, it can be the default behaviour for the allocator. The patch handles
calloc() correctly, by forcibly zeroing memory in such calls.
But other software must be taught too, to use MAP_NOZERO when they do not
need zeroed memory. I did that for the gcc garbage collector.



- Davide


2007-06-27 03:45:38

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()

On 6/26/07, Rik van Riel <[email protected]> wrote:
> Since programs can get back free()d memory after a malloc(),
> with the old contents of the memory intact, surely your
> MAP_NONZERO behavior could be the default for programs that
> can get away with it?
>
> Maybe we could use some magic ELF header, similar to the
> way non-executable stack is handled?

No. This is an implementation detail of the libc version. The malloc
as compiled today is expecting brk-ed memory to be zeroed. This
default can of course be changed (it's a simple define) but you cannot
make this the default behavior for brk.

2007-06-27 03:48:33

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()

On 6/26/07, Davide Libenzi <[email protected]> wrote:
> The following patch implements the sys_brk2() syscall, that nothing is
> other than a sys_brk() with an extra "flags" parameter.

Shouldn't we wait for Linus' sys_indirect to arrive and make this
another syscall which takes advantage of it?

2007-06-27 03:55:43

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()

On Tue, 26 Jun 2007, Ulrich Drepper wrote:

> On 6/26/07, Davide Libenzi <[email protected]> wrote:
> > The following patch implements the sys_brk2() syscall, that nothing is
> > other than a sys_brk() with an extra "flags" parameter.
>
> Shouldn't we wait for Linus' sys_indirect to arrive and make this
> another syscall which takes advantage of it?

I acutally have the code for it, but I never posted it since it did not
receive a too warm review (and the only user was the fdmap thingy).
OTOH glibc could implement __morecore using mmap(MAP_NOZERO), and hence
brk2() would not be needed, no?



- Davide


2007-06-27 04:12:00

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()

Ulrich Drepper wrote:
> On 6/26/07, Rik van Riel <[email protected]> wrote:
>> Since programs can get back free()d memory after a malloc(),
>> with the old contents of the memory intact, surely your
>> MAP_NONZERO behavior could be the default for programs that
>> can get away with it?
>>
>> Maybe we could use some magic ELF header, similar to the
>> way non-executable stack is handled?
>
> No. This is an implementation detail of the libc version. The malloc
> as compiled today is expecting brk-ed memory to be zeroed. This
> default can of course be changed (it's a simple define) but you cannot
> make this the default behavior for brk.

After going through the first malloc()/free() cycle, surely
the memory will no longer be zeroed on the second malloc() ?

What makes the first brk malloc so special?

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is. Each group
calls the other unpatriotic.

2007-06-27 05:02:55

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()

On 6/26/07, Davide Libenzi <[email protected]> wrote:
> I acutally have the code for it, but I never posted it since it did not
> receive a too warm review (and the only user was the fdmap thingy).

Only user of sys_indirect? There will be quite a few right away.
Every syscall that returns a file descriptor needs O_CLOEXEC support
(socket, pipe, epoll_create, ...)


> OTOH glibc could implement __morecore using mmap(MAP_NOZERO), and hence
> brk2() would not be needed, no?

No. mmap calls create individual VMAs which gets expensive. There
are also some hardware drivers which get more expensive the more VMAs
there are. I want to go away as much as possible from mmap for
malloc.

2007-06-27 05:04:22

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()

On 6/26/07, Rik van Riel <[email protected]> wrote:
> After going through the first malloc()/free() cycle, surely
> the memory will no longer be zeroed on the second malloc() ?

If returned to the system, sure.


> What makes the first brk malloc so special?

If the memory is zeroed it needs not be initialized by malloc. No
calloc zeroing, no pointer clearing.

Anyway, it's irrelevant what the benefits are, the fact is current
code depends on brk to zero the memory and you'd break the ABI if
you'd change it.

2007-06-27 12:33:19

by Hugh Dickins

[permalink] [raw]
Subject: Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()

On Tue, 26 Jun 2007, Ulrich Drepper wrote:
> On 6/26/07, Davide Libenzi <[email protected]> wrote:
>
> > OTOH glibc could implement __morecore using mmap(MAP_NOZERO), and hence
> > brk2() would not be needed, no?
>
> No. mmap calls create individual VMAs which gets expensive. There
> are also some hardware drivers which get more expensive the more VMAs
> there are. I want to go away as much as possible from mmap for
> malloc.

Not so: if an mmap can be done by extending either adjacent vma (prot
and flags and file and offset all match up), that's what's done and no
separate vma is created. (And adjacent vmas get merged when mprotect
removes the difference in protection.)

I don't think there's any such reason to prefer brk to mmap. But you
may have encountered something which we in the kernel are thinking of
as an insignificant corner case, which is actually breaking things up
badly in practice (I recall the kernel's internal VM_ACCOUNT flag,
relating to non-overcommit accounting, which might get turned on at
any time, sometimes preventing a vma merge you'd otherwise expect).
Please let me know if you've a test case which shows more vmas than
expected.

Hugh

2007-06-27 16:00:20

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()

On Wed, 27 Jun 2007, Hugh Dickins wrote:

> On Tue, 26 Jun 2007, Ulrich Drepper wrote:
> > On 6/26/07, Davide Libenzi <[email protected]> wrote:
> >
> > > OTOH glibc could implement __morecore using mmap(MAP_NOZERO), and hence
> > > brk2() would not be needed, no?
> >
> > No. mmap calls create individual VMAs which gets expensive. There
> > are also some hardware drivers which get more expensive the more VMAs
> > there are. I want to go away as much as possible from mmap for
> > malloc.
>
> Not so: if an mmap can be done by extending either adjacent vma (prot
> and flags and file and offset all match up), that's what's done and no
> separate vma is created. (And adjacent vmas get merged when mprotect
> removes the difference in protection.)
>
> I don't think there's any such reason to prefer brk to mmap. But you
> may have encountered something which we in the kernel are thinking of
> as an insignificant corner case, which is actually breaking things up
> badly in practice (I recall the kernel's internal VM_ACCOUNT flag,
> relating to non-overcommit accounting, which might get turned on at
> any time, sometimes preventing a vma merge you'd otherwise expect).
> Please let me know if you've a test case which shows more vmas than
> expected.

The only way I can see vma fragmentation happen in that case, is if
userspace uses a mixture of mmaps and mallocs, and flags+prots of the two
does not match. The glibc allocator seems to support it just fine. There's
a macro where you specify if the heaps are contiguous or not.


- Davide


2007-06-27 16:06:16

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()

On Tue, 26 Jun 2007, Ulrich Drepper wrote:

> On 6/26/07, Davide Libenzi <[email protected]> wrote:
> > I acutally have the code for it, but I never posted it since it did not
> > receive a too warm review (and the only user was the fdmap thingy).
>
> Only user of sys_indirect? There will be quite a few right away.
> Every syscall that returns a file descriptor needs O_CLOEXEC support
> (socket, pipe, epoll_create, ...)

Ok, I'll try to add flags support to those "fd generators" over the code I
have. That is only i386 nd x86-64 ATM (sys_indirect requires a thin asm
wrapper - gcc asm will not do it for me).



- Davide


2007-06-27 17:01:55

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()

On 6/27/07, Hugh Dickins <[email protected]> wrote:
> Not so: if an mmap can be done by extending either adjacent vma (prot
> and flags and file and offset all match up), that's what's done and no
> separate vma is created. (And adjacent vmas get merged when mprotect
> removes the difference in protection.)

mmap return values are randomized. If they would be mergable
something would be wrong.


> I don't think there's any such reason to prefer brk to mmap.

Talk to the Quadrics people. Some of their interconnect adapters
incur significant costs with separate VMAs.


> Please let me know if you've a test case which shows more vmas than
> expected.

Not related to this, but we already have "too many" VMAs for some
definition of "too many". Since we split VMA when the protection
changes each thread stack consists at least of two VMA (three for
ia64). There was an approach at some point to push the access flags
down and allow VMAs to stretch further. Why was this deemed
unsuitable? It could have quite a bit with situations with many
threads (Java). As I've told back when this came up, we had one
customer where the VMA lookup actually showed up in profiles.

2007-06-27 17:43:46

by Hugh Dickins

[permalink] [raw]
Subject: Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()

On Wed, 27 Jun 2007, Ulrich Drepper wrote:
> On 6/27/07, Hugh Dickins <[email protected]> wrote:
> > Not so: if an mmap can be done by extending either adjacent vma (prot
> > and flags and file and offset all match up), that's what's done and no
> > separate vma is created. (And adjacent vmas get merged when mprotect
> > removes the difference in protection.)
>
> mmap return values are randomized. If they would be mergable
> something would be wrong.

The absolute virtual addresses are randomized, yes; but do a sequence
of mmaps and they come back adjacent to each other, and so mergable.
Or does your distro include a kernel patch to randomize them relative
to each other?

> > I don't think there's any such reason to prefer brk to mmap.
>
> Talk to the Quadrics people. Some of their interconnect adapters
> incur significant costs with separate VMAs.
>
> > Please let me know if you've a test case which shows more vmas than
> > expected.
>
> Not related to this, but we already have "too many" VMAs for some
> definition of "too many". Since we split VMA when the protection
> changes each thread stack consists at least of two VMA (three for
> ia64). There was an approach at some point to push the access flags
> down and allow VMAs to stretch further. Why was this deemed
> unsuitable?

Blaisorblade's patch, yes. I disliked it at the time I reviewed
it a couple of years back: it made a system call many of us regret
even nastier, and relied on tiresome per-arch pagetable encodings.
It seemed likely to cause us trouble down the line - though
undoubtedly it provides real benefits too.

I never found time to review it when it came around again recently,
thought Blaisorblade might prefer me to turn a blind eye to it ;)
and let others opine instead. But I think noone else found time to
review it either. I expect it's rather nicer now, but doubt it's
actually attractive.

It _might_ turn out to be more attractive, not to rely on that
peculiar sys_remap_file_pages, but make all our vmas independent
of protections, and hang differently protected ranges off them.
Maybe.

> It could have quite a bit with situations with many
> threads (Java). As I've told back when this came up, we had one
> customer where the VMA lookup actually showed up in profiles.

In honesty, I should add that I dislike and distrust Davide's
MAP_NOZERO very much indeed! Would much rather leave my cpus
spending a little time in clear_page(). A uid in struct page
(though I'm sure we could find somewhere to tuck it away) -
the horror, the horror! But I've so far failed to find a killer
argument against it, and am hoping for someone else to do so.

Curmudgeonly,
Hugh

2007-06-27 18:14:08

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()

On Wed, 27 Jun 2007, Hugh Dickins wrote:

> In honesty, I should add that I dislike and distrust Davide's
> MAP_NOZERO very much indeed! Would much rather leave my cpus
> spending a little time in clear_page(). A uid in struct page
> (though I'm sure we could find somewhere to tuck it away) -
> the horror, the horror! But I've so far failed to find a killer
> argument against it, and am hoping for someone else to do so.

Little time? Please, do not trust me. Start oprofile and run a kernel
build. Look, I'm not even talking about som micro benchmark explicitly
built to exploit the thing. A kernel build.
You will find clear_page to be the *1st* kernel entry after cc1 and as.
That is bad for two reasons. The time it spends in there, and the cache it
blows.



- Davide


2007-06-27 18:33:31

by Hugh Dickins

[permalink] [raw]
Subject: Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()

On Wed, 27 Jun 2007, Davide Libenzi wrote:
> On Wed, 27 Jun 2007, Hugh Dickins wrote:
>
> > In honesty, I should add that I dislike and distrust Davide's
> > MAP_NOZERO very much indeed! Would much rather leave my cpus
> > spending a little time in clear_page(). A uid in struct page
> > (though I'm sure we could find somewhere to tuck it away) -
> > the horror, the horror! But I've so far failed to find a killer
> > argument against it, and am hoping for someone else to do so.
>
> Little time? Please, do not trust me. Start oprofile and run a kernel
> build. Look, I'm not even talking about som micro benchmark explicitly
> built to exploit the thing. A kernel build.
> You will find clear_page to be the *1st* kernel entry after cc1 and as.
> That is bad for two reasons. The time it spends in there, and the cache it
> blows.

I don't doubt that it shows real benefits; but dangerously cutting
corners usually shows benefits too. Relying on a uid at this level
feels very wrong to me - but as I said, I've not found a killer
argument against it.

And we both know that clear_page features so high in part because
it's bringing cachelines in from the cold, which are about to be
accessed again by userspace; so it's often not so bad as it appears.

Though I probably wouldn't be citing that argument if we were
talking about offloading clear_page to another engine.

Hugh

2007-06-27 18:45:44

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()

On Wed, 27 Jun 2007, Hugh Dickins wrote:

> On Wed, 27 Jun 2007, Davide Libenzi wrote:
> > On Wed, 27 Jun 2007, Hugh Dickins wrote:
> >
> > > In honesty, I should add that I dislike and distrust Davide's
> > > MAP_NOZERO very much indeed! Would much rather leave my cpus
> > > spending a little time in clear_page(). A uid in struct page
> > > (though I'm sure we could find somewhere to tuck it away) -
> > > the horror, the horror! But I've so far failed to find a killer
> > > argument against it, and am hoping for someone else to do so.
> >
> > Little time? Please, do not trust me. Start oprofile and run a kernel
> > build. Look, I'm not even talking about som micro benchmark explicitly
> > built to exploit the thing. A kernel build.
> > You will find clear_page to be the *1st* kernel entry after cc1 and as.
> > That is bad for two reasons. The time it spends in there, and the cache it
> > blows.
>
> I don't doubt that it shows real benefits; but dangerously cutting
> corners usually shows benefits too. Relying on a uid at this level
> feels very wrong to me - but as I said, I've not found a killer
> argument against it.

The reason why I posted is exactly so other ppl can look at it and find
possible flaws in the way pages and retired. If an effective UID was able
to see (or it generated) the data on that page, it should be able to get
that page back uncleared (when VM_NOZERO is set).
>From a performance POV, a 2-3% boost on a non-micro-bench test like a
kernel build is not exaclty peanuts. And for more heavy malloc/anon-mmap
appliations, the boost goes up to 10-15%. That is not exactly what I
call "little time" ;)



- Davide


2007-06-27 18:52:38

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()

On 6/27/07, Hugh Dickins <[email protected]> wrote:
> The absolute virtual addresses are randomized, yes; but do a sequence
> of mmaps and they come back adjacent to each other, and so mergable.
> Or does your distro include a kernel patch to randomize them relative
> to each other?

Each individual mmap is supposed to be randomized, yes. If this
doesn't happen in one of our kernels right now something broken. You
wouldn't have effective ASLR if all relative address remain the same.


> It _might_ turn out to be more attractive, not to rely on that
> peculiar sys_remap_file_pages, but make all our vmas independent
> of protections, and hang differently protected ranges off them.
> Maybe.

That's what I think is done or at least should be done.

If you want to be shocked, look at some really big Java apps.
Hundreds or thousands of threads, lots of mmap allocation. We might
have 10,000 VMAs. Searching becomes a problem and if the protection
information be stored somewhere else and you safe the VMA data
structures there is even some memory saving possible.

I definitely thing that this is an area which warrants looking at. We
haven't yet seen the worst of VMA usage. The move to 64-bit is only
just beginning and wait what people think they can do with 48+ bits of
address space.

2007-06-27 19:00:28

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()

Ulrich Drepper wrote:
> On 6/27/07, Hugh Dickins <[email protected]> wrote:
>> Not so: if an mmap can be done by extending either adjacent vma (prot
>> and flags and file and offset all match up), that's what's done and no
>> separate vma is created. (And adjacent vmas get merged when mprotect
>> removes the difference in protection.)
>
> mmap return values are randomized. If they would be mergable
> something would be wrong.

You can easily pass a hint address to mmap(), causing it to
extend a malloc arena instead of creating a new VMA.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is. Each group
calls the other unpatriotic.

2007-06-27 19:23:15

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()

On Wed, 27 Jun 2007, Rik van Riel wrote:

> Ulrich Drepper wrote:
> > On 6/27/07, Hugh Dickins <[email protected]> wrote:
> > > Not so: if an mmap can be done by extending either adjacent vma (prot
> > > and flags and file and offset all match up), that's what's done and no
> > > separate vma is created. (And adjacent vmas get merged when mprotect
> > > removes the difference in protection.)
> >
> > mmap return values are randomized. If they would be mergable
> > something would be wrong.
>
> You can easily pass a hint address to mmap(), causing it to
> extend a malloc arena instead of creating a new VMA.

You mean something like the F_DUPFD, where you pass an fd and that serves
as from-there-up hint?



- Davide


2007-06-27 19:32:57

by Hugh Dickins

[permalink] [raw]
Subject: Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()

On Wed, 27 Jun 2007, Ulrich Drepper wrote:
> On 6/27/07, Hugh Dickins <[email protected]> wrote:
> > The absolute virtual addresses are randomized, yes; but do a sequence
> > of mmaps and they come back adjacent to each other, and so mergable.
> > Or does your distro include a kernel patch to randomize them relative
> > to each other?
>
> Each individual mmap is supposed to be randomized, yes. If this
> doesn't happen in one of our kernels right now something broken. You
> wouldn't have effective ASLR if all relative address remain the same.

Ah, I guess it's an exec_shield thing: vanilla kernel gives us
adjacent mmaps which are mergable. As Rik suggest, to make your
mmaps adjacent, use the mmap(addr,...) hint. (But please don't
then complain that they're not relatively randomized ;)

> > It _might_ turn out to be more attractive, not to rely on that
> > peculiar sys_remap_file_pages, but make all our vmas independent
> > of protections, and hang differently protected ranges off them.
> > Maybe.
>
> That's what I think is done or at least should be done.

Noted.

Hugh

2007-06-27 22:12:04

by Nicholas Miell

[permalink] [raw]
Subject: Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()

On Wed, 2007-06-27 at 11:45 -0700, Davide Libenzi wrote:
> On Wed, 27 Jun 2007, Hugh Dickins wrote:
>
> > On Wed, 27 Jun 2007, Davide Libenzi wrote:
> > > On Wed, 27 Jun 2007, Hugh Dickins wrote:
> > >
> > > > In honesty, I should add that I dislike and distrust Davide's
> > > > MAP_NOZERO very much indeed! Would much rather leave my cpus
> > > > spending a little time in clear_page(). A uid in struct page
> > > > (though I'm sure we could find somewhere to tuck it away) -
> > > > the horror, the horror! But I've so far failed to find a killer
> > > > argument against it, and am hoping for someone else to do so.
> > >
> > > Little time? Please, do not trust me. Start oprofile and run a kernel
> > > build. Look, I'm not even talking about som micro benchmark explicitly
> > > built to exploit the thing. A kernel build.
> > > You will find clear_page to be the *1st* kernel entry after cc1 and as.
> > > That is bad for two reasons. The time it spends in there, and the cache it
> > > blows.
> >
> > I don't doubt that it shows real benefits; but dangerously cutting
> > corners usually shows benefits too. Relying on a uid at this level
> > feels very wrong to me - but as I said, I've not found a killer
> > argument against it.
>
> The reason why I posted is exactly so other ppl can look at it and find
> possible flaws in the way pages and retired. If an effective UID was able
> to see (or it generated) the data on that page, it should be able to get
> that page back uncleared (when VM_NOZERO is set).
> >From a performance POV, a 2-3% boost on a non-micro-bench test like a
> kernel build is not exaclty peanuts. And for more heavy malloc/anon-mmap
> appliations, the boost goes up to 10-15%. That is not exactly what I
> call "little time" ;)
>
> - Davide
>

I don't think the security issues with this will ever make it
worthwhile.

Consider:

1) euid is not sufficient, you need to store away arbitrary LSM
information and call LSM hooks to decide security equivalence. The same
applies to VServer or whatever other container system you use.

2) Two processes, A and B, are in separate VFS namespaces but have
equivalent security identity according to LSM. Process A reads data from
file F which is not visible in process's B's namespace. You have to
prevent process B from ever getting a page that once contained data from
file F.

3) mlock() is often used by programs like GPG to prevent decrypted
secret keys from ever getting swapped out. You need to zero all
once-mlocked pages before they get reused to prevent that page from
getting swapped to disk or application bugs from leaking the key.

--
Nicholas Miell <[email protected]>

2007-06-28 00:18:12

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()

On Wed, 27 Jun 2007, Nicholas Miell wrote:

> 1) euid is not sufficient, you need to store away arbitrary LSM
> information and call LSM hooks to decide security equivalence. The same
> applies to VServer or whatever other container system you use.

The EUID that is used now, can easily be any cookie. It can be an LSM
cookie (if LSM is active in the system). We don't do complex checks, like
group permission & Co. We assume that if a UID-cookie had such data
available (or it generated it), it can have it back uncleared.



> 2) Two processes, A and B, are in separate VFS namespaces but have
> equivalent security identity according to LSM. Process A reads data from
> file F which is not visible in process's B's namespace. You have to
> prevent process B from ever getting a page that once contained data from
> file F.

They have the *same* security identity. It means that at any time such
security identity can access resources on both VFS (if it is allowed to
access such resources - according to security rules in place, LSM or not).
Data is either generated by the security identity, or it is faulted in
(and it means that the security identity had the GO from the security
provisioning to access such resource).



> 3) mlock() is often used by programs like GPG to prevent decrypted
> secret keys from ever getting swapped out. You need to zero all
> once-mlocked pages before they get reused to prevent that page from
> getting swapped to disk or application bugs from leaking the key.

GPG and other security software do also memclear on top of mlock, to
prevent such memory staying alive at all. Just for example, you don't want
in any case that after an munlock you app core and data goes down.



- Davide


2007-06-28 02:58:22

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()

On Wed, 27 Jun 2007, Davide Libenzi wrote:

> On Wed, 27 Jun 2007, Nicholas Miell wrote:
>
> > 1) euid is not sufficient, you need to store away arbitrary LSM
> > information and call LSM hooks to decide security equivalence. The same
> > applies to VServer or whatever other container system you use.
>
> The EUID that is used now, can easily be any cookie. It can be an LSM
> cookie (if LSM is active in the system). We don't do complex checks, like
> group permission & Co. We assume that if a UID-cookie had such data
> available (or it generated it), it can have it back uncleared.

(looking through the LSM/SeLinux jungle)
Also, LSM/SeLinux could disable completely the feature, at request. Just
assign a known-to-be-invalid UID to mm->owner_uid (passign through
an(other) hook), and pages will never be recycled.



- Davide


2007-06-30 07:52:22

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()

On Wed, 27 Jun 2007 15:11:49 -0700 Nicholas Miell <[email protected]> wrote:

> I don't think the security issues with this will ever make it
> worthwhile.

eh, security issues are a corner case.

The vast majority of Linux machines are used by a single user who has admin
access anyway. This includes all embedded, all consumer and most laptop
and desktop.

So a reasonable way of getting the benefit of this change into most
people's hands is to forget about the uid/euid issues altogether and just
have a big fat knob which enables this feature, system-wide. (Radical,
huh. But then, I liked single user linux.)

A significant problem I see with any such approach is that it yet again
weakens the overall testing and QA effort: libc and the kernel now need to
be tested with and without this feature, and it's yet another question to
be asked of the bug reporters.

(But please take none of this as endorsement. For some reason the whole
thing gives me the creepies).