I was using oprofile to sample some userspace code I am working on,
and I was continuosly noticing clear_page in the top three entries
of the oprofile logs.
Also, a simple kernel build, in my Dual Opteron with 8GB of RAM,
shows clear_page as the first kernel entry, second only to the
userspace the cc1 and as.
Most of the userspace code uses malloc() (and anonymous mappings) in
such a way that the memory returned via kernel->glibc is immediately
written soon after. The POSIX malloc() definition itself also, does
not require the returned memory to be zeroed (as calloc() does).
So I implemented a rather quick hack that introduces a new mmap() flag
MAP_NOZERO (only valid for anonymous mappings) and the vma counter-part
VM_NOZERO. Also, a new sys_brk2() has been introduced to accept a new
flags parameter. A brief description of the patches follows in the next
emails.
I first hacked Val's ebizzy to accept a new '-N' flag to make use of
MAP_NOZERO:
http://infohost.nmt.edu/~val/patches/ebizzy.tar.gz
http://www.xmailserver.org/ebizzy-nzmmap-0.3.diff
On my box, ebizzy performance jumped up from 10% to 15%.
The userspace code I am working on (uses malloc() quite heavily), saw
a performance jump of around 14%.
In both cases, clear_page dropped way down in the oprofile logs.
I then coded quick (and rather ugly) hacks for glibc and gcc to
make them use the new features (MAP_NOZERO and sys_brk2()):
http://www.xmailserver.org/glibc-nzmalloc-tweaks
http://www.xmailserver.org/gcc-nozero-hack
I then tried a 2.6.22-rc5 kernel build using the newly built glibc
and gcc (with and w/out no-zero enabling options/env-vars), and
when using the no-zero mode, clear_page went way down in the oprofile
logs and build time dropped of about 2.5% to 3%.
I did not have time (and will) to tweak as and ld also.
These are some test utilities to verify the no-zero behaviour of MAP_NOZERO
(and sys_brk2()):
http://www.xmailserver.org/nzmmap-test.c
http://www.xmailserver.org/nzmalloc-test.c
http://www.xmailserver.org/smiffy.c
To run nzmalloc-test you need a patched glibc (using glibc-nzmalloc-tweaks).
The smiffy one, should be run under a user that has no other processes
running and that owns no files on the system, and it verifies that all the
pages it gets from the kernel are zeroed (otherwise "Houston, we have a problem ...").
It is running on my system w/out barfing by more than two days.
How crazy is that?
- Davide
ChangeLog:
* Version 2
o Reusing _mapcount instead of adding a new field in the page struct
o Added a fix for a setuid+exec/ptrace race (Andy spotted)
On Jun 28, 2007, at 14:49:24, Davide Libenzi wrote:
> I was using oprofile to sample some userspace code I am working on,
> and I was continuosly noticing clear_page in the top three
> entries of the oprofile logs.
>
> Also, a simple kernel build, in my Dual Opteron with 8GB of RAM,
> shows clear_page as the first kernel entry, second only to the
> userspace the cc1 and as. Most of the userspace code uses malloc
> () (and anonymous mappings) in such a way that the memory returned
> via kernel->glibc is immediately written soon after. The POSIX
> malloc() definition itself also, does not require the returned
> memory to be zeroed (as calloc() does).
>
> So I implemented a rather quick hack that introduces a new mmap()
> flag MAP_NOZERO (only valid for anonymous mappings) and the vma
> counter-part VM_NOZERO. Also, a new sys_brk2() has been introduced
> to accept a new flags parameter. A brief description of the
> patches follows in the next emails.
Hmm, sounds like this would also need a "MAP_NOREUSE" flag of some
kind for security sensitive applications. Basically, I wouldn't want
my ssh-agent pages holding private SSH keys to be reused by my web
browser which then gets exploited :-D. It would also be a massive
information leak under SELinux. To fix it properly according to the
SELinux model you would need to tag each page with a label
immediately after it's freed and then do an access-vector-check
against the old page and the new process before allowing reuse. On
the other hand, that would probably be at least as expensive as
zeroing the page.
Cheers,
Kyle Moffett
Kyle Moffett wrote:
> On Jun 28, 2007, at 14:49:24, Davide Libenzi wrote:
>> I was using oprofile to sample some userspace code I am working on,
>> and I was continuosly noticing clear_page in the top three entries
>> of the oprofile logs.
>>
>> Also, a simple kernel build, in my Dual Opteron with 8GB of RAM,
>> shows clear_page as the first kernel entry, second only to the
>> userspace the cc1 and as. Most of the userspace code uses malloc()
>> (and anonymous mappings) in such a way that the memory returned via
>> kernel->glibc is immediately written soon after. The POSIX malloc()
>> definition itself also, does not require the returned memory to be
>> zeroed (as calloc() does).
>>
>> So I implemented a rather quick hack that introduces a new mmap() flag
>> MAP_NOZERO (only valid for anonymous mappings) and the vma
>> counter-part VM_NOZERO. Also, a new sys_brk2() has been introduced to
>> accept a new flags parameter. A brief description of the patches
>> follows in the next emails.
>
> Hmm, sounds like this would also need a "MAP_NOREUSE" flag of some kind
> for security sensitive applications.
That wants MAP_PRIVATE so that the kernel can also decide to not
swap these pages out to an unencrypted swap area.
--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is. Each group
calls the other unpatriotic.
On 6/28/07, Rik van Riel <[email protected]> wrote:
> That wants MAP_PRIVATE so that the kernel can also decide to not
> swap these pages out to an unencrypted swap area.
That's not what MAP_PRIVATE means. MAP_PRIVATE is the opposite of
MAP_SHARED. It's meaningless for anonymous memory (which is what
ssh-agent etc would use) and for file-backed data it definitely allows
swapping unless you use mlock().
On Thu, 28 Jun 2007, Kyle Moffett wrote:
> On Jun 28, 2007, at 14:49:24, Davide Libenzi wrote:
> > I was using oprofile to sample some userspace code I am working on, and I
> > was continuosly noticing clear_page in the top three entries of the
> > oprofile logs.
> >
> > Also, a simple kernel build, in my Dual Opteron with 8GB of RAM, shows
> > clear_page as the first kernel entry, second only to the userspace the cc1
> > and as. Most of the userspace code uses malloc() (and anonymous mappings)
> > in such a way that the memory returned via kernel->glibc is immediately
> > written soon after. The POSIX malloc() definition itself also, does not
> > require the returned memory to be zeroed (as calloc() does).
> >
> > So I implemented a rather quick hack that introduces a new mmap() flag
> > MAP_NOZERO (only valid for anonymous mappings) and the vma counter-part
> > VM_NOZERO. Also, a new sys_brk2() has been introduced to accept a new flags
> > parameter. A brief description of the patches follows in the next emails.
>
> Hmm, sounds like this would also need a "MAP_NOREUSE" flag of some kind for
> security sensitive applications. Basically, I wouldn't want my ssh-agent
> pages holding private SSH keys to be reused by my web browser which then gets
> exploited :-D
Well, if your web browser and your ssh session are running under the same
user, and your web browser gets hacked, someone is basically logged in in
your system under your user. It can ptrace-attach your ssh-agent and take
it from there. It can also read your .ssh directory for what it matter and
more simply ;)
> It would also be a massive information leak under SELinux. To
> fix it properly according to the SELinux model you would need to tag each page
> with a label immediately after it's freed and then do an access-vector-check
> against the old page and the new process before allowing reuse. On the other
> hand, that would probably be at least as expensive as zeroing the page.
SeLinux could use a simple hook and disable the feature per-task and
globally. Just assign an invalid UID to mm->owner_uid and pages will never
be used. It could also hook mmap and clear off MAP_NOZERO.
- Davide
On Thu, Jun 28, 2007 at 10:57:00PM -0400, Kyle Moffett wrote:
> On Jun 28, 2007, at 14:49:24, Davide Libenzi wrote:
> >So I implemented a rather quick hack that introduces a new mmap()
> >flag MAP_NOZERO (only valid for anonymous mappings) and the vma
> >counter-part VM_NOZERO. Also, a new sys_brk2() has been introduced
> >to accept a new flags parameter. A brief description of the
> >patches follows in the next emails.
>
> Hmm, sounds like this would also need a "MAP_NOREUSE" flag of some
> kind for security sensitive applications. Basically, I wouldn't want
> my ssh-agent pages holding private SSH keys to be reused by my web
> browser which then gets exploited :-D.
PGP at least (and I think GPG still) did overwrite keys before calling
free(), and attempted to use mlock(). Looks like ssh-agent doesn't use
mlock -- at least it hasn't in this case:
% grep Lck /proc/`pidof ssh-agent`/status
VmLck: 0 kB
% ulimit -a | grep lock
file size (blocks) unlimited
core file size (blocks) 0
locked-in-memory size (kb) 32
file locks unlimited
Requiring security-sensitive apps to use a new flag to get safe behavior
is dangerous. Better to be safe by default and turn on the
less-safe-but-faster behavior for the cases that benefit from it.
> It would also be a massive
> information leak under SELinux. To fix it properly according to the
> SELinux model you would need to tag each page with a label
> immediately after it's freed and then do an access-vector-check
> against the old page and the new process before allowing reuse. On
> the other hand, that would probably be at least as expensive as
> zeroing the page.
I still think that using uid in mm_struct is wrong, and some kind of
abstraction is required. I called this "free pool" in
<[email protected]>, but I think that name is
misleading -- I am not proposing that this should be part of the
management of free pages, but should be a label which abstracts "safe to
share freed pages among" groups. Then different SELinux protection
domains would simply have different labels.
-andy
On Fri, 29 Jun 2007, Andy Isaacson wrote:
> On Thu, Jun 28, 2007 at 10:57:00PM -0400, Kyle Moffett wrote:
> > On Jun 28, 2007, at 14:49:24, Davide Libenzi wrote:
> > >So I implemented a rather quick hack that introduces a new mmap()
> > >flag MAP_NOZERO (only valid for anonymous mappings) and the vma
> > >counter-part VM_NOZERO. Also, a new sys_brk2() has been introduced
> > >to accept a new flags parameter. A brief description of the
> > >patches follows in the next emails.
> >
> > Hmm, sounds like this would also need a "MAP_NOREUSE" flag of some
> > kind for security sensitive applications. Basically, I wouldn't want
> > my ssh-agent pages holding private SSH keys to be reused by my web
> > browser which then gets exploited :-D.
>
> PGP at least (and I think GPG still) did overwrite keys before calling
> free(), and attempted to use mlock(). Looks like ssh-agent doesn't use
> mlock -- at least it hasn't in this case:
> % grep Lck /proc/`pidof ssh-agent`/status
> VmLck: 0 kB
> % ulimit -a | grep lock
> file size (blocks) unlimited
> core file size (blocks) 0
> locked-in-memory size (kb) 32
> file locks unlimited
>
> Requiring security-sensitive apps to use a new flag to get safe behavior
> is dangerous. Better to be safe by default and turn on the
> less-safe-but-faster behavior for the cases that benefit from it.
Can you better explain what MAP_NOZERO would alter in such case?
> I still think that using uid in mm_struct is wrong, and some kind of
> abstraction is required. I called this "free pool" in
> <[email protected]>, but I think that name is
> misleading -- I am not proposing that this should be part of the
> management of free pages, but should be a label which abstracts "safe to
> share freed pages among" groups. Then different SELinux protection
> domains would simply have different labels.
I think I answered this one at least a couple of times, but anyawy. First,
that can be whatever cookie we choose. At the moment UID is used because
it makes easier a fit into _mapcount. Second, SeLinux will be able to
disable the feature on a per-process base, or globally.
Anything else?
- Davide
On Jun 29, 2007, at 16:12:58, Davide Libenzi wrote:
> On Fri, 29 Jun 2007, Andy Isaacson wrote:
>> I still think that using uid in mm_struct is wrong, and some kind
>> of abstraction is required. I called this "free pool" in
>> <[email protected]>, but I think that name is
>> misleading -- I am not proposing that this should be part of the
>> management of free pages, but should be a label which abstracts
>> "safe to share freed pages among" groups. Then different SELinux
>> protection domains would simply have different labels.
>
> I think I answered this one at least a couple of times, but anyawy.
> First, that can be whatever cookie we choose. At the moment UID is
> used because it makes easier a fit into _mapcount. Second, SeLinux
> will be able to disable the feature on a per-process base, or
> globally.
>
> Anything else?
Well I would be very interested in actually being able to use this
feature under SELinux, I think that just the underlying "can-I-use-
this-page" logic needs modification. Maybe "MAP_REUSABLE"? That
would both imply that we can accept reused (IE: nonzeroed) memory
*AND* that the current page may be reused (IE: remapped without
zeroing), although you could conceivably have one flag for each
case. The userspace allocator should be able to (when prompted by
MAP_REUSABLE) look in a page "pool" of sorts before falling back to a
zeroed page. The pool would be created for a given "key" the first
time it unmaps MAP_REUSABLE pages, possibly using the memory freed by
said unmap.
The real trick is how to define the "key". The default, without
LSMs, should be something like the UID. SELinux, on the other hand,
would probably want to use some kind of hash of the label as the
"key", (and store the label in each pool, as well). That way SELinux
could have a simple access-vector check for process:reusepage, as
well as an access-vector check and type transition for
"freereusablepage". Then a policy could allow most user processes to
unconditionally reuse pages (which would end up in the access-vector-
cache and therefore be fast), while security-sensitive processes like
ssh-agent could neither reuse pages nor have their pages reused, even
if they request it.
Cheers,
Kyle Moffett
On Fri, 29 Jun 2007, Kyle Moffett wrote:
> Well I would be very interested in actually being able to use this feature
> under SELinux, I think that just the underlying "can-I-use-this-page" logic
> needs modification. Maybe "MAP_REUSABLE"? That would both imply that we can
> accept reused (IE: nonzeroed) memory *AND* that the current page may be reused
> (IE: remapped without zeroing), although you could conceivably have one flag
> for each case. The userspace allocator should be able to (when prompted by
> MAP_REUSABLE) look in a page "pool" of sorts before falling back to a zeroed
> page. The pool would be created for a given "key" the first time it unmaps
> MAP_REUSABLE pages, possibly using the memory freed by said unmap.
Hmm, why would you need MAP_REUSABLE? If a page is visible at any time for
a given UID, and you have a login under such UID, you can fetch the content
of the page at any time (ie, ptrace_attach, gdb, ...). And if you are
not under a UID login, you'll never get to see that page. ATM not even the
classical "root can see everything" rule is applied.
I think the focus should be to find a case where under the currently
implemented policy for MAP_NOZERO, MAP_NOZERO represent a loss of security
WRT no MAP_NOZERO. I have not been able to find one yet, although Andy
found a potential one in the setuid+exec/ptrace race (fixed by a patch
that should IMO go in in any case).
The more ppl think about breaking it, the better it is.
> The real trick is how to define the "key". The default, without LSMs, should
> be something like the UID. SELinux, on the other hand, would probably want to
> use some kind of hash of the label as the "key", (and store the label in each
> pool, as well). That way SELinux could have a simple access-vector check for
> process:reusepage, as well as an access-vector check and type transition for
> "freereusablepage". Then a policy could allow most user processes to
> unconditionally reuse pages (which would end up in the access-vector-cache and
> therefore be fast), while security-sensitive processes like ssh-agent could
> neither reuse pages nor have their pages reused, even if they request it.
It is very easy, assuming a simple unsigned long cookie is enough for
SeLinux, to fit in the current MAP_NOZERO. Well, we have to change
something in the current struct page _mapcount reuse, but that doable.
There is one line to change, that is the line where the cookie is assigned
to the mm_struct. From there on, it's all handled the same way.
If the hash is any longer than unsigned long, I don't really think is ever
gonna fly, being it stored inside a struct page.
Also, if you start introducing "keys" whose domain is wider than a single
user, then you'll run for sure in some sort of problem. This is why the
current code does not even try to go into the "group" policies.
IMO all this may have some sense if 1) it is very simple 2) limits code
and data structures bloat to very little or nothing 3) stays all the way
to the safe side, at the cost of losing some possible valid pages to be
recycled. After all, MAP_NOZERO is an hint and not a requirement.
I think that the current method (either UID or KEY based) is simple, does
not add extra management pools and, *so far*, is not showing security
leaks.
- Davide
On Jun 30, 2007, at 15:03:07, Davide Libenzi wrote:
> Hmm, why would you need MAP_REUSABLE? If a page is visible at any
> time for a given UID, and you have a login under such UID, you can
> fetch the content of the page at any time (ie, ptrace_attach,
> gdb, ...).
Not under SELinux or other LSMs. I suppose those could live without
a 15% performance improvement in some workloads, but it would be nice
if we could avoid it. Essentially, UID is a really poor way to
define process-security-equivalence classes in some systems. If you
really want to define such classes then you need to add LSM hooks to
manage the equivalence classes.
> I think the focus should be to find a case where under the
> currently implemented policy for MAP_NOZERO, MAP_NOZERO represent a
> loss of security WRT no MAP_NOZERO.
Very simple case:
SELinux is turned on, an s9 (IE: TOP_SECRET) process calls free(),
and an s3 (IE: UNCLASSIFIED) process calls malloc(), getting the data
from the TOP_SECRET process.
>> The real trick is how to define the "key". The default, without
>> LSMs, should be something like the UID. SELinux, on the other
>> hand, would probably want to use some kind of hash of the label as
>> the "key", (and store the label in each pool, as well). That way
>> SELinux could have a simple access-vector check for
>> process:reusepage, as well as an access-vector check and type
>> transition for "freereusablepage". Then a policy could allow most
>> user processes to unconditionally reuse pages (which would end up
>> in the access-vector-cache and therefore be fast), while security-
>> sensitive processes like ssh-agent could neither reuse pages nor
>> have their pages reused, even if they request it.
>
> It is very easy, assuming a simple unsigned long cookie is enough
> for SeLinux, to fit in the current MAP_NOZERO. Well, we have to
> change something in the current struct page _mapcount reuse, but
> that doable. There is one line to change, that is the line where
> the cookie is assigned to the mm_struct.
I think if you create the concept of a "process equivalence class"
and add an LSM hook for it, then the unsigned long could just store
which equivalence class the page is in. The default without LSMs
would be to use mutual-ptraceability as the equivalence class (IE:
the UID with a proviso for SUID binaries). LSMs should be able to
create a process_equivalence_class hook which when called returns an
unsigned long identifying the "equivalence class" (IE: pool) into
which the page is placed when freed (or ((unsigned long)-1) to
forcibly zero the page). When a process requests a maybe-not-zeroed
page, the LSM hook would be called again to determine what
equivalence class should be used, (or ((unsigned long)-1) for dont-
use-any-class).
Cheers,
Kyle Moffett
On Sat, 30 Jun 2007, Kyle Moffett wrote:
> On Jun 30, 2007, at 15:03:07, Davide Libenzi wrote:
> > Hmm, why would you need MAP_REUSABLE? If a page is visible at any time for a
> > given UID, and you have a login under such UID, you can fetch the content of
> > the page at any time (ie, ptrace_attach, gdb, ...).
>
> Not under SELinux or other LSMs. I suppose those could live without a 15%
> performance improvement in some workloads, but it would be nice if we could
> avoid it. Essentially, UID is a really poor way to define
> process-security-equivalence classes in some systems. If you really want to
> define such classes then you need to add LSM hooks to manage the equivalence
> classes.
>
> > I think the focus should be to find a case where under the currently
> > implemented policy for MAP_NOZERO, MAP_NOZERO represent a loss of security
> > WRT no MAP_NOZERO.
>
> Very simple case:
> SELinux is turned on, an s9 (IE: TOP_SECRET) process calls free(), and an s3
> (IE: UNCLASSIFIED) process calls malloc(), getting the data from the
> TOP_SECRET process.
Note that you use *s3* and *s9*. Those will be two different context cookies.
SeLinux will have its own way to set the cookie in the mm_struct, to *s3*
in one case, and to *s9* in the other case. This will make things so that
they'll never see each other pages.
> > It is very easy, assuming a simple unsigned long cookie is enough for
> > SeLinux, to fit in the current MAP_NOZERO. Well, we have to change
> > something in the current struct page _mapcount reuse, but that doable. There
> > is one line to change, that is the line where the cookie is assigned to the
> > mm_struct.
>
> I think if you create the concept of a "process equivalence class" and add an
> LSM hook for it, then the unsigned long could just store which equivalence
> class the page is in. The default without LSMs would be to use
> mutual-ptraceability as the equivalence class (IE: the UID with a proviso for
> SUID binaries). LSMs should be able to create a process_equivalence_class
> hook which when called returns an unsigned long identifying the "equivalence
> class" (IE: pool) into which the page is placed when freed (or ((unsigned
> long)-1) to forcibly zero the page). When a process requests a
> maybe-not-zeroed page, the LSM hook would be called again to determine what
> equivalence class should be used, (or ((unsigned long)-1) for
> dont-use-any-class).
I'd rather prefer to not create anything, and to not add anything. SeLinux
can set the cookie as it likes, and it can also disable the feature. Since
SeLinux is definitely not the common case, I'd prefer to keep the simple
unsigned long cookie abstraction, and let them handle however they like it.
- Davide
On Jun 30, 2007, at 19:57:18, Davide Libenzi wrote:
> On Sat, 30 Jun 2007, Kyle Moffett wrote:
>> Very simple case: SELinux is turned on, an s9 (IE: TOP_SECRET)
>> process calls free(), and an s3 (IE: UNCLASSIFIED) process calls
>> malloc(), getting the data from the TOP_SECRET process.
>
> Note that you use *s3* and *s9*. Those will be two different
> context cookies. SeLinux will have its own way to set the cookie
> in the mm_struct, to *s3* in one case, and to *s9* in the other
> case. This will make things so that they'll never see each other
> pages.
Except s3 and s9 aren't complete cookies. A complete label might be:
"system_u:system_r:apache2_t:s3" for an unclassified apache web-
server process, or "kmoffett_u:secadmin_r:usershell_t:s9" for me
logged in with a top-secret label in my security-administrator role.
That's why you'd need to call an LSM hook to get a unique identifier,
as the LSM would actually need to allocate identifiers for
equivalence classes. Secondly, processes may change labels as they
run, so you couldn't just call it once and cache the result, you
would need to call it for every freed page (or every re-use of a page).
>>> It is very easy, assuming a simple unsigned long cookie is enough
>>> for
>>> SeLinux, to fit in the current MAP_NOZERO. Well, we have to change
>>> something in the current struct page _mapcount reuse, but that
>>> doable. There
>>> is one line to change, that is the line where the cookie is
>>> assigned to the
>>> mm_struct.
I do think a single unsigned long cookie would work, as long as you
have an LSM hook:
int process_equivalence_class(struct task_struct *task, unsigned long
*result);
If it returns 0 then you can use the result as a page cookie,
otherwise you can't reuse pages for this process at all.
Cheers,
Kyle Moffett