2010-08-21 00:00:42

by Linus Torvalds

[permalink] [raw]
Subject: [RFC] mlock/stack guard interaction fixup

Ian (and others),
here's a three-patch series that uses the doubly linked list to do
your mlock() case hopefully correctly.

NOTE! It's untested. The first patch (which is the slightly scary one)
is tested to some degree, the two other ones I checked that they
compile, but that's it.

I'm not going to apply them to my main tree unless they get testing
and acks. And as mentioned, I've not done any of the changes that
having a vm_prev pointer can allow in other places.

Comments? Fixes? Braindamage?

Linus


Attachments:
0001-mm-make-the-vma-list-be-doubly-linked.patch (4.75 kB)
0002-mm-make-the-mlock-stack-guard-page-checks-stricter.patch (1.63 kB)
0003-mm-make-stack-guard-page-logic-use-vm_prev-pointer.patch (1.47 kB)
Download all attachments

2010-08-21 00:20:23

by Mike Snitzer

[permalink] [raw]
Subject: Re: [RFC] mlock/stack guard interaction fixup

On Fri, Aug 20, 2010 at 7:59 PM, Linus Torvalds
<[email protected]> wrote:
> Ian (and others),
> ?here's a three-patch series that uses the doubly linked list to do
> your mlock() case hopefully correctly.
>
> NOTE! It's untested. The first patch (which is the slightly scary one)
> is tested to some degree, the two other ones I checked that they
> compile, but that's it.
>
> I'm not going to apply them to my main tree unless they get testing
> and acks. And as mentioned, I've not done any of the changes that
> having a vm_prev pointer can allow in other places.
>
> Comments? Fixes? Braindamage?

In 0002-mm-make-the-mlock-stack-guard-page-checks-stricter.patch you
switch __mlock_vma_pages_range from 'start += PAGE_SIZE' to 'addr +=
PAGE_SIZE'.

So would that be a bugfix for commit d7824370? Seems likely given
start isn't used after that point.

Mike

2010-08-21 00:54:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] mlock/stack guard interaction fixup

On Fri, Aug 20, 2010 at 5:20 PM, Mike Snitzer <[email protected]> wrote:
>
> In 0002-mm-make-the-mlock-stack-guard-page-checks-stricter.patch you
> switch __mlock_vma_pages_range from 'start += PAGE_SIZE' to 'addr +=
> PAGE_SIZE'.
>
> So would that be a bugfix for commit d7824370? ?Seems likely given
> start isn't used after that point.

Yup, that's a bug-fix. Although the real bug was having two variables
(addr and start) both be the same thing. Probably for some random
historical reason. And I didn't fix that.

Linus

2010-08-21 11:56:24

by Ian Campbell

[permalink] [raw]
Subject: Re: [RFC] mlock/stack guard interaction fixup

On Fri, 2010-08-20 at 16:59 -0700, Linus Torvalds wrote:
> Ian (and others),
> here's a three-patch series that uses the doubly linked list to do
> your mlock() case hopefully correctly.

Thanks Linus.

> NOTE! It's untested. The first patch (which is the slightly scary one)
> is tested to some degree, the two other ones I checked that they
> compile, but that's it.

I applied on top of 2.6.35.3 and it fixes the simple test case I posted
yesterday as well as the original issue I was seeing with the Xen
toolstack on 2.6.32.20.

I don't know that they are particularly good tests for this change but I
also ran allmodconfig kernel build and ltp on 2.6.35.3+fixes without
issue. Are there any good mlock heavy workloads?

Out of interest, why is there no guard page for the VM_GROWSUP stack
case? Is it just that the memory layout on PA-RISC makes the stack grows
into the heap scenario impossible?

> I'm not going to apply them to my main tree unless they get testing
> and acks.

Tested-by: Ian Campbell <[email protected]>

> And as mentioned, I've not done any of the changes that
> having a vm_prev pointer can allow in other places.
>
> Comments? Fixes? Braindamage?
>
> Linus

--
Ian Campbell

Providence, New Jersey, is one of the few cities where Velveeta cheese
appears on the gourmet shelf.


Attachments:
signature.asc (198.00 B)
This is a digitally signed message part

2010-08-21 15:49:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] mlock/stack guard interaction fixup

On Sat, Aug 21, 2010 at 4:56 AM, Ian Campbell <[email protected]> wrote:
>
> I don't know that they are particularly good tests for this change but I
> also ran allmodconfig kernel build and ltp on 2.6.35.3+fixes without
> issue. Are there any good mlock heavy workloads?

mlock itself isn't very interesting, I think more interesting is
testing that the doubly linked list handles all the cases correctly.
Something that splits mappings, unmaps partial ones etc etc. Running
something like Electric Fence is probably a good idea.

The happy news is that we really didn't have lots of assignments to
vma->vm_next - they were all pretty cleanly separated into just a
couple of cases. So I'm pretty confident in the patches. But...

> Out of interest, why is there no guard page for the VM_GROWSUP stack
> case? Is it just that the memory layout on PA-RISC makes the stack grows
> into the heap scenario impossible?

No, it's just that I can't find it in myself to care about PA-RISC, so
I never wrote the code. I don't think anything else has a grows-up
stack. And even if I were to write the code, I couldn't even test it.

It should be reasonably easy to do the VM_GROWSUP case too, but
somebody with a PA-RISC would need to do it.

Linus

2010-08-21 16:08:46

by Sam Ravnborg

[permalink] [raw]
Subject: Re: [RFC] mlock/stack guard interaction fixup

>
> > Out of interest, why is there no guard page for the VM_GROWSUP stack
> > case? Is it just that the memory layout on PA-RISC makes the stack grows
> > into the heap scenario impossible?
>
> No, it's just that I can't find it in myself to care about PA-RISC, so
> I never wrote the code. I don't think anything else has a grows-up
> stack. And even if I were to write the code, I couldn't even test it.
>
> It should be reasonably easy to do the VM_GROWSUP case too, but
> somebody with a PA-RISC would need to do it.

Tony Luck already provided a VM_GROWSUP version.

See: http://lkml.org/lkml/2010/8/20/325

[It is signed off by Tony Luc - but I guess they know each other ;-) ]

Sam

2010-08-22 06:58:05

by Ian Campbell

[permalink] [raw]
Subject: Re: [RFC] mlock/stack guard interaction fixup

On Sat, 2010-08-21 at 08:48 -0700, Linus Torvalds wrote:
> On Sat, Aug 21, 2010 at 4:56 AM, Ian Campbell <[email protected]> wrote:
> >
> > I don't know that they are particularly good tests for this change but I
> > also ran allmodconfig kernel build and ltp on 2.6.35.3+fixes without
> > issue. Are there any good mlock heavy workloads?
>
> mlock itself isn't very interesting, I think more interesting is
> testing that the doubly linked list handles all the cases correctly.
> Something that splits mappings, unmaps partial ones etc etc. Running
> something like Electric Fence is probably a good idea.

EF_DISABLE_BANNER=1 EF_ALLOW_MALLOC_0=1 LD_PRELOAD=libefence.so.0.0 make

craps out pretty quickly with:

CC init/main.o

ElectricFence Exiting: mprotect() failed: Cannot allocate memory
make[1]: *** [init/main.o] Error 255
make: *** [init] Error 2

but it does that with 2.6.35.3, 2.6.35.2, 2.6.35.1 and 2.6.35 too so it
doesn't seem to be breakage relating to any of the stack guard stuff

>
> The happy news is that we really didn't have lots of assignments to
> vma->vm_next - they were all pretty cleanly separated into just a
> couple of cases. So I'm pretty confident in the patches. But...
>
> > Out of interest, why is there no guard page for the VM_GROWSUP stack
> > case? Is it just that the memory layout on PA-RISC makes the stack grows
> > into the heap scenario impossible?
>
> No, it's just that I can't find it in myself to care about PA-RISC, so
> I never wrote the code. I don't think anything else has a grows-up
> stack. And even if I were to write the code, I couldn't even test it.
>
> It should be reasonably easy to do the VM_GROWSUP case too, but
> somebody with a PA-RISC would need to do it.
>
> Linus
>

--
Ian Campbell

Everybody is going somewhere!! It's probably a garage sale or a
disaster Movie!!


Attachments:
signature.asc (198.00 B)
This is a digitally signed message part

2010-08-22 07:33:15

by Ian Campbell

[permalink] [raw]
Subject: Re: [RFC] mlock/stack guard interaction fixup

On Sun, 2010-08-22 at 07:57 +0100, Ian Campbell wrote:
> On Sat, 2010-08-21 at 08:48 -0700, Linus Torvalds wrote:
> > On Sat, Aug 21, 2010 at 4:56 AM, Ian Campbell <[email protected]> wrote:
> > >
> > > I don't know that they are particularly good tests for this change but I
> > > also ran allmodconfig kernel build and ltp on 2.6.35.3+fixes without
> > > issue. Are there any good mlock heavy workloads?
> >
> > mlock itself isn't very interesting, I think more interesting is
> > testing that the doubly linked list handles all the cases correctly.
> > Something that splits mappings, unmaps partial ones etc etc. Running
> > something like Electric Fence is probably a good idea.
>
> EF_DISABLE_BANNER=1 EF_ALLOW_MALLOC_0=1 LD_PRELOAD=libefence.so.0.0 make
>
> craps out pretty quickly with:
>
> CC init/main.o
>
> ElectricFence Exiting: mprotect() failed: Cannot allocate memory
> make[1]: *** [init/main.o] Error 255
> make: *** [init] Error 2
>
> but it does that with 2.6.35.3, 2.6.35.2, 2.6.35.1 and 2.6.35 too so it
> doesn't seem to be breakage relating to any of the stack guard stuff

I increased the vm.max_map_count sysctl and now things are rolling
along. Will let you know how I get on.

--
Ian Campbell

Arithmetic:
An obscure art no longer practiced in the world's developed countries.


Attachments:
signature.asc (198.00 B)
This is a digitally signed message part

2010-08-22 09:55:33

by Ian Campbell

[permalink] [raw]
Subject: Re: [RFC] mlock/stack guard interaction fixup

On Sun, 2010-08-22 at 08:33 +0100, Ian Campbell wrote:
> On Sun, 2010-08-22 at 07:57 +0100, Ian Campbell wrote:
> > On Sat, 2010-08-21 at 08:48 -0700, Linus Torvalds wrote:
> > > On Sat, Aug 21, 2010 at 4:56 AM, Ian Campbell <[email protected]> wrote:
> > > >
> > > > I don't know that they are particularly good tests for this change but I
> > > > also ran allmodconfig kernel build and ltp on 2.6.35.3+fixes without
> > > > issue. Are there any good mlock heavy workloads?
> > >
> > > mlock itself isn't very interesting, I think more interesting is
> > > testing that the doubly linked list handles all the cases correctly.
> > > Something that splits mappings, unmaps partial ones etc etc. Running
> > > something like Electric Fence is probably a good idea.
> >
> > EF_DISABLE_BANNER=1 EF_ALLOW_MALLOC_0=1 LD_PRELOAD=libefence.so.0.0 make
> >
> > craps out pretty quickly with:
> >
> > CC init/main.o
> >
> > ElectricFence Exiting: mprotect() failed: Cannot allocate memory
> > make[1]: *** [init/main.o] Error 255
> > make: *** [init] Error 2
> >
> > but it does that with 2.6.35.3, 2.6.35.2, 2.6.35.1 and 2.6.35 too so it
> > doesn't seem to be breakage relating to any of the stack guard stuff
>
> I increased the vm.max_map_count sysctl and now things are rolling
> along. Will let you know how I get on.

So its slow and memory intensive as hell due to efence so my test box is
struggling[0] but it has compiled 270+ .o files successfully so I think
it's OK from that perspective. I think it'll be quite a while before I
can say its passed an allmodconfig under efence though.

In the meantime I notice you've committed the patches. Can we get them
queued up for stable backports at some point? I appreciate you might
want them to bake for a bit longer in 2.6.36-rc first.

Greg, we are talking about:
0e8e50e20c837eeec8323bba7dcd25fe5479194c mm: make stack guard page logic use vm_prev pointer
7798330ac8114c731cfab83e634c6ecedaa233d7 mm: make the mlock() stack guard page checks stricter
297c5eee372478fc32fec5fe8eed711eedb13f3d mm: make the vma list be doubly linked

Cheers,
Ian.

[0] It's a 4G x 4 core box which normally does -j16 builds with no swap.
Now a -j4 will send it deep into the 16G of swap I added this morning,
and even a -j3 is a bit tight. I knew efence was resource intensive but
this still surprised me.

--
Ian Campbell

Who does not love wine, women, and song,
Remains a fool his whole life long.
-- Johann Heinrich Voss


Attachments:
signature.asc (198.00 B)
This is a digitally signed message part

2010-08-22 16:44:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] mlock/stack guard interaction fixup

On Sun, Aug 22, 2010 at 2:55 AM, Ian Campbell <[email protected]> wrote:
>
> In the meantime I notice you've committed the patches. Can we get them
> queued up for stable backports at some point? I appreciate you might
> want them to bake for a bit longer in 2.6.36-rc first.

Yeah, I think I want them in -rc2 (which I'll do today) and get some
more testing first. The patches are simple, but let's see if I missed
anything else for a couple of days.

Linus

2010-08-22 17:27:02

by Greg KH

[permalink] [raw]
Subject: Re: [RFC] mlock/stack guard interaction fixup

On Sun, Aug 22, 2010 at 10:55:17AM +0100, Ian Campbell wrote:
> On Sun, 2010-08-22 at 08:33 +0100, Ian Campbell wrote:
> > On Sun, 2010-08-22 at 07:57 +0100, Ian Campbell wrote:
> > > On Sat, 2010-08-21 at 08:48 -0700, Linus Torvalds wrote:
> > > > On Sat, Aug 21, 2010 at 4:56 AM, Ian Campbell <[email protected]> wrote:
> > > > >
> > > > > I don't know that they are particularly good tests for this change but I
> > > > > also ran allmodconfig kernel build and ltp on 2.6.35.3+fixes without
> > > > > issue. Are there any good mlock heavy workloads?
> > > >
> > > > mlock itself isn't very interesting, I think more interesting is
> > > > testing that the doubly linked list handles all the cases correctly.
> > > > Something that splits mappings, unmaps partial ones etc etc. Running
> > > > something like Electric Fence is probably a good idea.
> > >
> > > EF_DISABLE_BANNER=1 EF_ALLOW_MALLOC_0=1 LD_PRELOAD=libefence.so.0.0 make
> > >
> > > craps out pretty quickly with:
> > >
> > > CC init/main.o
> > >
> > > ElectricFence Exiting: mprotect() failed: Cannot allocate memory
> > > make[1]: *** [init/main.o] Error 255
> > > make: *** [init] Error 2
> > >
> > > but it does that with 2.6.35.3, 2.6.35.2, 2.6.35.1 and 2.6.35 too so it
> > > doesn't seem to be breakage relating to any of the stack guard stuff
> >
> > I increased the vm.max_map_count sysctl and now things are rolling
> > along. Will let you know how I get on.
>
> So its slow and memory intensive as hell due to efence so my test box is
> struggling[0] but it has compiled 270+ .o files successfully so I think
> it's OK from that perspective. I think it'll be quite a while before I
> can say its passed an allmodconfig under efence though.
>
> In the meantime I notice you've committed the patches. Can we get them
> queued up for stable backports at some point? I appreciate you might
> want them to bake for a bit longer in 2.6.36-rc first.
>
> Greg, we are talking about:
> 0e8e50e20c837eeec8323bba7dcd25fe5479194c mm: make stack guard page logic use vm_prev pointer
> 7798330ac8114c731cfab83e634c6ecedaa233d7 mm: make the mlock() stack guard page checks stricter
> 297c5eee372478fc32fec5fe8eed711eedb13f3d mm: make the vma list be doubly linked

I must be missing something, but aren't these patches just "cleanups"
and changing the logic here to be nicer? Or do they fix real problems
with the previous stack guard stuff?

Is it the second one you really need here?

thanks,

greg k-h

2010-08-22 18:22:24

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] mlock/stack guard interaction fixup

On Sun, Aug 22, 2010 at 10:25 AM, Greg KH <[email protected]> wrote:
>>
>> Greg, we are talking about:
>> 0e8e50e20c837eeec8323bba7dcd25fe5479194c mm: make stack guard page logic use vm_prev pointer
>> 7798330ac8114c731cfab83e634c6ecedaa233d7 mm: make the mlock() stack guard page checks stricter
>> 297c5eee372478fc32fec5fe8eed711eedb13f3d mm: make the vma list be doubly linked
>
> I must be missing something, but aren't these patches just "cleanups"
> and changing the logic here to be nicer? ?Or do they fix real problems
> with the previous stack guard stuff?
>
> Is it the second one you really need here?

They're all "required" (#2 needs #1, and #3 is a fix for something
that can happen in the same circumstances that #2 makes any
difference).

Although you do need to have some really odd things going on for any
of them to make any difference. Notably, you need to do mlock or
mprotect on the stack segment, which no sane program does.

That said, the change from

start += PAGE_SIZE;

to

addr += PAGE_SIZE;

in __mlock_vma_pages_range() (in #3) is a bugfix even for the normal
mlockall() case. Not that anybody will realistically care about that
either: the failure case just doesn't really realistically ever matter
(it expands the stack which the code tries to avoid, and possibly
forgets to mlock the bottom of the stack).

So I wouldn't call them high priority. Ian is doing something _really_
odd. Doing hypercalls from user space on stuff that is on the stack,
rather than just copying it to some stable area is dodgy. And I
guarantee that doing the crazy mlock dance is slower than the copy, so
it's complex, fragile, _and_ slow.

Linus

2010-08-22 19:09:40

by Greg KH

[permalink] [raw]
Subject: Re: [RFC] mlock/stack guard interaction fixup

On Sun, Aug 22, 2010 at 11:21:34AM -0700, Linus Torvalds wrote:
> On Sun, Aug 22, 2010 at 10:25 AM, Greg KH <[email protected]> wrote:
> >>
> >> Greg, we are talking about:
> >> 0e8e50e20c837eeec8323bba7dcd25fe5479194c mm: make stack guard page logic use vm_prev pointer
> >> 7798330ac8114c731cfab83e634c6ecedaa233d7 mm: make the mlock() stack guard page checks stricter
> >> 297c5eee372478fc32fec5fe8eed711eedb13f3d mm: make the vma list be doubly linked
> >
> > I must be missing something, but aren't these patches just "cleanups"
> > and changing the logic here to be nicer? ?Or do they fix real problems
> > with the previous stack guard stuff?
> >
> > Is it the second one you really need here?
>
> They're all "required" (#2 needs #1, and #3 is a fix for something
> that can happen in the same circumstances that #2 makes any
> difference).

Ok, thanks.

> Although you do need to have some really odd things going on for any
> of them to make any difference. Notably, you need to do mlock or
> mprotect on the stack segment, which no sane program does.
>
> That said, the change from
>
> start += PAGE_SIZE;
>
> to
>
> addr += PAGE_SIZE;
>
> in __mlock_vma_pages_range() (in #3) is a bugfix even for the normal
> mlockall() case. Not that anybody will realistically care about that
> either: the failure case just doesn't really realistically ever matter
> (it expands the stack which the code tries to avoid, and possibly
> forgets to mlock the bottom of the stack).
>
> So I wouldn't call them high priority. Ian is doing something _really_
> odd. Doing hypercalls from user space on stuff that is on the stack,
> rather than just copying it to some stable area is dodgy. And I
> guarantee that doing the crazy mlock dance is slower than the copy, so
> it's complex, fragile, _and_ slow.

Heh, ok, I'll not worry about this for the .27 kernel then, that makes
it a lot easier for me :)

thanks,

greg k-h

2010-08-23 09:01:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] mlock/stack guard interaction fixup

On Fri, 2010-08-20 at 16:59 -0700, Linus Torvalds wrote:
>
> Comments? Fixes? Braindamage?

Running User Mode Linux (if it still builds) is a terrific test case for
the vma code. I remember that that was the thing that caused most
explosions back when I was poking at this code.

/me goes look at the actual patches..

2010-08-23 09:22:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] mlock/stack guard interaction fixup

On Sun, 2010-08-22 at 11:21 -0700, Linus Torvalds wrote:
> Doing hypercalls from user space on stuff that is on the stack,
> rather than just copying it to some stable area is dodgy. And I
> guarantee that doing the crazy mlock dance is slower than the copy, so
> it's complex, fragile, _and_ slow.
>
That sounds broken too, not having read the initial problem and such, it
sounds like the hypercall would expect the pages to stay pinned and
mlock doesn't actually guarantee that at all.

2010-08-23 16:01:34

by Ian Jackson

[permalink] [raw]
Subject: Re: [RFC] mlock/stack guard interaction fixup

Linus Torvalds writes ("Re: [RFC] mlock/stack guard interaction fixup"):
> Although you do need to have some really odd things going on for any
> of them to make any difference. Notably, you need to do mlock or
> mprotect on the stack segment, which no sane program does.

mlocking the stack is entirely sensible and normal for a real-time
program. Most such programs use mlockall but there is no particular
reason why a program that has some more specific requirements should
use mlock to lock only a part of the stack. (Perhaps it has only one
real-time thread?)

Locking, including of the stack, is discussed extensively here:
http://www.opengroup.org/onlinepubs/9699919799/xrat/V4_xsh_chap02.html
section "Memory locking functions" subsection "Requirements".

Ian.

2010-08-23 16:26:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] mlock/stack guard interaction fixup

On Mon, 2010-08-23 at 16:42 +0100, [email protected]
wrote:
>
> mlocking the stack is entirely sensible and normal for a real-time
> program. Most such programs use mlockall but there is no particular
> reason why a program that has some more specific requirements should
> use mlock to lock only a part of the stack. (Perhaps it has only one
> real-time thread?)

RT apps should pre-allocate and mlock their stack in advance (and
pre-fault too for the paranoid).

mlockall is a very bad interface and should really not be used.

2010-08-23 16:34:17

by Luck, Tony

[permalink] [raw]
Subject: Re: [RFC] mlock/stack guard interaction fixup

On Sat, Aug 21, 2010 at 9:08 AM, Sam Ravnborg <[email protected]> wrote:
> Tony Luck already provided a VM_GROWSUP version.
>
> ? ?See: http://lkml.org/lkml/2010/8/20/325
>
> [It is signed off by Tony Luc - but I guess they know each other ;-) ]

Tony Luc spends too much time looking at the To: and Cc: to make
sure that he spelled *other* peoples names correctly.

That patch doesn't apply any more because of the latest change to look
at vm_prev instead of calling find_vma() [N.B. the block comment above
check_stack_guard_page() still talks about find_vma()]. I can fix up my
patch ... but I have to wonder whether the new code doesn't leave a
hole again. It assumes that any VM_GROWSDOWN object that is
found below the current one is the result of the stack vma having been
split. But couldn't an attacker have used MAP_GROWSDOWN when
placing their sneaky stack smashing mapping just below the stack?

-Tony

2010-08-23 17:18:21

by Ian Jackson

[permalink] [raw]
Subject: Re: [RFC] mlock/stack guard interaction fixup

Peter Zijlstra writes ("Re: [RFC] mlock/stack guard interaction fixup"):
> On Mon, 2010-08-23 at 16:42 +0100, [email protected]
> wrote:
> > mlocking the stack is entirely sensible and normal for a real-time
> > program. Most such programs use mlockall but there is no particular
> > reason why a program that has some more specific requirements should
> > use mlock to lock only a part of the stack. (Perhaps it has only one
> > real-time thread?)
>
> RT apps should pre-allocate and mlock their stack in advance (and
> pre-fault too for the paranoid).

Are you allowed to mlock a stack page which has not yet been faulted
in ? What effect does it have ? I wasn't able to find a convincing
de jure answer to this question.

But you seem, like me, to be disagreeing with Linus's assertion that
calling mlock() on the stack is something no sane programs does ?

> mlockall is a very bad interface and should really not be used.

You are directly contradicting the advice in SuS, to which I just gave
a reference. You're free to do so of course but it might be worth
explaining in a bit more detail why the advice in SuS is wrong.

Ian.

2010-08-23 17:35:22

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] mlock/stack guard interaction fixup

On Mon, Aug 23, 2010 at 10:18 AM, Ian Jackson
<[email protected]> wrote:
>
> But you seem, like me, to be disagreeing with Linus's assertion that
> calling mlock() on the stack is something no sane programs does ?

Note: I don't think it's generally sane to mlock() a _part_ of the stack.

I think it's entirely sane to lock the whole stack (and that includes
expanding it to some expected maximum value). That makes sense as a
"we cannot afford to run out of memory" or "we must not allow the
pages to hit disk" kind of protection.

However, using mlock on part of the stack is dubious. It's also
dubious as a way to pin particular pages in the page tables, because
it's not necessarily something that the semantics guarantee
(historically mlock just guarantees that they won't be swapped out,
not that they will necessarily maintain some particular mapping).

There's also a difference between "resident in RAM" and "that physical
page is guaranteed to be mapped at that virtual address".

Quite frankly, I personally believe that people who play games with
mlock are misguided. The _one_ special case is for protecting keys or
private data that you do not want to hit the disk in some unencrypted
mode, and quite frankly, you should strive to handle those way more
specially than just putting them in some random place ("on the stack"
or "in some malloc()'ed area"). The sane model for doing that is
generally to explicitly mmap() and mlock the area, so that you get a
very controlled access pattern, and never have to worry about things
like COW etc.

Because trust me, COW and mlock() is _interesting_. As in "I suspect
lots of systems have bugs", and "the semantics don't really guarantee
that you won't have to wait for somethign to be paged out in order for
the allocation for the COW to be satisfied".

I suspect that if you use mlock for _any_ other reason than protecting
a particular very sensitive piece of information, you should use
mlockall(MCL_FUTURE). IOW, if you use mlock because you have realtime
issues, there is no excuse to ever use anything else, imho. And even
then, I guarantee that things like copy-on-write is going to be
"interesting".

I realize that people hate mlockall() (and particularly MCL_FUTURE),
and yes, it's a bloated thing that you can't reasonably use on a large
process. But dammit, if you have RT issues, you shouldn't _have_ some
big bloated process. You should have a small statically linked server
that is RT, and nothing else.

People who use mlock any other way tend to be the people who can't be
bothered to do it right, so they do some hacky half-way crap.

Linus

2010-08-23 17:40:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] mlock/stack guard interaction fixup

On Mon, 2010-08-23 at 18:18 +0100, Ian Jackson wrote:
> Peter Zijlstra writes ("Re: [RFC] mlock/stack guard interaction fixup"):
> > On Mon, 2010-08-23 at 16:42 +0100, [email protected]
> > wrote:
> > > mlocking the stack is entirely sensible and normal for a real-time
> > > program. Most such programs use mlockall but there is no particular
> > > reason why a program that has some more specific requirements should
> > > use mlock to lock only a part of the stack. (Perhaps it has only one
> > > real-time thread?)
> >
> > RT apps should pre-allocate and mlock their stack in advance (and
> > pre-fault too for the paranoid).
>
> Are you allowed to mlock a stack page which has not yet been faulted
> in ? What effect does it have ? I wasn't able to find a convincing
> de jure answer to this question.

mlock() seems to call make_pages_present(), so its a moot point.

> But you seem, like me, to be disagreeing with Linus's assertion that
> calling mlock() on the stack is something no sane programs does ?

I think the case that Linus called daft is splitting the stack vma
through mlock/mprotect, which is indeed something rarely done.

> > mlockall is a very bad interface and should really not be used.
>
> You are directly contradicting the advice in SuS, to which I just gave
> a reference. You're free to do so of course but it might be worth
> explaining in a bit more detail why the advice in SuS is wrong.

Because a real RT program will have a significant !RT part, and calling
mlock on everything is a massive resource waste.

Furthermore, mlockall gives the false impression that everything is good
to go for RT, disregarding pretty much everything that makes a RT app.

There's lots more to RT than sched_setscheduler() and
mlockall(MCL_FUTURE).

If a library is RT suited, it will already mlock() all relevant memory
allocations (or provide an option to do so, or strictly work on
externally allocated memory, which the user will then have mlock()'ed).

2010-08-23 17:53:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] mlock/stack guard interaction fixup

On Mon, 2010-08-23 at 10:34 -0700, Linus Torvalds wrote:
> It's also
> dubious as a way to pin particular pages in the page tables, because
> it's not necessarily something that the semantics guarantee
> (historically mlock just guarantees that they won't be swapped out,
> not that they will necessarily maintain some particular mapping).

Quite so, and esp. now that page migration gets used more with all that
memory compaction stuff that got merged.

2010-08-23 18:00:02

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] mlock/stack guard interaction fixup

On Mon, 2010-08-23 at 10:34 -0700, Linus Torvalds wrote:
> I suspect that if you use mlock for _any_ other reason than protecting
> a particular very sensitive piece of information, you should use
> mlockall(MCL_FUTURE). IOW, if you use mlock because you have realtime
> issues, there is no excuse to ever use anything else, imho. And even
> then, I guarantee that things like copy-on-write is going to be
> "interesting".
>
> I realize that people hate mlockall() (and particularly MCL_FUTURE),
> and yes, it's a bloated thing that you can't reasonably use on a large
> process. But dammit, if you have RT issues, you shouldn't _have_ some
> big bloated process. You should have a small statically linked server
> that is RT, and nothing else.

Us real-time people have been telling people to not use mlockall() at
all.

While small !glibc statically linked RT components using shared memory
interfaces to !RT apps could work its not how people actually write
their apps. They write big monolithic threaded apps where some threads
are RT.

[ in part because there doesn't seem to be a usable !glibc
libpthread/librt implementation out there, in part because people use
crap like Java-RT ]

2010-08-23 18:43:16

by Darren Hart

[permalink] [raw]
Subject: Re: [RFC] mlock/stack guard interaction fixup

On 08/23/2010 10:59 AM, Peter Zijlstra wrote:
> On Mon, 2010-08-23 at 10:34 -0700, Linus Torvalds wrote:
>> I suspect that if you use mlock for _any_ other reason than protecting
>> a particular very sensitive piece of information, you should use
>> mlockall(MCL_FUTURE). IOW, if you use mlock because you have realtime
>> issues, there is no excuse to ever use anything else, imho. And even
>> then, I guarantee that things like copy-on-write is going to be
>> "interesting".
>>
>> I realize that people hate mlockall() (and particularly MCL_FUTURE),
>> and yes, it's a bloated thing that you can't reasonably use on a large
>> process. But dammit, if you have RT issues, you shouldn't _have_ some
>> big bloated process. You should have a small statically linked server
>> that is RT, and nothing else.
>
> Us real-time people have been telling people to not use mlockall() at
> all.

Well, we have at least two camps of people here I guess. When people
come to me with unexplainable latencies, paging is one of the things we
check for, and mlockall() is a good way to test if avoiding that paging
will help them - so I have been known to recommend it on occasion.

> While small !glibc statically linked RT components using shared memory
> interfaces to !RT apps could work its not how people actually write
> their apps. They write big monolithic threaded apps where some threads
> are RT.
>
> [ in part because there doesn't seem to be a usable !glibc
> libpthread/librt implementation out there, in part because people use
> crap like Java-RT ]

Which is also missing some performance and functionality due to the lack
of complete pthread support for priority inheritance (and the complete
disinterest in fixing it by certain maintainers).

--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

2010-08-23 18:50:27

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC] mlock/stack guard interaction fixup

On 08/23/2010 10:34 AM, Linus Torvalds wrote:
> Quite frankly, I personally believe that people who play games with
> mlock are misguided. The _one_ special case is for protecting keys or
> private data that you do not want to hit the disk in some unencrypted
> mode, and quite frankly, you should strive to handle those way more
> specially than just putting them in some random place ("on the stack"
> or "in some malloc()'ed area"). The sane model for doing that is
> generally to explicitly mmap() and mlock the area, so that you get a
> very controlled access pattern, and never have to worry about things
> like COW etc.

Is that guaranteed to work (in Linux or in general)? mlock has always
meant "won't generate disk IO to fault in/evicted", but does it prevent
dirty pages from being written out so long as they also remain
resident? Or does it depend on the precise type of page you're
mlocking? For example, what does mlock on a shared writeable mapping mean?

J

2010-08-23 18:53:10

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC] mlock/stack guard interaction fixup

On 08/23/2010 10:18 AM, Ian Jackson wrote:
> Are you allowed to mlock a stack page which has not yet been faulted
> in ? What effect does it have ? I wasn't able to find a convincing
> de jure answer to this question.
>
> But you seem, like me, to be disagreeing with Linus's assertion that
> calling mlock() on the stack is something no sane programs does ?

Doing hypercalls from userspace is a silly hack to avoid putting dom0
hypercall knowledge into the kernel. mlock in that area has always been
problematic (initially on Solaris, and increasingly on Linux) and we're
going to have to fix it at some point. I wouldn't spend a lot of time
defending it.

J

2010-08-23 19:04:12

by Ian Campbell

[permalink] [raw]
Subject: Re: [RFC] mlock/stack guard interaction fixup

On Mon, 2010-08-23 at 10:34 -0700, Linus Torvalds wrote:
> On Mon, Aug 23, 2010 at 10:18 AM, Ian Jackson
> <[email protected]> wrote:
> >
> > But you seem, like me, to be disagreeing with Linus's assertion that
> > calling mlock() on the stack is something no sane programs does ?
>
> Note: I don't think it's generally sane to mlock() a _part_ of the stack.

Neither do I. The Xen toolstack even has a mechanism for bouncing data
to a special area for use as hypercall arguments. I expected it was just
a few corner cases which didn't use it but when I started looking into
it due to this conversation I discovered it's not as widely used as it
should be, I'm working to fix that on the Xen end.

> [...] It's also
> dubious as a way to pin particular pages in the page tables, because
> it's not necessarily something that the semantics guarantee
> (historically mlock just guarantees that they won't be swapped out,
> not that they will necessarily maintain some particular mapping).

Xen's usage doesn't require that the physical page backing an mlocked
virtual address never changes, just that it doesn't change without
obeying the regular TLB flush semantics.

Ian.
--
Ian Campbell

Familiarity breeds attempt.

2010-08-23 19:08:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] mlock/stack guard interaction fixup

On Mon, 2010-08-23 at 11:50 -0700, Jeremy Fitzhardinge wrote:
> On 08/23/2010 10:34 AM, Linus Torvalds wrote:
> > Quite frankly, I personally believe that people who play games with
> > mlock are misguided. The _one_ special case is for protecting keys or
> > private data that you do not want to hit the disk in some unencrypted
> > mode, and quite frankly, you should strive to handle those way more
> > specially than just putting them in some random place ("on the stack"
> > or "in some malloc()'ed area"). The sane model for doing that is
> > generally to explicitly mmap() and mlock the area, so that you get a
> > very controlled access pattern, and never have to worry about things
> > like COW etc.
>
> Is that guaranteed to work (in Linux or in general)? mlock has always
> meant "won't generate disk IO to fault in/evicted", but does it prevent
> dirty pages from being written out so long as they also remain
> resident? Or does it depend on the precise type of page you're
> mlocking? For example, what does mlock on a shared writeable mapping mean?

mlock() simply avoids major faults, nothing more.

I think both page migration and page-out for shared pages where some
maps are !mlocked can cause unmaps and thus minor faults.

mlock and dirty do not interact, they will still be cleaned/written out
as normal.

2010-08-23 19:23:26

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC] mlock/stack guard interaction fixup

On 08/23/2010 12:07 PM, Peter Zijlstra wrote:
> mlock() simply avoids major faults, nothing more.
>
> I think both page migration and page-out for shared pages where some
> maps are !mlocked can cause unmaps and thus minor faults.
>
> mlock and dirty do not interact, they will still be cleaned/written out
> as normal.

So mlock is useless for preventing secret stuff from being written to disk.

J

2010-08-23 19:26:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] mlock/stack guard interaction fixup

On Mon, 2010-08-23 at 12:23 -0700, Jeremy Fitzhardinge wrote:
> On 08/23/2010 12:07 PM, Peter Zijlstra wrote:
> > mlock() simply avoids major faults, nothing more.
> >
> > I think both page migration and page-out for shared pages where some
> > maps are !mlocked can cause unmaps and thus minor faults.
> >
> > mlock and dirty do not interact, they will still be cleaned/written out
> > as normal.
>
> So mlock is useless for preventing secret stuff from being written to disk.

Well, if you put your sekrit in a file map, sure.

Use a mmap(MAP_ANONYMOUS|MAP_LOCK) and madvise(MADV_DONTFORK) for your
sekrits.

2010-08-23 19:54:15

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC] mlock/stack guard interaction fixup

On 08/23/2010 12:26 PM, Peter Zijlstra wrote:
> On Mon, 2010-08-23 at 12:23 -0700, Jeremy Fitzhardinge wrote:
>> On 08/23/2010 12:07 PM, Peter Zijlstra wrote:
>>> mlock() simply avoids major faults, nothing more.
>>>
>>> I think both page migration and page-out for shared pages where some
>>> maps are !mlocked can cause unmaps and thus minor faults.
>>>
>>> mlock and dirty do not interact, they will still be cleaned/written out
>>> as normal.
>> So mlock is useless for preventing secret stuff from being written to disk.
> Well, if you put your sekrit in a file map, sure.
>
> Use a mmap(MAP_ANONYMOUS|MAP_LOCK) and madvise(MADV_DONTFORK) for your
> sekrits.

Won't dirty anonymous pages also get written to swap?

J

2010-08-24 07:09:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] mlock/stack guard interaction fixup

On Mon, 2010-08-23 at 12:54 -0700, Jeremy Fitzhardinge wrote:
> On 08/23/2010 12:26 PM, Peter Zijlstra wrote:
> > On Mon, 2010-08-23 at 12:23 -0700, Jeremy Fitzhardinge wrote:
> >> On 08/23/2010 12:07 PM, Peter Zijlstra wrote:
> >>> mlock() simply avoids major faults, nothing more.
> >>>
> >>> I think both page migration and page-out for shared pages where some
> >>> maps are !mlocked can cause unmaps and thus minor faults.
> >>>
> >>> mlock and dirty do not interact, they will still be cleaned/written out
> >>> as normal.
> >> So mlock is useless for preventing secret stuff from being written to disk.
> > Well, if you put your sekrit in a file map, sure.
> >
> > Use a mmap(MAP_ANONYMOUS|MAP_LOCK) and madvise(MADV_DONTFORK) for your
> > sekrits.
>
> Won't dirty anonymous pages also get written to swap?

Not if all the maps are mlocked (private like above would only have a
single map), there'd be no point would there.

2010-08-24 07:20:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] mlock/stack guard interaction fixup

On Tue, 2010-08-24 at 09:08 +0200, Peter Zijlstra wrote:

> > Won't dirty anonymous pages also get written to swap?

Also, usually when we talk about dirty pages we mean file pages.

2010-08-25 08:28:39

by Stefan Bader

[permalink] [raw]
Subject: Re: [Stable-review] [RFC] mlock/stack guard interaction fixup

On 08/22/2010 07:25 PM, Greg KH wrote:
> On Sun, Aug 22, 2010 at 10:55:17AM +0100, Ian Campbell wrote:
>> On Sun, 2010-08-22 at 08:33 +0100, Ian Campbell wrote:
>>> On Sun, 2010-08-22 at 07:57 +0100, Ian Campbell wrote:
>>>> On Sat, 2010-08-21 at 08:48 -0700, Linus Torvalds wrote:
>>>>> On Sat, Aug 21, 2010 at 4:56 AM, Ian Campbell <[email protected]> wrote:
>>>>>>
>>>>>> I don't know that they are particularly good tests for this change but I
>>>>>> also ran allmodconfig kernel build and ltp on 2.6.35.3+fixes without
>>>>>> issue. Are there any good mlock heavy workloads?
>>>>>
>>>>> mlock itself isn't very interesting, I think more interesting is
>>>>> testing that the doubly linked list handles all the cases correctly.
>>>>> Something that splits mappings, unmaps partial ones etc etc. Running
>>>>> something like Electric Fence is probably a good idea.
>>>>
>>>> EF_DISABLE_BANNER=1 EF_ALLOW_MALLOC_0=1 LD_PRELOAD=libefence.so.0.0 make
>>>>
>>>> craps out pretty quickly with:
>>>>
>>>> CC init/main.o
>>>>
>>>> ElectricFence Exiting: mprotect() failed: Cannot allocate memory
>>>> make[1]: *** [init/main.o] Error 255
>>>> make: *** [init] Error 2
>>>>
>>>> but it does that with 2.6.35.3, 2.6.35.2, 2.6.35.1 and 2.6.35 too so it
>>>> doesn't seem to be breakage relating to any of the stack guard stuff
>>>
>>> I increased the vm.max_map_count sysctl and now things are rolling
>>> along. Will let you know how I get on.
>>
>> So its slow and memory intensive as hell due to efence so my test box is
>> struggling[0] but it has compiled 270+ .o files successfully so I think
>> it's OK from that perspective. I think it'll be quite a while before I
>> can say its passed an allmodconfig under efence though.
>>
>> In the meantime I notice you've committed the patches. Can we get them
>> queued up for stable backports at some point? I appreciate you might
>> want them to bake for a bit longer in 2.6.36-rc first.
>>
>> Greg, we are talking about:
>> 0e8e50e20c837eeec8323bba7dcd25fe5479194c mm: make stack guard page logic use vm_prev pointer
>> 7798330ac8114c731cfab83e634c6ecedaa233d7 mm: make the mlock() stack guard page checks stricter
>> 297c5eee372478fc32fec5fe8eed711eedb13f3d mm: make the vma list be doubly linked
>
> I must be missing something, but aren't these patches just "cleanups"
> and changing the logic here to be nicer? Or do they fix real problems
> with the previous stack guard stuff?
>
> Is it the second one you really need here?
>

Ok, I saw Greg picking up the new batch which is good as I think the old logic
to expand the stack could give incorrect false positives when being called with
an addr above the lowest stack vma.

There seems only one little piece left. The proc output would likely look like
there are holes in the stack vma. Something like below should fix it. And maybe
then the two checking functions might move to a common header (as the code is
shamelessly borrowed from mlock). Ok, in the case here the addr == vma->vm_addr
check is always true.

-Stefan

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 439fc1f..2e67dac 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -203,6 +203,21 @@ static int do_maps_open(struct inode *inode, struct file *f
return ret;
}

+/* Is the vma a continuation of the stack vma above it? */
+static inline int vma_stack_continue(struct vm_area_struct *vma,
+ unsigned long addr)
+{
+ return vma && (vma->vm_end == addr) && (vma->vm_flags & VM_GROWSDOWN);
+}
+
+static inline int stack_guard_page(struct vm_area_struct *vma,
+ unsigned long addr)
+{
+ return (vma->vm_flags & VM_GROWSDOWN) &&
+ (vma->vm_start == addr) &&
+ !vma_stack_continue(vma->vm_prev, addr);
+}
+
static void show_map_vma(struct seq_file *m, struct vm_area_struct *vma)
{
struct mm_struct *mm = vma->vm_mm;
@@ -223,7 +238,7 @@ static void show_map_vma(struct seq_file *m, struct vm_area_

/* We don't show the stack guard page in /proc/maps */
start = vma->vm_start;
- if (vma->vm_flags & VM_GROWSDOWN)
+ if (stack_guard_page(vma, start)
start += PAGE_SIZE;

seq_printf(m, "%08lx-%08lx %c%c%c%c %08llx %02x:%02x %lu %n",

> thanks,
>
> greg k-h
>
> _______________________________________________
> Stable-review mailing list
> [email protected]
> http://linux.kernel.org/mailman/listinfo/stable-review