The current interaction between mlockall(MCL_FUTURE) and mmap has a
deficiency. In 'normal' mode, without MCL_FUTURE in force, the default
is that new memory mappings are not locked, but mmap provides MAP_LOCKED
specifically to override that default. However, with MCL_FUTURE toggled
to on, there is no analogous way to tell mmap to override the default. The
proposed MAP_UNLOCKED flag would resolve this deficiency.
The benefit of the patch is that it makes it possible for an application
which has previously called mlockall(MCL_FUTURE) to selectively exempt
new memory mappings from memory locking, on a per-mmap-call basis. There
is currently no thread-safe way for an application to do this as
toggling MCL_FUTURE around calls to mmap is racy in a multi-threaded
context. Other threads may manipulate the address space during the
window where MCL_FUTURE is off, subverting the programmers intended
memory locking semantics.
The ability to exempt specific memory mappings from memory locking is
necessary when the region to be mapped is larger than physical memory.
In such cases a call to mmap the region cannot succeed, unless
MAP_UNLOCKED is available.
Acked-by: WANG Cong <[email protected]>
Acked-by: Chris Wright <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>
---
I keep the acks since the patch is exactly the same, only commit message
is changed.
Commit message is mostly copied from Andrew C. Morrow email. Hope now it
is OK. Thank you Andrew :)
v1->v2
- adding new flag to all archs
- fixing typo
v2->v3
- one more typo fix
v3->v4
- return error if MAP_LOCKED | MAP_UNLOCKED is specified
v4->v5
- rebase to latest head
v5->v6
- commit message rewritten
diff --git a/arch/alpha/include/asm/mman.h b/arch/alpha/include/asm/mman.h
index 99c56d4..cfc51ac 100644
--- a/arch/alpha/include/asm/mman.h
+++ b/arch/alpha/include/asm/mman.h
@@ -30,6 +30,7 @@
#define MAP_NONBLOCK 0x40000 /* do not block on IO */
#define MAP_STACK 0x80000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x100000 /* create a huge page mapping */
+#define MAP_UNLOCKED 0x200000 /* force page unlocking */
#define MS_ASYNC 1 /* sync memory asynchronously */
#define MS_SYNC 2 /* synchronous memory sync */
diff --git a/arch/mips/include/asm/mman.h b/arch/mips/include/asm/mman.h
index c892bfb..3e4d108 100644
--- a/arch/mips/include/asm/mman.h
+++ b/arch/mips/include/asm/mman.h
@@ -48,6 +48,7 @@
#define MAP_NONBLOCK 0x20000 /* do not block on IO */
#define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x80000 /* create a huge page mapping */
+#define MAP_UNLOCKED 0x100000 /* force page unlocking */
/*
* Flags for msync
diff --git a/arch/parisc/include/asm/mman.h b/arch/parisc/include/asm/mman.h
index 9749c8a..4e8b9bf 100644
--- a/arch/parisc/include/asm/mman.h
+++ b/arch/parisc/include/asm/mman.h
@@ -24,6 +24,7 @@
#define MAP_NONBLOCK 0x20000 /* do not block on IO */
#define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x80000 /* create a huge page mapping */
+#define MAP_UNLOCKED 0x100000 /* force page unlocking */
#define MS_SYNC 1 /* synchronous memory sync */
#define MS_ASYNC 2 /* sync memory asynchronously */
diff --git a/arch/powerpc/include/asm/mman.h b/arch/powerpc/include/asm/mman.h
index d4a7f64..7d33f01 100644
--- a/arch/powerpc/include/asm/mman.h
+++ b/arch/powerpc/include/asm/mman.h
@@ -27,6 +27,7 @@
#define MAP_NONBLOCK 0x10000 /* do not block on IO */
#define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x40000 /* create a huge page mapping */
+#define MAP_UNLOCKED 0x80000 /* force page unlocking */
#ifdef __KERNEL__
#ifdef CONFIG_PPC64
diff --git a/arch/sparc/include/asm/mman.h b/arch/sparc/include/asm/mman.h
index c3029ad..f80d203 100644
--- a/arch/sparc/include/asm/mman.h
+++ b/arch/sparc/include/asm/mman.h
@@ -22,6 +22,7 @@
#define MAP_NONBLOCK 0x10000 /* do not block on IO */
#define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x40000 /* create a huge page mapping */
+#define MAP_UNLOCKED 0x80000 /* force page unlocking */
#ifdef __KERNEL__
#ifndef __ASSEMBLY__
diff --git a/arch/xtensa/include/asm/mman.h b/arch/xtensa/include/asm/mman.h
index fca4db4..c62bcd8 100644
--- a/arch/xtensa/include/asm/mman.h
+++ b/arch/xtensa/include/asm/mman.h
@@ -55,6 +55,7 @@
#define MAP_NONBLOCK 0x20000 /* do not block on IO */
#define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x80000 /* create a huge page mapping */
+#define MAP_UNLOCKED 0x100000 /* force page unlocking */
/*
* Flags for msync
diff --git a/include/asm-generic/mman.h b/include/asm-generic/mman.h
index 32c8bd6..59e0f29 100644
--- a/include/asm-generic/mman.h
+++ b/include/asm-generic/mman.h
@@ -12,6 +12,7 @@
#define MAP_NONBLOCK 0x10000 /* do not block on IO */
#define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x40000 /* create a huge page mapping */
+#define MAP_UNLOCKED 0x80000 /* force page unlocking */
#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */
diff --git a/mm/mmap.c b/mm/mmap.c
index ee22989..4bda220 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -962,6 +962,9 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
if (!can_do_mlock())
return -EPERM;
+ if (flags & MAP_UNLOCKED)
+ vm_flags &= ~VM_LOCKED;
+
/* mlock MCL_FUTURE? */
if (vm_flags & VM_LOCKED) {
unsigned long locked, lock_limit;
@@ -1050,7 +1053,10 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
struct file *file = NULL;
unsigned long retval = -EBADF;
- if (!(flags & MAP_ANONYMOUS)) {
+ if (unlikely((flags & (MAP_LOCKED | MAP_UNLOCKED)) ==
+ (MAP_LOCKED | MAP_UNLOCKED))) {
+ return -EINVAL;
+ } else if (!(flags & MAP_ANONYMOUS)) {
if (unlikely(flags & MAP_HUGETLB))
return -EINVAL;
file = fget(fd);
--
Gleb.
Hi Gleb,
On Mon, Jan 18, 2010 at 3:37 PM, Gleb Natapov <[email protected]> wrote:
> The current interaction between mlockall(MCL_FUTURE) and mmap has a
> deficiency. In 'normal' mode, without MCL_FUTURE in force, the default
> is that new memory mappings are not locked, but mmap provides MAP_LOCKED
> specifically to override that default. However, with MCL_FUTURE toggled
> to on, there is no analogous way to tell mmap to override the default. The
> proposed MAP_UNLOCKED flag would resolve this deficiency.
>
> The benefit of the patch is that it makes it possible for an application
> which has previously called mlockall(MCL_FUTURE) to selectively exempt
> new memory mappings from memory locking, on a per-mmap-call basis. There
> is currently no thread-safe way for an application to do this as
> toggling MCL_FUTURE around calls to mmap is racy in a multi-threaded
> context. Other threads may manipulate the address space during the
> window where MCL_FUTURE is off, subverting the programmers intended
> memory locking semantics.
>
> The ability to exempt specific memory mappings from memory locking is
> necessary when the region to be mapped is larger than physical memory.
> In such cases a call to mmap the region cannot succeed, unless
> MAP_UNLOCKED is available.
The changelog doesn't mention what kind of applications would want to
use this. Are there some? Using mlockall(MCL_FUTURE) but then having
some memory regions MAP_UNLOCKED sounds like a strange combination to
me.
On Mon, Jan 18, 2010 at 04:09:35PM +0200, Pekka Enberg wrote:
> Hi Gleb,
>
> On Mon, Jan 18, 2010 at 3:37 PM, Gleb Natapov <[email protected]> wrote:
> > The current interaction between mlockall(MCL_FUTURE) and mmap has a
> > deficiency. In 'normal' mode, without MCL_FUTURE in force, the default
> > is that new memory mappings are not locked, but mmap provides MAP_LOCKED
> > specifically to override that default. However, with MCL_FUTURE toggled
> > to on, there is no analogous way to tell mmap to override the default. The
> > proposed MAP_UNLOCKED flag would resolve this deficiency.
> >
> > The benefit of the patch is that it makes it possible for an application
> > which has previously called mlockall(MCL_FUTURE) to selectively exempt
> > new memory mappings from memory locking, on a per-mmap-call basis. There
> > is currently no thread-safe way for an application to do this as
> > toggling MCL_FUTURE around calls to mmap is racy in a multi-threaded
> > context. Other threads may manipulate the address space during the
> > window where MCL_FUTURE is off, subverting the programmers intended
> > memory locking semantics.
> >
> > The ability to exempt specific memory mappings from memory locking is
> > necessary when the region to be mapped is larger than physical memory.
> > In such cases a call to mmap the region cannot succeed, unless
> > MAP_UNLOCKED is available.
>
> The changelog doesn't mention what kind of applications would want to
> use this. Are there some? Using mlockall(MCL_FUTURE) but then having
> some memory regions MAP_UNLOCKED sounds like a strange combination to
> me.
The specific use cases were discussed in the thread following previous
version of the patch. I can describe my specific use case in a change log
and I can copy what Andrew said about his case, but is it really needed in
a commit message itself? It boils down to greater control over when and
where application can get major fault. There are applications that need
this kind of control. As of use of mlockall(MCL_FUTURE) how can I make
sure that all memory allocated behind my application's back (by dynamic
linker, libraries, stack) will be locked otherwise?
--
Gleb.
> this kind of control. As of use of mlockall(MCL_FUTURE) how can I make
> sure that all memory allocated behind my application's back (by dynamic
> linker, libraries, stack) will be locked otherwise?
If you add this flag you can't do that anyway - some library will
helpfully start up using it and then you are completely stuffed or will
be back in two or three years adding MLOCKALL_ALWAYS.
Alan
On Mon, Jan 18, 2010 at 02:32:32PM +0000, Alan Cox wrote:
> > this kind of control. As of use of mlockall(MCL_FUTURE) how can I make
> > sure that all memory allocated behind my application's back (by dynamic
> > linker, libraries, stack) will be locked otherwise?
>
> If you add this flag you can't do that anyway - some library will
> helpfully start up using it and then you are completely stuffed or will
> be back in two or three years adding MLOCKALL_ALWAYS.
>
Libraries can do many other bad things. They can do mlockall(0) today
too and this is not the reason to ditch mlockall(). I don't expect libc will
do that though.
--
Gleb.
On Mon, 2010-01-18 at 14:32 +0000, Alan Cox wrote:
> > this kind of control. As of use of mlockall(MCL_FUTURE) how can I make
> > sure that all memory allocated behind my application's back (by dynamic
> > linker, libraries, stack) will be locked otherwise?
>
> If you add this flag you can't do that anyway - some library will
> helpfully start up using it and then you are completely stuffed or will
> be back in two or three years adding MLOCKALL_ALWAYS.
Agreed, mlockall() is a very bad interface and should not be used for a
plethora of reasons, this being one of them.
The thing is, if you cant trust your library to do sane things, then
don't use it.
On Mon, Jan 18, 2010 at 03:49:58PM +0100, Peter Zijlstra wrote:
> On Mon, 2010-01-18 at 14:32 +0000, Alan Cox wrote:
> > > this kind of control. As of use of mlockall(MCL_FUTURE) how can I make
> > > sure that all memory allocated behind my application's back (by dynamic
> > > linker, libraries, stack) will be locked otherwise?
> >
> > If you add this flag you can't do that anyway - some library will
> > helpfully start up using it and then you are completely stuffed or will
> > be back in two or three years adding MLOCKALL_ALWAYS.
>
> Agreed, mlockall() is a very bad interface and should not be used for a
> plethora of reasons, this being one of them.
>
There are valid uses for mlockall() and even if the interface is bad there
is no alternative right now, so why not fix one of it problems?
> The thing is, if you cant trust your library to do sane things, then
> don't use it.
>
Agreed, the are things that sane library should never do: exit() or output
debug info to stdio or meddle with memory mlock/munlock behind application's
back.
--
Gleb.
On Mon, 2010-01-18 at 17:01 +0200, Gleb Natapov wrote:
> There are valid uses for mlockall()
That's debatable.
On 01/18/2010 05:06 PM, Peter Zijlstra wrote:
> On Mon, 2010-01-18 at 17:01 +0200, Gleb Natapov wrote:
>
>> There are valid uses for mlockall()
>>
> That's debatable.
>
>
Real-time?
--
error compiling committee.c: too many arguments to function
On Mon, Jan 18, 2010 at 04:06:34PM +0100, Peter Zijlstra wrote:
> On Mon, 2010-01-18 at 17:01 +0200, Gleb Natapov wrote:
> > There are valid uses for mlockall()
>
> That's debatable.
>
Well, I have use for it. You can look at previous thread were I described
it and suggest alternatives.
--
Gleb.
On Mon, 2010-01-18 at 17:11 +0200, Avi Kivity wrote:
> On 01/18/2010 05:06 PM, Peter Zijlstra wrote:
> > On Mon, 2010-01-18 at 17:01 +0200, Gleb Natapov wrote:
> >
> >> There are valid uses for mlockall()
> >>
> > That's debatable.
> >
> >
>
> Real-time?
I would not advice that, just mlock() the text and data you need for the
real-time thread. mlockall() is a really blunt instrument.
On 01/18/2010 05:14 PM, Peter Zijlstra wrote:
> On Mon, 2010-01-18 at 17:11 +0200, Avi Kivity wrote:
>
>> On 01/18/2010 05:06 PM, Peter Zijlstra wrote:
>>
>>> On Mon, 2010-01-18 at 17:01 +0200, Gleb Natapov wrote:
>>>
>>>
>>>> There are valid uses for mlockall()
>>>>
>>>>
>>> That's debatable.
>>>
>>>
>>>
>> Real-time?
>>
> I would not advice that, just mlock() the text and data you need for the
> real-time thread. mlockall() is a really blunt instrument.
>
May not be feasible due to libraries.
--
error compiling committee.c: too many arguments to function
On Mon, 2010-01-18 at 17:19 +0200, Avi Kivity wrote:
> > I would not advice that, just mlock() the text and data you need for the
> > real-time thread. mlockall() is a really blunt instrument.
> >
>
> May not be feasible due to libraries.
Esp for the real-time case I could advise not to use those libraries
then, since they're clearly not designed for that use case.
On Mon, 18 Jan 2010 16:24:07 +0100
Peter Zijlstra <[email protected]> wrote:
> On Mon, 2010-01-18 at 17:19 +0200, Avi Kivity wrote:
> > > I would not advice that, just mlock() the text and data you need for the
> > > real-time thread. mlockall() is a really blunt instrument.
> > >
> >
> > May not be feasible due to libraries.
>
> Esp for the real-time case I could advise not to use those libraries
> then, since they're clearly not designed for that use case.
In "hard" real time cases an awful lot of libraries have things like
memory allocations in them and don't care about stack growth which can
cause faults and sleeps. The memory allocator if you are running threaded
was not real time priority aware either last time I checked so the
standard libraries are not going to give the behaviour you want unless
you have a proper RT environment, and even then it may be a bit iffy here
and there.
On Mon, 2010-01-18 at 15:41 +0000, Alan Cox wrote:
> On Mon, 18 Jan 2010 16:24:07 +0100
> Peter Zijlstra <[email protected]> wrote:
>
> > On Mon, 2010-01-18 at 17:19 +0200, Avi Kivity wrote:
> > > > I would not advice that, just mlock() the text and data you need for the
> > > > real-time thread. mlockall() is a really blunt instrument.
> > > >
> > >
> > > May not be feasible due to libraries.
> >
> > Esp for the real-time case I could advise not to use those libraries
> > then, since they're clearly not designed for that use case.
>
> In "hard" real time cases an awful lot of libraries have things like
> memory allocations in them and don't care about stack growth which can
> cause faults and sleeps. The memory allocator if you are running threaded
> was not real time priority aware either last time I checked so the
> standard libraries are not going to give the behaviour you want unless
> you have a proper RT environment, and even then it may be a bit iffy here
> and there.
I'm quite aware of that, which is why we recommend people to
pre-allocate, mlock() and pre-fault everything in advance and make sure
the RT thread doesn't touch any data/text outside of that and uses a
limited set of system calls.
You can also do that for stacks using pthread_attr_setstack().
On Mon, Jan 18, 2010 at 4:19 PM, Gleb Natapov <[email protected]> wrote:
> The specific use cases were discussed in the thread following previous
> version of the patch. I can describe my specific use case in a change log
> and I can copy what Andrew said about his case, but is it really needed in
> a commit message itself? It boils down to greater control over when and
> where application can get major fault. There are applications that need
> this kind of control. As of use of mlockall(MCL_FUTURE) how can I make
> sure that all memory allocated behind my application's back (by dynamic
> linker, libraries, stack) will be locked otherwise?
Again, why do you want to MCL_FUTURE but then go and use MAP_UNLOCKED?
"Greater control" is not an argument for adding a new API that needs
to be maintained forever, a real world use case is.
And yes, this stuff needs to be in the changelog. Whether you want to
spell it out or post an URL to some previous discussion is up to you.
On Mon, Jan 18, 2010 at 06:05:38PM +0200, Pekka Enberg wrote:
> On Mon, Jan 18, 2010 at 4:19 PM, Gleb Natapov <[email protected]> wrote:
> > The specific use cases were discussed in the thread following previous
> > version of the patch. I can describe my specific use case in a change log
> > and I can copy what Andrew said about his case, but is it really needed in
> > a commit message itself? It boils down to greater control over when and
> > where application can get major fault. There are applications that need
> > this kind of control. As of use of mlockall(MCL_FUTURE) how can I make
> > sure that all memory allocated behind my application's back (by dynamic
> > linker, libraries, stack) will be locked otherwise?
>
> Again, why do you want to MCL_FUTURE but then go and use MAP_UNLOCKED?
I need to have all my memory locked except one big (bigger then main
memory) chunk. I either need to rewrite my application and all libraries
to use memory allocator that return locked memory, may be even rewrite
dynamic loader to use this allocator to lock executable code too, or run
mlockall(MCL_FUTURE|MCL_CURRENT) at startup and exempt one allocation
from this rule. Note that I can't allocate it locked and then unlock
since allocation will fail. Actually for me it hangs kernel last I
checked.
> "Greater control" is not an argument for adding a new API that needs
> to be maintained forever, a real world use case is.
>
If there is real world use case for mlockall() there is real use case for
this too. People seems to be trying to convince me that I don't need
mlockall() without proposing alternatives. The only alternative I see
lock everything from userspace.
> And yes, this stuff needs to be in the changelog. Whether you want to
> spell it out or post an URL to some previous discussion is up to you.
The discussion was here just a couple of days ago. Here is the link
were I describe my use case: http://marc.info/?l=linux-mm&m=126345374125942&w=2
If you think it needs to be spelled out in commit log I'll do it.
--
Gleb.
On Mon, Jan 18, 2010 at 04:14:43PM +0100, Peter Zijlstra wrote:
> On Mon, 2010-01-18 at 17:11 +0200, Avi Kivity wrote:
> > On 01/18/2010 05:06 PM, Peter Zijlstra wrote:
> > > On Mon, 2010-01-18 at 17:01 +0200, Gleb Natapov wrote:
> > >
> > >> There are valid uses for mlockall()
> > >>
> > > That's debatable.
> > >
> > >
> >
> > Real-time?
>
> I would not advice that, just mlock() the text and data you need for the
> real-time thread. mlockall() is a really blunt instrument.
>
Yes it is blunt, but the patch makes it less so.
--
Gleb.
Hi Gleb,
On Mon, Jan 18, 2010 at 7:08 PM, Gleb Natapov <[email protected]> wrote:
>> "Greater control" is not an argument for adding a new API that needs
>> to be maintained forever, a real world use case is.
>>
> If there is real world use case for mlockall() there is real use case for
> this too. People seems to be trying to convince me that I don't need
> mlockall() without proposing alternatives. The only alternative I see
> lock everything from userspace.
>
>> And yes, this stuff needs to be in the changelog. Whether you want to
>> spell it out or post an URL to some previous discussion is up to you.
> The discussion was here just a couple of days ago. Here is the link
> were I describe my use case: http://marc.info/?l=linux-mm&m=126345374125942&w=2
> If you think it needs to be spelled out in commit log I'll do it.
So this is a performance thing? Btw, is there are reason you can't use
plain mlock() for it as suggested by Peter earlier?
Pekka
On Mon, Jan 18, 2010 at 08:09:26PM +0200, Pekka Enberg wrote:
> Hi Gleb,
>
> On Mon, Jan 18, 2010 at 7:08 PM, Gleb Natapov <[email protected]> wrote:
> >> "Greater control" is not an argument for adding a new API that needs
> >> to be maintained forever, a real world use case is.
> >>
> > If there is real world use case for mlockall() there is real use case for
> > this too. People seems to be trying to convince me that I don't need
> > mlockall() without proposing alternatives. The only alternative I see
> > lock everything from userspace.
> >
> >> And yes, this stuff needs to be in the changelog. Whether you want to
> >> spell it out or post an URL to some previous discussion is up to you.
> > The discussion was here just a couple of days ago. Here is the link
> > were I describe my use case: http://marc.info/?l=linux-mm&m=126345374125942&w=2
> > If you think it needs to be spelled out in commit log I'll do it.
>
> So this is a performance thing? Btw, is there are reason you can't use
> plain mlock() for it as suggested by Peter earlier?
>
I can't realistically chase every address space mapping changes and mlock
new areas. The only way other then mlockall() is to use custom memory
allocator that allocates mlocked memory.
--
Gleb.
> I can't realistically chase every address space mapping changes and mlock
> new areas. The only way other then mlockall() is to use custom memory
> allocator that allocates mlocked memory.
Which keeps all the special cases in your app rather than in every single
users kernel. That seems to be the right way up, especially as you can
make a library of it !
Alan
> On Mon, Jan 18, 2010 at 04:06:34PM +0100, Peter Zijlstra wrote:
> > On Mon, 2010-01-18 at 17:01 +0200, Gleb Natapov wrote:
> > > There are valid uses for mlockall()
> >
> > That's debatable.
> >
> Well, I have use for it. You can look at previous thread were I described
> it and suggest alternatives.
Please stop suck.
This is the reviewing. The reviewers shouldn't need to look at all
previous thread. It mean your description isn't enough.
On Mon, Jan 18, 2010 at 07:10:31PM +0000, Alan Cox wrote:
> > I can't realistically chase every address space mapping changes and mlock
> > new areas. The only way other then mlockall() is to use custom memory
> > allocator that allocates mlocked memory.
>
> Which keeps all the special cases in your app rather than in every single
> users kernel. That seems to be the right way up, especially as you can
> make a library of it !
>
Are you advocating rewriting mlockall() in userspace? It may be possible,
but will require rewriting half of libc. Everything that changes process
address space should support mlocking (memory allocation functions, dynamic
loading, strdup). Allocations can be done only with mmap() since brk()
can allocate mlocked memory atomically. And of course if third party library
uses mmap syscall directly instead of using libc one you are SOL. Been
there already, worked on project that replaced libc memory allocations functions
because it had to track when memory is returned to OS, not just internal
libc pool (MPI). This is pain in the arse and on top of that it doesn't work
reliably. Some things are better be done on OS level.
The thread took a direction of bashing mlockall(). This is especially
strange since proposed patch actually makes mlockall() more fine
grained and thus more useful.
--
Gleb.
Hi Gleb,
On Tue, Jan 19, 2010 at 9:17 AM, Gleb Natapov <[email protected]> wrote:
> The thread took a direction of bashing mlockall(). This is especially
> strange since proposed patch actually makes mlockall() more fine
> grained and thus more useful.
No, the thread took a direction of you not being able to properly
explain why we want MMAP_UNLOCKED in the kernel. It seems useless for
real-time and I've yet to figure out why you need _mlockall()_ if it's
a performance thing.
It would be probably useful if you could point us to the application
source code that actually wants this feature.
Pekka
On Tue, Jan 19, 2010 at 09:37:05AM +0200, Pekka Enberg wrote:
> Hi Gleb,
>
> On Tue, Jan 19, 2010 at 9:17 AM, Gleb Natapov <[email protected]> wrote:
> > The thread took a direction of bashing mlockall(). This is especially
> > strange since proposed patch actually makes mlockall() more fine
> > grained and thus more useful.
>
> No, the thread took a direction of you not being able to properly
> explain why we want MMAP_UNLOCKED in the kernel. It seems useless for
It is needed in the kernel because this is the only proper (aka thread
safe) way to mmap area bigger the main memory after mlockall(MCL_FUTURE).
Do you agree we that? Now you can ask why is this needed and this is
valid question.
> real-time and I've yet to figure out why you need _mlockall()_ if it's
> a performance thing.
I don't do real-time so will not argue how useful it is for that,
but it seems to me that people who argue that it is not useful for real
time don't do it either and the only person in this thread who does real
time uses mlockall(). Hmm strange.
In my case (virtualization) I want to test/profile guest under heavy swapping
of a guests memory, so I intentionally create memory shortage by creating
guest much large then host memory, but I want system to swap out only
guest's memory.
>
> It would be probably useful if you could point us to the application
> source code that actually wants this feature.
>
This is two line patch to qemu that calls mlockall(MCL_CURRENT|MCL_FUTURE)
at the beginning of the main() and changes guest memory allocation to
use MAP_UNLOCKED flag. All alternative solutions in this thread suggest
that I should rewrite qemu + all library it uses. You see why I can't
take them seriously?
--
Gleb.
Hi Gleb,
On Tue, Jan 19, 2010 at 9:52 AM, Gleb Natapov <[email protected]> wrote:
>> It would be probably useful if you could point us to the application
>> source code that actually wants this feature.
>>
> This is two line patch to qemu that calls mlockall(MCL_CURRENT|MCL_FUTURE)
> at the beginning of the main() and changes guest memory allocation to
> use MAP_UNLOCKED flag. All alternative solutions in this thread suggest
> that I should rewrite qemu + all library it uses. You see why I can't
> take them seriously?
Well, that's not going to be portable, is it, so the application
design would still be broken, no? Did you try using (or extending)
posix_madvise(MADV_DONTNEED) for the guest address space? It seems to
me that you're trying to use a big hammer (mlock) when a polite hint
for the VM would probably be sufficient for it do its job.
Pekka
On Tue, Jan 19, 2010 at 10:07:07AM +0200, Pekka Enberg wrote:
> Hi Gleb,
>
> On Tue, Jan 19, 2010 at 9:52 AM, Gleb Natapov <[email protected]> wrote:
> >> It would be probably useful if you could point us to the application
> >> source code that actually wants this feature.
> >>
> > This is two line patch to qemu that calls mlockall(MCL_CURRENT|MCL_FUTURE)
> > at the beginning of the main() and changes guest memory allocation to
> > use MAP_UNLOCKED flag. All alternative solutions in this thread suggest
> > that I should rewrite qemu + all library it uses. You see why I can't
> > take them seriously?
>
> Well, that's not going to be portable, is it, so the application
KVM is not portable ;) and that is what my main interest is.
> design would still be broken, no? Did you try using (or extending)
> posix_madvise(MADV_DONTNEED) for the guest address space? It seems to
After mlockall() I can't even allocate guest address space. Or do you mean
instead of mlockall()? Then how MADV_DONTNEED will help? It just drops
page table for the address range (which is not what I need) and does not
have any long time effect.
> me that you're trying to use a big hammer (mlock) when a polite hint
> for the VM would probably be sufficient for it do its job.
>
I what to tell to VM "swap this, don't swap that" and as far as I see
there is no other way to do it currently.
--
Gleb.
Hi Gleb,
On Tue, Jan 19, 2010 at 10:26 AM, Gleb Natapov <[email protected]> wrote:
>> design would still be broken, no? Did you try using (or extending)
>> posix_madvise(MADV_DONTNEED) for the guest address space? It seems to
> After mlockall() I can't even allocate guest address space. Or do you mean
> instead of mlockall()? Then how MADV_DONTNEED will help? It just drops
> page table for the address range (which is not what I need) and does not
> have any long time effect.
Oh right, MADV_DONTNEED is no good.
On Tue, Jan 19, 2010 at 10:26 AM, Gleb Natapov <[email protected]> wrote:
>> me that you're trying to use a big hammer (mlock) when a polite hint
>> for the VM would probably be sufficient for it do its job.
>>
> I what to tell to VM "swap this, don't swap that" and as far as I see
> there is no other way to do it currently.
Yeah, which is why I was suggesting that maybe posix_madvise() needs
to be extended to have a MADV_NEED_BUT_LESS_IMPORTANT flag that can be
used as a hint by mm/vmscan.c to first swap the guest address spaces.
Pekka
On Tue, Jan 19, 2010 at 10:44:23AM +0200, Pekka Enberg wrote:
> On Tue, Jan 19, 2010 at 10:26 AM, Gleb Natapov <[email protected]> wrote:
> >> me that you're trying to use a big hammer (mlock) when a polite hint
> >> for the VM would probably be sufficient for it do its job.
> >>
> > I what to tell to VM "swap this, don't swap that" and as far as I see
> > there is no other way to do it currently.
>
> Yeah, which is why I was suggesting that maybe posix_madvise() needs
> to be extended to have a MADV_NEED_BUT_LESS_IMPORTANT flag that can be
> used as a hint by mm/vmscan.c to first swap the guest address spaces.
>
If such thing would exist may be I would have used it since swapping out
of a wrong page is not live or death matter in my case, but mlockall()
provides me with exactly what I need and without swapping out wrong
pages. Speaking about adding such madvise call wouldn't it be even
harder to justify? It obviously not good enough for real-time use and my
case, I admit, is unusual. Also if we start prioritise memory why stop
on binary, why not set value like "this memory is more important then
that memory by factor of 5"?
--
Gleb.
> In my case (virtualization) I want to test/profile guest under heavy swapping
> of a guests memory, so I intentionally create memory shortage by creating
> guest much large then host memory, but I want system to swap out only
> guest's memory.
So this isn't an API question this is an obscure corner case testing
question.
>
> >
> > It would be probably useful if you could point us to the application
> > source code that actually wants this feature.
> >
> This is two line patch to qemu that calls mlockall(MCL_CURRENT|MCL_FUTURE)
> at the beginning of the main() and changes guest memory allocation to
> use MAP_UNLOCKED flag. All alternative solutions in this thread suggest
> that I should rewrite qemu + all library it uses. You see why I can't
> take them seriously?
And you want millions of users to have kernels with weird extra functions
whole sole value is one test environment you wish to run
See why we can't take you seriously either ?
On Tue, Jan 19, 2010 at 11:54:42AM +0000, Alan Cox wrote:
> > In my case (virtualization) I want to test/profile guest under heavy swapping
> > of a guests memory, so I intentionally create memory shortage by creating
> > guest much large then host memory, but I want system to swap out only
> > guest's memory.
>
> So this isn't an API question this is an obscure corner case testing
> question.
>
It is real use case scenario where the kernel doesn't provider me with
enough rope. You, of course, can dismiss it as "obscure corner case". You
can't expect issues with mlockall() which is corner case by itself to be
mainstream, can you?
> > >
> > > It would be probably useful if you could point us to the application
> > > source code that actually wants this feature.
> > >
> > This is two line patch to qemu that calls mlockall(MCL_CURRENT|MCL_FUTURE)
> > at the beginning of the main() and changes guest memory allocation to
> > use MAP_UNLOCKED flag. All alternative solutions in this thread suggest
> > that I should rewrite qemu + all library it uses. You see why I can't
> > take them seriously?
>
> And you want millions of users to have kernels with weird extra functions
> whole sole value is one test environment you wish to run
>
We are talking about 4 lines of code that other people find useful too
and they commented in this thread. This wouldn't be the first kernel
feature not used by millions of people.
> See why we can't take you seriously either ?
>
I was taking about solutions. Thank you.
--
Gleb.
Hi, Pekka.
On Tue, 2010-01-19 at 10:44 +0200, Pekka Enberg wrote:
> Hi Gleb,
>
> On Tue, Jan 19, 2010 at 10:26 AM, Gleb Natapov <[email protected]> wrote:
> >> design would still be broken, no? Did you try using (or extending)
> >> posix_madvise(MADV_DONTNEED) for the guest address space? It seems to
> > After mlockall() I can't even allocate guest address space. Or do you mean
> > instead of mlockall()? Then how MADV_DONTNEED will help? It just drops
> > page table for the address range (which is not what I need) and does not
> > have any long time effect.
>
> Oh right, MADV_DONTNEED is no good.
>
> On Tue, Jan 19, 2010 at 10:26 AM, Gleb Natapov <[email protected]> wrote:
> >> me that you're trying to use a big hammer (mlock) when a polite hint
> >> for the VM would probably be sufficient for it do its job.
> >>
> > I what to tell to VM "swap this, don't swap that" and as far as I see
> > there is no other way to do it currently.
>
> Yeah, which is why I was suggesting that maybe posix_madvise() needs
> to be extended to have a MADV_NEED_BUT_LESS_IMPORTANT flag that can be
> used as a hint by mm/vmscan.c to first swap the guest address spaces.
>
> Pekka
Gleb. How about using MADV_SEQUENTIAL on guest memory?
It makes that pages of guest are moved into inactive reclaim list more
fast. It means it is likely to swap out faster than other pages if it
isn't hit during inactive list.
--
Kind regards,
Minchan Kim
On Tue, Jan 19, 2010 at 2:48 PM, Minchan Kim <[email protected]> wrote:
> Gleb. How about using MADV_SEQUENTIAL on guest memory?
> It makes that pages of guest are moved into inactive reclaim list more
> fast. It means it is likely to swap out faster than other pages if it
> isn't hit during inactive list.
Yeah, something like that but we don't want the readahead. OTOH, it's
not clear what Gleb's real problem is. Are the guest address spaces
anonymous or file backed? Which parts of the emulator are swapped out
that are causing the problem? Maybe it's a VM balancing issue that
mlock papers over?
Pekka
> > And you want millions of users to have kernels with weird extra functions
> > whole sole value is one test environment you wish to run
> >
> We are talking about 4 lines of code that other people find useful too
> and they commented in this thread. This wouldn't be the first kernel
> feature not used by millions of people.
It wouldn't be the first completely dumb mistake in the kernel either,
but one dumb mistake doesn't argue for including others
On Tue, Jan 19, 2010 at 03:18:11PM +0200, Pekka Enberg wrote:
> On Tue, Jan 19, 2010 at 2:48 PM, Minchan Kim <[email protected]> wrote:
> > Gleb. How about using MADV_SEQUENTIAL on guest memory?
> > It makes that pages of guest are moved into inactive reclaim list more
> > fast. It means it is likely to swap out faster than other pages if it
> > isn't hit during inactive list.
>
> Yeah, something like that but we don't want the readahead. OTOH, it's
> not clear what Gleb's real problem is. Are the guest address spaces
> anonymous or file backed?
Anonymous.
> Which parts of the emulator are swapped out
> that are causing the problem?
I don't want anything that can be used during guest runtime to be
swapped out. And I run 2G guest in 512M container, so eventually
everything is swapped out :)
> Maybe it's a VM balancing issue that
> mlock papers over?
>
There is no problem. I do measurements on how host swapping affects
guest and I don't want qemu code to be swapped out.
--
Gleb.
On Tue, 2010-01-19 at 09:52 +0200, Gleb Natapov wrote:
> In my case (virtualization) I want to test/profile guest under heavy swapping
> of a guests memory, so I intentionally create memory shortage by creating
You mean "guest memory" that is area emulated DRAM in qemu?
It is anonymous vma.
> guest much large then host memory, but I want system to swap out only
> guest's memory.
Couldn't you use MADV_SEQUENTIAL on only guest memory area?
It doesn't make side effect about readahead since it's anon area.
And it would make do best effort to swap out guest's memory.
--
Kind regards,
Minchan Kim
On Tue, Jan 19, 2010 at 11:07:23PM +0900, Minchan Kim wrote:
> On Tue, 2010-01-19 at 09:52 +0200, Gleb Natapov wrote:
>
> > In my case (virtualization) I want to test/profile guest under heavy swapping
> > of a guests memory, so I intentionally create memory shortage by creating
>
> You mean "guest memory" that is area emulated DRAM in qemu?
> It is anonymous vma.
>
> > guest much large then host memory, but I want system to swap out only
> > guest's memory.
>
> Couldn't you use MADV_SEQUENTIAL on only guest memory area?
> It doesn't make side effect about readahead since it's anon area.
> And it would make do best effort to swap out guest's memory.
>
I can try, should be better then nothing I guess. Doesn't guaranty that
emulator memory will not be swapped out though.
--
Gleb.
> Hi Gleb,
>
> On Tue, Jan 19, 2010 at 10:26 AM, Gleb Natapov <[email protected]> wrote:
> >> design would still be broken, no? Did you try using (or extending)
> >> posix_madvise(MADV_DONTNEED) for the guest address space? It seems to
> > After mlockall() I can't even allocate guest address space. Or do you mean
> > instead of mlockall()? Then how MADV_DONTNEED will help? It just drops
> > page table for the address range (which is not what I need) and does not
> > have any long time effect.
>
> Oh right, MADV_DONTNEED is no good.
Off topic:
posix_madvise(MADV_DONTNEED) is nop. glibc's posix_madvise(MADV_DONTNEED)
don't call linux's madvise(MADV_DONTNEED).
It's because madvise(MADV_DONTNEED) is not POSIX compliant.
The behavior of linux madvise(MADV_DONTNEED) is similar to Solaris (or *BSD)
madvise(MADV_FREE).