2011-05-08 18:56:01

by Mikulas Patocka

[permalink] [raw]
Subject: [PATCH] Don't mlock guardpage if the stack is growing up

Don't mlock guardpage if the stack is growing up

Linux kernel excludes guard page when performing mlock on a VMA with
down-growing stack. However, some architectures have up-growing stack and
locking the guard page should be excluded in this case too.

This patch fixes lvm2 on PA-RISC (and possibly other architectures with
up-growing stack). lvm2 calculates the number of used pages when locking
and when unlocking and reports an internal error if the numbers mismatch.
On PA-RISC, the kernel would incorrectly attempt to mlock the stack guard
page, this causes allocation of one more page and internal error in lvm2.

Signed-off-by: Mikulas Patocka <[email protected]>

---
include/linux/mm.h | 6 ++++++
mm/memory.c | 21 ++++++++++++---------
2 files changed, 18 insertions(+), 9 deletions(-)

Index: linux-2.6.39-rc6-fast/include/linux/mm.h
===================================================================
--- linux-2.6.39-rc6-fast.orig/include/linux/mm.h 2011-05-07 05:59:51.000000000 +0200
+++ linux-2.6.39-rc6-fast/include/linux/mm.h 2011-05-07 05:59:52.000000000 +0200
@@ -1016,6 +1016,12 @@ static inline int vma_stack_continue(str
return vma && (vma->vm_end == addr) && (vma->vm_flags & VM_GROWSDOWN);
}

+/* Is the vma a continuation of the stack vma below it? */
+static inline int vma_stack_growsup_continue(struct vm_area_struct *vma, unsigned long addr)
+{
+ return vma && (vma->vm_start == addr) && (vma->vm_flags & VM_GROWSUP);
+}
+
extern unsigned long move_page_tables(struct vm_area_struct *vma,
unsigned long old_addr, struct vm_area_struct *new_vma,
unsigned long new_addr, unsigned long len);
Index: linux-2.6.39-rc6-fast/mm/memory.c
===================================================================
--- linux-2.6.39-rc6-fast.orig/mm/memory.c 2011-05-07 05:59:51.000000000 +0200
+++ linux-2.6.39-rc6-fast/mm/memory.c 2011-05-07 05:59:52.000000000 +0200
@@ -1412,9 +1412,12 @@ no_page_table:

static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long addr)
{
- return (vma->vm_flags & VM_GROWSDOWN) &&
+ return ((vma->vm_flags & VM_GROWSDOWN) &&
(vma->vm_start == addr) &&
- !vma_stack_continue(vma->vm_prev, addr);
+ !vma_stack_continue(vma->vm_prev, addr)) ||
+ ((vma->vm_flags & VM_GROWSUP) &&
+ (vma->vm_end == addr + PAGE_SIZE) &&
+ !vma_stack_growsup_continue(vma->vm_next, addr + PAGE_SIZE));
}

/**
@@ -1551,18 +1554,18 @@ int __get_user_pages(struct task_struct
continue;
}

- /*
- * If we don't actually want the page itself,
- * and it's the stack guard page, just skip it.
- */
- if (!pages && stack_guard_page(vma, start))
- goto next_page;
-
do {
struct page *page;
unsigned int foll_flags = gup_flags;

/*
+ * If we don't actually want the page itself,
+ * and it's the stack guard page, just skip it.
+ */
+ if (!pages && stack_guard_page(vma, start))
+ goto next_page;
+
+ /*
* If we have a pending SIGKILL, don't keep faulting
* pages and potentially allocating memory.
*/


2011-05-08 20:12:08

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] Don't mlock guardpage if the stack is growing up

On Sun, 8 May 2011, Mikulas Patocka wrote:

> Don't mlock guardpage if the stack is growing up
>
> Linux kernel excludes guard page when performing mlock on a VMA with
> down-growing stack. However, some architectures have up-growing stack and
> locking the guard page should be excluded in this case too.
>
> This patch fixes lvm2 on PA-RISC (and possibly other architectures with
> up-growing stack). lvm2 calculates the number of used pages when locking
> and when unlocking and reports an internal error if the numbers mismatch.
> On PA-RISC, the kernel would incorrectly attempt to mlock the stack guard
> page, this causes allocation of one more page and internal error in lvm2.
>
> Signed-off-by: Mikulas Patocka <[email protected]>

Interesting, I'd convinced myself that the growsup case was safe,
because of how we always approach the vma from its bottom end.

I've added Michel to the Cc, he's the one with the best grasp here.

Could you point us to where lvm2 is making these calculations?
I don't understand quite what it's doing.

Thanks,
Hugh

>
> ---
> include/linux/mm.h | 6 ++++++
> mm/memory.c | 21 ++++++++++++---------
> 2 files changed, 18 insertions(+), 9 deletions(-)
>
> Index: linux-2.6.39-rc6-fast/include/linux/mm.h
> ===================================================================
> --- linux-2.6.39-rc6-fast.orig/include/linux/mm.h 2011-05-07 05:59:51.000000000 +0200
> +++ linux-2.6.39-rc6-fast/include/linux/mm.h 2011-05-07 05:59:52.000000000 +0200
> @@ -1016,6 +1016,12 @@ static inline int vma_stack_continue(str
> return vma && (vma->vm_end == addr) && (vma->vm_flags & VM_GROWSDOWN);
> }
>
> +/* Is the vma a continuation of the stack vma below it? */
> +static inline int vma_stack_growsup_continue(struct vm_area_struct *vma, unsigned long addr)
> +{
> + return vma && (vma->vm_start == addr) && (vma->vm_flags & VM_GROWSUP);
> +}
> +
> extern unsigned long move_page_tables(struct vm_area_struct *vma,
> unsigned long old_addr, struct vm_area_struct *new_vma,
> unsigned long new_addr, unsigned long len);
> Index: linux-2.6.39-rc6-fast/mm/memory.c
> ===================================================================
> --- linux-2.6.39-rc6-fast.orig/mm/memory.c 2011-05-07 05:59:51.000000000 +0200
> +++ linux-2.6.39-rc6-fast/mm/memory.c 2011-05-07 05:59:52.000000000 +0200
> @@ -1412,9 +1412,12 @@ no_page_table:
>
> static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long addr)
> {
> - return (vma->vm_flags & VM_GROWSDOWN) &&
> + return ((vma->vm_flags & VM_GROWSDOWN) &&
> (vma->vm_start == addr) &&
> - !vma_stack_continue(vma->vm_prev, addr);
> + !vma_stack_continue(vma->vm_prev, addr)) ||
> + ((vma->vm_flags & VM_GROWSUP) &&
> + (vma->vm_end == addr + PAGE_SIZE) &&
> + !vma_stack_growsup_continue(vma->vm_next, addr + PAGE_SIZE));
> }
>
> /**
> @@ -1551,18 +1554,18 @@ int __get_user_pages(struct task_struct
> continue;
> }
>
> - /*
> - * If we don't actually want the page itself,
> - * and it's the stack guard page, just skip it.
> - */
> - if (!pages && stack_guard_page(vma, start))
> - goto next_page;
> -
> do {
> struct page *page;
> unsigned int foll_flags = gup_flags;
>
> /*
> + * If we don't actually want the page itself,
> + * and it's the stack guard page, just skip it.
> + */
> + if (!pages && stack_guard_page(vma, start))
> + goto next_page;
> +
> + /*
> * If we have a pending SIGKILL, don't keep faulting
> * pages and potentially allocating memory.
> */

2011-05-08 21:48:01

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Don't mlock guardpage if the stack is growing up

On Sun, May 8, 2011 at 11:55 AM, Mikulas Patocka
<[email protected]> wrote:
>
> This patch fixes lvm2 on PA-RISC (and possibly other architectures with
> up-growing stack). lvm2 calculates the number of used pages when locking
> and when unlocking and reports an internal error if the numbers mismatch.

This patch won't apply on current kernels (including stable) because
of commit a1fde08c74e9 that changed the test of "pages" to instead
just test "flags & FOLL_MLOCK".

That should be trivial to fix up.

However, I really don't much like this complex test:

> ?static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long addr)
> ?{
> - ? ? ? return (vma->vm_flags & VM_GROWSDOWN) &&
> + ? ? ? return ((vma->vm_flags & VM_GROWSDOWN) &&
> ? ? ? ? ? ? ? ?(vma->vm_start == addr) &&
> - ? ? ? ? ? ? ? !vma_stack_continue(vma->vm_prev, addr);
> + ? ? ? ? ? ? ? !vma_stack_continue(vma->vm_prev, addr)) ||
> + ? ? ? ? ? ? ?((vma->vm_flags & VM_GROWSUP) &&
> + ? ? ? ? ? ? ? (vma->vm_end == addr + PAGE_SIZE) &&
> + ? ? ? ? ? ? ? !vma_stack_growsup_continue(vma->vm_next, addr + PAGE_SIZE));
> ?}

in that format. It gets really hard to read, and I think you'd be
better off writing it as two helper functions (or macros) for the two
cases, and then have

static inline int stack_guard_page(struct vm_area_struct *vma,
unsigned long addr)
{
return stack_guard_page_growsdown(vma, addr) ||
stack_guard_page_growsup(vma, addr);
}

I'd also like to verify that it doesn't actually generate any extra
code for the common case (iirc VM_GROWSUP is 0 for the architectures
that don't need it, and so the compiler shouldn't generate any extra
code, but I'd like that mentioned and verified explicitly).

Hmm?

Other than that it looks ok to me.

That said, could we please fix LVM to not do that crazy sh*t in the
first place? The STACK_GROWSUP case is never going to have a lot of
testing, this is just sad.

Linus

2011-05-09 11:01:12

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [PATCH] Don't mlock guardpage if the stack is growing up



On Sun, 8 May 2011, Linus Torvalds wrote:

> On Sun, May 8, 2011 at 11:55 AM, Mikulas Patocka
> <[email protected]> wrote:
> >
> > This patch fixes lvm2 on PA-RISC (and possibly other architectures with
> > up-growing stack). lvm2 calculates the number of used pages when locking
> > and when unlocking and reports an internal error if the numbers mismatch.
>
> This patch won't apply on current kernels (including stable) because
> of commit a1fde08c74e9 that changed the test of "pages" to instead
> just test "flags & FOLL_MLOCK".
>
> That should be trivial to fix up.
>
> However, I really don't much like this complex test:
>
> > ?static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long addr)
> > ?{
> > - ? ? ? return (vma->vm_flags & VM_GROWSDOWN) &&
> > + ? ? ? return ((vma->vm_flags & VM_GROWSDOWN) &&
> > ? ? ? ? ? ? ? ?(vma->vm_start == addr) &&
> > - ? ? ? ? ? ? ? !vma_stack_continue(vma->vm_prev, addr);
> > + ? ? ? ? ? ? ? !vma_stack_continue(vma->vm_prev, addr)) ||
> > + ? ? ? ? ? ? ?((vma->vm_flags & VM_GROWSUP) &&
> > + ? ? ? ? ? ? ? (vma->vm_end == addr + PAGE_SIZE) &&
> > + ? ? ? ? ? ? ? !vma_stack_growsup_continue(vma->vm_next, addr + PAGE_SIZE));
> > ?}
>
> in that format. It gets really hard to read, and I think you'd be
> better off writing it as two helper functions (or macros) for the two
> cases, and then have
>
> static inline int stack_guard_page(struct vm_area_struct *vma,
> unsigned long addr)
> {
> return stack_guard_page_growsdown(vma, addr) ||
> stack_guard_page_growsup(vma, addr);
> }
>
> I'd also like to verify that it doesn't actually generate any extra
> code for the common case (iirc VM_GROWSUP is 0 for the architectures
> that don't need it, and so the compiler shouldn't generate any extra
> code, but I'd like that mentioned and verified explicitly).
>
> Hmm?
>
> Other than that it looks ok to me.
>
> That said, could we please fix LVM to not do that crazy sh*t in the
> first place? The STACK_GROWSUP case is never going to have a lot of
> testing, this is just sad.

LVM reads process maps from /proc/self/maps and locks them with mlock.

Why it doesn't use mlockall()? Because glibc maps all locales to the
process. Glibc packs all locales to a 100MB file and maps that file to
every process. Even if the process uses just one locale, glibc maps all.

So, when LVM used mlockall, it consumed >100MB memory and it caused
out-of-memory problems in system installers.

So, alternate way of locking was added to LVM --- read all maps and lock
them, except for the glibc locale file.

The real fix would be to fix glibc not to map 100MB to every process.

> Linus
>

This is updated patch. I tested it on x86-64 and it doesn't change
generated code.

Mikulas

---

Don't lock guardpage if the stack is growing up

Linux kernel excludes guard page when performing mlock on a VMA with
down-growing stack. However, some architectures have up-growing stack
and locking the guard page should be excluded in this case too.

This patch fixes lvm2 on PA-RISC (and possibly other architectures with
up-growing stack). lvm2 calculates number of used pages when locking and
when unlocking and reports an internal error if the numbers mismatch.

Signed-off-by: Mikulas Patocka <[email protected]>

---
include/linux/mm.h | 10 +++++++++-
mm/memory.c | 21 ++++++++++++++++++---
2 files changed, 27 insertions(+), 4 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h 2011-05-09 12:33:50.000000000 +0200
+++ linux-2.6/include/linux/mm.h 2011-05-09 12:42:05.000000000 +0200
@@ -1011,11 +1011,19 @@ int set_page_dirty_lock(struct page *pag
int clear_page_dirty_for_io(struct page *page);

/* Is the vma a continuation of the stack vma above it? */
-static inline int vma_stack_continue(struct vm_area_struct *vma, unsigned long addr)
+static inline int vma_stack_growsdown_continue(struct vm_area_struct *vma,
+ unsigned long addr)
{
return vma && (vma->vm_end == addr) && (vma->vm_flags & VM_GROWSDOWN);
}

+/* Is the vma a continuation of the stack vma below it? */
+static inline int vma_stack_growsup_continue(struct vm_area_struct *vma,
+ unsigned long addr)
+{
+ return vma && (vma->vm_start == addr) && (vma->vm_flags & VM_GROWSUP);
+}
+
extern unsigned long move_page_tables(struct vm_area_struct *vma,
unsigned long old_addr, struct vm_area_struct *new_vma,
unsigned long new_addr, unsigned long len);
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c 2011-05-09 12:33:51.000000000 +0200
+++ linux-2.6/mm/memory.c 2011-05-09 12:41:38.000000000 +0200
@@ -1410,11 +1410,21 @@ no_page_table:
return page;
}

-static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long addr)
+static inline int stack_guard_page_growsdown(struct vm_area_struct *vma,
+ unsigned long addr)
{
return (vma->vm_flags & VM_GROWSDOWN) &&
(vma->vm_start == addr) &&
- !vma_stack_continue(vma->vm_prev, addr);
+ !vma_stack_growsdown_continue(vma->vm_prev, addr);
+}
+
+
+static inline int stack_guard_page_growsup(struct vm_area_struct *vma,
+ unsigned long addr)
+{
+ return (vma->vm_flags & VM_GROWSUP) &&
+ (vma->vm_end == addr + PAGE_SIZE) &&
+ !vma_stack_growsup_continue(vma->vm_next, addr + PAGE_SIZE);
}

/**
@@ -1554,13 +1564,18 @@ int __get_user_pages(struct task_struct
/*
* For mlock, just skip the stack guard page.
*/
- if ((gup_flags & FOLL_MLOCK) && stack_guard_page(vma, start))
+ if ((gup_flags & FOLL_MLOCK) &&
+ stack_guard_page_growsdown(vma, start))
goto next_page;

do {
struct page *page;
unsigned int foll_flags = gup_flags;

+ if ((gup_flags & FOLL_MLOCK) &&
+ stack_guard_page_growsup(vma, start))
+ goto next_page;
+
/*
* If we have a pending SIGKILL, don't keep faulting
* pages and potentially allocating memory.

2011-05-09 11:12:27

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [PATCH] Don't mlock guardpage if the stack is growing up

On Sun, 8 May 2011, Hugh Dickins wrote:

> On Sun, 8 May 2011, Mikulas Patocka wrote:
>
> > Don't mlock guardpage if the stack is growing up
> >
> > Linux kernel excludes guard page when performing mlock on a VMA with
> > down-growing stack. However, some architectures have up-growing stack and
> > locking the guard page should be excluded in this case too.
> >
> > This patch fixes lvm2 on PA-RISC (and possibly other architectures with
> > up-growing stack). lvm2 calculates the number of used pages when locking
> > and when unlocking and reports an internal error if the numbers mismatch.
> > On PA-RISC, the kernel would incorrectly attempt to mlock the stack guard
> > page, this causes allocation of one more page and internal error in lvm2.
> >
> > Signed-off-by: Mikulas Patocka <[email protected]>
>
> Interesting, I'd convinced myself that the growsup case was safe,
> because of how we always approach the vma from its bottom end.
>
> I've added Michel to the Cc, he's the one with the best grasp here.
>
> Could you point us to where lvm2 is making these calculations?
> I don't understand quite what it's doing.
>
> Thanks,
> Hugh

See ./lib/mm/memlock.c in LVM2. It reads /proc/self/maps, parses the file
and locks each map with mlock, except for glibc locale file.

It calculates how much memory it took when locking and unlocking and
prints an internal error if the numbers differ. The internal error
normally means that there was some memory allocated while it was locked
(that is wrong).

However, on up-growing stack, the internal error is always triggered,
because mlock() of the stack touches the guard page and allocates one more
page.

Mikulas

2011-05-09 11:44:41

by Zdenek Kabelac

[permalink] [raw]
Subject: Re: [PATCH] Don't mlock guardpage if the stack is growing up

Dne 9.5.2011 13:01, Mikulas Patocka napsal(a):
>
>
> On Sun, 8 May 2011, Linus Torvalds wrote:
>
>> On Sun, May 8, 2011 at 11:55 AM, Mikulas Patocka
>> <[email protected]> wrote:
>>>
>>> This patch fixes lvm2 on PA-RISC (and possibly other architectures with
>>> up-growing stack). lvm2 calculates the number of used pages when locking
>>> and when unlocking and reports an internal error if the numbers mismatch.
>>
>> This patch won't apply on current kernels (including stable) because
>> of commit a1fde08c74e9 that changed the test of "pages" to instead
>> just test "flags & FOLL_MLOCK".
>>
>> That should be trivial to fix up.
>>
>> However, I really don't much like this complex test:
>>
>>> static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long addr)
>>> {
>>> - return (vma->vm_flags & VM_GROWSDOWN) &&
>>> + return ((vma->vm_flags & VM_GROWSDOWN) &&
>>> (vma->vm_start == addr) &&
>>> - !vma_stack_continue(vma->vm_prev, addr);
>>> + !vma_stack_continue(vma->vm_prev, addr)) ||
>>> + ((vma->vm_flags & VM_GROWSUP) &&
>>> + (vma->vm_end == addr + PAGE_SIZE) &&
>>> + !vma_stack_growsup_continue(vma->vm_next, addr + PAGE_SIZE));
>>> }
>>
>> in that format. It gets really hard to read, and I think you'd be
>> better off writing it as two helper functions (or macros) for the two
>> cases, and then have
>>
>> static inline int stack_guard_page(struct vm_area_struct *vma,
>> unsigned long addr)
>> {
>> return stack_guard_page_growsdown(vma, addr) ||
>> stack_guard_page_growsup(vma, addr);
>> }
>>
>> I'd also like to verify that it doesn't actually generate any extra
>> code for the common case (iirc VM_GROWSUP is 0 for the architectures
>> that don't need it, and so the compiler shouldn't generate any extra
>> code, but I'd like that mentioned and verified explicitly).
>>
>> Hmm?
>>
>> Other than that it looks ok to me.
>>
>> That said, could we please fix LVM to not do that crazy sh*t in the
>> first place? The STACK_GROWSUP case is never going to have a lot of
>> testing, this is just sad.
>
> LVM reads process maps from /proc/self/maps and locks them with mlock.
>
> Why it doesn't use mlockall()? Because glibc maps all locales to the
> process. Glibc packs all locales to a 100MB file and maps that file to
> every process. Even if the process uses just one locale, glibc maps all.
>
> So, when LVM used mlockall, it consumed >100MB memory and it caused
> out-of-memory problems in system installers.
>
> So, alternate way of locking was added to LVM --- read all maps and lock
> them, except for the glibc locale file.
>
> The real fix would be to fix glibc not to map 100MB to every process.
>

I should add here probably few words.

Glibc knows few more ways around - so it could work only with one locale file
per language, or even without using mmap and allocating them in memory.
Depends on the distribution usually - Fedora decided to combine all locales
into one huge file (>100MB) - Ubuntu/Debian mmaps each locales individually
(usually ~MB)

LVM support both ways - either user may select in lvm.conf to always use
mlockall, or he may switch to use mlock mapping of individual memory areas
where those memory parts, that cannot be executed during suspend state and
cannot cause memory deadlock, are not locked into memory. As a 'bonus' it's
internally used for tracking algorithmic bugs.

Zdenek

2011-05-09 15:58:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Don't mlock guardpage if the stack is growing up

On Mon, May 9, 2011 at 4:12 AM, Mikulas Patocka
<[email protected]> wrote:
>
> See ./lib/mm/memlock.c in LVM2. It reads /proc/self/maps, parses the file
> and locks each map with mlock, except for glibc locale file.

Hmm. One thing that strikes me is this problem also implies that the
/proc/self/maps file is wrong for the GROWSUP case, isn't it?

So I think we should not just apply your lock fix, but then *also*
apply something like this:

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 2e7addfd9803..080980880c7f 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -214,7 +214,7 @@ static void show_map_vma(struct seq_file *m,
struct vm_area_struct *vma)
int flags = vma->vm_flags;
unsigned long ino = 0;
unsigned long long pgoff = 0;
- unsigned long start;
+ unsigned long start, end;
dev_t dev = 0;
int len;

@@ -227,13 +227,16 @@ static void show_map_vma(struct seq_file *m,
struct vm_area_struct *vma)

/* We don't show the stack guard page in /proc/maps */
start = vma->vm_start;
- if (vma->vm_flags & VM_GROWSDOWN)
- if (!vma_stack_continue(vma->vm_prev, vma->vm_start))
- start += PAGE_SIZE;
+ if (stack_guard_page_growsdown(vma, start))
+ start += PAGE_SIZE;
+ end = vma->vm_end;
+ if (stack_guard_page_growsup(vma, end))
+ end -= PAGE_SIZE;
+

seq_printf(m, "%08lx-%08lx %c%c%c%c %08llx %02x:%02x %lu %n",
start,
- vma->vm_end,
+ end,
flags & VM_READ ? 'r' : '-',
flags & VM_WRITE ? 'w' : '-',
flags & VM_EXEC ? 'x' : '-',

NOTE! The above actually assumes that your
"stack_guard_page_growsup()" has been changed to actually take the
"next page" value, which I think makes more sense (ie it's the "end of
stack", the same way "stack_guard_page_growsdown()" address is).

Hmm? I don't have any growsup machine to test with, can you do that?

Linus

2011-05-09 21:09:04

by Alasdair G Kergon

[permalink] [raw]
Subject: Re: [PATCH] Don't mlock guardpage if the stack is growing up

On Mon, May 09, 2011 at 01:43:59PM +0200, Zdenek Kabelac wrote:
> Dne 9.5.2011 13:01, Mikulas Patocka napsal(a):
> > So, when LVM used mlockall, it consumed >100MB memory and it caused
> > out-of-memory problems in system installers.

The way glibc and some distros have implemented locales makes mlockall()
quite useless for many purposes now in practice IMHO. We discussed the
options at the time and wrote the extra code to implement the only
solution we were offered.

(1) There was no possibility of some distros not using a single locale
database (reasons: optimisation and packaging convenience);
(2) there was no possibility of glibc providing a mechanism to unmap the
locale database (if you reset your locale to the default glibc will not
unmap the file and indeed doesn't rule out doing something similar with
other files in future - these are glibc internals and it's none of our
business).

Alasdair

2011-05-09 22:16:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Don't mlock guardpage if the stack is growing up

On Mon, May 9, 2011 at 8:57 AM, Linus Torvalds
<[email protected]> wrote:
>
> Hmm. One thing that strikes me is this problem also implies that the
> /proc/self/maps file is wrong for the GROWSUP case, isn't it?
>
> So I think we should not just apply your lock fix, but then *also*
> apply something like this:

Actually, I think we might be better off with something like this.

It makes a few more changes:

- move the stack guard page checking in __get_user_pages() into the
rare case (ie we didn't find a page), since that's the only case we
care about (the thing about the guard page is that don't want to call
"handle_mm_fault()"). As a result, it's off any path where we can
possibly care about performance, so we might as well have a nice
helper function for both the grow-up and grow-down cases, instead of
trying to be clever and only look at the grow-down case for the first
page in the vma like you did in your patch.

End result: simpler, more straightforward code.

- Move the growsup/down helper functions to <linux/mm.h>, since the
/proc code really wants to use them too. That means that the
"vma_stack_continue()" function (which now got split up into two
cases, for the up/down cases) is now entirely just an internal helper
function - nobody else uses it, and the real interface are the
"stack_guard_page_xyz()" functions. Renamed to be simpler.

- changed that naming of those stack_guard_page functions to use
_start and _end instead of growsup/growsdown, since it actually takes
the start or the end of the page as the argument (to match the
semantics of the afore-mentioned helpers)

- and finally, make /proc/<pid>/maps use these helpers for both the
up/down case, so now /proc/self/maps should work well for the growsup
case too.

Hmm?

The only oddish case is IA64 that actually has a stack that grows
*both* up and down. That means that I could make up a stack mapping
that has a single virtual page in it, that is both the start *and* the
end page. Now /proc/self/maps would actually show such a mapping with
"negative" size. That's interesting.

It would be easy enough to have a "if (end < start) end = start" there
for that case, but maybe it's actually interesting information.

Regardless, I'd like to hear whether this patch really does work on
PA-RISC and especially IA64. I think those are the only cases that
have a GROWSUP stack. And the IA64 case that supports both is the most
interesting, everybody else does just one or the other.

Linus


Attachments:
patch.diff (3.74 kB)

2011-05-09 22:19:15

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH] Don't mlock guardpage if the stack is growing up

On Mon, 2011-05-09 at 15:07 -0700, Linus Torvalds wrote:
> Regardless, I'd like to hear whether this patch really does work on
> PA-RISC and especially IA64. I think those are the only cases that
> have a GROWSUP stack. And the IA64 case that supports both is the most
> interesting, everybody else does just one or the other.

So I can test the patch, if you tell me how. I don't use lvm2, so it
would have to be a simple test case.

James

2011-05-09 22:26:42

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [PATCH] Don't mlock guardpage if the stack is growing up



On Mon, 9 May 2011, Linus Torvalds wrote:

> On Mon, May 9, 2011 at 8:57 AM, Linus Torvalds
> <[email protected]> wrote:
> >
> > Hmm. One thing that strikes me is this problem also implies that the
> > /proc/self/maps file is wrong for the GROWSUP case, isn't it?
> >
> > So I think we should not just apply your lock fix, but then *also*
> > apply something like this:
>
> Actually, I think we might be better off with something like this.
>
> It makes a few more changes:
>
> - move the stack guard page checking in __get_user_pages() into the
> rare case (ie we didn't find a page), since that's the only case we
> care about (the thing about the guard page is that don't want to call
> "handle_mm_fault()"). As a result, it's off any path where we can
> possibly care about performance, so we might as well have a nice
> helper function for both the grow-up and grow-down cases, instead of
> trying to be clever and only look at the grow-down case for the first
> page in the vma like you did in your patch.
>
> End result: simpler, more straightforward code.
>
> - Move the growsup/down helper functions to <linux/mm.h>, since the
> /proc code really wants to use them too. That means that the
> "vma_stack_continue()" function (which now got split up into two
> cases, for the up/down cases) is now entirely just an internal helper
> function - nobody else uses it, and the real interface are the
> "stack_guard_page_xyz()" functions. Renamed to be simpler.
>
> - changed that naming of those stack_guard_page functions to use
> _start and _end instead of growsup/growsdown, since it actually takes
> the start or the end of the page as the argument (to match the
> semantics of the afore-mentioned helpers)
>
> - and finally, make /proc/<pid>/maps use these helpers for both the
> up/down case, so now /proc/self/maps should work well for the growsup
> case too.
>
> Hmm?
>
> The only oddish case is IA64 that actually has a stack that grows
> *both* up and down. That means that I could make up a stack mapping
> that has a single virtual page in it, that is both the start *and* the
> end page. Now /proc/self/maps would actually show such a mapping with
> "negative" size. That's interesting.
>
> It would be easy enough to have a "if (end < start) end = start" there
> for that case, but maybe it's actually interesting information.
>
> Regardless, I'd like to hear whether this patch really does work on
> PA-RISC and especially IA64. I think those are the only cases that
> have a GROWSUP stack. And the IA64 case that supports both is the most
> interesting, everybody else does just one or the other.
>
> Linus

I will test it after a week, now I'm traveling away.

Mikulas

2011-05-09 22:32:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Don't mlock guardpage if the stack is growing up

On Mon, May 9, 2011 at 3:19 PM, James Bottomley
<[email protected]> wrote:
>
> So I can test the patch, if you tell me how. ?I don't use lvm2, so it
> would have to be a simple test case.

Something like the attached. Run it as root (it needs root just for
the mlockall), and see whether the stack it shows changes.

With current kernels, I think the stack expands by one page during the
mlockall (for STACK_GROWSUP), with the patch it shouldn't.

Linus


Attachments:
t.c (400.00 B)

2011-05-09 22:45:15

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] Don't mlock guardpage if the stack is growing up

On Mon, May 09, 2011 at 01:43:59PM +0200, Zdenek Kabelac wrote:
> > Why it doesn't use mlockall()? Because glibc maps all locales to the
> > process. Glibc packs all locales to a 100MB file and maps that file to
> > every process. Even if the process uses just one locale, glibc maps all.
> >
> > So, when LVM used mlockall, it consumed >100MB memory and it caused
> > out-of-memory problems in system installers.
> >
> > So, alternate way of locking was added to LVM --- read all maps and lock
> > them, except for the glibc locale file.
> >
> > The real fix would be to fix glibc not to map 100MB to every process.
>
> I should add here probably few words.
>
> Glibc knows few more ways around - so it could work only with one locale file
> per language, or even without using mmap and allocating them in memory.
> Depends on the distribution usually - Fedora decided to combine all locales
> into one huge file (>100MB) - Ubuntu/Debian mmaps each locales individually
> (usually ~MB)

Sounds to me like glibc should introduce an mlockmost() call that does all
the work for you ...

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2011-05-09 22:53:38

by Tony Luck

[permalink] [raw]
Subject: Re: [PATCH] Don't mlock guardpage if the stack is growing up

On Mon, May 9, 2011 at 3:31 PM, Linus Torvalds
<[email protected]> wrote:
> With current kernels, I think the stack expands by one page during the
> mlockall (for STACK_GROWSUP), with the patch it shouldn't.

Tried this on ia64 (with a mod because the upward growing stack isn't
blessed with
the [stack] annotation, only the downward growing stack gets that honour).

ia64 builds, boots, and processes can still grow stacks (both of them). The
patched kernel doesn't change the size of the upwardly growing stack across
the mlockall().

-Tony

P.S. while we could start both stacks on the same page and have the grow
away from the start point, ia64 actually starts them out a fair distance apart
and lets them run into each other (if you have enough memory to let them
grow that far, and if ulimit -s doesn't stop them earlier)

2011-05-09 22:56:35

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Don't mlock guardpage if the stack is growing up

On Mon, May 9, 2011 at 3:45 PM, Matthew Wilcox <[email protected]> wrote:
>
> Sounds to me like glibc should introduce an mlockmost() call that does all
> the work for you ...

That sounds like the worst of all worlds. Nobody will use it, now
there's two places to break, and how do you describe what you don't
want mlocked?

I dunno. There are so few applications that use mlock at *all* that I
wonder if it's even worth worrying about. And this one case we've
hopefully now fixed anyway.

But if people really want to fix mlockall(), I'd suggest some way of
just marking certain mappings as "sparse mappings", and then we can
just teach mlockall to not lock them. And then glibc could just mark
its locale archive that way - and maybe others would mark other
mappings.

We could still make such a "sparse mapping" act as locked for the
actual pages that are mapped into it - so it would have some kind of
real semantics. You could do a "mlockall(..)" on the process, and then
touch the sparse pages you know you want, and they'd be guaranteed to
be mapped after that.

We might even have a "mlockall(MCL_SPARSE)" flag that does that for
everything - basically guarantee that "whatever is mapped in now will
remain mapped in, but I won't bother paging it in just because you ask
me to do a mlockall".

So there are sane semantics for the concept, and it would be easy to
do in the kernel. Whether it's worth doing, I dunno.

Linus

2011-05-09 22:58:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Don't mlock guardpage if the stack is growing up

On Mon, May 9, 2011 at 3:53 PM, Tony Luck <[email protected]> wrote:
>
> P.S. while we could start both stacks on the same page and have the grow
> away from the start point, ia64 actually starts them out a fair distance apart
> and lets them run into each other (if you have enough memory to let them
> grow that far, and if ulimit -s doesn't stop them earlier)

Ahh, so you never actually have one single mapping that has both flags set?

In that case, I won't even worry about it.

One thing I did want to verify: did the mlockall() actually change the
stack size without that patch? Just to double-check that the patch
actually did change semantics visibly.

Linus

2011-05-09 23:08:47

by Tony Luck

[permalink] [raw]
Subject: Re: [PATCH] Don't mlock guardpage if the stack is growing up

On Mon, May 9, 2011 at 3:58 PM, Linus Torvalds
<[email protected]> wrote:
> Ahh, so you never actually have one single mapping that has both flags set?
>
> In that case, I won't even worry about it.

Definitely not for normal processes - I'm not sure how both stacks are
set up for threads.

> One thing I did want to verify: did the mlockall() actually change the
> stack size without that patch? Just to double-check that the patch
> actually did change semantics visibly.

On an unpatched system I see this (lots more than one page of growth -
pages are 64K on this config):
6007fffffff50000-6007fffffff70000 rw-p 00000000 00:00 0
6007fffffff50000-6008000000750000 rw-p 00000000 00:00 0

On a patched system I see (this one has 16K pages - no growth)
600007ffff9d0000-600007ffff9d4000 rw-p 00000000 00:00 0
600007ffff9d0000-600007ffff9d4000 rw-p 00000000 00:00 0

-Tony

2011-05-09 23:17:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Don't mlock guardpage if the stack is growing up

On Mon, May 9, 2011 at 4:08 PM, Tony Luck <[email protected]> wrote:
>
> Definitely not for normal processes - I'm not sure how both stacks are
> set up for threads.

We don't actually allow user space to set the growsup/growsdown bits
any more (we have PROT_GROWSUP and PROT_GROWSDOWN, but that is to
allow mprotect to not give an exact range, but say "apply this to the
end of a growsup/growsdown segment").

So the only thing that has those bits are things that the kernel sets
explicitly at exec time. So if ia64 doesn't set it, we're all good.

>> One thing I did want to verify: did the mlockall() actually change the
>> stack size without that patch? Just to double-check that the patch
>> actually did change semantics visibly.
>
> On an unpatched system I see this (lots more than one page of growth -
> pages are 64K on this config):
> 6007fffffff50000-6007fffffff70000 rw-p 00000000 00:00 0
> 6007fffffff50000-6008000000750000 rw-p 00000000 00:00 0
>
> On a patched system I see (this one has 16K pages - no growth)
> 600007ffff9d0000-600007ffff9d4000 rw-p 00000000 00:00 0
> 600007ffff9d0000-600007ffff9d4000 rw-p 00000000 00:00 0

Ok, I'll consider it tested. I'll commit it with Mikulas as author,
but note that I edited it so he won't get the blame if there's some
problem.

Linus

2011-05-09 23:26:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Don't mlock guardpage if the stack is growing up

On Mon, May 9, 2011 at 4:17 PM, Linus Torvalds
<[email protected]> wrote:
>
> Ok, I'll consider it tested. I'll commit it with Mikulas as author,
> but note that I edited it so he won't get the blame if there's some
> problem.

Oh, and I marked it for stable too, although I don't know if any
distribution really cares about parisc or ia64. And I'm not sure that
ia64 even saw the lvm2 failure case - I'd have expected to hear about
it if it actually happens there.

Linus

2011-05-10 04:13:00

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH] Don't mlock guardpage if the stack is growing up

On Mon, 2011-05-09 at 16:25 -0700, Linus Torvalds wrote:
> On Mon, May 9, 2011 at 4:17 PM, Linus Torvalds
> <[email protected]> wrote:
> >
> > Ok, I'll consider it tested. I'll commit it with Mikulas as author,
> > but note that I edited it so he won't get the blame if there's some
> > problem.
>
> Oh, and I marked it for stable too, although I don't know if any
> distribution really cares about parisc or ia64. And I'm not sure that
> ia64 even saw the lvm2 failure case - I'd have expected to hear about
> it if it actually happens there.

Well, it's a done deal, but here's the proof on parisc too:

Before:

c0266000-c0289000 rwxp 00000000 00:00 0 [stack]
c0266000-c0a66000 rwxp 00000000 00:00 0 [stack]

After:

bffee000-c0010000 rwxp 00000000 00:00 0 [stack]
bffee000-c0010000 rwxp 00000000 00:00 0 [stack]

James

2011-05-10 22:57:37

by Alasdair G Kergon

[permalink] [raw]
Subject: Re: [PATCH] Don't mlock guardpage if the stack is growing up

On Mon, May 09, 2011 at 03:56:11PM -0700, Linus Torvalds wrote:
> So there are sane semantics for the concept, and it would be easy to
> do in the kernel. Whether it's worth doing, I dunno.

At this point we have a workaround that seems to work for us. I find it
ugly and it has needed some tweaking already but we can cope.

If others find similar problems to ours and start replicating the logic
we're using, then that would be the time to give serious thought to a
clean 'sparse' extension. (Maybe marking individual mappings as MAP_SPARSE
and adding a MCL_NO_SPARSE option to ignore them sounds most promising to me.)

(What other software packages make use of mlockall() and under what
circumstances?)

Alasdair

2011-05-11 15:58:11

by Milan Broz

[permalink] [raw]
Subject: Re: [PATCH] Don't mlock guardpage if the stack is growing up

On 05/11/2011 12:57 AM, Alasdair G Kergon wrote:
> (What other software packages make use of mlockall() and under what
> circumstances?)

Another one is cryptsetup for commands which manipulate with keys
in memory.
(Users of libcryptetup library are not forced to lock memory, it is optional
call. But cryptsetup itself as libcryptsetup library user always locks memory.)

And I am not happy with mlockall() as well but the lvm2 workaround
is quite complicated.

Basically it wants to lock memory with explicitly allocated keys
(this can be rewritten to use specific locked page though) but it
also need to lock various libdevmapper buffers when issuing dmcrypt
cfg ioctl (mapping table and messages contains key). So that's why
mlockall(MCL_CURRENT|MCL_FUTURE) was the simplest way (and no problems
reported yet).
(No that it is perfect but better than nothing... Of course
more important is to wipe memory with keys after use.)

Milan

2011-05-12 02:13:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Don't mlock guardpage if the stack is growing up

On Wed, May 11, 2011 at 1:42 AM, Milan Broz <[email protected]> wrote:
>
> Another one is cryptsetup [..]

Quite frankly, all security-related uses should always be happy about
a "MCL_SPARSE" model, since there is no point in ever bringing in
pages that haven't been used. The whole (and only) point of
mlock[all]() for them is the "avoid to push to disk" issue.

I do wonder if we really should ever do the page-in at all. We might
simply be better off always just saying "we'll lock pages you've
touched, that's it".

Linus

2011-05-12 09:06:17

by Zdenek Kabelac

[permalink] [raw]
Subject: Re: [PATCH] Don't mlock guardpage if the stack is growing up

Dne 12.5.2011 04:12, Linus Torvalds napsal(a):
> On Wed, May 11, 2011 at 1:42 AM, Milan Broz <[email protected]> wrote:
>>
>> Another one is cryptsetup [..]
>
> Quite frankly, all security-related uses should always be happy about
> a "MCL_SPARSE" model, since there is no point in ever bringing in
> pages that haven't been used. The whole (and only) point of
> mlock[all]() for them is the "avoid to push to disk" issue.
>
> I do wonder if we really should ever do the page-in at all. We might
> simply be better off always just saying "we'll lock pages you've
> touched, that's it".
>


For LVM we need to ensure the code which might ever be executed during disk
suspend state must be paged and locked in - thus we would need MCL_SPARSE only
on several selected 'unneeded' libraries - as we are obviously not really able
to select which part of glibc might be needed during all code path (though I
guess we may find some limits). But if we are sure that some libraries and
locale files will never be used during suspend state - we do not care about
those pages at all.

So it's not like we would always need only MCL_SPARSE all the time - we would
probably need to have some control to switch i.e. glibc into MCL_ALL.

Zdenek

2011-05-15 22:18:28

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [PATCH] Don't mlock guardpage if the stack is growing up

On Tue, 10 May 2011, Mikulas Patocka wrote:

> On Mon, 9 May 2011, Linus Torvalds wrote:
>
> > On Mon, May 9, 2011 at 8:57 AM, Linus Torvalds
> > <[email protected]> wrote:
> > >
> > > Hmm. One thing that strikes me is this problem also implies that the
> > > /proc/self/maps file is wrong for the GROWSUP case, isn't it?
> > >
> > > So I think we should not just apply your lock fix, but then *also*
> > > apply something like this:
> >
> > Actually, I think we might be better off with something like this.
> >
> > It makes a few more changes:
> >
> > - move the stack guard page checking in __get_user_pages() into the
> > rare case (ie we didn't find a page), since that's the only case we
> > care about (the thing about the guard page is that don't want to call
> > "handle_mm_fault()"). As a result, it's off any path where we can
> > possibly care about performance, so we might as well have a nice
> > helper function for both the grow-up and grow-down cases, instead of
> > trying to be clever and only look at the grow-down case for the first
> > page in the vma like you did in your patch.
> >
> > End result: simpler, more straightforward code.
> >
> > - Move the growsup/down helper functions to <linux/mm.h>, since the
> > /proc code really wants to use them too. That means that the
> > "vma_stack_continue()" function (which now got split up into two
> > cases, for the up/down cases) is now entirely just an internal helper
> > function - nobody else uses it, and the real interface are the
> > "stack_guard_page_xyz()" functions. Renamed to be simpler.
> >
> > - changed that naming of those stack_guard_page functions to use
> > _start and _end instead of growsup/growsdown, since it actually takes
> > the start or the end of the page as the argument (to match the
> > semantics of the afore-mentioned helpers)
> >
> > - and finally, make /proc/<pid>/maps use these helpers for both the
> > up/down case, so now /proc/self/maps should work well for the growsup
> > case too.
> >
> > Hmm?
> >
> > The only oddish case is IA64 that actually has a stack that grows
> > *both* up and down. That means that I could make up a stack mapping
> > that has a single virtual page in it, that is both the start *and* the
> > end page. Now /proc/self/maps would actually show such a mapping with
> > "negative" size. That's interesting.
> >
> > It would be easy enough to have a "if (end < start) end = start" there
> > for that case, but maybe it's actually interesting information.
> >
> > Regardless, I'd like to hear whether this patch really does work on
> > PA-RISC and especially IA64. I think those are the only cases that
> > have a GROWSUP stack. And the IA64 case that supports both is the most
> > interesting, everybody else does just one or the other.
> >
> > Linus
>
> I will test it after a week, now I'm traveling away.
>
> Mikulas

Hi

I tested 2.6.39-rc7 in on PA-RISC and confirm that it works.

Mikulas