Hi Hugh! We're running our tests on latest vanilla kernel all the time,
and recently we've got an issue on restore:
https://github.com/xemul/criu/issues/322
| (00.410614) 4: cg: Cgroups 1 inherited from parent
| (00.410858) 4: Opened local page read 3 (parent 0)
| (00.410961) 4: premap 0x00000000400000-0x00000000406000 -> 00007fe65badf000
| (00.410981) 4: premap 0x00000000605000-0x00000000606000 -> 00007fe65bae5000
| (00.410997) 4: premap 0x00000000606000-0x00000000607000 -> 00007fe65bae6000
| (00.411013) 4: premap 0x000000025a0000-0x000000025c1000 -> 00007fe65bae7000
| (00.411036) 4: Error (criu/mem.c:726): Unable to remap a private vma: Invalid argument
| (00.412779) 1: Error (criu/cr-restore.c:1465): 4 exited, status=1
Andrew has narrowed it down to the commit
| commit 1be7107fbe18eed3e319a6c3e83c78254b693acb
| Author: Hugh Dickins <[email protected]>
| Date: Mon Jun 19 04:03:24 2017 -0700
|
| mm: larger stack guard gap, between vmas
and looking into the patch I see the procfs output has been changed
| diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
| index f0c8b33..520802d 100644
| --- a/fs/proc/task_mmu.c
| +++ b/fs/proc/task_mmu.c
| @@ -300,11 +300,7 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma, int is_pid)
|
| /* We don't show the stack guard page in /proc/maps */
| start = vma->vm_start;
| - if (stack_guard_page_start(vma, start))
| - start += PAGE_SIZE;
| end = vma->vm_end;
| - if (stack_guard_page_end(vma, end))
| - end -= PAGE_SIZE;
|
| seq_setwidth(m, 25 + sizeof(void *) * 6 - 1);
| seq_printf(m, "%08lx-%08lx %c%c%c%c %08llx %02x:%02x %lu ",
For which we of course are not ready because we've been implying the
guard page is returned here so we adjust addresses locally when saving
them into images.
So now we need to figure out somehow if show_map_vma accounts [PAGE_SIZE|guard_area] or not,
I guess we might use kernel version here but it won't be working fine on custom kernels,
or kernels with the patch backported.
Second I guess we might need to detect @stack_guard_gap runtime as
well but not yet sure because we only have found this problem and
hasn't been investigating it deeply yet. Hopefully will do in a
day or couple (I guess we still have some time before the final
kernel release).
Cyrill
On Tue, 20 Jun 2017, Cyrill Gorcunov wrote:
> Hi Hugh! We're running our tests on latest vanilla kernel all the time,
> and recently we've got an issue on restore:
>
> https://github.com/xemul/criu/issues/322
>
> | (00.410614) 4: cg: Cgroups 1 inherited from parent
> | (00.410858) 4: Opened local page read 3 (parent 0)
> | (00.410961) 4: premap 0x00000000400000-0x00000000406000 -> 00007fe65badf000
> | (00.410981) 4: premap 0x00000000605000-0x00000000606000 -> 00007fe65bae5000
> | (00.410997) 4: premap 0x00000000606000-0x00000000607000 -> 00007fe65bae6000
> | (00.411013) 4: premap 0x000000025a0000-0x000000025c1000 -> 00007fe65bae7000
> | (00.411036) 4: Error (criu/mem.c:726): Unable to remap a private vma: Invalid argument
> | (00.412779) 1: Error (criu/cr-restore.c:1465): 4 exited, status=1
>
> Andrew has narrowed it down to the commit
>
> | commit 1be7107fbe18eed3e319a6c3e83c78254b693acb
> | Author: Hugh Dickins <[email protected]>
> | Date: Mon Jun 19 04:03:24 2017 -0700
> |
> | mm: larger stack guard gap, between vmas
>
> and looking into the patch I see the procfs output has been changed
>
> | diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> | index f0c8b33..520802d 100644
> | --- a/fs/proc/task_mmu.c
> | +++ b/fs/proc/task_mmu.c
> | @@ -300,11 +300,7 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma, int is_pid)
> |
> | /* We don't show the stack guard page in /proc/maps */
> | start = vma->vm_start;
> | - if (stack_guard_page_start(vma, start))
> | - start += PAGE_SIZE;
> | end = vma->vm_end;
> | - if (stack_guard_page_end(vma, end))
> | - end -= PAGE_SIZE;
> |
> | seq_setwidth(m, 25 + sizeof(void *) * 6 - 1);
> | seq_printf(m, "%08lx-%08lx %c%c%c%c %08llx %02x:%02x %lu ",
>
> For which we of course are not ready because we've been implying the
> guard page is returned here so we adjust addresses locally when saving
> them into images.
>
> So now we need to figure out somehow if show_map_vma accounts [PAGE_SIZE|guard_area] or not,
> I guess we might use kernel version here but it won't be working fine on custom kernels,
> or kernels with the patch backported.
>
> Second I guess we might need to detect @stack_guard_gap runtime as
> well but not yet sure because we only have found this problem and
> hasn't been investigating it deeply yet. Hopefully will do in a
> day or couple (I guess we still have some time before the final
> kernel release).
Sorry for breaking you: we realized there was some risk of that.
Would it be acceptable to you, to judge which kind of a kernel it is,
by whether it has a global variable stack_guard_gap? I don't know
if that would be a horrible hack, or the kind of thing that you're
used to doing all over the place. Judging by kernel version will
be awkward, since the patch is being backported to stable kernels.
But I'm surprised by your explanation above: maybe I'm confused,
or maybe the explanation is different. Because as I see it, the
change I made in that patch *maintained* consistency for CRIU:
It used to be the case that there was a gap page included in the
extent of the stack vma, but it didn't really belong in there,
therefore show_map_vma() massaged the addresses shown to conceal it.
Whereas now with the 1be7107fbe18 commit, the gap (page or more)
is not included in the extent of the stack vma, so there's no
longer any need to massage the addresses shown to conceal it.
We do need to understand this fairly quickly, since those stable
backports will pose more of a problem for you than the v4.12
release itself.
Hugh
On Tue, Jun 20, 2017 at 03:23:20AM -0700, Hugh Dickins wrote:
>
> Sorry for breaking you: we realized there was some risk of that.
>
> Would it be acceptable to you, to judge which kind of a kernel it is,
> by whether it has a global variable stack_guard_gap? I don't know
> if that would be a horrible hack, or the kind of thing that you're
> used to doing all over the place. Judging by kernel version will
> be awkward, since the patch is being backported to stable kernels.
Wait, maybe we could use VmFlags from /proc/$pid/smaps for that?
I mean we show "gd/gu" flag there is it's stack area. Say we can
add additional flag which would point that we should not delete
guard page from the output. Currently we've in criu
/* Add a guard page only if here is enough space for it */
if ((vma_area->e->flags & MAP_GROWSDOWN) &&
*prev_end < vma_area->e->start)
vma_area->e->start -= PAGE_SIZE; /* Guard page */
So that on the restore we use mmap with MAP_FIXED. Hugh, I'm still
analyzing the problem in criu, maybe this code snippet the only
problem and just lifting up smaps flags will be enough. Just
gimme some more time.
> But I'm surprised by your explanation above: maybe I'm confused,
> or maybe the explanation is different. Because as I see it, the
> change I made in that patch *maintained* consistency for CRIU:
>
> It used to be the case that there was a gap page included in the
> extent of the stack vma, but it didn't really belong in there,
> therefore show_map_vma() massaged the addresses shown to conceal it.
>
> Whereas now with the 1be7107fbe18 commit, the gap (page or more)
> is not included in the extent of the stack vma, so there's no
> longer any need to massage the addresses shown to conceal it.
>
> We do need to understand this fairly quickly, since those stable
> backports will pose more of a problem for you than the v4.12
> release itself.
Seems patches already are in fly for most of distros. So yes,
I'm trying my best right now.
Cyrill
On 06/20, Cyrill Gorcunov wrote:
>
> | diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> | index f0c8b33..520802d 100644
> | --- a/fs/proc/task_mmu.c
> | +++ b/fs/proc/task_mmu.c
> | @@ -300,11 +300,7 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma, int is_pid)
> |
> | /* We don't show the stack guard page in /proc/maps */
> | start = vma->vm_start;
> | - if (stack_guard_page_start(vma, start))
> | - start += PAGE_SIZE;
> | end = vma->vm_end;
> | - if (stack_guard_page_end(vma, end))
> | - end -= PAGE_SIZE;
> |
> | seq_setwidth(m, 25 + sizeof(void *) * 6 - 1);
> | seq_printf(m, "%08lx-%08lx %c%c%c%c %08llx %02x:%02x %lu ",
>
> For which we of course are not ready because we've been implying the
> guard page is returned here so we adjust addresses locally when saving
> them into images.
>
> So now we need to figure out somehow if show_map_vma accounts [PAGE_SIZE|guard_area] or not,
> I guess we might use kernel version here but it won't be working fine on custom kernels,
> or kernels with the patch backported.
You can write a simple test. Just do mmap(MAP_GROWSDOWN) and look at
/proc/self/maps. If it reports vm_start + PAGE_SIZE rather than addr
returned by mmap, then the kernel is old.
> Second I guess we might need to detect @stack_guard_gap runtime as
> well
I do not think so. criu does not need to know about the new guard area
at all. It simply doesn't exist from user-space pov.
In fact, I think this should have been true even before this change, just
stack_guard_page_start() was not accurate and this is the reason (I guess)
you had to play with stack guard; the first page (hidden by show_map_vma)
can have a valid stack data, for example if the application played with
MAP_FIXED or munmap().
So I think you should simply disable, say, unmap_guard_pages() and most
of all other MAP_GROWSDOWN code in criu.
Oleg.
On Tue, Jun 20, 2017 at 12:51:16PM +0200, Oleg Nesterov wrote:
>
> You can write a simple test. Just do mmap(MAP_GROWSDOWN) and look at
> /proc/self/maps. If it reports vm_start + PAGE_SIZE rather than addr
> returned by mmap, then the kernel is old.
>
> > Second I guess we might need to detect @stack_guard_gap runtime as
> > well
>
> I do not think so. criu does not need to know about the new guard area
> at all. It simply doesn't exist from user-space pov.
>
> In fact, I think this should have been true even before this change, just
> stack_guard_page_start() was not accurate and this is the reason (I guess)
> you had to play with stack guard; the first page (hidden by show_map_vma)
> can have a valid stack data, for example if the application played with
> MAP_FIXED or munmap().
>
> So I think you should simply disable, say, unmap_guard_pages() and most
> of all other MAP_GROWSDOWN code in criu.
Looks like a good plan. Thanks Oleg! Gonna try it.
Cyrill
On Tue, Jun 20, 2017 at 12:51:16PM +0200, Oleg Nesterov wrote:
>
> So I think you should simply disable, say, unmap_guard_pages() and most
> of all other MAP_GROWSDOWN code in criu.
I've sent out a series, lets see if it pass the tests.
https://lists.openvz.org/pipermail/criu/2017-June/038205.html
On Tue, Jun 20, 2017 at 03:23:20AM -0700, Hugh Dickins wrote:
...
>
> We do need to understand this fairly quickly, since those stable
> backports will pose more of a problem for you than the v4.12
> release itself.
The patches for criu are on the fly. Still one of the test case
start failing with the new kernels. Basically the test does
the following:
- allocate growsdown memory area
- touch first byte (which before the patch force the kernel
to extend the stack allocating new page)
- touch first-1 byte
---
int main(int argc, char **argv)
{
char *start_addr, *start_addr1, *fake_grow_down, *test_addr, *grow_down;
volatile char *p;
start_addr = mmap(NULL, PAGE_SIZE * 10, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
if (start_addr == MAP_FAILED) {
printf("Can't mal a new region");
return 1;
}
printf("start_addr %lx\n", start_addr);
munmap(start_addr, PAGE_SIZE * 10);
fake_grow_down = mmap(start_addr + PAGE_SIZE * 5, PAGE_SIZE,
PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED | MAP_GROWSDOWN, -1, 0);
if (fake_grow_down == MAP_FAILED) {
printf("Can't mal a new region");
return 1;
}
printf("start_addr %lx\n", fake_grow_down);
p = fake_grow_down;
*p-- = 'c';
*p = 'b';
...
}
---
This start failing because
| static inline int check_stack_guard_page(struct vm_area_struct *vma, unsigned long address)
function get dropped off. Hugh, it is done on intent and
userspace programs have to extend stack manually?
Cyrill
On Wed, Jun 21, 2017 at 05:57:30PM +0200, Oleg Nesterov wrote:
> (add Adrian)
>
> On 06/21, Cyrill Gorcunov wrote:
> >
> > The patches for criu are on the fly. Still one of the test case
> > start failing with the new kernels. Basically the test does
> > the following:
>
> Cyrill, please read the last email I sent you in another (private) discussion.
> Most probably you should throw out some tests which assume the kernel has the
> stack-guard-page hack, it was replaced by the stack-guard-hole hack ;)
Yes, thank you.
> > - allocate growsdown memory area
> > - touch first byte (which before the patch force the kernel
> > to extend the stack allocating new page)
> > - touch first-1 byte
> >
> > ---
> > int main(int argc, char **argv)
> > {
> > char *start_addr, *start_addr1, *fake_grow_down, *test_addr, *grow_down;
> > volatile char *p;
> >
> > start_addr = mmap(NULL, PAGE_SIZE * 10, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
> > if (start_addr == MAP_FAILED) {
> > printf("Can't mal a new region");
> > return 1;
> > }
> > printf("start_addr %lx\n", start_addr);
> > munmap(start_addr, PAGE_SIZE * 10);
> >
> > fake_grow_down = mmap(start_addr + PAGE_SIZE * 5, PAGE_SIZE,
> > PROT_READ | PROT_WRITE,
> > MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED | MAP_GROWSDOWN, -1, 0);
> > if (fake_grow_down == MAP_FAILED) {
> > printf("Can't mal a new region");
> > return 1;
> > }
> > printf("start_addr %lx\n", fake_grow_down);
> >
> > p = fake_grow_down;
> > *p-- = 'c';
>
> I guess this works? I mean, *p-- = 'c' should not fail...
It fails.
>
> > *p = 'b';
>
> OK, now we need to expand the stack. This can fail or not. This depends on
> whether this vma (created by mmap(MAP_GROWSDOWN) has a stack_guard_gap hole
> between its ->vm_prev.
>
> > function get dropped off. Hugh, it is done on intent and
> > userspace programs have to extend stack manually?
>
> No. a MAP_GROWSDOWN area should grow automatically. Unless the hole between
> vm_prev becomes less than stack_guard_gap.
>
> This is the whole point of guard hole, or guard page we had before. Just the
> previous implementation was not accurate, that is why criu had to have some
> hacks to workaround.
>
> It no longer needs to know about guard hole/page/whatever. Just remove
> (conditionalize) all the MAP_GROWSDOWN code. Except, of course, you still
> need to record MAP_GROWSDOWN in vma_area->e->flags (_vmflag_match), in order
> to restore this vma correctly.
Oleg, look, it seems I've been testing on the wrong VM :) (Sign, so many
opened at once it's easy to forget in which one you're runngin)
Here is the complete code. It supposed to _extend_ stack but it fails
on the latest master + Hugh's [PATCH] mm: fix new crash in unmapped_area_topdown()
---
[root@fc2 criu]# ~/st2
start_addr 7fe6162a8000
start_addr 7fe6163d9000
Segmentation fault (core dumped)
---
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/mman.h>
#define PAGE_SIZE 4096
int main(int argc, char **argv)
{
char *start_addr, *start_addr1, *fake_grow_down, *test_addr, *grow_down;
volatile char *p;
start_addr = mmap(NULL, PAGE_SIZE * 512, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
if (start_addr == MAP_FAILED) {
printf("Can't mal a new region");
return 1;
}
printf("start_addr %lx\n", start_addr);
munmap(start_addr, PAGE_SIZE * 512);
start_addr += PAGE_SIZE * 300;
fake_grow_down = mmap(start_addr + PAGE_SIZE * 5, PAGE_SIZE,
PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED | MAP_GROWSDOWN, -1, 0);
if (fake_grow_down == MAP_FAILED) {
printf("Can't mal a new region");
return 1;
}
printf("start_addr %lx\n", fake_grow_down);
p = fake_grow_down;
*p-- = 'c';
*p = 'b';
/* overlap the guard page of fake_grow_down */
test_addr = mmap(start_addr + PAGE_SIZE * 3, PAGE_SIZE,
PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED, -1, 0);
if (test_addr == MAP_FAILED) {
printf("Can't mal a new region");
return 1;
}
printf("test_addr %lx\n", test_addr);
grow_down = mmap(start_addr + PAGE_SIZE * 2, PAGE_SIZE,
PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED | MAP_GROWSDOWN, -1, 0);
if (grow_down == MAP_FAILED) {
printf("Can't mal a new region");
return 1;
}
printf("grow_down %lx\n", grow_down);
munmap(test_addr, PAGE_SIZE);
if (fake_grow_down[0] != 'c' || *(fake_grow_down - 1) != 'b') {
printf("%c %c\n", fake_grow_down[0], *(fake_grow_down - 1));
return 1;
}
p = grow_down;
*p-- = 'z';
*p = 'x';
return 0;
}
(add Adrian)
On 06/21, Cyrill Gorcunov wrote:
>
> The patches for criu are on the fly. Still one of the test case
> start failing with the new kernels. Basically the test does
> the following:
Cyrill, please read the last email I sent you in another (private) discussion.
Most probably you should throw out some tests which assume the kernel has the
stack-guard-page hack, it was replaced by the stack-guard-hole hack ;)
> - allocate growsdown memory area
> - touch first byte (which before the patch force the kernel
> to extend the stack allocating new page)
> - touch first-1 byte
>
> ---
> int main(int argc, char **argv)
> {
> char *start_addr, *start_addr1, *fake_grow_down, *test_addr, *grow_down;
> volatile char *p;
>
> start_addr = mmap(NULL, PAGE_SIZE * 10, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
> if (start_addr == MAP_FAILED) {
> printf("Can't mal a new region");
> return 1;
> }
> printf("start_addr %lx\n", start_addr);
> munmap(start_addr, PAGE_SIZE * 10);
>
> fake_grow_down = mmap(start_addr + PAGE_SIZE * 5, PAGE_SIZE,
> PROT_READ | PROT_WRITE,
> MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED | MAP_GROWSDOWN, -1, 0);
> if (fake_grow_down == MAP_FAILED) {
> printf("Can't mal a new region");
> return 1;
> }
> printf("start_addr %lx\n", fake_grow_down);
>
> p = fake_grow_down;
> *p-- = 'c';
I guess this works? I mean, *p-- = 'c' should not fail...
> *p = 'b';
OK, now we need to expand the stack. This can fail or not. This depends on
whether this vma (created by mmap(MAP_GROWSDOWN) has a stack_guard_gap hole
between its ->vm_prev.
> function get dropped off. Hugh, it is done on intent and
> userspace programs have to extend stack manually?
No. a MAP_GROWSDOWN area should grow automatically. Unless the hole between
vm_prev becomes less than stack_guard_gap.
This is the whole point of guard hole, or guard page we had before. Just the
previous implementation was not accurate, that is why criu had to have some
hacks to workaround.
It no longer needs to know about guard hole/page/whatever. Just remove
(conditionalize) all the MAP_GROWSDOWN code. Except, of course, you still
need to record MAP_GROWSDOWN in vma_area->e->flags (_vmflag_match), in order
to restore this vma correctly.
Oleg.
On Wed, Jun 21, 2017 at 06:22:56PM +0300, Cyrill Gorcunov wrote:
> This start failing because
>
> | static inline int check_stack_guard_page(struct vm_area_struct *vma, unsigned long address)
>
> function get dropped off. Hugh, it is done on intent and
> userspace programs have to extend stack manually?
Sorry for noise. I forget to add 1M gap for testing.
All is fine!
On 06/21, Cyrill Gorcunov wrote:
>
> On Wed, Jun 21, 2017 at 05:57:30PM +0200, Oleg Nesterov wrote:
> > >
> > > p = fake_grow_down;
> > > *p-- = 'c';
> >
> > I guess this works? I mean, *p-- = 'c' should not fail...
>
> It fails.
Hmm. Impossible ;) could you add the additional printf's to re-check?
> Here is the complete code. It supposed to _extend_ stack but it fails
> on the latest master + Hugh's [PATCH] mm: fix new crash in unmapped_area_topdown()
> ---
> [root@fc2 criu]# ~/st2
> start_addr 7fe6162a8000
> start_addr 7fe6163d9000
> Segmentation fault (core dumped)
> ---
> #include <stdio.h>
> #include <stdlib.h>
> #include <errno.h>
> #include <stdlib.h>
> #include <string.h>
> #include <unistd.h>
>
> #include <sys/mman.h>
>
> #define PAGE_SIZE 4096
>
> int main(int argc, char **argv)
> {
> char *start_addr, *start_addr1, *fake_grow_down, *test_addr, *grow_down;
> volatile char *p;
>
> start_addr = mmap(NULL, PAGE_SIZE * 512, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
> if (start_addr == MAP_FAILED) {
> printf("Can't mal a new region");
> return 1;
> }
> printf("start_addr %lx\n", start_addr);
> munmap(start_addr, PAGE_SIZE * 512);
>
> start_addr += PAGE_SIZE * 300;
>
> fake_grow_down = mmap(start_addr + PAGE_SIZE * 5, PAGE_SIZE,
> PROT_READ | PROT_WRITE,
> MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED | MAP_GROWSDOWN, -1, 0);
> if (fake_grow_down == MAP_FAILED) {
> printf("Can't mal a new region");
> return 1;
> }
> printf("start_addr %lx\n", fake_grow_down);
>
> p = fake_grow_down;
> *p-- = 'c';
once again, I can't believe this STORE can fail...
> *p = 'b';
Ah. I forgot about another kernel "feature" ;) not related to the recent guard
page changes...
Could you test the patch below?
Oleg.
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 8ad91a0..edc5d68 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1416,7 +1416,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
* and pusha to work. ("enter $65535, $31" pushes
* 32 pointers and then decrements %sp by 65535.)
*/
- if (unlikely(address + 65536 + 32 * sizeof(unsigned long) < regs->sp)) {
+if (0) if (unlikely(address + 65536 + 32 * sizeof(unsigned long) < regs->sp)) {
bad_area(regs, error_code, address);
return;
}
On 06/21/2017 08:01 PM, Oleg Nesterov wrote:
> On 06/21, Cyrill Gorcunov wrote:
>>
>> On Wed, Jun 21, 2017 at 05:57:30PM +0200, Oleg Nesterov wrote:
>>>>
>>>> p = fake_grow_down;
>>>> *p-- = 'c';
>>>
>>> I guess this works? I mean, *p-- = 'c' should not fail...
>>
>> It fails.
>
> Hmm. Impossible ;) could you add the additional printf's to re-check?
>
>> Here is the complete code. It supposed to _extend_ stack but it fails
>> on the latest master + Hugh's [PATCH] mm: fix new crash in unmapped_area_topdown()
>> ---
>> [root@fc2 criu]# ~/st2
>> start_addr 7fe6162a8000
>> start_addr 7fe6163d9000
>> Segmentation fault (core dumped)
>> ---
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <errno.h>
>> #include <stdlib.h>
>> #include <string.h>
>> #include <unistd.h>
>>
>> #include <sys/mman.h>
>>
>> #define PAGE_SIZE 4096
>>
>> int main(int argc, char **argv)
>> {
>> char *start_addr, *start_addr1, *fake_grow_down, *test_addr, *grow_down;
>> volatile char *p;
>>
>> start_addr = mmap(NULL, PAGE_SIZE * 512, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
>> if (start_addr == MAP_FAILED) {
>> printf("Can't mal a new region");
>> return 1;
>> }
>> printf("start_addr %lx\n", start_addr);
>> munmap(start_addr, PAGE_SIZE * 512);
>>
>> start_addr += PAGE_SIZE * 300;
>>
>> fake_grow_down = mmap(start_addr + PAGE_SIZE * 5, PAGE_SIZE,
>> PROT_READ | PROT_WRITE,
>> MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED | MAP_GROWSDOWN, -1, 0);
>> if (fake_grow_down == MAP_FAILED) {
>> printf("Can't mal a new region");
>> return 1;
>> }
>> printf("start_addr %lx\n", fake_grow_down);
>>
>> p = fake_grow_down;
>> *p-- = 'c';
>
> once again, I can't believe this STORE can fail...
>
>> *p = 'b';
>
> Ah. I forgot about another kernel "feature" ;) not related to the recent guard
> page changes...
>
> Could you test the patch below?
Well, for me only the second store caused segfault.
With your patch it passes..
>
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 8ad91a0..edc5d68 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1416,7 +1416,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
> * and pusha to work. ("enter $65535, $31" pushes
> * 32 pointers and then decrements %sp by 65535.)
> */
> - if (unlikely(address + 65536 + 32 * sizeof(unsigned long) < regs->sp)) {
> +if (0) if (unlikely(address + 65536 + 32 * sizeof(unsigned long) < regs->sp)) {
> bad_area(regs, error_code, address);
> return;
> }
>
--
Dmitry
Hi Oleg,
On Wed, Jun 21, 2017 at 07:01:29PM +0200, Oleg Nesterov wrote:
> Ah. I forgot about another kernel "feature" ;) not related to the recent guard
> page changes...
>
> Could you test the patch below?
FWIW, I've just checked here on 3.10.107-rc and while I observe the same
regression as Cyrill with his reproducer, I can confirm that your patch
fixes it on this kernel.
Willy
On 06/21, Oleg Nesterov wrote:
>
> On 06/21, Cyrill Gorcunov wrote:
> >
> > On Wed, Jun 21, 2017 at 05:57:30PM +0200, Oleg Nesterov wrote:
> > > >
> > > > p = fake_grow_down;
> > > > *p-- = 'c';
> > >
> > > I guess this works? I mean, *p-- = 'c' should not fail...
> >
> > It fails.
>
> Hmm. Impossible ;) could you add the additional printf's to re-check?
Yes, please...
> > p = fake_grow_down;
> > *p-- = 'c';
>
> once again, I can't believe this STORE can fail...
>
> > *p = 'b';
>
> Ah. I forgot about another kernel "feature" ;) not related to the recent guard
> page changes...
On the second thought it is actually connected, and it seems we really need to
remove this code in do_page_fault... I'll re-check tomorrow, need to run away.
> Could you test the patch below?
>
> Oleg.
>
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 8ad91a0..edc5d68 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1416,7 +1416,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
> * and pusha to work. ("enter $65535, $31" pushes
> * 32 pointers and then decrements %sp by 65535.)
> */
> - if (unlikely(address + 65536 + 32 * sizeof(unsigned long) < regs->sp)) {
> +if (0) if (unlikely(address + 65536 + 32 * sizeof(unsigned long) < regs->sp)) {
> bad_area(regs, error_code, address);
> return;
> }
On 06/21/2017 08:15 PM, Dmitry Safonov wrote:
> On 06/21/2017 08:01 PM, Oleg Nesterov wrote:
>> On 06/21, Cyrill Gorcunov wrote:
>>>
>>> On Wed, Jun 21, 2017 at 05:57:30PM +0200, Oleg Nesterov wrote:
>>>>>
>>>>> p = fake_grow_down;
>>>>> *p-- = 'c';
>>>>
>>>> I guess this works? I mean, *p-- = 'c' should not fail...
>>>
>>> It fails.
>>
>> Hmm. Impossible ;) could you add the additional printf's to re-check?
>>
>>> Here is the complete code. It supposed to _extend_ stack but it fails
>>> on the latest master + Hugh's [PATCH] mm: fix new crash in
>>> unmapped_area_topdown()
>>> ---
>>> [root@fc2 criu]# ~/st2
>>> start_addr 7fe6162a8000
>>> start_addr 7fe6163d9000
>>> Segmentation fault (core dumped)
>>> ---
>>> #include <stdio.h>
>>> #include <stdlib.h>
>>> #include <errno.h>
>>> #include <stdlib.h>
>>> #include <string.h>
>>> #include <unistd.h>
>>>
>>> #include <sys/mman.h>
>>>
>>> #define PAGE_SIZE 4096
>>>
>>> int main(int argc, char **argv)
>>> {
>>> char *start_addr, *start_addr1, *fake_grow_down, *test_addr,
>>> *grow_down;
>>> volatile char *p;
>>>
>>> start_addr = mmap(NULL, PAGE_SIZE * 512, PROT_READ | PROT_WRITE,
>>> MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
>>> if (start_addr == MAP_FAILED) {
>>> printf("Can't mal a new region");
>>> return 1;
>>> }
>>> printf("start_addr %lx\n", start_addr);
>>> munmap(start_addr, PAGE_SIZE * 512);
>>>
>>> start_addr += PAGE_SIZE * 300;
>>>
>>> fake_grow_down = mmap(start_addr + PAGE_SIZE * 5, PAGE_SIZE,
>>> PROT_READ | PROT_WRITE,
>>> MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED | MAP_GROWSDOWN,
>>> -1, 0);
>>> if (fake_grow_down == MAP_FAILED) {
>>> printf("Can't mal a new region");
>>> return 1;
>>> }
>>> printf("start_addr %lx\n", fake_grow_down);
>>>
>>> p = fake_grow_down;
>>> *p-- = 'c';
>>
>> once again, I can't believe this STORE can fail...
>>
>>> *p = 'b';
>>
>> Ah. I forgot about another kernel "feature" ;) not related to the
>> recent guard
>> page changes...
>>
>> Could you test the patch below?
>
> Well, for me only the second store caused segfault.
> With your patch it passes..
The only question I have - how is it connected to guard page?
Before the former patch the test has passed. Why?
And this thing seems to be in kernel since 2009.
>
>>
>>
>> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
>> index 8ad91a0..edc5d68 100644
>> --- a/arch/x86/mm/fault.c
>> +++ b/arch/x86/mm/fault.c
>> @@ -1416,7 +1416,7 @@ __do_page_fault(struct pt_regs *regs, unsigned
>> long error_code,
>> * and pusha to work. ("enter $65535, $31" pushes
>> * 32 pointers and then decrements %sp by 65535.)
>> */
>> - if (unlikely(address + 65536 + 32 * sizeof(unsigned long) <
>> regs->sp)) {
>> +if (0) if (unlikely(address + 65536 + 32 * sizeof(unsigned
>> long) < regs->sp)) {
>> bad_area(regs, error_code, address);
>> return;
>> }
>>
>
--
Dmitry
On 06/21, Dmitry Safonov wrote:
>
> The only question I have - how is it connected to guard page?
Because with stack guard page do_page_fault() almost never needs to
call expand_stack(), thus this check was almost never tested, I guess.
Probably it should go away now.
I'll write the changelog and patch tomorrow, unless someone does this
before.
Oleg.
On 06/21/2017 08:31 PM, Oleg Nesterov wrote:
> On 06/21, Dmitry Safonov wrote:
>>
>> The only question I have - how is it connected to guard page?
>
> Because with stack guard page do_page_fault() almost never needs to
> call expand_stack(), thus this check was almost never tested, I guess.
> Probably it should go away now.
Ah, indeed! Thanks for explanation.
>
> I'll write the changelog and patch tomorrow, unless someone does this
> before.
>
> Oleg.
>
--
Dmitry
On 06/21/2017 08:31 PM, Oleg Nesterov wrote:
> On 06/21, Dmitry Safonov wrote:
>>
>> The only question I have - how is it connected to guard page?
>
> Because with stack guard page do_page_fault() almost never needs to
> call expand_stack(), thus this check was almost never tested, I guess.
> Probably it should go away now.
>
> I'll write the changelog and patch tomorrow, unless someone does this
> before.
Ugh, maybe it's also worth now to update man 2 mmap.
At this moment, mmap() will no more return address one page lower
and "guard" is no more a page:
> MAP_GROWSDOWN
> This flag is used for stacks. It indicates to the kernel virtual
> memory system that the mapping should extend downward in
> memory. The return address is one page lower than the memory
> area that is actually created in the process's virtual address
> space. Touching an address in the "guard" page below the mapping
> will cause the mapping to grow by a page. This growth can be
> repeated until the mapping grows to within a page of the high end
> of the next lower mapping, at which point touching the "guard"
> page will result in a SIGSEGV signal.
CC'ing Michael
--
Dmitry
On Wed, Jun 21, 2017 at 07:15:45PM +0200, Oleg Nesterov wrote:
> On 06/21, Oleg Nesterov wrote:
> >
> > On 06/21, Cyrill Gorcunov wrote:
> > >
> > > On Wed, Jun 21, 2017 at 05:57:30PM +0200, Oleg Nesterov wrote:
> > > > >
> > > > > p = fake_grow_down;
> > > > > *p-- = 'c';
> > > >
> > > > I guess this works? I mean, *p-- = 'c' should not fail...
> > >
> > > It fails.
> >
> > Hmm. Impossible ;) could you add the additional printf's to re-check?
>
> Yes, please...
I meant of course second store, sorry for confusion :)
>
> > > p = fake_grow_down;
> > > *p-- = 'c';
> >
> > once again, I can't believe this STORE can fail...
> >
> > > *p = 'b';
> >
> > Ah. I forgot about another kernel "feature" ;) not related to the recent guard
> > page changes...
>
> On the second thought it is actually connected, and it seems we really need to
> remove this code in do_page_fault... I'll re-check tomorrow, need to run away.
>
> > Could you test the patch below?
> >
> > Oleg.
> >
> >
> > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > index 8ad91a0..edc5d68 100644
> > --- a/arch/x86/mm/fault.c
> > +++ b/arch/x86/mm/fault.c
> > @@ -1416,7 +1416,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
> > * and pusha to work. ("enter $65535, $31" pushes
> > * 32 pointers and then decrements %sp by 65535.)
> > */
> > - if (unlikely(address + 65536 + 32 * sizeof(unsigned long) < regs->sp)) {
> > +if (0) if (unlikely(address + 65536 + 32 * sizeof(unsigned long) < regs->sp)) {
> > bad_area(regs, error_code, address);
> > return;
> > }
It does the trick
[root@fc2 ~]# ~/st2
start_addr 7fce7c3f8000
start_addr 7fce7c529000
test_addr 7fce7c527000
grow_down 7fce7c526000
Cyrill
On Wed, 21 Jun 2017, Dmitry Safonov wrote:
> On 06/21/2017 08:31 PM, Oleg Nesterov wrote:
> > On 06/21, Dmitry Safonov wrote:
> > >
> > > The only question I have - how is it connected to guard page?
> >
> > Because with stack guard page do_page_fault() almost never needs to
> > call expand_stack(), thus this check was almost never tested, I guess.
> > Probably it should go away now.
> >
> > I'll write the changelog and patch tomorrow, unless someone does this
> > before.
>
> Ugh, maybe it's also worth now to update man 2 mmap.
>
> At this moment, mmap() will no more return address one page lower
> and "guard" is no more a page:
>
> > MAP_GROWSDOWN
> > This flag is used for stacks. It indicates to the kernel virtual
> > memory system that the mapping should extend downward in
> > memory. The return address is one page lower than the memory
> > area that is actually created in the process's virtual address
> > space. Touching an address in the "guard" page below the mapping
> > will cause the mapping to grow by a page. This growth can be
> > repeated until the mapping grows to within a page of the high end
> > of the next lower mapping, at which point touching the "guard"
> > page will result in a SIGSEGV signal.
>
> CC'ing Michael
That does go into rather more detail than I like to see: I suppose
the man pages on my machines are rather old, and only show the first
two innocuous sentences about MAP_GROWSDOWN.
Michael, v4.12-rc6 enlarges the stack guard gap from one page to 256
pages (by default). But quite what the man page ought to say will
depend on the outcome of the discussion in the lkp-robot thread.
(Or perhaps it isn't a discussion, but me feeling over-anxious
about how Linus has decided.) Maybe the robot will settle it.
Hugh
On Wed, Jun 21, 2017 at 06:24:18PM -0700, Hugh Dickins wrote:
> >
> > At this moment, mmap() will no more return address one page lower
> > and "guard" is no more a page:
> >
> > > MAP_GROWSDOWN
> > > This flag is used for stacks. It indicates to the kernel virtual
> > > memory system that the mapping should extend downward in
> > > memory. The return address is one page lower than the memory
> > > area that is actually created in the process's virtual address
> > > space. Touching an address in the "guard" page below the mapping
> > > will cause the mapping to grow by a page. This growth can be
> > > repeated until the mapping grows to within a page of the high end
> > > of the next lower mapping, at which point touching the "guard"
> > > page will result in a SIGSEGV signal.
> >
> > CC'ing Michael
>
> That does go into rather more detail than I like to see: I suppose
> the man pages on my machines are rather old, and only show the first
> two innocuous sentences about MAP_GROWSDOWN.
On my fedora24 too :) And I rather wonder why guard page is mentioned
here, because as far as I remember the only "cut off a guard page"
case was proc/$pid/[s]maps output. mmap() returned exact vma's start
address all the time, isn't it? Where the guard page is rather an
internal kernel's handling of page faults for growsdown areas.
> Michael, v4.12-rc6 enlarges the stack guard gap from one page to 256
> pages (by default). But quite what the man page ought to say will
> depend on the outcome of the discussion in the lkp-robot thread.
> (Or perhaps it isn't a discussion, but me feeling over-anxious
> about how Linus has decided.) Maybe the robot will settle it.
Cyrill
Cyrill,
I am replying to my own email because I got lost in numerous threads/emails
connected to stack guard/gap problems. IIRC you confirmed that the 1st load
doesn't fail and the patch fixes the problem. So everything is clear, and we
will discuss this change in another thread.
But let me add that (imo) you should not change this test-case. You simply
should not run it if kerndat_mm_guard_page_maps() detects the new kernel at
startup.
The new version makes no sense for criu, afaics. Yes, yes, thank you very
much for this test-case, it found the kernel regression ;) But criu has
nothing to do with this problem, and it is not clear right now if we are
going to fix it or not.
With the recent kernel changes criu should never look outside of start-end
region reported by /proc/maps; and restore doesn't even need to know if a
GROWSDOWN region will actually grow or not, because (iiuc) you do not need
to auto-grow the stack vma during restore, criu re-creates the whole vma
with the same length using MAP_FIXED and it should never write below the
addr returned by mmap(MAP_FIXED).
So (afaics) the only complication is that the process can be dumped on
a system running with (say) stack_guard_gap=4K kernel parameter, and then
restored on another system running with stack_guard_gap=1M. In this case
the application may fail after restore if it tries to auto-grow the stack,
but this is unlikely and this is another story.
Oleg.
On 06/21, Oleg Nesterov wrote:
>
> On 06/21, Cyrill Gorcunov wrote:
> >
> > On Wed, Jun 21, 2017 at 05:57:30PM +0200, Oleg Nesterov wrote:
> > > >
> > > > p = fake_grow_down;
> > > > *p-- = 'c';
> > >
> > > I guess this works? I mean, *p-- = 'c' should not fail...
> >
> > It fails.
>
> Hmm. Impossible ;) could you add the additional printf's to re-check?
>
> > Here is the complete code. It supposed to _extend_ stack but it fails
> > on the latest master + Hugh's [PATCH] mm: fix new crash in unmapped_area_topdown()
> > ---
> > [root@fc2 criu]# ~/st2
> > start_addr 7fe6162a8000
> > start_addr 7fe6163d9000
> > Segmentation fault (core dumped)
> > ---
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <errno.h>
> > #include <stdlib.h>
> > #include <string.h>
> > #include <unistd.h>
> >
> > #include <sys/mman.h>
> >
> > #define PAGE_SIZE 4096
> >
> > int main(int argc, char **argv)
> > {
> > char *start_addr, *start_addr1, *fake_grow_down, *test_addr, *grow_down;
> > volatile char *p;
> >
> > start_addr = mmap(NULL, PAGE_SIZE * 512, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
> > if (start_addr == MAP_FAILED) {
> > printf("Can't mal a new region");
> > return 1;
> > }
> > printf("start_addr %lx\n", start_addr);
> > munmap(start_addr, PAGE_SIZE * 512);
> >
> > start_addr += PAGE_SIZE * 300;
> >
> > fake_grow_down = mmap(start_addr + PAGE_SIZE * 5, PAGE_SIZE,
> > PROT_READ | PROT_WRITE,
> > MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED | MAP_GROWSDOWN, -1, 0);
> > if (fake_grow_down == MAP_FAILED) {
> > printf("Can't mal a new region");
> > return 1;
> > }
> > printf("start_addr %lx\n", fake_grow_down);
> >
> > p = fake_grow_down;
> > *p-- = 'c';
>
> once again, I can't believe this STORE can fail...
>
> > *p = 'b';
>
> Ah. I forgot about another kernel "feature" ;) not related to the recent guard
> page changes...
>
> Could you test the patch below?
>
> Oleg.
>
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 8ad91a0..edc5d68 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1416,7 +1416,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
> * and pusha to work. ("enter $65535, $31" pushes
> * 32 pointers and then decrements %sp by 65535.)
> */
> - if (unlikely(address + 65536 + 32 * sizeof(unsigned long) < regs->sp)) {
> +if (0) if (unlikely(address + 65536 + 32 * sizeof(unsigned long) < regs->sp)) {
> bad_area(regs, error_code, address);
> return;
> }
On Thu, Jun 22, 2017 at 04:23:00PM +0200, Oleg Nesterov wrote:
> Cyrill,
>
> I am replying to my own email because I got lost in numerous threads/emails
> connected to stack guard/gap problems. IIRC you confirmed that the 1st load
> doesn't fail and the patch fixes the problem. So everything is clear, and we
> will discuss this change in another thread.
Yes.
> But let me add that (imo) you should not change this test-case. You simply
> should not run it if kerndat_mm_guard_page_maps() detects the new kernel at
> startup.
>
> The new version makes no sense for criu, afaics. Yes, yes, thank you very
> much for this test-case, it found the kernel regression ;) But criu has
> nothing to do with this problem, and it is not clear right now if we are
> going to fix it or not.
To be fair the first reporter is Andrew Vagin :) He wrote the test and
poked me to look into. If we're not going to fix it in the kernel then
sure -- we won't run it on new kernels (hell knows though, what else
application may fail, as Linus pointed it's perfectly valid to map and
autogrow the vma).
> With the recent kernel changes criu should never look outside of start-end
> region reported by /proc/maps; and restore doesn't even need to know if a
> GROWSDOWN region will actually grow or not, because (iiuc) you do not need
> to auto-grow the stack vma during restore, criu re-creates the whole vma
> with the same length using MAP_FIXED and it should never write below the
> addr returned by mmap(MAP_FIXED).
Yes, and we already do, thanks.
> So (afaics) the only complication is that the process can be dumped on
> a system running with (say) stack_guard_gap=4K kernel parameter, and then
> restored on another system running with stack_guard_gap=1M. In this case
> the application may fail after restore if it tries to auto-grow the stack,
> but this is unlikely and this is another story.
Yes, it's different problem and it would be cool to be able to fetch this
value somehow (maybe via sysfs or something). Otherwise if such container
migration case happen we simply find error code in the restore log and
I fear it won't be clear that the error happened exactly because of
gap settings variation.