2011-02-24 05:39:52

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH] mm: fix possible cause of a page_mapped BUG

Robert Swiecki reported a BUG_ON(page_mapped) from a fuzzer, punching
a hole with madvise(,, MADV_REMOVE). That path is under mutex, and
cannot be explained by lack of serialization in unmap_mapping_range().

Reviewing the code, I found one place where vm_truncate_count handling
should have been updated, when I switched at the last minute from one
way of managing the restart_addr to another: mremap move changes the
virtual addresses, so it ought to adjust the restart_addr.

But rather than exporting the notion of restart_addr from memory.c, or
converting to restart_pgoff throughout, simply reset vm_truncate_count
to 0 to force a rescan if mremap move races with preempted truncation.

We have no confirmation that this fixes Robert's BUG,
but it is a fix that's worth making anyway.

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/mremap.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

--- 2.6.38-rc6/mm/mremap.c 2011-01-18 22:04:56.000000000 -0800
+++ linux/mm/mremap.c 2011-02-23 15:29:52.000000000 -0800
@@ -94,9 +94,7 @@ static void move_ptes(struct vm_area_str
*/
mapping = vma->vm_file->f_mapping;
spin_lock(&mapping->i_mmap_lock);
- if (new_vma->vm_truncate_count &&
- new_vma->vm_truncate_count != vma->vm_truncate_count)
- new_vma->vm_truncate_count = 0;
+ new_vma->vm_truncate_count = 0;
}

/*


2011-02-28 23:35:42

by Robert Święcki

[permalink] [raw]
Subject: Re: [PATCH] mm: fix possible cause of a page_mapped BUG

> But rather than exporting the notion of restart_addr from memory.c, or
> converting to restart_pgoff throughout, simply reset vm_truncate_count
> to 0 to force a rescan if mremap move races with preempted truncation.
>
> We have no confirmation that this fixes Robert's BUG,
> but it is a fix that's worth making anyway.

Hi, I don't have currently access to my rs232/console testing machine
(lame excuse but it helps a lot;), cause I'm working currently OOtO,
so I'll try to test it asap - which is probably Mar 15th or so.

Btw, the fuzzer is here: http://code.google.com/p/iknowthis/

I think i was trying it with this revision:
http://code.google.com/p/iknowthis/source/detail?r=11 (i386 mode,
newest 'iknowthis' supports x86-64 natively), so feel free to try it.

It used to crash the machine (it's BUG_ON but the system became
unusable) in matter of hours. Btw, when I was testing it for the last
time it Ooopsed much more frequently in proc_readdir (I sent report in
one of earliet e-mails).

> Signed-off-by: Hugh Dickins <[email protected]>
> ---
>
>  mm/mremap.c |    4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
>
> --- 2.6.38-rc6/mm/mremap.c      2011-01-18 22:04:56.000000000 -0800
> +++ linux/mm/mremap.c   2011-02-23 15:29:52.000000000 -0800
> @@ -94,9 +94,7 @@ static void move_ptes(struct vm_area_str
>                 */
>                mapping = vma->vm_file->f_mapping;
>                spin_lock(&mapping->i_mmap_lock);
> -               if (new_vma->vm_truncate_count &&
> -                   new_vma->vm_truncate_count != vma->vm_truncate_count)
> -                       new_vma->vm_truncate_count = 0;
> +               new_vma->vm_truncate_count = 0;
>        }
>
>        /*
>



--
Robert Święcki

2011-03-17 15:40:52

by Robert Święcki

[permalink] [raw]
Subject: Re: [PATCH] mm: fix possible cause of a page_mapped BUG

Hi,

On Tue, Mar 1, 2011 at 12:35 AM, Robert Święcki <[email protected]> wrote:
>> But rather than exporting the notion of restart_addr from memory.c, or
>> converting to restart_pgoff throughout, simply reset vm_truncate_count
>> to 0 to force a rescan if mremap move races with preempted truncation.
>>
>> We have no confirmation that this fixes Robert's BUG,
>> but it is a fix that's worth making anyway.
>
> Hi, I don't have currently access to my rs232/console testing machine
> (lame excuse but it helps a lot;), cause I'm working currently OOtO,
> so I'll try to test it asap - which is probably Mar 15th or so.

So, I compiled 2.6.38 and started fuzzing it. I'm bumping into other
problems, and never seen anything about mremap in 2.6.38 (yet), as it
had been happening in 2.6.37-rc2. The output goes to
http://alt.swiecki.net/linux_kernel/ - I'm still trying.

> Btw, the fuzzer is here: http://code.google.com/p/iknowthis/
>
> I think i was trying it with this revision:
> http://code.google.com/p/iknowthis/source/detail?r=11 (i386 mode,
> newest 'iknowthis' supports x86-64 natively), so feel free to try it.
>
> It used to crash the machine (it's BUG_ON but the system became
> unusable) in matter of hours. Btw, when I was testing it for the last
> time it Ooopsed much more frequently in proc_readdir (I sent report in
> one of earliet e-mails).
>
>> Signed-off-by: Hugh Dickins <[email protected]>
>> ---
>>
>>  mm/mremap.c |    4 +---
>>  1 file changed, 1 insertion(+), 3 deletions(-)
>>
>> --- 2.6.38-rc6/mm/mremap.c      2011-01-18 22:04:56.000000000 -0800
>> +++ linux/mm/mremap.c   2011-02-23 15:29:52.000000000 -0800
>> @@ -94,9 +94,7 @@ static void move_ptes(struct vm_area_str
>>                 */
>>                mapping = vma->vm_file->f_mapping;
>>                spin_lock(&mapping->i_mmap_lock);
>> -               if (new_vma->vm_truncate_count &&
>> -                   new_vma->vm_truncate_count != vma->vm_truncate_count)
>> -                       new_vma->vm_truncate_count = 0;
>> +               new_vma->vm_truncate_count = 0;
>>        }
>>
>>        /*
>>
>
>
>
> --
> Robert Święcki
>



--
Robert Święcki

2011-03-19 05:35:04

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] mm: fix possible cause of a page_mapped BUG

On Thu, 17 Mar 2011, Robert Swiecki wrote:
> On Tue, Mar 1, 2011 at 12:35 AM, Robert Swiecki <[email protected]> wrote:
>
> So, I compiled 2.6.38 and started fuzzing it. I'm bumping into other
> problems, and never seen anything about mremap in 2.6.38 (yet),

Thanks a lot for getting back to this, Robert, and thanks for the update.
I won't be celebrating, but this sounds like good news for my mremap patch.

> as it had been happening in 2.6.37-rc2. The output goes to
> http://alt.swiecki.net/linux_kernel/ - I'm still trying.

A problem in sys_mlock: I've Cc'ed Michel who is the current expert.

A problem in sys_munlock: Michel again, except vma_prio_tree_add is
implicated, and I used to be involved with that. I've appended below
a debug patch which I wrote years ago, and have largely forgotten, but
Andrew keeps it around in mmotm: we might learn more if you add that
into your kernel build.

A problem in next_pidmap from find_ge_pid from ... proc_pid_readdir.
I did spend a while looking into that when you first reported it.
I'm pretty sure, from the register values, that it's a result of
a pid number (in some places signed int, in some places unsigned)
getting unexpectedly sign-extended to negative, so indexing before
the beginning of an array; but I never tracked down the root of the
problem, and failed to reproduce it with odd lseeks on the directory.

Ah, the one you report now comes from compat_sys_getdents,
whereas the original one came from compat_sys_old_readdir: okay,
I had been wondering whether it was peculiar to the old_readdir case,
but no, it's reproduced with getdents too. Might be peculiar to compat.

Anyway, I've Cc'ed Eric who will be the best for that one.

And a couple of watchdog problems: I haven't even glanced at
those, hope someone else can suggest a good way forward on them.

Hugh

>
> > Btw, the fuzzer is here: http://code.google.com/p/iknowthis/
> >
> > I think i was trying it with this revision:
> > http://code.google.com/p/iknowthis/source/detail?r=11 (i386 mode,
> > newest 'iknowthis' supports x86-64 natively), so feel free to try it.
> >
> > It used to crash the machine (it's BUG_ON but the system became
> > unusable) in matter of hours. Btw, when I was testing it for the last
> > time it Ooopsed much more frequently in proc_readdir (I sent report in
> > one of earliet e-mails).

From: Hugh Dickins <[email protected]>

Jayson Santos has sighted mm/prio_tree.c:78,79 BUGs (kernel bugzilla 8446),
and one was sighted a couple of years ago. No reason yet to suppose
they're prio_tree bugs, but we can't tell much about them without seeing
the vmas.

So dump vma and the one it's supposed to resemble: I had expected to use
print_hex_dump(), but that's designed for u8 dumps, whereas almost every
field of vm_area_struct is either a pointer or an unsigned long - which
look nonsense dumped as u8s.

Replace the two BUG_ONs by a single WARN_ON; and if it fires, just keep
this vma out of the tree (truncation and swapout won't be able to find it).
How safe this is depends on what the error really is; but we hold a file's
i_mmap_lock here, so it may be impossible to recover from BUG_ON.

Signed-off-by: Hugh Dickins <[email protected]>
Cc: Jayson Santos <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

mm/prio_tree.c | 33 ++++++++++++++++++++++++++++-----
1 file changed, 28 insertions(+), 5 deletions(-)

diff -puN mm/prio_tree.c~prio_tree-debugging-patch mm/prio_tree.c
--- a/mm/prio_tree.c~prio_tree-debugging-patch
+++ a/mm/prio_tree.c
@@ -67,6 +67,20 @@
* vma->shared.vm_set.head == NULL ==> a list node
*/

+static void dump_vma(struct vm_area_struct *vma)
+{
+ void **ptr = (void **) vma;
+ int i;
+
+ printk("vm_area_struct at %p:", ptr);
+ for (i = 0; i < sizeof(*vma)/sizeof(*ptr); i++, ptr++) {
+ if (!(i & 3))
+ printk("\n");
+ printk(" %p", *ptr);
+ }
+ printk("\n");
+}
+
/*
* Add a new vma known to map the same set of pages as the old vma:
* useful for fork's dup_mmap as well as vma_prio_tree_insert below.
@@ -74,14 +88,23 @@
*/
void vma_prio_tree_add(struct vm_area_struct *vma, struct vm_area_struct *old)
{
- /* Leave these BUG_ONs till prio_tree patch stabilizes */
- BUG_ON(RADIX_INDEX(vma) != RADIX_INDEX(old));
- BUG_ON(HEAP_INDEX(vma) != HEAP_INDEX(old));
-
vma->shared.vm_set.head = NULL;
vma->shared.vm_set.parent = NULL;

- if (!old->shared.vm_set.parent)
+ if (WARN_ON(RADIX_INDEX(vma) != RADIX_INDEX(old) ||
+ HEAP_INDEX(vma) != HEAP_INDEX(old))) {
+ /*
+ * This should never happen, yet it has been seen a few times:
+ * we cannot say much about it without seeing the vma contents.
+ */
+ dump_vma(vma);
+ dump_vma(old);
+ /*
+ * Don't try to link this (corrupt?) vma into the (corrupt?)
+ * prio_tree, but arrange for its removal to succeed later.
+ */
+ INIT_LIST_HEAD(&vma->shared.vm_set.list);
+ } else if (!old->shared.vm_set.parent)
list_add(&vma->shared.vm_set.list,
&old->shared.vm_set.list);
else if (old->shared.vm_set.head)

2011-04-01 14:34:33

by Robert Święcki

[permalink] [raw]
Subject: Re: [PATCH] mm: fix possible cause of a page_mapped BUG

On Sat, Mar 19, 2011 at 6:34 AM, Hugh Dickins <[email protected]> wrote:
> On Thu, 17 Mar 2011, Robert Swiecki wrote:
>> On Tue, Mar 1, 2011 at 12:35 AM, Robert Swiecki <[email protected]> wrote:
>>
>> So, I compiled 2.6.38 and started fuzzing it. I'm bumping into other
>> problems, and never seen anything about mremap in 2.6.38 (yet),
>
> Thanks a lot for getting back to this, Robert, and thanks for the update.
> I won't be celebrating, but this sounds like good news for my mremap patch.
>
>> as it had been happening in 2.6.37-rc2. The output goes to
>> http://alt.swiecki.net/linux_kernel/ - I'm still trying.
>
> A problem in sys_mlock: I've Cc'ed Michel who is the current expert.
>
> A problem in sys_munlock: Michel again, except vma_prio_tree_add is
> implicated, and I used to be involved with that.  I've appended below
> a debug patch which I wrote years ago, and have largely forgotten, but
> Andrew keeps it around in mmotm: we might learn more if you add that
> into your kernel build.

Hey, I'll apply your patch and check it out. In the meantime I
triggered another Oops (NULL-ptr deref via sys_mprotect).

The oops is here:

http://alt.swiecki.net/linux_kernel/sys_mprotect-2.6.38.txt

> A problem in next_pidmap from find_ge_pid from ... proc_pid_readdir.
> I did spend a while looking into that when you first reported it.
> I'm pretty sure, from the register values, that it's a result of
> a pid number (in some places signed int, in some places unsigned)
> getting unexpectedly sign-extended to negative, so indexing before
> the beginning of an array; but I never tracked down the root of the
> problem, and failed to reproduce it with odd lseeks on the directory.
>
> Ah, the one you report now comes from compat_sys_getdents,
> whereas the original one came from compat_sys_old_readdir: okay,
> I had been wondering whether it was peculiar to the old_readdir case,
> but no, it's reproduced with getdents too.  Might be peculiar to compat.
>
> Anyway, I've Cc'ed Eric who will be the best for that one.
>
> And a couple of watchdog problems: I haven't even glanced at
> those, hope someone else can suggest a good way forward on them.
>
> Hugh
>
>>
>> > Btw, the fuzzer is here: http://code.google.com/p/iknowthis/
>> >
>> > I think i was trying it with this revision:
>> > http://code.google.com/p/iknowthis/source/detail?r=11 (i386 mode,
>> > newest 'iknowthis' supports x86-64 natively), so feel free to try it.
>> >
>> > It used to crash the machine (it's BUG_ON but the system became
>> > unusable) in matter of hours. Btw, when I was testing it for the last
>> > time it Ooopsed much more frequently in proc_readdir (I sent report in
>> > one of earliet e-mails).
>
> From: Hugh Dickins <[email protected]>
>
> Jayson Santos has sighted mm/prio_tree.c:78,79 BUGs (kernel bugzilla 8446),
> and one was sighted a couple of years ago.  No reason yet to suppose
> they're prio_tree bugs, but we can't tell much about them without seeing
> the vmas.
>
> So dump vma and the one it's supposed to resemble: I had expected to use
> print_hex_dump(), but that's designed for u8 dumps, whereas almost every
> field of vm_area_struct is either a pointer or an unsigned long - which
> look nonsense dumped as u8s.
>
> Replace the two BUG_ONs by a single WARN_ON; and if it fires, just keep
> this vma out of the tree (truncation and swapout won't be able to find it).
>  How safe this is depends on what the error really is; but we hold a file's
> i_mmap_lock here, so it may be impossible to recover from BUG_ON.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> Cc: Jayson Santos <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> ---
>
>  mm/prio_tree.c |   33 ++++++++++++++++++++++++++++-----
>  1 file changed, 28 insertions(+), 5 deletions(-)
>
> diff -puN mm/prio_tree.c~prio_tree-debugging-patch mm/prio_tree.c
> --- a/mm/prio_tree.c~prio_tree-debugging-patch
> +++ a/mm/prio_tree.c
> @@ -67,6 +67,20 @@
>  *     vma->shared.vm_set.head == NULL ==> a list node
>  */
>
> +static void dump_vma(struct vm_area_struct *vma)
> +{
> +       void **ptr = (void **) vma;
> +       int i;
> +
> +       printk("vm_area_struct at %p:", ptr);
> +       for (i = 0; i < sizeof(*vma)/sizeof(*ptr); i++, ptr++) {
> +               if (!(i & 3))
> +                       printk("\n");
> +               printk(" %p", *ptr);
> +       }
> +       printk("\n");
> +}
> +
>  /*
>  * Add a new vma known to map the same set of pages as the old vma:
>  * useful for fork's dup_mmap as well as vma_prio_tree_insert below.
> @@ -74,14 +88,23 @@
>  */
>  void vma_prio_tree_add(struct vm_area_struct *vma, struct vm_area_struct *old)
>  {
> -       /* Leave these BUG_ONs till prio_tree patch stabilizes */
> -       BUG_ON(RADIX_INDEX(vma) != RADIX_INDEX(old));
> -       BUG_ON(HEAP_INDEX(vma) != HEAP_INDEX(old));
> -
>        vma->shared.vm_set.head = NULL;
>        vma->shared.vm_set.parent = NULL;
>
> -       if (!old->shared.vm_set.parent)
> +       if (WARN_ON(RADIX_INDEX(vma) != RADIX_INDEX(old) ||
> +                   HEAP_INDEX(vma)  != HEAP_INDEX(old))) {
> +               /*
> +                * This should never happen, yet it has been seen a few times:
> +                * we cannot say much about it without seeing the vma contents.
> +                */
> +               dump_vma(vma);
> +               dump_vma(old);
> +               /*
> +                * Don't try to link this (corrupt?) vma into the (corrupt?)
> +                * prio_tree, but arrange for its removal to succeed later.
> +                */
> +               INIT_LIST_HEAD(&vma->shared.vm_set.list);
> +       } else if (!old->shared.vm_set.parent)
>                list_add(&vma->shared.vm_set.list,
>                                &old->shared.vm_set.list);
>        else if (old->shared.vm_set.head)
>



--
Robert Święcki

2011-04-01 15:52:22

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm: fix possible cause of a page_mapped BUG

On Fri, Apr 1, 2011 at 7:34 AM, Robert Święcki <[email protected]> wrote:
>
> Hey, I'll apply your patch and check it out. In the meantime I
> triggered another Oops (NULL-ptr deref via sys_mprotect).
>
> The oops is here:
>
> http://alt.swiecki.net/linux_kernel/sys_mprotect-2.6.38.txt

That's not a NULL pointer dereference. That's a BUG_ON().

And for some reason you've turned off the BUG_ON() messages, saving
some tiny amount of memory.

Anyway, it looks like the first BUG_ON() in vma_prio_tree_add(), so it
would be this one:

BUG_ON(RADIX_INDEX(vma) != RADIX_INDEX(old));

but it is possible that gcc has shuffled things around (so it _might_
be the HEAP_INDEX() one). If you had CONFIG_DEBUG_BUGVERBOSE=y, you'd
get a filename and line number. One reason I hate -O2 in cases like
this is that the basic block movement makes it way harder to actually
debug things. I would suggest using -Os too (CONFIG_OPTIMIZE_FOR_SIZE
or whatever it's called).

Anyway, I do find it worrying. The vma code shouldn't be this fragile. Hugh?

I do wonder what triggers this. Is it a huge-page vma? We seem to be
lacking the check to see that mprotect() is on a hugepage boundary -
and that seems bogus. Or am I missing some check? The new transparent
hugepage support splits the page, but what if it's a _static_ hugepage
thing?

But why would that affect the radix_index thing? I have no idea. I'd
like to blame the anon_vma rewrites last year, but I can't see why
that should matter either. Again, hugepages had some special rules, I
think (and that would explain why nobody normal sees this).

Guys, please give this one a look.

Linus

2011-04-01 16:21:26

by Robert Święcki

[permalink] [raw]
Subject: Re: [PATCH] mm: fix possible cause of a page_mapped BUG

On Fri, Apr 1, 2011 at 5:44 PM, Linus Torvalds
<[email protected]> wrote:
> On Fri, Apr 1, 2011 at 7:34 AM, Robert Święcki <[email protected]> wrote:
>>
>> Hey, I'll apply your patch and check it out. In the meantime I
>> triggered another Oops (NULL-ptr deref via sys_mprotect).
>>
>> The oops is here:
>>
>> http://alt.swiecki.net/linux_kernel/sys_mprotect-2.6.38.txt
>
> That's not a NULL pointer dereference. That's a BUG_ON().
>
> And for some reason you've turned off the BUG_ON() messages, saving
> some tiny amount of memory.

Is it possible to turn it off via config flags? Looking into
arch/x86/include/asm/bug.h it seems it's unconditional (as in "it
always manifests itself somehow") and I have
CONFIG_DEBUG_BUGVERBOSE=y.

This BUG/Oopps was triggered before I applied Hugh's patch on a vanilla kernel.

Anything that could help you debugging this? Uploading kernel image
(unfortunately I've overwritten this one), dumping more kgdb data? I
must admit I'm not up-to-date with current linux kernel debugging
techniques. The kernel config is here:
http://alt.swiecki.net/linux_kernel/ise-test-2.6.38-kernel-config.txt

For now I'll compile with -O0 -fno-inline (are you sure you'd like -Os?)

> Anyway, it looks like the first BUG_ON() in vma_prio_tree_add(), so it
> would be this one:
>
>        BUG_ON(RADIX_INDEX(vma) != RADIX_INDEX(old));
>
> but it is possible that gcc has shuffled things around (so it _might_
> be the HEAP_INDEX() one). If you had CONFIG_DEBUG_BUGVERBOSE=y, you'd
> get a filename and line number. One reason I hate -O2 in cases like
> this is that the basic block movement makes it way harder to actually
> debug things. I would suggest using -Os too (CONFIG_OPTIMIZE_FOR_SIZE
> or whatever it's called).
>
> Anyway, I do find it worrying. The vma code shouldn't be this fragile.  Hugh?
>
> I do wonder what triggers this. Is it a huge-page vma? We seem to be
> lacking the check to see that mprotect() is on a hugepage boundary -
> and that seems bogus. Or am I missing some check? The new transparent
> hugepage support splits the page, but what if it's a _static_ hugepage
> thing?
>
> But why would that affect the radix_index thing? I have no idea. I'd
> like to blame the anon_vma rewrites last year, but I can't see why
> that should matter either. Again, hugepages had some special rules, I
> think (and that would explain why nobody normal sees this).
>
> Guys, please give this one a look.
>
>                          Linus
>



--
Robert Święcki

2011-04-01 16:36:35

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm: fix possible cause of a page_mapped BUG

On Fri, Apr 1, 2011 at 9:21 AM, Robert Święcki <[email protected]> wrote:
>
> Is it possible to turn it off via config flags? Looking into
> arch/x86/include/asm/bug.h it seems it's unconditional (as in "it
> always manifests itself somehow") and I have
> CONFIG_DEBUG_BUGVERBOSE=y.

Ok, if you have CONFIG_DEBUG_BUGVERBOSE then, you do have the bug-table.

Maybe it's just kdb that is broken, and doesn't print it. I wouldn't
be surprised. It's not the first time I've seen debugging features
that just make debugging a mess.

> Anything that could help you debugging this? Uploading kernel image
> (unfortunately I've overwritten this one), dumping more kgdb data?

So in this case kgdb just dropped the most important data on the floor.

But if you have kdb active next time, print out the vma/old contents
in that function that has the BUG() in it.

> I must admit I'm not up-to-date with current linux kernel debugging
> techniques. The kernel config is here:
> http://alt.swiecki.net/linux_kernel/ise-test-2.6.38-kernel-config.txt
>
> For now I'll compile with -O0 -fno-inline (are you sure you'd like -Os?)

Oh, don't do that. -O0 makes the code totally unreadable (the compiler
just does _stupid_ things, making the asm code look so horrible that
you can't match it up against anything sane), and -fno-inline isn't
worth the pain either.

-Os is much better than those.

But in this case, just getting the filename and line number would have
made the thing moot anyway - without kdb it _should_ have said
something clear like

kernel BUG at %s:%u!

where %s:%u is the filename and line number.

Linus

2011-04-02 01:46:18

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] mm: fix possible cause of a page_mapped BUG

On Fri, Apr 1, 2011 at 8:44 AM, Linus Torvalds
<[email protected]> wrote:
> On Fri, Apr 1, 2011 at 7:34 AM, Robert Święcki <[email protected]> wrote:
>>
>> Hey, I'll apply your patch and check it out. In the meantime I
>> triggered another Oops (NULL-ptr deref via sys_mprotect).
>>
>> The oops is here:
>>
>> http://alt.swiecki.net/linux_kernel/sys_mprotect-2.6.38.txt
>
> That's not a NULL pointer dereference. That's a BUG_ON().
>
> And for some reason you've turned off the BUG_ON() messages, saving
> some tiny amount of memory.
>
> Anyway, it looks like the first BUG_ON() in vma_prio_tree_add(), so it
> would be this one:
>
>        BUG_ON(RADIX_INDEX(vma) != RADIX_INDEX(old));
>
> but it is possible that gcc has shuffled things around (so it _might_
> be the HEAP_INDEX() one). If you had CONFIG_DEBUG_BUGVERBOSE=y, you'd
> get a filename and line number. One reason I hate -O2 in cases like
> this is that the basic block movement makes it way harder to actually
> debug things. I would suggest using -Os too (CONFIG_OPTIMIZE_FOR_SIZE
> or whatever it's called).
>
> Anyway, I do find it worrying. The vma code shouldn't be this fragile.  Hugh?
>
> I do wonder what triggers this. Is it a huge-page vma? We seem to be
> lacking the check to see that mprotect() is on a hugepage boundary -
> and that seems bogus. Or am I missing some check? The new transparent
> hugepage support splits the page, but what if it's a _static_ hugepage
> thing?
>
> But why would that affect the radix_index thing? I have no idea. I'd
> like to blame the anon_vma rewrites last year, but I can't see why
> that should matter either. Again, hugepages had some special rules, I
> think (and that would explain why nobody normal sees this).
>
> Guys, please give this one a look.

I do intend to look, but I think it makes more sense to wait until
Robert has reproduced it (or something like it) with my debugging
patch in.

He's fuzzing, so no reason to get anxious about recent changes, it may
have been lurking there for years. He did already report a
vma_prio_tree_add() crash, which led me to send him the patch: so the
issue seems to be reproducible, and the patch dumps out the
vma_area_structs involved, which has some hope of telling us more.

Down the years we've had about three earlier reports of crashes there:
so rare we've tended to put them down to bad memory or slab
corruption, but never had anything like a reproducible case to study
until now.

Hugh

2011-04-02 04:01:47

by Hui Zhu

[permalink] [raw]
Subject: Re: [PATCH] mm: fix possible cause of a page_mapped BUG

On Sat, Apr 2, 2011 at 00:35, Linus Torvalds
<[email protected]> wrote:
> On Fri, Apr 1, 2011 at 9:21 AM, Robert Święcki <[email protected]> wrote:
>>
>> Is it possible to turn it off via config flags? Looking into
>> arch/x86/include/asm/bug.h it seems it's unconditional (as in "it
>> always manifests itself somehow") and I have
>> CONFIG_DEBUG_BUGVERBOSE=y.
>
> Ok, if you have CONFIG_DEBUG_BUGVERBOSE then, you do have the bug-table.
>
> Maybe it's just kdb that is broken, and doesn't print it. I wouldn't
> be surprised. It's not the first time I've seen debugging features
> that just make debugging a mess.
>
>> Anything that could help you debugging this? Uploading kernel image
>> (unfortunately I've overwritten this one), dumping more kgdb data?
>
> So in this case kgdb just dropped the most important data on the floor.
>
> But if you have kdb active next time, print out the vma/old contents
> in that function that has the BUG() in it.
>
>> I must admit I'm not up-to-date with current linux kernel debugging
>> techniques. The kernel config is here:
>> http://alt.swiecki.net/linux_kernel/ise-test-2.6.38-kernel-config.txt
>>
>> For now I'll compile with -O0 -fno-inline (are you sure you'd like -Os?)

Hi Robert,

I am not sure you can success with build trunk with -O0 -fno-inline.
I suggest you try the patch in
http://code.google.com/p/kgtp/downloads/detail?name=co.patch.
It add a option in "Kernel hacking" called "Compile with almost no
optimization". It will make kernel be built without -O2. It support
x86_32, x86_64 and arm.

PS, maybe you can try kgtp (https://code.google.com/p/kgtp/) debug your kernel.

Thanks,
Hui

>
> Oh, don't do that. -O0 makes the code totally unreadable (the compiler
> just does _stupid_ things, making the asm code look so horrible that
> you can't match it up against anything sane), and -fno-inline isn't
> worth the pain either.
>
> -Os is much better than those.
>
> But in this case, just getting the filename and line number would have
> made the thing moot anyway - without kdb it _should_ have said
> something clear like
>
>   kernel BUG at %s:%u!
>
> where %s:%u is the filename and line number.
>
>                          Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

2011-04-04 12:46:29

by Robert Święcki

[permalink] [raw]
Subject: Re: [PATCH] mm: fix possible cause of a page_mapped BUG

On Sat, Apr 2, 2011 at 3:46 AM, Hugh Dickins <[email protected]> wrote:
> On Fri, Apr 1, 2011 at 8:44 AM, Linus Torvalds
> <[email protected]> wrote:
>> On Fri, Apr 1, 2011 at 7:34 AM, Robert Święcki <[email protected]> wrote:
>>>
>>> Hey, I'll apply your patch and check it out. In the meantime I
>>> triggered another Oops (NULL-ptr deref via sys_mprotect).
>>>
>>> The oops is here:
>>>
>>> http://alt.swiecki.net/linux_kernel/sys_mprotect-2.6.38.txt
>>
>> That's not a NULL pointer dereference. That's a BUG_ON().
>>
>> And for some reason you've turned off the BUG_ON() messages, saving
>> some tiny amount of memory.
>>
>> Anyway, it looks like the first BUG_ON() in vma_prio_tree_add(), so it
>> would be this one:
>>
>>        BUG_ON(RADIX_INDEX(vma) != RADIX_INDEX(old));
>>
>> but it is possible that gcc has shuffled things around (so it _might_
>> be the HEAP_INDEX() one). If you had CONFIG_DEBUG_BUGVERBOSE=y, you'd
>> get a filename and line number. One reason I hate -O2 in cases like
>> this is that the basic block movement makes it way harder to actually
>> debug things. I would suggest using -Os too (CONFIG_OPTIMIZE_FOR_SIZE
>> or whatever it's called).
>>
>> Anyway, I do find it worrying. The vma code shouldn't be this fragile.  Hugh?
>>
>> I do wonder what triggers this. Is it a huge-page vma? We seem to be
>> lacking the check to see that mprotect() is on a hugepage boundary -
>> and that seems bogus. Or am I missing some check? The new transparent
>> hugepage support splits the page, but what if it's a _static_ hugepage
>> thing?
>>
>> But why would that affect the radix_index thing? I have no idea. I'd
>> like to blame the anon_vma rewrites last year, but I can't see why
>> that should matter either. Again, hugepages had some special rules, I
>> think (and that would explain why nobody normal sees this).
>>
>> Guys, please give this one a look.
>
> I do intend to look, but I think it makes more sense to wait until
> Robert has reproduced it (or something like it) with my debugging
> patch in.

Hi Hugh,

I did two things, included your patch, and compiled with
CONFIG_CC_OPTIMIZE_FOR_SIZE=y; the kernel didn't BUG() or Oopssed for
~2 days under fuzzing (with getdents and readdir syscalls disabled in
the fuzzer). I don't think -Os has any bigger influence on how mm
internally works therefore I must attribute the change to your patch
(was it patch which fixes something or merely dumps vma structures in
case of any problem?).


==============
BTW, another problem arose in the meantime, not sure if anyhow related
to things we're discussing here, although 'btc 0' in kdb shows that
processor 0 hangs in sys_mlock - I did it in two different moments, to
exclude any coincidences. After those 2 days of fuzzing, 'ps wuax'
stopped working, i.e. it prints some output, then stops, cannot be
killed with -SIGKILL etc. I'll let it run for the time being, I can
dump more data in this PID 17750 if anybody wants:

strace:

# strace -f ps wwuax
....
open("/proc/17750/status", O_RDONLY) = 6
read(6, "Name:\tiknowthis\nState:\tD (disk s"..., 1023) = 777
close(6) = 0
open("/proc/17750/cmdline", O_RDONLY) = 6
read(6,

Process 17750 also cannot be killed. Attaching more data:

# cat /proc/17750/status
Name: iknowthis
State: D (disk sleep)
Tgid: 17750
Pid: 17750
PPid: 1
TracerPid: 0
Uid: 1001 1001 1001 1001
Gid: 1001 1001 1001 1001
FDSize: 64
Groups: 1001
VmPeak: 7752 kB
VmSize: 5760 kB
VmLck: 32 kB
VmHWM: 4892 kB
VmRSS: 3068 kB
VmData: 2472 kB
VmStk: 408 kB
VmExe: 160 kB
VmLib: 2684 kB
VmPTE: 44 kB
VmSwap: 0 kB
Threads: 1
SigQ: 218/16382
SigPnd: 0000000000000b00
ShdPnd: 0000400000000503
SigBlk: 0000000000000000
SigIgn: 0000000001001000
SigCgt: 0000000000000000
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: ffffffffffffffff
Cpus_allowed: 01
Cpus_allowed_list: 0
Mems_allowed: 00000000,00000001
Mems_allowed_list: 0
voluntary_ctxt_switches: 43330
nonvoluntary_ctxt_switches: 4436

# cat /proc/17750/wchan
call_rwsem_down_write_failed

# cat /proc/17750/maps (hangs)

(from kdb)

[0]kdb> btp 17750
Stack traceback for pid 17750
0xffff88011e772dc0 17750 1 0 0 D 0xffff88011e773240 iknowthis
<c> ffff88011cbcfb88<c> 0000000000000086<c> ffff88011cbcfb08<c>
ffff88011cbcffd8<c>
<c> 0000000000013f00<c> ffff88011e772dc0<c> ffff88011e773180<c>
ffff88011e773178<c>
<c> ffff88011cbce000<c> ffff88011cbcffd8<c> 0000000000013f00<c>
0000000000013f00<c>
Call Trace:
[<ffffffff81e286ad>] rwsem_down_failed_common+0xdb/0x10d
[<ffffffff81e286f2>] rwsem_down_write_failed+0x13/0x15
[<ffffffff81416953>] call_rwsem_down_write_failed+0x13/0x20
[<ffffffff81e27da0>] ? down_write+0x25/0x27
[<ffffffff8115f041>] do_coredump+0x14f/0x9a5
[<ffffffff8114d4e2>] ? T.1006+0x17/0x32
[<ffffffff810a45f0>] ? __dequeue_signal+0xfa/0x12f
[<ffffffff8108a79c>] ? get_parent_ip+0x11/0x42
[<ffffffff810a6406>] get_signal_to_deliver+0x3be/0x3e6
[<ffffffff8103e0c1>] do_signal+0x72/0x67d
[<ffffffff81096807>] ? child_wait_callback+0x0/0x58
[<ffffffff81e28c28>] ? _raw_spin_unlock_irq+0x36/0x41
[<ffffffff8108bb06>] ? finish_task_switch+0x4b/0xb9
[<ffffffff8108c3ed>] ? schedule_tail+0x38/0x68
[<ffffffff8103eb43>] ? ret_from_fork+0x13/0x80
[<ffffffff8103e6f8>] do_notify_resume+0x2c/0x6e

[0]kdb> btc 0
Stack traceback for pid 10350
0xffff88011b6badc0 10350 1 1 0 R 0xffff88011b6bb240 *iknowthis2
<c> ffff8800cfc03db8<c> 0000000000000000<c>
Call Trace:
<#DB> <<EOE>> <IRQ> [<ffffffff81518b03>] ? __handle_sysrq+0xbf/0x15c
[<ffffffff81518d7d>] ? handle_sysrq+0x2c/0x2e
[<ffffffff8152bd90>] ? serial8250_handle_port+0x157/0x2b2
[<ffffffff810a1be8>] ? run_timer_softirq+0x2b3/0x2c2
[<ffffffff8152bf4c>] ? serial8250_interrupt+0x61/0x111
[<ffffffff810e6e52>] ? handle_IRQ_event+0x78/0x150
[<ffffffff810ea044>] ? move_native_irq+0x19/0x6d
[<ffffffff810e8d90>] ? handle_edge_irq+0xe3/0x12f
[<ffffffff8104198f>] ? handle_irq+0x88/0x91
[<ffffffff81e297a5>] ? do_IRQ+0x4d/0xb3
[<ffffffff81e29193>] ? ret_from_intr+0x0/0x15
<EOI> [<ffffffff811336aa>] ? __mlock_vma_pages_range+0x49/0xad
[<ffffffff8113370a>] ? __mlock_vma_pages_range+0xa9/0xad
[<ffffffff811337c0>] ? do_mlock_pages+0xb2/0x118
[<ffffffff81134002>] ? sys_mlock+0xe8/0xf6
[<ffffffff8107d7e3>] ? ia32_sysret+0x0/0x5

[0]kdb> btc 1
Stack traceback for pid 9409
0xffff88011c9816e0 9409 1 1 1 R 0xffff88011c981b60 iknowthis2
<c> ffff88011dc21ec8

--
Robert Święcki

2011-04-04 13:02:11

by Robert Święcki

[permalink] [raw]
Subject: Re: [PATCH] mm: fix possible cause of a page_mapped BUG

On Sat, Apr 2, 2011 at 6:01 AM, Hui Zhu <[email protected]> wrote:
> On Sat, Apr 2, 2011 at 00:35, Linus Torvalds
> <[email protected]> wrote:
>> On Fri, Apr 1, 2011 at 9:21 AM, Robert Święcki <[email protected]> wrote:
>>>
>>> Is it possible to turn it off via config flags? Looking into
>>> arch/x86/include/asm/bug.h it seems it's unconditional (as in "it
>>> always manifests itself somehow") and I have
>>> CONFIG_DEBUG_BUGVERBOSE=y.
>>
>> Ok, if you have CONFIG_DEBUG_BUGVERBOSE then, you do have the bug-table.
>>
>> Maybe it's just kdb that is broken, and doesn't print it. I wouldn't
>> be surprised. It's not the first time I've seen debugging features
>> that just make debugging a mess.
>>
>>> Anything that could help you debugging this? Uploading kernel image
>>> (unfortunately I've overwritten this one), dumping more kgdb data?
>>
>> So in this case kgdb just dropped the most important data on the floor.
>>
>> But if you have kdb active next time, print out the vma/old contents
>> in that function that has the BUG() in it.
>>
>>> I must admit I'm not up-to-date with current linux kernel debugging
>>> techniques. The kernel config is here:
>>> http://alt.swiecki.net/linux_kernel/ise-test-2.6.38-kernel-config.txt
>>>
>>> For now I'll compile with -O0 -fno-inline (are you sure you'd like -Os?)
>
> Hi Robert,
>
> I am not sure you can success with build trunk with  -O0 -fno-inline.
> I suggest you try the patch in
> http://code.google.com/p/kgtp/downloads/detail?name=co.patch.
> It add a option in "Kernel hacking" called "Compile with almost no
> optimization". It will make kernel be built without -O2. It support
> x86_32, x86_64 and arm.

HI,

Yeah.. -O0 doesn't build smoothly, it seems that building with -O0 is
not required right now, but I'll keep your patch in mind in case it
becomes necessary. Thanks for the tip.

> PS, maybe you can try kgtp (https://code.google.com/p/kgtp/)  debug your kernel.
>
>>
>> Oh, don't do that. -O0 makes the code totally unreadable (the compiler
>> just does _stupid_ things, making the asm code look so horrible that
>> you can't match it up against anything sane), and -fno-inline isn't
>> worth the pain either.
>>
>> -Os is much better than those.
>>
>> But in this case, just getting the filename and line number would have
>> made the thing moot anyway - without kdb it _should_ have said
>> something clear like
>>
>>   kernel BUG at %s:%u!
>>
>> where %s:%u is the filename and line number.
>>
>>                          Linus
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>>
>



--
Robert Święcki

2011-04-04 18:30:24

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] mm: fix possible cause of a page_mapped BUG

On Mon, Apr 4, 2011 at 5:46 AM, Robert Święcki <[email protected]> wrote:
> On Sat, Apr 2, 2011 at 3:46 AM, Hugh Dickins <[email protected]> wrote:
>> On Fri, Apr 1, 2011 at 8:44 AM, Linus Torvalds
>> <[email protected]> wrote:
>>> On Fri, Apr 1, 2011 at 7:34 AM, Robert Święcki <[email protected]> wrote:
>>>>
>>>> Hey, I'll apply your patch and check it out. In the meantime I
>>>> triggered another Oops (NULL-ptr deref via sys_mprotect).
>>>>
>>>> The oops is here:
>>>>
>>>> http://alt.swiecki.net/linux_kernel/sys_mprotect-2.6.38.txt
>>>
>>> That's not a NULL pointer dereference. That's a BUG_ON().
>>>
>>> And for some reason you've turned off the BUG_ON() messages, saving
>>> some tiny amount of memory.
>>>
>>> Anyway, it looks like the first BUG_ON() in vma_prio_tree_add(), so it
>>> would be this one:
>>>
>>>        BUG_ON(RADIX_INDEX(vma) != RADIX_INDEX(old));
>>>
>>> but it is possible that gcc has shuffled things around (so it _might_
>>> be the HEAP_INDEX() one). If you had CONFIG_DEBUG_BUGVERBOSE=y, you'd
>>> get a filename and line number. One reason I hate -O2 in cases like
>>> this is that the basic block movement makes it way harder to actually
>>> debug things. I would suggest using -Os too (CONFIG_OPTIMIZE_FOR_SIZE
>>> or whatever it's called).
>>>
>>> Anyway, I do find it worrying. The vma code shouldn't be this fragile.  Hugh?
>>>
>>> I do wonder what triggers this. Is it a huge-page vma? We seem to be
>>> lacking the check to see that mprotect() is on a hugepage boundary -
>>> and that seems bogus. Or am I missing some check? The new transparent
>>> hugepage support splits the page, but what if it's a _static_ hugepage
>>> thing?
>>>
>>> But why would that affect the radix_index thing? I have no idea. I'd
>>> like to blame the anon_vma rewrites last year, but I can't see why
>>> that should matter either. Again, hugepages had some special rules, I
>>> think (and that would explain why nobody normal sees this).
>>>
>>> Guys, please give this one a look.
>>
>> I do intend to look, but I think it makes more sense to wait until
>> Robert has reproduced it (or something like it) with my debugging
>> patch in.
>
> Hi Hugh,
>
> I did two things, included your patch, and compiled with
> CONFIG_CC_OPTIMIZE_FOR_SIZE=y; the kernel didn't BUG() or Oopssed for
> ~2 days under fuzzing (with getdents and readdir syscalls disabled in
> the fuzzer). I don't think -Os has any bigger influence on how mm
> internally works therefore I must attribute the change to your patch
> (was it patch which fixes something or merely dumps vma structures in
> case of any problem?).

I'm sorry, I should have explained the patch a little more. Along
with dumping out the vma structs, it does change the BUG or BUGs there
to WARN_ONs, allowing the system to continue if it's not too badly
corrupted, though leaking some structure memory (if the structs have
been reused, it's probably not safe to assume we still have ownership
of them). So if the problem has occurred again, it should be leaving
WARNING messages and vma struct dumps in your /var/log/messages -
please look for them and send them in if found.

Perhaps we should simply include the patch in mainline kernel: it
doesn't do much good just lingering in mmotm, but seems to be helping
your system to limp along longer.

>
>
> ==============
> BTW, another problem arose in the meantime, not sure if anyhow related
> to things we're discussing here, although 'btc 0' in kdb shows that
> processor 0 hangs in sys_mlock - I did it in two different moments, to
> exclude any coincidences. After those 2 days of fuzzing, 'ps wuax'
> stopped working, i.e. it prints some output, then stops, cannot be
> killed with -SIGKILL etc. I'll let it run for the time being, I can
> dump more data in this PID 17750 if anybody wants:
>
> strace:
>
> # strace -f ps wwuax
> ....
> open("/proc/17750/status", O_RDONLY)    = 6
> read(6, "Name:\tiknowthis\nState:\tD (disk s"..., 1023) = 777
> close(6)                                = 0
> open("/proc/17750/cmdline", O_RDONLY)   = 6
> read(6,
>
> Process 17750 also cannot be killed. Attaching more data:
>
> # cat /proc/17750/status
> Name:   iknowthis
> State:  D (disk sleep)
> Tgid:   17750
> Pid:    17750
> PPid:   1
> TracerPid:      0
> Uid:    1001    1001    1001    1001
> Gid:    1001    1001    1001    1001
> FDSize: 64
> Groups: 1001
> VmPeak:     7752 kB
> VmSize:     5760 kB
> VmLck:        32 kB
> VmHWM:      4892 kB
> VmRSS:      3068 kB
> VmData:     2472 kB
> VmStk:       408 kB
> VmExe:       160 kB
> VmLib:      2684 kB
> VmPTE:        44 kB
> VmSwap:        0 kB
> Threads:        1
> SigQ:   218/16382
> SigPnd: 0000000000000b00
> ShdPnd: 0000400000000503
> SigBlk: 0000000000000000
> SigIgn: 0000000001001000
> SigCgt: 0000000000000000
> CapInh: 0000000000000000
> CapPrm: 0000000000000000
> CapEff: 0000000000000000
> CapBnd: ffffffffffffffff
> Cpus_allowed:   01
> Cpus_allowed_list:      0
> Mems_allowed:   00000000,00000001
> Mems_allowed_list:      0
> voluntary_ctxt_switches:        43330
> nonvoluntary_ctxt_switches:     4436
>
> # cat /proc/17750/wchan
> call_rwsem_down_write_failed
>
> # cat /proc/17750/maps (hangs)
>
> (from kdb)
>
> [0]kdb> btp 17750
> Stack traceback for pid 17750
> 0xffff88011e772dc0    17750        1  0    0   D  0xffff88011e773240  iknowthis
> <c> ffff88011cbcfb88<c> 0000000000000086<c> ffff88011cbcfb08<c>
> ffff88011cbcffd8<c>
> <c> 0000000000013f00<c> ffff88011e772dc0<c> ffff88011e773180<c>
> ffff88011e773178<c>
> <c> ffff88011cbce000<c> ffff88011cbcffd8<c> 0000000000013f00<c>
> 0000000000013f00<c>
> Call Trace:
>  [<ffffffff81e286ad>] rwsem_down_failed_common+0xdb/0x10d
>  [<ffffffff81e286f2>] rwsem_down_write_failed+0x13/0x15
>  [<ffffffff81416953>] call_rwsem_down_write_failed+0x13/0x20
>  [<ffffffff81e27da0>] ? down_write+0x25/0x27
>  [<ffffffff8115f041>] do_coredump+0x14f/0x9a5
>  [<ffffffff8114d4e2>] ? T.1006+0x17/0x32
>  [<ffffffff810a45f0>] ? __dequeue_signal+0xfa/0x12f
>  [<ffffffff8108a79c>] ? get_parent_ip+0x11/0x42
>  [<ffffffff810a6406>] get_signal_to_deliver+0x3be/0x3e6
>  [<ffffffff8103e0c1>] do_signal+0x72/0x67d
>  [<ffffffff81096807>] ? child_wait_callback+0x0/0x58
>  [<ffffffff81e28c28>] ? _raw_spin_unlock_irq+0x36/0x41
>  [<ffffffff8108bb06>] ? finish_task_switch+0x4b/0xb9
>  [<ffffffff8108c3ed>] ? schedule_tail+0x38/0x68
>  [<ffffffff8103eb43>] ? ret_from_fork+0x13/0x80
>  [<ffffffff8103e6f8>] do_notify_resume+0x2c/0x6e
>
> [0]kdb>  btc 0
> Stack traceback for pid 10350
> 0xffff88011b6badc0    10350        1  1    0   R  0xffff88011b6bb240 *iknowthis2
> <c> ffff8800cfc03db8<c> 0000000000000000<c>
> Call Trace:
>  <#DB>  <<EOE>>  <IRQ>  [<ffffffff81518b03>] ? __handle_sysrq+0xbf/0x15c
>  [<ffffffff81518d7d>] ? handle_sysrq+0x2c/0x2e
>  [<ffffffff8152bd90>] ? serial8250_handle_port+0x157/0x2b2
>  [<ffffffff810a1be8>] ? run_timer_softirq+0x2b3/0x2c2
>  [<ffffffff8152bf4c>] ? serial8250_interrupt+0x61/0x111
>  [<ffffffff810e6e52>] ? handle_IRQ_event+0x78/0x150
>  [<ffffffff810ea044>] ? move_native_irq+0x19/0x6d
>  [<ffffffff810e8d90>] ? handle_edge_irq+0xe3/0x12f
>  [<ffffffff8104198f>] ? handle_irq+0x88/0x91
>  [<ffffffff81e297a5>] ? do_IRQ+0x4d/0xb3
>  [<ffffffff81e29193>] ? ret_from_intr+0x0/0x15
>  <EOI>  [<ffffffff811336aa>] ? __mlock_vma_pages_range+0x49/0xad
>  [<ffffffff8113370a>] ? __mlock_vma_pages_range+0xa9/0xad
>  [<ffffffff811337c0>] ? do_mlock_pages+0xb2/0x118
>  [<ffffffff81134002>] ? sys_mlock+0xe8/0xf6
>  [<ffffffff8107d7e3>] ? ia32_sysret+0x0/0x5
>
> [0]kdb> btc 1
> Stack traceback for pid 9409
> 0xffff88011c9816e0     9409        1  1    1   R  0xffff88011c981b60  iknowthis2
> <c> ffff88011dc21ec8

Sorry, I've no time to think about this one at the moment (at LSF).
Does this look similar to what you previously reported on mlock?

Hugh

2011-04-05 12:21:43

by Robert Święcki

[permalink] [raw]
Subject: Re: [PATCH] mm: fix possible cause of a page_mapped BUG

>> Hi Hugh,
>>
>> I did two things, included your patch, and compiled with
>> CONFIG_CC_OPTIMIZE_FOR_SIZE=y; the kernel didn't BUG() or Oopssed for
>> ~2 days under fuzzing (with getdents and readdir syscalls disabled in
>> the fuzzer). I don't think -Os has any bigger influence on how mm
>> internally works therefore I must attribute the change to your patch
>> (was it patch which fixes something or merely dumps vma structures in
>> case of any problem?).
>
> I'm sorry, I should have explained the patch a little more.  Along
> with dumping out the vma structs, it does change the BUG or BUGs there
> to WARN_ONs, allowing the system to continue if it's not too badly
> corrupted, though leaking some structure memory (if the structs have
> been reused, it's probably not safe to assume we still have ownership
> of them).  So if the problem has occurred again, it should be leaving
> WARNING messages and vma struct dumps in your /var/log/messages -
> please look for them and send them in if found.

Here it is, I'll leave it in this state (kdb) in case you need some
remote debugging

<4>[ 1523.877666] WARNING: at mm/prio_tree.c:95 vma_prio_tree_add+0x43/0x110()
<4>[ 1523.884381] Hardware name: Precision WorkStation 390
<4>[ 1523.889703] Pid: 13801, comm: iknowthis2 Not tainted 2.6.38 #2
<4>[ 1523.895544] Call Trace:
<4>[ 1523.898000] [<ffffffff810b5d2a>] ? warn_slowpath_common+0x7a/0xb0
<4>[ 1523.904195] [<ffffffff810b5d7a>] ? warn_slowpath_null+0x1a/0x20
<4>[ 1523.910210] [<ffffffff8116b4b3>] ? vma_prio_tree_add+0x43/0x110
<4>[ 1523.916226] [<ffffffff8116b5c1>] ? vma_prio_tree_insert+0x41/0x60
<4>[ 1523.922416] [<ffffffff8117b69c>] ? __vma_link_file+0x4c/0x90
<4>[ 1523.928171] [<ffffffff8117c078>] ? vma_adjust+0xe8/0x570
<4>[ 1523.933579] [<ffffffff8117c641>] ? __split_vma+0x141/0x280
<4>[ 1523.939157] [<ffffffff8117c7a5>] ? split_vma+0x25/0x30
<4>[ 1523.944391] [<ffffffff811733f6>] ? sys_madvise+0x6a6/0x720
<4>[ 1523.949969] [<ffffffff810a4e09>] ? sub_preempt_count+0xa9/0xe0
<4>[ 1523.955900] [<ffffffff8224f809>] ? trace_hardirqs_on_thunk+0x3a/0x3c
<4>[ 1523.962347] [<ffffffff8109a653>] ? ia32_sysret+0x0/0x5
<4>[ 1523.967580] [<ffffffff8224f809>] ? trace_hardirqs_on_thunk+0x3a/0x3c
<4>[ 1523.974026] ---[ end trace c13483b7eb481afd ]---
<4>[ 1523.978650] vm_area_struct at ffff880120bda508:
<4>[ 1523.983199] ffff88011eb5aa00 00000000f72f3000 00000000f73f0000
ffff88011b8eaa10
<4>[ 1523.990674] ffff88011b8ea228 0000000000000027 00000000000101ff
ffff88011b8ea6b1
<4>[ 1523.998151] ffff88011e390820 ffff88011b8ea260 ffff880120796780
ffff880120bdad40
<4>[ 1524.005624] (null) (null) ffff88011ed5b910
ffff88011ed5b1f0
<4>[ 1524.013103] ffff88011f72b168 ffffffff82427480 ffffffffffffff03
ffff8800793ff0c0
<4>[ 1524.020581] (null) (null) (null)
<4>[ 1524.026556] vm_area_struct at ffff880120bdacf0:
<4>[ 1524.031110] ffff88011eb5a300 00000000f72f3000 00000000f7400000
ffff88011f6c6f18
<4>[ 1524.038584] ffff88011b5c9da8 0000000000000027 00000000000101ff
ffff8801206f0c71
<4>[ 1524.046062] ffff88011f6c6f50 ffff88011b5c9de0 ffff880120bdad40
ffff880120bdad40
<4>[ 1524.053536] ffff880120bda558 (null) ffff88011f758ee0
ffff88011f7583a0
<4>[ 1524.061016] ffff88011f556690 ffffffff82427480 ffffffffffffff03
ffff8800793ff0c0
<4>[ 1524.068491] (null) (null) (null)

[1]kdb> pid
KDB current process is iknowthis2(pid=13801)

[1]kdb> btp 13801
Stack traceback for pid 13801
0xffff88011ec35cc0 13801 4516 1 1 R 0xffff88011ec36140 *iknowthis2
<c> ffff88011c5d1d68<c> 0000000000000000<c> ffff88011f7a3eb8<c>
ffff88011c5d1d88<c>
<c> ffffffff8116b3b9<c> ffff88011b8ea730<c> ffff88011b8eaa10<c>
ffff88011c5d1e28<c>
<c> ffffffff8117c0cb<c> 00000000f73f0000<c> ffff88011eb5aa00<c>
ffff88011f72b168<c>
Call Trace:
[<ffffffff8116b3b9>] ? vma_prio_tree_remove+0xc9/0x110
[<ffffffff8117c0cb>] ? vma_adjust+0x13b/0x570
[<ffffffff8117c641>] ? __split_vma+0x141/0x280
[<ffffffff8117c7a5>] ? split_vma+0x25/0x30
[<ffffffff811733f6>] ? sys_madvise+0x6a6/0x720
[<ffffffff810a4e09>] ? sub_preempt_count+0xa9/0xe0
[<ffffffff8224f809>] ? trace_hardirqs_on_thunk+0x3a/0x3c
[<ffffffff8109a653>] ? ia32_sysret+0x0/0x5
[<ffffffff8224f809>] ? trace_hardirqs_on_thunk+0x3a/0x3c

[1]kdb> rd
ax: ffff88011b8ea780 bx: ffff880120796678 cx: ffff8801207966c8
dx: ffff8801207966c8 si: ffff8801207966c8 di: ffff880027d3bec8
bp: ffff88011c5d1d68 sp: ffff88011c5d1d68 r8: 0000000000000000
r9: ffff88011c5d1946 r10: ffff88011c5d1945 r11: 0000000000000000
r12: ffff88011b8eaa10 r13: ffff880120bda508 r14: ffff880027d3bea8
r15: ffff88011eb5aa00 ip: ffffffff8158d935 flags: 00010297 cs: 00000010

--
Robert Święcki

2011-04-05 15:38:27

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm: fix possible cause of a page_mapped BUG

On Tue, Apr 5, 2011 at 5:21 AM, Robert Święcki <[email protected]> wrote:
>
> Here it is, I'll leave it in this state (kdb) in case you need some
> remote debugging
>
> <4>[ 1523.877666] WARNING: at mm/prio_tree.c:95 vma_prio_tree_add+0x43/0x110()
> <4>[ 1523.978650] vm_area_struct at ffff880120bda508:
> <4>[ 1523.983199]  ffff88011eb5aa00 00000000f72f3000 00000000f73f0000 ffff88011b8eaa10
> <4>[ 1523.990674]  ffff88011b8ea228 0000000000000027 00000000000101ff ffff88011b8ea6b1
> <4>[ 1523.998151]  ffff88011e390820 ffff88011b8ea260 ffff880120796780 ffff880120bdad40
> <4>[ 1524.005624]            (null)           (null) ffff88011ed5b910 ffff88011ed5b1f0
> <4>[ 1524.013103]  ffff88011f72b168 ffffffff82427480 ffffffffffffff03 ffff8800793ff0c0
> <4>[ 1524.020581]            (null)           (null)           (null)

vma->vm_start/end is 0xf72f3000-0xf73f0000

> <4>[ 1524.026556] vm_area_struct at ffff880120bdacf0:
> <4>[ 1524.031110]  ffff88011eb5a300 00000000f72f3000 00000000f7400000 ffff88011f6c6f18
> <4>[ 1524.038584]  ffff88011b5c9da8 0000000000000027 00000000000101ff ffff8801206f0c71
> <4>[ 1524.046062]  ffff88011f6c6f50 ffff88011b5c9de0 ffff880120bdad40 ffff880120bdad40
> <4>[ 1524.053536]  ffff880120bda558           (null) ffff88011f758ee0 ffff88011f7583a0
> <4>[ 1524.061016]  ffff88011f556690 ffffffff82427480 ffffffffffffff03 ffff8800793ff0c0
> <4>[ 1524.068491]            (null)           (null)           (null)

vma->vm_start/end is 0xf72f3000-0xf7400000.

If I read those right, then the vm_pgoff (RADIX_INDEX for the
prio-tree) is ffffffffffffff03 for both cases. That doesn't look good.
How do we get a negative pg_off for a file mapping?

Also, since they have a different size, they should have a different
HEAP_INDEX. That's why we BUG_ON() - with a different HEAP_INDEX,
shouldn't that mean that the prio_tree_insert() logic should create a
new node for it?

I dunno. But that odd negative pg_off thing makes me think there is
some overflow issue (ie HEAP_INDEX being pg_off + size ends up
fluctuating between really big and really small). So I'd suspect THAT
as the main reason.

But maybe I'm mis-reading the dump, and the ffffffffffffff03 isn't
vm_pgoff at all.

Hugh?

Linus

2011-04-06 14:47:50

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] mm: fix possible cause of a page_mapped BUG

On Tue, Apr 5, 2011 at 8:37 AM, Linus Torvalds
<[email protected]> wrote:
> On Tue, Apr 5, 2011 at 5:21 AM, Robert Święcki <[email protected]> wrote:
>>
>> Here it is, I'll leave it in this state (kdb) in case you need some
>> remote debugging
>>
>> <4>[ 1523.877666] WARNING: at mm/prio_tree.c:95 vma_prio_tree_add+0x43/0x110()
>> <4>[ 1523.978650] vm_area_struct at ffff880120bda508:
>> <4>[ 1523.983199]  ffff88011eb5aa00 00000000f72f3000 00000000f73f0000 ffff88011b8eaa10
>> <4>[ 1523.990674]  ffff88011b8ea228 0000000000000027 00000000000101ff ffff88011b8ea6b1
>> <4>[ 1523.998151]  ffff88011e390820 ffff88011b8ea260 ffff880120796780 ffff880120bdad40
>> <4>[ 1524.005624]            (null)           (null) ffff88011ed5b910 ffff88011ed5b1f0
>> <4>[ 1524.013103]  ffff88011f72b168 ffffffff82427480 ffffffffffffff03 ffff8800793ff0c0
>> <4>[ 1524.020581]            (null)           (null)           (null)
>
> vma->vm_start/end is 0xf72f3000-0xf73f0000
>
>> <4>[ 1524.026556] vm_area_struct at ffff880120bdacf0:
>> <4>[ 1524.031110]  ffff88011eb5a300 00000000f72f3000 00000000f7400000 ffff88011f6c6f18
>> <4>[ 1524.038584]  ffff88011b5c9da8 0000000000000027 00000000000101ff ffff8801206f0c71
>> <4>[ 1524.046062]  ffff88011f6c6f50 ffff88011b5c9de0 ffff880120bdad40 ffff880120bdad40
>> <4>[ 1524.053536]  ffff880120bda558           (null) ffff88011f758ee0 ffff88011f7583a0
>> <4>[ 1524.061016]  ffff88011f556690 ffffffff82427480 ffffffffffffff03 ffff8800793ff0c0
>> <4>[ 1524.068491]            (null)           (null)           (null)
>
> vma->vm_start/end is 0xf72f3000-0xf7400000.
>
> If I read those right, then the vm_pgoff (RADIX_INDEX for the
> prio-tree) is ffffffffffffff03 for both cases. That doesn't look good.
> How do we get a negative pg_off for a file mapping?

Yes, I think that's probably at the root of it. Robert is using a
fuzzer, and it's a 32-bit executable running on a 64-bit kernel: I
suspect there's somewhere on our compat path where we've not validated
incoming mmap offset properly.

Hmm, but I don't see anything wrong there.

>
> Also, since they have a different size, they should have a different
> HEAP_INDEX. That's why we BUG_ON() - with a different HEAP_INDEX,
> shouldn't that mean that the prio_tree_insert() logic should create a
> new node for it?

Yes.

>
> I dunno. But that odd negative pg_off thing makes me think there is
> some overflow issue (ie HEAP_INDEX being pg_off + size ends up
> fluctuating between really big and really small). So I'd suspect THAT
> as the main reason.

Yes, one of the vmas is such that the end offset (pgoff of next page
after) would be 0, and for the other it would be 16. There's sure to
be places, inside the prio_tree code and outside it, where we rely
upon pgoff not wrapping around - wrap should be prevented by original
validation of arguments.

Hugh

2011-04-06 15:33:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm: fix possible cause of a page_mapped BUG

On Wed, Apr 6, 2011 at 7:47 AM, Hugh Dickins <[email protected]> wrote:
>>
>> I dunno. But that odd negative pg_off thing makes me think there is
>> some overflow issue (ie HEAP_INDEX being pg_off + size ends up
>> fluctuating between really big and really small). So I'd suspect THAT
>> as the main reason.
>
> Yes, one of the vmas is such that the end offset (pgoff of next page
> after) would be 0, and for the other it would be 16. ?There's sure to
> be places, inside the prio_tree code and outside it, where we rely
> upon pgoff not wrapping around - wrap should be prevented by original
> validation of arguments.

Well, we _do_ validate them in do_mmap_pgoff(), which is the main
routine for all the mmap() system calls, and the main way to get a new
mapping.

There are other ways, like do_brk(), but afaik that always sets
vm_pgoff to the virtual address (shifted), so again the new mapping
should be fine.

So when a new mapping is created, it should all be ok.

But I think mremap() may end up expanding it without doing the same
overflow check.

Do you see any other way to get this situation? Does the vma dump give
you any hint about where it came from?

Robert - here's a (UNTESTED!) patch to make mremap() be a bit more
careful about vm_pgoff when growing a mapping. Does it make any
difference?

Linus


Attachments:
patch.diff (771.00 B)

2011-04-06 15:43:54

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] mm: fix possible cause of a page_mapped BUG

On Wed, 6 Apr 2011, Linus Torvalds wrote:
> On Wed, Apr 6, 2011 at 7:47 AM, Hugh Dickins <[email protected]> wrote:
> >>
> >> I dunno. But that odd negative pg_off thing makes me think there is
> >> some overflow issue (ie HEAP_INDEX being pg_off + size ends up
> >> fluctuating between really big and really small). So I'd suspect THAT
> >> as the main reason.
> >
> > Yes, one of the vmas is such that the end offset (pgoff of next page
> > after) would be 0, and for the other it would be 16. ?There's sure to
> > be places, inside the prio_tree code and outside it, where we rely
> > upon pgoff not wrapping around - wrap should be prevented by original
> > validation of arguments.
>
> Well, we _do_ validate them in do_mmap_pgoff(), which is the main
> routine for all the mmap() system calls, and the main way to get a new
> mapping.
>
> There are other ways, like do_brk(), but afaik that always sets
> vm_pgoff to the virtual address (shifted), so again the new mapping
> should be fine.
>
> So when a new mapping is created, it should all be ok.
>
> But I think mremap() may end up expanding it without doing the same
> overflow check.
>
> Do you see any other way to get this situation? Does the vma dump give
> you any hint about where it came from?
>
> Robert - here's a (UNTESTED!) patch to make mremap() be a bit more
> careful about vm_pgoff when growing a mapping. Does it make any
> difference?

I'd come to the same conclusion: the original page_mapped BUG has itself
suggested that mremap() is getting used.

I was about to send you my own UNTESTED patch: let me append it anyway,
I think it is more correct than yours (it's the offset of vm_end we need
to worry about, and there's the funny old_len,new_len stuff). See what
you think - sorry, I'm going out now.

Hugh

--- 2.6.38/mm/mremap.c 2011-03-14 18:20:32.000000000 -0700
+++ linux/mm/mremap.c 2011-04-06 08:31:46.000000000 -0700
@@ -282,6 +282,12 @@ static struct vm_area_struct *vma_to_res
goto Efault;
}

+ if (vma->vm_file && new_len > old_len) {
+ pgoff_t endoff = linear_page_index(vma, vma->vm_end);
+ if (endoff + ((new_len - old_len) >> PAGE_SHIFT) < endoff)
+ goto Eoverflow;
+ }
+
if (vma->vm_flags & VM_LOCKED) {
unsigned long locked, lock_limit;
locked = mm->locked_vm << PAGE_SHIFT;
@@ -311,6 +317,8 @@ Enomem:
return ERR_PTR(-ENOMEM);
Eagain:
return ERR_PTR(-EAGAIN);
+Eoverflow:
+ return ERR_PTR(-EOVERFLOW);
}

static unsigned long mremap_to(unsigned long addr,

2011-04-06 16:00:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm: fix possible cause of a page_mapped BUG

On Wed, Apr 6, 2011 at 8:43 AM, Hugh Dickins <[email protected]> wrote:
>
> I was about to send you my own UNTESTED patch: let me append it anyway,
> I think it is more correct than yours (it's the offset of vm_end we need
> to worry about, and there's the funny old_len,new_len stuff).

Umm. That's what my patch did too. The

pgoff = (addr - vma->vm_start) >> PAGE_SHIFT;

is the "offset of the pgoff" from the original mapping, then we do

pgoff += vma->vm_pgoff;

to get the pgoff of the new mapping, and then we do

if (pgoff + (new_len >> PAGE_SHIFT) < pgoff)

to check that the new mapping is ok.

I think yours is equivalent, just a different (and odd - that
linear_page_index() thing will do lots of unnecessary shifts and
hugepage crap) way of writing it.

>?See what you think - sorry, I'm going out now.

I think _yours_ is conceptually buggy, because I think that test for
"vma->vm_file" is wrong.

Yes, new anonymous mappings set vm_pgoff to the virtual address, but
that's not true for mremap() moving them around, afaik.

Admittedly it's really hard to get to the overflow case, because the
address is shifted down, so even if you start out with an anonymous
mmap at a high address (to get a big vm_off), and then move it down
and expand it (to get a big size), I doubt you can possibly overflow.
But I still don't think that the test for vm_file is semantically
sensible, even if it might not _matter_.

But whatever. I suspect both our patches are practically doing the
same thing, and it would be interesting to hear if it actually fixes
the issue. Maybe there is some other way to mess up vm_pgoff that I
can't think of right now.

Linus

2011-04-06 17:54:17

by Robert Święcki

[permalink] [raw]
Subject: Re: [PATCH] mm: fix possible cause of a page_mapped BUG

On Wed, Apr 6, 2011 at 5:59 PM, Linus Torvalds
<[email protected]> wrote:
> On Wed, Apr 6, 2011 at 8:43 AM, Hugh Dickins <[email protected]> wrote:
>>
>> I was about to send you my own UNTESTED patch: let me append it anyway,
>> I think it is more correct than yours (it's the offset of vm_end we need
>> to worry about, and there's the funny old_len,new_len stuff).
>
> Umm. That's what my patch did too. The
>
>   pgoff = (addr - vma->vm_start) >> PAGE_SHIFT;
>
> is the "offset of the pgoff" from the original mapping, then we do
>
>   pgoff += vma->vm_pgoff;
>
> to get the pgoff of the new mapping, and then we do
>
>   if (pgoff + (new_len >> PAGE_SHIFT) < pgoff)
>
> to check that the new mapping is ok.
>
> I think yours is equivalent, just a different (and odd - that
> linear_page_index() thing will do lots of unnecessary shifts and
> hugepage crap) way of writing it.
>
>> See what you think - sorry, I'm going out now.
>
> I think _yours_ is conceptually buggy, because I think that test for
> "vma->vm_file" is wrong.
>
> Yes, new anonymous mappings set vm_pgoff to the virtual address, but
> that's not true for mremap() moving them around, afaik.
>
> Admittedly it's really hard to get to the overflow case, because the
> address is shifted down, so even if you start out with an anonymous
> mmap at a high address (to get a big vm_off), and then move it down
> and expand it (to get a big size), I doubt you can possibly overflow.
> But I still don't think that the test for vm_file is semantically
> sensible, even if it might not _matter_.
>
> But whatever. I suspect both our patches are practically doing the
> same thing, and it would be interesting to hear if it actually fixes
> the issue. Maybe there is some other way to mess up vm_pgoff that I
> can't think of right now.

Testing with Linus' patch. Will let you know in a few hours.

--
Robert Święcki

2011-04-07 12:41:58

by Robert Święcki

[permalink] [raw]
Subject: Re: [PATCH] mm: fix possible cause of a page_mapped BUG

>>> I was about to send you my own UNTESTED patch: let me append it anyway,
>>> I think it is more correct than yours (it's the offset of vm_end we need
>>> to worry about, and there's the funny old_len,new_len stuff).
>>
>> Umm. That's what my patch did too. The
>>
>>   pgoff = (addr - vma->vm_start) >> PAGE_SHIFT;
>>
>> is the "offset of the pgoff" from the original mapping, then we do
>>
>>   pgoff += vma->vm_pgoff;
>>
>> to get the pgoff of the new mapping, and then we do
>>
>>   if (pgoff + (new_len >> PAGE_SHIFT) < pgoff)
>>
>> to check that the new mapping is ok.
>>
>> I think yours is equivalent, just a different (and odd - that
>> linear_page_index() thing will do lots of unnecessary shifts and
>> hugepage crap) way of writing it.
>>
>>> See what you think - sorry, I'm going out now.
>>
>> I think _yours_ is conceptually buggy, because I think that test for
>> "vma->vm_file" is wrong.
>>
>> Yes, new anonymous mappings set vm_pgoff to the virtual address, but
>> that's not true for mremap() moving them around, afaik.
>>
>> Admittedly it's really hard to get to the overflow case, because the
>> address is shifted down, so even if you start out with an anonymous
>> mmap at a high address (to get a big vm_off), and then move it down
>> and expand it (to get a big size), I doubt you can possibly overflow.
>> But I still don't think that the test for vm_file is semantically
>> sensible, even if it might not _matter_.
>>
>> But whatever. I suspect both our patches are practically doing the
>> same thing, and it would be interesting to hear if it actually fixes
>> the issue. Maybe there is some other way to mess up vm_pgoff that I
>> can't think of right now.
>
> Testing with Linus' patch. Will let you know in a few hours.

Ok, nothing happened after ~20h. The bug, usually, was triggered within 5-10h.

I can add some printk in this condition, and let it run for a few days
(I will not have access to my testing machine throughout that time),
if you think this will confirm your hypothesis.

--
Robert Święcki

2011-04-07 14:17:28

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] mm: fix possible cause of a page_mapped BUG

On Wed, 6 Apr 2011, Linus Torvalds wrote:
> On Wed, Apr 6, 2011 at 8:43 AM, Hugh Dickins <[email protected]> wrote:
> >
> > I was about to send you my own UNTESTED patch: let me append it anyway,
> > I think it is more correct than yours (it's the offset of vm_end we need
> > to worry about, and there's the funny old_len,new_len stuff).
>
> Umm. That's what my patch did too. The
>
> pgoff = (addr - vma->vm_start) >> PAGE_SHIFT;
>
> is the "offset of the pgoff" from the original mapping, then we do
>
> pgoff += vma->vm_pgoff;
>
> to get the pgoff of the new mapping, and then we do
>
> if (pgoff + (new_len >> PAGE_SHIFT) < pgoff)
>
> to check that the new mapping is ok.

Right, I was forgetting the semantics for mremap when
addr + old_len < vma->vm_end. It has to move out the
old section and extend it elsewhere, it does not affect
the page just before vma->vm_end at all. So mine was
indeed a more complicated way of doing yours.

>
> I think yours is equivalent, just a different (and odd - that
> linear_page_index() thing will do lots of unnecessary shifts and
> hugepage crap) way of writing it.

I was trying to use the common function provided: but it's
actually wrong, that's a function for getting the value found
in page->index (in units of PAGE_CACHE_SIZE), whereas here we
want the value found in vm_pgoff (in units of PAGE_SIZE).

Of course PAGE_CACHE_SIZE has equalled PAGE_SIZE everywhere but in
some patches by Christoph Lameter a few years back, so there isn't
an effective difference; but I was wrong to use that function.

>
> >?See what you think - sorry, I'm going out now.
>
> I think _yours_ is conceptually buggy, because I think that test for
> "vma->vm_file" is wrong.

Just being cautious: we cannot hit the BUG in prio_tree.c when we're
dealing with an anonymous mapping, and I didn't want to think about
anonymous at the time.

>
> Yes, new anonymous mappings set vm_pgoff to the virtual address, but
> that's not true for mremap() moving them around, afaik.
>
> Admittedly it's really hard to get to the overflow case, because the
> address is shifted down, so even if you start out with an anonymous
> mmap at a high address (to get a big vm_off), and then move it down
> and expand it (to get a big size), I doubt you can possibly overflow.
> But I still don't think that the test for vm_file is semantically
> sensible, even if it might not _matter_.

The strangest case is when a 64-bit kernel execs a 32-bit executable,
preparing the stack with a very high virtual address which goes into
vm_pgoff (shifted by PAGE_SHIFT), then moves that stack down into the
32-bit address space but leaving it with the original high vm_pgoff.

I think you are now excluding some wild anonymous cases which were
allowed before, and gave no trouble - vma_address() looks like a wrap
won't upset it. But they're not cases which anyone is likely to do,
and safer to keep the anon rules in synch with the file rules.

>
> But whatever. I suspect both our patches are practically doing the
> same thing, and it would be interesting to hear if it actually fixes
> the issue. Maybe there is some other way to mess up vm_pgoff that I
> can't think of right now.

Here's yours inline below:

Acked-by: Hugh Dickins <[email protected]>
---

mm/mremap.c | 11 +++++++++--
1 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 1de98d492ddc..a7c1f9f9b941 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -277,9 +277,16 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr,
if (old_len > vma->vm_end - addr)
goto Efault;

- if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP)) {
- if (new_len > old_len)
+ /* Need to be careful about a growing mapping */
+ if (new_len > old_len) {
+ unsigned long pgoff;
+
+ if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP))
goto Efault;
+ pgoff = (addr - vma->vm_start) >> PAGE_SHIFT;
+ pgoff += vma->vm_pgoff;
+ if (pgoff + (new_len >> PAGE_SHIFT) < pgoff)
+ goto Einval;
}

if (vma->vm_flags & VM_LOCKED) {

2011-04-07 14:24:02

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] mm: fix possible cause of a page_mapped BUG

On Thu, 7 Apr 2011, Robert Swiecki wrote:
> >
> > Testing with Linus' patch. Will let you know in a few hours.
>
> Ok, nothing happened after ~20h. The bug, usually, was triggered within 5-10h.
>
> I can add some printk in this condition, and let it run for a few days
> (I will not have access to my testing machine throughout that time),
> if you think this will confirm your hypothesis.

That's great, thanks Robert. If the machine has nothing better to do,
then it would be nice to let it run a little longer (a few days if that's
what suits you), but it does look good so far. Though I'm afraid you'll
now discover something else entirely ;)

Hugh

2011-04-12 09:58:39

by Robert Święcki

[permalink] [raw]
Subject: Re: [PATCH] mm: fix possible cause of a page_mapped BUG

On Thu, Apr 7, 2011 at 4:24 PM, Hugh Dickins <[email protected]> wrote:
> On Thu, 7 Apr 2011, Robert Swiecki wrote:
>> >
>> > Testing with Linus' patch. Will let you know in a few hours.
>>
>> Ok, nothing happened after ~20h. The bug, usually, was triggered within 5-10h.
>>
>> I can add some printk in this condition, and let it run for a few days
>> (I will not have access to my testing machine throughout that time),
>> if you think this will confirm your hypothesis.
>
> That's great, thanks Robert.  If the machine has nothing better to do,
> then it would be nice to let it run a little longer (a few days if that's
> what suits you), but it does look good so far.  Though I'm afraid you'll
> now discover something else entirely ;)

Ok, I added printk here:

if (new_len > old_len) {
unsigned long pgoff;

if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP))
goto Efault;
pgoff = (addr - vma->vm_start) >> PAGE_SHIFT;
pgoff += vma->vm_pgoff;
if (pgoff + (new_len >> PAGE_SHIFT) < pgoff) {
printk("VMA_TO_RESIZE: ADDR:%lx OLD_LEN:%lx
NEW_LEN:%lx PGOFF: %lx VMA->VM_START:%lx VMA->VM_FLAGS:%lx",
addr, old_len, new_len, pgoff,
vma->vm_start, vma->vm_flags);

goto Einval;
}
}


and after a few mins of fuzzing I get:

[ 584.224028] VMA_TO_RESIZE: ADDR:f751f000 OLD_LEN:6000 NEW_LEN:c000
PGOFF: fffffffffffffffa VMA->VM_START:f751f000 VMA->VM_FLAGS:2321fa
[ 639.777561] VMA_TO_RESIZE: ADDR:f751f000 OLD_LEN:6000 NEW_LEN:b000
PGOFF: fffffffffffffffa VMA->VM_START:f751f000 VMA->VM_FLAGS:2301f8

So, if this case is not caught later on in the code, I guess it solves
the problem. During the fuzzing I didn't experience any panic's, but
some other problems arose, i.e. cannot read /proc/<pid>/maps for some
processes (sys_read hangs, and such process cannot be killed or
stopped with any signal, still it's running (R state) and using CPU -
I'll submit another report for that).

--
Robert Święcki

2011-04-12 14:22:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm: fix possible cause of a page_mapped BUG

On Tue, Apr 12, 2011 at 2:58 AM, Robert Święcki <[email protected]> wrote:
>
> So, if this case is not caught later on in the code, I guess it solves
> the problem. During the fuzzing I didn't experience any panic's, but
> some other problems arose, i.e. cannot read /proc/<pid>/maps for some
> processes (sys_read hangs, and such process cannot be killed or
> stopped with any signal, still it's running (R state) and using CPU -
> I'll submit another report for that).

Hmm. Sounds like an endless loop in kernel mode.

Use "perf record -ag" as root, it should show up very clearly in the report.

Linus

2011-04-12 16:19:17

by Robert Święcki

[permalink] [raw]
Subject: Re: [PATCH] mm: fix possible cause of a page_mapped BUG

>>> So, if this case is not caught later on in the code, I guess it solves
>>> the problem. During the fuzzing I didn't experience any panic's, but
>>> some other problems arose, i.e. cannot read /proc/<pid>/maps for some
>>> processes (sys_read hangs, and such process cannot be killed or
>>> stopped with any signal, still it's running (R state) and using CPU -
>>> I'll submit another report for that).
>>
>> Hmm. Sounds like an endless loop in kernel mode.
>>
>> Use "perf record -ag" as root, it should show up very clearly in the report.
>>
>>                          Linus
>
> I've put some data here -
> http://groups.google.com/group/fa.linux.kernel/browse_thread/thread/4345dcc4f7750ce2
> - I think it's somewhat connected (sys_mlock appears on both cases).
>
> Attaching perf data (for 2.6.38) + kdb dumpall + procdump for process 14158
>
> Those 3 processes cannot be stopped/killed
>
> 14158 66.2  0.0   8380  3012 ?  RL  /tmp/iknowthis
> 17100 63.6  0.1  18248  4004 ?  RL  /tmp/iknowthis
> 19772 63.8  0.0   4000  1888 ?   RL  /tmp/iknowthis

Also, the system doesn't look usable after such fuzzing (executing a
few times some pretty deterministic program)

root@ise-test:~# gcc -m32 mlock.c -o mlock

root@ise-test:~# ./mlock
./mlock: relocation error: ./mlock: symbol perror, version GLIBC_2.0
not defined in file libc.so.6 with link time reference

root@ise-test:~# ./mlock
mmap: Success
RET: 0xf751f000
mremap: Invalid argument
RET: 0xffffffff

root@ise-test:~# ./mlock
Segmentation fault

root@ise-test:~# dmesg | tail -n 1
[ 5164.961568] mlock[7097]: segfault at 0 ip (null) sp
00000000ff8a00d4 error 14 in mlock[8048000+1000]

--
Robert Święcki

2011-04-12 17:20:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm: fix possible cause of a page_mapped BUG

On Tue, Apr 12, 2011 at 8:48 AM, Robert Święcki <[email protected]> wrote:
>>
>> Hmm. Sounds like an endless loop in kernel mode.
>>
>> Use "perf record -ag" as root, it should show up very clearly in the report.
>
> I've put some data here -
> http://groups.google.com/group/fa.linux.kernel/browse_thread/thread/4345dcc4f7750ce2
> - I think it's somewhat connected (sys_mlock appears on both cases).

Ok, so it's definitely sys_mlock.

And I suspect it's due to commit 53a7706d5ed8 somehow looping forever.

One possible cause would be how that commit made things care deeply
about the return value of __get_user_pages(), and in particular what
happens when that return value is zero. It ends up looping forever in
do_mlock_pages() for that case, because it does

nend = nstart + ret * PAGE_SIZE;

so now the next round we'll set "nstart = nend" and start all over.

I see at least one way __get_user_pages() will return zero, and it's
if it is passed a npages of 0 to begin with. Which can easily happen
if you try to mlock() the first page of a stack segment: the code will
jump over that stack segment page, and then have nothing to do, and
return zero. So then do_mlock_pages() will just keep on trying again.

THIS IS A HACKY AND UNTESTED PATCH!

It's ugly as hell, because the real problem is do_mlock_pages() caring
too damn much about the return value, and us hiding the whole stack
page thing in that function. I wouldn't want to commit it as-is, but
if you can easily reproduce the problem, it's a good patch to test out
the theory. Assuming I didn't screw something up.

Again, TOTALLY UNTESTED!

Linus


Attachments:
patch.diff (1.04 kB)

2011-04-12 19:00:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm: fix possible cause of a page_mapped BUG

On Tue, Apr 12, 2011 at 10:19 AM, Linus Torvalds
<[email protected]> wrote:
>
> THIS IS A HACKY AND UNTESTED PATCH!

.. and here is a rather less hacky, but still equally untested patch.
It moves the stack guard page handling into __get_user_pages() itself,
and thus avoids the whole problem.

This one I could easily see myself committing. Assuming I get some
ack's and testing..

Comments?

Linus


Attachments:
patch.diff (3.01 kB)

2011-04-12 19:02:59

by Robert Święcki

[permalink] [raw]
Subject: Re: [PATCH] mm: fix possible cause of a page_mapped BUG

On Tue, Apr 12, 2011 at 8:59 PM, Linus Torvalds
<[email protected]> wrote:
> On Tue, Apr 12, 2011 at 10:19 AM, Linus Torvalds
> <[email protected]> wrote:
>>
>> THIS IS A HACKY AND UNTESTED PATCH!
>
> .. and here is a rather less hacky, but still equally untested patch.
> It moves the stack guard page handling into __get_user_pages() itself,
> and thus avoids the whole problem.
>
> This one I could easily see myself committing. Assuming I get some
> ack's and testing..

I'm testing currently with the old one, w/o any symptoms of problems
by now, but it's not a meaningful period of time. I can try with the
new one, leave it over(European)night, and let you know tomorrow.

--
Robert Święcki

2011-04-12 19:38:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm: fix possible cause of a page_mapped BUG

On Tue, Apr 12, 2011 at 12:02 PM, Robert Święcki <[email protected]> wrote:
>
> I'm testing currently with the old one, w/o any symptoms of problems
> by now, but it's not a meaningful period of time. I can try with the
> new one, leave it over(European)night, and let you know tomorrow.

You might as well keep testing the old one, if that gives it better
coverage. No need to disrupt anything you already have running.

The more important input is "was that actually the root cause", rather
than deciding between the ugly or clean way of fixing it.

So if the first patch fixes it, then I'm pretty sure the second one
will too - just in a cleaner manner.

Linus

2011-04-18 21:15:16

by Michel Lespinasse

[permalink] [raw]
Subject: Re: [PATCH] mm: fix possible cause of a page_mapped BUG

On Tue, Apr 12, 2011 at 12:38 PM, Linus Torvalds
<[email protected]> wrote:
> On Tue, Apr 12, 2011 at 12:02 PM, Robert Święcki <[email protected]> wrote:
>>
>> I'm testing currently with the old one, w/o any symptoms of problems
>> by now, but it's not a meaningful period of time. I can try with the
>> new one, leave it over(European)night, and let you know tomorrow.
>
> You might as well keep testing the old one, if that gives it better
> coverage. No need to disrupt anything you already have running.
>
> The more important input is "was that actually the root cause", rather
> than deciding between the ugly or clean way of fixing it.
>
> So if the first patch fixes it, then I'm pretty sure the second one
> will too - just in a cleaner manner.

Sorry for the delayed response - I have been traveling abroad in the
last two weeks and until the end of the month.

This second patch looks more attractive than the first, but is also
harder to prove correct. Hugh looked at all gup call sites and
convinced himself that the change was safe, except for the
fault_in_user_writeable() site in futex.c which he asked me to look
at. I am worried that we would have an issue there, as places like
futex_wake_op() or fixup_pi_state_owner() operate on user memory with
page faults disabled, and expect fault_in_user_writeable() to set up
the user page so that they can retry if the initial access failed.
With this proposal, fault_in_user_writeable() would become inoperative
when the address is within the guard page; this could cause some
malicious futex operation to create an infinite loop.

--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

2011-05-05 00:09:46

by Michel Lespinasse

[permalink] [raw]
Subject: Re: [PATCH] mm: fix possible cause of a page_mapped BUG

FYI, the attached code causes an infinite loop in kernels that have
the 95042f9eb7 commit:

#include <stdio.h>
#include <string.h>

#include <unistd.h>
#include <sys/syscall.h>
#include <linux/futex.h>

int *get_stack_guard(void)
{
FILE *map;
char buf[1000];

map = fopen("/proc/self/maps", "r");
if (!map)
return NULL;
while(fgets(buf, 1000, map)) {
long a, b;
char c[1000], d[1000], e[1000], f[1000], g[1000];
if (sscanf(buf, "%lx-%lx %s %s %s %s %s", &a, &b, c, d, e, f, g) == 7 &&
!strcmp(g, "[stack]")) {
fclose(map);
return (int *)(a - 4096);
}
}
fclose(map);
return NULL;
}

int main(void)
{
int *uaddr = get_stack_guard();
syscall(SYS_futex, uaddr, FUTEX_LOCK_PI_PRIVATE, 0, NULL, NULL, 0);
return 0;
}

Linus, I am not sure as to what would be the preferred way to fix
this. One option could be to modify fault_in_user_writeable so that it
passes a non-NULL page pointer, and just does a put_page on it
afterwards. While this would work, this is kinda ugly and would slow
down futex operations somewhat. A more conservative alternative could
be to enable the guard page special case under an new GUP flag, but
this loses much of the elegance of your original proposal...

On Mon, Apr 18, 2011 at 2:15 PM, Michel Lespinasse <[email protected]> wrote:
> This second patch looks more attractive than the first, but is also
> harder to prove correct. Hugh looked at all gup call sites and
> convinced himself that the change was safe, except for the
> fault_in_user_writeable() site in futex.c which he asked me to look
> at. I am worried that we would have an issue there, as places like
> futex_wake_op() or fixup_pi_state_owner() operate on user memory with
> page faults disabled, and expect fault_in_user_writeable() to set up
> the user page so that they can retry if the initial access failed.
> With this proposal, fault_in_user_writeable() would become inoperative
> when the ?address is within the guard page; this could cause some
> malicious futex operation to create an infinite loop.

--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

2011-05-05 00:39:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm: fix possible cause of a page_mapped BUG

On Wed, May 4, 2011 at 5:09 PM, Michel Lespinasse <[email protected]> wrote:
>
> FYI, the attached code causes an infinite loop in kernels that have
> the 95042f9eb7 commit:

Mmm.

Yes. The atomic fault will never work, and the get_user_pages() thing
won't either, so things will just loop forever.

> Linus, I am not sure as to what would be the preferred way to fix
> this. One option could be to modify fault_in_user_writeable so that it
> passes a non-NULL page pointer, and just does a put_page on it
> afterwards. While this would work, this is kinda ugly and would slow
> down futex operations somewhat.

No, that's just ugly as hell.

> A more conservative alternative could
> be to enable the guard page special case under an new GUP flag, but
> this loses much of the elegance of your original proposal...

How about only doing that only for FOLL_MLOCK?

Also, looking at mm/mlock.c, why _do_ we call get_user_pages() even if
the vma isn't mlocked? That looks bogus. Since we have dropped the
mm_semaphore, an unlock may have happened, and afaik we should *not*
try to bring those pages back in at all. There's this whole comment
about that in the caller ("__mlock_vma_pages_range() double checks the
vma flags, so that it won't mlock pages if the vma was already
munlocked."), but despite that it would actually call
__get_user_pages() even if the VM_LOCKED bit had been cleared (it just
wouldn't call it with the FOLL_MLOCK flag).

So maybe something like the attached?

UNTESTED! And maybe there was some really subtle reason to still call
__get_user_pages() without that FOLL_MLOCK thing that I'm missing.

Linus


Attachments:
patch.diff (1.41 kB)

2011-05-05 01:18:15

by Michel Lespinasse

[permalink] [raw]
Subject: Re: [PATCH] mm: fix possible cause of a page_mapped BUG

On Wed, May 4, 2011 at 5:38 PM, Linus Torvalds
<[email protected]> wrote:
>> A more conservative alternative could
>> be to enable the guard page special case under an new GUP flag, but
>> this loses much of the elegance of your original proposal...
>
> How about only doing that only for FOLL_MLOCK?

Sounds reasonable.

> Also, looking at mm/mlock.c, why _do_ we call get_user_pages() even if
> the vma isn't mlocked? That looks bogus. Since we have dropped the
> mm_semaphore, an unlock may have happened, and afaik we should *not*
> try to bring those pages back in at all. There's this whole comment
> about that in the caller ("__mlock_vma_pages_range() double checks the
> vma flags, so that it won't mlock pages if the vma was already
> munlocked."), but despite that it would actually call
> __get_user_pages() even if the VM_LOCKED bit had been cleared (it just
> wouldn't call it with the FOLL_MLOCK flag).

There are two reasons VM_LOCKED might be cleared in
__mlock_vma_pages_range(). It could be that one of the VM_SPECIAL
flags were set on the VMA, in which case mlock() won't set VM_LOCKED
but it still must make the pages present. Or, there is an munlock()
executing concurrently with mlock() - in that case, the conservative
thing to do is to give the same results as if the mlock() had
completed before the munlock(). That is, the mlock() would have broken
COW / made the pages present and the munlock() would have cleared the
VM_LOCKED and PageMlocked flags.

> UNTESTED! And maybe there was some really subtle reason to still call
> __get_user_pages() without that FOLL_MLOCK thing that I'm missing.

I think we want the mm/memory.c part of this proposal without the
mm/mlock.c part.

--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

2011-05-05 01:41:14

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm: fix possible cause of a page_mapped BUG

On Wed, May 4, 2011 at 6:18 PM, Michel Lespinasse <[email protected]> wrote:
>
> I think we want the mm/memory.c part of this proposal without the
> mm/mlock.c part.

.. but what about mlock not setting the FOLL_MLOCK bit, then?

In that case, you'd get the "mlock extends the stack" problem.

Linus

2011-05-05 03:38:53

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm: fix possible cause of a page_mapped BUG

Ok, so here's a slightly different approach.

It still makes the FOLL_MLOCK be unconditional in the mlock path, but
it really just pushes down the

- gup_flags = FOLL_TOUCH;
+ gup_flags = FOLL_TOUCH | FOLL_MLOCK;
...
- if (vma->vm_flags & VM_LOCKED)
- gup_flags |= FOLL_MLOCK;

from __mlock_vma_pages_range(), and moves the conditional into
'follow_page()' (which is the only _user_ of that flag) instead:

- if (flags & FOLL_MLOCK) {
+ if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {

so semantically this changes nothing at all.

But now, because __get_user_pages() can look at FOLL_MLOCK to see that
it's a mlock access, we can do that whole "skip stack guard page"
based on that flag:

- if (!pages && stack_guard_page(vma, start))
+ if ((gup_flags & FOLL_MLOCK) && stack_guard_page(vma, start))

which means that other uses will try to page in the stack guard page.

I seriously considered making that "skip stack guard page" and the
"mlock lookup" be two separate bits, because conceptually they are
really pretty independent, but right now the only _users_ seem to be
tied together, so I kept it as one single bit (FOLL_MLOCK).

But as far as I can tell, the attached patch is 100% equivalent to
what we do now, except for that "skip stack guard pages only for
mlock" change.

Comments? I like this patch because it seems to make the logic more
straightforward.

But somebody else should double-check my logic.

Linus


Attachments:
patch.diff (1.69 kB)

2011-05-05 04:26:19

by Michel Lespinasse

[permalink] [raw]
Subject: Re: [PATCH] mm: fix possible cause of a page_mapped BUG

On Wed, May 4, 2011 at 8:37 PM, Linus Torvalds
<[email protected]> wrote:
> I seriously considered making that "skip stack guard page" and the
> "mlock lookup" be two separate bits, because conceptually they are
> really pretty independent, but right now the only _users_ seem to be
> tied together, so I kept it as one single bit (FOLL_MLOCK).

Yes, this seems best for now.

> But as far as I can tell, the attached patch is 100% equivalent to
> what we do now, except for that "skip stack guard pages only for
> mlock" change.
>
> Comments? I like this patch because it seems to make the logic more
> straightforward.
>
> But somebody else should double-check my logic.

It makes perfect sense.

Reviewed-by: Michel Lespinasse <[email protected]>

(I would argue for it to go to stable trees as well)

--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.