2010-04-02 18:07:59

by Borislav Petkov

[permalink] [raw]
Subject: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)

Hi,

I've got the following oopsie two times now when hibernating - this
means, I don't get it everytime I hibernate but only sometimes, say once
in a blue moon.

And yeah, I couldn't catch it over serial console so I had to make ugly
pictures. By the way, the numbers in the filenames increment as I scroll
down the whole oops (yep, it hadn't completely frozen and I still could
do Shift->PgUp or Shift->PgDn on the console):

http://www.kernel.org/pub/linux/kernel/people/bp/

So, here's what I could decipher from the oopsie, someone else who's
more knowledgeable in mm, rmap and anon_vma's list traversal should be
able to tell what goes wrong there.

EIP is at page_referenced+0xee

which is

<disasm>
10c4: 41 01 c4 add %eax,%r12d
10c7: 83 7d cc 00 cmpl $0x0,-0x34(%rbp)
10cb: 74 19 je 10e6 <page_referenced+0xff>
10cd: 4d 8b 6d 20 mov 0x20(%r13),%r13
10d1: 49 83 ed 20 sub $0x20,%r13

10d5: 49 8b 45 20 mov 0x20(%r13),%rax <--------------

10d9: 0f 18 08 prefetcht0 (%rax)
10dc: 49 8d 45 20 lea 0x20(%r13),%rax
10e0: 48 39 45 80 cmp %rax,-0x80(%rbp)
</disasm>


Corresponding asm:

<asm>
.loc 1 496 0
movq 32(%r13), %r13 # <variable>.same_anon_vma.next, __mptr.451
.LVL295:
subq $32, %r13 #, avc
.LVL296:
.L184:
.LBE1278:
movq 32(%r13), %rax # <variable>.same_anon_vma.next, <variable>.same_anon_vma.next <----------------
prefetcht0 (%rax) # <variable>.same_anon_vma.next
leaq 32(%r13), %rax #, tmp97
cmpq %rax, -128(%rbp) # tmp97, %sfp
jne .L187 #,
.L186:
.loc 1 514 0
movq %r14, %rdi # anon_vma,
call page_unlock_anon_vma #
</asm>


and the NULL pointer in question is being written into %r13 and then 32
is subtracted from it (I'm guessing container_of()). This is consistent
with the register snapshot - %r13 contains 0xffffffffffffffe0 which is
-32 and with the code dump in the oops, in CIMG1640.JPG code points to
opcode 49 8b 45 20.

Which is the following piece of code in <mm/rmap.c:page_referenced_anon()>.

<source>

mapcount = page_mapcount(page);
list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
struct vm_area_struct *vma = avc->vma;
unsigned long address = vma_address(page, vma);
if (address == -EFAULT)
continue;

</source>

which tells us that same_anon_vma.next is NULL. Hmm...

--
Regards/Gruss,
Boris.


2010-04-02 18:14:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)


I think this is likely due to the new scalable anon_vma linking by Rik.
Nothing else I can imagine should have introduced anything like it.

Rik: the picures have the information, but you need to look at several to
see both the oops and the backtrace. Here's a condensed version:

shrink_all_memory ->
do_try_to_free_pages ->
shrink_zone ->
shrink_inactive_list ->
shrink_page_list ->
page_referenced

where page_referenced() oopses due page_referenced_anon() as per
Borislav's description below.

Added all the usual suspects to the Cc list. Left the full report appended
so that the new people don't have to search for it on lkml.

Linus

On Fri, 2 Apr 2010, Borislav Petkov wrote:
>
> I've got the following oopsie two times now when hibernating - this
> means, I don't get it everytime I hibernate but only sometimes, say once
> in a blue moon.
>
> And yeah, I couldn't catch it over serial console so I had to make ugly
> pictures. By the way, the numbers in the filenames increment as I scroll
> down the whole oops (yep, it hadn't completely frozen and I still could
> do Shift->PgUp or Shift->PgDn on the console):
>
> http://www.kernel.org/pub/linux/kernel/people/bp/
>
> So, here's what I could decipher from the oopsie, someone else who's
> more knowledgeable in mm, rmap and anon_vma's list traversal should be
> able to tell what goes wrong there.
>
> EIP is at page_referenced+0xee
>
> which is
>
> <disasm>
> 10c4: 41 01 c4 add %eax,%r12d
> 10c7: 83 7d cc 00 cmpl $0x0,-0x34(%rbp)
> 10cb: 74 19 je 10e6 <page_referenced+0xff>
> 10cd: 4d 8b 6d 20 mov 0x20(%r13),%r13
> 10d1: 49 83 ed 20 sub $0x20,%r13
>
> 10d5: 49 8b 45 20 mov 0x20(%r13),%rax <--------------
>
> 10d9: 0f 18 08 prefetcht0 (%rax)
> 10dc: 49 8d 45 20 lea 0x20(%r13),%rax
> 10e0: 48 39 45 80 cmp %rax,-0x80(%rbp)
> </disasm>
>
>
> Corresponding asm:
>
> <asm>
> .loc 1 496 0
> movq 32(%r13), %r13 # <variable>.same_anon_vma.next, __mptr.451
> .LVL295:
> subq $32, %r13 #, avc
> .LVL296:
> .L184:
> .LBE1278:
> movq 32(%r13), %rax # <variable>.same_anon_vma.next, <variable>.same_anon_vma.next <----------------
> prefetcht0 (%rax) # <variable>.same_anon_vma.next
> leaq 32(%r13), %rax #, tmp97
> cmpq %rax, -128(%rbp) # tmp97, %sfp
> jne .L187 #,
> .L186:
> .loc 1 514 0
> movq %r14, %rdi # anon_vma,
> call page_unlock_anon_vma #
> </asm>
>
>
> and the NULL pointer in question is being written into %r13 and then 32
> is subtracted from it (I'm guessing container_of()). This is consistent
> with the register snapshot - %r13 contains 0xffffffffffffffe0 which is
> -32 and with the code dump in the oops, in CIMG1640.JPG code points to
> opcode 49 8b 45 20.
>
> Which is the following piece of code in <mm/rmap.c:page_referenced_anon()>.
>
> <source>
>
> mapcount = page_mapcount(page);
> list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
> struct vm_area_struct *vma = avc->vma;
> unsigned long address = vma_address(page, vma);
> if (address == -EFAULT)
> continue;
>
> </source>
>
> which tells us that same_anon_vma.next is NULL. Hmm...
>
> --
> Regards/Gruss,
> Boris.
>

2010-04-02 18:26:43

by Andrew Morton

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)

On Fri, 2 Apr 2010 11:09:14 -0700 (PDT) Linus Torvalds <[email protected]> wrote:

>
> I think this is likely due to the new scalable anon_vma linking by Rik.

Similar to https://bugzilla.kernel.org/show_bug.cgi?id=15680

> Nothing else I can imagine should have introduced anything like it.
>
> Rik: the picures have the information, but you need to look at several to
> see both the oops and the backtrace. Here's a condensed version:
>
> shrink_all_memory ->
> do_try_to_free_pages ->
> shrink_zone ->
> shrink_inactive_list ->
> shrink_page_list ->
> page_referenced
>
> where page_referenced() oopses due page_referenced_anon() as per
> Borislav's description below.
>
> Added all the usual suspects to the Cc list. Left the full report appended
> so that the new people don't have to search for it on lkml.
>
> Linus
>
> On Fri, 2 Apr 2010, Borislav Petkov wrote:
> >
> > I've got the following oopsie two times now when hibernating - this
> > means, I don't get it everytime I hibernate but only sometimes, say once
> > in a blue moon.
> >
> > And yeah, I couldn't catch it over serial console so I had to make ugly
> > pictures. By the way, the numbers in the filenames increment as I scroll
> > down the whole oops (yep, it hadn't completely frozen and I still could
> > do Shift->PgUp or Shift->PgDn on the console):
> >
> > http://www.kernel.org/pub/linux/kernel/people/bp/
> >
> > So, here's what I could decipher from the oopsie, someone else who's
> > more knowledgeable in mm, rmap and anon_vma's list traversal should be
> > able to tell what goes wrong there.
> >
> > EIP is at page_referenced+0xee
> >
> > which is
> >
> > <disasm>
> > 10c4: 41 01 c4 add %eax,%r12d
> > 10c7: 83 7d cc 00 cmpl $0x0,-0x34(%rbp)
> > 10cb: 74 19 je 10e6 <page_referenced+0xff>
> > 10cd: 4d 8b 6d 20 mov 0x20(%r13),%r13
> > 10d1: 49 83 ed 20 sub $0x20,%r13
> >
> > 10d5: 49 8b 45 20 mov 0x20(%r13),%rax <--------------
> >
> > 10d9: 0f 18 08 prefetcht0 (%rax)
> > 10dc: 49 8d 45 20 lea 0x20(%r13),%rax
> > 10e0: 48 39 45 80 cmp %rax,-0x80(%rbp)
> > </disasm>
> >
> >
> > Corresponding asm:
> >
> > <asm>
> > .loc 1 496 0
> > movq 32(%r13), %r13 # <variable>.same_anon_vma.next, __mptr.451
> > .LVL295:
> > subq $32, %r13 #, avc
> > .LVL296:
> > .L184:
> > .LBE1278:
> > movq 32(%r13), %rax # <variable>.same_anon_vma.next, <variable>.same_anon_vma.next <----------------
> > prefetcht0 (%rax) # <variable>.same_anon_vma.next
> > leaq 32(%r13), %rax #, tmp97
> > cmpq %rax, -128(%rbp) # tmp97, %sfp
> > jne .L187 #,
> > .L186:
> > .loc 1 514 0
> > movq %r14, %rdi # anon_vma,
> > call page_unlock_anon_vma #
> > </asm>
> >
> >
> > and the NULL pointer in question is being written into %r13 and then 32
> > is subtracted from it (I'm guessing container_of()). This is consistent
> > with the register snapshot - %r13 contains 0xffffffffffffffe0 which is
> > -32 and with the code dump in the oops, in CIMG1640.JPG code points to
> > opcode 49 8b 45 20.
> >
> > Which is the following piece of code in <mm/rmap.c:page_referenced_anon()>.
> >
> > <source>
> >
> > mapcount = page_mapcount(page);
> > list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
> > struct vm_area_struct *vma = avc->vma;
> > unsigned long address = vma_address(page, vma);
> > if (address == -EFAULT)
> > continue;
> >
> > </source>
> >
> > which tells us that same_anon_vma.next is NULL. Hmm...
> >
> > --
> > Regards/Gruss,
> > Boris.
> >

2010-04-02 18:41:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)



On Fri, 2 Apr 2010, Andrew Morton wrote:

> On Fri, 2 Apr 2010 11:09:14 -0700 (PDT) Linus Torvalds <[email protected]> wrote:
>
> >
> > I think this is likely due to the new scalable anon_vma linking by Rik.
>
> Similar to https://bugzilla.kernel.org/show_bug.cgi?id=15680

Yup, looks like the same thing, except that bugzilla entry was due to
swapping rather than hibernation and memory shrinking. But same end
result, just different reasons for why we were trying to shrink the page
lists.

Linus

2010-04-02 22:03:11

by Rik van Riel

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)

On 04/02/2010 02:37 PM, Linus Torvalds wrote:
> On Fri, 2 Apr 2010, Andrew Morton wrote:
>> On Fri, 2 Apr 2010 11:09:14 -0700 (PDT) Linus Torvalds<[email protected]> wrote:
>>
>>>
>>> I think this is likely due to the new scalable anon_vma linking by Rik.
>>
>> Similar to https://bugzilla.kernel.org/show_bug.cgi?id=15680
>
> Yup, looks like the same thing, except that bugzilla entry was due to
> swapping rather than hibernation and memory shrinking. But same end
> result, just different reasons for why we were trying to shrink the page
> lists.

Interesting that it is a null pointer dereference, given
that we do not zero out the anon_vma_chain structs before
freeing them.

Page_referenced_anon() takes the anon_vma->lock before
walking the list. The three places where we modify the
anon_vma_chain->same_anon_vma list, we also hold the
lock.

No doubt something in mm/ is doing something silly, but
I have not found anything yet :(

If I had to guess, I'd say maybe we got one of the
mprotect & vma_adjust cases wrong. Maybe a page stayed
around in the LRU (and in a process?) after its anon_vma
already got freed?

There has to be a reason why a very heavy AIM7 workload
and some other stress tests did not trigger it, but a few
people are able to trigger it on their systems...

2010-04-03 00:24:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)



On Fri, 2 Apr 2010, Rik van Riel wrote:
>
> Interesting that it is a null pointer dereference, given
> that we do not zero out the anon_vma_chain structs before
> freeing them.
>
> Page_referenced_anon() takes the anon_vma->lock before
> walking the list. The three places where we modify the
> anon_vma_chain->same_anon_vma list, we also hold the
> lock.

So let's look at the individual anon_vma_chain entries instead.

What is the protection of the 'vma->anon_vma_chain' list? In
anon_vma_prepare(), the code implies that it is the page_table_lock, but
what about anon_vma_clone()? If I'm reading it correctly, it is some odd
mix of "mmap_sem held for writing" or "mmap_sem held for reading _and_
page_table_lock". And then we have the exit case that apparently has no
locking at all, but that should hopefully be single-threaded.

That thing is subtle. A few more comments about the locking would be good,
so that people like me wouldn't have to try to guess the rules from
reading the source.

> There has to be a reason why a very heavy AIM7 workload
> and some other stress tests did not trigger it, but a few
> people are able to trigger it on their systems...

I don't think AIM7 is at all a very interesting workload, and not likely
to stress anything at all. Did your AIM7 test actually cause heavy
swapping? I doubt it.

Page swapout is where a lot of the magic happens, since that happens
without mmap_sem held etc.

Linus

2010-04-04 16:13:12

by Minchan Kim

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)

Hi, Rik.

On Fri, 2010-04-02 at 18:01 -0400, Rik van Riel wrote:
> On 04/02/2010 02:37 PM, Linus Torvalds wrote:
> > On Fri, 2 Apr 2010, Andrew Morton wrote:
> >> On Fri, 2 Apr 2010 11:09:14 -0700 (PDT) Linus Torvalds<[email protected]> wrote:
> >>
> >>>
> >>> I think this is likely due to the new scalable anon_vma linking by Rik.
> >>
> >> Similar to https://bugzilla.kernel.org/show_bug.cgi?id=15680
> >
> > Yup, looks like the same thing, except that bugzilla entry was due to
> > swapping rather than hibernation and memory shrinking. But same end
> > result, just different reasons for why we were trying to shrink the page
> > lists.
>
> Interesting that it is a null pointer dereference, given
> that we do not zero out the anon_vma_chain structs before
> freeing them.
>
> Page_referenced_anon() takes the anon_vma->lock before
> walking the list. The three places where we modify the
> anon_vma_chain->same_anon_vma list, we also hold the
> lock.
>
> No doubt something in mm/ is doing something silly, but
> I have not found anything yet :(
>
> If I had to guess, I'd say maybe we got one of the
> mprotect & vma_adjust cases wrong. Maybe a page stayed
> around in the LRU (and in a process?) after its anon_vma
> already got freed?

While I review the code again due to this BUG, I found some strange
thing.

In anon_vma_fork, if anon_vma_clone is successful but anon_vma_alloc is
failed, what happens? Parent VMA's anon_vmas have anon_vma_chain which
has vma which is destroyed.
I couldn't find any clean routine to remove this garbage.
I am missing something?

But I think it isn't related to this bug because oops point is not
vma_address but anon_vma_chain.next.



--
Kind regards,
Minchan Kim

2010-04-04 17:25:33

by Rik van Riel

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)

On 04/04/2010 12:12 PM, Minchan Kim wrote:

> While I review the code again due to this BUG, I found some strange
> thing.
>
> In anon_vma_fork, if anon_vma_clone is successful but anon_vma_alloc is
> failed, what happens? Parent VMA's anon_vmas have anon_vma_chain which
> has vma which is destroyed.
> I couldn't find any clean routine to remove this garbage.
> I am missing something?

Good catch. The parent VMA's anon_vmas will get delinked
eventually, but we need to get rid of the newly allocated
child anon_vmas. You found a hopefully rare memory leak...

We need a call to unlink_anon_vmas(vma) at the error label
to do that.

> But I think it isn't related to this bug because oops point is not
> vma_address but anon_vma_chain.next.

Agreed, it's probably not it.

2010-04-04 23:10:57

by Rik van Riel

[permalink] [raw]
Subject: [PATCH] rmap: fix anon_vma_fork() memory leak

Fix a memory leak in anon_vma_fork(), where we fail to tear down the
anon_vmas attached to the new VMA in case setting up the new anon_vma
fails.

Reported-by: Minchan Kim <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>

diff --git a/mm/rmap.c b/mm/rmap.c
index fcd593c..fb7ce99 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -231,6 +231,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)

out_error_free_anon_vma:
anon_vma_free(anon_vma);
+ unlink_anon_vmas(vma);
out_error:
return -ENOMEM;
}

2010-04-04 23:56:40

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] rmap: fix anon_vma_fork() memory leak

On Mon, Apr 5, 2010 at 8:09 AM, Rik van Riel <[email protected]> wrote:
> Fix a memory leak in anon_vma_fork(), where we fail to tear down the
> anon_vmas attached to the new VMA in case setting up the new anon_vma
> fails.
>
> Reported-by: Minchan Kim <[email protected]>
> Signed-off-by: Rik van Riel <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

--
Kind regards,
Minchan Kim

2010-04-05 15:42:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] rmap: fix anon_vma_fork() memory leak



On Sun, 4 Apr 2010, Rik van Riel wrote:
>
> Fix a memory leak in anon_vma_fork(), where we fail to tear down the
> anon_vmas attached to the new VMA in case setting up the new anon_vma
> fails.
>
> Reported-by: Minchan Kim <[email protected]>
> Signed-off-by: Rik van Riel <[email protected]>
> Reviewed-by: Minchan Kim <[email protected]>
> ---
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index fcd593c..fb7ce99 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -231,6 +231,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
>
> out_error_free_anon_vma:
> anon_vma_free(anon_vma);
> + unlink_anon_vmas(vma);
> out_error:
> return -ENOMEM;
> }

This looks _very_ wrong to me.

Shouldn't the unlink_anon_vmas() be in the "out_error" case? IOW, we
should do it even if the "anon_vma_alloc()" failed, nbot just if the
"anon_vma_chain_alloc()" failed?

No?

What am I missing?

Linus

2010-04-05 15:54:36

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] rmap: fix anon_vma_fork() memory leak

On Tue, Apr 6, 2010 at 12:37 AM, Linus Torvalds
<[email protected]> wrote:
>
>
> On Sun, 4 Apr 2010, Rik van Riel wrote:
>>
>> Fix a memory leak in anon_vma_fork(), where we fail to tear down the
>> anon_vmas attached to the new VMA in case setting up the new anon_vma
>> fails.
>>
>> Reported-by: Minchan Kim <[email protected]>
>> Signed-off-by: Rik van Riel <[email protected]>
>> Reviewed-by: Minchan Kim <[email protected]>
>> ---
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index fcd593c..fb7ce99 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -231,6 +231,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
>>
>>   out_error_free_anon_vma:
>>       anon_vma_free(anon_vma);
>> +     unlink_anon_vmas(vma);
>>   out_error:
>>       return -ENOMEM;
>>  }
>
> This looks _very_ wrong to me.
>
> Shouldn't the unlink_anon_vmas() be in the "out_error" case? IOW, we
> should do it even if the "anon_vma_alloc()" failed, nbot just if the
> "anon_vma_chain_alloc()" failed?
>
> No?
>
> What am I missing?

Indeed. You're right.
I should have been reviewed more carefully.



--
Kind regards,
Minchan Kim

2010-04-05 16:05:59

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] rmap: fix anon_vma_fork() memory leak

On 04/05/2010 11:37 AM, Linus Torvalds wrote:

> This looks _very_ wrong to me.
>
> Shouldn't the unlink_anon_vmas() be in the "out_error" case?

Indeed it should. I've had my mind somewhere else this weekend :/

New patch in the next mail.

2010-04-05 16:15:20

by Rik van Riel

[permalink] [raw]
Subject: [PATCH -v2] rmap: fix anon_vma_fork() memory leak

Fix a memory leak in anon_vma_fork(), where we fail to tear down the
anon_vmas attached to the new VMA in case setting up the new anon_vma
fails.

This bug also has the potential to leave behind anon_vma_chain structs
with pointers to invalid memory.

Reported-by: Minchan Kim <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>

diff --git a/mm/rmap.c b/mm/rmap.c
index fcd593c..eaa7a09 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -232,6 +232,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
out_error_free_anon_vma:
anon_vma_free(anon_vma);
out_error:
+ unlink_anon_vmas(vma);
return -ENOMEM;
}

2010-04-06 08:53:57

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)

>
> I think this is likely due to the new scalable anon_vma linking by Rik.
> Nothing else I can imagine should have introduced anything like it.
>
> Rik: the picures have the information, but you need to look at several to
> see both the oops and the backtrace. Here's a condensed version:
>
> shrink_all_memory ->
> do_try_to_free_pages ->
> shrink_zone ->
> shrink_inactive_list ->
> shrink_page_list ->
> page_referenced
>
> where page_referenced() oopses due page_referenced_anon() as per
> Borislav's description below.
>
> Added all the usual suspects to the Cc list. Left the full report appended
> so that the new people don't have to search for it on lkml.

Today, I've reviewed this patch carefully. but I haven't found any bug.

1) anon_vma->list is alwasys protected anon_vma->lock.
2) If anyone forget to take lock, list_add() and/or list_del() never
assign to NULL.

then, NULL mean either three possibility.

a) we see uninitialized data
b) we see after freed data
c) we see memory corruption by another bug

but (a) can't happen because

static inline void __list_add()
{
next->prev = new;
new->next = next;
new->prev = prev;
prev->next = new; (*)
}

If uninitialized var is linked to avc list, new->next was already !NULL.

(b) is also impossible. SLAB_DESTROY_BY_RCU delay the page for anon_vma
freeing until next rcu period. It mean rcu_read_lock()+page_mapped()
can see kfree()ed page. but it is safe. noone corrupt it.

now I doubt (c) ;-)



Also, I've runned stress workload with shrink_all_memory() today. but
I couldn't reproduce the issue. hmm.. (perhaps I'm no lucky guy.
I'm frequently fail to reproduce)

I'll continue to work.

2010-04-06 10:09:18

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)

> (b) is also impossible. SLAB_DESTROY_BY_RCU delay the page for anon_vma
> freeing until next rcu period. It mean rcu_read_lock()+page_mapped()
> can see kfree()ed page. but it is safe. noone corrupt it.

by the way: I haven't understand why rik's per process anon_vma concept
works correctly with ksm. ksm increase anon_vma->ksm_refcount. but it seems
not guranteed vma->anon_vma and page->anon_vma are the same.

but I guess bug reporter doesn't use ksm, it's minor feature.


2010-04-06 14:36:20

by Rik van Riel

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)

On 04/06/2010 06:09 AM, KOSAKI Motohiro wrote:
>> (b) is also impossible. SLAB_DESTROY_BY_RCU delay the page for anon_vma
>> freeing until next rcu period. It mean rcu_read_lock()+page_mapped()
>> can see kfree()ed page. but it is safe. noone corrupt it.
>
> by the way: I haven't understand why rik's per process anon_vma concept
> works correctly with ksm. ksm increase anon_vma->ksm_refcount. but it seems
> not guranteed vma->anon_vma and page->anon_vma are the same.

KSM removes the page from its original anon_vma.

If the page gets reinstantiated (copy on write), it will be
created in the vma->anon_vma.

Am I overlooking something?

2010-04-06 14:39:49

by Rik van Riel

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)

On 04/06/2010 04:53 AM, KOSAKI Motohiro wrote:

> Today, I've reviewed this patch carefully. but I haven't found any bug.

> Also, I've runned stress workload with shrink_all_memory() today. but
> I couldn't reproduce the issue. hmm.. (perhaps I'm no lucky guy.
> I'm frequently fail to reproduce)
>
> I'll continue to work.

My status with this bug is the same - I have gone through
the code from all angles, but have not found any other bugs
yet (except for that leak - which could leave invalid
pointers behind).

This makes me wonder if perhaps the bug is a side effect
of something Borislav (and the other reproducers) have
in their kernel configuration, which we do not have.

Another (unlikely) thing is that the fix for the leak
makes the bug go away. Yes, very unlikely.

Borislav, could you please send us your .config ?

Also, if you have the time, could you try out the
patch (-v2) I mailed in a little up this thread
that fixes the memory leak in anon_vma_fork?

I suspect it should not change anything, but it
could be useful to rule out anyway.

2010-04-06 15:35:13

by Minchan Kim

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)

On Tue, 2010-04-06 at 10:38 -0400, Rik van Riel wrote:
> On 04/06/2010 04:53 AM, KOSAKI Motohiro wrote:
>
> > Today, I've reviewed this patch carefully. but I haven't found any bug.
>
> > Also, I've runned stress workload with shrink_all_memory() today. but
> > I couldn't reproduce the issue. hmm.. (perhaps I'm no lucky guy.
> > I'm frequently fail to reproduce)
> >
> > I'll continue to work.
>
> My status with this bug is the same - I have gone through
> the code from all angles, but have not found any other bugs
> yet (except for that leak - which could leave invalid
> pointers behind).

Let's see the unlink_anon_vmas.

1. list_for_each_entry_safe(avc,next, vma->anon_vma_chain, same_vma)
2. anon_vma_unlink
3. spin_lock(anon_vma->lock) <-- HERE LOCK.
4. list_del(anon_vma_chain->same_anon_vma);

What if anon_vma is destroyed and reuse by SLAB_XXX_RCU for another
anon_vma object between 2 and 3?
I mean how to make sure 3) does lock valid anon_vma?

I hope it is culprit.


--
Kind regards,
Minchan Kim

2010-04-06 15:42:11

by Rik van Riel

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)

On 04/06/2010 11:34 AM, Minchan Kim wrote:

> Let's see the unlink_anon_vmas.
>
> 1. list_for_each_entry_safe(avc,next, vma->anon_vma_chain, same_vma)
> 2. anon_vma_unlink
> 3. spin_lock(anon_vma->lock)<-- HERE LOCK.
> 4. list_del(anon_vma_chain->same_anon_vma);
>
> What if anon_vma is destroyed and reuse by SLAB_XXX_RCU for another
> anon_vma object between 2 and 3?
> I mean how to make sure 3) does lock valid anon_vma?
>
> I hope it is culprit.

How can the anon_vma get destroyed and reused, when this
anon_vma_chain still has a reference to it (and the
anon_vma has not been freed yet)?

What combination of circumstances is necessary for
your bug hypothetical to happen?

2010-04-06 15:59:16

by Minchan Kim

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)

On Tue, 2010-04-06 at 11:40 -0400, Rik van Riel wrote:
> On 04/06/2010 11:34 AM, Minchan Kim wrote:
>
> > Let's see the unlink_anon_vmas.
> >
> > 1. list_for_each_entry_safe(avc,next, vma->anon_vma_chain, same_vma)
> > 2. anon_vma_unlink
> > 3. spin_lock(anon_vma->lock)<-- HERE LOCK.
> > 4. list_del(anon_vma_chain->same_anon_vma);
> >
> > What if anon_vma is destroyed and reuse by SLAB_XXX_RCU for another
> > anon_vma object between 2 and 3?
> > I mean how to make sure 3) does lock valid anon_vma?
> >
> > I hope it is culprit.
>
> How can the anon_vma get destroyed and reused, when this
> anon_vma_chain still has a reference to it (and the

Doesn't anon_vma_chain have a ref counter on anon_vma?

> anon_vma has not been freed yet)?

AFAIK, anon_vma can be reused without free by SLAB_XXX_RCU.
So we always use it carefully by page_lock_anon_vma or manual check
with RCU and page_mapped.

What am I missing?

>
> What combination of circumstances is necessary for
> your bug hypothetical to happen?


CPU A CPU B

unlink_anon_vmas
list_for_each_entry

free_pgtable
anon_vma_unlink
<crazy stall> spin_lock(anon_vma);
list_del(same_anon_vma)
spin_unlock(anon_vma)
anon_vma_unlink
anon_vma_free
reuse for another anon_vma
spin_lock(another anon_vma)
list_del(another anon_vma)

If my assumption is wrong, please correct me.
Thanks, Rik.

--
Kind regards,
Minchan Kim

2010-04-06 16:00:01

by Linus Torvalds

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)



On Wed, 7 Apr 2010, Minchan Kim wrote:
>
> Let's see the unlink_anon_vmas.
>
> 1. list_for_each_entry_safe(avc,next, vma->anon_vma_chain, same_vma)
> 2. anon_vma_unlink
> 3. spin_lock(anon_vma->lock) <-- HERE LOCK.
> 4. list_del(anon_vma_chain->same_anon_vma);
>
> What if anon_vma is destroyed and reuse by SLAB_XXX_RCU for another
> anon_vma object between 2 and 3?
> I mean how to make sure 3) does lock valid anon_vma?
>
> I hope it is culprit.

I don't think so. That isn't the racy case. We're working with a
anon_vma_chain, so the anonvma is all there.

The racy case is when we look up an anonvma by the page, and the page gets
unmapped at the same time because somebody else is travelling over the LRU
list of the page itself, isn't it?

I do wonder if "page_lock_anon_vma()" should check the whole
"page_mapped()" case _after_ taking the anon_vma lock. Because if the race
happens, we're following a anon_vma list that has nothing to do with that
page (it's stilla _valid_ list, since we locked the anon_vma, but will it
be ok?)

IOW, what is it that really keeps the anon_vma list reliable _and_
relevant wrt the page? We know we may get a stale anon_vma, are we ok if
that anon_vma list doesn't actually have anything to do with the page any
more?

I think the first check in "page_address_in_vma()" protects us, but
whatever.

However, that made me look at the PAGE_MIGRATION case. That seems to be
just broken. It's doing that page_anon_vma() + spin_lock without holding
any RCU locks, so there is no guarantee that anon_vma there is at all
valid.

Is that function always called with rcu_read_lock()?

Linus

2010-04-06 16:24:01

by Minchan Kim

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)

Hi, Linus.

On Tue, 2010-04-06 at 08:55 -0700, Linus Torvalds wrote:
>
> On Wed, 7 Apr 2010, Minchan Kim wrote:
> >
> > Let's see the unlink_anon_vmas.
> >
> > 1. list_for_each_entry_safe(avc,next, vma->anon_vma_chain, same_vma)
> > 2. anon_vma_unlink
> > 3. spin_lock(anon_vma->lock) <-- HERE LOCK.
> > 4. list_del(anon_vma_chain->same_anon_vma);
> >
> > What if anon_vma is destroyed and reuse by SLAB_XXX_RCU for another
> > anon_vma object between 2 and 3?
> > I mean how to make sure 3) does lock valid anon_vma?
> >
> > I hope it is culprit.
>
> I don't think so. That isn't the racy case. We're working with a
> anon_vma_chain, so the anonvma is all there.
>

But the anon_vma is using for another anon_vma.
Nonetheless, anon_vma_unlink does list_del(anon_vma's same_anon_vma).
I doubt it.

> The racy case is when we look up an anonvma by the page, and the page gets
> unmapped at the same time because somebody else is travelling over the LRU
> list of the page itself, isn't it?

Yes. but I thought page might travel with anon_vmas which have
same_anon_vma deleted by race.

>
> I do wonder if "page_lock_anon_vma()" should check the whole
> "page_mapped()" case _after_ taking the anon_vma lock. Because if the race
> happens, we're following a anon_vma list that has nothing to do with that
> page (it's stilla _valid_ list, since we locked the anon_vma, but will it
> be ok?)

So we always use it with (vma_address and page_check_address) to make
sure validation of anon_vma.
But I think it's not good design. I want to hold lock ahead checking of
page_mapped but maybe performance issue? I am not sure.

>
> IOW, what is it that really keeps the anon_vma list reliable _and_
> relevant wrt the page? We know we may get a stale anon_vma, are we ok if
> that anon_vma list doesn't actually have anything to do with the page any
> more?
> I think the first check in "page_address_in_vma()" protects us, but
> whatever.
>
> However, that made me look at the PAGE_MIGRATION case. That seems to be
> just broken. It's doing that page_anon_vma() + spin_lock without holding
> any RCU locks, so there is no guarantee that anon_vma there is at all
> valid.

FYI, recently there is a patch about migration case.
http://lkml.org/lkml/2010/4/2/145


>
> Is that function always called with rcu_read_lock()?
>
> Linus


--
Kind regards,
Minchan Kim

2010-04-06 16:33:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)



On Wed, 7 Apr 2010, Minchan Kim wrote:
> >
> > However, that made me look at the PAGE_MIGRATION case. That seems to be
> > just broken. It's doing that page_anon_vma() + spin_lock without holding
> > any RCU locks, so there is no guarantee that anon_vma there is at all
> > valid.
>
> FYI, recently there is a patch about migration case.
> http://lkml.org/lkml/2010/4/2/145

No, I'm talking about rmap_walk_anon():

anon_vma = page_anon_vma(page);
if (!anon_vma)
return ret;
spin_lock(&anon_vma->lock);

which seems to be simply buggy. The anon_vma may not exist any more,
because an RCU event might have really freed the page between looking it
up and locking it.

Linus

2010-04-06 16:36:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)



On Wed, 7 Apr 2010, Minchan Kim wrote:
> >
> > I don't think so. That isn't the racy case. We're working with a
> > anon_vma_chain, so the anonvma is all there.
>
> But the anon_vma is using for another anon_vma.

No, that can only happen if somebody has done "anon_vma_free()" on it. And
nobody does that if the anonvma still has a non-empty'&anon_vma->head'.

So as long as the anon_vma has a anon_vma_chain entry associated with it
(or a ksm refcount, but that's a separate issue), it's not going to be
re-allocated for any other use, because it's not going to be free'd.

Linus

2010-04-06 16:45:47

by Minchan Kim

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)

On Tue, 2010-04-06 at 09:28 -0700, Linus Torvalds wrote:
>
> On Wed, 7 Apr 2010, Minchan Kim wrote:
> > >
> > > However, that made me look at the PAGE_MIGRATION case. That seems to be
> > > just broken. It's doing that page_anon_vma() + spin_lock without holding
> > > any RCU locks, so there is no guarantee that anon_vma there is at all
> > > valid.
> >
> > FYI, recently there is a patch about migration case.
> > http://lkml.org/lkml/2010/4/2/145
>
> No, I'm talking about rmap_walk_anon():
>
> anon_vma = page_anon_vma(page);
> if (!anon_vma)
> return ret;
> spin_lock(&anon_vma->lock);
>
> which seems to be simply buggy. The anon_vma may not exist any more,
> because an RCU event might have really freed the page between looking it
> up and locking it.
>
> Linus

unmap_and_move
remove_migration_ptes
rmap_walk
rmap_walk_anon

We always has rcu_read_lock about anon page in unmap_and_move.
So I think it's not buggy. What am I missing?


--
Kind regards,
Minchan Kim

2010-04-06 16:54:47

by Minchan Kim

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)

On Tue, 2010-04-06 at 09:32 -0700, Linus Torvalds wrote:
>
> On Wed, 7 Apr 2010, Minchan Kim wrote:
> > >
> > > I don't think so. That isn't the racy case. We're working with a
> > > anon_vma_chain, so the anonvma is all there.
> >
> > But the anon_vma is using for another anon_vma.
>
> No, that can only happen if somebody has done "anon_vma_free()" on it. And
> nobody does that if the anonvma still has a non-empty'&anon_vma->head'.
>
> So as long as the anon_vma has a anon_vma_chain entry associated with it
> (or a ksm refcount, but that's a separate issue), it's not going to be
> re-allocated for any other use, because it's not going to be free'd.
>

> Linus

That's what I am missing.
Thanks, Linus.

I will think over the problem. :)

--
Kind regards,
Minchan Kim

2010-04-06 16:58:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)



On Wed, 7 Apr 2010, Minchan Kim wrote:
>
> unmap_and_move
> remove_migration_ptes
> rmap_walk
> rmap_walk_anon
>
> We always has rcu_read_lock about anon page in unmap_and_move.
> So I think it's not buggy. What am I missing?

Ok, in that case it's fine.

However, it does bring back my comment about all those anonvma changes:
the locking is totally undocumented.

Why isn't there a thing _saying_ that it's ok because of this?

Why is there no comment about the locking of that 'same_vma' /
'vma->anon_vma_chain' except for the totally nonsensical one about
page_table_lock (which doesn't protect _any_ of the other cases)?

Linus

2010-04-06 17:06:09

by Borislav Petkov

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)

From: Rik van Riel <[email protected]>
Date: Tue, Apr 06, 2010 at 10:38:18AM -0400

> This makes me wonder if perhaps the bug is a side effect
> of something Borislav (and the other reproducers) have
> in their kernel configuration, which we do not have.
>
> Another (unlikely) thing is that the fix for the leak
> makes the bug go away. Yes, very unlikely.
>
> Borislav, could you please send us your .config ?

attached.

> Also, if you have the time, could you try out the
> patch (-v2) I mailed in a little up this thread
> that fixes the memory leak in anon_vma_fork?

Sure, building ontop of v2.6.34-rc3-288-gab195c5.

Will try to trigger it but let me remind you that it will take a while
since it doesn't happen everytime I suspend.

Any other printks or debug output which might be helpful to slap at the
site, page_referenced_anon() I mean?

> I suspect it should not change anything, but it
> could be useful to rule out anyway.
>

--
Regards/Gruss,
Boris.


Attachments:
(No filename) (975.00 B)
config-2.6.34-rc3 (61.15 kB)
Download all attachments

2010-04-06 17:06:21

by Rik van Riel

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)

On 04/06/2010 12:53 PM, Linus Torvalds wrote:
> On Wed, 7 Apr 2010, Minchan Kim wrote:
>>
>> unmap_and_move
>> remove_migration_ptes
>> rmap_walk
>> rmap_walk_anon
>>
>> We always has rcu_read_lock about anon page in unmap_and_move.
>> So I think it's not buggy. What am I missing?
>
> Ok, in that case it's fine.
>
> However, it does bring back my comment about all those anonvma changes:
> the locking is totally undocumented.
>
> Why isn't there a thing _saying_ that it's ok because of this?
>
> Why is there no comment about the locking of that 'same_vma' /
> 'vma->anon_vma_chain' except for the totally nonsensical one about
> page_table_lock (which doesn't protect _any_ of the other cases)?

Which other cases? When do we ever walk the "same_vma" list
not from the context of the process owning the vma?

This bug in page_referenced is walking the "same_anon_vma" list,
which is locked with the anon_vma->lock.

2010-04-06 18:33:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)



On Tue, 6 Apr 2010, Rik van Riel wrote:
>
> Which other cases? When do we ever walk the "same_vma" list
> not from the context of the process owning the vma?

That's the point. What does 'owning the vma' mean? That's exactly what I'm
asking to be documented.

Quite frankly, the thing is a mess. There is _no_ comment on why it's ok
to modify the list or walk the list, except for the one totally misleading
one, since the page_table_lock has at most a _secondary_ meaning in the
whole ownership (ie it is used only when we do _not_ own the vma chain
exclusively).

So your very comment shows the whole confusion. No, we do not "own the
vma" in all cases. Sometimes we just have a read-lock on it.

> This bug in page_referenced is walking the "same_anon_vma" list,
> which is locked with the anon_vma->lock.

Umm. Wake the hell up, Rik!

It's walking a _corrupt_ same_anon_vma list. In other words, we _know_
that the 'anon_vma_chain' entry is crap. We know that exactly because it
contains "impossible" values with regard to the list.

And what's the easiest way to get such a corrupt list, considering that
the locking looks correct for that particular list?

That's right: by having something like anon_vma_clone() do something bad
when it walks the same avc entries using the 'same_vma' list and creates
copies of it.

You can't just say "but but but same_anon_vma list is always locked
properly". Because it doesn't matter if that list is locked properly if
walking _another_ list doesn't work right.

I really don't understand why you keep on harping on thatr same_anon_vma
list. The fact that that was the corrupt list IN ABSOLUTELY NO WAY implies
that that is the list that caused the corruption.

For example, let's say that the 'anon_vma_chain' list is corrupted. Never
mind how. So what could happen is that you'd have vma->anon_vma pointing
to one thing, and one or more entries on the 'vma->anon_vma_chain' list
pointing to _another_ anon_vma.

What happens then? I have no idea. Maybe nothing bad. But the point is, if
one avc list is corrupted and you may end up referencing those avc's in
unexpected cases, how can you trust the other list that is in the same
data structure?

For example, maybe some list corruption causes us to do that
"anon_vma_chain_link()" _twice_ on the same avc entry. So we do that
"list_add_tail(&avc->same_anon_vma, &anon_vma->head);" on an entry that
already had "same_anon_vma" on one list.

No, I really don't see how that could happen, but my argument is that a
corrupt list can do odd things. The same entry might end up pointing to
itself, so that you end up freeing it twice or something.

Just as an example of the kind of code that makes me worry:

void unlink_anon_vmas(struct vm_area_struct *vma)
{
struct anon_vma_chain *avc, *next;

/* Unlink each anon_vma chained to the VMA. */
list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) {
anon_vma_unlink(avc);
list_del(&avc->same_vma);
anon_vma_chain_free(avc);
}
}

Now, think about what happens for the *last* entry in that avc chain. It
will call that "anon_vma_unlink()" thing, which will delete perhaps the
last entry in the "same_anon_vma" one, and then it does

if (empty)
anon_vma_free(anon_vma);

*before* unlink_anon_vma's has actually does that

list_del(&avc->same_vma);

and what we essentially have is a stale anon_vma_chain entry that still
exists on that same_vma list, and points to an anon_vma that already got
deleted.

Does it matter? I really can't see that it does. But that's the kind of
thing that makes me nervous. It makes me _especially_ nervous when the
whole locking for that anon_vma_chain thing isn't entirely obvious.

Linus

2010-04-06 19:04:07

by Andrew Morton

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)

On Tue, 6 Apr 2010 11:28:52 -0700 (PDT)
Linus Torvalds <[email protected]> wrote:

> For example, maybe some list corruption causes us to do that
> "anon_vma_chain_link()" _twice_ on the same avc entry. So we do that
> "list_add_tail(&avc->same_anon_vma, &anon_vma->head);" on an entry that
> already had "same_anon_vma" on one list.

The lib/list_debug.c stuff might detect such things. I wonder if
either Borislav or Steinar had CONFIG_DEBUG_LIST enabled?

2010-04-06 19:15:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)



On Tue, 6 Apr 2010, Andrew Morton wrote:

> On Tue, 6 Apr 2010 11:28:52 -0700 (PDT)
> Linus Torvalds <[email protected]> wrote:
>
> > For example, maybe some list corruption causes us to do that
> > "anon_vma_chain_link()" _twice_ on the same avc entry. So we do that
> > "list_add_tail(&avc->same_anon_vma, &anon_vma->head);" on an entry that
> > already had "same_anon_vma" on one list.
>
> The lib/list_debug.c stuff might detect such things. I wonder if
> either Borislav or Steinar had CONFIG_DEBUG_LIST enabled?

Well, even without CONFIG_LIST_DEBUG we'd catch _some_ things, and
conversely, even with LIST_DEBUG on we don't catch everything.

For example, doing list_del() twice on the same entry will die with a
really nice pattern due to poisoning even without LIST_DEBUG.

But list_add() twice on the same entry will sadly silently succeed both
with and without list debugging (the list debugging will check the target
list head, but there is no way to check the "new->next/prev" entries).

Anyway, I've not actually found anything wrong in the same_vma locking.
And I'm not at all convinced there is any list corruption there. My point
was really only that
(a) the locking rules seem very unclear and certainly not documented and
(b) corruption of one list could easily be the cause of corruption of
another list of the same structure.
but I don't actually see anything wrong anywhere.

Linus

2010-04-06 19:37:26

by Steinar H. Gunderson

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)

On Tue, Apr 06, 2010 at 12:03:15PM -0700, Andrew Morton wrote:
>> For example, maybe some list corruption causes us to do that
>> "anon_vma_chain_link()" _twice_ on the same avc entry. So we do that
>> "list_add_tail(&avc->same_anon_vma, &anon_vma->head);" on an entry that
>> already had "same_anon_vma" on one list.
> The lib/list_debug.c stuff might detect such things. I wonder if
> either Borislav or Steinar had CONFIG_DEBUG_LIST enabled?

Not set on my kernel.

/* Steinar */
--
Homepage: http://www.sesse.net/

2010-04-06 19:39:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)



On Tue, 6 Apr 2010, Linus Torvalds wrote:
>
> Anyway, I've not actually found anything wrong in the same_vma locking.
> And I'm not at all convinced there is any list corruption there. My point
> was really only that
> (a) the locking rules seem very unclear and certainly not documented and
> (b) corruption of one list could easily be the cause of corruption of
> another list of the same structure.
> but I don't actually see anything wrong anywhere.

I _have_ found what looks like a few clues, though.

In particular, the disassembly in Steinar Gunderson's case looks much more
like the disassembly I get, and if I read that correctly, it's actually
the _first_ iteration of the for_each_entry() loop that crashes.

Why do I think so?

In Steinar's oops, we have "RAX: ffff880169111fc8", which is clearly a
kernel pointer. However, the code from Steinar's oops decodes to:

0: 3b 56 10 cmp 0x10(%rsi),%edx
3: 73 1e jae 0x23
5: 48 83 fa f2 cmp $0xfffffffffffffff2,%rdx
9: 74 18 je 0x23
b: 4d 89 f8 mov %r15,%r8
e: 48 8d 4d cc lea -0x34(%rbp),%rcx
12: 4c 89 e7 mov %r12,%rdi
15: e8 44 f2 ff ff callq 0xfffffffffffff25e
1a: 41 01 c5 add %eax,%r13d
1d: 83 7d cc 00 cmpl $0x0,-0x34(%rbp)
21: 74 19 je 0x3c
23: 48 8b 43 20 mov 0x20(%rbx),%rax
27: 48 8d 58 e0 lea -0x20(%rax),%rbx
2b:* 48 8b 43 20 mov 0x20(%rbx),%rax <-- trapping instruction
2f: 0f 18 08 prefetcht0 (%rax)
32: 48 8d 43 20 lea 0x20(%rbx),%rax
36: 48 39 45 88 cmp %rax,-0x78(%rbp)
3a: 75 a7 jne 0xffffffffffffffe3
3c: 41 fe 06 incb (%r14)
3f: e9 .byte 0xe9

which matches my code pretty well, and the point is, _if_ it went through
the loop, then %rbx should be %rax+20. And it's not.

IOW, the code you see above before the trapping instruction is the end of
the loop: it's the

referenced += page_referenced_one(page, vma, address,
&mapcount, vm_flags);
if (!mapcount)
break;
}

part (the "callq" and "add %eax" is that "referenced +=", and %r13d is
"referenced").

What you cannot see from the code decode is the loop setup and _entry_,
which looks like this for me:

movl 12(%rbx), %eax # <variable>.D.11299._mapcount.counter, D.33294
xorl %r12d, %r12d # referenced
incl %eax # tmp89
movl %eax, -52(%rbp) # tmp89, mapcount
leaq 48(%r14), %rax #,
movq 48(%r14), %r13 # <variable>.head.next, <variable>.head.next
movq %rax, -128(%rbp) #, %sfp
subq $32, %r13 #, avc
jmp .L167 #

where that "L167" is actually the oopsing instruction (ie the "while" loop
has been turned around, and we jump to the end of the loop that does the
loop end test).

In other words, what is NULL here is not an anon_vma_chain entry, but
actually the initial "anon_vma->head.next" pointer.

The whole _head_ of the list has never been initialized, in other words.

So we can entirely ignore the 'anon_vma_chain' issues. We need to look at
the initializations of the 'anon_vma's themselves.

Linus

2010-04-06 19:42:53

by Borislav Petkov

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)

From: Andrew Morton <[email protected]>
Date: Tue, Apr 06, 2010 at 12:03:15PM -0700

> On Tue, 6 Apr 2010 11:28:52 -0700 (PDT)
> Linus Torvalds <[email protected]> wrote:
>
> > For example, maybe some list corruption causes us to do that
> > "anon_vma_chain_link()" _twice_ on the same avc entry. So we do that
> > "list_add_tail(&avc->same_anon_vma, &anon_vma->head);" on an entry that
> > already had "same_anon_vma" on one list.
>
> The lib/list_debug.c stuff might detect such things. I wonder if
> either Borislav or Steinar had CONFIG_DEBUG_LIST enabled?

No, it is off in my .config. I'll turn it on and retest to see whether
it screams something. In the meantime, I've been testing current git
(v2.6.34-rc3-288-gab195c5), and especially Rik's mem leak fix which
Linus already committed (4946d54cb55e86a156216fcfeed5568514b0830f) and
tried to retrigger the bug by hibernating the machine several times.

Now, this machine has 8G of memory so I thought maybe if starting
several assorted guests on it would put some pressure on anon_vma lists
but no, the machine habernated happily by creating almost a 600Mb
hibernation image and having all three guests loaded.

Then, I said, well, let's have another last test run and started firefox
which went into reloading the last session. And I remember that firefox
still hadn't finished loading all pages when I hibernated and boom, it
oopsed.

So, it definitely is some anon_vma lists concurrency issue ... The good
thing is, I was able to catch the oops in its sheer magnificence over
netconsole this time:


[ 2995.478125] PM: Preallocating image memory...
[ 2995.713692] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 2995.714001] IP: [<ffffffff810c194d>] page_referenced+0xee/0x1dc
[ 2995.714001] PGD 22d1b8067 PUD 22dd85067 PMD 0
[ 2995.714001] Oops: 0000 [#1] PREEMPT SMP
[ 2995.714001] last sysfs file: /sys/power/state
[ 2995.714001] CPU 0
[ 2995.714001] Modules linked in: tun powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod ohci_hcd pcspkr 8250_pnp 8250 k10temp edac_core serial_core
[ 2995.714001]
[ 2995.714001] Pid: 7440, comm: hib.sh Not tainted 2.6.34-rc3-00288-gab195c5 #1 M3A78 PRO/System Product Name
[ 2995.714001] RIP: 0010:[<ffffffff810c194d>] [<ffffffff810c194d>] page_referenced+0xee/0x1dc
[ 2995.714001] RSP: 0018:ffff88022fa038b8 EFLAGS: 00010283
[ 2995.714001] RAX: ffff88022d747098 RBX: ffffea00078efb70 RCX: 0000000000000000
[ 2995.714001] RDX: ffff88022fa03cf8 RSI: ffff88022d747070 RDI: ffff88022fb32520
[ 2995.714001] RBP: ffff88022fa03938 R08: 0000000000000002 R09: 0000000000000000
[ 2995.714001] R10: ffff88022fa038a8 R11: ffff88022d295d10 R12: 0000000000000000
[ 2995.714001] R13: ffffffffffffffe0 R14: ffff88022d747058 R15: ffff88022fa03a00
[ 2995.714001] FS: 00007f4da8b966f0(0000) GS:ffff88000a000000(0000) knlGS:0000000000000000
[ 2995.714001] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 2995.714001] CR2: 0000000000000000 CR3: 000000022d11e000 CR4: 00000000000006f0
[ 2995.714001] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2995.714001] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 2995.714001] Process hib.sh (pid: 7440, threadinfo ffff88022fa02000, task ffff88022fb32520)
[ 2995.714001] Stack:
[ 2995.714001] ffff88022d747098 00000000813fd2ac ffffffff8165ee28 0000000000000416
[ 2995.714001] <0> ffff88022fa038f8 ffffffff810c6d40 ffffea00078fae60 ffffea00078fae60
[ 2995.714001] <0> ffff88022fa03938 00000002810abd98 ffffea00078ec530 ffffea00078efb98
[ 2995.714001] Call Trace:
[ 2995.714001] [<ffffffff810c6d40>] ? swapcache_free+0x37/0x3c
[ 2995.714001] [<ffffffff810ac31d>] shrink_page_list+0x171/0x4b1
[ 2995.714001] [<ffffffff813fd1e6>] ? _raw_spin_unlock_irq+0x30/0x58
[ 2995.714001] [<ffffffff810ac9b9>] shrink_inactive_list+0x35c/0x623
[ 2995.714001] [<ffffffff810acd94>] ? shrink_zone+0x114/0x3d4
[ 2995.714001] [<ffffffff81064f29>] ? print_lock_contention_bug+0x1b/0xe1
[ 2995.714001] [<ffffffff813fc790>] ? _raw_spin_lock_irq+0x19/0x79
[ 2995.714001] [<ffffffff810acf8a>] shrink_zone+0x30a/0x3d4
[ 2995.714001] [<ffffffff810ad19e>] ? shrink_slab+0x14a/0x15c
[ 2995.714001] [<ffffffff810adb65>] do_try_to_free_pages+0x176/0x27f
[ 2995.714001] [<ffffffff8103de67>] ? irq_exit+0x93/0x95
[ 2995.714001] [<ffffffff810add03>] shrink_all_memory+0x95/0xc4
[ 2995.714001] [<ffffffff810ab0f0>] ? isolate_pages_global+0x0/0x217
[ 2995.714001] [<ffffffff81077503>] ? count_data_pages+0x65/0x79
[ 2995.714001] [<ffffffff8107776a>] hibernate_preallocate_memory+0x1aa/0x2cb
[ 2995.714001] [<ffffffff813f95b5>] ? printk+0x41/0x44
[ 2995.714001] [<ffffffff810760b3>] hibernation_snapshot+0x36/0x1e1
[ 2995.714001] [<ffffffff8107632c>] hibernate+0xce/0x172
[ 2995.714001] [<ffffffff81075099>] state_store+0x5c/0xd3
[ 2995.714001] [<ffffffff8118728f>] kobj_attr_store+0x17/0x19
[ 2995.714001] [<ffffffff81127b69>] sysfs_write_file+0x108/0x144
[ 2995.714001] [<ffffffff810d66ff>] vfs_write+0xb2/0x153
[ 2995.714001] [<ffffffff810641a9>] ? trace_hardirqs_on_caller+0x1f/0x14b
[ 2995.714001] [<ffffffff810d6863>] sys_write+0x4a/0x71
[ 2995.714001] [<ffffffff810021db>] system_call_fastpath+0x16/0x1b
[ 2995.714001] Code: 3b 56 10 73 1e 48 83 fa f2 74 18 48 8d 4d cc 4d 89 f8 48 89 df e8 4d f2 ff ff 41 01 c4 83 7d cc 00 74 19 4d 8b 6d 20 49 83 ed 20 <49> 8b 45 20 0f 18 08 49 8d 45 20 48 39 45 80 75 aa 4c 89 f7 e8
[ 2995.714001] RIP [<ffffffff810c194d>] page_referenced+0xee/0x1dc
[ 2995.714001] RSP <ffff88022fa038b8>
[ 2995.714001] CR2: 0000000000000000
[ 2995.729717] ---[ end trace 92c25d74e4800968 ]---
[ 2995.729862] note: hib.sh[7440] exited with preempt_count 2
[ 2995.730022] BUG: scheduling while atomic: hib.sh/7440/0x10000003
[ 2995.730170] INFO: lockdep is turned off.
[ 2995.730319] Modules linked in: tun powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod ohci_hcd pcspkr 8250_pnp 8250 k10temp edac_core serial_core
[ 2995.731749] Pid: 7440, comm: hib.sh Tainted: G D 2.6.34-rc3-00288-gab195c5 #1
[ 2995.732003] Call Trace:
[ 2995.732158] [<ffffffff810636bf>] ? __debug_show_held_locks+0x1b/0x24
[ 2995.732305] [<ffffffff8102d499>] __schedule_bug+0x72/0x77
[ 2995.732454] [<ffffffff813f9a0a>] schedule+0xd9/0x730
[ 2995.732603] [<ffffffff81030301>] __cond_resched+0x18/0x24
[ 2995.732751] [<ffffffff813fa12e>] _cond_resched+0x2c/0x37
[ 2995.732900] [<ffffffff810b8a21>] unmap_vmas+0x6ce/0x893
[ 2995.733053] [<ffffffff810bd0f5>] exit_mmap+0xd7/0x182
[ 2995.733206] [<ffffffff81035b58>] mmput+0x43/0xea
[ 2995.733356] [<ffffffff81039e99>] exit_mm+0x110/0x11d
[ 2995.733505] [<ffffffff8103b8ed>] do_exit+0x1c5/0x6a2
[ 2995.733653] [<ffffffff81038f84>] ? kmsg_dump+0x13b/0x155
[ 2995.733802] [<ffffffff810060db>] ? oops_end+0x47/0x93
[ 2995.733950] [<ffffffff81006122>] oops_end+0x8e/0x93
[ 2995.734102] [<ffffffff8101ed99>] no_context+0x1fc/0x20b
[ 2995.734255] [<ffffffff8101ef34>] __bad_area_nosemaphore+0x18c/0x1af
[ 2995.734407] [<ffffffff8101f16f>] ? do_page_fault+0xa8/0x32d
[ 2995.734556] [<ffffffff8101ef6a>] bad_area_nosemaphore+0x13/0x15
[ 2995.734705] [<ffffffff8101f23a>] do_page_fault+0x173/0x32d
[ 2995.734854] [<ffffffff810802f9>] ? __call_rcu+0x11d/0x130
[ 2995.735008] [<ffffffff813fdaa3>] ? error_sti+0x5/0x6
[ 2995.735161] [<ffffffff81063167>] ? trace_hardirqs_off_caller+0x1f/0xa9
[ 2995.735313] [<ffffffff813fc48e>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 2995.735463] [<ffffffff813fd8bf>] page_fault+0x1f/0x30
[ 2995.735612] [<ffffffff810c194d>] ? page_referenced+0xee/0x1dc
[ 2995.735761] [<ffffffff810c18df>] ? page_referenced+0x80/0x1dc
[ 2995.735910] [<ffffffff810c6d40>] ? swapcache_free+0x37/0x3c
[ 2995.736062] [<ffffffff810ac31d>] shrink_page_list+0x171/0x4b1
[ 2995.736216] [<ffffffff813fd1e6>] ? _raw_spin_unlock_irq+0x30/0x58
[ 2995.736368] [<ffffffff810ac9b9>] shrink_inactive_list+0x35c/0x623
[ 2995.736518] [<ffffffff810acd94>] ? shrink_zone+0x114/0x3d4
[ 2995.736666] [<ffffffff81064f29>] ? print_lock_contention_bug+0x1b/0xe1
[ 2995.736816] [<ffffffff813fc790>] ? _raw_spin_lock_irq+0x19/0x79
[ 2995.736965] [<ffffffff810acf8a>] shrink_zone+0x30a/0x3d4
[ 2995.737117] [<ffffffff810ad19e>] ? shrink_slab+0x14a/0x15c
[ 2995.737270] [<ffffffff810adb65>] do_try_to_free_pages+0x176/0x27f
[ 2995.737422] [<ffffffff8103de67>] ? irq_exit+0x93/0x95
[ 2995.737570] [<ffffffff810add03>] shrink_all_memory+0x95/0xc4
[ 2995.737719] [<ffffffff810ab0f0>] ? isolate_pages_global+0x0/0x217
[ 2995.737868] [<ffffffff81077503>] ? count_data_pages+0x65/0x79
[ 2995.738020] [<ffffffff8107776a>] hibernate_preallocate_memory+0x1aa/0x2cb
[ 2995.738175] [<ffffffff813f95b5>] ? printk+0x41/0x44
[ 2995.738326] [<ffffffff810760b3>] hibernation_snapshot+0x36/0x1e1
[ 2995.738475] [<ffffffff8107632c>] hibernate+0xce/0x172
[ 2995.738623] [<ffffffff81075099>] state_store+0x5c/0xd3
[ 2995.738772] [<ffffffff8118728f>] kobj_attr_store+0x17/0x19
[ 2995.738920] [<ffffffff81127b69>] sysfs_write_file+0x108/0x144
[ 2995.739073] [<ffffffff810d66ff>] vfs_write+0xb2/0x153
[ 2995.739226] [<ffffffff810641a9>] ? trace_hardirqs_on_caller+0x1f/0x14b
[ 2995.739378] [<ffffffff810d6863>] sys_write+0x4a/0x71
[ 2995.739526] [<ffffffff810021db>] system_call_fastpath+0x16/0x1b
[ 2995.739940] BUG: unable to handle kernel paging request at 00007faf064ff1f0
[ 2995.740220] IP: [<ffffffff8119c0d0>] do_raw_spin_trylock+0x4/0x3a
[ 2995.740441] PGD 0
[ 2995.740646] Oops: 0000 [#2] PREEMPT SMP
[ 2995.740685] last sysfs file: /sys/power/state
[ 2995.740685] CPU 1
[ 2995.740685] Modules linked in: tun powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod ohci_hcd pcspkr 8250_pnp 8250 k10temp edac_core serial_core
[ 2995.740685]
[ 2995.740685] Pid: 7440, comm: hib.sh Tainted: G D 2.6.34-rc3-00288-gab195c5 #1 M3A78 PRO/System Product Name
[ 2995.740685] RIP: 0010:[<ffffffff8119c0d0>] [<ffffffff8119c0d0>] do_raw_spin_trylock+0x4/0x3a
[ 2995.740685] RSP: 0018:ffff88022fa03438 EFLAGS: 00010292
[ 2995.740685] RAX: ffff88022fb32520 RBX: 00007faf064ff1f0 RCX: 0000000000000000
[ 2995.740685] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007faf064ff1f0
[ 2995.740685] RBP: ffff88022fa03438 R08: 0000000000000002 R09: 0000000000000000
[ 2995.740685] R10: dead000000100100 R11: ffffffff810d26f5 R12: 00007faf064ff208
[ 2995.740685] R13: fffffffffffffff0 R14: ffff88022d747068 R15: 00007f4da81fa000
[ 2995.740685] FS: 00007f4da8b966f0(0000) GS:ffff88000a200000(0000) knlGS:0000000000000000
[ 2995.740685] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 2995.740685] CR2: 00007faf064ff1f0 CR3: 0000000001646000 CR4: 00000000000006e0
[ 2995.740685] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2995.740685] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 2995.740685] Process hib.sh (pid: 7440, threadinfo ffff88022fa02000, task ffff88022fb32520)
[ 2995.740685] Stack:
[ 2995.740685] ffff88022fa03468 ffffffff813fc6c3 ffffffff810c1ae3 ffff8801cbfde880
[ 2995.740685] <0> ffff88022fb32510 00007faf064ff1f0 ffff88022fa034a8 ffffffff810c1ae3
[ 2995.740685] <0> ffff88022fa034a8 ffff88022d747000 0000000000000000 0000000000000000
[ 2995.740685] Call Trace:
[ 2995.740685] [<ffffffff813fc6c3>] _raw_spin_lock+0x48/0x73
[ 2995.740685] [<ffffffff810c1ae3>] ? unlink_anon_vmas+0x40/0xe1
[ 2995.740685] [<ffffffff810c1ae3>] unlink_anon_vmas+0x40/0xe1
[ 2995.740685] [<ffffffff810bb562>] free_pgtables+0x68/0xce
[ 2995.740685] [<ffffffff810bd11e>] exit_mmap+0x100/0x182
[ 2995.740685] [<ffffffff81035b58>] mmput+0x43/0xea
[ 2995.740685] [<ffffffff81039e99>] exit_mm+0x110/0x11d
[ 2995.740685] [<ffffffff8103b8ed>] do_exit+0x1c5/0x6a2
[ 2995.740685] [<ffffffff81038f84>] ? kmsg_dump+0x13b/0x155
[ 2995.740685] [<ffffffff810060db>] ? oops_end+0x47/0x93
[ 2995.740685] [<ffffffff81006122>] oops_end+0x8e/0x93
[ 2995.740685] [<ffffffff8101ed99>] no_context+0x1fc/0x20b
[ 2995.740685] [<ffffffff8101ef34>] __bad_area_nosemaphore+0x18c/0x1af
[ 2995.740685] [<ffffffff8101f16f>] ? do_page_fault+0xa8/0x32d
[ 2995.740685] [<ffffffff8101ef6a>] bad_area_nosemaphore+0x13/0x15
[ 2995.740685] [<ffffffff8101f23a>] do_page_fault+0x173/0x32d
[ 2995.740685] [<ffffffff810802f9>] ? __call_rcu+0x11d/0x130
[ 2995.740685] [<ffffffff813fdaa3>] ? error_sti+0x5/0x6
[ 2995.740685] [<ffffffff81063167>] ? trace_hardirqs_off_caller+0x1f/0xa9
[ 2995.740685] [<ffffffff813fc48e>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 2995.740685] [<ffffffff813fd8bf>] page_fault+0x1f/0x30
[ 2995.740685] [<ffffffff810c194d>] ? page_referenced+0xee/0x1dc
[ 2995.740685] [<ffffffff810c18df>] ? page_referenced+0x80/0x1dc
[ 2995.740685] [<ffffffff810c6d40>] ? swapcache_free+0x37/0x3c
[ 2995.740685] [<ffffffff810ac31d>] shrink_page_list+0x171/0x4b1
[ 2995.740685] [<ffffffff813fd1e6>] ? _raw_spin_unlock_irq+0x30/0x58
[ 2995.740685] [<ffffffff810ac9b9>] shrink_inactive_list+0x35c/0x623
[ 2995.740685] [<ffffffff810acd94>] ? shrink_zone+0x114/0x3d4
[ 2995.740685] [<ffffffff81064f29>] ? print_lock_contention_bug+0x1b/0xe1
[ 2995.740685] [<ffffffff813fc790>] ? _raw_spin_lock_irq+0x19/0x79
[ 2995.740685] [<ffffffff810acf8a>] shrink_zone+0x30a/0x3d4
[ 2995.740685] [<ffffffff810ad19e>] ? shrink_slab+0x14a/0x15c
[ 2995.740685] [<ffffffff810adb65>] do_try_to_free_pages+0x176/0x27f
[ 2995.740685] [<ffffffff8103de67>] ? irq_exit+0x93/0x95
[ 2995.740685] [<ffffffff810add03>] shrink_all_memory+0x95/0xc4
[ 2995.740685] [<ffffffff810ab0f0>] ? isolate_pages_global+0x0/0x217
[ 2995.740685] [<ffffffff81077503>] ? count_data_pages+0x65/0x79
[ 2995.740685] [<ffffffff8107776a>] hibernate_preallocate_memory+0x1aa/0x2cb
[ 2995.740685] [<ffffffff813f95b5>] ? printk+0x41/0x44
[ 2995.740685] [<ffffffff810760b3>] hibernation_snapshot+0x36/0x1e1
[ 2995.740685] [<ffffffff8107632c>] hibernate+0xce/0x172
[ 2995.740685] [<ffffffff81075099>] state_store+0x5c/0xd3
[ 2995.740685] [<ffffffff8118728f>] kobj_attr_store+0x17/0x19
[ 2995.740685] [<ffffffff81127b69>] sysfs_write_file+0x108/0x144
[ 2995.740685] [<ffffffff810d66ff>] vfs_write+0xb2/0x153
[ 2995.740685] [<ffffffff810641a9>] ? trace_hardirqs_on_caller+0x1f/0x14b
[ 2995.740685] [<ffffffff810d6863>] sys_write+0x4a/0x71
[ 2995.740685] [<ffffffff810021db>] system_call_fastpath+0x16/0x1b
[ 2995.740685] Code: c7 c7 90 16 67 81 e8 79 f1 25 00 48 c7 c7 90 16 67 81 e8 e1 e7 25 00 48 c7 c7 30 18 67 81 e8 d5 e7 25 00 c9 c3 90 90 55 48 89 e5 <0f> b7 07 38 e0 8d 90 00 01 00 00 75 05 f0 66 0f b1 17 0f 94 c2
[ 2995.740685] RIP [<ffffffff8119c0d0>] do_raw_spin_trylock+0x4/0x3a
[ 2995.740685] RSP <ffff88022fa03438>
[ 2995.740685] CR2: 00007faf064ff1f0
[ 2995.762521] ---[ end trace 92c25d74e4800969 ]---
[ 2995.762686] Fixing recursive fault but reboot is needed!
[ 2995.762855] BUG: scheduling while atomic: hib.sh/7440/0x00000005
[ 2995.763026] INFO: lockdep is turned off.
[ 2995.763203] Modules linked in: tun powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod ohci_hcd pcspkr 8250_pnp 8250 k10temp edac_core serial_core
[ 2995.764799] Pid: 7440, comm: hib.sh Tainted: G D 2.6.34-rc3-00288-gab195c5 #1
[ 2995.765080] Call Trace:
[ 2995.765256] [<ffffffff810636bf>] ? __debug_show_held_locks+0x1b/0x24
[ 2995.765429] [<ffffffff8102d499>] __schedule_bug+0x72/0x77
[ 2995.765600] [<ffffffff813f9a0a>] schedule+0xd9/0x730
[ 2995.765771] [<ffffffff8103b7f7>] do_exit+0xcf/0x6a2
[ 2995.765941] [<ffffffff81038f84>] ? kmsg_dump+0x13b/0x155
[ 2995.766115] [<ffffffff810060db>] ? oops_end+0x47/0x93
[ 2995.766295] [<ffffffff81006122>] oops_end+0x8e/0x93
[ 2995.766462] [<ffffffff8101ed99>] no_context+0x1fc/0x20b
[ 2995.766632] [<ffffffff810641a9>] ? trace_hardirqs_on_caller+0x1f/0x14b
[ 2995.766806] [<ffffffff8101ef34>] __bad_area_nosemaphore+0x18c/0x1af
[ 2995.766977] [<ffffffff8101f16f>] ? do_page_fault+0xa8/0x32d
[ 2995.767161] [<ffffffff8101ef6a>] bad_area_nosemaphore+0x13/0x15
[ 2995.767330] [<ffffffff8101f23a>] do_page_fault+0x173/0x32d
[ 2995.767501] [<ffffffff810a9eca>] ? release_pages+0x1ee/0x200
[ 2995.767673] [<ffffffff813fdaa3>] ? error_sti+0x5/0x6
[ 2995.767842] [<ffffffff81063167>] ? trace_hardirqs_off_caller+0x1f/0xa9
[ 2995.768017] [<ffffffff813fc48e>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 2995.768202] [<ffffffff810d26f5>] ? kmem_cache_free+0x56/0x129
[ 2995.768373] [<ffffffff813fd8bf>] page_fault+0x1f/0x30
[ 2995.768544] [<ffffffff810d26f5>] ? kmem_cache_free+0x56/0x129
[ 2995.768716] [<ffffffff8119c0d0>] ? do_raw_spin_trylock+0x4/0x3a
[ 2995.768888] [<ffffffff813fc6c3>] _raw_spin_lock+0x48/0x73
[ 2995.769064] [<ffffffff810c1ae3>] ? unlink_anon_vmas+0x40/0xe1
[ 2995.769246] [<ffffffff810c1ae3>] unlink_anon_vmas+0x40/0xe1
[ 2995.769415] [<ffffffff810bb562>] free_pgtables+0x68/0xce
[ 2995.769586] [<ffffffff810bd11e>] exit_mmap+0x100/0x182
[ 2995.769756] [<ffffffff81035b58>] mmput+0x43/0xea
[ 2995.769925] [<ffffffff81039e99>] exit_mm+0x110/0x11d
[ 2995.770099] [<ffffffff8103b8ed>] do_exit+0x1c5/0x6a2
[ 2995.770279] [<ffffffff81038f84>] ? kmsg_dump+0x13b/0x155
[ 2995.770447] [<ffffffff810060db>] ? oops_end+0x47/0x93
[ 2995.770616] [<ffffffff81006122>] oops_end+0x8e/0x93
[ 2995.770785] [<ffffffff8101ed99>] no_context+0x1fc/0x20b
[ 2995.770955] [<ffffffff8101ef34>] __bad_area_nosemaphore+0x18c/0x1af
[ 2995.771141] [<ffffffff8101f16f>] ? do_page_fault+0xa8/0x32d
[ 2995.771311] [<ffffffff8101ef6a>] bad_area_nosemaphore+0x13/0x15
[ 2995.771482] [<ffffffff8101f23a>] do_page_fault+0x173/0x32d
[ 2995.771653] [<ffffffff810802f9>] ? __call_rcu+0x11d/0x130
[ 2995.771824] [<ffffffff813fdaa3>] ? error_sti+0x5/0x6
[ 2995.771994] [<ffffffff81063167>] ? trace_hardirqs_off_caller+0x1f/0xa9
[ 2995.772179] [<ffffffff813fc48e>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 2995.772352] [<ffffffff813fd8bf>] page_fault+0x1f/0x30
[ 2995.772524] [<ffffffff810c194d>] ? page_referenced+0xee/0x1dc
[ 2995.772696] [<ffffffff810c18df>] ? page_referenced+0x80/0x1dc
[ 2995.772867] [<ffffffff810c6d40>] ? swapcache_free+0x37/0x3c
[ 2995.773043] [<ffffffff810ac31d>] shrink_page_list+0x171/0x4b1
[ 2995.773226] [<ffffffff813fd1e6>] ? _raw_spin_unlock_irq+0x30/0x58
[ 2995.773398] [<ffffffff810ac9b9>] shrink_inactive_list+0x35c/0x623
[ 2995.773572] [<ffffffff810acd94>] ? shrink_zone+0x114/0x3d4
[ 2995.773742] [<ffffffff81064f29>] ? print_lock_contention_bug+0x1b/0xe1
[ 2995.773916] [<ffffffff813fc790>] ? _raw_spin_lock_irq+0x19/0x79
[ 2995.774093] [<ffffffff810acf8a>] shrink_zone+0x30a/0x3d4
[ 2995.774274] [<ffffffff810ad19e>] ? shrink_slab+0x14a/0x15c
[ 2995.774444] [<ffffffff810adb65>] do_try_to_free_pages+0x176/0x27f
[ 2995.774617] [<ffffffff8103de67>] ? irq_exit+0x93/0x95
[ 2995.774786] [<ffffffff810add03>] shrink_all_memory+0x95/0xc4
[ 2995.774958] [<ffffffff810ab0f0>] ? isolate_pages_global+0x0/0x217
[ 2995.775144] [<ffffffff81077503>] ? count_data_pages+0x65/0x79
[ 2995.775314] [<ffffffff8107776a>] hibernate_preallocate_memory+0x1aa/0x2cb
[ 2995.775487] [<ffffffff813f95b5>] ? printk+0x41/0x44
[ 2995.775657] [<ffffffff810760b3>] hibernation_snapshot+0x36/0x1e1
[ 2995.775828] [<ffffffff8107632c>] hibernate+0xce/0x172
[ 2995.775998] [<ffffffff81075099>] state_store+0x5c/0xd3
[ 2995.776182] [<ffffffff8118728f>] kobj_attr_store+0x17/0x19
[ 2995.776350] [<ffffffff81127b69>] sysfs_write_file+0x108/0x144
[ 2995.776521] [<ffffffff810d66ff>] vfs_write+0xb2/0x153
[ 2995.776690] [<ffffffff810641a9>] ? trace_hardirqs_on_caller+0x1f/0x14b
[ 2995.776863] [<ffffffff810d6863>] sys_write+0x4a/0x71
[ 2995.777038] [<ffffffff810021db>] system_call_fastpath+0x16/0x1b

--
Regards/Gruss,
Boris.

2010-04-06 20:07:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)



On Tue, 6 Apr 2010, Borislav Petkov wrote:
>
> [ 2995.478125] PM: Preallocating image memory...
> [ 2995.713692] BUG: unable to handle kernel NULL pointer dereference at (null)
> [ 2995.714001] IP: [<ffffffff810c194d>] page_referenced+0xee/0x1dc
> [ 2995.714001] PGD 22d1b8067 PUD 22dd85067 PMD 0
> [ 2995.714001] Oops: 0000 [#1] PREEMPT SMP
> [ 2995.714001] last sysfs file: /sys/power/state
> [ 2995.714001] CPU 0
> [ 2995.714001] Modules linked in: tun powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod ohci_hcd pcspkr 8250_pnp 8250 k10temp edac_core serial_core
> [ 2995.714001]
> [ 2995.714001] Pid: 7440, comm: hib.sh Not tainted 2.6.34-rc3-00288-gab195c5 #1 M3A78 PRO/System Product Name
> [ 2995.714001] RIP: 0010:[<ffffffff810c194d>] [<ffffffff810c194d>] page_referenced+0xee/0x1dc
> [ 2995.714001] RSP: 0018:ffff88022fa038b8 EFLAGS: 00010283
> [ 2995.714001] RAX: ffff88022d747098 RBX: ffffea00078efb70 RCX: 0000000000000000
> [ 2995.714001] RDX: ffff88022fa03cf8 RSI: ffff88022d747070 RDI: ffff88022fb32520
> [ 2995.714001] RBP: ffff88022fa03938 R08: 0000000000000002 R09: 0000000000000000
> [ 2995.714001] R10: ffff88022fa038a8 R11: ffff88022d295d10 R12: 0000000000000000
> [ 2995.714001] R13: ffffffffffffffe0 R14: ffff88022d747058 R15: ffff88022fa03a00
> [ 2995.714001] FS: 00007f4da8b966f0(0000) GS:ffff88000a000000(0000) knlGS:0000000000000000
> [ 2995.714001] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 2995.714001] CR2: 0000000000000000 CR3: 000000022d11e000 CR4: 00000000000006f0
> [ 2995.714001] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 2995.714001] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 2995.714001] Process hib.sh (pid: 7440, threadinfo ffff88022fa02000, task ffff88022fb32520)
> [ 2995.714001] Stack:
> [ 2995.714001] ffff88022d747098 00000000813fd2ac ffffffff8165ee28 0000000000000416
> [ 2995.714001] <0> ffff88022fa038f8 ffffffff810c6d40 ffffea00078fae60 ffffea00078fae60
> [ 2995.714001] <0> ffff88022fa03938 00000002810abd98 ffffea00078ec530 ffffea00078efb98
> [ 2995.714001] Call Trace:
> [ 2995.714001] [<ffffffff810c6d40>] ? swapcache_free+0x37/0x3c
> [ 2995.714001] [<ffffffff810ac31d>] shrink_page_list+0x171/0x4b1
> [ 2995.714001] [<ffffffff813fd1e6>] ? _raw_spin_unlock_irq+0x30/0x58
> [ 2995.714001] [<ffffffff810ac9b9>] shrink_inactive_list+0x35c/0x623
> [ 2995.714001] [<ffffffff810acd94>] ? shrink_zone+0x114/0x3d4
> [ 2995.714001] [<ffffffff81064f29>] ? print_lock_contention_bug+0x1b/0xe1
> [ 2995.714001] [<ffffffff813fc790>] ? _raw_spin_lock_irq+0x19/0x79
> [ 2995.714001] [<ffffffff810acf8a>] shrink_zone+0x30a/0x3d4
> [ 2995.714001] [<ffffffff810ad19e>] ? shrink_slab+0x14a/0x15c
> [ 2995.714001] [<ffffffff810adb65>] do_try_to_free_pages+0x176/0x27f
> [ 2995.714001] [<ffffffff8103de67>] ? irq_exit+0x93/0x95
> [ 2995.714001] [<ffffffff810add03>] shrink_all_memory+0x95/0xc4
> [ 2995.714001] [<ffffffff810ab0f0>] ? isolate_pages_global+0x0/0x217
> [ 2995.714001] [<ffffffff81077503>] ? count_data_pages+0x65/0x79
> [ 2995.714001] [<ffffffff8107776a>] hibernate_preallocate_memory+0x1aa/0x2cb
> [ 2995.714001] [<ffffffff813f95b5>] ? printk+0x41/0x44
> [ 2995.714001] [<ffffffff810760b3>] hibernation_snapshot+0x36/0x1e1
> [ 2995.714001] [<ffffffff8107632c>] hibernate+0xce/0x172
> [ 2995.714001] [<ffffffff81075099>] state_store+0x5c/0xd3
> [ 2995.714001] [<ffffffff8118728f>] kobj_attr_store+0x17/0x19
> [ 2995.714001] [<ffffffff81127b69>] sysfs_write_file+0x108/0x144
> [ 2995.714001] [<ffffffff810d66ff>] vfs_write+0xb2/0x153
> [ 2995.714001] [<ffffffff810641a9>] ? trace_hardirqs_on_caller+0x1f/0x14b
> [ 2995.714001] [<ffffffff810d6863>] sys_write+0x4a/0x71
> [ 2995.714001] [<ffffffff810021db>] system_call_fastpath+0x16/0x1b
> [ 2995.714001] Code: 3b 56 10 73 1e 48 83 fa f2 74 18 48 8d 4d cc 4d 89 f8 48 89 df e8 4d f2 ff ff 41 01 c4 83 7d cc 00 74 19 4d 8b 6d 20 49 83 ed 20 <49> 8b 45 20 0f 18 08 49 8d 45 20 48 39 45 80 75 aa 4c 89 f7 e8
> [ 2995.714001] RIP [<ffffffff810c194d>] page_referenced+0xee/0x1dc
> [ 2995.714001] RSP <ffff88022fa038b8>
> [ 2995.714001] CR2: 0000000000000000
> [ 2995.729717] ---[ end trace 92c25d74e4800968 ]---

So again, I can show that the code has never actually been through the
loop. The above code decodes to:

0: 3b 56 10 cmp 0x10(%rsi),%edx
3: 73 1e jae 0x23
5: 48 83 fa f2 cmp $0xfffffffffffffff2,%rdx
9: 74 18 je 0x23
b: 48 8d 4d cc lea -0x34(%rbp),%rcx
f: 4d 89 f8 mov %r15,%r8
12: 48 89 df mov %rbx,%rdi
15: e8 4d f2 ff ff callq 0xfffffffffffff267
1a: 41 01 c4 add %eax,%r12d
1d: 83 7d cc 00 cmpl $0x0,-0x34(%rbp)
21: 74 19 je 0x3c
23: 4d 8b 6d 20 mov 0x20(%r13),%r13
27: 49 83 ed 20 sub $0x20,%r13
2b:* 49 8b 45 20 mov 0x20(%r13),%rax <-- trapping instruction
2f: 0f 18 08 prefetcht0 (%rax)
32: 49 8d 45 20 lea 0x20(%r13),%rax
36: 48 39 45 80 cmp %rax,-0x80(%rbp)
3a: 75 aa jne 0xffffffffffffffe6
3c: 4c 89 f7 mov %r14,%rdi
3f: e8 .byte 0xe8

and in your case, if we had gone through the loop, then %rax would still
contain the return value from page_referenced_one().

But %rax is a kernel pointer, and %r12d is 0.

So again, it's actually anon_vma.head.next that is NULL, not any of the
entries on the list itself.

Now, I can see several cases for this:

- the obvious one: anon_vma just wasn't correctly initialized, and is
missing a INIT_LIST_HEAD(&anon_vma->head). That's either a slab bug (we
don't have a whole lot of coverage of constructors), or somebody
allocated an anon_vma without using the anon_vma_cachep.

- Related to the above: perhaps the RCU freeing isn't working, or
slub/slab/slob ends up reusing the allocations for something else than
anonvma's, so together with the race _and_ an unlucky re-use, you get
some odd crud.

I haven't looked at the kernel config files: do they perhaps share the
same (odd?) SLUB/SLAB/SLOB config?

- anon_vma isn't actually an anonvma at all. 'page->mapping' was crud
with the low bit set. That sounds unlikely, but who knows. The ksm code
sets mapping to "stable_node + PAGE_MAPPING_ANON | PAGE_MAPPING_KSM"

Did people have KSM enabled?

.. and probably other things I haven't even thought about.

Linus

2010-04-06 20:47:17

by Steinar H. Gunderson

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)

On Tue, Apr 06, 2010 at 01:02:35PM -0700, Linus Torvalds wrote:
> I haven't looked at the kernel config files: do they perhaps share the
> same (odd?) SLUB/SLAB/SLOB config?

http://storage.sesse.net/config-crashing-2.6.34-rc2

> Did people have KSM enabled?

No KSM for me.

/* Steinar */
--
Homepage: http://www.sesse.net/

2010-04-06 20:51:36

by Borislav Petkov

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)

From: Linus Torvalds <[email protected]>
Date: Tue, Apr 06, 2010 at 01:02:35PM -0700

> So again, I can show that the code has never actually been through the
> loop. The above code decodes to:
>
> 0: 3b 56 10 cmp 0x10(%rsi),%edx
> 3: 73 1e jae 0x23
> 5: 48 83 fa f2 cmp $0xfffffffffffffff2,%rdx
> 9: 74 18 je 0x23
> b: 48 8d 4d cc lea -0x34(%rbp),%rcx
> f: 4d 89 f8 mov %r15,%r8
> 12: 48 89 df mov %rbx,%rdi
> 15: e8 4d f2 ff ff callq 0xfffffffffffff267
> 1a: 41 01 c4 add %eax,%r12d
> 1d: 83 7d cc 00 cmpl $0x0,-0x34(%rbp)
> 21: 74 19 je 0x3c
> 23: 4d 8b 6d 20 mov 0x20(%r13),%r13
> 27: 49 83 ed 20 sub $0x20,%r13
> 2b:* 49 8b 45 20 mov 0x20(%r13),%rax <-- trapping instruction
> 2f: 0f 18 08 prefetcht0 (%rax)
> 32: 49 8d 45 20 lea 0x20(%r13),%rax
> 36: 48 39 45 80 cmp %rax,-0x80(%rbp)
> 3a: 75 aa jne 0xffffffffffffffe6
> 3c: 4c 89 f7 mov %r14,%rdi
> 3f: e8 .byte 0xe8
>
> and in your case, if we had gone through the loop, then %rax would still
> contain the return value from page_referenced_one().
>
> But %rax is a kernel pointer, and %r12d is 0.
>
> So again, it's actually anon_vma.head.next that is NULL, not any of the
> entries on the list itself.
>
> Now, I can see several cases for this:
>
> - the obvious one: anon_vma just wasn't correctly initialized, and is
> missing a INIT_LIST_HEAD(&anon_vma->head). That's either a slab bug (we
> don't have a whole lot of coverage of constructors), or somebody
> allocated an anon_vma without using the anon_vma_cachep.

I've added code to verify this and am suspend/resuming now... Wait a
minute, Linus, you're good! :) :

[ 873.083074] PM: Preallocating image memory...
[ 873.254359] NULL anon_vma->head.next, page 2182681

This is the page_to_pfn number.

Now, how do we track back to the place which is missing anon_vma->head
init? Can we use the struct page *page arg to page_referenced_anon()
somehow?

[ 873.254654] Pid: 3642, comm: hib.sh Not tainted 2.6.34-rc3-00288-gab195c5-dirty #3
[ 873.254904] Call Trace:
[ 873.255063] [<ffffffff810c0c28>] page_referenced+0xd3/0x219
[ 873.255212] [<ffffffff810c5fb0>] ? swapcache_free+0x37/0x3c
[ 873.255364] [<ffffffff810ab782>] shrink_page_list+0x14a/0x477
[ 873.255512] [<ffffffff810aa6e0>] ? isolate_pages_global+0xc4/0x1f0
[ 873.255662] [<ffffffff813f8a76>] ? _raw_spin_unlock_irq+0x30/0x58
[ 873.255811] [<ffffffff810abe06>] shrink_inactive_list+0x357/0x5e5
[ 873.255960] [<ffffffff810ab626>] ? shrink_active_list+0x232/0x244
[ 873.256112] [<ffffffff810ac39e>] shrink_zone+0x30a/0x3d4
[ 873.256264] [<ffffffff810acf79>] do_try_to_free_pages+0x176/0x27f
[ 873.256416] [<ffffffff810ad117>] shrink_all_memory+0x95/0xc4
[ 873.256564] [<ffffffff810aa61c>] ? isolate_pages_global+0x0/0x1f0
[ 873.256713] [<ffffffff81076e4c>] ? count_data_pages+0x65/0x79
[ 873.256862] [<ffffffff810770b3>] hibernate_preallocate_memory+0x1aa/0x2cb
[ 873.257036] [<ffffffff813f4f75>] ? printk+0x41/0x44
[ 873.257186] [<ffffffff81075a53>] hibernation_snapshot+0x36/0x1e1
[ 873.257337] [<ffffffff81075ccc>] hibernate+0xce/0x172
[ 873.257485] [<ffffffff81074a39>] state_store+0x5c/0xd3
[ 873.257634] [<ffffffff81184eff>] kobj_attr_store+0x17/0x19
[ 873.257783] [<ffffffff81125d43>] sysfs_write_file+0x108/0x144
[ 873.257932] [<ffffffff810d560f>] vfs_write+0xb2/0x153
[ 873.258084] [<ffffffff81063bd9>] ? trace_hardirqs_on_caller+0x1f/0x14b
[ 873.258237] [<ffffffff810d5773>] sys_write+0x4a/0x71
[ 873.258388] [<ffffffff810021db>] system_call_fastpath+0x16/0x1b


> - Related to the above: perhaps the RCU freeing isn't working, or
> slub/slab/slob ends up reusing the allocations for something else than
> anonvma's, so together with the race _and_ an unlucky re-use, you get
> some odd crud.
>
> I haven't looked at the kernel config files: do they perhaps share the
> same (odd?) SLUB/SLAB/SLOB config?

what is an odd SL[AOU]B config?

> - anon_vma isn't actually an anonvma at all. 'page->mapping' was crud
> with the low bit set. That sounds unlikely, but who knows. The ksm code
> sets mapping to "stable_node + PAGE_MAPPING_ANON | PAGE_MAPPING_KSM"
>
> Did people have KSM enabled?

Nope, KSM is off here.

--
Regards/Gruss,
Boris.

2010-04-06 21:00:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)



On Tue, 6 Apr 2010, Steinar H. Gunderson wrote:

> On Tue, Apr 06, 2010 at 01:02:35PM -0700, Linus Torvalds wrote:
> > I haven't looked at the kernel config files: do they perhaps share the
> > same (odd?) SLUB/SLAB/SLOB config?
>
> http://storage.sesse.net/config-crashing-2.6.34-rc2

Ok, CONFIG_SLUB, which is the common case. Not likely to be buggy.

> > Did people have KSM enabled?
>
> No KSM for me.

Ok, not anything odd there either, and you're not using any odd RCU setup
either. Nothing odd at all strikes me about your config, in fact. Lots and
lots of modules, but I guess it comes from some distro default config..

Linus

2010-04-06 21:05:28

by Steinar H. Gunderson

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)

On Tue, Apr 06, 2010 at 01:56:19PM -0700, Linus Torvalds wrote:
>>> Did people have KSM enabled?
>> No KSM for me.
> Ok, not anything odd there either, and you're not using any odd RCU setup
> either. Nothing odd at all strikes me about your config, in fact. Lots and
> lots of modules, but I guess it comes from some distro default config..

I think it was originally some distro config, yes, but that «config fork» was
at 2.6.16 or something...

/* Steinar */
--
Homepage: http://www.sesse.net/

2010-04-06 21:32:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)



On Tue, 6 Apr 2010, Borislav Petkov wrote:
> > So again, it's actually anon_vma.head.next that is NULL, not any of the
> > entries on the list itself.
> >
> > Now, I can see several cases for this:
> >
> > - the obvious one: anon_vma just wasn't correctly initialized, and is
> > missing a INIT_LIST_HEAD(&anon_vma->head). That's either a slab bug (we
> > don't have a whole lot of coverage of constructors), or somebody
> > allocated an anon_vma without using the anon_vma_cachep.
>
> I've added code to verify this and am suspend/resuming now... Wait a
> minute, Linus, you're good! :) :
>
> [ 873.083074] PM: Preallocating image memory...
> [ 873.254359] NULL anon_vma->head.next, page 2182681

Yeah, I was pretty sure of that thing.

I still don't see _how_ it happens, though. That 'struct anon_vma' is very
simple, and contains literally just the lock and that list_head.

Now, 'head.next' is kind of magical, because it contains that magic
low-bit "have I been locked" thing (see "vm_lock_anon_vma()" in
mm/mmap.c). But I'm not seeing anything else touching it.

And if you allocate a anon_vma the proper way, the SLUB constructor should
have made sure that the head is initialized. And no normal list operation
ever sets any list pointer to zero, although a "list_del()" on the first
list entry could do it if that first list entry had a NULL next pointer.

> Now, how do we track back to the place which is missing anon_vma->head
> init? Can we use the struct page *page arg to page_referenced_anon()
> somehow?

You might enable SLUB debugging (both SLUB_DEBUG _and_ SLUB_DEBUG_ON), and
then make the "object_err()" function in mm/slub.c be non-static. You
could call it when you see the problem, perhaps.

Or you could just add tests to both alloc_anon_vma() and free_anon_vma()
to check that 'list_empty(&anon_vma->head)' is true. I dunno.

> > I haven't looked at the kernel config files: do they perhaps share the
> > same (odd?) SLUB/SLAB/SLOB config?
>
> what is an odd SL[AOU]B config?

Probably anything but the default SLUB these days. But Steinar already
said he had SLUB, so it's unlikely to be something odd.

> > - anon_vma isn't actually an anonvma at all. 'page->mapping' was crud
> > with the low bit set. That sounds unlikely, but who knows. The ksm code
> > sets mapping to "stable_node + PAGE_MAPPING_ANON | PAGE_MAPPING_KSM"
> >
> > Did people have KSM enabled?
>
> Nope, KSM is off here.

Yeah, wasn't for Steinar either. So it doesn't look like it's any odd
corner case that depends on some odd configuration.

Linus

2010-04-06 22:59:40

by Borislav Petkov

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)

From: Linus Torvalds <[email protected]>
Date: Tue, Apr 06, 2010 at 02:27:37PM -0700

> On Tue, 6 Apr 2010, Borislav Petkov wrote:
> > > So again, it's actually anon_vma.head.next that is NULL, not any of the
> > > entries on the list itself.
> > >
> > > Now, I can see several cases for this:
> > >
> > > - the obvious one: anon_vma just wasn't correctly initialized, and is
> > > missing a INIT_LIST_HEAD(&anon_vma->head). That's either a slab bug (we
> > > don't have a whole lot of coverage of constructors), or somebody
> > > allocated an anon_vma without using the anon_vma_cachep.
> >
> > I've added code to verify this and am suspend/resuming now... Wait a
> > minute, Linus, you're good! :) :
> >
> > [ 873.083074] PM: Preallocating image memory...
> > [ 873.254359] NULL anon_vma->head.next, page 2182681
>
> Yeah, I was pretty sure of that thing.
>
> I still don't see _how_ it happens, though. That 'struct anon_vma' is very
> simple, and contains literally just the lock and that list_head.
>
> Now, 'head.next' is kind of magical, because it contains that magic
> low-bit "have I been locked" thing (see "vm_lock_anon_vma()" in
> mm/mmap.c). But I'm not seeing anything else touching it.
>
> And if you allocate a anon_vma the proper way, the SLUB constructor should
> have made sure that the head is initialized. And no normal list operation
> ever sets any list pointer to zero, although a "list_del()" on the first
> list entry could do it if that first list entry had a NULL next pointer.
>
> > Now, how do we track back to the place which is missing anon_vma->head
> > init? Can we use the struct page *page arg to page_referenced_anon()
> > somehow?
>
> You might enable SLUB debugging (both SLUB_DEBUG _and_ SLUB_DEBUG_ON), and
> then make the "object_err()" function in mm/slub.c be non-static. You
> could call it when you see the problem, perhaps.
>
> Or you could just add tests to both alloc_anon_vma() and free_anon_vma()
> to check that 'list_empty(&anon_vma->head)' is true. I dunno.

Ok, I tried doing all you suggested and here's what came out. Please,
take this with a grain of salt because I'm almost falling asleep - even
the coffee is not working anymore so it could be just as well that I've
made a mistake somewhere (the new OOPS is a #GP, by the way), just
watch:

Source changes locally:

--
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 4884462..0c11dfb 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -108,6 +108,8 @@ unsigned int kmem_cache_size(struct kmem_cache *);
const char *kmem_cache_name(struct kmem_cache *);
int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr);

+void object_err(struct kmem_cache *s, struct page *page, u8 *object, char *reason);
+
/*
* Please use this macro to create slab caches. Simply specify the
* name of the structure and maybe some flags that are listed above.
diff --git a/mm/rmap.c b/mm/rmap.c
index eaa7a09..7b35b3f 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -66,11 +66,24 @@ static struct kmem_cache *anon_vma_chain_cachep;

static inline struct anon_vma *anon_vma_alloc(void)
{
- return kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
+ struct anon_vma *ret;
+ ret = kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
+
+ if (!ret->head.next) {
+ printk("%s NULL anon_vma->head.next\n", __func__);
+ dump_stack();
+ }
+
+ return ret;
}

void anon_vma_free(struct anon_vma *anon_vma)
{
+ if (!anon_vma->head.next) {
+ printk("%s NULL anon_vma->head.next\n", __func__);
+ dump_stack();
+ }
+
kmem_cache_free(anon_vma_cachep, anon_vma);
}

@@ -494,6 +507,18 @@ static int page_referenced_anon(struct page *page,
return referenced;

mapcount = page_mapcount(page);
+
+ if (!anon_vma->head.next) {
+ printk(KERN_ERR "NULL anon_vma->head.next, page %lu\n",
+ page_to_pfn(page));
+
+ object_err(anon_vma_cachep, page, (u8 *)anon_vma, "NULL next");
+
+ dump_stack();
+
+ return referenced;
+ }
+
list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
struct vm_area_struct *vma = avc->vma;
unsigned long address = vma_address(page, vma);
diff --git a/mm/slub.c b/mm/slub.c
index b364844..bcf5416 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -477,7 +477,7 @@ static void print_trailer(struct kmem_cache *s, struct page *page, u8 *p)
dump_stack();
}

-static void object_err(struct kmem_cache *s, struct page *page,
+void object_err(struct kmem_cache *s, struct page *page,
u8 *object, char *reason)
{
slab_bug(s, "%s", reason);

---

do the same exercise of starting several guests and then shutting them
down, and hibernating at the same time. After having shutdown the
guests, start firefox and let it load a big html page and hibernate
while doing so, boom!

[ 269.104940] Freezing user space processes ... (elapsed 0.03 seconds) done.
[ 269.141953] Freezing remaining freezable tasks ... (elapsed 0.01 seconds) done.
[ 269.155115] PM: Preallocating image memory...
[ 269.423811] general protection fault: 0000 [#1] PREEMPT SMP
[ 269.424003] last sysfs file: /sys/power/state
[ 269.424003] CPU 0
[ 269.424003] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_co
nservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod ohci_hcd pcspkr edac_core k10temp 8250_pnp 8250 serial_
core
[ 269.424003]
[ 269.424003] Pid: 2617, comm: hib.sh Tainted: G W 2.6.34-rc3-00288-gab195c5-dirty #4 M3A78 PRO/System Product
Name
[ 269.424003] RIP: 0010:[<ffffffff810c0cb4>] [<ffffffff810c0cb4>] page_referenced+0x147/0x232
[ 269.424003] RSP: 0018:ffff88022a1218b8 EFLAGS: 00010246
[ 269.424003] RAX: ffff8802126fa468 RBX: ffffea000700b210 RCX: 0000000000000000
[ 269.424003] RDX: ffff8802126fa429 RSI: ffff8802126fa440 RDI: ffff88022dc3cb80
[ 269.424003] RBP: ffff88022a121938 R08: 0000000000000002 R09: 0000000000000000
[ 269.424003] R10: 0000000000000246 R11: ffff88021a030478 R12: 0000000000000000
[ 269.424003] R13: 002e2e2e002e2e0e R14: ffff8802126fa428 R15: ffff88022a121a00
[ 269.424003] FS: 00007fe2799796f0(0000) GS:ffff88000a000000(0000) knlGS:0000000000000000
[ 269.424003] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 269.424003] CR2: 00007fffdefb3880 CR3: 00000002171c0000 CR4: 00000000000006f0
[ 269.424003] DR0: 0000000000000090 DR1: 00000000000000a4 DR2: 00000000000000ff
[ 269.424003] DR3: 000000000000000f DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 269.424003] Process hib.sh (pid: 2617, threadinfo ffff88022a120000, task ffff88022dc3cb80)
[ 269.424003] Stack:
[ 269.424003] ffff8802126fa468 00000000813f8cfc ffffffff8165ae28 00000000000042e7
[ 269.424003] <0> ffff88022a1218f8 ffffffff810c6051 ffffea0006f968c8 ffffea0006f968c8
[ 269.424003] <0> ffff88022a121938 00000002810ab275 0000000006f96890 ffffea000700b238
[ 269.424003] Call Trace:
[ 269.424003] [<ffffffff810c6051>] ? swapcache_free+0x37/0x3c
[ 269.424003] [<ffffffff810ab79a>] shrink_page_list+0x14a/0x477
[ 269.424003] [<ffffffff813f8c36>] ? _raw_spin_unlock_irq+0x30/0x58
[ 269.424003] [<ffffffff810abe1e>] shrink_inactive_list+0x357/0x5e5
[ 269.424003] [<ffffffff810ac3b6>] shrink_zone+0x30a/0x3d4
[ 269.424003] [<ffffffff810acf91>] do_try_to_free_pages+0x176/0x27f
[ 269.424003] [<ffffffff810ad12f>] shrink_all_memory+0x95/0xc4
[ 269.424003] [<ffffffff810aa634>] ? isolate_pages_global+0x0/0x1f0
[ 269.424003] [<ffffffff81076e64>] ? count_data_pages+0x65/0x79
[ 269.424003] [<ffffffff810770cb>] hibernate_preallocate_memory+0x1aa/0x2cb
[ 269.424003] [<ffffffff813f5135>] ? printk+0x41/0x44
[ 269.424003] [<ffffffff81075a6b>] hibernation_snapshot+0x36/0x1e1
[ 269.424003] [<ffffffff81075ce4>] hibernate+0xce/0x172
[ 269.424003] [<ffffffff81074a51>] state_store+0x5c/0xd3
[ 269.424003] [<ffffffff81185097>] kobj_attr_store+0x17/0x19
[ 269.424003] [<ffffffff81125edb>] sysfs_write_file+0x108/0x144
[ 269.424003] [<ffffffff810d57a7>] vfs_write+0xb2/0x153
[ 269.424003] [<ffffffff81063bf1>] ? trace_hardirqs_on_caller+0x1f/0x14b
[ 269.424003] [<ffffffff810d590b>] sys_write+0x4a/0x71
[ 269.424003] [<ffffffff810021db>] system_call_fastpath+0x16/0x1b
[ 269.424003] Code: 3b 56 10 73 1e 48 83 fa f2 74 18 48 8d 4d cc 4d 89 f8 48 89 df e8 1e f2 ff ff 41 01 c4 83 7d cc 00
74 19 4d 8b 6d 20 49 83 ed 20 <49> 8b 45 20 0f 18 08 49 8d 45 20 48 39 45 80 75 aa 4c 89 f7 e8
[ 269.424003] RIP [<ffffffff810c0cb4>] page_referenced+0x147/0x232
[ 269.424003] RSP <ffff88022a1218b8>
[ 269.438405] ---[ end trace ad5b4172ee94398e ]---
[ 269.438553] note: hib.sh[2617] exited with preempt_count 2
[ 269.438709] BUG: scheduling while atomic: hib.sh/2617/0x10000003
[ 269.438858] INFO: lockdep is turned off.
[ 269.439075] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_co
nservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod ohci_hcd pcspkr edac_core k10temp 8250_pnp 8250 serial_core
[ 269.440875] Pid: 2617, comm: hib.sh Tainted: G D W 2.6.34-rc3-00288-gab195c5-dirty #4
[ 269.441137] Call Trace:
[ 269.441288] [<ffffffff81063107>] ? __debug_show_held_locks+0x1b/0x24
[ 269.441440] [<ffffffff8102d3c0>] __schedule_bug+0x72/0x77
[ 269.441590] [<ffffffff813f553e>] schedule+0xd9/0x730
[ 269.441741] [<ffffffff8103022c>] __cond_resched+0x18/0x24
[ 269.441891] [<ffffffff813f5c62>] _cond_resched+0x2c/0x37
[ 269.442045] [<ffffffff810b7d7d>] unmap_vmas+0x6ce/0x893
[ 269.442205] [<ffffffff810bc42f>] exit_mmap+0xd7/0x182
[ 269.442352] [<ffffffff81035951>] mmput+0x48/0xb9
[ 269.442502] [<ffffffff81039c21>] exit_mm+0x110/0x11d
[ 269.442652] [<ffffffff8103b663>] do_exit+0x1c5/0x691
[ 269.442802] [<ffffffff81038d0d>] ? kmsg_dump+0x13b/0x155
[ 269.442953] [<ffffffff810060db>] ? oops_end+0x47/0x93
[ 269.443107] [<ffffffff81006122>] oops_end+0x8e/0x93
[ 269.443262] [<ffffffff81006313>] die+0x5a/0x63
[ 269.443414] [<ffffffff81003eaf>] do_general_protection+0x134/0x13c
[ 269.443566] [<ffffffff813f90f0>] ? irq_return+0x0/0x2
[ 269.443716] [<ffffffff813f92cf>] general_protection+0x1f/0x30
[ 269.443867] [<ffffffff810c0cb4>] ? page_referenced+0x147/0x232
[ 269.444021] [<ffffffff810c0bf0>] ? page_referenced+0x83/0x232
[ 269.444176] [<ffffffff810c6051>] ? swapcache_free+0x37/0x3c
[ 269.444328] [<ffffffff810ab79a>] shrink_page_list+0x14a/0x477
[ 269.444479] [<ffffffff813f8c36>] ? _raw_spin_unlock_irq+0x30/0x58
[ 269.444630] [<ffffffff810abe1e>] shrink_inactive_list+0x357/0x5e5
[ 269.444782] [<ffffffff810ac3b6>] shrink_zone+0x30a/0x3d4
[ 269.444933] [<ffffffff810acf91>] do_try_to_free_pages+0x176/0x27f
[ 269.445087] [<ffffffff810ad12f>] shrink_all_memory+0x95/0xc4
[ 269.445243] [<ffffffff810aa634>] ? isolate_pages_global+0x0/0x1f0
[ 269.445396] [<ffffffff81076e64>] ? count_data_pages+0x65/0x79
[ 269.445547] [<ffffffff810770cb>] hibernate_preallocate_memory+0x1aa/0x2cb
[ 269.445698] [<ffffffff813f5135>] ? printk+0x41/0x44
[ 269.445848] [<ffffffff81075a6b>] hibernation_snapshot+0x36/0x1e1
[ 269.445999] [<ffffffff81075ce4>] hibernate+0xce/0x172
[ 269.446160] [<ffffffff81074a51>] state_store+0x5c/0xd3
[ 269.446307] [<ffffffff81185097>] kobj_attr_store+0x17/0x19
[ 269.446457] [<ffffffff81125edb>] sysfs_write_file+0x108/0x144
[ 269.446607] [<ffffffff810d57a7>] vfs_write+0xb2/0x153
[ 269.446757] [<ffffffff81063bf1>] ? trace_hardirqs_on_caller+0x1f/0x14b
[ 269.446908] [<ffffffff810d590b>] sys_write+0x4a/0x71
[ 269.447063] [<ffffffff810021db>] system_call_fastpath+0x16/0x1b


This time we have

[ 269.424003] RIP: 0010:[<ffffffff810c0cb4>] [<ffffffff810c0cb4>] page_referenced+0x147/0x232

which is offset 0x1104.

which is

10eb: 48 89 df mov %rbx,%rdi
10ee: e8 00 00 00 00 callq 10f3 <page_referenced+0x136>
10f3: 41 01 c4 add %eax,%r12d
10f6: 83 7d cc 00 cmpl $0x0,-0x34(%rbp)
10fa: 74 19 je 1115 <page_referenced+0x158>
10fc: 4d 8b 6d 20 mov 0x20(%r13),%r13
1100: 49 83 ed 20 sub $0x20,%r13
1104: 49 8b 45 20 mov 0x20(%r13),%rax <-------------------------
1108: 0f 18 08 prefetcht0 (%rax)
110b: 49 8d 45 20 lea 0x20(%r13),%rax
110f: 48 39 45 80 cmp %rax,-0x80(%rbp)
1113: 75 aa jne 10bf <page_referenced+0x102>
1115: 4c 89 f7 mov %r14,%rdi

and asm is

.loc 1 522 0
movq 32(%r13), %r13 # <variable>.same_anon_vma.next, __mptr.454
.LVL295:
subq $32, %r13 #, avc
.LVL296:
.L186:
.LBE1224:
movq 32(%r13), %rax # <variable>.same_anon_vma.next, <variable>.same_anon_vma.next <--------------
prefetcht0 (%rax) # <variable>.same_anon_vma.next
leaq 32(%r13), %rax #, tmp104
cmpq %rax, -128(%rbp) # tmp104, %sfp
jne .L189 #,
.L188:
.loc 1 540 0
movq %r14, %rdi # anon_vma,
call page_unlock_anon_vma #

and %r13 contains some funny stuff, could be some mangled SLUB debug
poison or something: R13: 002e2e2e002e2e0e. Maybe this is the reason for
the #GP.

But yes, even if the oopsing instruction is

movq 32(%r13), %rax # <variable>.same_anon_vma.next, <variable>.same_anon_vma.next

this is not same_anon_vma.next because we've come to the above
instruction through the ".L186:" label, before which we have %r13
already loaded with anon_vma->head.next.

To be continued...

--
Regards/Gruss,
Boris.

2010-04-06 23:32:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)



On Wed, 7 Apr 2010, Borislav Petkov wrote:
>
> Ok, I tried doing all you suggested and here's what came out. Please,
> take this with a grain of salt because I'm almost falling asleep - even
> the coffee is not working anymore so it could be just as well that I've
> made a mistake somewhere (the new OOPS is a #GP, by the way), just
> watch:

Hey ho, yeah.

The reason it's a #GP fault is that it's not a NULL pointer dereference
any more, but a wild pointer that is not in the legal region of pointers
on x86-64. That is also why your debugging code didn't catch it: the
pointer isn't NULL, so you got the #GP fault on the same old instruction:

2b:* 49 8b 45 20 mov 0x20(%r13),%rax <-- trapping instruction

for all the same old reasons.

But now %r13 has a non-zero value: 0x002e2e2e002e2e0e, which I do _not_
recognize as any of the normal poison values.

> and %r13 contains some funny stuff, could be some mangled SLUB debug
> poison or something: R13: 002e2e2e002e2e0e. Maybe this is the reason for
> the #GP.

Correct. You don't get a page fault if the pointer was totally bogus

> But yes, even if the oopsing instruction is
>
> movq 32(%r13), %rax # <variable>.same_anon_vma.next, <variable>.same_anon_vma.next
>
> this is not same_anon_vma.next because we've come to the above
> instruction through the ".L186:" label, before which we have %r13
> already loaded with anon_vma->head.next.

No, you're mis-reading the asm. It's again the first iteration, and the
code above it is again the end of the loop. And %rax is once more a kernel
pointer, not the return value of 'page_referenced_one()'.

So it once more is 'anon_vma->head.next' that is crap, but now it's not
NULL, it's that very odd 0x002e2e2e002e2e2e pattern (the %r13 has had 0x20
subtracted from it, so that LSB of "0x0e" is actually _also_ a 0x2e).

What does '0x2e' mean? It's ASCII '.', but that doesn't really mean
anything either.

Linus

2010-04-06 23:42:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)



On Wed, 7 Apr 2010, Borislav Petkov wrote:
> +
> + if (!anon_vma->head.next) {
> + printk(KERN_ERR "NULL anon_vma->head.next, page %lu\n",
> + page_to_pfn(page));
> +
> + object_err(anon_vma_cachep, page, (u8 *)anon_vma, "NULL next");

Oh, and since the debugging code never triggered ('head.next' wasn't
actually NULL), you never got here, but the 'page' you passed in to
object_error() should be the page of the slab allocation, not the page
associated with the anon_vma.

So it should be something like "virt_to_head_page(anon_vma)" that you pass
in to object_err().

Not that it matters. I assume it is the fact that SLAB debugging is on
that actually turns the NULL into a non-NULL thing. Poisoning is not
active for SLUb's with constructors or RCU-freeing, but things like
redzoning still are. So enabling SLUB debugging will change the offsets
within the pages of all the SLUB allocations. I wonder if that's just
what caused it to now have that 0x002e2e2e002e2e2e instead of NULL.

Linus

2010-04-06 23:56:55

by Rik van Riel

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)

On 04/06/2010 05:27 PM, Linus Torvalds wrote:

> I still don't see _how_ it happens, though. That 'struct anon_vma' is very
> simple, and contains literally just the lock and that list_head.

It gets more fun. It looks like the anon_vma is only
allocated through anon_vma_alloc() and only handled
by the functions in rmap.c

By themselves, all of those functions look alright.

However, I think I may have found a possible bug in
the interplay between anon_vma_prepare() and vma_adjust(),
across several mprotect invocations.

Let me explain what I think may be going on in small
steps, since it is quite subtle (assuming I am right).

1) a process forks, creating a second "layer" of
anon_vma objects for the VMAs that have anon pages

2) a new VMA is created adjacant to an existing one,
with different permissions

3) anon_vma_prepare is called on the new VMA, this
only links the "top" anon_vma to the new VMA, since
that is the anon_vma where all new pages get
instantiated anyway (this would be part of the bug)

4) mprotect changes the permission of one of the VMAs,
causing the old and the new VMAs to get merged

5) vma_adjust calls anon_vma_merge, causing the anon_vma
chain of one of the VMAs to get nuked - with bad luck,
this is the original one, leaving just the new anon_vma
attached to the VMA

6) if the parent process quits, the old anon_vma structs
get freed

7) meanwhile, we may still have some anonymous pages
stick around in memory that have their page->mapping
point to a freed anon_vma struct

Does this look like it could happen?

If so, I'll cook up a patch to change anon_vma_prepare
and find_mergeable_anon_vma to attach the whole chain
of anon_vmas to the new VMA, using anon_vma_clone().

2010-04-06 23:57:05

by Rik van Riel

[permalink] [raw]
Subject: [PATCH] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

When a new VMA has a mergeable anon_vma with a neighboring VMA,
make sure all of the neighbor's old anon_vma structs are also
linked in.

This is necessary because at some point the VMAs could get merged,
and we want to ensure no anon_vma structs get freed prematurely,
while the system still has anonymous pages that belong to those
structs.

Reported-by: Borislav Petkov <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>

---
include/linux/mm.h | 2 +-
mm/mmap.c | 6 +++---
mm/rmap.c | 20 +++++++++++++-------
3 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e70f21b..90ac50e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1228,7 +1228,7 @@ extern struct vm_area_struct *vma_merge(struct mm_struct *,
struct vm_area_struct *prev, unsigned long addr, unsigned long end,
unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
struct mempolicy *);
-extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
+extern struct vm_area_struct *find_mergeable_anon_vma(struct vm_area_struct *);
extern int split_vma(struct mm_struct *,
struct vm_area_struct *, unsigned long addr, int new_below);
extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
diff --git a/mm/mmap.c b/mm/mmap.c
index 75557c6..bf0600c 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -832,7 +832,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
* anon_vmas being allocated, preventing vma merge in subsequent
* mprotect.
*/
-struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma)
+struct vm_area_struct *find_mergeable_anon_vma(struct vm_area_struct *vma)
{
struct vm_area_struct *near;
unsigned long vm_flags;
@@ -855,7 +855,7 @@ struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma)
can_vma_merge_before(near, vm_flags,
NULL, vma->vm_file, vma->vm_pgoff +
((vma->vm_end - vma->vm_start) >> PAGE_SHIFT)))
- return near->anon_vma;
+ return near;
try_prev:
/*
* It is potentially slow to have to call find_vma_prev here.
@@ -875,7 +875,7 @@ try_prev:
mpol_equal(vma_policy(near), vma_policy(vma)) &&
can_vma_merge_after(near, vm_flags,
NULL, vma->vm_file, vma->vm_pgoff))
- return near->anon_vma;
+ return near;
none:
/*
* There's no absolute need to look only at touching neighbours:
diff --git a/mm/rmap.c b/mm/rmap.c
index eaa7a09..60616db 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -119,20 +119,26 @@ int anon_vma_prepare(struct vm_area_struct *vma)
might_sleep();
if (unlikely(!anon_vma)) {
struct mm_struct *mm = vma->vm_mm;
+ struct vm_area_struct *merge_vma;
struct anon_vma *allocated;

+ merge_vma = find_mergeable_anon_vma(vma);
+ if (merge_vma) {
+ if (anon_vma_clone(vma, merge_vma))
+ goto out_enomem;
+ return 0;
+ }
+
avc = anon_vma_chain_alloc();
if (!avc)
goto out_enomem;

- anon_vma = find_mergeable_anon_vma(vma);
allocated = NULL;
- if (!anon_vma) {
- anon_vma = anon_vma_alloc();
- if (unlikely(!anon_vma))
- goto out_enomem_free_avc;
- allocated = anon_vma;
- }
+ anon_vma = anon_vma_alloc();
+ if (unlikely(!anon_vma))
+ goto out_enomem_free_avc;
+ allocated = anon_vma;
+
spin_lock(&anon_vma->lock);

/* page_table_lock to protect against threads */

2010-04-07 00:14:42

by Linus Torvalds

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)



On Tue, 6 Apr 2010, Rik van Riel wrote:
>
> It gets more fun. It looks like the anon_vma is only
> allocated through anon_vma_alloc() and only handled
> by the functions in rmap.c
>
> By themselves, all of those functions look alright.

Yes. Very trivially so, in fact.

> However, I think I may have found a possible bug in
> the interplay between anon_vma_prepare() and vma_adjust(),
> across several mprotect invocations.
>
> Let me explain what I think may be going on in small
> steps, since it is quite subtle (assuming I am right).

Sounds at least possible. Way more likely than any of the "trivially
obvious" code being buggy, or the SLUB layer suddenly having a serious bug
that only the new user could trigger.

That said, the code that _really_ confuses me is the stuff that uses
"anon_vma_clone()". Could you please also explain the code flow of
vma_adjust() to mere mortals, please?

I suspect Borislav is sleeping. But at least we have a patch for him to
test when he wakes up ;)

Linus

2010-04-07 01:20:17

by Rik van Riel

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)

On 04/06/2010 08:10 PM, Linus Torvalds wrote:

> That said, the code that _really_ confuses me is the stuff that uses
> "anon_vma_clone()". Could you please also explain the code flow of
> vma_adjust() to mere mortals, please?

That's easier said than done. I spent 3 days with pen and paper,
going over that code before I made the anon_vma changes, first
verifying that the code is indeed correct and then figuring out
how I could make the anon_vma changes safely.

I am not happy with the complexity of the code around vma_adjust,
but could not find a way to simplify it and still keep merging
VMAs the way we do.

My largest change to vma_adjust was moving some code closer to
the beginning of the function, so I could bail out if the
allocation failed, without making change to the vma...

> I suspect Borislav is sleeping. But at least we have a patch for him to
> test when he wakes up ;)

I am looking forward to the test results.

2010-04-07 07:01:12

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

> When a new VMA has a mergeable anon_vma with a neighboring VMA,
> make sure all of the neighbor's old anon_vma structs are also
> linked in.
>
> This is necessary because at some point the VMAs could get merged,
> and we want to ensure no anon_vma structs get freed prematurely,
> while the system still has anonymous pages that belong to those
> structs.

Ahhhh, I'm shame myself. sure, neighbor vma might have lots avc ;-)
few comments are blow.

>
> Reported-by: Borislav Petkov <[email protected]>
> Signed-off-by: Rik van Riel <[email protected]>
>
> ---
> include/linux/mm.h | 2 +-
> mm/mmap.c | 6 +++---
> mm/rmap.c | 20 +++++++++++++-------
> 3 files changed, 17 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index e70f21b..90ac50e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1228,7 +1228,7 @@ extern struct vm_area_struct *vma_merge(struct mm_struct *,
> struct vm_area_struct *prev, unsigned long addr, unsigned long end,
> unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
> struct mempolicy *);
> -extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
> +extern struct vm_area_struct *find_mergeable_anon_vma(struct vm_area_struct *);
> extern int split_vma(struct mm_struct *,
> struct vm_area_struct *, unsigned long addr, int new_below);
> extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 75557c6..bf0600c 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -832,7 +832,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
> * anon_vmas being allocated, preventing vma merge in subsequent
> * mprotect.
> */
> -struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma)
> +struct vm_area_struct *find_mergeable_anon_vma(struct vm_area_struct *vma)
> {
> struct vm_area_struct *near;
> unsigned long vm_flags;
> @@ -855,7 +855,7 @@ struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma)
> can_vma_merge_before(near, vm_flags,
> NULL, vma->vm_file, vma->vm_pgoff +
> ((vma->vm_end - vma->vm_start) >> PAGE_SHIFT)))
> - return near->anon_vma;
> + return near;
> try_prev:
> /*
> * It is potentially slow to have to call find_vma_prev here.
> @@ -875,7 +875,7 @@ try_prev:
> mpol_equal(vma_policy(near), vma_policy(vma)) &&
> can_vma_merge_after(near, vm_flags,
> NULL, vma->vm_file, vma->vm_pgoff))
> - return near->anon_vma;
> + return near;
> none:
> /*
> * There's no absolute need to look only at touching neighbours:
> diff --git a/mm/rmap.c b/mm/rmap.c
> index eaa7a09..60616db 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -119,20 +119,26 @@ int anon_vma_prepare(struct vm_area_struct *vma)
> might_sleep();
> if (unlikely(!anon_vma)) {
> struct mm_struct *mm = vma->vm_mm;
> + struct vm_area_struct *merge_vma;
> struct anon_vma *allocated;
>
> + merge_vma = find_mergeable_anon_vma(vma);
> + if (merge_vma) {
> + if (anon_vma_clone(vma, merge_vma))
> + goto out_enomem;
> + return 0;
> + }
> +

Hmm.. probably I'm moron.
I'm also confusing this locking rule as same as linus said.

after this patch, new locking order are

down_read(mmap_sem)
anon_vma_clone(vma, merge_vma)
list_add(&avc->same_vma, &vma->anon_vma_chain);
spin_lock(&anon_vma->lock);
list_add_tail(&avc->same_anon_vma, &anon_vma->head);
spin_unlock(&anon_vma->lock);
spin_lock(&anon_vma->lock);
spin_lock(&mm->page_table_lock);

So, Why mmap_sem read lock can protect vma->anon_vma_chain?
An another threads seems to be able to change avc list concurrentlly and freely.

plus, Why don't we need "vma->anon_vma = merge_vma->anon_vma" assignment?
if vma->anon_vma keep NULL, I think anon_vma_prepare() call anon_vma_clone()
multiple times.


> avc = anon_vma_chain_alloc();
> if (!avc)
> goto out_enomem;
>
> - anon_vma = find_mergeable_anon_vma(vma);
> allocated = NULL;
> - if (!anon_vma) {
> - anon_vma = anon_vma_alloc();
> - if (unlikely(!anon_vma))
> - goto out_enomem_free_avc;
> - allocated = anon_vma;
> - }
> + anon_vma = anon_vma_alloc();
> + if (unlikely(!anon_vma))
> + goto out_enomem_free_avc;
> + allocated = anon_vma;
> +
> spin_lock(&anon_vma->lock);
>
> /* page_table_lock to protect against threads */


2010-04-07 07:22:21

by Borislav Petkov

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)

From: Rik van Riel <[email protected]>
Date: Tue, Apr 06, 2010 at 09:18:28PM -0400

Hi Rik,

I think your patch needs a bit more baking, see below :)

> >I suspect Borislav is sleeping. But at least we have a patch for him to
> >test when he wakes up ;)
>
> I am looking forward to the test results.

This happens when starting X, I haven't even started hibernating.

[By the way, further testing will have to wait till tonight since I
have a job, you know :) ]

Also, mm/rmap.c:745 is

BUG_ON(!anon_vma);

in __page_set_anon_rmap().

---
[ 43.142371] ------------[ cut here ]------------
[ 43.142411] kernel BUG at mm/rmap.c:745!
[ 43.142436] invalid opcode: 0000 [#1] PREEMPT SMP
[ 43.142514] last sysfs file: /sys/devices/virtual/vtconsole/vtcon0/uevent
[ 43.142537] CPU 0
[ 43.142559] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 edac_core serial_core k10temp ohci_hcd pcspkr
[ 43.142997]
[ 43.143012] Pid: 1940, comm: console-kit-dae Not tainted 2.6.34-rc3-00289-gae1ed76 #5 M3A78 PRO/System Product Name
[ 43.143012] RIP: 0010:[<ffffffff810c08e7>] [<ffffffff810c08e7>] page_add_new_anon_rmap+0x3b/0x89
[ 43.143012] RSP: 0000:ffff88022c019da8 EFLAGS: 00010246
[ 43.143012] RAX: 0000000000000000 RBX: ffffea000774ff78 RCX: 000000002ce900f4
[ 43.143012] RDX: ffff88000a1d5dc8 RSI: 0000000000000007 RDI: ffffffff816e8740
[ 43.143012] RBP: ffff88022c019dc8 R08: 00007f29e3cfd928 R09: 000000000062c318
[ 43.143012] R10: 0000000000000000 R11: 0000000000000002 R12: ffff88022bbad960
[ 43.143012] R13: 00007f29e3cfd928 R14: 00007f29e3cfd928 R15: 80000002216d9067
[ 43.143012] FS: 00007f29e3d0f790(0000) GS:ffff88000a000000(0000) knlGS:0000000000000000
[ 43.143012] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 43.143012] CR2: 00007f29e3cfd928 CR3: 000000022dfd3000 CR4: 00000000000006f0
[ 43.143012] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 43.143012] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 43.143012] Process console-kit-dae (pid: 1940, threadinfo ffff88022c018000, task ffff88022ce90000)
[ 43.143012] Stack:
[ 43.143012] ffffffff810b8802 ffff88022bbad960 ffff88022ea3c600 ffff88022bb6d7e8
[ 43.143012] <0> ffff88022c019e48 ffffffff810b8823 ffff88022ea3c6b8 0000000000000246
[ 43.143012] <0> ffffea000774ff78 0000000000000001 00000001e3cfd928 ffff88022fdb58f0
[ 43.143012] Call Trace:
[ 43.143012] [<ffffffff810b8802>] ? handle_mm_fault+0x2af/0x64e
[ 43.143012] [<ffffffff810b8823>] handle_mm_fault+0x2d0/0x64e
[ 43.143012] [<ffffffff8101f392>] do_page_fault+0x30b/0x32d
[ 43.143012] [<ffffffff810615ce>] ? put_lock_stats+0xe/0x27
[ 43.143012] [<ffffffff81062a55>] ? lock_release_holdtime+0x104/0x109
[ 43.143012] [<ffffffff813f93e3>] ? error_sti+0x5/0x6
[ 43.143012] [<ffffffff813f7de2>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 43.143012] [<ffffffff813f91ff>] page_fault+0x1f/0x30
[ 43.143012] Code: 00 00 48 89 fb 49 89 f4 49 89 d5 f0 80 4f 02 10 be 07 00 00 00 c7 47 0c 00 00 00 00 e8 c5 30 ff ff 49 8b 44 24 78 48 85 c0 75 04 <0f> 0b eb fe 48 ff c0 4c 89 e6 48 89 df 48 89 43 18 4d 2b 6c 24
[ 43.143012] RIP [<ffffffff810c08e7>] page_add_new_anon_rmap+0x3b/0x89
[ 43.143012] RSP <ffff88022c019da8>
[ 43.145276] ---[ end trace d6305f6e826dbd53 ]---
[ 43.145314] note: console-kit-dae[1940] exited with preempt_count 1
[ 73.644201] ------------[ cut here ]------------
[ 73.644218] kernel BUG at mm/rmap.c:745!
[ 73.644226] invalid opcode: 0000 [#2] PREEMPT SMP
[ 73.644266] last sysfs file: /sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq
[ 73.644278] CPU 0
[ 73.644287] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 edac_core serial_core k10temp ohci_hcd pcspkr
[ 73.644509]
[ 73.644520] Pid: 2018, comm: iceowl-bin Tainted: G D 2.6.34-rc3-00289-gae1ed76 #5 M3A78 PRO/System Product Name
[ 73.644534] RIP: 0010:[<ffffffff810c08e7>] [<ffffffff810c08e7>] page_add_new_anon_rmap+0x3b/0x89
[ 73.644553] RSP: 0000:ffff88022cd37da8 EFLAGS: 00010246
[ 73.644562] RAX: 0000000000000000 RBX: ffffea000764dfa8 RCX: 0000000000000002
[ 73.644572] RDX: ffff88000a1d5dc8 RSI: 0000000000000007 RDI: ffffffff816e8740
[ 73.644589] RBP: ffff88022cd37dc8 R08: 00007f2ce0aab928 R09: 0000000000000000
[ 73.644603] R10: 0000000000000000 R11: 000000000011da32 R12: ffff88022d5894b0
[ 73.644615] R13: 00007f2ce0aab928 R14: 00007f2ce0aab928 R15: 800000021cd23067
[ 73.644628] FS: 00007f2cee88b7b0(0000) GS:ffff88000a000000(0000) knlGS:0000000000000000
[ 73.644639] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 73.644652] CR2: 00007f2ce0aab928 CR3: 000000022b1b5000 CR4: 00000000000006f0
[ 73.644664] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 73.644675] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 73.644690] Process iceowl-bin (pid: 2018, threadinfo ffff88022cd36000, task ffff88022a74a5c0)
[ 73.644701] Stack:
[ 73.644708] ffffffff810b8802 ffff88022d5894b0 ffff88022ce41e00 ffff88022d4b0558
[ 73.644745] <0> ffff88022cd37e48 ffffffff810b8823 ffff88022ce41eb8 0000000000000246
[ 73.644801] <0> ffffea000764dfa8 0000000000000001 00000001e0aab928 ffff88022c0a4828
[ 73.644862] Call Trace:
[ 73.644874] [<ffffffff810b8802>] ? handle_mm_fault+0x2af/0x64e
[ 73.644885] [<ffffffff810b8823>] handle_mm_fault+0x2d0/0x64e
[ 73.644895] [<ffffffff8101f392>] do_page_fault+0x30b/0x32d
[ 73.644909] [<ffffffff810be3c2>] ? do_mmap_pgoff+0x290/0x2f3
[ 73.644921] [<ffffffff813f93e3>] ? error_sti+0x5/0x6
[ 73.644932] [<ffffffff81062b97>] ? trace_hardirqs_off_caller+0x1f/0xa9
[ 73.644943] [<ffffffff813f7de2>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 73.644952] [<ffffffff813f91ff>] page_fault+0x1f/0x30
[ 73.644963] Code: 00 00 48 89 fb 49 89 f4 49 89 d5 f0 80 4f 02 10 be 07 00 00 00 c7 47 0c 00 00 00 00 e8 c5 30 ff ff 49 8b 44 24 78 48 85 c0 75 04 <0f> 0b eb fe 48 ff c0 4c 89 e6 48 89 df 48 89 43 18 4d 2b 6c 24
[ 73.645001] RIP [<ffffffff810c08e7>] page_add_new_anon_rmap+0x3b/0x89
[ 73.645001] RSP <ffff88022cd37da8>
[ 73.645610] ---[ end trace d6305f6e826dbd54 ]---
[ 73.645621] note: iceowl-bin[2018] exited with preempt_count 1
[ 77.562222] SysRq : HELP : loglevel(0-9) reBoot Crash show-all-locks(D) terminate-all-tasks(E) memory-full-oom-kill(F) kill-all-tasks(I) thaw-filesystems(J) saK show-backtrace-all-active-cpus(L) show-memory-usage(M) nice-all-RT-tasks(N) powerOff show-registers(P) show-all-timers(Q) unRaw Sync show-task-states(T) Unmount show-blocked-tasks(W) dump-ftrace-buffer(Z)
[ 78.014120] SysRq : Emergency Sync
[ 78.016864] Emergency Sync complete
[ 78.585045] SysRq : Emergency Remount R/O
[ 78.663367] Emergency Remount complete
[ 79.098126] SysRq : Resetting

--
Regards/Gruss,
Boris.

2010-04-07 07:29:12

by Borislav Petkov

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)

From: Linus Torvalds <[email protected]>
Date: Tue, Apr 06, 2010 at 04:27:42PM -0700

> No, you're mis-reading the asm. It's again the first iteration, and the
> code above it is again the end of the loop. And %rax is once more a kernel
> pointer, not the return value of 'page_referenced_one()'.
>
> So it once more is 'anon_vma->head.next' that is crap, but now it's not
> NULL, it's that very odd 0x002e2e2e002e2e2e pattern (the %r13 has had 0x20
> subtracted from it, so that LSB of "0x0e" is actually _also_ a 0x2e).

No, maybe I expressed myself wrong (it was late an' all) - I was
basically trying to confirm your assessment that anon_vma->head.next
is crap but the code had changed since I had added the debugging 'if
(!anon_vma->head.next)' and that was the value that was already in %r13
before iterating over the list chain.

Yeah, just a minor nitpick and not that it matters. Nevermind though,
we're on the same page.

--
Regards/Gruss,
Boris.

2010-04-07 08:38:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)

On Tue, 2010-04-06 at 11:28 -0700, Linus Torvalds wrote:
> Just as an example of the kind of code that makes me worry:
>
> void unlink_anon_vmas(struct vm_area_struct *vma)
> {
> struct anon_vma_chain *avc, *next;
>
> /* Unlink each anon_vma chained to the VMA. */
> list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) {
> anon_vma_unlink(avc);
> list_del(&avc->same_vma);
> anon_vma_chain_free(avc);
> }
> }
>
> Now, think about what happens for the *last* entry in that avc chain. It
> will call that "anon_vma_unlink()" thing, which will delete perhaps the
> last entry in the "same_anon_vma" one, and then it does
>
> if (empty)
> anon_vma_free(anon_vma);
>
> *before* unlink_anon_vma's has actually does that
>
> list_del(&avc->same_vma);
>
> and what we essentially have is a stale anon_vma_chain entry that still
> exists on that same_vma list, and points to an anon_vma that already got
> deleted.
>
> Does it matter? I really can't see that it does.

I think it does, the anon_vma thing has an RCU destroyed slab, but that
doesn't mean the anon_vma object itself is rcu delayed. The moment we
free it it can be re-used. So the above use after free is a bug.

2010-04-07 08:39:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)

On Tue, 2010-04-06 at 08:55 -0700, Linus Torvalds wrote:
> I do wonder if "page_lock_anon_vma()" should check the whole
> "page_mapped()" case _after_ taking the anon_vma lock. Because if the race
> happens, we're following a anon_vma list that has nothing to do with that
> page (it's stilla _valid_ list, since we locked the anon_vma, but will it
> be ok?)
>
> IOW, what is it that really keeps the anon_vma list reliable _and_
> relevant wrt the page? We know we may get a stale anon_vma, are we ok if
> that anon_vma list doesn't actually have anything to do with the page any
> more?

When doing the whole make i_mmap_lock/anon_vma->lock a mutex thing last
week I ran into the same issue and its on my todo list to find out wth
is happening there.

So yes I think we should move that validation check inside
page_lock_anon_vma().

I'll cook up a patch once I'm done staring at the various funny arch
mmu_gather implementations.

2010-04-07 08:44:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)

On Tue, 2010-04-06 at 13:02 -0700, Linus Torvalds wrote:
> - Related to the above: perhaps the RCU freeing isn't working, or
> slub/slab/slob ends up reusing the allocations for something else than
> anonvma's, so together with the race _and_ an unlucky re-use, you get
> some odd crud.
>
> I haven't looked at the kernel config files: do they perhaps share the
> same (odd?) SLUB/SLAB/SLOB config?

Right, so anon_vma uses SLAB_DESTROY_BY_RCU and as the huge comment in
rmap.c explains, that doesn't mean the objects themself get RCU grace
period delays in freeing, only the SLAB that backs these objects does.

So the moment you do kmem_cache_free() on the anon_vma it can be re-used
for another allocation. The only guarantee given by RCU is that the
backing storage doesn't go away and hence you can 'safely' deref
pointers, you still very much have to revalidate you got the object you
were looking for.

2010-04-07 09:16:49

by Johannes Weiner

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)

On Wed, Apr 07, 2010 at 10:36:43AM +0200, Peter Zijlstra wrote:
> On Tue, 2010-04-06 at 11:28 -0700, Linus Torvalds wrote:
> > Just as an example of the kind of code that makes me worry:
> >
> > void unlink_anon_vmas(struct vm_area_struct *vma)
> > {
> > struct anon_vma_chain *avc, *next;
> >
> > /* Unlink each anon_vma chained to the VMA. */
> > list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) {
> > anon_vma_unlink(avc);
> > list_del(&avc->same_vma);
> > anon_vma_chain_free(avc);
> > }
> > }
> >
> > Now, think about what happens for the *last* entry in that avc chain. It
> > will call that "anon_vma_unlink()" thing, which will delete perhaps the
> > last entry in the "same_anon_vma" one, and then it does
> >
> > if (empty)
> > anon_vma_free(anon_vma);
> >
> > *before* unlink_anon_vma's has actually does that
> >
> > list_del(&avc->same_vma);
> >
> > and what we essentially have is a stale anon_vma_chain entry that still
> > exists on that same_vma list, and points to an anon_vma that already got
> > deleted.
> >
> > Does it matter? I really can't see that it does.
>
> I think it does, the anon_vma thing has an RCU destroyed slab, but that
> doesn't mean the anon_vma object itself is rcu delayed. The moment we
> free it it can be re-used. So the above use after free is a bug.

It frees avc->anon_vma, not avc. So the sequence is

free(avc->anon_vma) in anon_vma_unlink()
list_del(&avc->same_vma) in unlink_anon_vmas()

It's not a use-after free. A problem would be if somebody should find the
avc through this list (it is the vma->anon_vma_chain list) when its anon_vma
pointer is invalid.

I don't think this can happen, however. Both the unlinking and the looking
at the list happen under vma->vm_mm's mmap_sem held for writing.

2010-04-07 09:39:07

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)

On Wed, 2010-04-07 at 11:16 +0200, Johannes Weiner wrote:
> On Wed, Apr 07, 2010 at 10:36:43AM +0200, Peter Zijlstra wrote:
> > On Tue, 2010-04-06 at 11:28 -0700, Linus Torvalds wrote:
> > > Just as an example of the kind of code that makes me worry:
> > >
> > > void unlink_anon_vmas(struct vm_area_struct *vma)
> > > {
> > > struct anon_vma_chain *avc, *next;
> > >
> > > /* Unlink each anon_vma chained to the VMA. */
> > > list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) {
> > > anon_vma_unlink(avc);
> > > list_del(&avc->same_vma);
> > > anon_vma_chain_free(avc);
> > > }
> > > }
> > >
> > > Now, think about what happens for the *last* entry in that avc chain. It
> > > will call that "anon_vma_unlink()" thing, which will delete perhaps the
> > > last entry in the "same_anon_vma" one, and then it does
> > >
> > > if (empty)
> > > anon_vma_free(anon_vma);
> > >
> > > *before* unlink_anon_vma's has actually does that
> > >
> > > list_del(&avc->same_vma);
> > >
> > > and what we essentially have is a stale anon_vma_chain entry that still
> > > exists on that same_vma list, and points to an anon_vma that already got
> > > deleted.
> > >
> > > Does it matter? I really can't see that it does.
> >
> > I think it does, the anon_vma thing has an RCU destroyed slab, but that
> > doesn't mean the anon_vma object itself is rcu delayed. The moment we
> > free it it can be re-used. So the above use after free is a bug.
>
> It frees avc->anon_vma, not avc.

Sure, freeing avc does not involve RCU in any way.

> So the sequence is
>
> free(avc->anon_vma) in anon_vma_unlink()
> list_del(&avc->same_vma) in unlink_anon_vmas()
>
> It's not a use-after free. A problem would be if somebody should find the
> avc through this list (it is the vma->anon_vma_chain list) when its anon_vma
> pointer is invalid.
>
> I don't think this can happen, however. Both the unlinking and the looking
> at the list happen under vma->vm_mm's mmap_sem held for writing.

What I was worried about was it freeing anon_vma and then still having
the avc on list. But I guess that cannot happen because it only frees if
its actually empty.

2010-04-07 10:09:22

by Pekka Enberg

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)

Hi Linus,

On Wed, Apr 7, 2010 at 3:10 AM, Linus Torvalds
<[email protected]> wrote:
> Sounds at least possible. Way more likely than any of the "trivially
> obvious" code being buggy, or the SLUB layer suddenly having a serious bug
> that only the new user could trigger.

I haven't followed the discussion at all but if someone wants to
investigate that angle more, the most likely suspect are the recent
per-cpu changes. That said, I'd expect the problem to be more
widespread if SLUB is to blame here.

Pekka

2010-04-07 10:12:46

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)

> Hi Linus,
>
> On Wed, Apr 7, 2010 at 3:10 AM, Linus Torvalds
> <[email protected]> wrote:
> > Sounds at least possible. Way more likely than any of the "trivially
> > obvious" code being buggy, or the SLUB layer suddenly having a serious bug
> > that only the new user could trigger.
>
> I haven't followed the discussion at all but if someone wants to
> investigate that angle more, the most likely suspect are the recent
> per-cpu changes. That said, I'd expect the problem to be more
> widespread if SLUB is to blame here.

Nope. We don't doubt SLUB nor per-cpu anymore. Rik found the bug in his patch.

thanks.


2010-04-07 14:05:56

by Paulo Marques

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)

Linus Torvalds wrote:
> [...]
> So it once more is 'anon_vma->head.next' that is crap, but now it's not
> NULL, it's that very odd 0x002e2e2e002e2e2e pattern (the %r13 has had 0x20
> subtracted from it, so that LSB of "0x0e" is actually _also_ a 0x2e).
>
> What does '0x2e' mean? It's ASCII '.', but that doesn't really mean
> anything either.

Just a wild shot in the dark: it can be a couple of gray pixels with
intensity 0x2e at some 32 bits per pixel mode. I say this because of the
zero bytes there and someone mentioning seeing the problem when starting X.

--
Paulo Marques - http://www.grupopie.com

"Don't worry, you'll be fine; I saw it work in a cartoon once..."

2010-04-07 14:13:50

by Rik van Riel

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)

On 04/07/2010 04:36 AM, Peter Zijlstra wrote:
> On Tue, 2010-04-06 at 11:28 -0700, Linus Torvalds wrote:

>> if (empty)
>> anon_vma_free(anon_vma);
>>
>> *before* unlink_anon_vma's has actually does that
>>
>> list_del(&avc->same_vma);
>>
>> and what we essentially have is a stale anon_vma_chain entry that still
>> exists on that same_vma list, and points to an anon_vma that already got
>> deleted.
>>
>> Does it matter? I really can't see that it does.
>
> I think it does, the anon_vma thing has an RCU destroyed slab, but that
> doesn't mean the anon_vma object itself is rcu delayed. The moment we
> free it it can be re-used. So the above use after free is a bug.

Peter, the avc is an anon_vma_chain, which is a different
object than the anon_vma itself. There is no use after free
of an anon_vma object in unlink_anon_vmas + anon_vma_unlink.

2010-04-07 14:14:34

by Borislav Petkov

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)

From: Paulo Marques <[email protected]>
Date: Wed, Apr 07, 2010 at 03:05:50PM +0100

> Linus Torvalds wrote:
> > [...]
> > So it once more is 'anon_vma->head.next' that is crap, but now it's not
> > NULL, it's that very odd 0x002e2e2e002e2e2e pattern (the %r13 has had 0x20
> > subtracted from it, so that LSB of "0x0e" is actually _also_ a 0x2e).
> >
> > What does '0x2e' mean? It's ASCII '.', but that doesn't really mean
> > anything either.
>
> Just a wild shot in the dark: it can be a couple of gray pixels with
> intensity 0x2e at some 32 bits per pixel mode. I say this because of the
> zero bytes there and someone mentioning seeing the problem when starting X.

I don't think those are related: the problem when X was starting happens
with Rik' newest patch and the funny %r13 value happened after enabling
SLUB debugging last night.

Thanks.

--
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

2010-04-07 14:49:55

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

On 04/07/2010 03:00 AM, KOSAKI Motohiro wrote:

> Hmm.. probably I'm moron.

Someone might be, but it's not you :)

> I'm also confusing this locking rule as same as linus said.
>
> after this patch, new locking order are

> So, Why mmap_sem read lock can protect vma->anon_vma_chain?
> An another threads seems to be able to change avc list concurrentlly and freely.

You are right, the code needs to take the pagetable_lock
around the call to anon_vma_clone, so other threads
get locked out.

This means the locking order has now been inverted,
with the pagetable_lock on the outside and the
anon_vma locks on the inside.

I have checked all the other call sites to the
anon_vma code. The direct callers of anon_vma_clone
and anon_vma_fork already hold the mmap_sem for
write. The callers of anon_vma_prepare hold the
mmap_sem for read - so excluding other callers of
anon_vma_prepare with the page_table_lock is enough.

mm_take_all_locks has the mmap_sem for write.

There seem to be no other traversals of the same_vma
list, so changing the locking order to have the
page_table_lock on the outside of the anon_vma locks
works.

> plus, Why don't we need "vma->anon_vma = merge_vma->anon_vma" assignment?
> if vma->anon_vma keep NULL, I think anon_vma_prepare() call anon_vma_clone()
> multiple times.

Added in the new version. See the next email.

2010-04-07 14:56:13

by Rik van Riel

[permalink] [raw]
Subject: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

When a new VMA has a mergeable anon_vma with a neighboring VMA,
make sure all of the neighbor's old anon_vma structs are also
linked in.

This is necessary because at some point the VMAs could get merged,
and we want to ensure no anon_vma structs get freed prematurely,
while the system still has anonymous pages that belong to those
structs.

Reported-by: Borislav Petkov <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>

---
v2:
- fix the locking issues spotted by Kosaki Motohiro
- set vma->anon_vma correctly

include/linux/mm.h | 2 +-
mm/mmap.c | 6 +++---
mm/rmap.c | 27 ++++++++++++++++++---------
3 files changed, 22 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e70f21b..90ac50e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1228,7 +1228,7 @@ extern struct vm_area_struct *vma_merge(struct mm_struct *,
struct vm_area_struct *prev, unsigned long addr, unsigned long end,
unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
struct mempolicy *);
-extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
+extern struct vm_area_struct *find_mergeable_anon_vma(struct vm_area_struct *);
extern int split_vma(struct mm_struct *,
struct vm_area_struct *, unsigned long addr, int new_below);
extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
diff --git a/mm/mmap.c b/mm/mmap.c
index 75557c6..bf0600c 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -832,7 +832,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
* anon_vmas being allocated, preventing vma merge in subsequent
* mprotect.
*/
-struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma)
+struct vm_area_struct *find_mergeable_anon_vma(struct vm_area_struct *vma)
{
struct vm_area_struct *near;
unsigned long vm_flags;
@@ -855,7 +855,7 @@ struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma)
can_vma_merge_before(near, vm_flags,
NULL, vma->vm_file, vma->vm_pgoff +
((vma->vm_end - vma->vm_start) >> PAGE_SHIFT)))
- return near->anon_vma;
+ return near;
try_prev:
/*
* It is potentially slow to have to call find_vma_prev here.
@@ -875,7 +875,7 @@ try_prev:
mpol_equal(vma_policy(near), vma_policy(vma)) &&
can_vma_merge_after(near, vm_flags,
NULL, vma->vm_file, vma->vm_pgoff))
- return near->anon_vma;
+ return near;
none:
/*
* There's no absolute need to look only at touching neighbours:
diff --git a/mm/rmap.c b/mm/rmap.c
index eaa7a09..abe7aa5 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -119,24 +119,33 @@ int anon_vma_prepare(struct vm_area_struct *vma)
might_sleep();
if (unlikely(!anon_vma)) {
struct mm_struct *mm = vma->vm_mm;
+ struct vm_area_struct *merge_vma;
struct anon_vma *allocated;

+ merge_vma = find_mergeable_anon_vma(vma);
+ if (merge_vma) {
+ int ret;
+ spin_lock(&mm->page_table_lock);
+ ret = anon_vma_clone(vma, merge_vma);
+ if (!ret)
+ vma->anon_vma = merge_vma->anon_vma;
+ spin_unlock(&mm->page_table_lock);
+ return ret;
+ }
+
avc = anon_vma_chain_alloc();
if (!avc)
goto out_enomem;

- anon_vma = find_mergeable_anon_vma(vma);
allocated = NULL;
- if (!anon_vma) {
- anon_vma = anon_vma_alloc();
- if (unlikely(!anon_vma))
- goto out_enomem_free_avc;
- allocated = anon_vma;
- }
- spin_lock(&anon_vma->lock);
+ anon_vma = anon_vma_alloc();
+ if (unlikely(!anon_vma))
+ goto out_enomem_free_avc;
+ allocated = anon_vma;

/* page_table_lock to protect against threads */
spin_lock(&mm->page_table_lock);
+ spin_lock(&anon_vma->lock);
if (likely(!vma->anon_vma)) {
vma->anon_vma = anon_vma;
avc->anon_vma = anon_vma;
@@ -145,9 +154,9 @@ int anon_vma_prepare(struct vm_area_struct *vma)
list_add(&avc->same_anon_vma, &anon_vma->head);
allocated = NULL;
}
+ spin_unlock(&anon_vma->lock);
spin_unlock(&mm->page_table_lock);

- spin_unlock(&anon_vma->lock);
if (unlikely(allocated)) {
anon_vma_free(allocated);
anon_vma_chain_free(avc);

2010-04-07 15:36:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA



On Wed, 7 Apr 2010, Rik van Riel wrote:
>
> - fix the locking issues spotted by Kosaki Motohiro

No, they're broken.

And Rik, please explain the locking rather than make even more of these
kinds of random ad-hoc locking rules.

I've said this now _three_ times, but let me repeat once more:

- the locking rules for that anon_vma_chain are very unclear. I _think_
you mean for them to be "mmap_sem held for writing, _or_ mmap_sem held
for reading and page_table_lock held", but nowhere is that actually
documented.

Why is it so hard for you to just admit that? Especially after you
yourself got it wrong.

> + merge_vma = find_mergeable_anon_vma(vma);
> + if (merge_vma) {
> + int ret;
> + spin_lock(&mm->page_table_lock);
> + ret = anon_vma_clone(vma, merge_vma);
> + if (!ret)
> + vma->anon_vma = merge_vma->anon_vma;
> + spin_unlock(&mm->page_table_lock);
> + return ret;
> + }

Rik, the above is obviously total crap.

anon_vma_clone() needs to allocate memory, and it does so with GFP_KERNEL.
You can't do that with a spinlock held.

Linus

2010-04-07 15:50:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)



On Wed, 7 Apr 2010, Peter Zijlstra wrote:

> On Tue, 2010-04-06 at 11:28 -0700, Linus Torvalds wrote:
> > Just as an example of the kind of code that makes me worry:
> >
> > void unlink_anon_vmas(struct vm_area_struct *vma)
> > {
> > struct anon_vma_chain *avc, *next;
> >
> > /* Unlink each anon_vma chained to the VMA. */
> > list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) {
> > anon_vma_unlink(avc);
> > list_del(&avc->same_vma);
> > anon_vma_chain_free(avc);
> > }
> > }
> >
> > Now, think about what happens for the *last* entry in that avc chain. It
> > will call that "anon_vma_unlink()" thing, which will delete perhaps the
> > last entry in the "same_anon_vma" one, and then it does
> >
> > if (empty)
> > anon_vma_free(anon_vma);
> >
> > *before* unlink_anon_vma's has actually does that
> >
> > list_del(&avc->same_vma);
> >
> > and what we essentially have is a stale anon_vma_chain entry that still
> > exists on that same_vma list, and points to an anon_vma that already got
> > deleted.
> >
> > Does it matter? I really can't see that it does.
>
> I think it does, the anon_vma thing has an RCU destroyed slab, but that
> doesn't mean the anon_vma object itself is rcu delayed. The moment we
> free it it can be re-used. So the above use after free is a bug.

Well, it's not really a "use after free" - it's just that a stale pointer
still exists in a live data structure that is linked into the list. I
don't think there is a real bug there, simply because I don't think
anybody will be accessing that list (we should hopefully have all the
sufficient mutual exclusion in place).

So I just think it is bad form to potentially free something before we get
rid of all pointers to it. So to me it's a cleanliness issue: good code
shouldn't do things like that, and it would be much cleaner to remove the
AVC entry that has a pointer to the anon_vma _before_ we might be freeing
the anon_vma.

Maybe I'm just anal.

Linus

2010-04-07 15:54:10

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

On 04/07/2010 11:30 AM, Linus Torvalds wrote:

> I've said this now _three_ times, but let me repeat once more:
>
> - the locking rules for that anon_vma_chain are very unclear. I _think_
> you mean for them to be "mmap_sem held for writing, _or_ mmap_sem held
> for reading and page_table_lock held", but nowhere is that actually
> documented.

> Why is it so hard for you to just admit that? Especially after you
> yourself got it wrong.

You are right, the idea was to continue use the locking that
the anon_vma code was already using, without introducing any
new locking with the anon_vma patches.

However, it has become clear that this is no longer possible,
due to the need to hold a secondary lock across anon_vma_clone,
when we come from a code path that holds the mmap_sem for read.

>> + merge_vma = find_mergeable_anon_vma(vma);
>> + if (merge_vma) {
>> + int ret;
>> + spin_lock(&mm->page_table_lock);
>> + ret = anon_vma_clone(vma, merge_vma);
>> + if (!ret)
>> + vma->anon_vma = merge_vma->anon_vma;
>> + spin_unlock(&mm->page_table_lock);
>> + return ret;
>> + }
>
> Rik, the above is obviously total crap.
>
> anon_vma_clone() needs to allocate memory, and it does so with GFP_KERNEL.
> You can't do that with a spinlock held.

Looks like we'll either have to introduce a per-mm semaphore for
the same_vma anon_vma chains, or move the complexity of solving
this bug to anon_vma_merge, where we can ensure that the resulting
VMA has the sum of the anon_vmas of each VMA.

2010-04-07 15:55:31

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

Hi, Rik.

On Wed, Apr 7, 2010 at 11:54 PM, Rik van Riel <[email protected]> wrote:
> When a new VMA has a mergeable anon_vma with a neighboring VMA,
> make sure all of the neighbor's old anon_vma structs are also
> linked in.
>
> This is necessary because at some point the VMAs could get merged,
> and we want to ensure no anon_vma structs get freed prematurely,
> while the system still has anonymous pages that belong to those
> structs.
>
> Reported-by: Borislav Petkov <[email protected]>
> Signed-off-by: Rik van Riel <[email protected]>

At last, you might find culprit.

AFAIU your descriptoin, don't we have to care vma_merge case, too?
Sorry if it is dumb question.

--
Kind regards,
Minchan Kim

2010-04-07 17:01:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA



On Wed, 7 Apr 2010, Rik van Riel wrote:
>
> You are right, the idea was to continue use the locking that
> the anon_vma code was already using, without introducing any
> new locking with the anon_vma patches.
>
> However, it has become clear that this is no longer possible,
> due to the need to hold a secondary lock across anon_vma_clone,
> when we come from a code path that holds the mmap_sem for read.

I do wonder if we could possibly simplify this a _lot_ by just requiring
that the anon_vma gets allocated at vma creation time (ie mmap), rather
than doing it on-demand when we actually do the page fault.

That would make all of this crap happen under mmap_sem held for writing,
and it would simplify the faulting code (which is the much more critical
code) a lot.

And it would make all your locking problems go away. Now all anon_vma code
really _would_ run with mmap_sem held exclusively, without any races.

When I tried to do a "fill in multiple page table entries in one go"
patch, that annoying anon_vma issue was a problem as well. Allocating the
anon_vma up-front would have simplified that code too.

I can't imagine that we ever really have mappings without an anon_vma in
practice _anyway_, so why delay the allocation until page fault time?

Maybe I'm missing something subtle.

Linus

2010-04-07 21:24:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA



On Wed, 7 Apr 2010, Linus Torvalds wrote:
>
> I do wonder if we could possibly simplify this a _lot_ by just requiring
> that the anon_vma gets allocated at vma creation time (ie mmap), rather
> than doing it on-demand when we actually do the page fault.
>
> That would make all of this crap happen under mmap_sem held for writing,
> and it would simplify the faulting code (which is the much more critical
> code) a lot.

Here is a patch that boots for me (but has had _zero_ serious testing:
caveat emptor etc etc).

It basically moves "anon_vma_prepare()" to be called in vma_link and in
__insert_vm_struct() - which I _think_ should cover all normal vma
creation events. I did a "WARN_ONCE(!vma->anon_vma)" just to check, I
haven't triggered one yet.

Now, this clearly will create anon_vma's that may never get used at all,
ie for things like shared mappings etc that never have anonymous memory
associated with them. But that structure is pretty small, so I don't find
it in myself to care too deeply.

And with this, all the anon_vma games shuld all happen with mmap_sem held
for writing, which should hopefully simplify things a lot. Rik, can you
use this to make a new version of your fixing patch?

Comments?

Linus

---
mm/memory.c | 10 +---------
mm/mmap.c | 17 ++++-------------
2 files changed, 5 insertions(+), 22 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 833952d..0abefd8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2223,9 +2223,6 @@ reuse:
gotten:
pte_unmap_unlock(page_table, ptl);

- if (unlikely(anon_vma_prepare(vma)))
- goto oom;
-
if (is_zero_pfn(pte_pfn(orig_pte))) {
new_page = alloc_zeroed_user_highpage_movable(vma, address);
if (!new_page)
@@ -2766,8 +2763,6 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
/* Allocate our own private page. */
pte_unmap(page_table);

- if (unlikely(anon_vma_prepare(vma)))
- goto oom;
page = alloc_zeroed_user_highpage_movable(vma, address);
if (!page)
goto oom;
@@ -2863,10 +2858,6 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (flags & FAULT_FLAG_WRITE) {
if (!(vma->vm_flags & VM_SHARED)) {
anon = 1;
- if (unlikely(anon_vma_prepare(vma))) {
- ret = VM_FAULT_OOM;
- goto out;
- }
page = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
vma, address);
if (!page) {
@@ -3115,6 +3106,7 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
pmd_t *pmd;
pte_t *pte;

+ WARN_ONCE(!vma->anon_vma, "No anonvma");
__set_current_state(TASK_RUNNING);

count_vm_event(PGFAULT);
diff --git a/mm/mmap.c b/mm/mmap.c
index 75557c6..c14284b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -463,6 +463,8 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,

mm->map_count++;
validate_mm(mm);
+
+ anon_vma_prepare(vma);
}

/*
@@ -479,6 +481,8 @@ static void __insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
BUG_ON(__vma && __vma->vm_start < vma->vm_end);
__vma_link(mm, vma, prev, rb_link, rb_parent);
mm->map_count++;
+
+ anon_vma_prepare(vma);
}

static inline void
@@ -1674,12 +1678,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
if (!(vma->vm_flags & VM_GROWSUP))
return -EFAULT;

- /*
- * We must make sure the anon_vma is allocated
- * so that the anon_vma locking is not a noop.
- */
- if (unlikely(anon_vma_prepare(vma)))
- return -ENOMEM;
anon_vma_lock(vma);

/*
@@ -1720,13 +1718,6 @@ static int expand_downwards(struct vm_area_struct *vma,
{
int error;

- /*
- * We must make sure the anon_vma is allocated
- * so that the anon_vma locking is not a noop.
- */
- if (unlikely(anon_vma_prepare(vma)))
- return -ENOMEM;
-
address &= PAGE_MASK;
error = security_file_mmap(NULL, 0, 0, 0, address, 1);
if (error)

2010-04-07 21:53:50

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

On 04/07/2010 05:19 PM, Linus Torvalds wrote:

> Comments?

I remember there being an "unfixable" spot with this
approach when I originally wrote the new anon_vma
linking code.

However, I can't for the life of me find that spot.
I am starting to believe I made it fixable as a side
effect of one of the changes I made :)

One of the issues with your patch is that anon_vma_prepare
can fail and this patch ignores its return value.

Having anon_vma-prepare fail after an mremap or mprotect
might result in messing up the VMAs of a process, or having
to undo the VMA changes that were made.

In fact, this may be the problem I was running into - not
wanting to add even more complex error paths to the vma
shuffling code.

2010-04-07 22:14:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA



On Wed, 7 Apr 2010, Rik van Riel wrote:
>
> One of the issues with your patch is that anon_vma_prepare
> can fail and this patch ignores its return value.

Yes. The failure point is too late to do anything really interesting with,
and the old code also just causes a SIGBUS. My intention was to change the

WARN_ONCE(!vma->anon_vma);

into returning that SIGBUS - which is not wonderful, but is no different
from old failures.

In the long run, it would be nicer to actually return an error from the
mmap() that fails, but that's more complicated, and as mentioned, it's not
what the old code used to do either (since the failure point was always at
the page fault stage).

> Having anon_vma-prepare fail after an mremap or mprotect
> might result in messing up the VMAs of a process, or having
> to undo the VMA changes that were made.

We really aren't any worse off than we have always been.

If anon_vma_prepare() fails, the vma list will be valid, but no new pages
can be added to that vma. That used to be true before too.

Linus

2010-04-07 22:20:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA



On Wed, 7 Apr 2010, Linus Torvalds wrote:
>
> In the long run, it would be nicer to actually return an error from the
> mmap() that fails, but that's more complicated, and as mentioned, it's not
> what the old code used to do either (since the failure point was always at
> the page fault stage).

Put another way: I'm not proud of it, but the new code isn't any worse
than what we used to have, and I think the new code is _fixable_.

The easiest way to do that would likely be to pre-allocate the anon_vma
struct (and anon_vma_chain), and pass it down to anon_vma_prepare. That
way anon_vma_prepare() itself can never fail, and all we need to do is a
simple allocation earlier in the call-chain.

Linus

2010-04-07 23:42:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA



On Wed, 7 Apr 2010, Linus Torvalds wrote:
>
> Yes. The failure point is too late to do anything really interesting with,
> and the old code also just causes a SIGBUS. My intention was to change the
>
> WARN_ONCE(!vma->anon_vma);
>
> into returning that SIGBUS - which is not wonderful, but is no different
> from old failures.

Not SIGBUS, but VM_FAULT_OOM, of course.

IOW, something like this should be no worse than what we have now, and has
the much nicer locking semantics.

Having done some more digging, I can point to a downside: we do end up
having about twice as many anon_vma entries. It seems about half of the
vma's never need an anon_vma entry, probably because they end up being
read-only file mappings, and thus never trigger the anonvma case.

That said:

- I don't really think you can fix the locking problem you have in a
saner way

- the anon_vma entry is much smaller than the vm_area_struct, so we're
still using much less memory for them than for vma's.

- We _could_ avoid allocating anonvma entries for shared mappings or for
mappings that are read-only. That might force us to allocate some of
them at mprotect time, and/or when doing a forced COW event with
ptrace, but we have the mmap_sem for writing for the one case, and we
could decide to get it for the other.

So it's not a _fundamental_ problem if we decide we want to recover
most of the memory lost by doing unconditional allocations.

There are alternative models. For example, the VM layer _could_ decide to
just release the mmap_sem, and re-do it and take it for writing if the vma
doesn't have an anon_vma.

I dunno. I like how this patch makes things so much less subtle, though.
For example: with this in place, we could further simplify
anon_vma_prepare(), since it would now never have the re-entrancy issue
and wouldn't need to worry about taking that page_table_lock and
re-testing vma->anon_vma for races.

Linus

---
mm/memory.c | 12 +++---------
mm/mmap.c | 17 ++++-------------
2 files changed, 7 insertions(+), 22 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 833952d..b5efe76 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2223,9 +2223,6 @@ reuse:
gotten:
pte_unmap_unlock(page_table, ptl);

- if (unlikely(anon_vma_prepare(vma)))
- goto oom;
-
if (is_zero_pfn(pte_pfn(orig_pte))) {
new_page = alloc_zeroed_user_highpage_movable(vma, address);
if (!new_page)
@@ -2766,8 +2763,6 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
/* Allocate our own private page. */
pte_unmap(page_table);

- if (unlikely(anon_vma_prepare(vma)))
- goto oom;
page = alloc_zeroed_user_highpage_movable(vma, address);
if (!page)
goto oom;
@@ -2863,10 +2858,6 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (flags & FAULT_FLAG_WRITE) {
if (!(vma->vm_flags & VM_SHARED)) {
anon = 1;
- if (unlikely(anon_vma_prepare(vma))) {
- ret = VM_FAULT_OOM;
- goto out;
- }
page = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
vma, address);
if (!page) {
@@ -3115,6 +3106,9 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
pmd_t *pmd;
pte_t *pte;

+ if (!vma->anon_vma)
+ return VM_FAULT_OOM;
+
__set_current_state(TASK_RUNNING);

count_vm_event(PGFAULT);
diff --git a/mm/mmap.c b/mm/mmap.c
index 75557c6..c14284b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -463,6 +463,8 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,

mm->map_count++;
validate_mm(mm);
+
+ anon_vma_prepare(vma);
}

/*
@@ -479,6 +481,8 @@ static void __insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
BUG_ON(__vma && __vma->vm_start < vma->vm_end);
__vma_link(mm, vma, prev, rb_link, rb_parent);
mm->map_count++;
+
+ anon_vma_prepare(vma);
}

static inline void
@@ -1674,12 +1678,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
if (!(vma->vm_flags & VM_GROWSUP))
return -EFAULT;

- /*
- * We must make sure the anon_vma is allocated
- * so that the anon_vma locking is not a noop.
- */
- if (unlikely(anon_vma_prepare(vma)))
- return -ENOMEM;
anon_vma_lock(vma);

/*
@@ -1720,13 +1718,6 @@ static int expand_downwards(struct vm_area_struct *vma,
{
int error;

- /*
- * We must make sure the anon_vma is allocated
- * so that the anon_vma locking is not a noop.
- */
- if (unlikely(anon_vma_prepare(vma)))
- return -ENOMEM;
-
address &= PAGE_MASK;
error = security_file_mmap(NULL, 0, 0, 0, address, 1);
if (error)

2010-04-08 00:40:44

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

On 04/07/2010 06:15 PM, Linus Torvalds wrote:
> On Wed, 7 Apr 2010, Linus Torvalds wrote:
>>
>> In the long run, it would be nicer to actually return an error from the
>> mmap() that fails, but that's more complicated, and as mentioned, it's not
>> what the old code used to do either (since the failure point was always at
>> the page fault stage).
>
> Put another way: I'm not proud of it, but the new code isn't any worse
> than what we used to have, and I think the new code is _fixable_.

Agreed, it is no worse than what we had before.

As to fixable, I supect both situations are fixable.
The new code by getting the error paths right, the
old code by completely bailing out of the page fault
and retrying it (the pageout code should trigger an
OOM kill at some point, if we are really out of memory).

> The easiest way to do that would likely be to pre-allocate the anon_vma
> struct (and anon_vma_chain), and pass it down to anon_vma_prepare. That
> way anon_vma_prepare() itself can never fail, and all we need to do is a
> simple allocation earlier in the call-chain.

That may not work, because we may want to merge the anon_vma
with the anon_vma in an adjacant VMA ... and that adjacant
VMA could be chained onto multiple anon_vmas.

That means allocating a single anon_vma_chain may not be
enough.

2010-04-08 02:03:40

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

Hi

Wow, your patch is very cool. I'm surprising such 20 lines patch makes
lots simplify.

> On Wed, 7 Apr 2010, Linus Torvalds wrote:
> >
> > Yes. The failure point is too late to do anything really interesting with,
> > and the old code also just causes a SIGBUS. My intention was to change the
> >
> > WARN_ONCE(!vma->anon_vma);
> >
> > into returning that SIGBUS - which is not wonderful, but is no different
> > from old failures.
>
> Not SIGBUS, but VM_FAULT_OOM, of course.

Now pagefault don't insert anon_vma anymore, right? if so, SIGBUS is better.
Now SIGBUS and VM_FAULT_OOM make different result.

SIGBUS -> kill current task
VM_FAULT_OOM -> invoke oom-killer (see pagefault_out_of_memory())

If current task can't recover proper anon_vma. we should just kill current
instead random highest badness process. otherwise !anon_vma process continue
to randomly invoke oom-killer.

Perhaps, I'm missing something.


2010-04-08 02:37:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA



On Thu, 8 Apr 2010, KOSAKI Motohiro wrote:
>
> Now pagefault don't insert anon_vma anymore, right? if so, SIGBUS is better.
> Now SIGBUS and VM_FAULT_OOM make different result.
>
> SIGBUS -> kill current task
> VM_FAULT_OOM -> invoke oom-killer (see pagefault_out_of_memory())

Yeah, maybe VM_FAULT_SIGBUS works ok instead of VM_FAULT_OOM. But the
cause of it is the system having been oom when themappign was created, so
I think either is fine.

> If current task can't recover proper anon_vma. we should just kill current
> instead random highest badness process. otherwise !anon_vma process continue
> to randomly invoke oom-killer.

Yes, that is a good point.

Anyway, I think it might be interesting to test my anon_vma_prepare()
locking change patch together with Rik's _first_ version of his "fix
anon_vma_prepare" thing (the one without the spinlock). They should apply
independently of each other, and maybe it all even works together.

Linus

2010-04-08 05:53:13

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

From: Linus Torvalds <[email protected]>
Date: Wed, Apr 07, 2010 at 07:33:01PM -0700

> Anyway, I think it might be interesting to test my anon_vma_prepare()
> locking change patch together with Rik's _first_ version of his "fix
> anon_vma_prepare" thing (the one without the spinlock). They should apply
> independently of each other, and maybe it all even works together.

There are still issues: vma_adjust() grabs mapping->i_mmap_lock for file
mappings while we might sleep in anon_vma_prepare():

[ 9.386929] BUG: sleeping function called from invalid context at mm/rmap.c:119
[ 9.387188] in_atomic(): 1, irqs_disabled(): 0, pid: 1068, name: modprobe
[ 9.387343] 3 locks held by modprobe/1068:
[ 9.387524] #0: (&p->cred_guard_mutex){+.+.+.}, at: [<ffffffff810d97fc>] prepare_bprm_creds+0x29/0x5a
[ 9.387959] #1: (&mm->mmap_sem){++++++}, at: [<ffffffff81110ee2>] elf_map+0x70/0x190
[ 9.388416] #2: (&(&inode->i_data.i_mmap_lock)->rlock){+.+...}, at: [<ffffffff810bcbdf>] vma_adjust+0x190
/0x3ca
[ 9.388848] Pid: 1068, comm: modprobe Not tainted 2.6.34-rc3-00290-ge4b2849 #6
[ 9.389102] Call Trace:
[ 9.389256] [<ffffffff810630f6>] ? __debug_show_held_locks+0x22/0x24
[ 9.389418] [<ffffffff8102c288>] __might_sleep+0x117/0x11b
[ 9.389570] [<ffffffff810c0f2e>] anon_vma_prepare+0x30/0x132
[ 9.389722] [<ffffffff810bcd95>] vma_adjust+0x346/0x3ca
[ 9.389874] [<ffffffff810bcf68>] __split_vma+0x14f/0x1b9
[ 9.390027] [<ffffffff810bd143>] do_munmap+0x171/0x315
[ 9.390181] [<ffffffff81110ee2>] ? elf_map+0x70/0x190
[ 9.390335] [<ffffffff81110f9d>] elf_map+0x12b/0x190
[ 9.390493] [<ffffffff81111b35>] load_elf_binary+0xb33/0x170e
[ 9.390645] [<ffffffff8102d529>] ? sub_preempt_count+0xa3/0xb6
[ 9.390800] [<ffffffff810d945a>] search_binary_handler+0x166/0x30e
[ 9.390952] [<ffffffff810d92ab>] ? copy_strings+0x1d4/0x1e5
[ 9.391111] [<ffffffff81111002>] ? load_elf_binary+0x0/0x170e
[ 9.391265] [<ffffffff810dadff>] do_execve+0x1fc/0x2f5
[ 9.391424] [<ffffffff8100a379>] sys_execve+0x43/0x61
[ 9.391576] [<ffffffff810025fa>] stub_execve+0x6a/0xc0


--
Regards/Gruss,
Boris.

2010-04-08 14:16:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA



On Thu, 8 Apr 2010, Borislav Petkov wrote:
>
> There are still issues: vma_adjust() grabs mapping->i_mmap_lock for file
> mappings while we might sleep in anon_vma_prepare():

Ahh. Good catch. So I can't actually do that anon_vma_prepare() thing in
__insert_vm_struct.

It should be simple enough to just move it into the caller, just after it
releases that lock. There's only one user of that __insert_vm_struct()
anyway. You can do it yourself, or you can replace my previous patch with
this..

[ The patch below also makes it warn once and return SIGBUS for the case
where there is no anon_vma. I decided I still want to hear about it if
there might be some path that tries to insert a vma on its own ]

Linus

---
mm/memory.c | 12 +++---------
mm/mmap.c | 17 ++++-------------
2 files changed, 7 insertions(+), 22 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 833952d..08d4423 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2223,9 +2223,6 @@ reuse:
gotten:
pte_unmap_unlock(page_table, ptl);

- if (unlikely(anon_vma_prepare(vma)))
- goto oom;
-
if (is_zero_pfn(pte_pfn(orig_pte))) {
new_page = alloc_zeroed_user_highpage_movable(vma, address);
if (!new_page)
@@ -2766,8 +2763,6 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
/* Allocate our own private page. */
pte_unmap(page_table);

- if (unlikely(anon_vma_prepare(vma)))
- goto oom;
page = alloc_zeroed_user_highpage_movable(vma, address);
if (!page)
goto oom;
@@ -2863,10 +2858,6 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (flags & FAULT_FLAG_WRITE) {
if (!(vma->vm_flags & VM_SHARED)) {
anon = 1;
- if (unlikely(anon_vma_prepare(vma))) {
- ret = VM_FAULT_OOM;
- goto out;
- }
page = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
vma, address);
if (!page) {
@@ -3115,6 +3106,9 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
pmd_t *pmd;
pte_t *pte;

+ if (WARN_ONCE(!vma->anon_vma, "Mapping with no anon_vma"))
+ return VM_FAULT_SIGBUS;
+
__set_current_state(TASK_RUNNING);

count_vm_event(PGFAULT);
diff --git a/mm/mmap.c b/mm/mmap.c
index 75557c6..82392c2 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -463,6 +463,8 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,

mm->map_count++;
validate_mm(mm);
+
+ anon_vma_prepare(vma);
}

/*
@@ -628,6 +630,8 @@ again: remove_next = 1 + (end > next->vm_end);
if (mapping)
spin_unlock(&mapping->i_mmap_lock);

+ anon_vma_prepare(vma);
+
if (remove_next) {
if (file) {
fput(file);
@@ -1674,12 +1678,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
if (!(vma->vm_flags & VM_GROWSUP))
return -EFAULT;

- /*
- * We must make sure the anon_vma is allocated
- * so that the anon_vma locking is not a noop.
- */
- if (unlikely(anon_vma_prepare(vma)))
- return -ENOMEM;
anon_vma_lock(vma);

/*
@@ -1720,13 +1718,6 @@ static int expand_downwards(struct vm_area_struct *vma,
{
int error;

- /*
- * We must make sure the anon_vma is allocated
- * so that the anon_vma locking is not a noop.
- */
- if (unlikely(anon_vma_prepare(vma)))
- return -ENOMEM;
-
address &= PAGE_MASK;
error = security_file_mmap(NULL, 0, 0, 0, address, 1);
if (error)

2010-04-08 18:26:52

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

On 04/08/2010 10:11 AM, Linus Torvalds wrote:
>
>
> On Thu, 8 Apr 2010, Borislav Petkov wrote:
>>
>> There are still issues: vma_adjust() grabs mapping->i_mmap_lock for file
>> mappings while we might sleep in anon_vma_prepare():
>
> Ahh. Good catch. So I can't actually do that anon_vma_prepare() thing in
> __insert_vm_struct.
>
> It should be simple enough to just move it into the caller, just after it
> releases that lock. There's only one user of that __insert_vm_struct()
> anyway. You can do it yourself, or you can replace my previous patch with
> this..
>
> [ The patch below also makes it warn once and return SIGBUS for the case
> where there is no anon_vma. I decided I still want to hear about it if
> there might be some path that tries to insert a vma on its own ]

Reviewed-by: Rik van Riel <[email protected]>

I haven't seen any places that insert VMAs by itself.
Several strange places that allocate them, but they
all appear to use the standard functions to insert them.

2010-04-08 18:37:16

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA



On Thu, 8 Apr 2010, Rik van Riel wrote:
>
> Reviewed-by: Rik van Riel <[email protected]>

Yeah, I think I'll commit it as-is, assuming we get confirmation that it
(along with your patch) actually ends up fixing the original problem.

I had actually had lockdep etc on with that patch, but for some reason I'd
overlooked the SPINLOCK_SLEEP debugging, so I hadn't seen the stupid issue
that Borislav pointed out. I wonder if LOCKDEP or spinlock debugging hould
just select it. Small detail, but I should have caught that obvious bug
myself.

> I haven't seen any places that insert VMAs by itself.
> Several strange places that allocate them, but they
> all appear to use the standard functions to insert them.

Yeah, it's complicated enough to add a vma with all the rbtree etc stuff
that I hope nobody actually cooks their own. But I too grepped for vma
allocations, and there were more of them than I expected, so...

Linus

2010-04-08 20:38:13

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

From: Linus Torvalds <[email protected]>
Date: Thu, Apr 08, 2010 at 11:32:06AM -0700

Here we go, another night of testing starts... got more caffeine this
time :)

> > I haven't seen any places that insert VMAs by itself.
> > Several strange places that allocate them, but they
> > all appear to use the standard functions to insert them.
>
> Yeah, it's complicated enough to add a vma with all the rbtree etc stuff
> that I hope nobody actually cooks their own. But I too grepped for vma
> allocations, and there were more of them than I expected, so...

... and of course, I just hit that WARN_ONCE on the first suspend (it did
suspend ok though):

[ 88.078958] ------------[ cut here ]------------
[ 88.079007] WARNING: at mm/memory.c:3110 handle_mm_fault+0x56/0x67c()
[ 88.079032] Hardware name: System Product Name
[ 88.079056] Mapping with no anon_vma
[ 88.079082] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod k10temp 8250_pnp 8250 serial_core edac_core ohci_hcd pcspkr
[ 88.079637] Pid: 1965, comm: console-kit-dae Not tainted 2.6.34-rc3-00290-g2156db9 #7
[ 88.079676] Call Trace:
[ 88.079713] [<ffffffff81037ea8>] warn_slowpath_common+0x7c/0x94
[ 88.079744] [<ffffffff81037f17>] warn_slowpath_fmt+0x41/0x43
[ 88.079774] [<ffffffff810b857d>] handle_mm_fault+0x56/0x67c
[ 88.079805] [<ffffffff8101f392>] do_page_fault+0x30b/0x32d
[ 88.079838] [<ffffffff810615ce>] ? put_lock_stats+0xe/0x27
[ 88.079866] [<ffffffff81062a55>] ? lock_release_holdtime+0x104/0x109
[ 88.079898] [<ffffffff813f93e3>] ? error_sti+0x5/0x6
[ 88.079929] [<ffffffff813f7de2>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 88.079960] [<ffffffff813f91ff>] page_fault+0x1f/0x30
[ 88.079988] ---[ end trace 154dd7f6249e1cc3 ]---

and then sysfs triggered that lockdep circular locking warning - I
thought it was fixed already :(


[ 256.831204] =======================================================
[ 256.831210] [ INFO: possible circular locking dependency detected ]
[ 256.831216] 2.6.34-rc3-00290-g2156db9 #7
[ 256.831221] -------------------------------------------------------
[ 256.831226] hib.sh/2464 is trying to acquire lock:
[ 256.831231] (s_active#80){++++.+}, at: [<ffffffff81127412>] sysfs_addrm_finish+0x36/0x5f
[ 256.831250]
[ 256.831252] but task is already holding lock:
[ 256.831256] (&per_cpu(cpu_policy_rwsem, cpu)){+++++.}, at: [<ffffffff8131bb52>] lock_policy_rwsem_write+0x4f/0x80
[ 256.831271]
[ 256.831273] which lock already depends on the new lock.
[ 256.831275]
[ 256.831278]
[ 256.831280] the existing dependency chain (in reverse order) is:
[ 256.831284]
[ 256.831286] -> #1 (&per_cpu(cpu_policy_rwsem, cpu)){+++++.}:
[ 256.831294] [<ffffffff8106790a>] __lock_acquire+0x1306/0x169f
[ 256.831305] [<ffffffff81067d95>] lock_acquire+0xf2/0x118
[ 256.831314] [<ffffffff813f727a>] down_read+0x4c/0x91
[ 256.831323] [<ffffffff8131c9f3>] lock_policy_rwsem_read+0x4f/0x80
[ 256.831332] [<ffffffff8131ca5c>] show+0x38/0x71
[ 256.831341] [<ffffffff81125ef0>] sysfs_read_file+0xb9/0x13e
[ 256.831348] [<ffffffff810d5901>] vfs_read+0xaf/0x150
[ 256.831357] [<ffffffff810d5a65>] sys_read+0x4a/0x71
[ 256.831364] [<ffffffff810021db>] system_call_fastpath+0x16/0x1b
[ 256.831375]
[ 256.831376] -> #0 (s_active#80){++++.+}:
[ 256.831385] [<ffffffff810675c1>] __lock_acquire+0xfbd/0x169f
[ 256.831385] [<ffffffff81067d95>] lock_acquire+0xf2/0x118
[ 256.831385] [<ffffffff81126a79>] sysfs_deactivate+0x91/0xe6
[ 256.831385] [<ffffffff81127412>] sysfs_addrm_finish+0x36/0x5f
[ 256.831385] [<ffffffff81127504>] sysfs_remove_dir+0x7a/0x8d
[ 256.831385] [<ffffffff8118522e>] kobject_del+0x16/0x37
[ 256.831385] [<ffffffff8118528d>] kobject_release+0x3e/0x66
[ 256.831385] [<ffffffff811860d9>] kref_put+0x43/0x4d
[ 256.831385] [<ffffffff811851a9>] kobject_put+0x47/0x4b
[ 256.831385] [<ffffffff8131ba68>] __cpufreq_remove_dev+0x1e5/0x241
[ 256.831385] [<ffffffff813f4e33>] cpufreq_cpu_callback+0x67/0x7f
[ 256.831385] [<ffffffff8105846b>] notifier_call_chain+0x37/0x63
[ 256.831385] [<ffffffff81058505>] __raw_notifier_call_chain+0xe/0x10
[ 256.831385] [<ffffffff813e6091>] _cpu_down+0x98/0x2a6
[ 256.831385] [<ffffffff810396b1>] disable_nonboot_cpus+0x74/0x10d
[ 256.831385] [<ffffffff81075ac9>] hibernation_snapshot+0xac/0x1e1
[ 256.831385] [<ffffffff81075ccc>] hibernate+0xce/0x172
[ 256.831385] [<ffffffff81074a39>] state_store+0x5c/0xd3
[ 256.831385] [<ffffffff81184fb7>] kobj_attr_store+0x17/0x19
[ 256.831385] [<ffffffff81125dfb>] sysfs_write_file+0x108/0x144
[ 256.831385] [<ffffffff810d56c7>] vfs_write+0xb2/0x153
[ 256.831385] [<ffffffff810d582b>] sys_write+0x4a/0x71
[ 256.831385] [<ffffffff810021db>] system_call_fastpath+0x16/0x1b
[ 256.831385]
[ 256.831385] other info that might help us debug this:
[ 256.831385]
[ 256.831385] 6 locks held by hib.sh/2464:
[ 256.831385] #0: (&buffer->mutex){+.+.+.}, at: [<ffffffff81125d2f>] sysfs_write_file+0x3c/0x144
[ 256.831385] #1: (s_active#49){.+.+.+}, at: [<ffffffff81125dda>] sysfs_write_file+0xe7/0x144
[ 256.831385] #2: (pm_mutex){+.+.+.}, at: [<ffffffff81075c1a>] hibernate+0x1c/0x172
[ 256.831385] #3: (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff810395d1>] cpu_maps_update_begin+0x17/0x19
[ 256.831385] #4: (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff81039616>] cpu_hotplug_begin+0x2c/0x53
[ 256.831385] #5: (&per_cpu(cpu_policy_rwsem, cpu)){+++++.}, at: [<ffffffff8131bb52>] lock_policy_rwsem_write+0x4f/0x80
[ 256.831385]
[ 256.831385] stack backtrace:
[ 256.831385] Pid: 2464, comm: hib.sh Tainted: G W 2.6.34-rc3-00290-g2156db9 #7
[ 256.831385] Call Trace:
[ 256.831385] [<ffffffff810643c3>] print_circular_bug+0xae/0xbd
[ 256.831385] [<ffffffff810675c1>] __lock_acquire+0xfbd/0x169f
[ 256.831385] [<ffffffff81127412>] ? sysfs_addrm_finish+0x36/0x5f
[ 256.831385] [<ffffffff81067d95>] lock_acquire+0xf2/0x118
[ 256.831385] [<ffffffff81127412>] ? sysfs_addrm_finish+0x36/0x5f
[ 256.831385] [<ffffffff81126a79>] sysfs_deactivate+0x91/0xe6
[ 256.831385] [<ffffffff81127412>] ? sysfs_addrm_finish+0x36/0x5f
[ 256.831385] [<ffffffff81063d12>] ? trace_hardirqs_on+0xd/0xf
[ 256.831385] [<ffffffff81126f3d>] ? release_sysfs_dirent+0x89/0xa9
[ 256.831385] [<ffffffff81127412>] sysfs_addrm_finish+0x36/0x5f
[ 256.831385] [<ffffffff81127504>] sysfs_remove_dir+0x7a/0x8d
[ 256.831385] [<ffffffff8118522e>] kobject_del+0x16/0x37
[ 256.831385] [<ffffffff8118528d>] kobject_release+0x3e/0x66
[ 256.831385] [<ffffffff8118524f>] ? kobject_release+0x0/0x66
[ 256.831385] [<ffffffff811860d9>] kref_put+0x43/0x4d
[ 256.831385] [<ffffffff811851a9>] kobject_put+0x47/0x4b
[ 256.831385] [<ffffffff8131ba68>] __cpufreq_remove_dev+0x1e5/0x241
[ 256.831385] [<ffffffff813f4e33>] cpufreq_cpu_callback+0x67/0x7f
[ 256.831385] [<ffffffff8105846b>] notifier_call_chain+0x37/0x63
[ 256.831385] [<ffffffff81058505>] __raw_notifier_call_chain+0xe/0x10
[ 256.831385] [<ffffffff813e6091>] _cpu_down+0x98/0x2a6
[ 256.831385] [<ffffffff810396b1>] disable_nonboot_cpus+0x74/0x10d
[ 256.831385] [<ffffffff81075ac9>] hibernation_snapshot+0xac/0x1e1
[ 256.831385] [<ffffffff81075ccc>] hibernate+0xce/0x172
[ 256.831385] [<ffffffff81074a39>] state_store+0x5c/0xd3
[ 256.831385] [<ffffffff81184fb7>] kobj_attr_store+0x17/0x19
[ 256.831385] [<ffffffff81125dfb>] sysfs_write_file+0x108/0x144
[ 256.831385] [<ffffffff810d56c7>] vfs_write+0xb2/0x153
[ 256.831385] [<ffffffff81063cda>] ? trace_hardirqs_on_caller+0x120/0x14b
[ 256.831385] [<ffffffff810d582b>] sys_write+0x4a/0x71
[ 256.831385] [<ffffffff810021db>] system_call_fastpath+0x16/0x1b

--
Regards/Gruss,
Boris.

2010-04-08 21:08:14

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

From: Linus Torvalds <[email protected]>
Date: Thu, Apr 08, 2010 at 07:11:11AM -0700

> [ The patch below also makes it warn once and return SIGBUS for the case
> where there is no anon_vma. I decided I still want to hear about it if
> there might be some path that tries to insert a vma on its own ]

And this happens quite often - I changed the WARN_ONCE to WARN and can't
start kvm, iceowl (mozilla calendar) and the console-kit-daemon craps up
upon boot too:

[ 55.814570] ------------[ cut here ]------------
[ 55.814623] WARNING: at mm/memory.c:3110 handle_mm_fault+0x43/0x66a()
[ 55.814648] Hardware name: System Product Name
[ 55.814671] Mapping with no anon_vma
[ 55.814693] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 edac_core ohci_hcd serial_core k10temp pcspkr
[ 55.815249] Pid: 1936, comm: console-kit-dae Not tainted 2.6.34-rc3-00290-g2156db9-dirty #8
[ 55.815290] Call Trace:
[ 55.815327] [<ffffffff81037ea8>] warn_slowpath_common+0x7c/0x94
[ 55.815362] [<ffffffff81037f17>] warn_slowpath_fmt+0x41/0x43
[ 55.815391] [<ffffffff810b856a>] handle_mm_fault+0x43/0x66a
[ 55.815420] [<ffffffff8101f392>] do_page_fault+0x30b/0x32d
[ 55.815452] [<ffffffff810615ce>] ? put_lock_stats+0xe/0x27
[ 55.815483] [<ffffffff81062a55>] ? lock_release_holdtime+0x104/0x109
[ 55.815518] [<ffffffff813f93e3>] ? error_sti+0x5/0x6
[ 55.815553] [<ffffffff813f7dd2>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 55.815585] [<ffffffff813f91ff>] page_fault+0x1f/0x30
[ 55.815613] ---[ end trace fa59f67cbfeeca44 ]---
[ 60.801651] ------------[ cut here ]------------
[ 60.801672] WARNING: at mm/memory.c:3110 handle_mm_fault+0x43/0x66a()
[ 60.801681] Hardware name: System Product Name
[ 60.801689] Mapping with no anon_vma
[ 60.801702] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 edac_core ohci_hcd serial_core k10temp pcspkr
[ 60.802156] Pid: 2008, comm: iceowl-bin Tainted: G W 2.6.34-rc3-00290-g2156db9-dirty #8
[ 60.802169] Call Trace:
[ 60.802181] [<ffffffff81037ea8>] warn_slowpath_common+0x7c/0x94
[ 60.802191] [<ffffffff81037f17>] warn_slowpath_fmt+0x41/0x43
[ 60.802203] [<ffffffff810b856a>] handle_mm_fault+0x43/0x66a
[ 60.802213] [<ffffffff8101f392>] do_page_fault+0x30b/0x32d
[ 60.802225] [<ffffffff810615ce>] ? put_lock_stats+0xe/0x27
[ 60.802235] [<ffffffff81062a55>] ? lock_release_holdtime+0x104/0x109
[ 60.802268] [<ffffffff813f93e3>] ? error_sti+0x5/0x6
[ 60.802279] [<ffffffff813f7dd2>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 60.802290] [<ffffffff813f91ff>] page_fault+0x1f/0x30
[ 60.802305] ---[ end trace fa59f67cbfeeca45 ]---
[ 92.123350] ------------[ cut here ]------------
[ 92.123402] WARNING: at kernel/sched.c:3555 add_preempt_count+0x9c/0xcb()
[ 92.123428] Hardware name: System Product Name
[ 92.123451] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 edac_core ohci_hcd serial_core k10temp pcspkr
[ 92.123902] Pid: 2111, comm: kvm Tainted: G W 2.6.34-rc3-00290-g2156db9-dirty #8
[ 92.123940] Call Trace:
[ 92.123973] [<ffffffff81037ea8>] warn_slowpath_common+0x7c/0x94
[ 92.124002] [<ffffffff81037ed4>] warn_slowpath_null+0x14/0x16
[ 92.124031] [<ffffffff8102d5d8>] add_preempt_count+0x9c/0xcb
[ 92.124061] [<ffffffff813f7ee9>] _raw_spin_lock_nest_lock+0x21/0x7a
[ 92.124090] [<ffffffff810bc079>] ? mm_take_all_locks+0xf9/0x150
[ 92.124118] [<ffffffff810bc079>] mm_take_all_locks+0xf9/0x150
[ 92.124146] [<ffffffff810cc48d>] ? do_mmu_notifier_register+0xd3/0x19d
[ 92.124174] [<ffffffff810cc495>] do_mmu_notifier_register+0xdb/0x19d
[ 92.124202] [<ffffffff810cc57c>] mmu_notifier_register+0x13/0x15
[ 92.124256] [<ffffffffa00c67e3>] kvm_dev_ioctl+0x2c8/0x495 [kvm]
[ 92.124318] [<ffffffff810e24ff>] vfs_ioctl+0x32/0xa6
[ 92.124357] [<ffffffff810e2a91>] do_vfs_ioctl+0x495/0x4db
[ 92.124390] [<ffffffff813f93e3>] ? error_sti+0x5/0x6
[ 92.124425] [<ffffffff813f8fad>] ? retint_swapgs+0xe/0x13
[ 92.124458] [<ffffffff810e2b1e>] sys_ioctl+0x47/0x6a
[ 92.124498] [<ffffffff810021db>] system_call_fastpath+0x16/0x1b
[ 92.124527] ---[ end trace fa59f67cbfeeca46 ]---
[ 92.213834] ------------[ cut here ]------------
[ 92.213888] WARNING: at mm/memory.c:3110 handle_mm_fault+0x43/0x66a()
[ 92.213913] Hardware name: System Product Name
[ 92.213937] Mapping with no anon_vma
[ 92.213959] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 edac_core ohci_hcd serial_core k10temp pcspkr
[ 92.214529] Pid: 2111, comm: kvm Tainted: G W 2.6.34-rc3-00290-g2156db9-dirty #8
[ 92.214571] Call Trace:
[ 92.214612] [<ffffffff81037ea8>] warn_slowpath_common+0x7c/0x94
[ 92.214647] [<ffffffff81037f17>] warn_slowpath_fmt+0x41/0x43
[ 92.214683] [<ffffffff810b856a>] handle_mm_fault+0x43/0x66a
[ 92.214718] [<ffffffff8101f392>] do_page_fault+0x30b/0x32d
[ 92.214751] [<ffffffff810be3ab>] ? do_mmap_pgoff+0x290/0x2f3
[ 92.214787] [<ffffffff813f93e3>] ? error_sti+0x5/0x6
[ 92.214821] [<ffffffff81062b97>] ? trace_hardirqs_off_caller+0x1f/0xa9
[ 92.214857] [<ffffffff813f7dd2>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 92.214896] [<ffffffff813f91ff>] page_fault+0x1f/0x30
[ 92.214928] ---[ end trace fa59f67cbfeeca47 ]---

--
Regards/Gruss,
Boris.

2010-04-08 23:20:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA



On Thu, 8 Apr 2010, Borislav Petkov wrote:
>
> And this happens quite often - I changed the WARN_ONCE to WARN and can't
> start kvm, iceowl (mozilla calendar) and the console-kit-daemon craps up
> upon boot too:

Hmm. I tried console-kit-daemon, which I had installed, but didn't get
anything like that. Probably some setup difference.

I also went through every user of 'vm_area_cachep', and saw nothing
suspicious at least for the mmu case (I didn't check the nommu.c code). I
must have missed something.

One thing you could do is to add some more debugging info when that "no
anon_vma" warning happens. In particular, if you still have the SLUB
debugging on, you could try to do that

page = virt_to_head_page(vma);
object_err(vm_area_cachep, page, (void *)vma, "NULL anon_vma");

and it should give you _which_ routine did the kmem_cache_alloc() for the
vma that doesn't have an anon_vma.

Linus

2010-04-08 23:53:12

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

From: Linus Torvalds <[email protected]>
Date: Thu, Apr 08, 2010 at 04:16:23PM -0700

> > And this happens quite often - I changed the WARN_ONCE to WARN and can't
> > start kvm, iceowl (mozilla calendar) and the console-kit-daemon craps up
> > upon boot too:
>
> Hmm. I tried console-kit-daemon, which I had installed, but didn't get
> anything like that. Probably some setup difference.
>
> I also went through every user of 'vm_area_cachep', and saw nothing
> suspicious at least for the mmu case (I didn't check the nommu.c code). I
> must have missed something.
>
> One thing you could do is to add some more debugging info when that "no
> anon_vma" warning happens. In particular, if you still have the SLUB
> debugging on, you could try to do that
>
> page = virt_to_head_page(vma);
> object_err(vm_area_cachep, page, (void *)vma, "NULL anon_vma");
>
> and it should give you _which_ routine did the kmem_cache_alloc() for the
> vma that doesn't have an anon_vma.

Yep, looks good: its mmap_region()...


[ 88.237326] ------------[ cut here ]------------
[ 88.237377] WARNING: at mm/memory.c:3110 handle_mm_fault+0x43/0x6ab()
[ 88.237403] Hardware name: System Product Name
[ 88.237428] Mapping with no anon_vma
[ 88.237451] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 ohci_hcd edac_core serial_core pcspkr k10temp
[ 88.237938] Pid: 1978, comm: console-kit-dae Not tainted 2.6.34-rc3-00290-g2156db9-dirty #9
[ 88.237980] Call Trace:
[ 88.239269] [<ffffffff81037ec0>] warn_slowpath_common+0x7c/0x94
[ 88.239320] [<ffffffff81037f2f>] warn_slowpath_fmt+0x41/0x43
[ 88.239378] [<ffffffff810b8582>] handle_mm_fault+0x43/0x6ab
[ 88.239440] [<ffffffff8101f3b2>] do_page_fault+0x30b/0x32d
[ 88.239471] [<ffffffff810615e6>] ? put_lock_stats+0xe/0x27
[ 88.239517] [<ffffffff81062a6d>] ? lock_release_holdtime+0x104/0x109
[ 88.239548] [<ffffffff813f9463>] ? error_sti+0x5/0x6
[ 88.239597] [<ffffffff813f7e52>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 88.239626] [<ffffffff813f927f>] page_fault+0x1f/0x30
[ 88.239674] ---[ end trace 42d53170a0d3ccef ]---
[ 88.239699] =============================================================================
[ 88.239750] BUG vm_area_struct: NULL anon_vma
[ 88.239790] -----------------------------------------------------------------------------
[ 88.239794]
[ 88.239805] INFO: Allocated in mmap_region+0x23d/0x500 age=2 cpu=0 pid=1978
[ 88.239815] INFO: Slab 0xffffea0007a0f0e8 objects=17 used=1 fp=0xffff88022dfbb0f0 flags=0x80000000000000c2
[ 88.239823] INFO: Object 0xffff88022dfbb000 @offset=0 fp=0xffff88022dfbb0f0
[ 88.239827]
[ 88.239832] Object 0xffff88022dfbb000: 00 32 53 2b 02 88 ff ff 00 20 ab 29 d1 7f 00 00 .2S+..ÿÿ..«)Ñ...
[ 88.239861] Object 0xffff88022dfbb010: 00 30 ac 29 d1 7f 00 00 e0 81 2b 2c 02 88 ff ff .0¬)Ñ...à.+,..ÿÿ
[ 88.239886] Object 0xffff88022dfbb020: 25 00 00 00 00 00 00 80 73 00 10 00 00 00 00 00 %.......s.......
[ 88.239910] Object 0xffff88022dfbb030: 10 82 2b 2c 02 88 ff ff 00 00 00 00 00 00 00 00 ..+,..ÿÿ........
[ 88.239966] Object 0xffff88022dfbb040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 88.240016] Object 0xffff88022dfbb050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 88.240077] Object 0xffff88022dfbb060: 00 00 00 00 00 00 00 00 10 a0 1c 2c 02 88 ff ff ...........,..ÿÿ
[ 88.240160] Object 0xffff88022dfbb070: 10 a0 1c 2c 02 88 ff ff 00 00 00 00 00 00 00 00 ...,..ÿÿ........
[ 88.240225] Object 0xffff88022dfbb080: 00 00 00 00 00 00 00 00 b2 9a 12 fd 07 00 00 00 ........²..ý....
[ 88.240294] Object 0xffff88022dfbb090: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 88.240352] Object 0xffff88022dfbb0a0: 00 00 00 00 00 00 00 00 ........
[ 88.240442] Redzone 0xffff88022dfbb0a8: cc cc cc cc cc cc cc cc ÌÌÌÌÌÌÌÌ
[ 88.240509] Padding 0xffff88022dfbb0e8: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ
[ 88.240567] Pid: 1978, comm: console-kit-dae Tainted: G W 2.6.34-rc3-00290-g2156db9-dirty #9
[ 88.240578] Call Trace:
[ 88.240593] [<ffffffff810cd802>] print_trailer+0x139/0x142
[ 88.240607] [<ffffffff810cd845>] object_err+0x3a/0x42
[ 88.240617] [<ffffffff810b85e2>] handle_mm_fault+0xa3/0x6ab
[ 88.240641] [<ffffffff8101f3b2>] do_page_fault+0x30b/0x32d
[ 88.240652] [<ffffffff810615e6>] ? put_lock_stats+0xe/0x27
[ 88.240663] [<ffffffff81062a6d>] ? lock_release_holdtime+0x104/0x109
[ 88.240685] [<ffffffff813f9463>] ? error_sti+0x5/0x6
[ 88.240695] [<ffffffff813f7e52>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 88.240707] [<ffffffff813f927f>] page_fault+0x1f/0x30
[ 93.841666] ------------[ cut here ]------------
[ 93.841716] WARNING: at mm/memory.c:3110 handle_mm_fault+0x43/0x6ab()
[ 93.841741] Hardware name: System Product Name
[ 93.841766] Mapping with no anon_vma
[ 93.841793] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 ohci_hcd edac_core serial_core pcspkr k10temp
[ 93.842339] Pid: 2050, comm: iceowl-bin Tainted: G W 2.6.34-rc3-00290-g2156db9-dirty #9
[ 93.842383] Call Trace:
[ 93.842424] [<ffffffff81037ec0>] warn_slowpath_common+0x7c/0x94
[ 93.842457] [<ffffffff81037f2f>] warn_slowpath_fmt+0x41/0x43
[ 93.842492] [<ffffffff810b8582>] handle_mm_fault+0x43/0x6ab
[ 93.842527] [<ffffffff8101f3b2>] do_page_fault+0x30b/0x32d
[ 93.842561] [<ffffffff810615e6>] ? put_lock_stats+0xe/0x27
[ 93.842593] [<ffffffff81062a6d>] ? lock_release_holdtime+0x104/0x109
[ 93.842627] [<ffffffff813f9463>] ? error_sti+0x5/0x6
[ 93.842660] [<ffffffff813f7e52>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 93.842694] [<ffffffff813f927f>] page_fault+0x1f/0x30
[ 93.842724] ---[ end trace 42d53170a0d3ccf0 ]---
[ 93.842750] =============================================================================
[ 93.842794] BUG vm_area_struct: NULL anon_vma
[ 93.842822] -----------------------------------------------------------------------------
[ 93.842827]
[ 93.842889] INFO: Allocated in mmap_region+0x23d/0x500 age=1 cpu=2 pid=2050
[ 93.842918] INFO: Slab 0xffffea00079b84b8 objects=17 used=7 fp=0xffff88022c6f1690 flags=0x80000000000000c2
[ 93.842961] INFO: Object 0xffff88022c6f15a0 @offset=1440 fp=0xffff88022c6f1690
[ 93.842965]
[ 93.843005] Bytes b4 0xffff88022c6f1590: 48 d9 fc ff 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a HÙüÿ....ZZZZZZZZ
[ 93.843466] Object 0xffff88022c6f15a0: 00 78 b4 2e 02 88 ff ff 00 80 ce 49 5f 7f 00 00 .x´...ÿÿ..ÎI_...
[ 93.843877] Object 0xffff88022c6f15b0: 00 90 4e 4a 5f 7f 00 00 c0 13 6f 2c 02 88 ff ff ..NJ_...À.o,..ÿÿ
[ 93.844391] Object 0xffff88022c6f15c0: 25 00 00 00 00 00 00 80 73 00 10 00 00 00 00 00 %.......s.......
[ 93.844794] Object 0xffff88022c6f15d0: e0 94 4a 2c 02 88 ff ff 00 00 00 00 00 00 00 00 à.J,..ÿÿ........
[ 93.845198] Object 0xffff88022c6f15e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 93.845665] Object 0xffff88022c6f15f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 93.846076] Object 0xffff88022c6f1600: 00 00 00 00 00 00 00 00 30 2d ec 2a 02 88 ff ff ........0-ì*..ÿÿ
[ 93.846518] Object 0xffff88022c6f1610: 30 2d ec 2a 02 88 ff ff 00 00 00 00 00 00 00 00 0-ì*..ÿÿ........
[ 93.846931] Object 0xffff88022c6f1620: 00 00 00 00 00 00 00 00 e8 9c f4 f5 07 00 00 00 ........è.ôõ....
[ 93.847372] Object 0xffff88022c6f1630: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 93.847787] Object 0xffff88022c6f1640: 00 00 00 00 00 00 00 00 ........
[ 93.848194] Redzone 0xffff88022c6f1648: cc cc cc cc cc cc cc cc ÌÌÌÌÌÌÌÌ
[ 93.848635] Padding 0xffff88022c6f1688: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ
[ 93.849036] Pid: 2050, comm: iceowl-bin Tainted: G W 2.6.34-rc3-00290-g2156db9-dirty #9
[ 93.849078] Call Trace:
[ 93.849111] [<ffffffff810cd802>] print_trailer+0x139/0x142
[ 93.849142] [<ffffffff810cd845>] object_err+0x3a/0x42
[ 93.849174] [<ffffffff810b85e2>] handle_mm_fault+0xa3/0x6ab
[ 93.849204] [<ffffffff8101f3b2>] do_page_fault+0x30b/0x32d
[ 93.849237] [<ffffffff810615e6>] ? put_lock_stats+0xe/0x27
[ 93.849301] [<ffffffff81062a6d>] ? lock_release_holdtime+0x104/0x109
[ 93.849337] [<ffffffff813f9463>] ? error_sti+0x5/0x6
[ 93.849370] [<ffffffff813f7e52>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 93.849418] [<ffffffff813f927f>] page_fault+0x1f/0x30


--
Regards/Gruss,
Boris.

2010-04-09 00:54:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA



On Fri, 9 Apr 2010, Borislav Petkov wrote:
>
> Yep, looks good: its mmap_region()...

Can you double-check your current diffs - maybe something got corrupted.

mmap_region installs the vma with vma_link(), and the last thing
vma_link() does with my patch is that "anon_vma_prepare()".

Maybe with all the patches flying around, you had a reject or something,
and you lost that one anon_vma_prepare()?

Or maybe I screwed up somewhere and sent you the wrong patch. Here it is
again, just in case.

[ I have a horrible cold, and can hardly think straight. So who knows,
maybe I'm missing something. But if you have lost one of the
'anon_vma_prepare()' call sites, that would certainly explain why you
get NULL anon_vma's ]

Linus

---
mm/memory.c | 12 +++---------
mm/mmap.c | 17 ++++-------------
2 files changed, 7 insertions(+), 22 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 833952d..08d4423 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2223,9 +2223,6 @@ reuse:
gotten:
pte_unmap_unlock(page_table, ptl);

- if (unlikely(anon_vma_prepare(vma)))
- goto oom;
-
if (is_zero_pfn(pte_pfn(orig_pte))) {
new_page = alloc_zeroed_user_highpage_movable(vma, address);
if (!new_page)
@@ -2766,8 +2763,6 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
/* Allocate our own private page. */
pte_unmap(page_table);

- if (unlikely(anon_vma_prepare(vma)))
- goto oom;
page = alloc_zeroed_user_highpage_movable(vma, address);
if (!page)
goto oom;
@@ -2863,10 +2858,6 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (flags & FAULT_FLAG_WRITE) {
if (!(vma->vm_flags & VM_SHARED)) {
anon = 1;
- if (unlikely(anon_vma_prepare(vma))) {
- ret = VM_FAULT_OOM;
- goto out;
- }
page = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
vma, address);
if (!page) {
@@ -3115,6 +3106,9 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
pmd_t *pmd;
pte_t *pte;

+ if (WARN_ONCE(!vma->anon_vma, "Mapping with no anon_vma"))
+ return VM_FAULT_SIGBUS;
+
__set_current_state(TASK_RUNNING);

count_vm_event(PGFAULT);
diff --git a/mm/mmap.c b/mm/mmap.c
index 75557c6..82392c2 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -463,6 +463,8 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,

mm->map_count++;
validate_mm(mm);
+
+ anon_vma_prepare(vma);
}

/*
@@ -628,6 +630,8 @@ again: remove_next = 1 + (end > next->vm_end);
if (mapping)
spin_unlock(&mapping->i_mmap_lock);

+ anon_vma_prepare(vma);
+
if (remove_next) {
if (file) {
fput(file);
@@ -1674,12 +1678,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
if (!(vma->vm_flags & VM_GROWSUP))
return -EFAULT;

- /*
- * We must make sure the anon_vma is allocated
- * so that the anon_vma locking is not a noop.
- */
- if (unlikely(anon_vma_prepare(vma)))
- return -ENOMEM;
anon_vma_lock(vma);

/*
@@ -1720,13 +1718,6 @@ static int expand_downwards(struct vm_area_struct *vma,
{
int error;

- /*
- * We must make sure the anon_vma is allocated
- * so that the anon_vma locking is not a noop.
- */
- if (unlikely(anon_vma_prepare(vma)))
- return -ENOMEM;
-
address &= PAGE_MASK;
error = security_file_mmap(NULL, 0, 0, 0, address, 1);
if (error)

2010-04-09 01:30:19

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

From: Linus Torvalds <[email protected]>
Date: Thu, Apr 08, 2010 at 05:50:21PM -0700

> > Yep, looks good: its mmap_region()...
>
> Can you double-check your current diffs - maybe something got corrupted.
>
> mmap_region installs the vma with vma_link(), and the last thing
> vma_link() does with my patch is that "anon_vma_prepare()".

Right, it looks like it. I'll add some more debugging calls there
tomorrow - it might give us more clues in case someone hasn't caught it
until then.

> Maybe with all the patches flying around, you had a reject or something,
> and you lost that one anon_vma_prepare()?
>
> Or maybe I screwed up somewhere and sent you the wrong patch. Here it is
> again, just in case.

Doesn't look like it - here's the diff between yours and what I have
applied here (yep, only minor fuzz but no code differences) Also, I've
added my version at the end:

--- a.diff 2010-04-09 03:03:35.000000000 +0200
+++ b.diff 2010-04-09 03:03:52.000000000 +0200
@@ -1,8 +1,8 @@
diff --git a/mm/memory.c b/mm/memory.c
-index 1d2ea39..bd7ea7f 100644
+index 833952d..08d4423 100644
--- a/mm/memory.c
+++ b/mm/memory.c
-@@ -2224,9 +2224,6 @@ reuse:
+@@ -2223,9 +2223,6 @@ reuse:
gotten:
pte_unmap_unlock(page_table, ptl);

@@ -12,7 +12,7 @@ index 1d2ea39..bd7ea7f 100644
if (is_zero_pfn(pte_pfn(orig_pte))) {
new_page = alloc_zeroed_user_highpage_movable(vma, address);
if (!new_page)
-@@ -2767,8 +2764,6 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
+@@ -2766,8 +2763,6 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
/* Allocate our own private page. */
pte_unmap(page_table);

@@ -21,7 +21,7 @@ index 1d2ea39..bd7ea7f 100644
page = alloc_zeroed_user_highpage_movable(vma, address);
if (!page)
goto oom;
-@@ -2864,10 +2859,6 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+@@ -2863,10 +2858,6 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (flags & FAULT_FLAG_WRITE) {
if (!(vma->vm_flags & VM_SHARED)) {
anon = 1;
@@ -32,7 +32,7 @@ index 1d2ea39..bd7ea7f 100644
page = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
vma, address);
if (!page) {
-@@ -3116,6 +3107,9 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+@@ -3115,6 +3106,9 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
pmd_t *pmd;
pte_t *pte;

@@ -43,7 +43,7 @@ index 1d2ea39..bd7ea7f 100644

count_vm_event(PGFAULT);
diff --git a/mm/mmap.c b/mm/mmap.c
-index bf0600c..4592a93 100644
+index 75557c6..82392c2 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -463,6 +463,8 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,

> [ I have a horrible cold, and can hardly think straight. So who knows,
> maybe I'm missing something. But if you have lost one of the
> 'anon_vma_prepare()' call sites, that would certainly explain why you
> get NULL anon_vma's ]

Oh, sorry to hear that. Ok, let's stop for today - it is 3am here and
even if some would say, "well, this is just getting interesting" :), I
think it would be best to "sleep on it." :)

Thanks.

--
commit 2156db98fd84d07e3b86564f429fcc8c6b7d61df
Author: Linus Torvalds <[email protected]>
Date: Thu Apr 8 22:09:53 2010 +0200

rmap: preallocate anon VMAs

On Thu, 8 Apr 2010, Borislav Petkov wrote:
>
> There are still issues: vma_adjust() grabs mapping->i_mmap_lock for file
> mappings while we might sleep in anon_vma_prepare():

Ahh. Good catch. So I can't actually do that anon_vma_prepare() thing in
__insert_vm_struct.

It should be simple enough to just move it into the caller, just after it
releases that lock. There's only one user of that __insert_vm_struct()
anyway. You can do it yourself, or you can replace my previous patch with
this..

[ The patch below also makes it warn once and return SIGBUS for the case
where there is no anon_vma. I decided I still want to hear about it if
there might be some path that tries to insert a vma on its own ]

Linus

diff --git a/mm/memory.c b/mm/memory.c
index 1d2ea39..bd7ea7f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2224,9 +2224,6 @@ reuse:
gotten:
pte_unmap_unlock(page_table, ptl);

- if (unlikely(anon_vma_prepare(vma)))
- goto oom;
-
if (is_zero_pfn(pte_pfn(orig_pte))) {
new_page = alloc_zeroed_user_highpage_movable(vma, address);
if (!new_page)
@@ -2767,8 +2764,6 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
/* Allocate our own private page. */
pte_unmap(page_table);

- if (unlikely(anon_vma_prepare(vma)))
- goto oom;
page = alloc_zeroed_user_highpage_movable(vma, address);
if (!page)
goto oom;
@@ -2864,10 +2859,6 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (flags & FAULT_FLAG_WRITE) {
if (!(vma->vm_flags & VM_SHARED)) {
anon = 1;
- if (unlikely(anon_vma_prepare(vma))) {
- ret = VM_FAULT_OOM;
- goto out;
- }
page = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
vma, address);
if (!page) {
@@ -3116,6 +3107,9 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
pmd_t *pmd;
pte_t *pte;

+ if (WARN_ONCE(!vma->anon_vma, "Mapping with no anon_vma"))
+ return VM_FAULT_SIGBUS;
+
__set_current_state(TASK_RUNNING);

count_vm_event(PGFAULT);
diff --git a/mm/mmap.c b/mm/mmap.c
index bf0600c..4592a93 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -463,6 +463,8 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,

mm->map_count++;
validate_mm(mm);
+
+ anon_vma_prepare(vma);
}

/*
@@ -628,6 +630,8 @@ again: remove_next = 1 + (end > next->vm_end);
if (mapping)
spin_unlock(&mapping->i_mmap_lock);

+ anon_vma_prepare(vma);
+
if (remove_next) {
if (file) {
fput(file);
@@ -1674,12 +1678,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
if (!(vma->vm_flags & VM_GROWSUP))
return -EFAULT;

- /*
- * We must make sure the anon_vma is allocated
- * so that the anon_vma locking is not a noop.
- */
- if (unlikely(anon_vma_prepare(vma)))
- return -ENOMEM;
anon_vma_lock(vma);

/*
@@ -1720,13 +1718,6 @@ static int expand_downwards(struct vm_area_struct *vma,
{
int error;

- /*
- * We must make sure the anon_vma is allocated
- * so that the anon_vma locking is not a noop.
- */
- if (unlikely(anon_vma_prepare(vma)))
- return -ENOMEM;
-
address &= PAGE_MASK;
error = security_file_mmap(NULL, 0, 0, 0, address, 1);
if (error)

--
Regards/Gruss,
Boris.

2010-04-09 01:45:23

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

>
>
> On Fri, 9 Apr 2010, Borislav Petkov wrote:
> >
> > Yep, looks good: its mmap_region()...
>
> Can you double-check your current diffs - maybe something got corrupted.
>
> mmap_region installs the vma with vma_link(), and the last thing
> vma_link() does with my patch is that "anon_vma_prepare()".

I agree. and at least your patch works fine on my box. I'll continue digg.



>
> Maybe with all the patches flying around, you had a reject or something,
> and you lost that one anon_vma_prepare()?
>
> Or maybe I screwed up somewhere and sent you the wrong patch. Here it is
> again, just in case.
>
> [ I have a horrible cold, and can hardly think straight. So who knows,
> maybe I'm missing something. But if you have lost one of the
> 'anon_vma_prepare()' call sites, that would certainly explain why you
> get NULL anon_vma's ]
>
> Linus

2010-04-09 10:38:56

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

From: Borislav Petkov <[email protected]>
Date: Fri, Apr 09, 2010 at 03:30:12AM +0200

> > Maybe with all the patches flying around, you had a reject or something,
> > and you lost that one anon_vma_prepare()?
> >
> > Or maybe I screwed up somewhere and sent you the wrong patch. Here it is
> > again, just in case.
>
> Doesn't look like it - here's the diff between yours and what I have
> applied here (yep, only minor fuzz but no code differences) Also, I've
> added my version at the end:

So I went and reapplied the three patches (3rd is the object_err export
for SLUB debugging) on a new branch of today's git - same results, the
same processes crap up in the WARN(!vma->anon_vma) check so it should be
something else we're missing.

More code staring later...

--
Regards/Gruss,
Boris.

2010-04-09 16:40:27

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA



On Fri, 9 Apr 2010, Borislav Petkov wrote:
>
> So I went and reapplied the three patches (3rd is the object_err export
> for SLUB debugging) on a new branch of today's git - same results, the
> same processes crap up in the WARN(!vma->anon_vma) check so it should be
> something else we're missing.
>
> More code staring later...

Can you try with _just_ my patch? Or add a

vma->anon_vma = merge_vma->anon_vma;

to Rik's "merge_vma" case in anon_vma_prepare().

Because I'm starign at Rik's patch, and one thing strikes me: it does that
"anon_vma_clone()" in anon_vma_prepare(), and maybe I'm blind, but I don't
see where that actually sets vma->anon_vma.

As far as I can tell, anon_vma_clone() was designed purely for the fork()
case, which has done

*new = *vma;

which will set new->anon_vma to the same vma. But Rik's patch never does
that for the anon_vma_prepare() case.

And maybe we should do it in anon_vma_clone() itself, just to make it
impossible to mistakenly leave it out, the way I think Rik's patch did.

Anyway, I'm still groggy from allt he flu medication, so take everything I
say with a grain of salt.

In fact, the more I look at this, the less I think I like Rik's patch in
the first place. I think the real bug that Rik tried to fix is that
apparently anon_vma_merge() doesn't necessarily merge everything right.
>From Rik's bug-explanation, step 5:

>> 5) vma_adjust calls anon_vma_merge, causing the anon_vma
>> chain of one of the VMAs to get nuked - with bad luck,
>> this is the original one, leaving just the new anon_vma
>> attached to the VMA

and I think that _this_ is the real bug to begin with. The real fix should
be in vma_adjust/anon_vma_merge, not in how we set up the anon_vma in the
first place. I do _not_ think we should require that we always merged
things at mmap() time, because we may _never_ be able to merge perfectly
(ie start out with to disjoing mmaps, and fill in the middle).

Linus

2010-04-09 17:49:40

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

From: Linus Torvalds <[email protected]>
Date: Fri, Apr 09, 2010 at 09:35:15AM -0700

> Can you try with _just_ my patch?

Yep, yours along with the SLUB debugging piece just survived one
hibernation cycle without a problem. Also, no SIGBUS-killed processes,
all seems fine. Will continue stressing it though...

Let me know what you want me to do next.

--
Regards/Gruss,
Boris.

2010-04-09 17:54:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA



On Fri, 9 Apr 2010, Borislav Petkov wrote:
>
> From: Linus Torvalds <[email protected]>
> Date: Fri, Apr 09, 2010 at 09:35:15AM -0700
>
> > Can you try with _just_ my patch?
>
> Yep, yours along with the SLUB debugging piece just survived one
> hibernation cycle without a problem. Also, no SIGBUS-killed processes,
> all seems fine. Will continue stressing it though...
>
> Let me know what you want me to do next.

Continue stress-testing it. I don't think my patch on its own should fix
the original problem, but at least we now know why you got those NULL
anon_vma's.

So what I _think_ will happen is that you'll be able to re-create the
problem that started this all. But I'd like to verify that, just because
I'm anal and I'd like these things to be tested independently.

So assuming that the original problem happens again, if you can then apply
Rik's patch, but add a

dst->anon_vma = src->anon_vma;

to just before the success case (the "return 0") in anon_vma_clone(),
that would be good.

Linus

2010-04-09 19:19:41

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

From: Linus Torvalds <[email protected]>
Date: Fri, Apr 09, 2010 at 10:50:23AM -0700

> Continue stress-testing it. I don't think my patch on its own should fix
> the original problem, but at least we now know why you got those NULL
> anon_vma's.
>
> So what I _think_ will happen is that you'll be able to re-create the
> problem that started this all. But I'd like to verify that, just because
> I'm anal and I'd like these things to be tested independently.

Heh, that was easy. Third hibernate cycle is a charm^Wboom :)

> So assuming that the original problem happens again, if you can then apply
> Rik's patch, but add a
>
> dst->anon_vma = src->anon_vma;
>
> to just before the success case (the "return 0") in anon_vma_clone(),
> that would be good.

It looks like this way we mangle the anon_vma chains somehow. From
what I can see and if I'm not mistaken, we save the anon_vmas alright
but end up in what seems like an endless list_for_each_entry()
loop having grabbed anon_vma->lock in page_lock_anon_vma() and we
can't seem to yield it through page_unlock_anon_vma() at the end of
page_referenced_anon() so it has to be that code in between iterating
over each list entry...

I could be completely wrong though...


[ 373.683545] PM: Syncing filesystems ... done.
[ 373.950289] Freezing user space processes ... (elapsed 0.04 seconds) done.
[ 373.998878] Freezing remaining freezable tasks ... (elapsed 0.01 seconds) done.
[ 374.011121] PM: Preallocating image memory...
[ 439.161126] BUG: soft lockup - CPU#1 stuck for 61s! [hib.sh:3617]
[ 439.161315] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 serial_core edac_core pcspkr k10temp ohci_hcd
[ 439.162302] irq event stamp: 0
[ 439.162302] hardirqs last enabled at (0): [<(null)>] (null)
[ 439.162302] hardirqs last disabled at (0): [<ffffffff8103655c>] copy_process+0x3c1/0x10cc
[ 439.163297] softirqs last enabled at (0): [<ffffffff8103655c>] copy_process+0x3c1/0x10cc
[ 439.163297] softirqs last disabled at (0): [<(null)>] (null)
[ 439.163297] CPU 1
[ 439.163297] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 serial_core edac_core pcspkr k10temp ohci_hcd
[ 439.165297]
[ 439.165297] Pid: 3617, comm: hib.sh Tainted: G W 2.6.34-rc3-00413-g1028f7c-dirty #12 M3A78 PRO/System Product Name
[ 439.165297] RIP: 0010:[<ffffffff8118b731>] [<ffffffff8118b731>] delay_tsc+0x0/0xca
[ 439.165297] RSP: 0018:ffff8801f68b77f0 EFLAGS: 00000202
[ 439.166300] RAX: 0000000000000000 RBX: ffff8801f68b77f8 RCX: 000000000000f100
[ 439.166300] RDX: 0000000000000001 RSI: ffff8801f68b7848 RDI: 0000000000000001
[ 439.166300] RBP: ffffffff81002b4e R08: 0000000000000001 R09: 0000000000000000
[ 439.166300] R10: ffff88022c9a3ac8 R11: ffffffff00000012 R12: 000000000000f100
[ 439.166300] R13: 00000000cc444700 R14: 0000000000000001 R15: 0000000000000000
[ 439.166300] FS: 00007f8d00e676f0(0000) GS:ffff88000a200000(0000) knlGS:0000000000000000
[ 439.167296] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 439.167296] CR2: 00007fff93e8a9c0 CR3: 00000001f5397000 CR4: 00000000000006e0
[ 439.167296] DR0: 00000000000000a0 DR1: 0000000000000000 DR2: 0000000000000003
[ 439.167296] DR3: 00000000000000b0 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 439.167296] Process hib.sh (pid: 3617, threadinfo ffff8801f68b6000, task ffff88022a2e8000)
[ 439.167296] Stack:
[ 439.168297] ffffffff8118b72f ffff8801f68b7848 ffffffff8119a1ca ffff880214972868
[ 439.168297] <0> 0000000000000001 ffff880100000000 ffff880214972850 ffff880214972868
[ 439.168297] <0> ffff8801f68b7cf8 ffff8801f68b7b78 ffff8801f68b7a00 ffff8801f68b7878
[ 439.169298] Call Trace:
[ 439.169298] [<ffffffff8118b72f>] ? __delay+0xf/0x11
[ 439.169298] [<ffffffff8119a1ca>] ? do_raw_spin_lock+0xd2/0x13c
[ 439.169298] [<ffffffff813f827b>] ? _raw_spin_lock+0x60/0x73
[ 439.170299] [<ffffffff810c0a37>] ? page_lock_anon_vma+0x63/0xac
[ 439.170299] [<ffffffff810c0a37>] ? page_lock_anon_vma+0x63/0xac
[ 439.170299] [<ffffffff810c09d4>] ? page_lock_anon_vma+0x0/0xac
[ 439.170299] [<ffffffff810c0c1d>] ? page_referenced+0x80/0x1dc
[ 439.170299] [<ffffffff810c0b22>] ? try_to_unmap_anon+0xa2/0xb4
[ 439.170299] [<ffffffff810ab7a6>] ? shrink_page_list+0x14a/0x477
[ 439.170299] [<ffffffff813f8d86>] ? _raw_spin_unlock_irq+0x30/0x58
[ 439.171296] [<ffffffff810abe2a>] ? shrink_inactive_list+0x357/0x5e5
[ 439.171296] [<ffffffff810ab64a>] ? shrink_active_list+0x232/0x244
[ 439.171296] [<ffffffff810ac3c4>] ? shrink_zone+0x30c/0x3d6
[ 439.171296] [<ffffffff810acf9f>] ? do_try_to_free_pages+0x176/0x27f
[ 439.171296] [<ffffffff810ad13d>] ? shrink_all_memory+0x95/0xc4
[ 439.171296] [<ffffffff810aa640>] ? isolate_pages_global+0x0/0x1f0
[ 439.171296] [<ffffffff81076e60>] ? count_data_pages+0x65/0x79
[ 439.172298] [<ffffffff810770c7>] ? hibernate_preallocate_memory+0x1aa/0x2cb
[ 439.172298] [<ffffffff813f5285>] ? printk+0x41/0x44
[ 439.172298] [<ffffffff81075a67>] ? hibernation_snapshot+0x36/0x1e1
[ 439.172298] [<ffffffff81075ce0>] ? hibernate+0xce/0x172
[ 439.172298] [<ffffffff81074a4d>] ? state_store+0x5c/0xd3
[ 439.172298] [<ffffffff81184f8f>] ? kobj_attr_store+0x17/0x19
[ 439.173296] [<ffffffff81125dd7>] ? sysfs_write_file+0x108/0x144
[ 439.173296] [<ffffffff810d575f>] ? vfs_write+0xb2/0x153
[ 439.173296] [<ffffffff81063bed>] ? trace_hardirqs_on_caller+0x1f/0x14b
[ 439.173296] [<ffffffff810d58c3>] ? sys_write+0x4a/0x71
[ 439.173296] [<ffffffff810021db>] ? system_call_fastpath+0x16/0x1b
[ 439.173296] Code: ff c8 c9 c3 55 48 89 e5 0f 1f 44 00 00 48 c7 05 12 35 4e 00 31 b7 18 81 c9 c3 55 48 89 e5 0f 1f 44 00 00 ff 15 01 35 4e 00 c9 c3 <55> 48 89 e5 41 57 41 56 41 55 41 54 53 48 83 ec 08 0f 1f 44 00
[ 439.176296] Call Trace:
[ 439.177297] [<ffffffff8118b72f>] ? __delay+0xf/0x11
[ 439.177297] [<ffffffff8119a1ca>] ? do_raw_spin_lock+0xd2/0x13c
[ 439.177297] [<ffffffff813f827b>] ? _raw_spin_lock+0x60/0x73
[ 439.177297] [<ffffffff810c0a37>] ? page_lock_anon_vma+0x63/0xac
[ 439.177297] [<ffffffff810c0a37>] ? page_lock_anon_vma+0x63/0xac
[ 439.177297] [<ffffffff810c09d4>] ? page_lock_anon_vma+0x0/0xac
[ 439.177297] [<ffffffff810c0c1d>] ? page_referenced+0x80/0x1dc
[ 439.178295] [<ffffffff810c0b22>] ? try_to_unmap_anon+0xa2/0xb4
[ 439.178295] [<ffffffff810ab7a6>] ? shrink_page_list+0x14a/0x477
[ 439.178295] [<ffffffff813f8d86>] ? _raw_spin_unlock_irq+0x30/0x58
[ 439.178295] [<ffffffff810abe2a>] ? shrink_inactive_list+0x357/0x5e5
[ 439.178295] [<ffffffff810ab64a>] ? shrink_active_list+0x232/0x244
[ 439.178295] [<ffffffff810ac3c4>] ? shrink_zone+0x30c/0x3d6
[ 439.178295] [<ffffffff810acf9f>] ? do_try_to_free_pages+0x176/0x27f
[ 439.179299] [<ffffffff810ad13d>] ? shrink_all_memory+0x95/0xc4
[ 439.179299] [<ffffffff810aa640>] ? isolate_pages_global+0x0/0x1f0
[ 439.179299] [<ffffffff81076e60>] ? count_data_pages+0x65/0x79
[ 439.179299] [<ffffffff810770c7>] ? hibernate_preallocate_memory+0x1aa/0x2cb
[ 439.179299] [<ffffffff813f5285>] ? printk+0x41/0x44
[ 439.179299] [<ffffffff81075a67>] ? hibernation_snapshot+0x36/0x1e1
[ 439.180296] [<ffffffff81075ce0>] ? hibernate+0xce/0x172
[ 439.180296] [<ffffffff81074a4d>] ? state_store+0x5c/0xd3
[ 439.180296] [<ffffffff81184f8f>] ? kobj_attr_store+0x17/0x19
[ 439.180296] [<ffffffff81125dd7>] ? sysfs_write_file+0x108/0x144
[ 439.180296] [<ffffffff810d575f>] ? vfs_write+0xb2/0x153
[ 439.180296] [<ffffffff81063bed>] ? trace_hardirqs_on_caller+0x1f/0x14b
[ 439.180296] [<ffffffff810d58c3>] ? sys_write+0x4a/0x71
[ 439.181297] [<ffffffff810021db>] ? system_call_fastpath+0x16/0x1b
[ 504.659125] BUG: soft lockup - CPU#1 stuck for 61s! [hib.sh:3617]
[ 504.659126] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 serial_core edac_core pcspkr k10temp ohci_hcd
[ 504.660297] irq event stamp: 0
[ 504.660297] hardirqs last enabled at (0): [<(null)>] (null)
[ 504.660297] hardirqs last disabled at (0): [<ffffffff8103655c>] copy_process+0x3c1/0x10cc
[ 504.661298] softirqs last enabled at (0): [<ffffffff8103655c>] copy_process+0x3c1/0x10cc
[ 504.661298] softirqs last disabled at (0): [<(null)>] (null)
[ 504.661298] CPU 1
[ 504.661298] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 serial_core edac_core pcspkr k10temp ohci_hcd
[ 504.663297]
[ 504.663297] Pid: 3617, comm: hib.sh Tainted: G W 2.6.34-rc3-00413-g1028f7c-dirty #12 M3A78 PRO/System Product Name
[ 504.663297] RIP: 0010:[<ffffffff8118b775>] [<ffffffff8118b775>] delay_tsc+0x44/0xca
[ 504.663297] RSP: 0018:ffff8801f68b77b8 EFLAGS: 00000206
[ 504.663297] RAX: 00000000a4911fed RBX: ffff8801f68b77e8 RCX: 000000000000f100
[ 504.664326] RDX: 00000000000000f1 RSI: ffff8801f68b7848 RDI: 0000000000000001
[ 504.664326] RBP: ffffffff81002b4e R08: 0000000000000001 R09: 0000000000000000
[ 504.664326] R10: ffff88022c9a3ac8 R11: ffffffff00000012 R12: 0000000000000010
[ 504.664326] R13: ffff88000a200000 R14: ffff8801f68b6000 R15: ffff8801f68b7fd8
[ 504.664326] FS: 00007f8d00e676f0(0000) GS:ffff88000a200000(0000) knlGS:0000000000000000
[ 504.664326] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 504.665296] CR2: 00007fff93e8a9c0 CR3: 00000001f5397000 CR4: 00000000000006e0
[ 504.665296] DR0: 00000000000000a0 DR1: 0000000000000000 DR2: 0000000000000003
[ 504.665296] DR3: 00000000000000b0 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 504.665296] Process hib.sh (pid: 3617, threadinfo ffff8801f68b6000, task ffff88022a2e8000)
[ 504.665296] Stack:
[ 504.665296] 0000000000000001 ffff880214972850 ffff88022a2e8000 00000000b3450160
[ 504.666297] <0> ffff88022a2e83a8 000000005486e668 ffff8801f68b77f8 ffffffff8118b72f
[ 504.666297] <0> ffff8801f68b7848 ffffffff8119a1ca ffff880214972868 0000000000000001
[ 504.667298] Call Trace:
[ 504.667298] [<ffffffff8118b72f>] ? __delay+0xf/0x11
[ 504.667298] [<ffffffff8119a1ca>] ? do_raw_spin_lock+0xd2/0x13c
[ 504.667298] [<ffffffff813f827b>] ? _raw_spin_lock+0x60/0x73
[ 504.667298] [<ffffffff810c0a37>] ? page_lock_anon_vma+0x63/0xac
[ 504.668288] [<ffffffff810c0a37>] ? page_lock_anon_vma+0x63/0xac
[ 504.668298] [<ffffffff810c09d4>] ? page_lock_anon_vma+0x0/0xac
[ 504.668298] [<ffffffff810c0c1d>] ? page_referenced+0x80/0x1dc
[ 504.668298] [<ffffffff810c0b22>] ? try_to_unmap_anon+0xa2/0xb4
[ 504.668298] [<ffffffff810ab7a6>] ? shrink_page_list+0x14a/0x477
[ 504.668298] [<ffffffff813f8d86>] ? _raw_spin_unlock_irq+0x30/0x58
[ 504.668298] [<ffffffff810abe2a>] ? shrink_inactive_list+0x357/0x5e5
[ 504.669296] [<ffffffff810ab64a>] ? shrink_active_list+0x232/0x244
[ 504.669296] [<ffffffff810ac3c4>] ? shrink_zone+0x30c/0x3d6
[ 504.669296] [<ffffffff810acf9f>] ? do_try_to_free_pages+0x176/0x27f
[ 504.669296] [<ffffffff810ad13d>] ? shrink_all_memory+0x95/0xc4
[ 504.669296] [<ffffffff810aa640>] ? isolate_pages_global+0x0/0x1f0
[ 504.669296] [<ffffffff81076e60>] ? count_data_pages+0x65/0x79
[ 504.669296] [<ffffffff810770c7>] ? hibernate_preallocate_memory+0x1aa/0x2cb
[ 504.670302] [<ffffffff813f5285>] ? printk+0x41/0x44
[ 504.670302] [<ffffffff81075a67>] ? hibernation_snapshot+0x36/0x1e1
[ 504.670302] [<ffffffff81075ce0>] ? hibernate+0xce/0x172
[ 504.670302] [<ffffffff81074a4d>] ? state_store+0x5c/0xd3
[ 504.670302] [<ffffffff81184f8f>] ? kobj_attr_store+0x17/0x19
[ 504.670302] [<ffffffff81125dd7>] ? sysfs_write_file+0x108/0x144
[ 504.670302] [<ffffffff810d575f>] ? vfs_write+0xb2/0x153
[ 504.671297] [<ffffffff81063bed>] ? trace_hardirqs_on_caller+0x1f/0x14b
[ 504.671297] [<ffffffff810d58c3>] ? sys_write+0x4a/0x71
[ 504.673315] [<ffffffff810021db>] ? system_call_fastpath+0x16/0x1b
[ 504.674350] Code: bf 01 00 00 00 e8 f8 1d ea ff e8 9f f4 00 00 41 89 c5 0f ae f0 66 66 90 0f 31 89 c3 65 4c 8b 34 25 48 b5 00 00 0f ae f0 66 66 90 <0f> 31 41 89 c7 4c 89 f8 48 29 d8 4c 39 e0 73 49 bf 01 00 00 00
[ 504.677299] Call Trace:
[ 504.677299] [<ffffffff8118b72f>] ? __delay+0xf/0x11
[ 504.677299] [<ffffffff8119a1ca>] ? do_raw_spin_lock+0xd2/0x13c
[ 504.677299] [<ffffffff813f827b>] ? _raw_spin_lock+0x60/0x73
[ 504.677299] [<ffffffff810c0a37>] ? page_lock_anon_vma+0x63/0xac
[ 504.678287] [<ffffffff810c0a37>] ? page_lock_anon_vma+0x63/0xac
[ 504.678296] [<ffffffff810c09d4>] ? page_lock_anon_vma+0x0/0xac
[ 504.678296] [<ffffffff810c0c1d>] ? page_referenced+0x80/0x1dc
[ 504.678296] [<ffffffff810c0b22>] ? try_to_unmap_anon+0xa2/0xb4
[ 504.678296] [<ffffffff810ab7a6>] ? shrink_page_list+0x14a/0x477
[ 504.678296] [<ffffffff813f8d86>] ? _raw_spin_unlock_irq+0x30/0x58
[ 504.678296] [<ffffffff810abe2a>] ? shrink_inactive_list+0x357/0x5e5
[ 504.679297] [<ffffffff810ab64a>] ? shrink_active_list+0x232/0x244
[ 504.679297] [<ffffffff810ac3c4>] ? shrink_zone+0x30c/0x3d6
[ 504.679297] [<ffffffff810acf9f>] ? do_try_to_free_pages+0x176/0x27f
[ 504.679297] [<ffffffff810ad13d>] ? shrink_all_memory+0x95/0xc4
[ 504.679297] [<ffffffff810aa640>] ? isolate_pages_global+0x0/0x1f0
[ 504.679297] [<ffffffff81076e60>] ? count_data_pages+0x65/0x79
[ 504.679297] [<ffffffff810770c7>] ? hibernate_preallocate_memory+0x1aa/0x2cb
[ 504.680303] [<ffffffff813f5285>] ? printk+0x41/0x44
[ 504.680303] [<ffffffff81075a67>] ? hibernation_snapshot+0x36/0x1e1
[ 504.680303] [<ffffffff81075ce0>] ? hibernate+0xce/0x172
[ 504.680303] [<ffffffff81074a4d>] ? state_store+0x5c/0xd3
[ 504.680303] [<ffffffff81184f8f>] ? kobj_attr_store+0x17/0x19
[ 504.680303] [<ffffffff81125dd7>] ? sysfs_write_file+0x108/0x144
[ 504.680303] [<ffffffff810d575f>] ? vfs_write+0xb2/0x153
[ 504.681297] [<ffffffff81063bed>] ? trace_hardirqs_on_caller+0x1f/0x14b
[ 504.681297] [<ffffffff810d58c3>] ? sys_write+0x4a/0x71
[ 504.681297] [<ffffffff810021db>] ? system_call_fastpath+0x16/0x1b
[ 570.157125] BUG: soft lockup - CPU#1 stuck for 61s! [hib.sh:3617]
[ 570.157126] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 serial_core edac_core pcspkr k10temp ohci_hcd
[ 570.158283] irq event stamp: 0
[ 570.158283] hardirqs last enabled at (0): [<(null)>] (null)
[ 570.158283] hardirqs last disabled at (0): [<ffffffff8103655c>] copy_process+0x3c1/0x10cc
[ 570.159297] softirqs last enabled at (0): [<ffffffff8103655c>] copy_process+0x3c1/0x10cc
[ 570.159297] softirqs last disabled at (0): [<(null)>] (null)
[ 570.159297] CPU 1
[ 570.159297] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 serial_core edac_core pcspkr k10temp ohci_hcd
[ 570.161297]
[ 570.161297] Pid: 3617, comm: hib.sh Tainted: G W 2.6.34-rc3-00413-g1028f7c-dirty #12 M3A78 PRO/System Product Name
[ 570.161297] RIP: 0010:[<ffffffff8118b777>] [<ffffffff8118b777>] delay_tsc+0x46/0xca
[ 570.161297] RSP: 0018:ffff8801f68b77b8 EFLAGS: 00000206
[ 570.161297] RAX: 000000007cdde43c RBX: ffff8801f68b77e8 RCX: 000000000000f100
[ 570.162296] RDX: 000000000000011f RSI: ffff8801f68b7848 RDI: 0000000000000001
[ 570.162296] RBP: ffffffff81002b4e R08: 0000000000000001 R09: 0000000000000000
[ 570.162296] R10: ffff88022c9a3ac8 R11: ffffffff00000012 R12: 0000000000000010
[ 570.162296] R13: ffff88000a200000 R14: ffff8801f68b6000 R15: ffff8801f68b7fd8
[ 570.162296] FS: 00007f8d00e676f0(0000) GS:ffff88000a200000(0000) knlGS:0000000000000000
[ 570.162296] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 570.163296] CR2: 00007fff93e8a9c0 CR3: 00000001f5397000 CR4: 00000000000006e0
[ 570.163296] DR0: 00000000000000a0 DR1: 0000000000000000 DR2: 0000000000000003
[ 570.163296] DR3: 00000000000000b0 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 570.163296] Process hib.sh (pid: 3617, threadinfo ffff8801f68b6000, task ffff88022a2e8000)
[ 570.163296] Stack:
[ 570.163296] 0000000000000001 ffff880214972850 ffff88022a2e8000 00000000b3450160
[ 570.164335] <0> ffff88022a2e83a8 000000007f0025c7 ffff8801f68b77f8 ffffffff8118b72f
[ 570.164335] <0> ffff8801f68b7848 ffffffff8119a1ca ffff880214972868 0000000000000001
[ 570.165299] Call Trace:
[ 570.165299] [<ffffffff8118b72f>] ? __delay+0xf/0x11
[ 570.165299] [<ffffffff8119a1ca>] ? do_raw_spin_lock+0xd2/0x13c
[ 570.165299] [<ffffffff813f827b>] ? _raw_spin_lock+0x60/0x73
[ 570.165299] [<ffffffff810c0a37>] ? page_lock_anon_vma+0x63/0xac
[ 570.165299] [<ffffffff810c0a37>] ? page_lock_anon_vma+0x63/0xac
[ 570.166297] [<ffffffff810c09d4>] ? page_lock_anon_vma+0x0/0xac
[ 570.166297] [<ffffffff810c0c1d>] ? page_referenced+0x80/0x1dc
[ 570.166297] [<ffffffff810c0b22>] ? try_to_unmap_anon+0xa2/0xb4
[ 570.166297] [<ffffffff810ab7a6>] ? shrink_page_list+0x14a/0x477
[ 570.166297] [<ffffffff813f8d86>] ? _raw_spin_unlock_irq+0x30/0x58
[ 570.166297] [<ffffffff810abe2a>] ? shrink_inactive_list+0x357/0x5e5
[ 570.167296] [<ffffffff810ab64a>] ? shrink_active_list+0x232/0x244
[ 570.167296] [<ffffffff810ac3c4>] ? shrink_zone+0x30c/0x3d6
[ 570.167296] [<ffffffff810acf9f>] ? do_try_to_free_pages+0x176/0x27f
[ 570.167296] [<ffffffff810ad13d>] ? shrink_all_memory+0x95/0xc4
[ 570.167296] [<ffffffff810aa640>] ? isolate_pages_global+0x0/0x1f0
[ 570.167296] [<ffffffff81076e60>] ? count_data_pages+0x65/0x79
[ 570.167296] [<ffffffff810770c7>] ? hibernate_preallocate_memory+0x1aa/0x2cb
[ 570.168286] [<ffffffff813f5285>] ? printk+0x41/0x44
[ 570.168286] [<ffffffff81075a67>] ? hibernation_snapshot+0x36/0x1e1
[ 570.168286] [<ffffffff81075ce0>] ? hibernate+0xce/0x172
[ 570.168286] [<ffffffff81074a4d>] ? state_store+0x5c/0xd3
[ 570.168286] [<ffffffff81184f8f>] ? kobj_attr_store+0x17/0x19
[ 570.168286] [<ffffffff81125dd7>] ? sysfs_write_file+0x108/0x144
[ 570.168286] [<ffffffff810d575f>] ? vfs_write+0xb2/0x153
[ 570.169297] [<ffffffff81063bed>] ? trace_hardirqs_on_caller+0x1f/0x14b
[ 570.169297] [<ffffffff810d58c3>] ? sys_write+0x4a/0x71
[ 570.169297] [<ffffffff810021db>] ? system_call_fastpath+0x16/0x1b
[ 570.169297] Code: 00 00 00 e8 f8 1d ea ff e8 9f f4 00 00 41 89 c5 0f ae f0 66 66 90 0f 31 89 c3 65 4c 8b 34 25 48 b5 00 00 0f ae f0 66 66 90 0f 31 <41> 89 c7 4c 89 f8 48 29 d8 4c 39 e0 73 49 bf 01 00 00 00 e8 07
[ 570.172299] Call Trace:
[ 570.172299] [<ffffffff8118b72f>] ? __delay+0xf/0x11
[ 570.172299] [<ffffffff8119a1ca>] ? do_raw_spin_lock+0xd2/0x13c
[ 570.173297] [<ffffffff813f827b>] ? _raw_spin_lock+0x60/0x73
[ 570.173297] [<ffffffff810c0a37>] ? page_lock_anon_vma+0x63/0xac
[ 570.173297] [<ffffffff810c0a37>] ? page_lock_anon_vma+0x63/0xac
[ 570.173297] [<ffffffff810c09d4>] ? page_lock_anon_vma+0x0/0xac
[ 570.173297] [<ffffffff810c0c1d>] ? page_referenced+0x80/0x1dc
[ 570.173297] [<ffffffff810c0b22>] ? try_to_unmap_anon+0xa2/0xb4
[ 570.174329] [<ffffffff810ab7a6>] ? shrink_page_list+0x14a/0x477
[ 570.174329] [<ffffffff813f8d86>] ? _raw_spin_unlock_irq+0x30/0x58
[ 570.174329] [<ffffffff810abe2a>] ? shrink_inactive_list+0x357/0x5e5
[ 570.174329] [<ffffffff810ab64a>] ? shrink_active_list+0x232/0x244
[ 570.174329] [<ffffffff810ac3c4>] ? shrink_zone+0x30c/0x3d6
[ 570.174329] [<ffffffff810acf9f>] ? do_try_to_free_pages+0x176/0x27f
[ 570.174329] [<ffffffff810ad13d>] ? shrink_all_memory+0x95/0xc4
[ 570.175297] [<ffffffff810aa640>] ? isolate_pages_global+0x0/0x1f0
[ 570.175297] [<ffffffff81076e60>] ? count_data_pages+0x65/0x79
[ 570.175297] [<ffffffff810770c7>] ? hibernate_preallocate_memory+0x1aa/0x2cb
[ 570.175297] [<ffffffff813f5285>] ? printk+0x41/0x44
[ 570.175297] [<ffffffff81075a67>] ? hibernation_snapshot+0x36/0x1e1
[ 570.175297] [<ffffffff81075ce0>] ? hibernate+0xce/0x172
[ 570.175297] [<ffffffff81074a4d>] ? state_store+0x5c/0xd3
[ 570.176298] [<ffffffff81184f8f>] ? kobj_attr_store+0x17/0x19
[ 570.176298] [<ffffffff81125dd7>] ? sysfs_write_file+0x108/0x144
[ 570.176298] [<ffffffff810d575f>] ? vfs_write+0xb2/0x153
[ 570.176298] [<ffffffff81063bed>] ? trace_hardirqs_on_caller+0x1f/0x14b
[ 570.176298] [<ffffffff810d58c3>] ? sys_write+0x4a/0x71
[ 570.176298] [<ffffffff810021db>] ? system_call_fastpath+0x16/0x1b


--
Regards/Gruss,
Boris.

2010-04-09 19:37:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA



On Fri, 9 Apr 2010, Borislav Petkov wrote:
> >
> > So what I _think_ will happen is that you'll be able to re-create the
> > problem that started this all. But I'd like to verify that, just because
> > I'm anal and I'd like these things to be tested independently.
>
> Heh, that was easy. Third hibernate cycle is a charm^Wboom :)

Ok, good to know that I'm still tracking ok on the issue.

> > So assuming that the original problem happens again, if you can then apply
> > Rik's patch, but add a
> >
> > dst->anon_vma = src->anon_vma;
> >
> > to just before the success case (the "return 0") in anon_vma_clone(),
> > that would be good.
>
> It looks like this way we mangle the anon_vma chains somehow. From
> what I can see and if I'm not mistaken, we save the anon_vmas alright
> but end up in what seems like an endless list_for_each_entry()
> loop having grabbed anon_vma->lock in page_lock_anon_vma() and we
> can't seem to yield it through page_unlock_anon_vma() at the end of
> page_referenced_anon() so it has to be that code in between iterating
> over each list entry...

Ok. So scratch Rik's patch. It doesn't work even with the anon_vma set up.

Rik? I think it's back to you. I'm not going to bother committing the
change to the anon_vma locking unless you actually need the locking
guarantees for anon_vma_prepare().

And I've got the feeling that the proper fix is in the vma_adjust()
handling if your original idea was right.

Anybody?

We're at the point where I've already delayed -rc4 several days because
it's pointless cutting it without fixing this. One option is to just say
"f*ck it, we'll revert it all and try again later". But it feels so
close..

Linus

2010-04-09 20:05:10

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

On 04/09/2010 03:32 PM, Linus Torvalds wrote:

> Rik? I think it's back to you. I'm not going to bother committing the
> change to the anon_vma locking unless you actually need the locking
> guarantees for anon_vma_prepare().

> And I've got the feeling that the proper fix is in the vma_adjust()
> handling if your original idea was right.

We can fix it on the other side, by changing anon_vma_merge
to actually link all the anon_vma structs into the VMA.

An added benefit is that we are already holding the required
lock (mmap_sem) exclusively in that code path.

I'll cook up a patch and I'll mail it out after a little
testing.

2010-04-09 20:43:44

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

On Fri, Apr 09, 2010 at 12:32:30PM -0700, Linus Torvalds wrote:
>
>
> On Fri, 9 Apr 2010, Borislav Petkov wrote:
> > >
> > > So what I _think_ will happen is that you'll be able to re-create the
> > > problem that started this all. But I'd like to verify that, just because
> > > I'm anal and I'd like these things to be tested independently.
> >
> > Heh, that was easy. Third hibernate cycle is a charm^Wboom :)
>
> Ok, good to know that I'm still tracking ok on the issue.
>
> > > So assuming that the original problem happens again, if you can then apply
> > > Rik's patch, but add a
> > >
> > > dst->anon_vma = src->anon_vma;
> > >
> > > to just before the success case (the "return 0") in anon_vma_clone(),
> > > that would be good.
> >
> > It looks like this way we mangle the anon_vma chains somehow. From
> > what I can see and if I'm not mistaken, we save the anon_vmas alright
> > but end up in what seems like an endless list_for_each_entry()
> > loop having grabbed anon_vma->lock in page_lock_anon_vma() and we
> > can't seem to yield it through page_unlock_anon_vma() at the end of
> > page_referenced_anon() so it has to be that code in between iterating
> > over each list entry...
>
> Ok. So scratch Rik's patch. It doesn't work even with the anon_vma set up.
>
> Rik? I think it's back to you. I'm not going to bother committing the
> change to the anon_vma locking unless you actually need the locking
> guarantees for anon_vma_prepare().
>
> And I've got the feeling that the proper fix is in the vma_adjust()
> handling if your original idea was right.
>
> Anybody?

Okay, I think I got it working. I first thought we would need an
m^n loop to properly merge the anon_vma_chains, but we can actually
be cleverer than that:

---
Subject: mm: properly merge anon_vma_chains when merging vmas

Merging can happen when two VMAs were split from one root VMA or
a mergeable VMA was instantiated and reused a nearby VMA's anon_vma.

In both cases, none of the VMAs can grow any more anon_vmas and forked
VMAs can no longer get merged due to differing primary anon_vmas for
their private COW-broken pages.

In the split case, the anon_vma_chains are equal and we can just drop
the one of the VMA that is going away.

In the other case, the VMA that was instantiated later has only one
anon_vma on its chain: the primary anon_vma of its merge partner (due
to anon_vma_prepare()).

If the VMA that came later is going away, its anon_vma_chain is a
subset of the one that is staying, so it can be dropped like in the
split case.

Only if the VMA that came first is going away, its potential parent
anon_vmas need to be migrated to the VMA that is staying.

Signed-off-by: Johannes Weiner <[email protected]>
---

It compiles and boots but I have not really excercised this code.
Boris, could you give it a spin? Thanks!

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index d25bd22..ecef882 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -114,13 +114,7 @@ int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
void __anon_vma_link(struct vm_area_struct *);
void anon_vma_free(struct anon_vma *);
-
-static inline void anon_vma_merge(struct vm_area_struct *vma,
- struct vm_area_struct *next)
-{
- VM_BUG_ON(vma->anon_vma != next->anon_vma);
- unlink_anon_vmas(next);
-}
+void anon_vma_merge(struct vm_area_struct *, struct vm_area_struct *);

/*
* rmap interfaces called when adding or removing pte of page
diff --git a/mm/rmap.c b/mm/rmap.c
index eaa7a09..498a46e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -268,6 +268,58 @@ void unlink_anon_vmas(struct vm_area_struct *vma)
}
}

+void anon_vma_merge(struct vm_area_struct *vma, struct vm_area_struct *next)
+{
+ VM_BUG_ON(vma->anon_vma != next->anon_vma);
+ /*
+ * 1. case: vma and next are split parts of one root vma.
+ * Their anon_vma_chain is equal and we can drop that of next.
+ *
+ * 2. case: one vma was instantiated as mergeable with the
+ * other one and inherited the other one's primary anon_vma as
+ * the singleton in its chain.
+ *
+ * If next came after vma, vma's chain is already an unstrict
+ * superset of next's and we can treat it like case 1.
+ *
+ * If vma has the singleton chain, we have to copy next's
+ * unique anon_vmas over.
+ */
+ if (!list_is_singular(&vma->anon_vma_chain)) {
+ unlink_anon_vmas(next);
+ return;
+ }
+ while (!list_empty(&next->anon_vma_chain)) {
+ struct anon_vma_chain *avc;
+
+ avc = list_first_entry(&next->anon_vma_chain,
+ struct anon_vma_chain, same_vma);
+ if (avc->anon_vma == vma->anon_vma) {
+ /*
+ * The shared one that vma inherited in
+ * anon_vma_prepare. Don't copy it, we
+ * already have it.
+ */
+ spin_lock(&avc->anon_vma->lock);
+ list_del(&avc->same_anon_vma);
+ spin_unlock(&avc->anon_vma->lock);
+
+ list_del(&avc->same_vma);
+ anon_vma_chain_free(avc);
+ } else {
+ /*
+ * One of the parent anon_vmas, move it over.
+ * Make sure nobody walks the vma list while
+ * the entries are in flux.
+ */
+ spin_lock(&avc->anon_vma->lock);
+ avc->vma = vma;
+ list_move_tail(&avc->same_vma, &vma->anon_vma_chain);
+ spin_unlock(&avc->anon_vma->lock);
+ }
+ }
+}
+
static void anon_vma_ctor(void *data)
{
struct anon_vma *anon_vma = data;

2010-04-09 20:59:16

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

On 04/09/2010 04:43 PM, Johannes Weiner wrote:

> Okay, I think I got it working. I first thought we would need an
> m^n loop to properly merge the anon_vma_chains, but we can actually
> be cleverer than that:

I've looked it over 5 times, can't find anything wrong
with it. Your approach looks like it should work just
fine.

Certainly easier than the things Linus and I tried :)

> Signed-off-by: Johannes Weiner<[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

2010-04-09 21:39:43

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

From: Johannes Weiner <[email protected]>
Date: Fri, Apr 09, 2010 at 10:43:28PM +0200

Hi Hannes :) ,

> ---
> Subject: mm: properly merge anon_vma_chains when merging vmas
>
> Merging can happen when two VMAs were split from one root VMA or
> a mergeable VMA was instantiated and reused a nearby VMA's anon_vma.
>
> In both cases, none of the VMAs can grow any more anon_vmas and forked
> VMAs can no longer get merged due to differing primary anon_vmas for
> their private COW-broken pages.
>
> In the split case, the anon_vma_chains are equal and we can just drop
> the one of the VMA that is going away.
>
> In the other case, the VMA that was instantiated later has only one
> anon_vma on its chain: the primary anon_vma of its merge partner (due
> to anon_vma_prepare()).
>
> If the VMA that came later is going away, its anon_vma_chain is a
> subset of the one that is staying, so it can be dropped like in the
> split case.
>
> Only if the VMA that came first is going away, its potential parent
> anon_vmas need to be migrated to the VMA that is staying.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
>
> It compiles and boots but I have not really excercised this code.
> Boris, could you give it a spin? Thanks!

ok, I got this ontop of mainline (no other patches from this thread)
but unfortunately it breaks at the same spot while under heavy page
reclaiming when trying to hibernate while booting 3 guests.

[ 322.171120] PM: Preallocating image memory...
[ 322.477374] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 322.477376] IP: [<ffffffff810c0c87>] page_referenced+0xee/0x1dc
[ 322.477376] PGD 2014e8067 PUD 221b4e067 PMD 0
[ 322.477376] Oops: 0000 [#1] PREEMPT SMP
[ 322.477376] last sysfs file: /sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq
[ 322.477376] CPU 3
[ 322.477376] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 pcspkr serial_core k10temp ohci_hcd edac_core
[ 322.477376]
[ 322.477376] Pid: 2750, comm: hib.sh Tainted: G W 2.6.34-rc3-00411-ga7247b6 #13 M3A78 PRO/System Product Name
[ 322.477376] RIP: 0010:[<ffffffff810c0c87>] [<ffffffff810c0c87>] page_referenced+0xee/0x1dc
[ 322.477376] RSP: 0018:ffff88020936d8b8 EFLAGS: 00010283
[ 322.477376] RAX: ffff88022de91af0 RBX: ffffea0006dcb488 RCX: 0000000000000000
[ 322.477376] RDX: ffff88020936dcf8 RSI: ffff88022de91ac8 RDI: ffff88022ced0000
[ 322.477376] RBP: ffff88020936d938 R08: 0000000000000002 R09: 0000000000000000
[ 322.477376] R10: 0000000000000246 R11: 0000000000000003 R12: 0000000000000000
[ 322.477376] R13: ffffffffffffffe0 R14: ffff88022de91ab0 R15: ffff88020936da00
[ 322.477376] FS: 00007f286493e6f0(0000) GS:ffff88000a600000(0000) knlGS:0000000000000000
[ 322.477376] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 322.477376] CR2: 0000000000000000 CR3: 00000001f8354000 CR4: 00000000000006e0
[ 322.477376] DR0: 0000000000000090 DR1: 00000000000000a4 DR2: 00000000000000ff
[ 322.477376] DR3: 000000000000000f DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 322.477376] Process hib.sh (pid: 2750, threadinfo ffff88020936c000, task ffff88022ced0000)
[ 322.477376] Stack:
[ 322.477376] ffff88022de91af0 00000000813f8eec ffffffff8165ce28 000000000000002e
[ 322.477376] <0> ffff88020936d8f8 ffffffff810c60bc ffffea0006dcb450 ffffea0006dcb450
[ 322.477376] <0> ffff88020936d938 00000002810ab29d 0000000006f316b0 ffffea0006dcb4b0
[ 322.477376] Call Trace:
[ 322.477376] [<ffffffff810c60bc>] ? swapcache_free+0x37/0x3c
[ 322.477376] [<ffffffff810ab7c2>] shrink_page_list+0x14a/0x477
[ 322.477376] [<ffffffff810abe46>] shrink_inactive_list+0x357/0x5e5
[ 322.477376] [<ffffffff810ab666>] ? shrink_active_list+0x232/0x244
[ 322.477376] [<ffffffff810ac3e0>] shrink_zone+0x30c/0x3d6
[ 322.477376] [<ffffffff810acfbb>] do_try_to_free_pages+0x176/0x27f
[ 322.477376] [<ffffffff810ad159>] shrink_all_memory+0x95/0xc4
[ 322.477376] [<ffffffff810aa65c>] ? isolate_pages_global+0x0/0x1f0
[ 322.477376] [<ffffffff81076e7c>] ? count_data_pages+0x65/0x79
[ 322.477376] [<ffffffff810770e3>] hibernate_preallocate_memory+0x1aa/0x2cb
[ 322.477376] [<ffffffff813f5325>] ? printk+0x41/0x44
[ 322.477376] [<ffffffff81075a83>] hibernation_snapshot+0x36/0x1e1
[ 322.477376] [<ffffffff81075cfc>] hibernate+0xce/0x172
[ 322.477376] [<ffffffff81074a69>] state_store+0x5c/0xd3
[ 322.477376] [<ffffffff81185043>] kobj_attr_store+0x17/0x19
[ 322.477376] [<ffffffff81125e87>] sysfs_write_file+0x108/0x144
[ 322.477376] [<ffffffff810d580f>] vfs_write+0xb2/0x153
[ 322.477376] [<ffffffff81063c09>] ? trace_hardirqs_on_caller+0x1f/0x14b
[ 322.477376] [<ffffffff810d5973>] sys_write+0x4a/0x71
[ 322.477376] [<ffffffff810021db>] system_call_fastpath+0x16/0x1b
[ 322.477376] Code: 3b 56 10 73 1e 48 83 fa f2 74 18 48 8d 4d cc 4d 89 f8 48 89 df e8 77 f2 ff ff 41 01 c4 83 7d cc 00 74 19 4d 8b 6d 20 49 83 ed 20 <49> 8b 45 20 0f 18 08 49 8d 45 20 48 39 45 80 75 aa 4c 89 f7 e8
[ 322.477376] RIP [<ffffffff810c0c87>] page_referenced+0xee/0x1dc
[ 322.477376] RSP <ffff88020936d8b8>
[ 322.477376] CR2: 0000000000000000
[ 322.491359] ---[ end trace 520a5274d8859b71 ]---
[ 322.491509] note: hib.sh[2750] exited with preempt_count 2
[ 322.491663] BUG: scheduling while atomic: hib.sh/2750/0x10000003
[ 322.491810] INFO: lockdep is turned off.
[ 322.491956] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 pcspkr serial_core k10temp ohci_hcd edac_core
[ 322.493364] Pid: 2750, comm: hib.sh Tainted: G D W 2.6.34-rc3-00411-ga7247b6 #13
[ 322.493622] Call Trace:
[ 322.493768] [<ffffffff8106311f>] ? __debug_show_held_locks+0x1b/0x24
[ 322.493919] [<ffffffff8102d3d0>] __schedule_bug+0x72/0x77
[ 322.494070] [<ffffffff813f572e>] schedule+0xd9/0x730
[ 322.494223] [<ffffffff8103023c>] __cond_resched+0x18/0x24
[ 322.494378] [<ffffffff813f5e52>] _cond_resched+0x2c/0x37
[ 322.494527] [<ffffffff810b7da5>] unmap_vmas+0x6ce/0x893
[ 322.494678] [<ffffffff813f8e86>] ? _raw_spin_unlock_irqrestore+0x38/0x69
[ 322.494829] [<ffffffff810bc457>] exit_mmap+0xd7/0x182
[ 322.494978] [<ffffffff81035969>] mmput+0x48/0xb9
[ 322.495131] [<ffffffff81039c39>] exit_mm+0x110/0x11d
[ 322.495280] [<ffffffff8103b67b>] do_exit+0x1c5/0x691
[ 322.495521] [<ffffffff81038d25>] ? kmsg_dump+0x13b/0x155
[ 322.495668] [<ffffffff810060db>] ? oops_end+0x47/0x93
[ 322.495816] [<ffffffff81006122>] oops_end+0x8e/0x93
[ 322.495964] [<ffffffff8101ed95>] no_context+0x1fc/0x20b
[ 322.496118] [<ffffffff8101ef30>] __bad_area_nosemaphore+0x18c/0x1af
[ 322.496267] [<ffffffff8101f16b>] ? do_page_fault+0xa8/0x32d
[ 322.496484] [<ffffffff8101ef66>] bad_area_nosemaphore+0x13/0x15
[ 322.496630] [<ffffffff8101f236>] do_page_fault+0x173/0x32d
[ 322.496780] [<ffffffff813f96e3>] ? error_sti+0x5/0x6
[ 322.496928] [<ffffffff81062bc7>] ? trace_hardirqs_off_caller+0x1f/0xa9
[ 322.497082] [<ffffffff813f80d2>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 322.497232] [<ffffffff813f94ff>] page_fault+0x1f/0x30
[ 322.497392] [<ffffffff810c0c87>] ? page_referenced+0xee/0x1dc
[ 322.497541] [<ffffffff810c0c19>] ? page_referenced+0x80/0x1dc
[ 322.497690] [<ffffffff810c60bc>] ? swapcache_free+0x37/0x3c
[ 322.497839] [<ffffffff810ab7c2>] shrink_page_list+0x14a/0x477
[ 322.497989] [<ffffffff810abe46>] shrink_inactive_list+0x357/0x5e5
[ 322.498141] [<ffffffff810ab666>] ? shrink_active_list+0x232/0x244
[ 322.498291] [<ffffffff810ac3e0>] shrink_zone+0x30c/0x3d6
[ 322.498444] [<ffffffff810acfbb>] do_try_to_free_pages+0x176/0x27f
[ 322.498594] [<ffffffff810ad159>] shrink_all_memory+0x95/0xc4
[ 322.498743] [<ffffffff810aa65c>] ? isolate_pages_global+0x0/0x1f0
[ 322.498892] [<ffffffff81076e7c>] ? count_data_pages+0x65/0x79
[ 322.499046] [<ffffffff810770e3>] hibernate_preallocate_memory+0x1aa/0x2cb
[ 322.499195] [<ffffffff813f5325>] ? printk+0x41/0x44
[ 322.499344] [<ffffffff81075a83>] hibernation_snapshot+0x36/0x1e1
[ 322.499498] [<ffffffff81075cfc>] hibernate+0xce/0x172
[ 322.499647] [<ffffffff81074a69>] state_store+0x5c/0xd3
[ 322.499795] [<ffffffff81185043>] kobj_attr_store+0x17/0x19
[ 322.499944] [<ffffffff81125e87>] sysfs_write_file+0x108/0x144
[ 322.500097] [<ffffffff810d580f>] vfs_write+0xb2/0x153
[ 322.500246] [<ffffffff81063c09>] ? trace_hardirqs_on_caller+0x1f/0x14b
[ 322.500399] [<ffffffff810d5973>] sys_write+0x4a/0x71
[ 322.500547] [<ffffffff810021db>] system_call_fastpath+0x16/0x1b

--
Regards/Gruss,
Boris.

2010-04-09 23:26:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA



On Fri, 9 Apr 2010, Johannes Weiner wrote:
> + /*
> + * 1. case: vma and next are split parts of one root vma.
> + * Their anon_vma_chain is equal and we can drop that of next.
> + *
> + * 2. case: one vma was instantiated as mergeable with the
> + * other one and inherited the other one's primary anon_vma as
> + * the singleton in its chain.
> + *
> + * If next came after vma, vma's chain is already an unstrict
> + * superset of next's and we can treat it like case 1.
> + *
> + * If vma has the singleton chain, we have to copy next's
> + * unique anon_vmas over.
> + */

This comment makes my head hurt. In fact, the whole anon_vma thing hurts
my head.

Can we have some better high-level documentation on what happens for all
the cases.

- split (mprotect, or munmap in the middle):

anon_vma_clone: the two vma's will have the same anon_vma, and the
anon_vma chains will be equivalent.

- merge (mprotect that creates a mergeable state):

anon_vma_merge: we're supposed to have a anon_vma_chain that is
a superset of the two chains of the merged entries.

- fork:

anon_vma_fork: each new vma will have a _new_ anon_vma as it's
primary one, and will link to the old primary trough the
anon_vma_chain. It's doing this with a anon_vma_clone() followed
by adding an entra entry to the new anon_vma, and setting
vma->anon_vma to the new one.

- create/mmap:

anon_vma_prepare: find a mergeable anon_vma and use that as a
singleton, because the other entries on the anon_vma chain won't
matter, since they cannot be associated with any pages associated
with the newly created vma..

Correct?

Quite frankly, just looking at that, I can't see how we get to your rules.
At least not trivially. Especially with multiple merges, I don't see
how "singleton" is such a special case.

Linus

2010-04-09 23:46:37

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

On 04/09/2010 07:22 PM, Linus Torvalds wrote:
>
>
> On Fri, 9 Apr 2010, Johannes Weiner wrote:
>> + /*
>> + * 1. case: vma and next are split parts of one root vma.
>> + * Their anon_vma_chain is equal and we can drop that of next.
>> + *
>> + * 2. case: one vma was instantiated as mergeable with the
>> + * other one and inherited the other one's primary anon_vma as
>> + * the singleton in its chain.
>> + *
>> + * If next came after vma, vma's chain is already an unstrict
>> + * superset of next's and we can treat it like case 1.
>> + *
>> + * If vma has the singleton chain, we have to copy next's
>> + * unique anon_vmas over.
>> + */
>
> This comment makes my head hurt. In fact, the whole anon_vma thing hurts
> my head.
>
> Can we have some better high-level documentation on what happens for all
> the cases.
>
> - split (mprotect, or munmap in the middle):
>
> anon_vma_clone: the two vma's will have the same anon_vma, and the
> anon_vma chains will be equivalent.
>
> - merge (mprotect that creates a mergeable state):
>
> anon_vma_merge: we're supposed to have a anon_vma_chain that is
> a superset of the two chains of the merged entries.
>
> - fork:
>
> anon_vma_fork: each new vma will have a _new_ anon_vma as it's
> primary one, and will link to the old primary trough the
> anon_vma_chain. It's doing this with a anon_vma_clone() followed
> by adding an entra entry to the new anon_vma, and setting
> vma->anon_vma to the new one.
>
> - create/mmap:
>
> anon_vma_prepare: find a mergeable anon_vma and use that as a
> singleton, because the other entries on the anon_vma chain won't
> matter, since they cannot be associated with any pages associated
> with the newly created vma..
>
> Correct?

This is indeed correct.

> Quite frankly, just looking at that, I can't see how we get to your rules.
> At least not trivially. Especially with multiple merges, I don't see
> how "singleton" is such a special case.

The trick is in the fact that anon_vma_merge is only called
when vma->anon_vma == vma1->anon_vma.

If the top anon_vmas are different, then anon_vma_merge will
not be called.

This means that VMAs which have recently passed through fork
will not be passed to anon_vma_merge, because their top
anon_vmas are different.

That leaves just the split & create cases, which will be
passed to anon_vma_merge when they are merged.

In case of split, they will have identical anon_vma chains.

In case of create + merge, one of the two VMAs will have
the whole anon_vma chain, while the other one has just
the top anon_vma.

2010-04-09 23:54:41

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

On Fri, Apr 09, 2010 at 04:22:19PM -0700, Linus Torvalds wrote:
>
>
> On Fri, 9 Apr 2010, Johannes Weiner wrote:
> > + /*
> > + * 1. case: vma and next are split parts of one root vma.
> > + * Their anon_vma_chain is equal and we can drop that of next.
> > + *
> > + * 2. case: one vma was instantiated as mergeable with the
> > + * other one and inherited the other one's primary anon_vma as
> > + * the singleton in its chain.
> > + *
> > + * If next came after vma, vma's chain is already an unstrict
> > + * superset of next's and we can treat it like case 1.
> > + *
> > + * If vma has the singleton chain, we have to copy next's
> > + * unique anon_vmas over.
> > + */
>
> This comment makes my head hurt. In fact, the whole anon_vma thing hurts
> my head.

I can relate ;)

> Can we have some better high-level documentation on what happens for all
> the cases.
>
> - split (mprotect, or munmap in the middle):
>
> anon_vma_clone: the two vma's will have the same anon_vma, and the
> anon_vma chains will be equivalent.
>
> - merge (mprotect that creates a mergeable state):
>
> anon_vma_merge: we're supposed to have a anon_vma_chain that is
> a superset of the two chains of the merged entries.
>
> - fork:
>
> anon_vma_fork: each new vma will have a _new_ anon_vma as it's
> primary one, and will link to the old primary trough the
> anon_vma_chain. It's doing this with a anon_vma_clone() followed
> by adding an entra entry to the new anon_vma, and setting
> vma->anon_vma to the new one.
>
> - create/mmap:
>
> anon_vma_prepare: find a mergeable anon_vma and use that as a
> singleton, because the other entries on the anon_vma chain won't
> matter, since they cannot be associated with any pages associated
> with the newly created vma..
>
> Correct?
>
> Quite frankly, just looking at that, I can't see how we get to your rules.
> At least not trivially. Especially with multiple merges, I don't see
> how "singleton" is such a special case.

The key is that merging is only possible if the primary anon_vmas are
equivalent.

This only happens if we split a vma in two and clone the old vma's
anon_vma_chain into the new vma. So the chains are equivalent.

Or anon_vma_prepare() finds a mergeable anon_vma, in which case this
will be the singleton on the vma's chain.

If a split vma is merged, the old anon_vma_chains are equivalent, we
drop one completely and the one that stays has not changed.

If a mergeable vma (singleton anon_vma) is merged into another one,
this singleton is the primary anon_vma of the swallowing vma, thus
already linked and the swallowing vma's anon_vma_chain stays unchanged.

If it's the other way round and the singleton vma swallows the other
one, every anon_vma of the vanishing vma is moved over (except your
singleton anon_vma, you already have that). The result should look
exactly like the chain we swallowed. So in all this merging, no
unique and new combination of anon_vma_chains should have been
created! Thus you can merge as much as you want, either you swallow
singletons and don't change yourself or you are the singleton and
after the merger have an equivalent anon_vma_chain to the vma you
swallowed.

Again: no new anon_vmas should enter the game for mergeable vmas and
no _new_ anon_vma_chains should be created while merging. Thus it
is always true that you either merge with a singleton or the chains
are equivalent.

At least those are my assumptions. Maybe they are crap, but I don't
see how right now.

And according to Boris' test, somewhere we still drop anon_vmas
where we let pages in the field pointing at them.

Hannes

2010-04-10 00:00:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA



On Fri, 9 Apr 2010, Linus Torvalds wrote:
>
> Can we have some better high-level documentation on what happens for all
> the cases.
>
> - split (mprotect, or munmap in the middle):
>
> anon_vma_clone: the two vma's will have the same anon_vma, and the
> anon_vma chains will be equivalent.
>
> - merge (mprotect that creates a mergeable state):
>
> anon_vma_merge: we're supposed to have a anon_vma_chain that is
> a superset of the two chains of the merged entries.
>
> - fork:
>
> anon_vma_fork: each new vma will have a _new_ anon_vma as it's
> primary one, and will link to the old primary trough the
> anon_vma_chain. It's doing this with a anon_vma_clone() followed
> by adding an entra entry to the new anon_vma, and setting
> vma->anon_vma to the new one.
>
> - create/mmap:
>
> anon_vma_prepare: find a mergeable anon_vma and use that as a
> singleton, because the other entries on the anon_vma chain won't
> matter, since they cannot be associated with any pages associated
> with the newly created vma..
>
> Correct?

Ok, so I don't know if the above is correct, but if it is, let's ignore
the "merge" case as being complex, and look at the other cases.

With fork, the main anon_vma becomes different, so let's ignore that. That
always means that the resulting list is not comparable or compatible, and
we'll never mix them up.

If we make one very _simple_ rule for the create/mmap case, namely that we
only re-use another _singleton_ anon_vma, then split and create case will
look exactly the same. And in particular, we get a very simple and
powerful rule: if the anon_vma matches, then the _list_ will also always
match.

And that, in turn, would make 'merge' trivial too: you really can always
drop the side that goes away. There's never any question about how to
merge the lists, or which to pick, because every single operation that
leaves the anon_vma the same will guarantee that the list will be
identical too.

So now the simple rule is that if the anon_vma is the same, then the list
of associated anon_vma's will always be the same - across all of merge,
split and create.

Isn't that a _much_ simpler model to think about?

So _instead_ of all the patches that have floated about, I would suggest
this simple change to "find_mergeable_anon_vma()" instead..

Oh, and maybe it's the meds talking again. I'm feeling better than
yesterday, but am still a bit lightheaded.

Linus

---
mm/mmap.c | 6 ++++--
1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 75557c6..462a8ca 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -850,7 +850,8 @@ struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma)
vm_flags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC);
vm_flags |= near->vm_flags & (VM_READ|VM_WRITE|VM_EXEC);

- if (near->anon_vma && vma->vm_end == near->vm_start &&
+ if (near->anon_vma && list_is_singular(&near->anon_vma_chain) &&
+ vma->vm_end == near->vm_start &&
mpol_equal(vma_policy(vma), vma_policy(near)) &&
can_vma_merge_before(near, vm_flags,
NULL, vma->vm_file, vma->vm_pgoff +
@@ -871,7 +872,8 @@ try_prev:
vm_flags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC);
vm_flags |= near->vm_flags & (VM_READ|VM_WRITE|VM_EXEC);

- if (near->anon_vma && near->vm_end == vma->vm_start &&
+ if (near->anon_vma && list_is_singular(&near->anon_vma_chain) &&
+ near->vm_end == vma->vm_start &&
mpol_equal(vma_policy(near), vma_policy(vma)) &&
can_vma_merge_after(near, vm_flags,
NULL, vma->vm_file, vma->vm_pgoff))

2010-04-10 00:08:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA



On Fri, 9 Apr 2010, Rik van Riel wrote:
>
> The trick is in the fact that anon_vma_merge is only called
> when vma->anon_vma == vma1->anon_vma.

Sure sure. I still think it's _way_ too complex. See my previous email
where I suggested one single simple additional rule that I think makes
things _much_ simpler.

> If the top anon_vmas are different, then anon_vma_merge will
> not be called.

Right. The case of different anon_vma's is the trivial one. I don't worry
about that.

> That leaves just the split & create cases, which will be
> passed to anon_vma_merge when they are merged.
>
> In case of split, they will have identical anon_vma chains.

And yes, split is fundamentally simple. Split guarantees that the chains
look identical.

But:

> In case of create + merge, one of the two VMAs will have
> the whole anon_vma chain, while the other one has just
> the top anon_vma.

THIS is where I think you simplified a lot and said "and magic happens".

The thing is, in the case of create, we create a different chain. That
simple fact just makes merging fundamentally complicated. And we now have
two different chains, and both of those can split, so those differences
can "spread out". And you need to guarantee that "merge" really works. It
didn't work in your original code, and quite frankly, I do _not_ think
it's entirely obvious that it works in Johannes' code either.

Don't get me wrong: _maybe_ Johannes' code works fine. I just don't think
it's obvious at all. And if it doesn't work fine, now you're just
spreading the differences even further.

This is why I suggest that we limit the "re-use an existing vma for a new
case" to the singleton case, which means that now you _never_ have
differences at all. There's no spreading on splitting. Merging is trivial.

Now, admittedly, I'm really hopped up on cough medication, so the feeling
of this solving all the problems in the universe may not be entirely
accurate. But it feels so _right_.

I hope if feels right when I'm off my meds too.

Linus

2010-04-10 00:12:37

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

On 04/09/2010 08:03 PM, Linus Torvalds wrote:

> This is why I suggest that we limit the "re-use an existing vma for a new
> case" to the singleton case, which means that now you _never_ have
> differences at all. There's no spreading on splitting. Merging is trivial.

That looks like it should work.

> Now, admittedly, I'm really hopped up on cough medication, so the feeling
> of this solving all the problems in the universe may not be entirely
> accurate. But it feels so _right_.
>
> I hope if feels right when I'm off my meds too.

I am not on any cough meds, and your patch looks right.
OTOH, maybe I should be on some kind of cold meds, because
I haven't been feeling right all week...

2010-04-10 00:20:25

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

On 04/09/2010 07:56 PM, Linus Torvalds wrote:

> So _instead_ of all the patches that have floated about, I would suggest
> this simple change to "find_mergeable_anon_vma()" instead..

Boris, this is your chance to really ruin our week :)

If the bug persists with Linus's patch, we've been fixing
the wrong bug all week long, and you are experiencing
something else...

I'm getting really curious now.

> ---
> mm/mmap.c | 6 ++++--
> 1 files changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 75557c6..462a8ca 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -850,7 +850,8 @@ struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma)
> vm_flags = vma->vm_flags& ~(VM_READ|VM_WRITE|VM_EXEC);
> vm_flags |= near->vm_flags& (VM_READ|VM_WRITE|VM_EXEC);
>
> - if (near->anon_vma&& vma->vm_end == near->vm_start&&
> + if (near->anon_vma&& list_is_singular(&near->anon_vma_chain)&&
> + vma->vm_end == near->vm_start&&
> mpol_equal(vma_policy(vma), vma_policy(near))&&
> can_vma_merge_before(near, vm_flags,
> NULL, vma->vm_file, vma->vm_pgoff +
> @@ -871,7 +872,8 @@ try_prev:
> vm_flags = vma->vm_flags& ~(VM_READ|VM_WRITE|VM_EXEC);
> vm_flags |= near->vm_flags& (VM_READ|VM_WRITE|VM_EXEC);
>
> - if (near->anon_vma&& near->vm_end == vma->vm_start&&
> + if (near->anon_vma&& list_is_singular(&near->anon_vma_chain)&&
> + near->vm_end == vma->vm_start&&
> mpol_equal(vma_policy(near), vma_policy(vma))&&
> can_vma_merge_after(near, vm_flags,
> NULL, vma->vm_file, vma->vm_pgoff))

2010-04-10 00:31:27

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

On Fri, Apr 09, 2010 at 04:56:13PM -0700, Linus Torvalds wrote:
> So _instead_ of all the patches that have floated about, I would suggest
> this simple change to "find_mergeable_anon_vma()" instead..

That leaves the chance that my code was correct and we leave a conceptual
error around somewhere that can materialize again. But I am at a point
where simplification never sounded more blissful, so yeah, I like it :)

Let's hope it fixes Boris's issue.

2010-04-10 00:37:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA



On Sat, 10 Apr 2010, Johannes Weiner wrote:
>
> That leaves the chance that my code was correct and we leave a conceptual
> error around somewhere that can materialize again.

Absolutely. I really don't know whether your merge routine works or not.
I'd just rather not have to even _try_ to understand it.

I have a fairly simple rule for most of the code I see: if I have a hard
time understanding why it should work, I don't really want to rely on it.

> But I am at a point where simplification never sounded more blissful, so
> yeah, I like it :)

Exactly. This is the "let's limit things a bit to keep them much simpler.

> Let's hope it fixes Boris's issue.

I'm going to just guess that it won't, and that Boris' issue was actually
due to something else entirely, and we've all been staring at totally the
wrong code.

But we can hope.

Linus

2010-04-10 07:27:25

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

From: Linus Torvalds <[email protected]>
Date: Fri, Apr 09, 2010 at 05:32:36PM -0700

> Exactly. This is the "let's limit things a bit to keep them much simpler.

You gotta love that rule :)

> > Let's hope it fixes Boris's issue.
>
> I'm going to just guess that it won't, and that Boris' issue was actually
> due to something else entirely, and we've all been staring at totally the
> wrong code.
>
> But we can hope.

Now why would you go and jinx it like that... :)

Hibernation runs back-to-back:

1. light system load after boot... ok
2. 3 kvm guests, 3Gb mem free of 8Gb total acc. to /proc/meminfo... ok [ this was the fireproof way to trigger the bug, btw]
3. kvm guests down, firefox loading a 4Mb html page... ok
4. start ubuntu guest, firefox keeps loading the 4Mb html page after previous resume... ok
5. ubuntu guest booting done, firefox done, play video... ok
6. video broken after resume due to:

[AO_ALSA] Pcm in suspend mode, trying to resume. 212% 2% 1.7% 1 0
[AO_ALSA] alsa-lib: pcm_hw.c:709:(snd_pcm_hw_resume) SNDRV_PCM_IOCTL_RESUME failed: Function not implemented

i.e., unrelated... still ok

7. ubuntu guest downloading a 100Mb file causing allocation of a bunch of anon memory in the host... ok
8. all guests off, firefox off, back to light load... ok

No oopsies or problems in dmesg except the old lockdep sysfs warning.

I will keep running that kernel in the next couple of days and keep you
informed in case this is the fix we're gonna use.

--
Regards/Gruss,
Boris.

2010-04-10 13:40:06

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

From: Borislav Petkov <[email protected]>
Date: Sat, Apr 10, 2010 at 09:27:14AM +0200

> Now why would you go and jinx it like that... :)
>
> Hibernation runs back-to-back:
>
> 1. light system load after boot... ok
> 2. 3 kvm guests, 3Gb mem free of 8Gb total acc. to /proc/meminfo... ok [ this was the fireproof way to trigger the bug, btw]
> 3. kvm guests down, firefox loading a 4Mb html page... ok
> 4. start ubuntu guest, firefox keeps loading the 4Mb html page after previous resume... ok
> 5. ubuntu guest booting done, firefox done, play video... ok
> 6. video broken after resume due to:
>
> [AO_ALSA] Pcm in suspend mode, trying to resume. 212% 2% 1.7% 1 0
> [AO_ALSA] alsa-lib: pcm_hw.c:709:(snd_pcm_hw_resume) SNDRV_PCM_IOCTL_RESUME failed: Function not implemented
>
> i.e., unrelated... still ok
>
> 7. ubuntu guest downloading a 100Mb file causing allocation of a bunch of anon memory in the host... ok
> 8. all guests off, firefox off, back to light load... ok
>
> No oopsies or problems in dmesg except the old lockdep sysfs warning.
>
> I will keep running that kernel in the next couple of days and keep you
> informed in case this is the fix we're gonna use.

Yep, you jinxed it :)

This time we got stuck on the anon_vma->lock (yep, we've seen that
oopsie before). So, it might be that we _really_ are staring at the
wrong code... Back to square one.


[18969.797126] BUG: soft lockup - CPU#1 stuck for 61s! [hib.sh:5605]
[18969.797126] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 ohci_hcd pcspkr serial_core k10temp edac_core
[18969.798029] irq event stamp: 0
[18969.798029] hardirqs last enabled at (0): [<(null)>] (null)
[18969.798029] hardirqs last disabled at (0): [<ffffffff8103657c>] copy_process+0x3c1/0x10cc
[18969.798029] softirqs last enabled at (0): [<ffffffff8103657c>] copy_process+0x3c1/0x10cc
[18969.798029] softirqs last disabled at (0): [<(null)>] (null)
[18969.798029] CPU 1
[18969.798029] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 ohci_hcd pcspkr serial_core k10temp edac_core
[18969.798029]
[18969.798029] Pid: 5605, comm: hib.sh Not tainted 2.6.34-rc3-00501-gefb57c0 #1 M3A78 PRO/System Product Name
[18969.798029] RIP: 0010:[<ffffffff8118b7f4>] [<ffffffff8118b7f4>] delay_tsc+0x33/0xca
[18969.798029] RSP: 0018:ffff8801aebdf7b8 EFLAGS: 00000206
[18969.798029] RAX: 00000000fc6fc9e8 RBX: ffff8801aebdf7e8 RCX: 0000000000001200
[18969.798029] RDX: 0000000000002806 RSI: ffff8801aebdf848 RDI: 0000000000000001
[18969.798029] RBP: ffffffff81002b4e R08: 0000000000000001 R09: 0000000000000000
[18969.798029] R10: ffff8801aebdf8a8 R11: 0000000000000001 R12: 0000000000000014
[18969.798029] R13: ffff88000a200000 R14: ffff8801aebde000 R15: ffff8801aebdffd8
[18969.798029] FS: 00007f2c86c656f0(0000) GS:ffff88000a200000(0000) knlGS:0000000000000000
[18969.798029] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[18969.798029] CR2: 00007fd515101870 CR3: 000000022bd9a000 CR4: 00000000000006e0
[18969.798029] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[18969.798029] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[18969.798029] Process hib.sh (pid: 5605, threadinfo ffff8801aebde000, task ffff88022e194b80)
[18969.798029] Stack:
[18969.798029] 0000000000000001 ffff88022d2db720 ffff88022e194b80 00000000b3477260
[18969.798029] <0> ffff88022e194f28 000000002a5200c6 ffff8801aebdf7f8 ffffffff8118b7bf
[18969.798029] <0> ffff8801aebdf848 ffffffff8119a296 ffff88022d2db738 0000000000000001
[18969.798029] Call Trace:
[18969.798029] [<ffffffff8118b7bf>] ? __delay+0xf/0x11
[18969.798029] [<ffffffff8119a296>] ? do_raw_spin_lock+0xd2/0x13c
[18969.798029] [<ffffffff813f843b>] ? _raw_spin_lock+0x60/0x73
[18969.798029] [<ffffffff810c0ae3>] ? page_lock_anon_vma+0x63/0xac
[18969.798029] [<ffffffff810c0ae3>] ? page_lock_anon_vma+0x63/0xac
[18969.798029] [<ffffffff810c0a80>] ? page_lock_anon_vma+0x0/0xac
[18969.798029] [<ffffffff810c0cc9>] ? page_referenced+0x80/0x1dc
[18969.798029] [<ffffffff810c60a0>] ? swapcache_free+0x37/0x3c
[18969.798029] [<ffffffff810ab7e6>] ? shrink_page_list+0x14a/0x477
[18969.798029] [<ffffffff810abe6a>] ? shrink_inactive_list+0x357/0x5e5
[18969.798029] [<ffffffff810ab68a>] ? shrink_active_list+0x232/0x244
[18969.798029] [<ffffffff810ac404>] ? shrink_zone+0x30c/0x3d6
[18969.798029] [<ffffffff810acfdf>] ? do_try_to_free_pages+0x176/0x27f
[18969.798029] [<ffffffff810ad17d>] ? shrink_all_memory+0x95/0xc4
[18969.798029] [<ffffffff810aa680>] ? isolate_pages_global+0x0/0x1f0
[18969.798029] [<ffffffff81076e80>] ? count_data_pages+0x65/0x79
[18969.798029] [<ffffffff810770e7>] ? hibernate_preallocate_memory+0x1aa/0x2cb
[18969.798029] [<ffffffff813f5445>] ? printk+0x41/0x44
[18969.798029] [<ffffffff81075a87>] ? hibernation_snapshot+0x36/0x1e1
[18969.798029] [<ffffffff81075d00>] ? hibernate+0xce/0x172
[18969.798029] [<ffffffff81074a6d>] ? state_store+0x5c/0xd3
[18969.798029] [<ffffffff8118504b>] ? kobj_attr_store+0x17/0x19
[18969.798029] [<ffffffff81125e8b>] ? sysfs_write_file+0x108/0x144
[18969.798029] [<ffffffff810d5807>] ? vfs_write+0xb2/0x153
[18969.798029] [<ffffffff81063c0d>] ? trace_hardirqs_on_caller+0x1f/0x14b
[18969.798029] [<ffffffff810d596b>] ? sys_write+0x4a/0x71
[18969.798029] [<ffffffff810021db>] ? system_call_fastpath+0x16/0x1b
[18969.798029] Code: 41 55 41 54 53 48 83 ec 08 0f 1f 44 00 00 49 89 fc bf 01 00 00 00 e8 88 1d ea ff e8 db f4 00 00 41 89 c5 0f ae f0 66 66 90 0f 31 <89> c3 65 4c 8b 34 25 48 b5 00 00 0f ae f0 66 66 90 0f 31 41 89
[18969.798029] Call Trace:
[18969.798029] [<ffffffff8118b7bf>] ? __delay+0xf/0x11
[18969.798029] [<ffffffff8119a296>] ? do_raw_spin_lock+0xd2/0x13c
[18969.798029] [<ffffffff813f843b>] ? _raw_spin_lock+0x60/0x73
[18969.798029] [<ffffffff810c0ae3>] ? page_lock_anon_vma+0x63/0xac
[18969.798029] [<ffffffff810c0ae3>] ? page_lock_anon_vma+0x63/0xac
[18969.798029] [<ffffffff810c0a80>] ? page_lock_anon_vma+0x0/0xac
[18969.798029] [<ffffffff810c0cc9>] ? page_referenced+0x80/0x1dc
[18969.798029] [<ffffffff810c60a0>] ? swapcache_free+0x37/0x3c
[18969.798029] [<ffffffff810ab7e6>] ? shrink_page_list+0x14a/0x477
[18969.798029] [<ffffffff810abe6a>] ? shrink_inactive_list+0x357/0x5e5
[18969.798029] [<ffffffff810ab68a>] ? shrink_active_list+0x232/0x244
[18969.798029] [<ffffffff810ac404>] ? shrink_zone+0x30c/0x3d6
[18969.798029] [<ffffffff810acfdf>] ? do_try_to_free_pages+0x176/0x27f
[18969.798029] [<ffffffff810ad17d>] ? shrink_all_memory+0x95/0xc4
[18969.798029] [<ffffffff810aa680>] ? isolate_pages_global+0x0/0x1f0
[18969.798029] [<ffffffff81076e80>] ? count_data_pages+0x65/0x79
[18969.798029] [<ffffffff810770e7>] ? hibernate_preallocate_memory+0x1aa/0x2cb
[18969.798029] [<ffffffff813f5445>] ? printk+0x41/0x44
[18969.798029] [<ffffffff81075a87>] ? hibernation_snapshot+0x36/0x1e1
[18969.798029] [<ffffffff81075d00>] ? hibernate+0xce/0x172
[18969.798029] [<ffffffff81074a6d>] ? state_store+0x5c/0xd3
[18969.798029] [<ffffffff8118504b>] ? kobj_attr_store+0x17/0x19
[18969.798029] [<ffffffff81125e8b>] ? sysfs_write_file+0x108/0x144
[18969.798029] [<ffffffff810d5807>] ? vfs_write+0xb2/0x153
[18969.798029] [<ffffffff81063c0d>] ? trace_hardirqs_on_caller+0x1f/0x14b
[18969.798029] [<ffffffff810d596b>] ? sys_write+0x4a/0x71
[18969.798029] [<ffffffff810021db>] ? system_call_fastpath+0x16/0x1b
[19005.426655] SysRq : HELP : loglevel(0-9) reBoot Crash show-all-locks(D) terminate-all-tasks(E) memory-full-oom-kill(F) kill-all-tasks(I) thaw-filesystems(J) saK show-backtrace-all-active-cpus(L) show-memory-usage(M) nice-all-RT-tasks(N) powerOff show-registers(P) show-all-timers(Q) unRaw Sync show-task-states(T) Unmount show-blocked-tasks(W) dump-ftrace-buffer(Z)
[19005.663484] SysRq : HELP : loglevel(0-9) reBoot Crash show-all-locks(D) terminate-all-tasks(E) memory-full-oom-kill(F) kill-all-tasks(I) thaw-filesystems(J) saK show-backtrace-all-active-cpus(L) show-memory-usage(M) nice-all-RT-tasks(N) powerOff show-registers(P) show-all-timers(Q) unRaw Sync show-task-states(T) Unmount show-blocked-tasks(W) dump-ftrace-buffer(Z)
[19007.018563] SysRq : Emergency Sync
[19007.018969] Emergency Sync complete
[19007.582218] SysRq : Emergency Remount R/O
[19008.251934] SysRq : Power Off
[19010.076146] SysRq : Resetting


--
Regards/Gruss,
Boris.

2010-04-10 14:47:50

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

On 04/10/2010 07:26 AM, Borislav Petkov wrote:

> This time we got stuck on the anon_vma->lock (yep, we've seen that
> oopsie before). So, it might be that we _really_ are staring at the
> wrong code... Back to square one.

This is a different bug, though.

If the null pointer dereference is gone, Linus's patch
fixed that bug and we can move forward to fixing the
anon_vma->lock bug.

I'll start auditing the code to see if we forget to
unlock the anon_vma in some unlikely error path...

2010-04-10 15:28:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA



On Sat, 10 Apr 2010, Borislav Petkov wrote:
> >
> > I will keep running that kernel in the next couple of days and keep you
> > informed in case this is the fix we're gonna use.
>
> Yep, you jinxed it :)
>
> This time we got stuck on the anon_vma->lock (yep, we've seen that
> oopsie before). So, it might be that we _really_ are staring at the
> wrong code... Back to square one.

No, I think we're good. I suspect this is a different issue. Do you have
lockdep enabled, along with mutex and spinlock debugging etc? That might
help pinpoint what triggers this.

But I think the fact that you are apparently not able to get the list
corruption is a good sign. Of course, it might just be harder to trigger,
and these things could all be a sign of a different bug, but my gut feel
is that we did fix something, and you are just damn good at stressing the
new code. Kudos.

Linus

2010-04-10 16:45:05

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

From: Linus Torvalds <[email protected]>
Date: Sat, Apr 10, 2010 at 08:24:02AM -0700

> No, I think we're good. I suspect this is a different issue. Do you have
> lockdep enabled, along with mutex and spinlock debugging etc? That might
> help pinpoint what triggers this.

I had pretty much all lock debugging options enabled except PROVE_RCU.

> But I think the fact that you are apparently not able to get the list
> corruption is a good sign. Of course, it might just be harder to trigger,
> and these things could all be a sign of a different bug, but my gut feel
> is that we did fix something, and you are just damn good at stressing the
> new code. Kudos.

Yep, even my mom says I'm good at breaking things :) But seriously,
thanks - means a lot coming from you.

And I got an oops again, this time the #GP from couple of days ago.

<thinking out loud>

I'm starting to think that maybe there could be something wrong with the
machine I'm running it on. Especially since there are only two people
who reported this issue, Steinar and me, so how probable is it that
maybe those two machines have failing RAM module somewhere? Or some
other data corrupting thing? Although I should be getting mchecks...
Hmm...

</thinking out loud>

Im going to run the stress test on 2.6.33.2 to verify whether this is
actually software-related. Just in case. Oh, yes, I almost forgot, the
latest and greatest in the world of oopsies:


[ 452.351588] general protection fault: 0000 [#1] PREEMPT SMP
[ 452.352119] last sysfs file: /sys/power/state
[ 452.352131] CPU 1
[ 452.352131] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 edac_core serial_core ohci_hcd pcspkr k10temp
[ 452.352131]
[ 452.352131] Pid: 2929, comm: hib.sh Not tainted 2.6.34-rc3-00501-gefb57c0 #4 M3A78 PRO/System Product Name
[ 452.352131] RIP: 0010:[<ffffffff810c5f00>] [<ffffffff810c5f00>] page_referenced+0xee/0x1dc
[ 452.352131] RSP: 0018:ffff88022adb18b8 EFLAGS: 00010206
[ 452.352131] RAX: ffff88022ad5c468 RBX: ffffea0007598558 RCX: 0000000000000000
[ 452.352131] RDX: ffff88022adb1cf8 RSI: ffff88022ad5c440 RDI: ffff88022e7d38a0
[ 452.352131] RBP: ffff88022adb1938 R08: 0000000000000002 R09: 0000000000000000
[ 452.352131] R10: ffff88022be83868 R11: ffffffff00000012 R12: 0000000000000000
[ 452.352131] R13: 0032323200323212 R14: ffff88022ad5c428 R15: ffff88022adb1a00
[ 452.352131] FS: 00007f056a1e36f0(0000) GS:ffff88000a200000(0000) knlGS:0000000000000000
[ 452.352131] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 452.352131] CR2: 000000000250e408 CR3: 000000022983f000 CR4: 00000000000006e0
[ 452.352131] DR0: 00000000000000a0 DR1: 0000000000000000 DR2: 0000000000000003
[ 452.352131] DR3: 00000000000000b0 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 452.352131] Process hib.sh (pid: 2929, threadinfo ffff88022adb0000, task ffff88022e7d38a0)
[ 452.352131] Stack:
[ 452.352131] ffff88022ad5c468 00000000810c5c1f ffff88022adb1918 ffffffff810c5d88
[ 452.352131] <0> ffff88022adb18f8 ffffffff00000001 ffffea00075c89c0 ffffea00075984e8
[ 452.352131] <0> ffffea00075984e8 000000022adb1cf8 ffffea00075984e8 ffffea0007598580
[ 452.352131] Call Trace:
[ 452.352131] [<ffffffff810c5d88>] ? try_to_unmap_anon+0xa2/0xb4
[ 452.352131] [<ffffffff810b06bc>] shrink_page_list+0x154/0x4c7
[ 452.352131] [<ffffffff81067149>] ? print_lock_contention_bug+0x1b/0xe1
[ 452.352131] [<ffffffff810af59c>] ? isolate_pages_global+0xd0/0x1fc
[ 452.352131] [<ffffffff8140f856>] ? _raw_spin_unlock_irq+0x30/0x58
[ 452.352131] [<ffffffff810b0d8a>] shrink_inactive_list+0x35b/0x60c
[ 452.352131] [<ffffffff810b1347>] shrink_zone+0x30c/0x3d6
[ 452.352131] [<ffffffff810b1f3d>] do_try_to_free_pages+0x191/0x29a
[ 452.352131] [<ffffffff810b20db>] shrink_all_memory+0x95/0xc4
[ 452.352131] [<ffffffff810af4cc>] ? isolate_pages_global+0x0/0x1fc
[ 452.352131] [<ffffffff81079c9c>] ? count_data_pages+0x65/0x79
[ 452.352131] [<ffffffff81079f03>] hibernate_preallocate_memory+0x1aa/0x2cb
[ 452.352131] [<ffffffff8140bbd4>] ? printk+0x41/0x45
[ 452.352131] [<ffffffff8107878f>] hibernation_snapshot+0x36/0x1e1
[ 452.352131] [<ffffffff81078a08>] hibernate+0xce/0x172
[ 452.352131] [<ffffffff81077775>] state_store+0x5c/0xd3
[ 452.352131] [<ffffffff8118f3cf>] kobj_attr_store+0x17/0x19
[ 452.352131] [<ffffffff8112e288>] sysfs_write_file+0x108/0x144
[ 452.352131] [<ffffffff810db4ff>] vfs_write+0xb2/0x153
[ 452.352131] [<ffffffff810663c9>] ? trace_hardirqs_on_caller+0x1f/0x14b
[ 452.352131] [<ffffffff810db663>] sys_write+0x4a/0x71
[ 452.352131] [<ffffffff8100221b>] system_call_fastpath+0x16/0x1b
[ 452.352131] Code: 3b 56 10 73 1e 48 83 fa f2 74 18 48 8d 4d cc 4d 89 f8 48 89 df e8 11 f2 ff ff 41 01 c4 83 7d cc 00 74 19 4d 8b 6d 20 49 83 ed 20 <49> 8b 45 20 0f 18 08 49 8d 45 20 48 39 45 80 75 aa 4c 89 f7 e8
[ 452.352131] RIP [<ffffffff810c5f00>] page_referenced+0xee/0x1dc
[ 452.352131] RSP <ffff88022adb18b8>
[ 452.368192] ---[ end trace a9c84cb81ab9fd41 ]---
[ 452.368372] note: hib.sh[2929] exited with preempt_count 2
[ 452.368564] BUG: scheduling while atomic: hib.sh/2929/0x10000003
[ 452.368742] INFO: lockdep is turned off.
[ 452.368915] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 edac_core serial_core ohci_hcd pcspkr k10temp
[ 452.370749] Pid: 2929, comm: hib.sh Tainted: G D 2.6.34-rc3-00501-gefb57c0 #4
[ 452.371051] Call Trace:
[ 452.371239] [<ffffffff810658df>] ? __debug_show_held_locks+0x1b/0x24
[ 452.371425] [<ffffffff8102dfac>] __schedule_bug+0x72/0x77
[ 452.371608] [<ffffffff8140bfe8>] schedule+0xe3/0x7ff
[ 452.371788] [<ffffffff810bd066>] ? unmap_vmas+0x88e/0x893
[ 452.371973] [<ffffffff81030ecb>] __cond_resched+0x18/0x24
[ 452.372168] [<ffffffff8140c7d1>] _cond_resched+0x2c/0x37
[ 452.372348] [<ffffffff810bcea6>] unmap_vmas+0x6ce/0x893
[ 452.372531] [<ffffffff8140f8b6>] ? _raw_spin_unlock_irqrestore+0x38/0x69
[ 452.372721] [<ffffffff810c1604>] exit_mmap+0xd7/0x182
[ 452.372903] [<ffffffff810368bc>] mmput+0x48/0xb9
[ 452.373088] [<ffffffff8103ad90>] exit_mm+0x110/0x11d
[ 452.373284] [<ffffffff8103c9e6>] do_exit+0x1c5/0x6e5
[ 452.373464] [<ffffffff81039e2f>] ? kmsg_dump+0x13b/0x155
[ 452.373645] [<ffffffff8100616b>] ? oops_end+0x47/0x93
[ 452.373826] [<ffffffff810061b2>] oops_end+0x8e/0x93
[ 452.374006] [<ffffffff810063a3>] die+0x5a/0x63
[ 452.374198] [<ffffffff81003eef>] do_general_protection+0x134/0x13c
[ 452.374382] [<ffffffff8140fdb0>] ? irq_return+0x0/0x2
[ 452.374565] [<ffffffff8140ff8f>] general_protection+0x1f/0x30
[ 452.374754] [<ffffffff810c5f00>] ? page_referenced+0xee/0x1dc
[ 452.374940] [<ffffffff810c5e92>] ? page_referenced+0x80/0x1dc
[ 452.375147] [<ffffffff810c5d88>] ? try_to_unmap_anon+0xa2/0xb4
[ 452.375335] [<ffffffff810b06bc>] shrink_page_list+0x154/0x4c7
[ 452.375519] [<ffffffff81067149>] ? print_lock_contention_bug+0x1b/0xe1
[ 452.375703] [<ffffffff810af59c>] ? isolate_pages_global+0xd0/0x1fc
[ 452.375888] [<ffffffff8140f856>] ? _raw_spin_unlock_irq+0x30/0x58
[ 452.376080] [<ffffffff810b0d8a>] shrink_inactive_list+0x35b/0x60c
[ 452.376284] [<ffffffff810b1347>] shrink_zone+0x30c/0x3d6
[ 452.376476] [<ffffffff810b1f3d>] do_try_to_free_pages+0x191/0x29a
[ 452.376664] [<ffffffff810b20db>] shrink_all_memory+0x95/0xc4
[ 452.376852] [<ffffffff810af4cc>] ? isolate_pages_global+0x0/0x1fc
[ 452.377038] [<ffffffff81079c9c>] ? count_data_pages+0x65/0x79
[ 452.377238] [<ffffffff81079f03>] hibernate_preallocate_memory+0x1aa/0x2cb
[ 452.377429] [<ffffffff8140bbd4>] ? printk+0x41/0x45
[ 452.377611] [<ffffffff8107878f>] hibernation_snapshot+0x36/0x1e1
[ 452.377794] [<ffffffff81078a08>] hibernate+0xce/0x172
[ 452.377975] [<ffffffff81077775>] state_store+0x5c/0xd3
[ 452.378170] [<ffffffff8118f3cf>] kobj_attr_store+0x17/0x19
[ 452.378351] [<ffffffff8112e288>] sysfs_write_file+0x108/0x144
[ 452.378533] [<ffffffff810db4ff>] vfs_write+0xb2/0x153
[ 452.378714] [<ffffffff810663c9>] ? trace_hardirqs_on_caller+0x1f/0x14b
[ 452.378898] [<ffffffff810db663>] sys_write+0x4a/0x71
[ 452.379084] [<ffffffff8100221b>] system_call_fastpath+0x16/0x1b

--
Regards/Gruss,
Boris.

2010-04-10 16:46:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA



On Sat, 10 Apr 2010, Linus Torvalds wrote:
>
> But I think the fact that you are apparently not able to get the list
> corruption is a good sign. Of course, it might just be harder to trigger,
> and these things could all be a sign of a different bug, but my gut feel
> is that we did fix something, and you are just damn good at stressing the
> new code. Kudos.

Btw, I do hate the current 'find_mergeable_anon_vma()' with its duplicated
checks for prev/next compatibility that I just made even more complex.

So I'm actually inclined to want to write my simple two-liner fix as a
rather more complex cleanup patch, below.

It adds way more lines than it deletes, but a lot of it is comments (and
some of it is just because one routine got split up into three), and I
think it makes the result a lot more readable.

It also splits off the decision of whether we can reuse an non_vma from
the decision of whether we can merge the vma's - the two are kind of
related, but they are not really the same, and they have different issues.
I think it's good to try to keep separate issues separate.

This is UNTESTED! It's meant to be an "obvious cleanup" with no real
semantic difference, but if I did something wrong it won't work. Also note
the comment about the lack of locking between two adjacent anon_vma's
taking a page fault at the same time: the ACCESS_ONCE() is unlikely to
ever matter (anon_vma's are stable once they are set, so it's really just
that you could first load a NULL, and then if you re-load the value you
might get a non-NULL thing).

Also note that when checking whether the anon_vma is a singleton, we don't
hold any lock that protects the list we are checking. But
"list_is_singular()" is safe and won't oops even if the pointers in the
list are crap, because it only _compares_ the prev/next pointers, it
doesn't dereference them.

In short, what I'm saying is that there is a pretty subtle race in the
very very unlikely case that two anon_vma's get prepared concurrently, but
from a correctness standpoint it doesn't matter. We might sometimes - once
in a blue moon - reject an anon_vma that could in theory have been merged,
but that won't hurt.

Comments? Rik, Johannes?

Linus

---
mm/mmap.c | 86 ++++++++++++++++++++++++++++++++++++++++++++-----------------
1 files changed, 62 insertions(+), 24 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 75557c6..acb023e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -825,6 +825,61 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
}

/*
+ * Rough compatbility check to quickly see if it's even worth looking
+ * at sharing an anon_vma.
+ *
+ * They need to have the same vm_file, and the flags can only differ
+ * in things that mprotect may change.
+ *
+ * NOTE! The fact that we share an anon_vma doesn't _have_ to mean that
+ * we can merge the two vma's. For example, we refuse to merge a vma if
+ * there is a vm_ops->close() function, because that indicates that the
+ * driver is doing some kind of reference counting. But that doesn't
+ * really matter for the anon_vma sharing case.
+ */
+static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct *b)
+{
+ return a->vm_end == b->vm_start &&
+ mpol_equal(vma_policy(a), vma_policy(b)) &&
+ a->vm_file == b->vm_file &&
+ !((a->vm_flags ^ b->vm_flags) & ~(VM_READ|VM_WRITE|VM_EXEC)) &&
+ b->vm_pgoff == a->vm_pgoff + ((b->vm_start - a->vm_start) >> PAGE_SHIFT);
+}
+
+/*
+ * Do some basic sanity checking to see if we can re-use the anon_vma
+ * from 'old'. The 'a'/'b' vma's are in VM order - one of them will be
+ * the same as 'old', the other will be the new one that is trying
+ * to share the anon_vma.
+ *
+ * NOTE! This runs with mm_sem held for reading, so it is possible that
+ * the anon_vma of 'old' is concurrently in the process of being set up
+ * by another page fault trying to merge _that_. But that's ok: if it
+ * is being set up, that automatically means that it will be a singleton
+ * acceptable for merging, so we can do all of this optimistically. But
+ * we do that ACCESS_ONCE() to make sure that we never re-load the pointer.
+ *
+ * IOW: that the "list_is_singular()" test on the anon_vma_chain only
+ * matters for the 'stable anon_vma' case (ie the thing we want to avoid
+ * is to return an anon_vma that is "complex" due to having gone through
+ * a fork).
+ *
+ * We also make sure that the two vma's are compatible (adjacent,
+ * and with the same memory policies). That's all stable, even with just
+ * a read lock on the mm_sem.
+ */
+static struct anon_vma *reusable_anon_vma(struct vm_area_struct *old, struct vm_area_struct *a, struct vm_area_struct *b)
+{
+ if (anon_vma_compatible(a, b)) {
+ struct anon_vma *anon_vma = ACCESS_ONCE(old->anon_vma);
+
+ if (anon_vma && list_is_singular(&old->anon_vma_chain))
+ return anon_vma;
+ }
+ return NULL;
+}
+
+/*
* find_mergeable_anon_vma is used by anon_vma_prepare, to check
* neighbouring vmas for a suitable anon_vma, before it goes off
* to allocate a new anon_vma. It checks because a repetitive
@@ -834,28 +889,16 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
*/
struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma)
{
+ struct anon_vma *anon_vma;
struct vm_area_struct *near;
- unsigned long vm_flags;

near = vma->vm_next;
if (!near)
goto try_prev;

- /*
- * Since only mprotect tries to remerge vmas, match flags
- * which might be mprotected into each other later on.
- * Neither mlock nor madvise tries to remerge at present,
- * so leave their flags as obstructing a merge.
- */
- vm_flags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC);
- vm_flags |= near->vm_flags & (VM_READ|VM_WRITE|VM_EXEC);
-
- if (near->anon_vma && vma->vm_end == near->vm_start &&
- mpol_equal(vma_policy(vma), vma_policy(near)) &&
- can_vma_merge_before(near, vm_flags,
- NULL, vma->vm_file, vma->vm_pgoff +
- ((vma->vm_end - vma->vm_start) >> PAGE_SHIFT)))
- return near->anon_vma;
+ anon_vma = reusable_anon_vma(near, vma, near);
+ if (anon_vma)
+ return anon_vma;
try_prev:
/*
* It is potentially slow to have to call find_vma_prev here.
@@ -868,14 +911,9 @@ try_prev:
if (!near)
goto none;

- vm_flags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC);
- vm_flags |= near->vm_flags & (VM_READ|VM_WRITE|VM_EXEC);
-
- if (near->anon_vma && near->vm_end == vma->vm_start &&
- mpol_equal(vma_policy(near), vma_policy(vma)) &&
- can_vma_merge_after(near, vm_flags,
- NULL, vma->vm_file, vma->vm_pgoff))
- return near->anon_vma;
+ anon_vma = reusable_anon_vma(near, near, vma);
+ if (anon_vma)
+ return anon_vma;
none:
/*
* There's no absolute need to look only at touching neighbours:

2010-04-10 17:10:14

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA


On Sat, 10 Apr 2010, Borislav Petkov wrote:
>
> And I got an oops again, this time the #GP from couple of days ago.

Oh damn. So the list corruption really does happen still.

And the pattern is similar, but not the same: now it's 0032323200323232,
rather than 002e2e2e002e2e2e. Very intriguing. 0x32 instead of 0x2e, but
the same pattern of duplicated bytes. And not very helpful in that it
still doesn't actually make any sense.

> <thinking out loud>
>
> I'm starting to think that maybe there could be something wrong with the
> machine I'm running it on. Especially since there are only two people
> who reported this issue, Steinar and me, so how probable is it that
> maybe those two machines have failing RAM module somewhere? Or some
> other data corrupting thing? Although I should be getting mchecks...
> Hmm...

No. Just the fact that there are two people who reported the same
thing is already a pretty strong sign that it's real. Also, hardware
problems don't tend to be as consistent in the details as yours have
been.

And in fact I have seen it personally (but couldn't reproduce it) on the
kids mac mini after you reported it.

So I'm convinced the problem is real, and just not so easily
triggered, and you're being a great tester.

Linus
--
Here's the one I've seen, in case you care. I haven't posted it, because
it doesn't really add anything new.

BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<c02850cf>] page_referenced+0xd6/0x199
*pde = 21d73067 *pte = 00000000
Oops: 0000 [#2] SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:1f.2/host2/target2:0:0/2:0:0:0/block/sda/uevent
Modules linked in: [last unloaded: scsi_wait_scan]

Pid: 14440, comm: firefox Tainted: G D 2.6.34-rc2-00391-gfc1203c #3 Mac-F4208EC8/Macmini1,1
EIP: 0060:[<c02850cf>] EFLAGS: 00210287 CPU: 1
EIP is at page_referenced+0xd6/0x199
EAX: f59e65d4 EBX: c10b5480 ECX: 00000000 EDX: fffffff0
ESI: f59e65d0 EDI: 00000000 EBP: d8f77cd8 ESP: d8f77ca0
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process firefox (pid: 14440, ti=d8f76000 task=cb795440 task.ti=d8f76000)
Stack:
f59e65d4 00000000 fffffff0 c15ba000 d8f77cbc c02885b8 c07972c4 d8f77cdc
c0276712 00000000 00000001 c10b5498 c10b5480 d8f77e94 d8f77d58 c0276b53
d8f77d48 00000000 00000000 00000000 0000001d d8f77de8 00000001 c07972c4
Call Trace:
[<c02885b8>] ? swapcache_free+0x1b/0x24
[<c0276712>] ? __remove_mapping+0x90/0xb2
[<c0276b53>] ? shrink_page_list+0x109/0x3ba
[<c0277099>] ? shrink_inactive_list+0x295/0x48e
[<c0273d68>] ? determine_dirtyable_memory+0x34/0x4b
[<c0273dd0>] ? get_dirty_limits+0x16/0x26d
[<c027750c>] ? shrink_zone+0x27a/0x327
[<c03c55a5>] ? i915_gem_shrink+0x67/0x22c
[<c0277e6d>] ? do_try_to_free_pages+0x17d/0x292
[<c0278078>] ? try_to_free_pages+0x6a/0x72
[<c0275cd7>] ? isolate_pages_global+0x0/0x1bd
[<c0273210>] ? __alloc_pages_nodemask+0x2c2/0x447
[<c027f1c1>] ? handle_mm_fault+0x188/0x605
[<c02192c3>] ? do_page_fault+0x253/0x269
[<c0219070>] ? do_page_fault+0x0/0x269
[<c05b9e82>] ? error_code+0x66/0x6c
[<c05b0000>] ? azx_probe+0x5e8/0x8ae
[<c0219070>] ? do_page_fault+0x0/0x269
Code: f9 f2 74 18 ff 75 08 8d 45 f0 50 89 d8 e8 62 f6 ff ff 01 c7 59 83 7d f0 00 58 74 20 8b 55 d0 8b 42 10 83 e8 10 89 45 d0 8b 55 d0 <8b> 42 10 0f 18 00 90 89 d0 83 c0 10 39 45 c8 75 ab fe 06 e9 90
EIP: [<c02850cf>] page_referenced+0xd6/0x199 SS:ESP 0068:d8f77ca0
CR2: 0000000000000000
---[ end trace 890710798f4c0070 ]---

2010-04-10 17:15:00

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

From: Borislav Petkov <[email protected]>
Date: Sat, Apr 10, 2010 at 06:38:28PM +0200

> Im going to run the stress test on 2.6.33.2 to verify whether this is
> actually software-related. Just in case.

Just did a bunch of hibernation runs - 2.6.33.2 feels rock solid - no
issues whatsoever. So in the face of such results a hw failure is kinda
unprobable... Hmm...

--
Regards/Gruss,
Boris.

2010-04-10 18:27:01

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA



On Sat, 10 Apr 2010, Linus Torvalds wrote:
> On Sat, 10 Apr 2010, Borislav Petkov wrote:
> >
> > And I got an oops again, this time the #GP from couple of days ago.
>
> Oh damn. So the list corruption really does happen still.

Ho humm.

Maybe I'm crazy, but something started bothering me. And I started
wondering: when is the 'page->mapping' of an anonymous page actually
cleared?

The thing is, the mapping of an anonymous page is actually cleared only
when the page is _freed_, in "free_hot_cold_page()".

Now, let's think about that. And in particular, let's think about how that
relates to the freeing of the 'anon_vma' that the page->mapping points to.

The way the anon_vma is freed is when the mapping is torn down, and we do
roughly:

tlb = tlb_gather_mmu(mm,..)
..
unmap_vmas(&tlb, vma ..
..
free_pgtables()
..
tlb_finish_mmu(tlb, start, end);

and we actually unmap all the pages in "unmap_vmas()", and then _after_
unmapping all the pages we do the "unlink_anon_vmas(vma);" in
"free_pgtables()". Fine so far - the anon_vma stay around until after the
page has been happily unmapped.

But "unmapped all the pages" is _not_ actually the same as "free'd all the
pages". The actual _freeing_ of the page happens generally in
tlb_finish_mmu(), because we can free the page only after we've flushed
any TLB entries.

So what we have in that tlb_gather structure is a list of _pending_ pages
to be freed, while we already actually free'd the anon_vmas earlier!

Now, the thing is, tlb_gather_mmu() begins a preempt-safe region (because
we use a per-cpu variable), but as far as I can tell it is _not_ an
RCU-safe region.

So I think we might actually get a real RCU freeing event while this all
happens. So now the 'anon_vma' that 'page->mapping' points to has not just
been released back to the SLUB caches, the page itself might have been
released too.

I dunno. Does the above sound at all sane? Or am I just raving?

Something hacky like the above might fix it if I'm not just raving. I
really might be missing something here.

Linus

---
include/asm-generic/tlb.h | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index e43f976..2678118 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -14,6 +14,7 @@
#define _ASM_GENERIC__TLB_H

#include <linux/swap.h>
+#include <linux/rcupdate.h>
#include <asm/pgalloc.h>
#include <asm/tlbflush.h>

@@ -62,6 +63,7 @@ tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)

tlb->fullmm = full_mm_flush;

+ rcu_read_lock();
return tlb;
}

@@ -90,6 +92,7 @@ tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
/* keep the page table cache within bounds */
check_pgt_cache();

+ rcu_read_unlock();
put_cpu_var(mmu_gathers);
}

2010-04-10 18:31:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA



On Sat, 10 Apr 2010, Linus Torvalds wrote:
>
> I dunno. Does the above sound at all sane? Or am I just raving?
>
> Something hacky like the above might fix it if I'm not just raving. I
> really might be missing something here.

Btw, if this turns out to be accurate, the real fix is to probably just
have a separate phase at the very end to actually release all the vma's,
rather than do it in "free_page_tables()". We don't want to make the
tlb-gather any more atomic than it already is. In fact, Nick is trying to
make it preemptible.

So the patch included in that mail was meant very much as a "let's test my
crazy theory" patch, rather than as the real solution.

The patch is also untested. Maybe it doesn't work at all and introduces
new bugs. Caveat emptor.

Linus

2010-04-10 19:00:01

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

From: Linus Torvalds <[email protected]>
Date: Sat, Apr 10, 2010 at 11:21:39AM -0700

> On Sat, 10 Apr 2010, Linus Torvalds wrote:
> > On Sat, 10 Apr 2010, Borislav Petkov wrote:
> > >
> > > And I got an oops again, this time the #GP from couple of days ago.
> >
> > Oh damn. So the list corruption really does happen still.
>
> Ho humm.
>
> Maybe I'm crazy, but something started bothering me. And I started
> wondering: when is the 'page->mapping' of an anonymous page actually
> cleared?
>
> The thing is, the mapping of an anonymous page is actually cleared only
> when the page is _freed_, in "free_hot_cold_page()".
>
> Now, let's think about that. And in particular, let's think about how that
> relates to the freeing of the 'anon_vma' that the page->mapping points to.
>
> The way the anon_vma is freed is when the mapping is torn down, and we do
> roughly:
>
> tlb = tlb_gather_mmu(mm,..)
> ..
> unmap_vmas(&tlb, vma ..
> ..
> free_pgtables()
> ..
> tlb_finish_mmu(tlb, start, end);
>
> and we actually unmap all the pages in "unmap_vmas()", and then _after_
> unmapping all the pages we do the "unlink_anon_vmas(vma);" in
> "free_pgtables()". Fine so far - the anon_vma stay around until after the
> page has been happily unmapped.
>
> But "unmapped all the pages" is _not_ actually the same as "free'd all the
> pages". The actual _freeing_ of the page happens generally in
> tlb_finish_mmu(), because we can free the page only after we've flushed
> any TLB entries.
>
> So what we have in that tlb_gather structure is a list of _pending_ pages
> to be freed, while we already actually free'd the anon_vmas earlier!
>
> Now, the thing is, tlb_gather_mmu() begins a preempt-safe region (because
> we use a per-cpu variable), but as far as I can tell it is _not_ an
> RCU-safe region.
>
> So I think we might actually get a real RCU freeing event while this all
> happens. So now the 'anon_vma' that 'page->mapping' points to has not just
> been released back to the SLUB caches, the page itself might have been
> released too.

So, if I understand you correctly, the list_head anon_vma gets freed
_before_ the page descriptor itself, therefore we still get a valid
page->mapping in page_lock_anon_vma(). Maybe that explains the funny
patterns in %r13. But how do they come to exist when the anon_vma is
freed, shouldn't there be LIST_POISON or something recognizable?

Anyways, testing...

--
Regards/Gruss,
Boris.

2010-04-10 19:05:01

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

From: Borislav Petkov <[email protected]>
Date: Sat, Apr 10, 2010 at 08:51:45PM +0200

> Anyways, testing...

Nope, still b0rked. And this time is not a funny pattern but
ffffffffffffffe0 we had originally.

[ 521.306972] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 521.307126] IP: [<ffffffff810c60b4>] page_referenced+0xee/0x1dc
[ 521.307126] PGD 22d952067 PUD 2291db067 PMD 0
[ 521.307126] Oops: 0000 [#1] PREEMPT SMP
[ 521.307126] last sysfs file: /sys/power/state
[ 521.307126] CPU 1
[ 521.307126] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 pcspkr serial_core ohci_hcd edac_core k10temp
[ 521.307126]
[ 521.307126] Pid: 2896, comm: hib.sh Not tainted 2.6.34-rc3-00501-gefb57c0-dirty #5 M3A78 PRO/System Product Name
[ 521.307126] RIP: 0010:[<ffffffff810c60b4>] [<ffffffff810c60b4>] page_referenced+0xee/0x1dc
[ 521.307126] RSP: 0018:ffff88022bd9f8b8 EFLAGS: 00010283
[ 521.307126] RAX: ffff88022af8c338 RBX: ffffea00067e2998 RCX: 0000000000000000
[ 521.307126] RDX: ffff88022bd9fcf8 RSI: ffff88022af8c310 RDI: ffff88022c0c5e60
[ 521.307126] RBP: ffff88022bd9f938 R08: 0000000000000002 R09: 0000000000000000
[ 521.307126] R10: ffff88022b4454d8 R11: ffffffff00000012 R12: 0000000000000000
[ 521.307126] R13: ffffffffffffffe0 R14: ffff88022af8c2f8 R15: ffff88022bd9fa00
[ 521.307126] FS: 00007ff70fb586f0(0000) GS:ffff88000a200000(0000) knlGS:0000000000000000
[ 521.307126] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 521.307126] CR2: 0000000000000000 CR3: 000000022e19c000 CR4: 00000000000006e0
[ 521.307126] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 521.307126] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 521.307126] Process hib.sh (pid: 2896, threadinfo ffff88022bd9e000, task ffff88022c0c5e60)
[ 521.307126] Stack:
[ 521.307126] ffff88022af8c338 00000000810c5dd3 ffff88022bd9f918 ffffffff810c5f3c
[ 521.307126] <0> ffff880200000000 ffffffff00000001 ffff88022bd9ffd8 ffffea00067d2cf0
[ 521.307126] <0> ffffea00067d2cf0 000000022bd9fcf8 ffffea00067d2cf0 ffffea00067e29c0
[ 521.307126] Call Trace:
[ 521.307126] [<ffffffff810c5f3c>] ? try_to_unmap_anon+0xa2/0xb4
[ 521.307126] [<ffffffff810b06bc>] shrink_page_list+0x154/0x4c7
[ 521.307126] [<ffffffff81067149>] ? print_lock_contention_bug+0x1b/0xe1
[ 521.307126] [<ffffffff810af59c>] ? isolate_pages_global+0xd0/0x1fc
[ 521.307126] [<ffffffff8140fa66>] ? _raw_spin_unlock_irq+0x30/0x58
[ 521.307126] [<ffffffff810b0d8a>] shrink_inactive_list+0x35b/0x60c
[ 521.307126] [<ffffffff810b1347>] shrink_zone+0x30c/0x3d6
[ 521.307126] [<ffffffff810b1f3d>] do_try_to_free_pages+0x191/0x29a
[ 521.307126] [<ffffffff810b20db>] shrink_all_memory+0x95/0xc4
[ 521.307126] [<ffffffff810af4cc>] ? isolate_pages_global+0x0/0x1fc
[ 521.307126] [<ffffffff81079c9c>] ? count_data_pages+0x65/0x79
[ 521.307126] [<ffffffff81079f03>] hibernate_preallocate_memory+0x1aa/0x2cb
[ 521.307126] [<ffffffff8140bde4>] ? printk+0x41/0x45
[ 521.307126] [<ffffffff8107878f>] hibernation_snapshot+0x36/0x1e1
[ 521.307126] [<ffffffff81078a08>] hibernate+0xce/0x172
[ 521.307126] [<ffffffff81077775>] state_store+0x5c/0xd3
[ 521.307126] [<ffffffff8118f5eb>] kobj_attr_store+0x17/0x19
[ 521.307126] [<ffffffff8112e4a4>] sysfs_write_file+0x108/0x144
[ 521.307126] [<ffffffff810db6b3>] vfs_write+0xb2/0x153
[ 521.307126] [<ffffffff810663c9>] ? trace_hardirqs_on_caller+0x1f/0x14b
[ 521.307126] [<ffffffff810db817>] sys_write+0x4a/0x71
[ 521.307126] [<ffffffff8100221b>] system_call_fastpath+0x16/0x1b
[ 521.307126] Code: 3b 56 10 73 1e 48 83 fa f2 74 18 48 8d 4d cc 4d 89 f8 48 89 df e8 11 f2 ff ff 41 01 c4 83 7d cc 00 74 19 4d 8b 6d 20 49 83 ed 20 <49> 8b 45 20 0f 18 08 49 8d 45 20 48 39 45 80 75 aa 4c 89 f7 e8
[ 521.307126] RIP [<ffffffff810c60b4>] page_referenced+0xee/0x1dc
[ 521.307126] RSP <ffff88022bd9f8b8>
[ 521.307126] CR2: 0000000000000000
[ 521.320888] ---[ end trace 023d26183296e92e ]---
[ 521.321033] note: hib.sh[2896] exited with preempt_count 2
[ 521.321206] BUG: scheduling while atomic: hib.sh/2896/0x10000003
[ 521.321355] INFO: lockdep is turned off.
[ 521.321500] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 pcspkr serial_core ohci_hcd edac_core k10temp
[ 521.322884] Pid: 2896, comm: hib.sh Tainted: G D 2.6.34-rc3-00501-gefb57c0-dirty #5
[ 521.323139] Call Trace:
[ 521.323288] [<ffffffff810658df>] ? __debug_show_held_locks+0x1b/0x24
[ 521.323440] [<ffffffff8102dfac>] __schedule_bug+0x72/0x77
[ 521.323587] [<ffffffff8140c1f8>] schedule+0xe3/0x7ff
[ 521.323735] [<ffffffff81030ecb>] __cond_resched+0x18/0x24
[ 521.323882] [<ffffffff8140c9e1>] _cond_resched+0x2c/0x37
[ 521.324029] [<ffffffff810bcef1>] unmap_vmas+0x719/0x911
[ 521.324207] [<ffffffff810c1781>] exit_mmap+0x102/0x1e4
[ 521.324356] [<ffffffff810c16e8>] ? exit_mmap+0x69/0x1e4
[ 521.324503] [<ffffffff810368bc>] mmput+0x48/0xb9
[ 521.324651] [<ffffffff8103ad90>] exit_mm+0x110/0x11d
[ 521.324798] [<ffffffff8103c9e6>] do_exit+0x1c5/0x6e5
[ 521.324945] [<ffffffff81039e2f>] ? kmsg_dump+0x13b/0x155
[ 521.325093] [<ffffffff8100616b>] ? oops_end+0x47/0x93
[ 521.325244] [<ffffffff810061b2>] oops_end+0x8e/0x93
[ 521.325396] [<ffffffff8101f3e5>] no_context+0x1fc/0x20b
[ 521.325544] [<ffffffff8101f580>] __bad_area_nosemaphore+0x18c/0x1af
[ 521.325691] [<ffffffff8101f7bb>] ? do_page_fault+0xa8/0x32d
[ 521.325839] [<ffffffff8101f5b6>] bad_area_nosemaphore+0x13/0x15
[ 521.325987] [<ffffffff8101f886>] do_page_fault+0x173/0x32d
[ 521.326138] [<ffffffff81082b84>] ? __call_rcu+0x11d/0x130
[ 521.326289] [<ffffffff814103e3>] ? error_sti+0x5/0x6
[ 521.326437] [<ffffffff81065387>] ? trace_hardirqs_off_caller+0x1f/0xa9
[ 521.326586] [<ffffffff8140ed0b>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 521.326737] [<ffffffff814101ff>] page_fault+0x1f/0x30
[ 521.326885] [<ffffffff810c60b4>] ? page_referenced+0xee/0x1dc
[ 521.327034] [<ffffffff810c6046>] ? page_referenced+0x80/0x1dc
[ 521.327185] [<ffffffff810c5f3c>] ? try_to_unmap_anon+0xa2/0xb4
[ 521.327336] [<ffffffff810b06bc>] shrink_page_list+0x154/0x4c7
[ 521.327483] [<ffffffff81067149>] ? print_lock_contention_bug+0x1b/0xe1
[ 521.327632] [<ffffffff810af59c>] ? isolate_pages_global+0xd0/0x1fc
[ 521.327780] [<ffffffff8140fa66>] ? _raw_spin_unlock_irq+0x30/0x58
[ 521.327928] [<ffffffff810b0d8a>] shrink_inactive_list+0x35b/0x60c
[ 521.328079] [<ffffffff810b1347>] shrink_zone+0x30c/0x3d6
[ 521.328232] [<ffffffff810b1f3d>] do_try_to_free_pages+0x191/0x29a
[ 521.328387] [<ffffffff810b20db>] shrink_all_memory+0x95/0xc4
[ 521.328535] [<ffffffff810af4cc>] ? isolate_pages_global+0x0/0x1fc
[ 521.328683] [<ffffffff81079c9c>] ? count_data_pages+0x65/0x79
[ 521.328831] [<ffffffff81079f03>] hibernate_preallocate_memory+0x1aa/0x2cb
[ 521.328979] [<ffffffff8140bde4>] ? printk+0x41/0x45
[ 521.329130] [<ffffffff8107878f>] hibernation_snapshot+0x36/0x1e1
[ 521.329283] [<ffffffff81078a08>] hibernate+0xce/0x172
[ 521.329432] [<ffffffff81077775>] state_store+0x5c/0xd3
[ 521.329580] [<ffffffff8118f5eb>] kobj_attr_store+0x17/0x19
[ 521.329727] [<ffffffff8112e4a4>] sysfs_write_file+0x108/0x144
[ 521.329875] [<ffffffff810db6b3>] vfs_write+0xb2/0x153
[ 521.330022] [<ffffffff810663c9>] ? trace_hardirqs_on_caller+0x1f/0x14b
[ 521.330174] [<ffffffff810db817>] sys_write+0x4a/0x71
[ 521.330326] [<ffffffff8100221b>] system_call_fastpath+0x16/0x1b

--
Regards/Gruss,
Boris.

2010-04-10 19:38:07

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

On 04/10/2010 02:21 PM, Linus Torvalds wrote:

> Maybe I'm crazy, but something started bothering me. And I started
> wondering: when is the 'page->mapping' of an anonymous page actually
> cleared?
>
> The thing is, the mapping of an anonymous page is actually cleared only
> when the page is _freed_, in "free_hot_cold_page()".

Which is also where they are removed from the LRU.
The plot thickens...

> Now, let's think about that. And in particular, let's think about how that
> relates to the freeing of the 'anon_vma' that the page->mapping points to.
>
> The way the anon_vma is freed is when the mapping is torn down, and we do
> roughly:
>
> tlb = tlb_gather_mmu(mm,..)
> ..
> unmap_vmas(&tlb, vma ..
> ..
> free_pgtables()
> ..
> tlb_finish_mmu(tlb, start, end);

Looks like we should move the anon_vma freeing from free_pgtables
over to remove_vma?

This code is just below the tlb_finish_mmu in exit_mmap:

/*
* Walk the list again, actually closing and freeing it,
* with preemption enabled, without holding any MM locks.
*/
while (vma)
vma = remove_vma(vma);

This comment in free_pgtables is a little suspect:

/*
* Hide vma from rmap and truncate_pagecache before freeing
* pgtables
*/
unlink_anon_vmas(vma);
unlink_file_vma(vma);

After all, the rmap code will quickly notice that there either are
no page tables, or the page tables no longer have anything in them.

It looks like we may have had this use-after-free bug in the VM for
quite a while... I am not entirely sure what exposed the bug, but
I can see how it works.

2010-04-10 20:10:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA



On Sat, 10 Apr 2010, Borislav Petkov wrote:
> From: Borislav Petkov <[email protected]>
> Date: Sat, Apr 10, 2010 at 08:51:45PM +0200
>
> > Anyways, testing...
>
> Nope, still b0rked. And this time is not a funny pattern but
> ffffffffffffffe0 we had originally.

Ok, I think that just depends on who happens to re-use the allocation and
how it does it.

I'm pretty sure it's a use-after-free issue, where we have free'd an
anon_vma too early, even though it has pages associated with it.

If it wasn't the RCU case, it's just something else.

I think it's worth looking at "vma_adjust()", because as I already
mentioned to Rik earlier - the code is very hard to understand, and it's
accrued crud over many many years.

And vma_adjust is the one place that does that anon_vma_merge(), which is
apart from the actual unmapping sequence the only other place that
actually free's anon_vmas. So there are reasons to be very suspicious of
that code.

And I think that code can actually lose an anon_vma chain. It's totally
screwing up the "import anonvma" case: when it does

if (anon_vma_clone(importer, vma)) {
return -ENOMEM;
}
importer->anon_vma = anon_vma;

we can actually have "importer == vma", but "anon_vma = next->anon_vma".

In which case we actually end up with an _empty_ chain (because importer
didn't have a chain to begin with!) but "importer->anon_vma" points to an
anon_vma.

And then when we do that "remove_next", we actually get rid of the only
chain we ever had, and have lost all our references to the anon_vma.

That looks _horribly_ buggy.

Also, the conditional nesting makes no sense (the whole anon_vma_clone()
only makes sense if importer is set, and it is only ever set _inside_ the
earlier if-statement, so the whole code should be moved inside there), nor
does some of the comments.

This patch is scary and untested, but the more I look at that code, the
more convinced I am that vma_adjust was _really_ badly screwed up. The
patch below may make things worse. I'll test it myself too, but I'm
sending it out first, since I was writing the email as I was looking at
the piece of cr*p.

Linus

---
mm/mmap.c | 24 ++++++++----------------
1 files changed, 8 insertions(+), 16 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index acb023e..f90ea92 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -507,11 +507,12 @@ int vma_adjust(struct vm_area_struct *vma, unsigned long start,
struct address_space *mapping = NULL;
struct prio_tree_root *root = NULL;
struct file *file = vma->vm_file;
- struct anon_vma *anon_vma = NULL;
long adjust_next = 0;
int remove_next = 0;

if (next && !insert) {
+ struct vm_area_struct *exporter = NULL;
+
if (end >= next->vm_end) {
/*
* vma expands, overlapping all the next, and
@@ -519,7 +520,7 @@ int vma_adjust(struct vm_area_struct *vma, unsigned long start,
*/
again: remove_next = 1 + (end > next->vm_end);
end = next->vm_end;
- anon_vma = next->anon_vma;
+ exporter = next;
importer = vma;
} else if (end > next->vm_start) {
/*
@@ -527,7 +528,7 @@ again: remove_next = 1 + (end > next->vm_end);
* mprotect case 5 shifting the boundary up.
*/
adjust_next = (end - next->vm_start) >> PAGE_SHIFT;
- anon_vma = next->anon_vma;
+ exporter = next;
importer = vma;
} else if (end < vma->vm_end) {
/*
@@ -536,28 +537,19 @@ again: remove_next = 1 + (end > next->vm_end);
* mprotect case 4 shifting the boundary down.
*/
adjust_next = - ((vma->vm_end - end) >> PAGE_SHIFT);
- anon_vma = next->anon_vma;
+ exporter = vma;
importer = next;
}
- }

- /*
- * When changing only vma->vm_end, we don't really need anon_vma lock.
- */
- if (vma->anon_vma && (insert || importer || start != vma->vm_start))
- anon_vma = vma->anon_vma;
- if (anon_vma) {
/*
* Easily overlooked: when mprotect shifts the boundary,
* make sure the expanding vma has anon_vma set if the
* shrinking vma had, to cover any anon pages imported.
*/
- if (importer && !importer->anon_vma) {
- /* Block reverse map lookups until things are set up. */
- if (anon_vma_clone(importer, vma)) {
+ if (exporter && exporter->anon_vma && !importer->anon_vma) {
+ if (anon_vma_clone(importer, exporter))
return -ENOMEM;
- }
- importer->anon_vma = anon_vma;
+ importer->anon_vma = exporter->anon_vma;
}
}

2010-04-10 20:17:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA



On Sat, 10 Apr 2010, Linus Torvalds wrote:
>
> This patch is scary and untested, but the more I look at that code, the
> more convinced I am that vma_adjust was _really_ badly screwed up. The
> patch below may make things worse. I'll test it myself too, but I'm
> sending it out first, since I was writing the email as I was looking at
> the piece of cr*p.

Ok, it boots. Which means it must be bug-free and perfect. And I really am
convinced that the old vma_adjust() use of anon_vma_clone() was _totally_
broken, so this really could explain everything.

The RCU grace period thing for the TLB flush does look like a real bug
too, but it's one that is probably impossible to hit in practice.

A broken vma_adjust(), however, would seem to be trivial to hit once you
just get the right memory freeing patterns going, because the anon_vma
would easily be _loong_ gone because we didn't create a chain to it at
all, so the anon_vma code decided that it's not used any more.

So I'm actually pretty optimistic that this really is it.

Linus

2010-04-10 20:25:28

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

On 04/10/2010 04:05 PM, Linus Torvalds wrote:

> And vma_adjust is the one place that does that anon_vma_merge(), which is
> apart from the actual unmapping sequence the only other place that
> actually free's anon_vmas. So there are reasons to be very suspicious of
> that code.

It frees anon_vma_chain structures, but not actual anon_vmas.

Walking the anon_vma (from rmap) requires the anon_vma->lock,
which is taken in anon_vma_merge whenever a chain is unlinked.

> And I think that code can actually lose an anon_vma chain. It's totally
> screwing up the "import anonvma" case: when it does
>
> if (anon_vma_clone(importer, vma)) {
> return -ENOMEM;
> }
> importer->anon_vma = anon_vma;
>
> we can actually have "importer == vma", but "anon_vma = next->anon_vma".

A few lines up from that code, we have:

if (vma->anon_vma && (insert || importer || start !=
vma->vm_start))
anon_vma = vma->anon_vma;

So anon_vma should always be vma->anon_vma.

If we have already imported an anon_vma, we will not
do so twice, because of the !importer->anon_vma check.

What am I overlooking?

> In which case we actually end up with an _empty_ chain (because importer
> didn't have a chain to begin with!) but "importer->anon_vma" points to an
> anon_vma.

If we import a chain, from vma to importer, importer->anon_vma
will be equal to vma->anon_vma.

I do not see how 'importer' could get a state different from 'vma'.

> Also, the conditional nesting makes no sense (the whole anon_vma_clone()
> only makes sense if importer is set, and it is only ever set _inside_ the
> earlier if-statement, so the whole code should be moved inside there), nor
> does some of the comments.

No argument there, vma_adjust is very hard to read and it took
me a few days to convince myself that my changes kept things
equivalent to how they were before.

2010-04-10 20:34:23

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

On 04/10/2010 04:05 PM, Linus Torvalds wrote:

> This patch is scary and untested, but the more I look at that code, the
> more convinced I am that vma_adjust was _really_ badly screwed up. The
> patch below may make things worse. I'll test it myself too, but I'm
> sending it out first, since I was writing the email as I was looking at
> the piece of cr*p.

Your patch looks correct. Gotta love how before,
"vma" could be either exporter or importer!

I'm guessing that it did not break before my
changes, because of plain old luck...

Acked-by: Rik van Riel <[email protected]>

2010-04-10 20:39:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA



On Sat, 10 Apr 2010, Rik van Riel wrote:

> On 04/10/2010 04:05 PM, Linus Torvalds wrote:
>
> > And vma_adjust is the one place that does that anon_vma_merge(), which is
> > apart from the actual unmapping sequence the only other place that
> > actually free's anon_vmas. So there are reasons to be very suspicious of
> > that code.
>
> It frees anon_vma_chain structures, but not actual anon_vmas.

Rik, I think you're ignoring the fact that the anon_vma_chain is also the
implicit refcount.

So when you don't create the chains, you implicitly end up freeing the
anon_vma too early. In fact, it might well happen at that
'anon_vma_merge()': when it does the unlink_anon_vmas(), it may be
unlinking the last remaining anon_vma ref, and then anon_vma_unlink
_will_ in fact free the anon_vma.

Even though we have a 'vma->anon_vma' pointer that points to it - because
the chains weren't set up correctly.

> Walking the anon_vma (from rmap) requires the anon_vma->lock,
> which is taken in anon_vma_merge whenever a chain is unlinked.

None of that matters. If the dang thing got free'd, the lock isn't
reliable any more.

> A few lines up from that code, we have:
>
> if (vma->anon_vma && (insert || importer || start != vma->vm_start))
> anon_vma = vma->anon_vma;
>
> So anon_vma should always be vma->anon_vma.

No. vma->anon_vma is NULL, so the above lines are total no-ops. We're
trying to _fill_ it. But we're doing it wrong.

So we end up with:

anon_vma = next->anon-vma
importer = vma

and we do:

if (anon_vma_clone(importer, vma)) {
return -ENOMEM;
}
importer->anon_vma = anon_vma;

do you see?

The "anon_vma_clone(importer, vma)" does NOTHING, because it is cloning
from the wrong source (from 'vma', rather than from 'next', so it leaves
the vma chains empty.

And then, despite having empty chains, we do that

importer->anon_vma = anon_vma;

which sets the anon_vma to the (non-NULL) next->anon_vma.

And then, a bit later, we'll do

anon_vma_merge(vma, next);

which will happily notice that the anon_vma's of both vma and next match
(because we just _set_ them to match), and then frees the ONLY REMAINING
CHAIN - the one in next. The one we DID NOT CORRECTLY COPY, because we got
our sources completely screwed up.

> What am I overlooking?

Can you see it now?

> If we import a chain, from vma to importer, importer->anon_vma
> will be equal to vma->anon_vma.

The thing you seem to miss is that we aren't supposed to import the chain
from 'vma' AT ALL. The anon_vma came from _next_, not from 'vma'!

> I do not see how 'importer' could get a state different from 'vma'.

Stop worrying about 'vma'. Start worrying about 'next'.

Linus

2010-04-10 20:45:00

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

From: Linus Torvalds <[email protected]>
Date: Sat, Apr 10, 2010 at 01:12:46PM -0700

> So I'm actually pretty optimistic that this really is it.

Ok, let me verify what/in which order should be tested before I test
something wrongly. The RCU-safe fix for the TLB flush can stay for
correctness reasons, this last patch, obviosly, what happens with the
find_mergeable_anon_vma() changes to use only singleton lists for
merging? Should I keep those too?

--
Regards/Gruss,
Boris.

2010-04-10 20:45:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA



On Sat, 10 Apr 2010, Borislav Petkov wrote:
> From: Linus Torvalds <[email protected]>
> Date: Sat, Apr 10, 2010 at 01:12:46PM -0700
>
> > So I'm actually pretty optimistic that this really is it.
>
> Ok, let me verify what/in which order should be tested before I test
> something wrongly. The RCU-safe fix for the TLB flush can stay for
> correctness reasons, this last patch, obviosly, what happens with the
> find_mergeable_anon_vma() changes to use only singleton lists for
> merging? Should I keep those too?

Yes. So the patches I actually think are important are:

- the RCU fix is real, although admittedly the race window is probably
too small to ever really hit.

- the simplification rule to find_mergeable_anon_vma's is required,
because otherwise our anon_vma_merge() will do the wrong thing (maybe
Johannes' patch would be an alternative, but quite frankly, I think we
want the simpler code, and I don't think we even _want_ to share
anon_vma's that are complex due to forking)

I like my "cleanup" version (the bigger one with lots of comments) more
than the two-liner version, but they should be equivalent.

- the vma_adjust() fix is the one that I think may actually end up fixing
your problems for good. Knock wood.

So I think they are all required, but I suspect that the vma_adjust() one
is finally the most direct explanation of the problem you've seen.

Linus

2010-04-10 20:45:38

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

On 04/10/2010 04:34 PM, Linus Torvalds wrote:

>> What am I overlooking?
>
> Can you see it now?

Yeah, after reading through your patch it became obvious.
It's the code above this code that sets up the problem.

It's a small miracle it worked before...

2010-04-10 21:35:03

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

From: Linus Torvalds <[email protected]>
Date: Sat, Apr 10, 2010 at 01:40:39PM -0700

> Yes. So the patches I actually think are important are:
>
> - the RCU fix is real, although admittedly the race window is probably
> too small to ever really hit.
>
> - the simplification rule to find_mergeable_anon_vma's is required,
> because otherwise our anon_vma_merge() will do the wrong thing (maybe
> Johannes' patch would be an alternative, but quite frankly, I think we
> want the simpler code, and I don't think we even _want_ to share
> anon_vma's that are complex due to forking)
>
> I like my "cleanup" version (the bigger one with lots of comments) more
> than the two-liner version, but they should be equivalent.
>
> - the vma_adjust() fix is the one that I think may actually end up fixing
> your problems for good. Knock wood.
>
> So I think they are all required, but I suspect that the vma_adjust() one
> is finally the most direct explanation of the problem you've seen.

Damn, nope, still no joy :(. It looked like it was fixed but one of the
test was to hibernate right after the 3 kvm guests were shut down and I
guess the mem freeing pattern kinda hits it where it most hurts.

Anyways, I'm going to bed soon, will test whatever you come up with guys
tomorrow morning when I can think again.

By the way, do we want to create a new thread - the mailchain is off the
screen limits of my netbook :)

Thanks.

p.s. Oopsie:


[ 647.288638] PM: Syncing filesystems ... done.
[ 647.307459] Freezing user space processes ... (elapsed 0.01 seconds) done.
[ 647.320981] Freezing remaining freezable tasks ... (elapsed 0.01 seconds) done.
[ 647.334152] PM: Preallocating image memory...
[ 647.492781] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 647.493001] IP: [<ffffffff810c60a0>] page_referenced+0xee/0x1dc
[ 647.493001] PGD 22a1d1067 PUD 1cb6a9067 PMD 0
[ 647.493001] Oops: 0000 [#1] PREEMPT SMP
[ 647.493001] last sysfs file: /sys/power/state
[ 647.493001] CPU 0
[ 647.493001] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp ohci_hcd 8250 serial_core pcspkr k10temp edac_core
[ 647.493001]
[ 647.493001] Pid: 3231, comm: hib.sh Not tainted 2.6.34-rc3-00503-g8b3334b #6 M3A78 PRO/System Product Name
[ 647.493001] RIP: 0010:[<ffffffff810c60a0>] [<ffffffff810c60a0>] page_referenced+0xee/0x1dc
[ 647.493001] RSP: 0018:ffff880223b6f8b8 EFLAGS: 00010283
[ 647.493001] RAX: ffff88022aa316c8 RBX: ffffea0006882fc0 RCX: 0000000000000000
[ 647.493001] RDX: ffff880223b6fcf8 RSI: ffff88022aa316a0 RDI: ffff88022de6de60
[ 647.493001] RBP: ffff880223b6f938 R08: 0000000000000002 R09: 0000000000000000
[ 647.493001] R10: ffff880228cb03a8 R11: ffffffff00000012 R12: 0000000000000000
[ 647.493001] R13: ffffffffffffffe0 R14: ffff88022aa31688 R15: ffff880223b6fa00
[ 647.493001] FS: 00007f0eea2086f0(0000) GS:ffff88000a000000(0000) knlGS:0000000000000000
[ 647.493001] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 647.493001] CR2: 0000000000000000 CR3: 0000000223df5000 CR4: 00000000000006f0
[ 647.493001] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 647.493001] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 647.493001] Process hib.sh (pid: 3231, threadinfo ffff880223b6e000, task ffff88022de6de60)
[ 647.493001] Stack:
[ 647.493001] ffff88022aa316c8 00000000810c5dbf ffff880223b6f918 ffffffff810c5f28
[ 647.493001] <0> ffff880223b6f8f8 ffffffff00000001 ffffea0006867570 ffffea0006889070
[ 647.493001] <0> ffffea0006889070 0000000223b6fcf8 ffffea0006889070 ffffea0006882fe8
[ 647.493001] Call Trace:
[ 647.493001] [<ffffffff810c5f28>] ? try_to_unmap_anon+0xa2/0xb4
[ 647.493001] [<ffffffff810b06bc>] shrink_page_list+0x154/0x4c7
[ 647.493001] [<ffffffff810b0d8a>] shrink_inactive_list+0x35b/0x60c
[ 647.493001] [<ffffffff810b1155>] ? shrink_zone+0x11a/0x3d6
[ 647.493001] [<ffffffff81067149>] ? print_lock_contention_bug+0x1b/0xe1
[ 647.493001] [<ffffffff8140f000>] ? _raw_spin_lock_irq+0x19/0x79
[ 647.493001] [<ffffffff810b1347>] shrink_zone+0x30c/0x3d6
[ 647.493001] [<ffffffff810b155b>] ? shrink_slab+0x14a/0x15c
[ 647.493001] [<ffffffff810b1f3d>] do_try_to_free_pages+0x191/0x29a
[ 647.493001] [<ffffffff810b20db>] shrink_all_memory+0x95/0xc4
[ 647.493001] [<ffffffff810af4cc>] ? isolate_pages_global+0x0/0x1fc
[ 647.493001] [<ffffffff81079c9c>] ? count_data_pages+0x65/0x79
[ 647.493001] [<ffffffff81079f03>] hibernate_preallocate_memory+0x1aa/0x2cb
[ 647.493001] [<ffffffff8140bdd4>] ? printk+0x41/0x45
[ 647.493001] [<ffffffff8107878f>] hibernation_snapshot+0x36/0x1e1
[ 647.493001] [<ffffffff81078a08>] hibernate+0xce/0x172
[ 647.493001] [<ffffffff81077775>] state_store+0x5c/0xd3
[ 647.493001] [<ffffffff8118f5d7>] kobj_attr_store+0x17/0x19
[ 647.493001] [<ffffffff8112e490>] sysfs_write_file+0x108/0x144
[ 647.493001] [<ffffffff810db69f>] vfs_write+0xb2/0x153
[ 647.493001] [<ffffffff810663c9>] ? trace_hardirqs_on_caller+0x1f/0x14b
[ 647.493001] [<ffffffff810db803>] sys_write+0x4a/0x71
[ 647.493001] [<ffffffff8100221b>] system_call_fastpath+0x16/0x1b
[ 647.493001] Code: 3b 56 10 73 1e 48 83 fa f2 74 18 48 8d 4d cc 4d 89 f8 48 89 df e8 11 f2 ff ff 41 01 c4 83 7d cc 00 74 19 4d 8b 6d 20 49 83 ed 20 <49> 8b 45 20 0f 18 08 49 8d 45 20 48 39 45 80 75 aa 4c 89 f7 e8
[ 647.493001] RIP [<ffffffff810c60a0>] page_referenced+0xee/0x1dc
[ 647.493001] RSP <ffff880223b6f8b8>
[ 647.493001] CR2: 0000000000000000
[ 647.508991] ---[ end trace 91f57fb5ef398fd2 ]---
[ 647.509150] note: hib.sh[3231] exited with preempt_count 2
[ 647.509311] BUG: scheduling while atomic: hib.sh/3231/0x10000003
[ 647.509462] INFO: lockdep is turned off.
[ 647.509610] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp ohci_hcd 8250 serial_core pcspkr k10temp edac_core
[ 647.511093] Pid: 3231, comm: hib.sh Tainted: G D 2.6.34-rc3-00503-g8b3334b #6
[ 647.511353] Call Trace:
[ 647.511504] [<ffffffff810658df>] ? __debug_show_held_locks+0x1b/0x24
[ 647.511658] [<ffffffff8102dfac>] __schedule_bug+0x72/0x77
[ 647.511811] [<ffffffff8140c1e8>] schedule+0xe3/0x7ff
[ 647.511962] [<ffffffff810bd0e4>] ? unmap_vmas+0x90c/0x911
[ 647.512191] [<ffffffff81030ecb>] __cond_resched+0x18/0x24
[ 647.512337] [<ffffffff8140c9d1>] _cond_resched+0x2c/0x37
[ 647.512550] [<ffffffff810bcef1>] unmap_vmas+0x719/0x911
[ 647.512697] [<ffffffff810c1781>] exit_mmap+0x102/0x1e4
[ 647.512911] [<ffffffff810c16e8>] ? exit_mmap+0x69/0x1e4
[ 647.513082] [<ffffffff810368bc>] mmput+0x48/0xb9
[ 647.513233] [<ffffffff8103ad90>] exit_mm+0x110/0x11d
[ 647.513387] [<ffffffff8103c9e6>] do_exit+0x1c5/0x6e5
[ 647.513538] [<ffffffff81039e2f>] ? kmsg_dump+0x13b/0x155
[ 647.513690] [<ffffffff8100616b>] ? oops_end+0x47/0x93
[ 647.513859] [<ffffffff810061b2>] oops_end+0x8e/0x93
[ 647.514009] [<ffffffff8101f3e5>] no_context+0x1fc/0x20b
[ 647.514172] [<ffffffff8118b72b>] ? cfq_insert_request+0x7a/0x3b1
[ 647.514321] [<ffffffff8101f580>] __bad_area_nosemaphore+0x18c/0x1af
[ 647.514473] [<ffffffff8101f7bb>] ? do_page_fault+0xa8/0x32d
[ 647.514625] [<ffffffff8101f5b6>] bad_area_nosemaphore+0x13/0x15
[ 647.514777] [<ffffffff8101f886>] do_page_fault+0x173/0x32d
[ 647.514929] [<ffffffff814103a3>] ? error_sti+0x5/0x6
[ 647.515084] [<ffffffff81065387>] ? trace_hardirqs_off_caller+0x1f/0xa9
[ 647.515242] [<ffffffff8140ecfb>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 647.515397] [<ffffffff814101bf>] page_fault+0x1f/0x30
[ 647.515549] [<ffffffff810c60a0>] ? page_referenced+0xee/0x1dc
[ 647.515701] [<ffffffff810c6032>] ? page_referenced+0x80/0x1dc
[ 647.515853] [<ffffffff810c5f28>] ? try_to_unmap_anon+0xa2/0xb4
[ 647.516010] [<ffffffff810b06bc>] shrink_page_list+0x154/0x4c7
[ 647.516167] [<ffffffff810b0d8a>] shrink_inactive_list+0x35b/0x60c
[ 647.516323] [<ffffffff810b1155>] ? shrink_zone+0x11a/0x3d6
[ 647.516474] [<ffffffff81067149>] ? print_lock_contention_bug+0x1b/0xe1
[ 647.516627] [<ffffffff8140f000>] ? _raw_spin_lock_irq+0x19/0x79
[ 647.516780] [<ffffffff810b1347>] shrink_zone+0x30c/0x3d6
[ 647.516931] [<ffffffff810b155b>] ? shrink_slab+0x14a/0x15c
[ 647.517086] [<ffffffff810b1f3d>] do_try_to_free_pages+0x191/0x29a
[ 647.517243] [<ffffffff810b20db>] shrink_all_memory+0x95/0xc4
[ 647.517398] [<ffffffff810af4cc>] ? isolate_pages_global+0x0/0x1fc
[ 647.517551] [<ffffffff81079c9c>] ? count_data_pages+0x65/0x79
[ 647.517703] [<ffffffff81079f03>] hibernate_preallocate_memory+0x1aa/0x2cb
[ 647.517856] [<ffffffff8140bdd4>] ? printk+0x41/0x45
[ 647.518011] [<ffffffff8107878f>] hibernation_snapshot+0x36/0x1e1
[ 647.518168] [<ffffffff81078a08>] hibernate+0xce/0x172
[ 647.518322] [<ffffffff81077775>] state_store+0x5c/0xd3
[ 647.518473] [<ffffffff8118f5d7>] kobj_attr_store+0x17/0x19
[ 647.518625] [<ffffffff8112e490>] sysfs_write_file+0x108/0x144
[ 647.518777] [<ffffffff810db69f>] vfs_write+0xb2/0x153
[ 647.518928] [<ffffffff810663c9>] ? trace_hardirqs_on_caller+0x1f/0x14b
[ 647.519084] [<ffffffff810db803>] sys_write+0x4a/0x71
[ 647.519240] [<ffffffff8100221b>] system_call_fastpath+0x16/0x1b
[ 699.648857] SysRq : HELP : loglevel(0-9) reBoot Crash show-all-locks(D) terminate-all-tasks(E) memory-full-oom-kill(F) kill-all-tasks(I) thaw-filesystems(J) saK show-backtrace-all-active-cpus(L) show-memory-usage(M) nice-all-RT-tasks(N) powerOff show-registers(P) show-all-timers(Q) unRaw Sync show-task-states(T) Unmount show-blocked-tasks(W) dump-ftrace-buffer(Z)
[ 700.234923] SysRq : Emergency Sync
[ 700.235341] Emergency Sync complete
[ 700.982072] SysRq : Emergency Remount R/O
[ 701.600802] SysRq : Resetting

--
Regards/Gruss,
Boris.

2010-04-10 21:35:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA



On Sat, 10 Apr 2010, Borislav Petkov wrote:
>
> Damn, nope, still no joy :(. It looked like it was fixed but one of the
> test was to hibernate right after the 3 kvm guests were shut down and I
> guess the mem freeing pattern kinda hits it where it most hurts.

Damn, I really hoped that was it. Three independent bugs found and fixed,
and still no joy? Oh well.

> By the way, do we want to create a new thread - the mailchain is off the
> screen limits of my netbook :)

I prefer to keep it in one thread so that they all show up together if I
need to, but feel free to start a new one. Not a biggie.

> [ 647.492781] BUG: unable to handle kernel NULL pointer dereference at (null)
> [ 647.493001] IP: [<ffffffff810c60a0>] page_referenced+0xee/0x1dc

Well, it sure is consistent. I'll start to think about what else could go
wrong..

Linus

2010-04-10 22:00:04

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

From: Linus Torvalds <[email protected]>
Date: Sat, Apr 10, 2010 at 02:30:49PM -0700

> On Sat, 10 Apr 2010, Borislav Petkov wrote:
> >
> > Damn, nope, still no joy :(. It looked like it was fixed but one of the
> > test was to hibernate right after the 3 kvm guests were shut down and I
> > guess the mem freeing pattern kinda hits it where it most hurts.
>
> Damn, I really hoped that was it. Three independent bugs found and fixed,
> and still no joy? Oh well.

Yep, I'll redo the testing tomorrow, so that we are sure that even with
the _three_ bugs fixed we still hit the funky list element issue.

> > By the way, do we want to create a new thread - the mailchain is off the
> > screen limits of my netbook :)
>
> I prefer to keep it in one thread so that they all show up together if I
> need to, but feel free to start a new one. Not a biggie.

I'll keep the thread then - I didn't know it mattered. Mine was just a
suggestion, nevermind.

> > [ 647.492781] BUG: unable to handle kernel NULL pointer dereference at (null)
> > [ 647.493001] IP: [<ffffffff810c60a0>] page_referenced+0xee/0x1dc
>
> Well, it sure is consistent. I'll start to think about what else could go
> wrong..

Which could mean that even with those issues fixed, the real issue is
yet something else. Because obviously the fixes you throw at it don't
seem to change it - even the traces remain consistent across tests.
And if it is use-after-free case, the funny patterns could be some
shifted SLUB poison values which we happen to "see" through the dangling
pointer... I dunno.

Hmm.

--
Regards/Gruss,
Boris.

2010-04-10 22:49:34

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

On Sat, Apr 10, 2010 at 09:41:52AM -0700, Linus Torvalds wrote:
> [...]
>
> It also splits off the decision of whether we can reuse an non_vma from
> the decision of whether we can merge the vma's - the two are kind of
> related, but they are not really the same, and they have different issues.
> I think it's good to try to keep separate issues separate.
>
> [...]
>
> + * NOTE! The fact that we share an anon_vma doesn't _have_ to mean that
> + * we can merge the two vma's. For example, we refuse to merge a vma if
> + * there is a vm_ops->close() function, because that indicates that the
> + * driver is doing some kind of reference counting. But that doesn't
> + * really matter for the anon_vma sharing case.

I am all in favor of only doing singletons, so that we don't have to
inflict my psycho-active merging routine on civilians.

I am not convinced it's a good idea to share an anon_vma, however, when
we know beforehand the vmas will never merge, because it will increase
rmap overhead of walking unrelated vmas for every page in every vma that
is part of the reused anon_vma.

So we usually take that as a trade-off when there is a chance the vmas
could still reunite and we don't want to spoil that through differing
anon_vmas.

But if it's already clear that they won't, it appears to me it would
be more efficient in the long run to just allocate our own anon_vma.

Did you have something in mind that I missed?

Hannes

2010-04-10 23:35:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA



On Sun, 11 Apr 2010, Johannes Weiner wrote:
>
> Did you have something in mind that I missed?

Mostly that the corner cases will never matter, and I'd prefer to keep the
code simpler than to care deeply.

For example, the only case you'd see vm_ops->close() is for special device
mappings. It's true that they cannot have their vma's merged, but it's
also true that they (a) will seldom have anon_vma's anyway and (b) would
never get mapped very many times so that anon_vma merging would be an
issue.

In other words, it's a "don't care" situation, where to keep the code
simpler we just document that we don't care.

Linus

2010-04-11 13:16:30

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

From: Borislav Petkov <[email protected]>
Date: Sat, Apr 10, 2010 at 11:51:15PM +0200

> > Damn, I really hoped that was it. Three independent bugs found and fixed,
> > and still no joy? Oh well.
>
> Yep, I'll redo the testing tomorrow, so that we are sure that even with
> the _three_ bugs fixed we still hit the funky list element issue.

Ok, I could verify that the three patches we were talking about still
can't fix the issue. However, just to make sure I'm sending the versions
of the patches I used for you guys to check.

[ 529.667108] PM: Preallocating image memory...
[ 529.930881] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 529.931275] IP: [<ffffffff810c603c>] page_referenced+0xee/0x1dc
[ 529.931377] PGD 22e33d067 PUD 22ddc1067 PMD 0
[ 529.931377] Oops: 0000 [#1] PREEMPT SMP
[ 529.931377] last sysfs file: /sys/power/state
[ 529.931377] CPU 3
[ 529.931377] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 ohci_hcd edac_core serial_core pcspkr k10temp
[ 529.931377]
[ 529.931377] Pid: 3354, comm: hib.sh Tainted: G W 2.6.34-rc3-00503-g0fcc334 #1 M3A78 PRO/System Product Name
[ 529.931377] RIP: 0010:[<ffffffff810c603c>] [<ffffffff810c603c>] page_referenced+0xee/0x1dc
[ 529.931377] RSP: 0018:ffff880105a118b8 EFLAGS: 00010283
[ 529.931377] RAX: ffff88022dc896c8 RBX: ffffea0007a15e10 RCX: 0000000000000000
[ 529.931377] RDX: ffff880105a11cf8 RSI: ffff88022dc896a0 RDI: ffff88022b760000
[ 529.931377] RBP: ffff880105a11938 R08: 0000000000000002 R09: 0000000000000000
[ 529.931377] R10: 0000000000000000 R11: ffffffff00000012 R12: 0000000000000000
[ 529.931377] R13: ffffffffffffffe0 R14: ffff88022dc89688 R15: ffff880105a11a00
[ 529.931377] FS: 00007f21045876f0(0000) GS:ffff88000a600000(0000) knlGS:0000000000000000
[ 529.931377] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 529.931377] CR2: 0000000000000000 CR3: 000000022b33f000 CR4: 00000000000006e0
[ 529.931377] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 529.931377] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 529.931377] Process hib.sh (pid: 3354, threadinfo ffff880105a10000, task ffff88022b760000)
[ 529.931377] Stack:
[ 529.931377] ffff88022dc896c8 00000000810b0082 0000000000000000 0000000000000000
[ 529.931377] <0> 0000000000000000 0000000000000000 0000000000000000 0000000000000020
[ 529.931377] <0> 0000000000000000 0000000200000000 7fffffffffffffff ffffea0007a15e38
[ 529.931377] Call Trace:
[ 529.931377] [<ffffffff810b06bc>] shrink_page_list+0x154/0x4c7
[ 529.931377] [<ffffffff81067149>] ? print_lock_contention_bug+0x1b/0xe1
[ 529.931377] [<ffffffff810af59c>] ? isolate_pages_global+0xd0/0x1fc
[ 529.931377] [<ffffffff8140f9f6>] ? _raw_spin_unlock_irq+0x30/0x58
[ 529.931377] [<ffffffff810b0d8a>] shrink_inactive_list+0x35b/0x60c
[ 529.931377] [<ffffffff810b0556>] ? shrink_active_list+0x232/0x244
[ 529.931377] [<ffffffff810b1347>] shrink_zone+0x30c/0x3d6
[ 529.931377] [<ffffffff810b1f3d>] do_try_to_free_pages+0x191/0x29a
[ 529.931377] [<ffffffff810b20db>] shrink_all_memory+0x95/0xc4
[ 529.931377] [<ffffffff81078e1e>] ? memory_bm_test_bit+0x1/0x30
[ 529.931377] [<ffffffff810af4cc>] ? isolate_pages_global+0x0/0x1fc
[ 529.931377] [<ffffffff81079c9c>] ? count_data_pages+0x65/0x79
[ 529.931377] [<ffffffff81079f03>] hibernate_preallocate_memory+0x1aa/0x2cb
[ 529.931377] [<ffffffff8140bd74>] ? printk+0x41/0x45
[ 529.931377] [<ffffffff8107878f>] hibernation_snapshot+0x36/0x1e1
[ 529.931377] [<ffffffff81078a08>] hibernate+0xce/0x172
[ 529.931377] [<ffffffff81077775>] state_store+0x5c/0xd3
[ 529.931377] [<ffffffff8118f573>] kobj_attr_store+0x17/0x19
[ 529.931377] [<ffffffff8112e42c>] sysfs_write_file+0x108/0x144
[ 529.931377] [<ffffffff810db63b>] vfs_write+0xb2/0x153
[ 529.931377] [<ffffffff810663c9>] ? trace_hardirqs_on_caller+0x1f/0x14b
[ 529.931377] [<ffffffff810db79f>] sys_write+0x4a/0x71
[ 529.931377] [<ffffffff8100221b>] system_call_fastpath+0x16/0x1b
[ 529.931377] Code: 3b 56 10 73 1e 48 83 fa f2 74 18 48 8d 4d cc 4d 89 f8 48 89 df e8 11 f2 ff ff 41 01 c4 83 7d cc 00 74 19 4d 8b 6d 20 49 83 ed 20 <49> 8b 45 20 0f 18 08 49 8d 45 20 48 39 45 80 75 aa 4c 89 f7 e8
[ 529.931377] RIP [<ffffffff810c603c>] page_referenced+0xee/0x1dc
[ 529.931377] RSP <ffff880105a118b8>
[ 529.931377] CR2: 0000000000000000
[ 529.945250] ---[ end trace caa5471c993e6461 ]---
[ 529.945558] note: hib.sh[3354] exited with preempt_count 2
[ 529.945710] BUG: scheduling while atomic: hib.sh/3354/0x10000003
[ 529.945858] INFO: lockdep is turned off.
[ 529.946005] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 ohci_hcd edac_core serial_core pcspkr k10temp
[ 529.947595] Pid: 3354, comm: hib.sh Tainted: G D W 2.6.34-rc3-00503-g0fcc334 #1
[ 529.947848] Call Trace:
[ 529.947993] [<ffffffff810658df>] ? __debug_show_held_locks+0x1b/0x24
[ 529.948147] [<ffffffff8102dfac>] __schedule_bug+0x72/0x77
[ 529.948296] [<ffffffff8140c188>] schedule+0xe3/0x7ff
[ 529.948449] [<ffffffff810bd0e4>] ? unmap_vmas+0x90c/0x911
[ 529.948599] [<ffffffff81030ecb>] __cond_resched+0x18/0x24
[ 529.948748] [<ffffffff8140c971>] _cond_resched+0x2c/0x37
[ 529.948896] [<ffffffff810bcef1>] unmap_vmas+0x719/0x911
[ 529.949049] [<ffffffff8140f01e>] ? _raw_spin_lock_irqsave+0x1e/0x85
[ 529.949199] [<ffffffff8105a878>] ? up+0x14/0x3e
[ 529.949347] [<ffffffff810c171f>] exit_mmap+0x102/0x1e4
[ 529.949639] [<ffffffff810c1686>] ? exit_mmap+0x69/0x1e4
[ 529.949787] [<ffffffff810368bc>] mmput+0x48/0xb9
[ 529.949935] [<ffffffff8103ad90>] exit_mm+0x110/0x11d
[ 529.950087] [<ffffffff8103c9e6>] do_exit+0x1c5/0x6e5
[ 529.950236] [<ffffffff81039e2f>] ? kmsg_dump+0x13b/0x155
[ 529.950525] [<ffffffff8100616b>] ? oops_end+0x47/0x93
[ 529.950671] [<ffffffff810061b2>] oops_end+0x8e/0x93
[ 529.950819] [<ffffffff8101f3e5>] no_context+0x1fc/0x20b
[ 529.950967] [<ffffffff8101f580>] __bad_area_nosemaphore+0x18c/0x1af
[ 529.951120] [<ffffffff8101f7bb>] ? do_page_fault+0xa8/0x32d
[ 529.951276] [<ffffffff8101f5b6>] bad_area_nosemaphore+0x13/0x15
[ 529.951572] [<ffffffff8101f886>] do_page_fault+0x173/0x32d
[ 529.951719] [<ffffffff81410363>] ? error_sti+0x5/0x6
[ 529.951867] [<ffffffff81065387>] ? trace_hardirqs_off_caller+0x1f/0xa9
[ 529.952018] [<ffffffff8140ec9b>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 529.952170] [<ffffffff8141017f>] page_fault+0x1f/0x30
[ 529.952319] [<ffffffff810c603c>] ? page_referenced+0xee/0x1dc
[ 529.952615] [<ffffffff810c5fce>] ? page_referenced+0x80/0x1dc
[ 529.952762] [<ffffffff810b06bc>] shrink_page_list+0x154/0x4c7
[ 529.952911] [<ffffffff81067149>] ? print_lock_contention_bug+0x1b/0xe1
[ 529.953065] [<ffffffff810af59c>] ? isolate_pages_global+0xd0/0x1fc
[ 529.953214] [<ffffffff8140f9f6>] ? _raw_spin_unlock_irq+0x30/0x58
[ 529.953363] [<ffffffff810b0d8a>] shrink_inactive_list+0x35b/0x60c
[ 529.953627] [<ffffffff810b0556>] ? shrink_active_list+0x232/0x244
[ 529.953775] [<ffffffff810b1347>] shrink_zone+0x30c/0x3d6
[ 529.953924] [<ffffffff810b1f3d>] do_try_to_free_pages+0x191/0x29a
[ 529.954077] [<ffffffff810b20db>] shrink_all_memory+0x95/0xc4
[ 529.954226] [<ffffffff81078e1e>] ? memory_bm_test_bit+0x1/0x30
[ 529.954486] [<ffffffff810af4cc>] ? isolate_pages_global+0x0/0x1fc
[ 529.954632] [<ffffffff81079c9c>] ? count_data_pages+0x65/0x79
[ 529.954782] [<ffffffff81079f03>] hibernate_preallocate_memory+0x1aa/0x2cb
[ 529.954931] [<ffffffff8140bd74>] ? printk+0x41/0x45
[ 529.955083] [<ffffffff8107878f>] hibernation_snapshot+0x36/0x1e1
[ 529.955233] [<ffffffff81078a08>] hibernate+0xce/0x172
[ 529.955457] [<ffffffff81077775>] state_store+0x5c/0xd3
[ 529.955604] [<ffffffff8118f573>] kobj_attr_store+0x17/0x19
[ 529.955752] [<ffffffff8112e42c>] sysfs_write_file+0x108/0x144
[ 529.955900] [<ffffffff810db63b>] vfs_write+0xb2/0x153
[ 529.956053] [<ffffffff810663c9>] ? trace_hardirqs_on_caller+0x1f/0x14b
[ 529.956202] [<ffffffff810db79f>] sys_write+0x4a/0x71
[ 529.956351] [<ffffffff8100221b>] system_call_fastpath+0x16/0x1b
[ 537.634362] SysRq : HELP : loglevel(0-9) reBoot Crash show-all-locks(D) terminate-all-tasks(E) memory-full-oom-kill(F) kill-all-tasks(I) thaw-filesystems(J) saK show-backtrace-all-active-cpus(L) show-memory-usage(M) nice-all-RT-tasks(N) powerOff show-registers(P) show-all-timers(Q) unRaw Sync show-task-states(T) Unmount show-blocked-tasks(W) dump-ftrace-buffer(Z)
[ 538.129750] SysRq : Emergency Sync
[ 538.130161] Emergency Sync complete
[ 538.902386] SysRq : Emergency Remount R/O
[ 539.328830] SysRq : Resetting

--
Regards/Gruss,
Boris.

2010-04-11 13:20:06

by Borislav Petkov

[permalink] [raw]
Subject: [PATCH 1/3] mm: make page freeing path RCU-safe

From: Linus Torvalds <[email protected]>

On Sat, 10 Apr 2010, Linus Torvalds wrote:
> On Sat, 10 Apr 2010, Borislav Petkov wrote:
> >
> > And I got an oops again, this time the #GP from couple of days ago.
>
> Oh damn. So the list corruption really does happen still.

Ho humm.

Maybe I'm crazy, but something started bothering me. And I started
wondering: when is the 'page->mapping' of an anonymous page actually
cleared?

The thing is, the mapping of an anonymous page is actually cleared only
when the page is _freed_, in "free_hot_cold_page()".

Now, let's think about that. And in particular, let's think about how that
relates to the freeing of the 'anon_vma' that the page->mapping points to.

The way the anon_vma is freed is when the mapping is torn down, and we do
roughly:

tlb = tlb_gather_mmu(mm,..)
..
unmap_vmas(&tlb, vma ..
..
free_pgtables()
..
tlb_finish_mmu(tlb, start, end);

and we actually unmap all the pages in "unmap_vmas()", and then _after_
unmapping all the pages we do the "unlink_anon_vmas(vma);" in
"free_pgtables()". Fine so far - the anon_vma stay around until after the
page has been happily unmapped.

But "unmapped all the pages" is _not_ actually the same as "free'd all the
pages". The actual _freeing_ of the page happens generally in
tlb_finish_mmu(), because we can free the page only after we've flushed
any TLB entries.

So what we have in that tlb_gather structure is a list of _pending_ pages
to be freed, while we already actually free'd the anon_vmas earlier!

Now, the thing is, tlb_gather_mmu() begins a preempt-safe region (because
we use a per-cpu variable), but as far as I can tell it is _not_ an
RCU-safe region.

So I think we might actually get a real RCU freeing event while this all
happens. So now the 'anon_vma' that 'page->mapping' points to has not just
been released back to the SLUB caches, the page itself might have been
released too.

I dunno. Does the above sound at all sane? Or am I just raving?

Something hacky like the above might fix it if I'm not just raving. I
really might be missing something here.

Linus
---
include/asm-generic/tlb.h | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index e43f976..2678118 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -14,6 +14,7 @@
#define _ASM_GENERIC__TLB_H

#include <linux/swap.h>
+#include <linux/rcupdate.h>
#include <asm/pgalloc.h>
#include <asm/tlbflush.h>

@@ -62,6 +63,7 @@ tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)

tlb->fullmm = full_mm_flush;

+ rcu_read_lock();
return tlb;
}

@@ -90,6 +92,7 @@ tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
/* keep the page table cache within bounds */
check_pgt_cache();

+ rcu_read_unlock();
put_cpu_var(mmu_gathers);
}

--
1.7.0.3

2010-04-11 13:21:31

by Borislav Petkov

[permalink] [raw]
Subject: [PATCH 3/3] mm: fixup vma_adjust

From: Linus Torvalds <[email protected]>

On Sat, 10 Apr 2010, Borislav Petkov wrote:
> From: Borislav Petkov <[email protected]>
> Date: Sat, Apr 10, 2010 at 08:51:45PM +0200
>
> > Anyways, testing...
>
> Nope, still b0rked. And this time is not a funny pattern but
> ffffffffffffffe0 we had originally.

Ok, I think that just depends on who happens to re-use the allocation and
how it does it.

I'm pretty sure it's a use-after-free issue, where we have free'd an
anon_vma too early, even though it has pages associated with it.

If it wasn't the RCU case, it's just something else.

I think it's worth looking at "vma_adjust()", because as I already
mentioned to Rik earlier - the code is very hard to understand, and it's
accrued crud over many many years.

And vma_adjust is the one place that does that anon_vma_merge(), which is
apart from the actual unmapping sequence the only other place that
actually free's anon_vmas. So there are reasons to be very suspicious of
that code.

And I think that code can actually lose an anon_vma chain. It's totally
screwing up the "import anonvma" case: when it does

if (anon_vma_clone(importer, vma)) {
return -ENOMEM;
}
importer->anon_vma = anon_vma;

we can actually have "importer == vma", but "anon_vma = next->anon_vma".

In which case we actually end up with an _empty_ chain (because importer
didn't have a chain to begin with!) but "importer->anon_vma" points to an
anon_vma.

And then when we do that "remove_next", we actually get rid of the only
chain we ever had, and have lost all our references to the anon_vma.

That looks _horribly_ buggy.

Also, the conditional nesting makes no sense (the whole anon_vma_clone()
only makes sense if importer is set, and it is only ever set _inside_ the
earlier if-statement, so the whole code should be moved inside there), nor
does some of the comments.

This patch is scary and untested, but the more I look at that code, the
more convinced I am that vma_adjust was _really_ badly screwed up. The
patch below may make things worse. I'll test it myself too, but I'm
sending it out first, since I was writing the email as I was looking at
the piece of cr*p.

Linus
---
mm/mmap.c | 24 ++++++++----------------
1 files changed, 8 insertions(+), 16 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index acb023e..f90ea92 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -507,11 +507,12 @@ int vma_adjust(struct vm_area_struct *vma, unsigned long start,
struct address_space *mapping = NULL;
struct prio_tree_root *root = NULL;
struct file *file = vma->vm_file;
- struct anon_vma *anon_vma = NULL;
long adjust_next = 0;
int remove_next = 0;

if (next && !insert) {
+ struct vm_area_struct *exporter = NULL;
+
if (end >= next->vm_end) {
/*
* vma expands, overlapping all the next, and
@@ -519,7 +520,7 @@ int vma_adjust(struct vm_area_struct *vma, unsigned long start,
*/
again: remove_next = 1 + (end > next->vm_end);
end = next->vm_end;
- anon_vma = next->anon_vma;
+ exporter = next;
importer = vma;
} else if (end > next->vm_start) {
/*
@@ -527,7 +528,7 @@ again: remove_next = 1 + (end > next->vm_end);
* mprotect case 5 shifting the boundary up.
*/
adjust_next = (end - next->vm_start) >> PAGE_SHIFT;
- anon_vma = next->anon_vma;
+ exporter = next;
importer = vma;
} else if (end < vma->vm_end) {
/*
@@ -536,28 +537,19 @@ again: remove_next = 1 + (end > next->vm_end);
* mprotect case 4 shifting the boundary down.
*/
adjust_next = - ((vma->vm_end - end) >> PAGE_SHIFT);
- anon_vma = next->anon_vma;
+ exporter = vma;
importer = next;
}
- }

- /*
- * When changing only vma->vm_end, we don't really need anon_vma lock.
- */
- if (vma->anon_vma && (insert || importer || start != vma->vm_start))
- anon_vma = vma->anon_vma;
- if (anon_vma) {
/*
* Easily overlooked: when mprotect shifts the boundary,
* make sure the expanding vma has anon_vma set if the
* shrinking vma had, to cover any anon pages imported.
*/
- if (importer && !importer->anon_vma) {
- /* Block reverse map lookups until things are set up. */
- if (anon_vma_clone(importer, vma)) {
+ if (exporter && exporter->anon_vma && !importer->anon_vma) {
+ if (anon_vma_clone(importer, exporter))
return -ENOMEM;
- }
- importer->anon_vma = anon_vma;
+ importer->anon_vma = exporter->anon_vma;
}
}

--
1.7.0.3

2010-04-11 13:24:13

by Borislav Petkov

[permalink] [raw]
Subject: [PATCH 2/3] mm: cleanup find_mergeable_anon_vma complexity

From: Linus Torvalds <[email protected]>

On Sat, 10 Apr 2010, Linus Torvalds wrote:
>
> But I think the fact that you are apparently not able to get the list
> corruption is a good sign. Of course, it might just be harder to trigger,
> and these things could all be a sign of a different bug, but my gut feel
> is that we did fix something, and you are just damn good at stressing the
> new code. Kudos.

Btw, I do hate the current 'find_mergeable_anon_vma()' with its duplicated
checks for prev/next compatibility that I just made even more complex.

So I'm actually inclined to want to write my simple two-liner fix as a
rather more complex cleanup patch, below.

It adds way more lines than it deletes, but a lot of it is comments (and
some of it is just because one routine got split up into three), and I
think it makes the result a lot more readable.

It also splits off the decision of whether we can reuse an non_vma from
the decision of whether we can merge the vma's - the two are kind of
related, but they are not really the same, and they have different issues.
I think it's good to try to keep separate issues separate.

This is UNTESTED! It's meant to be an "obvious cleanup" with no real
semantic difference, but if I did something wrong it won't work. Also note
the comment about the lack of locking between two adjacent anon_vma's
taking a page fault at the same time: the ACCESS_ONCE() is unlikely to
ever matter (anon_vma's are stable once they are set, so it's really just
that you could first load a NULL, and then if you re-load the value you
might get a non-NULL thing).

Also note that when checking whether the anon_vma is a singleton, we don't
hold any lock that protects the list we are checking. But
"list_is_singular()" is safe and won't oops even if the pointers in the
list are crap, because it only _compares_ the prev/next pointers, it
doesn't dereference them.

In short, what I'm saying is that there is a pretty subtle race in the
very very unlikely case that two anon_vma's get prepared concurrently, but
from a correctness standpoint it doesn't matter. We might sometimes - once
in a blue moon - reject an anon_vma that could in theory have been merged,
but that won't hurt.

Comments? Rik, Johannes?

Linus
---
mm/mmap.c | 86 ++++++++++++++++++++++++++++++++++++++++++++-----------------
1 files changed, 62 insertions(+), 24 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 75557c6..acb023e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -825,6 +825,61 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
}

/*
+ * Rough compatbility check to quickly see if it's even worth looking
+ * at sharing an anon_vma.
+ *
+ * They need to have the same vm_file, and the flags can only differ
+ * in things that mprotect may change.
+ *
+ * NOTE! The fact that we share an anon_vma doesn't _have_ to mean that
+ * we can merge the two vma's. For example, we refuse to merge a vma if
+ * there is a vm_ops->close() function, because that indicates that the
+ * driver is doing some kind of reference counting. But that doesn't
+ * really matter for the anon_vma sharing case.
+ */
+static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct *b)
+{
+ return a->vm_end == b->vm_start &&
+ mpol_equal(vma_policy(a), vma_policy(b)) &&
+ a->vm_file == b->vm_file &&
+ !((a->vm_flags ^ b->vm_flags) & ~(VM_READ|VM_WRITE|VM_EXEC)) &&
+ b->vm_pgoff == a->vm_pgoff + ((b->vm_start - a->vm_start) >> PAGE_SHIFT);
+}
+
+/*
+ * Do some basic sanity checking to see if we can re-use the anon_vma
+ * from 'old'. The 'a'/'b' vma's are in VM order - one of them will be
+ * the same as 'old', the other will be the new one that is trying
+ * to share the anon_vma.
+ *
+ * NOTE! This runs with mm_sem held for reading, so it is possible that
+ * the anon_vma of 'old' is concurrently in the process of being set up
+ * by another page fault trying to merge _that_. But that's ok: if it
+ * is being set up, that automatically means that it will be a singleton
+ * acceptable for merging, so we can do all of this optimistically. But
+ * we do that ACCESS_ONCE() to make sure that we never re-load the pointer.
+ *
+ * IOW: that the "list_is_singular()" test on the anon_vma_chain only
+ * matters for the 'stable anon_vma' case (ie the thing we want to avoid
+ * is to return an anon_vma that is "complex" due to having gone through
+ * a fork).
+ *
+ * We also make sure that the two vma's are compatible (adjacent,
+ * and with the same memory policies). That's all stable, even with just
+ * a read lock on the mm_sem.
+ */
+static struct anon_vma *reusable_anon_vma(struct vm_area_struct *old, struct vm_area_struct *a, struct vm_area_struct *b)
+{
+ if (anon_vma_compatible(a, b)) {
+ struct anon_vma *anon_vma = ACCESS_ONCE(old->anon_vma);
+
+ if (anon_vma && list_is_singular(&old->anon_vma_chain))
+ return anon_vma;
+ }
+ return NULL;
+}
+
+/*
* find_mergeable_anon_vma is used by anon_vma_prepare, to check
* neighbouring vmas for a suitable anon_vma, before it goes off
* to allocate a new anon_vma. It checks because a repetitive
@@ -834,28 +889,16 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
*/
struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma)
{
+ struct anon_vma *anon_vma;
struct vm_area_struct *near;
- unsigned long vm_flags;

near = vma->vm_next;
if (!near)
goto try_prev;

- /*
- * Since only mprotect tries to remerge vmas, match flags
- * which might be mprotected into each other later on.
- * Neither mlock nor madvise tries to remerge at present,
- * so leave their flags as obstructing a merge.
- */
- vm_flags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC);
- vm_flags |= near->vm_flags & (VM_READ|VM_WRITE|VM_EXEC);
-
- if (near->anon_vma && vma->vm_end == near->vm_start &&
- mpol_equal(vma_policy(vma), vma_policy(near)) &&
- can_vma_merge_before(near, vm_flags,
- NULL, vma->vm_file, vma->vm_pgoff +
- ((vma->vm_end - vma->vm_start) >> PAGE_SHIFT)))
- return near->anon_vma;
+ anon_vma = reusable_anon_vma(near, vma, near);
+ if (anon_vma)
+ return anon_vma;
try_prev:
/*
* It is potentially slow to have to call find_vma_prev here.
@@ -868,14 +911,9 @@ try_prev:
if (!near)
goto none;

- vm_flags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC);
- vm_flags |= near->vm_flags & (VM_READ|VM_WRITE|VM_EXEC);
-
- if (near->anon_vma && near->vm_end == vma->vm_start &&
- mpol_equal(vma_policy(near), vma_policy(vma)) &&
- can_vma_merge_after(near, vm_flags,
- NULL, vma->vm_file, vma->vm_pgoff))
- return near->anon_vma;
+ anon_vma = reusable_anon_vma(near, near, vma);
+ if (anon_vma)
+ return anon_vma;
none:
/*
* There's no absolute need to look only at touching neighbours:
--
1.7.0.3

2010-04-11 13:25:38

by Borislav Petkov

[permalink] [raw]
Subject: [PATCH 2/3] mm: cleanup find_mergeable_anon_vma complexity

From: Linus Torvalds <[email protected]>

On Sat, 10 Apr 2010, Linus Torvalds wrote:
>
> But I think the fact that you are apparently not able to get the list
> corruption is a good sign. Of course, it might just be harder to trigger,
> and these things could all be a sign of a different bug, but my gut feel
> is that we did fix something, and you are just damn good at stressing the
> new code. Kudos.

Btw, I do hate the current 'find_mergeable_anon_vma()' with its duplicated
checks for prev/next compatibility that I just made even more complex.

So I'm actually inclined to want to write my simple two-liner fix as a
rather more complex cleanup patch, below.

It adds way more lines than it deletes, but a lot of it is comments (and
some of it is just because one routine got split up into three), and I
think it makes the result a lot more readable.

It also splits off the decision of whether we can reuse an non_vma from
the decision of whether we can merge the vma's - the two are kind of
related, but they are not really the same, and they have different issues.
I think it's good to try to keep separate issues separate.

This is UNTESTED! It's meant to be an "obvious cleanup" with no real
semantic difference, but if I did something wrong it won't work. Also note
the comment about the lack of locking between two adjacent anon_vma's
taking a page fault at the same time: the ACCESS_ONCE() is unlikely to
ever matter (anon_vma's are stable once they are set, so it's really just
that you could first load a NULL, and then if you re-load the value you
might get a non-NULL thing).

Also note that when checking whether the anon_vma is a singleton, we don't
hold any lock that protects the list we are checking. But
"list_is_singular()" is safe and won't oops even if the pointers in the
list are crap, because it only _compares_ the prev/next pointers, it
doesn't dereference them.

In short, what I'm saying is that there is a pretty subtle race in the
very very unlikely case that two anon_vma's get prepared concurrently, but
from a correctness standpoint it doesn't matter. We might sometimes - once
in a blue moon - reject an anon_vma that could in theory have been merged,
but that won't hurt.

Comments? Rik, Johannes?

Linus
---
mm/mmap.c | 86 ++++++++++++++++++++++++++++++++++++++++++++-----------------
1 files changed, 62 insertions(+), 24 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 75557c6..acb023e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -825,6 +825,61 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
}

/*
+ * Rough compatbility check to quickly see if it's even worth looking
+ * at sharing an anon_vma.
+ *
+ * They need to have the same vm_file, and the flags can only differ
+ * in things that mprotect may change.
+ *
+ * NOTE! The fact that we share an anon_vma doesn't _have_ to mean that
+ * we can merge the two vma's. For example, we refuse to merge a vma if
+ * there is a vm_ops->close() function, because that indicates that the
+ * driver is doing some kind of reference counting. But that doesn't
+ * really matter for the anon_vma sharing case.
+ */
+static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct *b)
+{
+ return a->vm_end == b->vm_start &&
+ mpol_equal(vma_policy(a), vma_policy(b)) &&
+ a->vm_file == b->vm_file &&
+ !((a->vm_flags ^ b->vm_flags) & ~(VM_READ|VM_WRITE|VM_EXEC)) &&
+ b->vm_pgoff == a->vm_pgoff + ((b->vm_start - a->vm_start) >> PAGE_SHIFT);
+}
+
+/*
+ * Do some basic sanity checking to see if we can re-use the anon_vma
+ * from 'old'. The 'a'/'b' vma's are in VM order - one of them will be
+ * the same as 'old', the other will be the new one that is trying
+ * to share the anon_vma.
+ *
+ * NOTE! This runs with mm_sem held for reading, so it is possible that
+ * the anon_vma of 'old' is concurrently in the process of being set up
+ * by another page fault trying to merge _that_. But that's ok: if it
+ * is being set up, that automatically means that it will be a singleton
+ * acceptable for merging, so we can do all of this optimistically. But
+ * we do that ACCESS_ONCE() to make sure that we never re-load the pointer.
+ *
+ * IOW: that the "list_is_singular()" test on the anon_vma_chain only
+ * matters for the 'stable anon_vma' case (ie the thing we want to avoid
+ * is to return an anon_vma that is "complex" due to having gone through
+ * a fork).
+ *
+ * We also make sure that the two vma's are compatible (adjacent,
+ * and with the same memory policies). That's all stable, even with just
+ * a read lock on the mm_sem.
+ */
+static struct anon_vma *reusable_anon_vma(struct vm_area_struct *old, struct vm_area_struct *a, struct vm_area_struct *b)
+{
+ if (anon_vma_compatible(a, b)) {
+ struct anon_vma *anon_vma = ACCESS_ONCE(old->anon_vma);
+
+ if (anon_vma && list_is_singular(&old->anon_vma_chain))
+ return anon_vma;
+ }
+ return NULL;
+}
+
+/*
* find_mergeable_anon_vma is used by anon_vma_prepare, to check
* neighbouring vmas for a suitable anon_vma, before it goes off
* to allocate a new anon_vma. It checks because a repetitive
@@ -834,28 +889,16 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
*/
struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma)
{
+ struct anon_vma *anon_vma;
struct vm_area_struct *near;
- unsigned long vm_flags;

near = vma->vm_next;
if (!near)
goto try_prev;

- /*
- * Since only mprotect tries to remerge vmas, match flags
- * which might be mprotected into each other later on.
- * Neither mlock nor madvise tries to remerge at present,
- * so leave their flags as obstructing a merge.
- */
- vm_flags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC);
- vm_flags |= near->vm_flags & (VM_READ|VM_WRITE|VM_EXEC);
-
- if (near->anon_vma && vma->vm_end == near->vm_start &&
- mpol_equal(vma_policy(vma), vma_policy(near)) &&
- can_vma_merge_before(near, vm_flags,
- NULL, vma->vm_file, vma->vm_pgoff +
- ((vma->vm_end - vma->vm_start) >> PAGE_SHIFT)))
- return near->anon_vma;
+ anon_vma = reusable_anon_vma(near, vma, near);
+ if (anon_vma)
+ return anon_vma;
try_prev:
/*
* It is potentially slow to have to call find_vma_prev here.
@@ -868,14 +911,9 @@ try_prev:
if (!near)
goto none;

- vm_flags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC);
- vm_flags |= near->vm_flags & (VM_READ|VM_WRITE|VM_EXEC);
-
- if (near->anon_vma && near->vm_end == vma->vm_start &&
- mpol_equal(vma_policy(near), vma_policy(vma)) &&
- can_vma_merge_after(near, vm_flags,
- NULL, vma->vm_file, vma->vm_pgoff))
- return near->anon_vma;
+ anon_vma = reusable_anon_vma(near, near, vma);
+ if (anon_vma)
+ return anon_vma;
none:
/*
* There's no absolute need to look only at touching neighbours:
--
1.7.0.3

2010-04-11 17:12:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA



On Sun, 11 Apr 2010, Borislav Petkov wrote:
>
> Ok, I could verify that the three patches we were talking about still
> can't fix the issue. However, just to make sure I'm sending the versions
> of the patches I used for you guys to check.

Yup, the patches are the ones I wanted you to try.

So either my fixes were buggy (possible, especially for the vma_adjust
case), or there are other bugs still lurking.

The scary part is that the _old_ anon_vma code didn't really care about
the anon_vma all that deeply. It was just a placeholder, if you got some
of it wrong the worst that would probably happen would be that a page
could never find all the mappings it had. So it was a possible swap
efficiency problem when we cannot get rid of all mapped pages, but if it
only happens for some small and unusual special case, nobody would ever
have noticed.

With the new code, when you have a page that is associated with a stale
anon_vma, you get the page_referenced() oops instead.

And I can't find the bug. Everything I've looked at looks fine. So I'm
going to ask you to start applying "validation patches" - code to check
some internal consistency, and seeing if we break that internal
consistency somewhere.

It may be that Rik has some patches like this from his development work,
but here's the first one. This patch should have caught the vma_adjust()
problem, but all it caught for me was that "anon_vma_clone()" ended up
cloning the avc entries in the wrong order so the lists didn't actually
look exactly the same.

The patch fixes that case, so if this triggers any warnings for you, I
think it's a real bug.

But I'm pretty sure that the problem is that we have a "page->mapping"
that points to an anon_vma that no longer exists, and you can easily get
that while still having valid vma chains - they just aren't necessarily
the complete _set_ of chains they should be.

[ In particular, I think that the _real_ problem is that we don't clear
"page->mapping" when we unmap a page.

See the comment at the end of page_remove_rmap(), and it also explains
the test for "page_mapped()" in page_lock_anon_vma().

But I think the bug you see might be exactly the race between
page_mapped() and actually getting the anon_vma spinlock. I'd have
expected that window to be too small to ever hit, though, which is why I
find it a bit unlikely. But it would explain why you _sometimes_
actually get a hung spinlock too - you never get the spinlock at all,
and somebody replaced the data with something that the spinlock code
thinks is a locked spinlock - but is no longer a spinlock at all ]

Linus

---
mm/mmap.c | 18 ++++++++++++++++++
mm/rmap.c | 2 +-
2 files changed, 19 insertions(+), 1 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index f90ea92..890c169 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1565,6 +1565,22 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,

EXPORT_SYMBOL(get_unmapped_area);

+static void verify_vma(struct vm_area_struct *vma)
+{
+ if (vma->anon_vma) {
+ struct anon_vma_chain *avc;
+ if (WARN_ONCE(list_empty(&vma->anon_vma_chain), "vma has anon_vma but empty chain"))
+ return;
+ /* The first entry of the avc chain should match! */
+ avc = list_entry(vma->anon_vma_chain.next, struct anon_vma_chain, same_vma);
+ WARN_ONCE(avc->anon_vma != vma->anon_vma, "anon_vma entry doesn't match anon_vma_chain");
+ WARN_ONCE(avc->vma != vma, "vma entry doesn't match anon_vma_chain");
+ } else {
+ WARN_ONCE(!list_empty(&vma->anon_vma_chain), "vma has no anon_vma but has chain");
+ }
+}
+
+
/* Look up the first VMA which satisfies addr < vm_end, NULL if none. */
struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
{
@@ -1598,6 +1614,8 @@ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
mm->mmap_cache = vma;
}
}
+ if (vma)
+ verify_vma(vma);
return vma;
}

diff --git a/mm/rmap.c b/mm/rmap.c
index eaa7a09..ee97d38 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -182,7 +182,7 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
{
struct anon_vma_chain *avc, *pavc;

- list_for_each_entry(pavc, &src->anon_vma_chain, same_vma) {
+ list_for_each_entry_reverse(pavc, &src->anon_vma_chain, same_vma) {
avc = anon_vma_chain_alloc();
if (!avc)
goto enomem_failure;

2010-04-11 17:20:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA



On Sun, 11 Apr 2010, Linus Torvalds wrote:
>
> But I think the bug you see might be exactly the race between
> page_mapped() and actually getting the anon_vma spinlock. I'd have
> expected that window to be too small to ever hit, though, which is why I
> find it a bit unlikely. But it would explain why you _sometimes_
> actually get a hung spinlock too - you never get the spinlock at all,
> and somebody replaced the data with something that the spinlock code
> thinks is a locked spinlock - but is no longer a spinlock at all ]

Actually, so if it's that race, then we might get rid of the oops with
this total hack.

NOTE! If this is the race, then the hack really is just a hack, because it
doesn't really solve anything. We still take the spinlock, and if bad
things has happened, _that_ can still very much fail, and you get the
watchdog lockup message instead. So this doesn't really fix anything.

But if this patch changes behavior, and you no longer see the oops, that
tells us _something_. I'm not sure how useful that "something" is, but it
at least means that there are no _mapped_ pages that have that stale
anon_vma pointer in page->mapping.

Conversely, if you still see the oops (rather than the watchdog), that
means that we actually have pages that are still marked mapped, and that
despite that mapped state have a stale page->mapping pointer. I actually
find that the more likely case, because otherwise the window is _so_ small
that I don't see how you can hit the oops so reliably.

Anyway - probably worth testing, along with the verify_vma() patch. If
nothing else, if there is no new behavior, even that tells us something.
Even if that "something" is not a huge piece of information.

Linus

---
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -302,7 +302,11 @@ struct anon_vma *page_lock_anon_vma(struct page *page)

anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
spin_lock(&anon_vma->lock);
- return anon_vma;
+
+ if (page_mapped(page))
+ return anon_vma;
+
+ spin_unlock(&anon_vma->lock);
out:
rcu_read_unlock();
return NULL;

2010-04-11 18:55:19

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

From: Linus Torvalds <[email protected]>
Date: Sun, Apr 11, 2010 at 10:16:10AM -0700

> Conversely, if you still see the oops (rather than the watchdog), that
> means that we actually have pages that are still marked mapped, and that
> despite that mapped state have a stale page->mapping pointer. I actually
> find that the more likely case, because otherwise the window is _so_ small
> that I don't see how you can hit the oops so reliably.

Ok, did test with the all 5 patches applied. It oopsed with the same
trace, see below. Except one kernel/sched.c:3555 warning checking
spinlock count overflowing, nothing else. :(

I tried to see whether the page->mapping pointer is stale, I dunno,
maybe there could be something in the register dump which could tell us
what's happening. This is how I see it, I could very well be wrong and
missing something though:


So, yes, we oops at the same place, however, a bit early we do

anon_vma = page_lock_anon_vma(page);
if (!anon_vma)
return referenced;

which compiles here to

.loc 1 496 0
movq %rbx, %rdi # page,
call page_lock_anon_vma #
.LVL288:
.loc 1 497 0
testq %rax, %rax # anon_vma
.LVL289:
.loc 1 496 0
movq %rax, %r14 #, anon_vma

and I checked that on the path before the instruction where we oops we
don't touch %r14 so the value in the register dump below should be that
anon_vma. Which looks like valid kernel pointer. We dereference it later
to get anon_vma->head.next with

.loc 1 501 0
movq 64(%r14), %r13 # <variable>.head.next, <variable>.head.next
.LBE1287:
leaq 64(%r14), %rax #,
movq %rax, -128(%rbp) #, %sfp
.LBB1288:
subq $32, %r13 #, avc

which ends up in %r13 as ffffffffffffffe0.

So, it really looks like at least that list_head in anon_vma is
bollocks, or even the whole anon_vma. So if this is correct, it is
highly likely that the anon_vma is already freed material or not
initialized at all.

Hm...


[ 616.317201] Freezing remaining freezable tasks ... (elapsed 0.01 seconds) done.
[ 616.329964] PM: Preallocating image memory...
[ 616.586463] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 616.586851] IP: [<ffffffff810c614f>] page_referenced+0xee/0x1dc
[ 616.587045] PGD 225dcf067 PUD 22627f067 PMD 0
[ 616.587126] Oops: 0000 [#1] PREEMPT SMP
[ 616.587126] last sysfs file: /sys/power/state
[ 616.587126] CPU 1
[ 616.587126] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod ohci_hcd edac_core 8250_pnp 8250 serial_core pcspkr k10temp
[ 616.587126]
[ 616.587126] Pid: 3453, comm: hib.sh Tainted: G W 2.6.34-rc3-00505-g1d9bb34 #1 M3A78 PRO/System Product Name
[ 616.587126] RIP: 0010:[<ffffffff810c614f>] [<ffffffff810c614f>] page_referenced+0xee/0x1dc
[ 616.587126] RSP: 0018:ffff88022b3258b8 EFLAGS: 00010283
[ 616.587126] RAX: ffff880200ba4b88 RBX: ffffea00076b2b30 RCX: ffff88022eacaa58
[ 616.587126] RDX: ffffffff810c5e7a RSI: ffff880200ba4b60 RDI: ffff88022fa492e0
[ 616.587126] RBP: ffff88022b325938 R08: 0000000000000002 R09: 0000000000000000
[ 616.587126] R10: ffff88022eacaa30 R11: 0000000000000001 R12: 0000000000000000
[ 616.587126] R13: ffffffffffffffe0 R14: ffff880200ba4b48 R15: ffff88022b325a00
[ 616.587126] FS: 00007f0b140306f0(0000) GS:ffff88000a200000(0000) knlGS:0000000000000000
[ 616.587126] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 616.587126] CR2: 0000000000000000 CR3: 000000022c44f000 CR4: 00000000000006e0
[ 616.587126] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 616.587126] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 616.587126] Process hib.sh (pid: 3453, threadinfo ffff88022b324000, task ffff88022fa492e0)
[ 616.587126] Stack:
[ 616.587126] ffff880200ba4b88 00000000810c5e5f ffff88022b325918 ffffffff810c5fd7
[ 616.587126] <0> ffff880200000000 ffffffff00000001 ffff88022b325fd8 ffffea00076c1a80
[ 616.587126] <0> ffffea00076c1a80 000000022b325cf8 ffffea00076c1a80 ffffea00076b2b58
[ 616.587126] Call Trace:
[ 616.587126] [<ffffffff810c5fd7>] ? try_to_unmap_anon+0xa2/0xb4
[ 616.587126] [<ffffffff810b06bc>] shrink_page_list+0x154/0x4c7
[ 616.587126] [<ffffffff81067149>] ? print_lock_contention_bug+0x1b/0xe1
[ 616.587126] [<ffffffff810af59c>] ? isolate_pages_global+0xd0/0x1fc
[ 616.587126] [<ffffffff8140fb06>] ? _raw_spin_unlock_irq+0x30/0x58
[ 616.587126] [<ffffffff810b0d8a>] shrink_inactive_list+0x35b/0x60c
[ 616.587126] [<ffffffff810b0556>] ? shrink_active_list+0x232/0x244
[ 616.587126] [<ffffffff810b1347>] shrink_zone+0x30c/0x3d6
[ 616.587126] [<ffffffff810b1f3d>] do_try_to_free_pages+0x191/0x29a
[ 616.587126] [<ffffffff810b20db>] shrink_all_memory+0x95/0xc4
[ 616.587126] [<ffffffff810af4cc>] ? isolate_pages_global+0x0/0x1fc
[ 616.587126] [<ffffffff81079c9c>] ? count_data_pages+0x65/0x79
[ 616.587126] [<ffffffff81079f03>] hibernate_preallocate_memory+0x1aa/0x2cb
[ 616.587126] [<ffffffff8140be84>] ? printk+0x41/0x45
[ 616.587126] [<ffffffff8107878f>] hibernation_snapshot+0x36/0x1e1
[ 616.587126] [<ffffffff81078a08>] hibernate+0xce/0x172
[ 616.587126] [<ffffffff81077775>] state_store+0x5c/0xd3
[ 616.587126] [<ffffffff8118f687>] kobj_attr_store+0x17/0x19
[ 616.587126] [<ffffffff8112e540>] sysfs_write_file+0x108/0x144
[ 616.587126] [<ffffffff810db74f>] vfs_write+0xb2/0x153
[ 616.587126] [<ffffffff810663c9>] ? trace_hardirqs_on_caller+0x1f/0x14b
[ 616.587126] [<ffffffff810db8b3>] sys_write+0x4a/0x71
[ 616.587126] [<ffffffff8100221b>] system_call_fastpath+0x16/0x1b
[ 616.587126] Code: 3b 56 10 73 1e 48 83 fa f2 74 18 48 8d 4d cc 4d 89 f8 48 89 df e8 02 f2 ff ff 41 01 c4 83 7d cc 00 74 19 4d 8b 6d 20 49 83 ed 20 <49> 8b 45 20 0f 18 08 49 8d 45 20 48 39 45 80 75 aa 4c 89 f7 e8
[ 616.587126] RIP [<ffffffff810c614f>] page_referenced+0xee/0x1dc
[ 616.587126] RSP <ffff88022b3258b8>
[ 616.587126] CR2: 0000000000000000
[ 616.600838] ---[ end trace 0ea0c6b4ead21c8f ]---
[ 616.600984] note: hib.sh[3453] exited with preempt_count 2
[ 616.601282] BUG: scheduling while atomic: hib.sh/3453/0x10000003
[ 616.601431] INFO: lockdep is turned off.
[ 616.601584] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod ohci_hcd edac_core 8250_pnp 8250 serial_core pcspkr k10temp
[ 616.603115] Pid: 3453, comm: hib.sh Tainted: G D W 2.6.34-rc3-00505-g1d9bb34 #1
[ 616.603460] Call Trace:
[ 616.603605] [<ffffffff810658df>] ? __debug_show_held_locks+0x1b/0x24
[ 616.603755] [<ffffffff8102dfac>] __schedule_bug+0x72/0x77
[ 616.603903] [<ffffffff8140c298>] schedule+0xe3/0x7ff
[ 616.604051] [<ffffffff810bd0e4>] ? unmap_vmas+0x90c/0x911
[ 616.604230] [<ffffffff81030ecb>] __cond_resched+0x18/0x24
[ 616.604381] [<ffffffff8140ca81>] _cond_resched+0x2c/0x37
[ 616.604529] [<ffffffff810bcef1>] unmap_vmas+0x719/0x911
[ 616.604678] [<ffffffff810c16c0>] exit_mmap+0x102/0x1e4
[ 616.604826] [<ffffffff810c1627>] ? exit_mmap+0x69/0x1e4
[ 616.604975] [<ffffffff810368bc>] mmput+0x48/0xb9
[ 616.605124] [<ffffffff8103ad90>] exit_mm+0x110/0x11d
[ 616.605280] [<ffffffff8103c9e6>] do_exit+0x1c5/0x6e5
[ 616.605430] [<ffffffff81039e2f>] ? kmsg_dump+0x13b/0x155
[ 616.605579] [<ffffffff8100616b>] ? oops_end+0x47/0x93
[ 616.605727] [<ffffffff810061b2>] oops_end+0x8e/0x93
[ 616.605875] [<ffffffff8101f3e5>] no_context+0x1fc/0x20b
[ 616.606023] [<ffffffff8101f580>] __bad_area_nosemaphore+0x18c/0x1af
[ 616.606176] [<ffffffff8101f7bb>] ? do_page_fault+0xa8/0x32d
[ 616.606330] [<ffffffff8101f5b6>] bad_area_nosemaphore+0x13/0x15
[ 616.606479] [<ffffffff8101f886>] do_page_fault+0x173/0x32d
[ 616.606628] [<ffffffff81410463>] ? error_sti+0x5/0x6
[ 616.606776] [<ffffffff81065387>] ? trace_hardirqs_off_caller+0x1f/0xa9
[ 616.606926] [<ffffffff8140edab>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 616.607076] [<ffffffff8141027f>] page_fault+0x1f/0x30
[ 616.607227] [<ffffffff810c5e7a>] ? page_lock_anon_vma+0x0/0xbb
[ 616.607381] [<ffffffff810c614f>] ? page_referenced+0xee/0x1dc
[ 616.607530] [<ffffffff810c60e1>] ? page_referenced+0x80/0x1dc
[ 616.607678] [<ffffffff810c5fd7>] ? try_to_unmap_anon+0xa2/0xb4
[ 616.607827] [<ffffffff810b06bc>] shrink_page_list+0x154/0x4c7
[ 616.607976] [<ffffffff81067149>] ? print_lock_contention_bug+0x1b/0xe1
[ 616.608131] [<ffffffff810af59c>] ? isolate_pages_global+0xd0/0x1fc
[ 616.608284] [<ffffffff8140fb06>] ? _raw_spin_unlock_irq+0x30/0x58
[ 616.608435] [<ffffffff810b0d8a>] shrink_inactive_list+0x35b/0x60c
[ 616.608585] [<ffffffff810b0556>] ? shrink_active_list+0x232/0x244
[ 616.608734] [<ffffffff810b1347>] shrink_zone+0x30c/0x3d6
[ 616.608883] [<ffffffff810b1f3d>] do_try_to_free_pages+0x191/0x29a
[ 616.609031] [<ffffffff810b20db>] shrink_all_memory+0x95/0xc4
[ 616.609183] [<ffffffff810af4cc>] ? isolate_pages_global+0x0/0x1fc
[ 616.609337] [<ffffffff81079c9c>] ? count_data_pages+0x65/0x79
[ 616.609486] [<ffffffff81079f03>] hibernate_preallocate_memory+0x1aa/0x2cb
[ 616.609636] [<ffffffff8140be84>] ? printk+0x41/0x45
[ 616.609784] [<ffffffff8107878f>] hibernation_snapshot+0x36/0x1e1
[ 616.609933] [<ffffffff81078a08>] hibernate+0xce/0x172
[ 616.610080] [<ffffffff81077775>] state_store+0x5c/0xd3
[ 616.610233] [<ffffffff8118f687>] kobj_attr_store+0x17/0x19
[ 616.610383] [<ffffffff8112e540>] sysfs_write_file+0x108/0x144
[ 616.610532] [<ffffffff810db74f>] vfs_write+0xb2/0x153
[ 616.610680] [<ffffffff810663c9>] ? trace_hardirqs_on_caller+0x1f/0x14b
[ 616.610830] [<ffffffff810db8b3>] sys_write+0x4a/0x71
[ 616.610978] [<ffffffff8100221b>] system_call_fastpath+0x16/0x1b
[ 682.501863] SysRq : HELP : loglevel(0-9) reBoot Crash show-all-locks(D) terminate-all-tasks(E) memory-full-oom-kill(F) kill-all-tasks(I) thaw-filesystems(J) saK show-backtrace-all-active-cpus(L) show-memory-usage(M) nice-all-RT-tasks(N) powerOff show-registers(P) show-all-timers(Q) unRaw Sync show-task-states(T) Unmount show-blocked-tasks(W) dump-ftrace-buffer(Z)
[ 683.552767] SysRq : Emergency Sync
[ 683.553147] Emergency Sync complete
[ 684.180708] SysRq : Emergency Remount R/O
[ 684.927560] SysRq : Resetting

--
Regards/Gruss,
Boris.

2010-04-11 19:50:28

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

On 04/11/2010 01:16 PM, Linus Torvalds wrote:

> NOTE! If this is the race, then the hack really is just a hack, because it
> doesn't really solve anything. We still take the spinlock, and if bad
> things has happened, _that_ can still very much fail, and you get the
> watchdog lockup message instead. So this doesn't really fix anything.

Looking around the code some more, zap_pte_range()
calls page_remove_rmap(), which leaves the
page->mapping in place and has this comment:

/*
* It would be tidy to reset the PageAnon mapping here,
* but that might overwrite a racing page_add_anon_rmap
* which increments mapcount after us but sets mapping
* before us: so leave the reset to free_hot_cold_page,
* and remember that it's only reliable while mapped.
* Leaving it set also helps swapoff to reinstate ptes
* faster for those pages still in swapcache.
*/

I wonder if we can clear page->mapping here, if
list_is_singular(anon_vma->head). That way we
will not leave stale pointers behind.

Adding another VMA to the anon_vma can happen
at fork time - which will not happen simultaneously
with exit or munmap, because the mmap_sem is taken
for write during either code path.

Am I overlooking something here?

2010-04-11 21:47:00

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

On 04/11/2010 01:16 PM, Linus Torvalds wrote:

> Actually, so if it's that race, then we might get rid of the oops with
> this total hack.

Another thing I just thought of.

The anon_vma struct will not be reused for something completely
different due to the SLAB_DESTROY_BY_RCU flag that the anon_vma_cachep
is created with.

The anon_vma_chain structs are allocated from a slab without that
flag, so they can be reused for something else in the middle of
an RCU section.

Is that something worth fixing, or is this so subtle that we'd
rather not have the code rely on this kind of behaviour at all?

2010-04-12 00:18:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA



On Sun, 11 Apr 2010, Borislav Petkov wrote:
>
> > Conversely, if you still see the oops (rather than the watchdog), that
> > means that we actually have pages that are still marked mapped, and that
> > despite that mapped state have a stale page->mapping pointer. I actually
> > find that the more likely case, because otherwise the window is _so_ small
> > that I don't see how you can hit the oops so reliably.
>
> Ok, did test with the all 5 patches applied. It oopsed with the same
> trace, see below. Except one kernel/sched.c:3555 warning checking
> spinlock count overflowing, nothing else. :(

Ok, that preempt-count thing is a real problem, but should be unrelated to
your issues.

Anyway, so this all means that we definitely have lost sight of an
'anon_vma', even if page->mapping still points to it, and even though the
page is still mapped.

I'll see if I can come up with a patch to do the same kind of validation
on page->mapping as on the anon-vma chains themselves.

> I tried to see whether the page->mapping pointer is stale, I dunno,
> maybe there could be something in the register dump which could tell us
> what's happening.

Sadly, you cannot tell by the pointer. A stale pointer still is a
perfectly fine kernel pointer, it's just that we've long since released
the anon_vma it used to point to, and now it points to some random other
data structure.

> So, it really looks like at least that list_head in anon_vma is
> bollocks, or even the whole anon_vma. So if this is correct, it is
> highly likely that the anon_vma is already freed material or not
> initialized at all.

Yes, it's pretty certain it is long free'd, and re-allocated to something
else.

Linus

2010-04-12 01:09:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA



On Sun, 11 Apr 2010, Linus Torvalds wrote:
>
> I'll see if I can come up with a patch to do the same kind of validation
> on page->mapping as on the anon-vma chains themselves.

Ok, this may or may not work. It hasn't triggered for me, which may be
because it's broken, but maybe it's because I'm not doing whatever it is
you are doing to break our VM.

It checks each anonymous page at unmap time against the vma it gets
unmapped from. It depends on the previous vma_verify debugging patch, and
it would be interesting to hear whether this patch causes any new warnngs
for you..

If the warnings do happen, they are not going to be printing out any
hugely informative data apart from the fact that the bad case happened at
all. But If they do trigger, I can try to improve on them - it's just not
worth trying to make them any more interesting if they never trigger.

Linus

---
mm/memory.c | 21 +++++++++++++++++++++
mm/mmap.c | 2 +-
2 files changed, 22 insertions(+), 1 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 833952d..5d2df59 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -890,6 +890,25 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
return ret;
}

+extern void verify_vma(struct vm_area_struct *);
+
+static void verify_anon_page(struct vm_area_struct *vma, struct page *page)
+{
+ struct anon_vma *anon_vma = vma->anon_vma;
+ struct anon_vma *need_anon_vma = page_anon_vma(page);
+ struct anon_vma_chain *avc;
+
+ verify_vma(vma);
+ if (WARN_ONCE(!anon_vma, "anonymous page in vma without anon_vma"))
+ return;
+ list_for_each_entry(avc, &vma->anon_vma_chain, same_vma) {
+ WARN_ONCE(avc->vma != vma, "anon_vma_chain vma entry doesn't match");
+ if (avc->anon_vma == need_anon_vma)
+ return;
+ }
+ WARN_ONCE(1, "page->mapping does not exist in vma chain");
+}
+
static unsigned long zap_pte_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end,
@@ -940,6 +959,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
tlb_remove_tlb_entry(tlb, pte, addr);
if (unlikely(!page))
continue;
+ if (PageAnon(page))
+ verify_anon_page(vma, page);
if (unlikely(details) && details->nonlinear_vma
&& linear_page_index(details->nonlinear_vma,
addr) != page->index)
diff --git a/mm/mmap.c b/mm/mmap.c
index 890c169..461f59c 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1565,7 +1565,7 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,

EXPORT_SYMBOL(get_unmapped_area);

-static void verify_vma(struct vm_area_struct *vma)
+void verify_vma(struct vm_area_struct *vma)
{
if (vma->anon_vma) {
struct anon_vma_chain *avc;

2010-04-12 07:21:07

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

From: Linus Torvalds <[email protected]>
Date: Sun, Apr 11, 2010 at 06:04:39PM -0700

> It checks each anonymous page at unmap time against the vma it gets
> unmapped from. It depends on the previous vma_verify debugging patch, and
> it would be interesting to hear whether this patch causes any new warnngs
> for you..
>
> If the warnings do happen, they are not going to be printing out any
> hugely informative data apart from the fact that the bad case happened at
> all. But If they do trigger, I can try to improve on them - it's just not
> worth trying to make them any more interesting if they never trigger.

Haa, I think you're gonna want to improve them :)

WARN_ONCE(1, "page->mapping does not exist in vma chain");

triggered on the first resume showing a rather messy 4 WARN_ONCEs. Had I
more cores, there maybe would've been more of them :) Maybe need locking
if clean output is of interest (see below).

So, anyway, if I can read this correctly, there is a page->mapping
anon_vma which is _not_ in the anon_vmas chain of the vma
(avc->same_vma).

And the spot we oops on is in page_referenced_anon():

list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {

which is actually where we iterate over all vmas associated with this
anon_vma.

So if that previous anon_vma pointed to by the page_mapping has been
falsely unlinked at some point, no wonder we boom on that later.

By the way, I completely understand when you say that your head hurts
from looking at this :).


[ 486.580872] Restarting tasks ... done.
[ 494.167242] [drm] Resetting GPU
[ 495.422354] ------------[ cut here ]------------
[ 495.422407] WARNING: at mm/memory.c:909 unmap_vmas+0x548/0xa29()
[ 495.422442] Hardware name: System Product Name
[ 495.422474] page->mapping does not exist in vma chain
[ 495.422504] Modules linked in:
[ 495.422545] ------------[ cut here ]------------
[ 495.422555] ------------[ cut here ]------------
[ 495.422565] powernow_k8
[ 495.422583] WARNING: at mm/memory.c:909 unmap_vmas+0x548/0xa29()
[ 495.422591] cpufreq_ondemand
[ 495.422597] Hardware name: System Product Name
[ 495.422602] page->mapping does not exist in vma chain cpufreq_powersave
[ 495.422612] Modules linked in: cpufreq_userspace powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table freq_table cpufreq_conservative cpufreq_conservative binfmt_misc binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt kvm_amd dm_mod 8250_pnp kvm 8250 serial_core edac_core pcspkr k10temp ohci_hcd
[ 495.422676] ipv6Pid: 2919, comm: udevd Tainted: G W 2.6.34-rc3-00506-g6c62fe4 #1
[ 495.422689] Call Trace:
[ 495.422694] vfat
[ 495.422700] ------------[ cut here ]------------
[ 495.422721] WARNING: at mm/memory.c:909 unmap_vmas+0x548/0xa29()
[ 495.422729] fat [<ffffffff81038fe0>] warn_slowpath_common+0x7c/0x94
[ 495.422746] dm_crypt
[ 495.422751] Hardware name: System Product Name
[ 495.422758] dm_modpage->mapping does not exist in vma chain
[ 495.422767] Modules linked in: 8250_pnp [<ffffffff8103904f>] warn_slowpath_fmt+0x41/0x43
[ 495.422784] powernow_k8 cpufreq_ondemand 8250 cpufreq_powersave [<ffffffff810bcd20>] unmap_vmas+0x548/0xa29
[ 495.422807] serial_core cpufreq_userspace [<ffffffff810bd021>] ? unmap_vmas+0x849/0xa29
[ 495.422828] edac_core freq_table pcspkr cpufreq_conservative [<ffffffff810c17d8>] exit_mmap+0x102/0x1e4
[ 495.422851] binfmt_misc [<ffffffff810c173f>] ? exit_mmap+0x69/0x1e4
[ 495.422863] k10temp [<ffffffff810368bc>] mmput+0x48/0xb9
[ 495.422876] kvm_amd [<ffffffff8103ad90>] exit_mm+0x110/0x11d
[ 495.422889] ohci_hcd kvm
[ 495.422903] [<ffffffff8103c9e6>] do_exit+0x1c5/0x6e5
[ 495.422909] ipv6Pid: 2916, comm: udevd Tainted: G W 2.6.34-rc3-00506-g6c62fe4 #1
[ 495.422927] [<ffffffff81065387>] ? trace_hardirqs_off_caller+0x1f/0xa9
[ 495.422934] Call Trace:
[ 495.422940] vfat [<ffffffff8141016d>] ? retint_swapgs+0xe/0x13
[ 495.422956] fat [<ffffffff81038fe0>] warn_slowpath_common+0x7c/0x94
[ 495.422972] dm_crypt dm_mod 8250_pnp [<ffffffff8103cf8a>] do_group_exit+0x84/0xb0
[ 495.422989] 8250 serial_core [<ffffffff8103cfcd>] sys_exit_group+0x17/0x1b
[ 495.423013] [<ffffffff8100221b>] system_call_fastpath+0x16/0x1b
[ 495.423019] edac_core
[ 495.423025] ---[ end trace d9664ac54d1edb0e ]---
[ 495.423031] pcspkr k10temp ohci_hcd
[ 495.423043] Pid: 2914, comm: udevd Tainted: G W 2.6.34-rc3-00506-g6c62fe4 #1
[ 495.423055] [<ffffffff8103904f>] warn_slowpath_fmt+0x41/0x43
[ 495.423063] Call Trace:
[ 495.423073] [<ffffffff810bcd20>] unmap_vmas+0x548/0xa29
[ 495.423087] [<ffffffff81038fe0>] warn_slowpath_common+0x7c/0x94
[ 495.423100] [<ffffffff810bd021>] ? unmap_vmas+0x849/0xa29
[ 495.423111] [<ffffffff8103904f>] warn_slowpath_fmt+0x41/0x43
[ 495.423123] [<ffffffff810bcd20>] unmap_vmas+0x548/0xa29
[ 495.423134] [<ffffffff810bd021>] ? unmap_vmas+0x849/0xa29
[ 495.423147] [<ffffffff810c17d8>] exit_mmap+0x102/0x1e4
[ 495.423159] [<ffffffff810c17d8>] exit_mmap+0x102/0x1e4
[ 495.423172] [<ffffffff810c173f>] ? exit_mmap+0x69/0x1e4
[ 495.423184] [<ffffffff810c173f>] ? exit_mmap+0x69/0x1e4
[ 495.423194] [<ffffffff810368bc>] mmput+0x48/0xb9
[ 495.423204] [<ffffffff810368bc>] mmput+0x48/0xb9
[ 495.423214] [<ffffffff8103ad90>] exit_mm+0x110/0x11d
[ 495.423225] [<ffffffff8103ad90>] exit_mm+0x110/0x11d
[ 495.423236] [<ffffffff8103c9e6>] do_exit+0x1c5/0x6e5
[ 495.423246] [<ffffffff8103c9e6>] do_exit+0x1c5/0x6e5
[ 495.423266] [<ffffffff81065387>] ? trace_hardirqs_off_caller+0x1f/0xa9
[ 495.423277] [<ffffffff81065387>] ? trace_hardirqs_off_caller+0x1f/0xa9
[ 495.423292] [<ffffffff8141016d>] ? retint_swapgs+0xe/0x13
[ 495.423303] [<ffffffff8141016d>] ? retint_swapgs+0xe/0x13
[ 495.423315] [<ffffffff8103cf8a>] do_group_exit+0x84/0xb0
[ 495.423325] [<ffffffff8103cfcd>] sys_exit_group+0x17/0x1b
[ 495.423334] [<ffffffff8103cf8a>] do_group_exit+0x84/0xb0
[ 495.423346] [<ffffffff8100221b>] system_call_fastpath+0x16/0x1b
[ 495.423357] [<ffffffff8103cfcd>] sys_exit_group+0x17/0x1b
[ 495.423365] ---[ end trace d9664ac54d1edb0f ]---
[ 495.423386] [<ffffffff8100221b>] system_call_fastpath+0x16/0x1b
[ 495.423402] ---[ end trace d9664ac54d1edb10 ]---
[ 495.424191] WARNING: at mm/memory.c:909 unmap_vmas+0x548/0xa29()
[ 495.424215] Hardware name: System Product Name
[ 495.424238] page->mapping does not exist in vma chain
[ 495.424259] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 serial_core edac_core pcspkr k10temp ohci_hcd
[ 495.424693] Pid: 1923, comm: udevd Tainted: G W 2.6.34-rc3-00506-g6c62fe4 #1
[ 495.424723] Call Trace:
[ 495.424758] [<ffffffff81038fe0>] warn_slowpath_common+0x7c/0x94
[ 495.424788] [<ffffffff8103904f>] warn_slowpath_fmt+0x41/0x43
[ 495.424816] [<ffffffff810bcd20>] unmap_vmas+0x548/0xa29
[ 495.424843] [<ffffffff810bd021>] ? unmap_vmas+0x849/0xa29
[ 495.424875] [<ffffffff810c17d8>] exit_mmap+0x102/0x1e4
[ 495.424901] [<ffffffff810c173f>] ? exit_mmap+0x69/0x1e4
[ 495.424926] [<ffffffff810368bc>] mmput+0x48/0xb9
[ 495.424954] [<ffffffff8103ad90>] exit_mm+0x110/0x11d
[ 495.424981] [<ffffffff8103c9e6>] do_exit+0x1c5/0x6e5
[ 495.425008] [<ffffffff81065387>] ? trace_hardirqs_off_caller+0x1f/0xa9
[ 495.425038] [<ffffffff8141016d>] ? retint_swapgs+0xe/0x13
[ 495.425065] [<ffffffff8103cf8a>] do_group_exit+0x84/0xb0
[ 495.425091] [<ffffffff8103cfcd>] sys_exit_group+0x17/0x1b
[ 495.425119] [<ffffffff8100221b>] system_call_fastpath+0x16/0x1b
[ 495.425156] ---[ end trace d9664ac54d1edb11 ]---


--
Regards/Gruss,
Boris.

2010-04-12 14:43:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

On Sat, 2010-04-10 at 11:21 -0700, Linus Torvalds wrote:
>

> Ho humm.
>
> Maybe I'm crazy, but something started bothering me. And I started
> wondering: when is the 'page->mapping' of an anonymous page actually
> cleared?
>
> The thing is, the mapping of an anonymous page is actually cleared only
> when the page is _freed_, in "free_hot_cold_page()".
>
> Now, let's think about that. And in particular, let's think about how that
> relates to the freeing of the 'anon_vma' that the page->mapping points to.
>
> The way the anon_vma is freed is when the mapping is torn down, and we do
> roughly:
>
> tlb = tlb_gather_mmu(mm,..)
> ..
> unmap_vmas(&tlb, vma ..
> ..
> free_pgtables()
> ..
> tlb_finish_mmu(tlb, start, end);
>
> and we actually unmap all the pages in "unmap_vmas()", and then _after_
> unmapping all the pages we do the "unlink_anon_vmas(vma);" in
> "free_pgtables()". Fine so far - the anon_vma stay around until after the
> page has been happily unmapped.
>
> But "unmapped all the pages" is _not_ actually the same as "free'd all the
> pages". The actual _freeing_ of the page happens generally in
> tlb_finish_mmu(), because we can free the page only after we've flushed
> any TLB entries.
>
> So what we have in that tlb_gather structure is a list of _pending_ pages
> to be freed, while we already actually free'd the anon_vmas earlier!
>
> Now, the thing is, tlb_gather_mmu() begins a preempt-safe region (because
> we use a per-cpu variable), but as far as I can tell it is _not_ an
> RCU-safe region.
>
> So I think we might actually get a real RCU freeing event while this all
> happens. So now the 'anon_vma' that 'page->mapping' points to has not just
> been released back to the SLUB caches, the page itself might have been
> released too.
>
> I dunno. Does the above sound at all sane? Or am I just raving?
>
> Something hacky like the above might fix it if I'm not just raving. I
> really might be missing something here.

Right, so unless you have CONFIG_TREE_PREEMPT_RCU=y, the preempt-disable
== RCU read lock assumption does hold.

But even with your patch it doesn't close all holes because while
zap_pte_range() can remove the last mapcount of the page, the
page_remove_tlb() et al. don't need to be the last use count of the
page.

Concurrent reclaim/gup/whatever could still have a count out on the page
delaying the actual free beyond the tlb gather RCU section.

So the reason page->mapping isn't cleared in page_remove_rmap() isn't
detailed beyond a (possible) race with page_add_anon_rmap() (which I
guess would be reclaim trying to unmap the page and a fault re-instating
it).

This also complicates the whole page_lock_anon_vma() thing, so it would
be nice to be able to remove this race and clear page->mapping in
page_remove_rmap().

2010-04-12 15:17:58

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

On Mon, Apr 12, 2010 at 11:40 PM, Peter Zijlstra <[email protected]> wrote:
> On Sat, 2010-04-10 at 11:21 -0700, Linus Torvalds wrote:
>>
>
>> Ho humm.
>>
>> Maybe I'm crazy, but something started bothering me. And I started
>> wondering: when is the 'page->mapping' of an anonymous page actually
>> cleared?
>>
>> The thing is, the mapping of an anonymous page is actually cleared only
>> when the page is _freed_, in "free_hot_cold_page()".
>>
>> Now, let's think about that. And in particular, let's think about how that
>> relates to the freeing of the 'anon_vma' that the page->mapping points to.
>>
>> The way the anon_vma is freed is when the mapping is torn down, and we do
>> roughly:
>>
>>       tlb = tlb_gather_mmu(mm,..)
>>       ..
>>       unmap_vmas(&tlb, vma ..
>>       ..
>>       free_pgtables()
>>       ..
>>       tlb_finish_mmu(tlb, start, end);
>>
>> and we actually unmap all the pages in "unmap_vmas()", and then _after_
>> unmapping all the pages we do the "unlink_anon_vmas(vma);" in
>> "free_pgtables()". Fine so far - the anon_vma stay around until after the
>> page has been happily unmapped.
>>
>> But "unmapped all the pages" is _not_ actually the same as "free'd all the
>> pages". The actual _freeing_ of the page happens generally in
>> tlb_finish_mmu(), because we can free the page only after we've flushed
>> any TLB entries.
>>
>> So what we have in that tlb_gather structure is a list of _pending_ pages
>> to be freed, while we already actually free'd the anon_vmas earlier!
>>
>> Now, the thing is, tlb_gather_mmu() begins a preempt-safe region (because
>> we use a per-cpu variable), but as far as I can tell it is _not_ an
>> RCU-safe region.
>>
>> So I think we might actually get a real RCU freeing event while this all
>> happens. So now the 'anon_vma' that 'page->mapping' points to has not just
>> been released back to the SLUB caches, the page itself might have been
>> released too.
>>
>> I dunno. Does the above sound at all sane? Or am I just raving?
>>
>> Something hacky like the above might fix it if I'm not just raving. I
>> really might be missing something here.
>
> Right, so unless you have CONFIG_TREE_PREEMPT_RCU=y, the preempt-disable
> == RCU read lock assumption does hold.

Indeed.

>
> But even with your patch it doesn't close all holes because while
> zap_pte_range() can remove the last mapcount of the page, the
> page_remove_tlb() et al. don't need to be the last use count of the
> page.
>
> Concurrent reclaim/gup/whatever could still have a count out on the page
> delaying the actual free beyond the tlb gather RCU section.

anon_vma lock is just valid in case of page_mapped.
if reclaim/gup/whatever want to use anon_vma, it should check with page_mapped.
And last put_page doesn't touch anon_vma for freeing the page so I
think it's not a problem. Do I miss something?

>
> This also complicates the whole page_lock_anon_vma() thing, so it would
> be nice to be able to remove this race and clear page->mapping in
> page_remove_rmap().
>

BTW, I totally agree with you.
Now anon_vma is very complicated.
SLAB_DESTROY_BY_RCU, vma merge, when page->mapping is cleared,
anon_vma_chain and so on.. :(


--
Kind regards,
Minchan Kim

2010-04-12 15:20:52

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

On 04/12/2010 10:40 AM, Peter Zijlstra wrote:

> So the reason page->mapping isn't cleared in page_remove_rmap() isn't
> detailed beyond a (possible) race with page_add_anon_rmap() (which I
> guess would be reclaim trying to unmap the page and a fault re-instating
> it).
>
> This also complicates the whole page_lock_anon_vma() thing, so it would
> be nice to be able to remove this race and clear page->mapping in
> page_remove_rmap().

For anonymous pages, I don't see where the race comes from.

Both do_swap_page and the reclaim code hold the page lock
across the entire operation, so they are already excluding
each other.

Hugh, do you remember what the race between page_remove_rmap
and page_add_anon_rmap is/was all about?

I don't see a race in the current code...

2010-04-12 15:33:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

On Tue, 2010-04-13 at 00:17 +0900, Minchan Kim wrote:
> > Concurrent reclaim/gup/whatever could still have a count out on the page
> > delaying the actual free beyond the tlb gather RCU section.
>
> anon_vma lock is just valid in case of page_mapped.
> if reclaim/gup/whatever want to use anon_vma, it should check with page_mapped.
> And last put_page doesn't touch anon_vma for freeing the page so I
> think it's not a problem. Do I miss something?

Hmm, I think you're right. The race I was thinking of makes the
page_lock_anon_vma() RCU section overlap with that of the mmu_gather,
which ensures the thing is long enough, or hits the !_mapcount case.

I'm not sure there are other page->mapping users that are interesting.

2010-04-12 15:49:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA



On Sun, 11 Apr 2010, Rik van Riel wrote:
>
> Looking around the code some more, zap_pte_range()
> calls page_remove_rmap(), which leaves the
> page->mapping in place and has this comment:

See my earlier email about this exact issue. It's well-known that there
are stale page->mapping pointers. The "page_mapped()" check _should_ have
meant that in that case we never follow them, though.

> I wonder if we can clear page->mapping here, if
> list_is_singular(anon_vma->head). That way we
> will not leave stale pointers behind.

What does that help? What if list _isn't_ singular?

Linus

2010-04-12 15:53:44

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

On 04/12/2010 11:44 AM, Linus Torvalds wrote:
> On Sun, 11 Apr 2010, Rik van Riel wrote:
>>
>> Looking around the code some more, zap_pte_range()
>> calls page_remove_rmap(), which leaves the
>> page->mapping in place and has this comment:
>
> See my earlier email about this exact issue. It's well-known that there
> are stale page->mapping pointers. The "page_mapped()" check _should_ have
> meant that in that case we never follow them, though.

Good point. I wonder if we have some SMP reordering
issue then?

>> I wonder if we can clear page->mapping here, if
>> list_is_singular(anon_vma->head). That way we
>> will not leave stale pointers behind.
>
> What does that help? What if list _isn't_ singular?

Yeah, that was a bad idea. Looking at the same code for
11 days straight seems to have put some knots in my brain :)

2010-04-12 15:56:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA



On Sun, 11 Apr 2010, Rik van Riel wrote:
>
> Another thing I just thought of.
>
> The anon_vma struct will not be reused for something completely
> different due to the SLAB_DESTROY_BY_RCU flag that the anon_vma_cachep
> is created with.

Rik, we _know_ it got re-used by something totally different. That's
clearly the problem. The page->mapping pointer does _not_ point to an
anon_vma any more. That's the problem here.

What we need to figure out is how we have a page on the LRU list that is
still marked as 'mapped' that has that stale mapping pointer.

I can easily see how the stale mapping pointer happens for a non-mapped
page. That part is trivial. Here's a simple case:

- vmscan does that whole "isolate LRU pages", and one of them is a (at
that time mapped) anonymous page. It's now not on any LRU lists at all.

- vmscan ends up waiting for pageout and/or writeback while holding that
list of pages.

- in the meantime, the process that had the page exists or unmaps,
unmapping the page and freeing the vma and the anon_vma.

- vmscan eventually gets to the page, and does that page_referenced()
dance. page->mapping points to something that is long long gone (as in
"IO access lifetimes", so we're talking something that has been freed
literally milliseconds ago, rather than any RCU delays)

So I can see the stale page->mapping pointer happening. That part is even
trivial. What I don't see is how the page would be still marked 'mapped'.
Everything that actually free's the vma/anon_vmas should also have
unmapped the page before that - even if it didn't _free_ the page.

Linus

2010-04-12 16:01:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

On Mon, 2010-04-12 at 11:19 -0400, Rik van Riel wrote:
> On 04/12/2010 10:40 AM, Peter Zijlstra wrote:
>
> > So the reason page->mapping isn't cleared in page_remove_rmap() isn't
> > detailed beyond a (possible) race with page_add_anon_rmap() (which I
> > guess would be reclaim trying to unmap the page and a fault re-instating
> > it).
> >
> > This also complicates the whole page_lock_anon_vma() thing, so it would
> > be nice to be able to remove this race and clear page->mapping in
> > page_remove_rmap().
>
> For anonymous pages, I don't see where the race comes from.
>
> Both do_swap_page and the reclaim code hold the page lock
> across the entire operation, so they are already excluding
> each other.
>
> Hugh, do you remember what the race between page_remove_rmap
> and page_add_anon_rmap is/was all about?
>
> I don't see a race in the current code...


Something like the below would be nice if possible.


---
mm/rmap.c | 44 +++++++++++++++++++++++++++++++-------------
1 files changed, 31 insertions(+), 13 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index eaa7a09..241f75d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -286,7 +286,22 @@ void __init anon_vma_init(void)

/*
* Getting a lock on a stable anon_vma from a page off the LRU is
- * tricky: page_lock_anon_vma rely on RCU to guard against the races.
+ * tricky:
+ *
+ * page_add_anon_vma()
+ * atomic_add_negative(page->_mapcount);
+ * page->mapping = anon_vma;
+ *
+ *
+ * page_remove_rmap()
+ * atomic_add_negative();
+ * page->mapping = anon_vma;
+ *
+ * So we have to first read page->mapping(), and then verify
+ * _mapcount, and make sure we order them correctly.
+ *
+ * We take anon_vma->lock in between so that if we see the anon_vma
+ * with a mapcount we know it won't go away on us.
*/
struct anon_vma *page_lock_anon_vma(struct page *page)
{
@@ -294,14 +309,24 @@ struct anon_vma *page_lock_anon_vma(struct page *page)
unsigned long anon_mapping;

rcu_read_lock();
- anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
+ anon_mapping = (unsigned long)rcu_dereference(page->mapping);
if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
goto out;
- if (!page_mapped(page))
- goto out;

anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
spin_lock(&anon_vma->lock);
+
+ /*
+ * Order the reading of page->mapping and page->_mapcount against the
+ * mb() implied by the atomic_add_negative() in page_remove_rmap().
+ */
+ smp_rmb();
+ if (!page_mapped(page)) {
+ spin_unlock(&anon_vma->lock);
+ anon_vma = NULL;
+ goto out;
+ }
+
return anon_vma;
out:
rcu_read_unlock();
@@ -864,15 +889,8 @@ void page_remove_rmap(struct page *page)
__dec_zone_page_state(page, NR_FILE_MAPPED);
mem_cgroup_update_file_mapped(page, -1);
}
- /*
- * It would be tidy to reset the PageAnon mapping here,
- * but that might overwrite a racing page_add_anon_rmap
- * which increments mapcount after us but sets mapping
- * before us: so leave the reset to free_hot_cold_page,
- * and remember that it's only reliable while mapped.
- * Leaving it set also helps swapoff to reinstate ptes
- * faster for those pages still in swapcache.
- */
+
+ page->mapping = NULL;
}

/*

2010-04-12 16:06:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA



On Mon, 12 Apr 2010, Borislav Petkov wrote:
> >
> > If the warnings do happen, they are not going to be printing out any
> > hugely informative data apart from the fact that the bad case happened at
> > all. But If they do trigger, I can try to improve on them - it's just not
> > worth trying to make them any more interesting if they never trigger.
>
> Haa, I think you're gonna want to improve them :)
>
> WARN_ONCE(1, "page->mapping does not exist in vma chain");
>
> triggered on the first resume showing a rather messy 4 WARN_ONCEs. Had I
> more cores, there maybe would've been more of them :) Maybe need locking
> if clean output is of interest (see below).

Goodie.

I can't trigger this on my machine (not that I tried very hard - but I did
do some swapping loads etc by limiting my memory to just 1GB etc). So I'm
pretty sure my verification code is "correct", and verifies things that
should be right.

And the fact that it triggers under the exact load that you use to then
trigger the bug is a damn good thing. That means that we are finally on
the right track, and we have somethign that correlates well with the
actual bug.

> So, anyway, if I can read this correctly, there is a page->mapping
> anon_vma which is _not_ in the anon_vmas chain of the vma
> (avc->same_vma).

Yes, and that is supposed to be a no-no. The page is clearly associated
with the vma in question (since we are unmapping it through that vma), but
the vma list of 'anon_vma's doesn't actually have the one that
'page->mapping' points to.

And that, in turn, means that we've lost sight of the 'page->mapping'
anon_vma, and THAT in turn means that it could well have been free'd as
being no longer referenced.

And if it was free'd, it could be re-allocated as something else (after
the RCU grace period), and that directly explains your oops.

> By the way, I completely understand when you say that your head hurts
> from looking at this :).

Well, I have to say that I'm happy I've spent the time on it, because this
way I got to learn all the new rules. It's just that I really wish I
wouldn't have _had_ to.

Anyway, I'll have to think way more about this to see if I can come up
with a debugging patch that shows more details about what actually caused
this to happen in the first place. But we definitely have a smoking gun.

Linus

2010-04-12 16:09:19

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

On 04/12/2010 12:01 PM, Peter Zijlstra wrote:

> @@ -864,15 +889,8 @@ void page_remove_rmap(struct page *page)
> __dec_zone_page_state(page, NR_FILE_MAPPED);
> mem_cgroup_update_file_mapped(page, -1);
> }
> - /*
> - * It would be tidy to reset the PageAnon mapping here,
> - * but that might overwrite a racing page_add_anon_rmap
> - * which increments mapcount after us but sets mapping
> - * before us: so leave the reset to free_hot_cold_page,
> - * and remember that it's only reliable while mapped.
> - * Leaving it set also helps swapoff to reinstate ptes
> - * faster for those pages still in swapcache.
> - */
> +
> + page->mapping = NULL;
> }

That would be a bug for file pages :)

I could see how it could work for anonymous memory, though.

2010-04-12 16:32:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA



On Mon, 12 Apr 2010, Linus Torvalds wrote:
>
> Yes, and that is supposed to be a no-no. The page is clearly associated
> with the vma in question (since we are unmapping it through that vma), but
> the vma list of 'anon_vma's doesn't actually have the one that
> 'page->mapping' points to.
>
> And that, in turn, means that we've lost sight of the 'page->mapping'
> anon_vma, and THAT in turn means that it could well have been free'd as
> being no longer referenced.
>
> And if it was free'd, it could be re-allocated as something else (after
> the RCU grace period), and that directly explains your oops.

I have a new theory. And this new theory is completely different from all
the other things we've been looking at.

The new theory is really simple: 'page->mapping' has been re-set to the
wrong mapping.

Now, there is one case where we reset page->mapping _intentionally_,
namely in the COW-breaking case of having the last user
("page_move_anon_rmap"). And that looks fine, and happens under normal
loads all the time. We _want_ to do it there.

But there is a _much_ more subtle case that involved swapping.

So guys, here's my fairly simple theory on what happens:

- page gets allocated/mapped by process A. Let's call the anon_vma we
associate the page with 'A' to keep it easy to track.

- Process A forks, creating process B. The anon_vma in B is 'B', and has
a chain that looks like 'B' -> 'A'. Everything is fine.

- Swapping happens. The page (with mapping pointing to 'A') gets swapped
out (perhaps not to disk - it's enough to assume that it's just not
mapped any more, and lives entirely in the swap-cache)

- Process B pages it in, which goes like this:

do_swap_page ->
page = lookup_swap_cache(entry);
...
set_pte_at(mm, address, page_table, pte);
page_add_anon_rmap(page, vma, address);

And think about what happens here!

In particular, what happens is that this will now be the "first" mapping
of that page, so page_add_anon_rmap() will do

if (first)
__page_set_anon_rmap(page, vma, address);

and notice what anon_vma it will use? It will use the anon_vma for process
B!

So now page->mapping actually points to anon_vma 'B', not 'A' like it used
to.

What happens then? Trivial: process 'A' also pages it in (nothing happens,
it's not the first mapping), and then process 'B' execve's or exits or
unmaps, making anon_vma B go away.

End result: process A has a page that points to anon_vma B, but anon_vma B
does not exist any more. This can go on forever. Forget about RCU grace
periods, forget about locking, forget anything like that. The bug is
simply that page->mapping points to an anon_vma that was correct at one
point, but was _not_ the one that was shared by all users of that possible
mapping.

The patch below is my largely mindless try at fixing this. It's untested.
I'm not entirely sure that it actually works. But it makes some amount of
conceptual sense. No?

Linus

---
mm/rmap.c | 15 +++++++++++++--
1 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index ee97d38..4bad326 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -734,9 +734,20 @@ void page_move_anon_rmap(struct page *page,
static void __page_set_anon_rmap(struct page *page,
struct vm_area_struct *vma, unsigned long address)
{
- struct anon_vma *anon_vma = vma->anon_vma;
+ struct anon_vma_chain *avc;
+ struct anon_vma *anon_vma;
+
+ BUG_ON(!vma->anon_vma);
+
+ /*
+ * We must use the _oldest_ possible anon_vma for the page mapping!
+ *
+ * So take the last AVC chain entry in the vma, which is the deepest
+ * ancestor, and use the anon_vma from that.
+ */
+ avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);
+ anon_vma = avc->anon_vma;

- BUG_ON(!anon_vma);
anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
page->mapping = (struct address_space *) anon_vma;
page->index = linear_page_index(vma, address);

2010-04-12 16:51:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA



On Mon, 12 Apr 2010, Rik van Riel wrote:

> On 04/12/2010 12:01 PM, Peter Zijlstra wrote:
>
> > @@ -864,15 +889,8 @@ void page_remove_rmap(struct page *page)
> > __dec_zone_page_state(page, NR_FILE_MAPPED);
> > mem_cgroup_update_file_mapped(page, -1);
> > }
> > - /*
> > - * It would be tidy to reset the PageAnon mapping here,
> > - * but that might overwrite a racing page_add_anon_rmap
> > - * which increments mapcount after us but sets mapping
> > - * before us: so leave the reset to free_hot_cold_page,
> > - * and remember that it's only reliable while mapped.
> > - * Leaving it set also helps swapoff to reinstate ptes
> > - * faster for those pages still in swapcache.
> > - */
> > +
> > + page->mapping = NULL;
> > }
>
> That would be a bug for file pages :)
>
> I could see how it could work for anonymous memory, though.

I think it's scary for anonymous pages too. The _common_ case of
page_remove_rmap() is from unmap/exit, which holds no locks on the page
what-so-ever. So assuming the page could be reachable some other way (swap
cache etc), I think the above is pretty scary.

Also do note that the bug we've been chasing has _always_ had that test
for "page_mapped(page)". See my other email about why the unmapped case
isn't even interesting, because it's so easy to see how page->mapping can
be stale for unmapped pages.

It's the _mapped_ case that is interesting, not the unmapped one. So
setting page->mapping to NULL when unmapping is perhaps a nice consistency
issue ("never have stale pointers"), but it's missing the fact that it's
not really the case we care about.

Linus

2010-04-12 18:42:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

On Mon, 2010-04-12 at 09:46 -0700, Linus Torvalds wrote:
>
> On Mon, 12 Apr 2010, Rik van Riel wrote:
>
> > On 04/12/2010 12:01 PM, Peter Zijlstra wrote:
> >
> > > @@ -864,15 +889,8 @@ void page_remove_rmap(struct page *page)
> > > __dec_zone_page_state(page, NR_FILE_MAPPED);
> > > mem_cgroup_update_file_mapped(page, -1);
> > > }
> > > - /*
> > > - * It would be tidy to reset the PageAnon mapping here,
> > > - * but that might overwrite a racing page_add_anon_rmap
> > > - * which increments mapcount after us but sets mapping
> > > - * before us: so leave the reset to free_hot_cold_page,
> > > - * and remember that it's only reliable while mapped.
> > > - * Leaving it set also helps swapoff to reinstate ptes
> > > - * faster for those pages still in swapcache.
> > > - */
> > > +
> > > + page->mapping = NULL;
> > > }
> >
> > That would be a bug for file pages :)
> >
> > I could see how it could work for anonymous memory, though.
>
> I think it's scary for anonymous pages too. The _common_ case of
> page_remove_rmap() is from unmap/exit, which holds no locks on the page
> what-so-ever. So assuming the page could be reachable some other way (swap
> cache etc), I think the above is pretty scary.

Fully agreed.

> Also do note that the bug we've been chasing has _always_ had that test
> for "page_mapped(page)". See my other email about why the unmapped case
> isn't even interesting, because it's so easy to see how page->mapping can
> be stale for unmapped pages.
>
> It's the _mapped_ case that is interesting, not the unmapped one. So
> setting page->mapping to NULL when unmapping is perhaps a nice consistency
> issue ("never have stale pointers"), but it's missing the fact that it's
> not really the case we care about.

Yes, I don't think this is the problem that has been plaguing us for
over a week now.

But while staring at that code it did get me worried that the current
code (page_lock_anon_vma):

- is missing the smp_read_barrier_depends() after the ACCESS_ONCE
- isn't properly ordered wrt page->mapping and page->_mapcount.
- doesn't appear to guarantee much at all when returning an anon_vma
since it locks after checking page->_mapcount so:
* it can return !NULL for an unmapped page (your patch cures that)
* it can return !NULL but for a different anon_vma
(my earlier patch checking page_rmapping() after the spin_lock
cures that, but doesn't cure the above):

[ highly unlikely but not impossible race ]

page_referenced(page_A)

try_to_unmap(page_A)

unrelated fault

fault page_A

CPU0 CPU1 CPU2 CPU3

rcu_read_lock()
anon_vma = page->mapping;
if (!anon_vma & ANON_BIT)
goto out
if (!page_mapped(page))
goto out

page_remove_rmap()
...
anon_vma_free()-----\
v
anon_vma_alloc()

anon_vma_alloc()
page_add_anon_rmap()
^
spin_lock(anon_vma->lock)----------/


Now I don't think the above can happen due to how our slab
allocators work, they won't share a slab page between cpus like
that, but once we make the whole thing preemptible this race
becomes a lot more likely.


So a page_lock_anon_vma(), that looks a little like the below should
(I think) cure all our problems with it.


struct anon_vma *page_lock_anon_vma(struct page *page)
{
struct anon_vma *anon_vma;
unsigned long anon_mapping;

rcu_read_lock();
again:
anon_mapping = (unsigned long)rcu_dereference(page->mapping);
if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
goto out;
anon_vma = (struct anon_vma *)(anon_mapping - PAGE_MAPPING_ANON);

/*
* The RCU read lock ensures we can safely dereference anon_vma
* since it ensures the backing slab won't go away. It will however
* not guarantee it's the right object.
*
* First take the anon_vma->lock, this will, per anon_vma_unlink()
* avoid this anon_vma from being freed if it is a valid object.
*/
spin_lock(&anon_vma->lock);

/*
* Secondly, we have to re-read page->mapping, so ensure it
* has not changed, rely on spin_lock() being at least a
* compiler barrier to force the re-read.
*/
if (unlikely(page_rmapping(page) != anon_vma)) {
spin_unlock(&anon_vma->lock);
goto again;
}

/*
* Ensure we read page->mapping before page->_mapcount,
* orders against atomic_add_negative() in page_remove_rmap().
*/
smp_rmb();

/*
* Finally check that the page is still mapped,
* if not, this can't possibly be the right anon_vma.
*/
if (!page_mapped(page))
goto unlock;

return anon_vma;

unlock:
spin_unlock(&anon_vma->lock);
out:
rcu_read_unlock();
return NULL;
}


With this, I think we can actually drop the RCU read lock when returning
since if this is indeed a valid anon_vma for this page, then the page is
still mapped, and hence the anon_vma was not deleted, and a possible
future delete will be held back by us holding the anon_vma->lock.

Now I could be totally wrong and have confused myself throroughly, but
how does this look?

2010-04-12 18:42:16

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

On 04/12/2010 12:26 PM, Linus Torvalds wrote:

> But there is a _much_ more subtle case that involved swapping.
>
> So guys, here's my fairly simple theory on what happens:

That bug looks entirely possible. Given that Borislav
has heavy swapping going on, it is quite possible that
this is the bug he has been triggering.

> The patch below is my largely mindless try at fixing this. It's untested.
> I'm not entirely sure that it actually works. But it makes some amount of
> conceptual sense. No?

The patch would help avoid the bug you described.

It does have the drawback of moving all the pages of
child processes back into the anon_vma of the parent
process after swapin, even if they are privately owned
pages by the child process.

I am guessing it may need a check to see whether the
page and swap slot are exclusively owned by the current
process.

Page or swap slot shared? => oldest anon_vma
Page and swap slot exclusive? => newest anon_vma

I suspect the easiest way to achieve this would be to
pass a flag in from do_swap_page, where we already
check this, a few lines above calling page_add_anon_rmap:

if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
flags &= ~FAULT_FLAG_WRITE;
}


2010-04-12 19:08:36

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

From: Rik van Riel <[email protected]>
Date: Mon, Apr 12, 2010 at 02:40:22PM -0400

> On 04/12/2010 12:26 PM, Linus Torvalds wrote:
>
> >But there is a _much_ more subtle case that involved swapping.
> >
> >So guys, here's my fairly simple theory on what happens:
>
> That bug looks entirely possible. Given that Borislav
> has heavy swapping going on, it is quite possible that
> this is the bug he has been triggering.

Yeah, about that. I dunno whether you guys saw that but the machine has
8Gb of RAM and shouldn't be swapping, AFAIK. The largest mem usage I
saw was 5Gb used, most of which pagecache. So I was kinda doubtful when
Linus came up with the swapping theory earlier. I'll pay attention to
the SwapCached in /proc/meminfo more to see whether we do any swapping.
It could be that there is a small amount which is swapped out for
whatever reason... Maybe that's the bug...

But I'll give the patch a run anyway in an hour or so anyway.

--
Regards/Gruss,
Boris.

2010-04-12 19:21:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA



On Mon, 12 Apr 2010, Borislav Petkov wrote:
>
> But I'll give the patch a run anyway in an hour or so anyway.

Thanks. I suspect you will find that even if there is no actual disk IO
swapping going on during any of the normal loads, the shrink_all_memory()
thing in your hibernation event will cause swap to happen. Or at least
swap-cache entries to be done.

Oh, and I've decided that my rcu_read_lock() patch for the tlb_gather()
thing for unmapping is bogus. Exactly because the critical issue isn't
when the page is free'd (and page->mapping is cleared), but when the page
is unmapped (and page_mapped() clears).

And that is done correctly even with the delayed frees in tlb_gather. So
addign the rcu_read_lock/rcu_read_unlock around it all doesn't actually
matter or help.

So the patches that I think fix real bugs are

- the anon_vma_prepare() fix to only share anon_vma's if they are
singletons.

- the vma_adjust() fix to copy the right anon_vma chains

- the anon_vma_clone() fix to traverse the avc's in reverse order, so
that the resulting cloned chain is the same as the original chain

You got this patch as part of the "verify_vma()" patch, but the only
part of that patch that matters is the one-liner that changes a
"for_each_list_entry" to use the "_reverse()" version..

- and that last patch to pick the right anon_vma when mapping a page
(which could still be improved: the "insert new page" case does _not_
have to take the oldest anon_vma, and Rik is correct that if we have an
exclusive swap cache entry we could also take the top one)

I think I'll re-post all four patches with real commit messages, to get
ack's for them. I'd like to finally get the much delayed -rc4 out the
door.

Oh, and if that "pick the right anon_vma" patch doesn't fix it, I suspect
we'll have to revert the whole anon_vma changes for 2.6.34. It's getting
pretty late in the -rc series to fix this bug. I'm _hoping_ that I really
nailed it this time, and that we're ok, but if Borislav reports it still
happening, and people not having any other ideas, I think I'll just have
to do an -rc4 with it all reverted, and then we can try again for 35 if
somebody figures out the bug.

Hmm? I'd hate to revert it all now because of the hours I've put in
looking at the code (to the point that I feel I understand it), but at the
same time, if it was somebody else who was chasing this bug and not being
able to fix it, I'd tell them "revert it, it's too late". Amount of effort
spent doesn't matter if the bug still happens ;^(

Linus

2010-04-12 19:30:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

On Mon, 2010-04-12 at 20:40 +0200, Peter Zijlstra wrote:

Hmm, if interleaved like so

> struct anon_vma *page_lock_anon_vma(struct page *page)
> {
> struct anon_vma *anon_vma;
> unsigned long anon_mapping;

page_remove_rmap()
anon_vma_unlink()
anon_vma_free()

So that the below will all observe the old page->mapping:

> rcu_read_lock();
> again:
> anon_mapping = (unsigned long)rcu_dereference(page->mapping);
> if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
> goto out;
> anon_vma = (struct anon_vma *)(anon_mapping - PAGE_MAPPING_ANON);
>
> /*
> * The RCU read lock ensures we can safely dereference anon_vma
> * since it ensures the backing slab won't go away. It will however
> * not guarantee it's the right object.
> *
> * First take the anon_vma->lock, this will, per anon_vma_unlink()
> * avoid this anon_vma from being freed if it is a valid object.
> */
> spin_lock(&anon_vma->lock);
>
> /*
> * Secondly, we have to re-read page->mapping, so ensure it
> * has not changed, rely on spin_lock() being at least a
> * compiler barrier to force the re-read.
> */
> if (unlikely(page_rmapping(page) != anon_vma)) {
> spin_unlock(&anon_vma->lock);
> goto again;
> }

page_add_anon_rmap(), so that the page_mapped() test below would be
positive,

> /*
> * Ensure we read page->mapping before page->_mapcount,
> * orders against atomic_add_negative() in page_remove_rmap().
> */
> smp_rmb();
>
> /*
> * Finally check that the page is still mapped,
> * if not, this can't possibly be the right anon_vma.
> */
> if (!page_mapped(page))
> goto unlock;

We could here return a non-valid and already freed anon_vma.

> return anon_vma;
>
> unlock:
> spin_unlock(&anon_vma->lock);
> out:
> rcu_read_unlock();
> return NULL;
> }
>
>

2010-04-12 19:49:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

On Mon, 2010-04-12 at 21:30 +0200, Peter Zijlstra wrote:
>
> We could here return a non-valid and already freed anon_vma.
>
OK, so non of the users of page_lock_anon_vma() with exception of the
memory-failure.c one could really care. And all of them seem to be safe
enough wrt dealing with a dead one.

So unless people care, I'm going to not spend more time on trying to
make page_lock_anon_vma() behave.

Instead I'll try and see wth it is that migrate.c and rmap_walk_anon are
doing.

2010-04-12 20:28:03

by Linus Torvalds

[permalink] [raw]
Subject: [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains


From: Linus Torvalds <[email protected]>
Date: Sat, 10 Apr 2010 15:22:30 -0700
Subject: [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains

When we move the boundaries between two vma's due to things like
mprotect, we need to make sure that the anon_vma of the pages that got
moved from one vma to another gets properly copied around. And that was
not always the case, in this rather hard-to-follow code sequence.

Clarify the code, and fix it so that it copies the anon_vma from the
right source.

Signed-off-by: Linus Torvalds <[email protected]>
---
mm/mmap.c | 24 ++++++++----------------
1 files changed, 8 insertions(+), 16 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index acb023e..f90ea92 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -507,11 +507,12 @@ int vma_adjust(struct vm_area_struct *vma, unsigned long start,
struct address_space *mapping = NULL;
struct prio_tree_root *root = NULL;
struct file *file = vma->vm_file;
- struct anon_vma *anon_vma = NULL;
long adjust_next = 0;
int remove_next = 0;

if (next && !insert) {
+ struct vm_area_struct *exporter = NULL;
+
if (end >= next->vm_end) {
/*
* vma expands, overlapping all the next, and
@@ -519,7 +520,7 @@ int vma_adjust(struct vm_area_struct *vma, unsigned long start,
*/
again: remove_next = 1 + (end > next->vm_end);
end = next->vm_end;
- anon_vma = next->anon_vma;
+ exporter = next;
importer = vma;
} else if (end > next->vm_start) {
/*
@@ -527,7 +528,7 @@ again: remove_next = 1 + (end > next->vm_end);
* mprotect case 5 shifting the boundary up.
*/
adjust_next = (end - next->vm_start) >> PAGE_SHIFT;
- anon_vma = next->anon_vma;
+ exporter = next;
importer = vma;
} else if (end < vma->vm_end) {
/*
@@ -536,28 +537,19 @@ again: remove_next = 1 + (end > next->vm_end);
* mprotect case 4 shifting the boundary down.
*/
adjust_next = - ((vma->vm_end - end) >> PAGE_SHIFT);
- anon_vma = next->anon_vma;
+ exporter = vma;
importer = next;
}
- }

- /*
- * When changing only vma->vm_end, we don't really need anon_vma lock.
- */
- if (vma->anon_vma && (insert || importer || start != vma->vm_start))
- anon_vma = vma->anon_vma;
- if (anon_vma) {
/*
* Easily overlooked: when mprotect shifts the boundary,
* make sure the expanding vma has anon_vma set if the
* shrinking vma had, to cover any anon pages imported.
*/
- if (importer && !importer->anon_vma) {
- /* Block reverse map lookups until things are set up. */
- if (anon_vma_clone(importer, vma)) {
+ if (exporter && exporter->anon_vma && !importer->anon_vma) {
+ if (anon_vma_clone(importer, exporter))
return -ENOMEM;
- }
- importer->anon_vma = anon_vma;
+ importer->anon_vma = exporter->anon_vma;
}
}

--
1.7.1.rc1.dirty

2010-04-12 20:28:23

by Linus Torvalds

[permalink] [raw]
Subject: [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare()


From: Linus Torvalds <[email protected]>
Date: Sat, 10 Apr 2010 10:36:19 -0700
Subject: [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare()

This changes the anon_vma reuse case to require that we only reuse
simple anon_vma's - ie the case when the vma only has a single anon_vma
associated with it.

This means that a reuse of an anon_vma from an adjacent vma will always
guarantee that both vma's are associated not onyl with the same
anon_vma, they will also have the same anon_vma chain (of just a single
entry in this case).

And since anon_vma re-use was the only case where the same anon_vma
might be associated with different chains of anon_vma's, we now have the
case that every vma that shares the same vma will always also have the
same chain. That makes it much easier to think about merging vma's that
share the same anon_vma's: you can always just drop the other anon_vma
chain in anon_vma_merge() since you know that they are always identical.

This also splits up the function to validate the anon_vma re-use, and
adds a lot of commentary about the possible races.

Signed-off-by: Linus Torvalds <[email protected]>
---

Ok, so I'm sending out this series of four patches, in the (perhaps
futile) hope that they will finally fix the problem that Borislav has been
so great at reporting.

I'd like to gather ack's, nak's and perhaps changelog improvement
suggestions while doing this.

mm/mmap.c | 86 ++++++++++++++++++++++++++++++++++++++++++++-----------------
1 files changed, 62 insertions(+), 24 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 75557c6..acb023e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -825,6 +825,61 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
}

/*
+ * Rough compatbility check to quickly see if it's even worth looking
+ * at sharing an anon_vma.
+ *
+ * They need to have the same vm_file, and the flags can only differ
+ * in things that mprotect may change.
+ *
+ * NOTE! The fact that we share an anon_vma doesn't _have_ to mean that
+ * we can merge the two vma's. For example, we refuse to merge a vma if
+ * there is a vm_ops->close() function, because that indicates that the
+ * driver is doing some kind of reference counting. But that doesn't
+ * really matter for the anon_vma sharing case.
+ */
+static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct *b)
+{
+ return a->vm_end == b->vm_start &&
+ mpol_equal(vma_policy(a), vma_policy(b)) &&
+ a->vm_file == b->vm_file &&
+ !((a->vm_flags ^ b->vm_flags) & ~(VM_READ|VM_WRITE|VM_EXEC)) &&
+ b->vm_pgoff == a->vm_pgoff + ((b->vm_start - a->vm_start) >> PAGE_SHIFT);
+}
+
+/*
+ * Do some basic sanity checking to see if we can re-use the anon_vma
+ * from 'old'. The 'a'/'b' vma's are in VM order - one of them will be
+ * the same as 'old', the other will be the new one that is trying
+ * to share the anon_vma.
+ *
+ * NOTE! This runs with mm_sem held for reading, so it is possible that
+ * the anon_vma of 'old' is concurrently in the process of being set up
+ * by another page fault trying to merge _that_. But that's ok: if it
+ * is being set up, that automatically means that it will be a singleton
+ * acceptable for merging, so we can do all of this optimistically. But
+ * we do that ACCESS_ONCE() to make sure that we never re-load the pointer.
+ *
+ * IOW: that the "list_is_singular()" test on the anon_vma_chain only
+ * matters for the 'stable anon_vma' case (ie the thing we want to avoid
+ * is to return an anon_vma that is "complex" due to having gone through
+ * a fork).
+ *
+ * We also make sure that the two vma's are compatible (adjacent,
+ * and with the same memory policies). That's all stable, even with just
+ * a read lock on the mm_sem.
+ */
+static struct anon_vma *reusable_anon_vma(struct vm_area_struct *old, struct vm_area_struct *a, struct vm_area_struct *b)
+{
+ if (anon_vma_compatible(a, b)) {
+ struct anon_vma *anon_vma = ACCESS_ONCE(old->anon_vma);
+
+ if (anon_vma && list_is_singular(&old->anon_vma_chain))
+ return anon_vma;
+ }
+ return NULL;
+}
+
+/*
* find_mergeable_anon_vma is used by anon_vma_prepare, to check
* neighbouring vmas for a suitable anon_vma, before it goes off
* to allocate a new anon_vma. It checks because a repetitive
@@ -834,28 +889,16 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
*/
struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma)
{
+ struct anon_vma *anon_vma;
struct vm_area_struct *near;
- unsigned long vm_flags;

near = vma->vm_next;
if (!near)
goto try_prev;

- /*
- * Since only mprotect tries to remerge vmas, match flags
- * which might be mprotected into each other later on.
- * Neither mlock nor madvise tries to remerge at present,
- * so leave their flags as obstructing a merge.
- */
- vm_flags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC);
- vm_flags |= near->vm_flags & (VM_READ|VM_WRITE|VM_EXEC);
-
- if (near->anon_vma && vma->vm_end == near->vm_start &&
- mpol_equal(vma_policy(vma), vma_policy(near)) &&
- can_vma_merge_before(near, vm_flags,
- NULL, vma->vm_file, vma->vm_pgoff +
- ((vma->vm_end - vma->vm_start) >> PAGE_SHIFT)))
- return near->anon_vma;
+ anon_vma = reusable_anon_vma(near, vma, near);
+ if (anon_vma)
+ return anon_vma;
try_prev:
/*
* It is potentially slow to have to call find_vma_prev here.
@@ -868,14 +911,9 @@ try_prev:
if (!near)
goto none;

- vm_flags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC);
- vm_flags |= near->vm_flags & (VM_READ|VM_WRITE|VM_EXEC);
-
- if (near->anon_vma && near->vm_end == vma->vm_start &&
- mpol_equal(vma_policy(near), vma_policy(vma)) &&
- can_vma_merge_after(near, vm_flags,
- NULL, vma->vm_file, vma->vm_pgoff))
- return near->anon_vma;
+ anon_vma = reusable_anon_vma(near, near, vma);
+ if (anon_vma)
+ return anon_vma;
none:
/*
* There's no absolute need to look only at touching neighbours:
--
1.7.1.rc1.dirty

2010-04-12 20:28:41

by Linus Torvalds

[permalink] [raw]
Subject: [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma


From: Linus Torvalds <[email protected]>
Date: Mon, 12 Apr 2010 12:44:29 -0700
Subject: [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma

Otherwise we might be mapping in a page in a new mapping, but that page
(through the swapcache) would later be mapped into an old mapping too.
The page->mapping must be the case that works for everybody, not just
the mapping that happened to page it in first.

This can be improved in certain cases: if we know the page is private to
just this particular mapping (for example, it's a new page, or it is the
only swapcache entry), we could pick the top (most specific) anon_vma.

But that's a future optimization. Make it _work_ reliably first.

Signed-off-by: Linus Torvalds <[email protected]>
---
mm/rmap.c | 15 +++++++++++++--
1 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index ee97d38..4bad326 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -734,9 +734,20 @@ void page_move_anon_rmap(struct page *page,
static void __page_set_anon_rmap(struct page *page,
struct vm_area_struct *vma, unsigned long address)
{
- struct anon_vma *anon_vma = vma->anon_vma;
+ struct anon_vma_chain *avc;
+ struct anon_vma *anon_vma;
+
+ BUG_ON(!vma->anon_vma);
+
+ /*
+ * We must use the _oldest_ possible anon_vma for the page mapping!
+ *
+ * So take the last AVC chain entry in the vma, which is the deepest
+ * ancestor, and use the anon_vma from that.
+ */
+ avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);
+ anon_vma = avc->anon_vma;

- BUG_ON(!anon_vma);
anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
page->mapping = (struct address_space *) anon_vma;
page->index = linear_page_index(vma, address);
--
1.7.1.rc1.dirty

2010-04-12 20:28:36

by Linus Torvalds

[permalink] [raw]
Subject: [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order


From: Linus Torvalds <[email protected]>
Date: Sun, 11 Apr 2010 17:15:03 -0700
Subject: [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order

We want to walk the chain in reverse order when cloning it, so that the
order of the result chain will be the same as the order in the source
chain. When we add entries to the chain, they go at the head of the
chain, so we want to add the source head last.

Signed-off-by: Linus Torvalds <[email protected]>
---
mm/rmap.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index eaa7a09..ee97d38 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -182,7 +182,7 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
{
struct anon_vma_chain *avc, *pavc;

- list_for_each_entry(pavc, &src->anon_vma_chain, same_vma) {
+ list_for_each_entry_reverse(pavc, &src->anon_vma_chain, same_vma) {
avc = anon_vma_chain_alloc();
if (!avc)
goto enomem_failure;
--
1.7.1.rc1.dirty

2010-04-12 20:56:00

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare()

On 04/12/2010 04:22 PM, Linus Torvalds wrote:
>
> From: Linus Torvalds<[email protected]>
> Date: Sat, 10 Apr 2010 10:36:19 -0700
> Subject: [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare()

> Signed-off-by: Linus Torvalds<[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

2010-04-12 20:56:26

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains

On 04/12/2010 04:23 PM, Linus Torvalds wrote:
>
> From: Linus Torvalds<[email protected]>
> Date: Sat, 10 Apr 2010 15:22:30 -0700
> Subject: [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains
>
> When we move the boundaries between two vma's due to things like
> mprotect, we need to make sure that the anon_vma of the pages that got
> moved from one vma to another gets properly copied around. And that was
> not always the case, in this rather hard-to-follow code sequence.
>
> Clarify the code, and fix it so that it copies the anon_vma from the
> right source.
>
> Signed-off-by: Linus Torvalds<[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

2010-04-12 20:58:24

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order

On 04/12/2010 04:23 PM, Linus Torvalds wrote:
>
> From: Linus Torvalds<[email protected]>
> Date: Sun, 11 Apr 2010 17:15:03 -0700
> Subject: [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order
>
> We want to walk the chain in reverse order when cloning it, so that the
> order of the result chain will be the same as the order in the source
> chain. When we add entries to the chain, they go at the head of the
> chain, so we want to add the source head last.
>
> Signed-off-by: Linus Torvalds<[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

2010-04-12 21:04:30

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma

On 04/12/2010 04:23 PM, Linus Torvalds wrote:
>
> From: Linus Torvalds<[email protected]>
> Date: Mon, 12 Apr 2010 12:44:29 -0700
> Subject: [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma
>
> Otherwise we might be mapping in a page in a new mapping, but that page
> (through the swapcache) would later be mapped into an old mapping too.
> The page->mapping must be the case that works for everybody, not just
> the mapping that happened to page it in first.
>
> This can be improved in certain cases: if we know the page is private to
> just this particular mapping (for example, it's a new page, or it is the
> only swapcache entry), we could pick the top (most specific) anon_vma.
>
> But that's a future optimization. Make it _work_ reliably first.

Agreed. I'll send an incremental for that later, you
can judge whether or not it's something you'll want to
merge before or after 2.6.34

> Signed-off-by: Linus Torvalds<[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

2010-04-12 21:50:34

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

From: Linus Torvalds <[email protected]>
Date: Mon, Apr 12, 2010 at 09:26:57AM -0700

> I have a new theory. And this new theory is completely different from all
> the other things we've been looking at.

Yeah, because all starts with "I have a new theory..." :o)

> The patch below is my largely mindless try at fixing this. It's untested.
> I'm not entirely sure that it actually works. But it makes some amount of
> conceptual sense. No?

Linus, are you trying to give me a heart-attack? This sh*t just survived
20(!) hibernation runs without a problem (well, there is this nagging
/sysfs lockdep warning) but apart from that, it survived! I even did my
all time best when hitting on it. Normally, it used to crap up on the
6th cycle as latest. Now we're rock solid. And yes, there were something
like ~64Mb in the swap cache.

Also, I have your verification stuff in addition to the 4 patches you
sent before. Not a single WARN_ONCE got triggered. So I have a gut
feeling that it is fixed but you never know with these beasts.

As before, I'll rebuild and reapply everything in the morning and retest
just in case. And I guess I'll have to test all following -rc's so that
we can be absolutely sure.

So cheers!

--
Regards/Gruss,
Boris.

2010-04-12 22:16:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA



On Mon, 12 Apr 2010, Borislav Petkov wrote:
>
> > I have a new theory. And this new theory is completely different from all
> > the other things we've been looking at.
>
> Yeah, because all starts with "I have a new theory..." :o)

Hey, all my other theories made sense too.. They just didn't work.

But as Edison said: I didn't fail, I just found three other ways to not
fix your bug.

> > The patch below is my largely mindless try at fixing this. It's untested.
> > I'm not entirely sure that it actually works. But it makes some amount of
> > conceptual sense. No?
>
> Linus, are you trying to give me a heart-attack? This sh*t just survived
> 20(!) hibernation runs without a problem (well, there is this nagging
> /sysfs lockdep warning) but apart from that, it survived! I even did my
> all time best when hitting on it. Normally, it used to crap up on the
> 6th cycle as latest. Now we're rock solid. And yes, there were something
> like ~64Mb in the swap cache.
>
> Also, I have your verification stuff in addition to the 4 patches you
> sent before. Not a single WARN_ONCE got triggered. So I have a gut
> feeling that it is fixed but you never know with these beasts.

Ok. That does sound very positive. Of course, last time you sounded
positive, I had an email from you half an hour later that said "oh no, it
oopsed again". So I'll take it with a bit of salt, but on the whole I'll
be optimistic about it.

Linus

2010-04-12 22:23:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA



Oh, btw, I like your email gateway. Only noticed now:

mail.skyhub.de (SuperMail on ZX Spectrum 128k)

that's a tough little machine.

Linus

2010-04-12 22:29:52

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

From: Linus Torvalds <[email protected]>
Date: Mon, Apr 12, 2010 at 03:18:20PM -0700

> Oh, btw, I like your email gateway. Only noticed now:
>
> mail.skyhub.de (SuperMail on ZX Spectrum 128k)
>
> that's a tough little machine.

Yeah, and it can handle all that mail traffic just fine :)

--
Regards/Gruss,
Boris.

2010-04-12 23:55:00

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare()

Hi Linus,

On Mon, Apr 12, 2010 at 01:22:33PM -0700, Linus Torvalds wrote:
>
> From: Linus Torvalds <[email protected]>
> Date: Sat, 10 Apr 2010 10:36:19 -0700
> Subject: [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare()
>
> This changes the anon_vma reuse case to require that we only reuse
> simple anon_vma's - ie the case when the vma only has a single anon_vma
> associated with it.
>
> This means that a reuse of an anon_vma from an adjacent vma will always
> guarantee that both vma's are associated not onyl with the same
> anon_vma, they will also have the same anon_vma chain (of just a single
> entry in this case).
>
> And since anon_vma re-use was the only case where the same anon_vma
> might be associated with different chains of anon_vma's, we now have the
> case that every vma that shares the same vma will always also have the

^^^ That should be anon_vma?

> same chain. That makes it much easier to think about merging vma's that
> share the same anon_vma's: you can always just drop the other anon_vma
> chain in anon_vma_merge() since you know that they are always identical.

I like to think of 'incomplete' and 'complete' versions of the same
chain and that this new rule of yours simplifies things by limiting
reuse to the cases where the incomplete and the complete version
end up identical. I can live with your wording, though :)

> This also splits up the function to validate the anon_vma re-use, and
> adds a lot of commentary about the possible races.
>
> Signed-off-by: Linus Torvalds <[email protected]>

Acked-by: Johannes Weiner <[email protected]>

That said, I still don't like that the vma comparisons differ depending
on whether we reuse an anon_vma or merge vmas. In my happy-place, the
same vma comparison function is predicate for both cases, so I actually
liked that aspect of the old code, but I also see that code reuse is a
PITA in that file... Ah well, that can still be cleaned up later.

2010-04-12 23:59:26

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains

On Mon, Apr 12, 2010 at 01:23:04PM -0700, Linus Torvalds wrote:
>
> From: Linus Torvalds <[email protected]>
> Date: Sat, 10 Apr 2010 15:22:30 -0700
> Subject: [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains
>
> When we move the boundaries between two vma's due to things like
> mprotect, we need to make sure that the anon_vma of the pages that got
> moved from one vma to another gets properly copied around. And that was
> not always the case, in this rather hard-to-follow code sequence.
>
> Clarify the code, and fix it so that it copies the anon_vma from the
> right source.
>
> Signed-off-by: Linus Torvalds <[email protected]>

Acked-by: Johannes Weiner <[email protected]>

2010-04-13 00:18:41

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order

On Mon, Apr 12, 2010 at 01:23:24PM -0700, Linus Torvalds wrote:
>
> From: Linus Torvalds <[email protected]>
> Date: Sun, 11 Apr 2010 17:15:03 -0700
> Subject: [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order
>
> We want to walk the chain in reverse order when cloning it, so that the
> order of the result chain will be the same as the order in the source
> chain. When we add entries to the chain, they go at the head of the
> chain, so we want to add the source head last.
>
> Signed-off-by: Linus Torvalds <[email protected]>

Acked-by: Johannes Weiner <[email protected]>

2010-04-13 00:41:34

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma

On Mon, Apr 12, 2010 at 01:23:50PM -0700, Linus Torvalds wrote:
>
> From: Linus Torvalds <[email protected]>
> Date: Mon, 12 Apr 2010 12:44:29 -0700
> Subject: [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma
>
> Otherwise we might be mapping in a page in a new mapping, but that page
> (through the swapcache) would later be mapped into an old mapping too.
> The page->mapping must be the case that works for everybody, not just
> the mapping that happened to page it in first.
>
> This can be improved in certain cases: if we know the page is private to
> just this particular mapping (for example, it's a new page, or it is the
> only swapcache entry), we could pick the top (most specific) anon_vma.
>
> But that's a future optimization. Make it _work_ reliably first.
>
> Signed-off-by: Linus Torvalds <[email protected]>

Acked-by: Johannes Weiner <[email protected]>

Would you mind pasting that nice description of the error case from your
other email into that changelog? I skimmed over the description but when
I read this patch several hours later, I had to go back to that previous
email to fully make sense of it.

> ---
> mm/rmap.c | 15 +++++++++++++--
> 1 files changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index ee97d38..4bad326 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -734,9 +734,20 @@ void page_move_anon_rmap(struct page *page,
> static void __page_set_anon_rmap(struct page *page,
> struct vm_area_struct *vma, unsigned long address)
> {
> - struct anon_vma *anon_vma = vma->anon_vma;
> + struct anon_vma_chain *avc;
> + struct anon_vma *anon_vma;
> +
> + BUG_ON(!vma->anon_vma);
> +
> + /*
> + * We must use the _oldest_ possible anon_vma for the page mapping!

I think the key here is not that it's the oldest (past) but also the one with
the longest extent (future), so that it's bound to stay until the last possible
mapping for this page vanishes.

Maybe it's just me, but I doubt the comment as it is would help me understand
that code if I didn't already.

> + *
> + * So take the last AVC chain entry in the vma, which is the deepest
> + * ancestor, and use the anon_vma from that.
> + */
> + avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);
> + anon_vma = avc->anon_vma;
>
> - BUG_ON(!anon_vma);
> anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
> page->mapping = (struct address_space *) anon_vma;
> page->index = linear_page_index(vma, address);

Hannes

2010-04-13 01:12:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma



On Tue, 13 Apr 2010, Johannes Weiner wrote:
>
> Would you mind pasting that nice description of the error case from your
> other email into that changelog? I skimmed over the description but when
> I read this patch several hours later, I had to go back to that previous
> email to fully make sense of it.

It now looks like this..

Linus
---
From: Linus Torvalds <[email protected]>
Date: Mon, 12 Apr 2010 12:44:29 -0700
Subject: [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma

Otherwise we might be mapping in a page in a new mapping, but that page
(through the swapcache) would later be mapped into an old mapping too.
The page->mapping must be the case that works for everybody, not just
the mapping that happened to page it in first.

Here's the scenario:

- page gets allocated/mapped by process A. Let's call the anon_vma we
associate the page with 'A' to keep it easy to track.

- Process A forks, creating process B. The anon_vma in B is 'B', and has
a chain that looks like 'B' -> 'A'. Everything is fine.

- Swapping happens. The page (with mapping pointing to 'A') gets swapped
out (perhaps not to disk - it's enough to assume that it's just not
mapped any more, and lives entirely in the swap-cache)

- Process B pages it in, which goes like this:

do_swap_page ->
page = lookup_swap_cache(entry);
...
set_pte_at(mm, address, page_table, pte);
page_add_anon_rmap(page, vma, address);

And think about what happens here!

In particular, what happens is that this will now be the "first"
mapping of that page, so page_add_anon_rmap() used to do

if (first)
__page_set_anon_rmap(page, vma, address);

and notice what anon_vma it will use? It will use the anon_vma for
process B!

What happens then? Trivial: process 'A' also pages it in (nothing
happens, it's not the first mapping), and then process 'B' execve's
or exits or unmaps, making anon_vma B go away.

End result: process A has a page that points to anon_vma B, but
anon_vma B does not exist any more. This can go on forever. Forget
about RCU grace periods, forget about locking, forget anything like
that. The bug is simply that page->mapping points to an anon_vma
that was correct at one point, but was _not_ the one that was shared
by all users of that possible mapping.

Changing it to always use the deepest anon_vma in the anonvma chain gets
us to the safest model.

This can be improved in certain cases: if we know the page is private to
just this particular mapping (for example, it's a new page, or it is the
only swapcache entry), we could pick the top (most specific) anon_vma.

But that's a future optimization. Make it _work_ reliably first.

Reviewed-by: Rik van Riel <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Tested-by: Borislav Petkov <[email protected]> [ "What do you know, I think you fixed it!" ]
Signed-off-by: Linus Torvalds <[email protected]>
---
mm/rmap.c | 15 +++++++++++++--
1 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index ee97d38..4bad326 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -734,9 +734,20 @@ void page_move_anon_rmap(struct page *page,
static void __page_set_anon_rmap(struct page *page,
struct vm_area_struct *vma, unsigned long address)
{
- struct anon_vma *anon_vma = vma->anon_vma;
+ struct anon_vma_chain *avc;
+ struct anon_vma *anon_vma;
+
+ BUG_ON(!vma->anon_vma);
+
+ /*
+ * We must use the _oldest_ possible anon_vma for the page mapping!
+ *
+ * So take the last AVC chain entry in the vma, which is the deepest
+ * ancestor, and use the anon_vma from that.
+ */
+ avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);
+ anon_vma = avc->anon_vma;

- BUG_ON(!anon_vma);
anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
page->mapping = (struct address_space *) anon_vma;
page->index = linear_page_index(vma, address);
--
1.7.1.rc1.dirty

2010-04-13 04:05:07

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare()

On Tue, Apr 13, 2010 at 5:22 AM, Linus Torvalds
<[email protected]> wrote:
>
> From: Linus Torvalds <[email protected]>
> Date: Sat, 10 Apr 2010 10:36:19 -0700
> Subject: [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare()
>
> This changes the anon_vma reuse case to require that we only reuse
> simple anon_vma's - ie the case when the vma only has a single anon_vma
> associated with it.
>
> This means that a reuse of an anon_vma from an adjacent vma will always
> guarantee that both vma's are associated not onyl with the same
> anon_vma, they will also have the same anon_vma chain (of just a single
> entry in this case).
>
> And since anon_vma re-use was the only case where the same anon_vma
> might be associated with different chains of anon_vma's, we now have the
> case that every vma that shares the same vma will always also have the

same vma => same anon_vma.

> same chain.  That makes it much easier to think about merging vma's that
> share the same anon_vma's: you can always just drop the other anon_vma
> chain in anon_vma_merge() since you know that they are always identical.
>
> This also splits up the function to validate the anon_vma re-use, and
> adds a lot of commentary about the possible races.
>
> Signed-off-by: Linus Torvalds <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>



--
Kind regards,
Minchan Kim

2010-04-13 04:15:22

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains

On Tue, Apr 13, 2010 at 5:23 AM, Linus Torvalds
<[email protected]> wrote:
>
> From: Linus Torvalds <[email protected]>
> Date: Sat, 10 Apr 2010 15:22:30 -0700
> Subject: [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains
>
> When we move the boundaries between two vma's due to things like
> mprotect, we need to make sure that the anon_vma of the pages that got
> moved from one vma to another gets properly copied around.  And that was
> not always the case, in this rather hard-to-follow code sequence.
>
> Clarify the code, and fix it so that it copies the anon_vma from the
> right source.
>
> Signed-off-by: Linus Torvalds <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

--
Kind regards,
Minchan Kim

2010-04-13 04:16:22

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order

On Tue, Apr 13, 2010 at 5:23 AM, Linus Torvalds
<[email protected]> wrote:
>
> From: Linus Torvalds <[email protected]>
> Date: Sun, 11 Apr 2010 17:15:03 -0700
> Subject: [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order
>
> We want to walk the chain in reverse order when cloning it, so that the
> order of the result chain will be the same as the order in the source
> chain.  When we add entries to the chain, they go at the head of the
> chain, so we want to add the source head last.
>
> Signed-off-by: Linus Torvalds <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>


--
Kind regards,
Minchan Kim

2010-04-13 04:24:01

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma

On Tue, Apr 13, 2010 at 10:08 AM, Linus Torvalds
<[email protected]> wrote:
>
>
> On Tue, 13 Apr 2010, Johannes Weiner wrote:
>>
>> Would you mind pasting that nice description of the error case from your
>> other email into that changelog?  I skimmed over the description but when
>> I read this patch several hours later, I had to go back to that previous
>> email to fully make sense of it.
>
> It now looks like this..
>
>                Linus
> ---
> From: Linus Torvalds <[email protected]>
> Date: Mon, 12 Apr 2010 12:44:29 -0700
> Subject: [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma
>
> Otherwise we might be mapping in a page in a new mapping, but that page
> (through the swapcache) would later be mapped into an old mapping too.
> The page->mapping must be the case that works for everybody, not just
> the mapping that happened to page it in first.
>
> Here's the scenario:
>
>  - page gets allocated/mapped by process A. Let's call the anon_vma we
>   associate the page with 'A' to keep it easy to track.
>
>  - Process A forks, creating process B. The anon_vma in B is 'B', and has
>   a chain that looks like 'B' -> 'A'. Everything is fine.
>
>  - Swapping happens. The page (with mapping pointing to 'A') gets swapped
>   out (perhaps not to disk - it's enough to assume that it's just not
>   mapped any more, and lives entirely in the swap-cache)
>
>  - Process B pages it in, which goes like this:
>
>        do_swap_page ->
>          page = lookup_swap_cache(entry);
>         ...
>          set_pte_at(mm, address, page_table, pte);
>          page_add_anon_rmap(page, vma, address);
>
>   And think about what happens here!
>
>   In particular, what happens is that this will now be the "first"
>   mapping of that page, so page_add_anon_rmap() used to do
>
>        if (first)
>                __page_set_anon_rmap(page, vma, address);
>
>   and notice what anon_vma it will use? It will use the anon_vma for
>   process B!
>
>   What happens then? Trivial: process 'A' also pages it in (nothing
>   happens, it's not the first mapping), and then process 'B' execve's
>   or exits or unmaps, making anon_vma B go away.
>
>   End result: process A has a page that points to anon_vma B, but
>   anon_vma B does not exist any more.  This can go on forever.  Forget
>   about RCU grace periods, forget about locking, forget anything like
>   that.  The bug is simply that page->mapping points to an anon_vma
>   that was correct at one point, but was _not_ the one that was shared
>   by all users of that possible mapping.
>
> Changing it to always use the deepest anon_vma in the anonvma chain gets
> us to the safest model.
>
> This can be improved in certain cases: if we know the page is private to
> just this particular mapping (for example, it's a new page, or it is the
> only swapcache entry), we could pick the top (most specific) anon_vma.
>
> But that's a future optimization. Make it _work_ reliably first.
>
> Reviewed-by: Rik van Riel <[email protected]>
> Acked-by: Johannes Weiner <[email protected]>
> Tested-by: Borislav Petkov <[email protected]> [ "What do you know, I think you fixed it!" ]
> Signed-off-by: Linus Torvalds <[email protected]>
Reviewed-by: Minchan Kim <minchan.kim>

It was great hunting and was a chance to learn many things
from LKML smart guys.
I feel again about OSS's power and great procedure of linux evolution

Thanks for everybody.

--
Kind regards,
Minchan Kim

2010-04-13 04:26:11

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma

On Tue, Apr 13, 2010 at 1:23 PM, Minchan Kim <[email protected]> wrote:
> On Tue, Apr 13, 2010 at 10:08 AM, Linus Torvalds
> <[email protected]> wrote:
>>
>>
>> On Tue, 13 Apr 2010, Johannes Weiner wrote:
>>>
>>> Would you mind pasting that nice description of the error case from your
>>> other email into that changelog?  I skimmed over the description but when
>>> I read this patch several hours later, I had to go back to that previous
>>> email to fully make sense of it.
>>
>> It now looks like this..
>>
>>                Linus
>> ---
>> From: Linus Torvalds <[email protected]>
>> Date: Mon, 12 Apr 2010 12:44:29 -0700
>> Subject: [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma
>>
>> Otherwise we might be mapping in a page in a new mapping, but that page
>> (through the swapcache) would later be mapped into an old mapping too.
>> The page->mapping must be the case that works for everybody, not just
>> the mapping that happened to page it in first.
>>
>> Here's the scenario:
>>
>>  - page gets allocated/mapped by process A. Let's call the anon_vma we
>>   associate the page with 'A' to keep it easy to track.
>>
>>  - Process A forks, creating process B. The anon_vma in B is 'B', and has
>>   a chain that looks like 'B' -> 'A'. Everything is fine.
>>
>>  - Swapping happens. The page (with mapping pointing to 'A') gets swapped
>>   out (perhaps not to disk - it's enough to assume that it's just not
>>   mapped any more, and lives entirely in the swap-cache)
>>
>>  - Process B pages it in, which goes like this:
>>
>>        do_swap_page ->
>>          page = lookup_swap_cache(entry);
>>         ...
>>          set_pte_at(mm, address, page_table, pte);
>>          page_add_anon_rmap(page, vma, address);
>>
>>   And think about what happens here!
>>
>>   In particular, what happens is that this will now be the "first"
>>   mapping of that page, so page_add_anon_rmap() used to do
>>
>>        if (first)
>>                __page_set_anon_rmap(page, vma, address);
>>
>>   and notice what anon_vma it will use? It will use the anon_vma for
>>   process B!
>>
>>   What happens then? Trivial: process 'A' also pages it in (nothing
>>   happens, it's not the first mapping), and then process 'B' execve's
>>   or exits or unmaps, making anon_vma B go away.
>>
>>   End result: process A has a page that points to anon_vma B, but
>>   anon_vma B does not exist any more.  This can go on forever.  Forget
>>   about RCU grace periods, forget about locking, forget anything like
>>   that.  The bug is simply that page->mapping points to an anon_vma
>>   that was correct at one point, but was _not_ the one that was shared
>>   by all users of that possible mapping.
>>
>> Changing it to always use the deepest anon_vma in the anonvma chain gets
>> us to the safest model.
>>
>> This can be improved in certain cases: if we know the page is private to
>> just this particular mapping (for example, it's a new page, or it is the
>> only swapcache entry), we could pick the top (most specific) anon_vma.
>>
>> But that's a future optimization. Make it _work_ reliably first.
>>
>> Reviewed-by: Rik van Riel <[email protected]>
>> Acked-by: Johannes Weiner <[email protected]>
>> Tested-by: Borislav Petkov <[email protected]> [ "What do you know, I think you fixed it!" ]
>> Signed-off-by: Linus Torvalds <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

Sorry for mistake.
I was extremely excited. :)

--
Kind regards,
Minchan Kim

2010-04-13 09:38:34

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

From: Linus Torvalds <[email protected]>
Date: Mon, Apr 12, 2010 at 03:11:53PM -0700

> Ok. That does sound very positive. Of course, last time you sounded
> positive, I had an email from you half an hour later that said "oh no, it
> oopsed again". So I'll take it with a bit of salt, but on the whole I'll
> be optimistic about it.

Ok, just finished testing -rc4 - no problems so far. Let's just go out
on a limb here and say with a greater certainty that this really got
fixed but be smart about it and keep an eye open if it happens again -
you never know.

Where is the champagne?

--
Regards/Gruss,
Boris.

2010-04-13 09:52:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare()

On Mon, 2010-04-12 at 13:22 -0700, Linus Torvalds wrote:
> +static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct *b)
> +{
> + return a->vm_end == b->vm_start &&
> + mpol_equal(vma_policy(a), vma_policy(b)) &&
> + a->vm_file == b->vm_file &&
> + !((a->vm_flags ^ b->vm_flags) & ~(VM_READ|VM_WRITE|VM_EXEC)) &&
> + b->vm_pgoff == a->vm_pgoff + ((b->vm_start - a->vm_start) >> PAGE_SHIFT);
> +}

Maybe write that as:

static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct *b)
{
if (a->vm_end != b->vm_start)
return 0;

if (!mpol_equal(vma_policy(a), vma_policy(b))
return 0;

if (a->vm_file != b->vm_file)
return 0;

if ((a->vm_flags ^ b->vm_flags) & ~(VM_READ|VM_WRITE|VM_EXEC))
return 0;

if (a->vm_pgoff + ((b->vm_start - a->vm_start) >> PAGE_SHIFT) != b->vm_pgoff)
return 0;

return 1;
}

2010-04-13 10:36:57

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

Hi Linus,

> On Sun, 11 Apr 2010, Rik van Riel wrote:
> >
> > Another thing I just thought of.
> >
> > The anon_vma struct will not be reused for something completely
> > different due to the SLAB_DESTROY_BY_RCU flag that the anon_vma_cachep
> > is created with.
>
> Rik, we _know_ it got re-used by something totally different. That's
> clearly the problem. The page->mapping pointer does _not_ point to an
> anon_vma any more. That's the problem here.
>
> What we need to figure out is how we have a page on the LRU list that is
> still marked as 'mapped' that has that stale mapping pointer.
>
> I can easily see how the stale mapping pointer happens for a non-mapped
> page. That part is trivial. Here's a simple case:
>
> - vmscan does that whole "isolate LRU pages", and one of them is a (at
> that time mapped) anonymous page. It's now not on any LRU lists at all.
>
> - vmscan ends up waiting for pageout and/or writeback while holding that
> list of pages.
>
> - in the meantime, the process that had the page exists or unmaps,
> unmapping the page and freeing the vma and the anon_vma.
>
> - vmscan eventually gets to the page, and does that page_referenced()
> dance. page->mapping points to something that is long long gone (as in
> "IO access lifetimes", so we're talking something that has been freed
> literally milliseconds ago, rather than any RCU delays)
>
> So I can see the stale page->mapping pointer happening. That part is even
> trivial. What I don't see is how the page would be still marked 'mapped'.
> Everything that actually free's the vma/anon_vmas should also have
> unmapped the page before that - even if it didn't _free_ the page.

Sorry, Now I'm lost what discuss in this crazy long thread.
IIUC, If the page->mapping was freed millisecns ago, following (1)
check returen false and we never touch page->mapping literally.

Am I missing something?


===================================================================
struct anon_vma *page_lock_anon_vma(struct page *page)
{
struct anon_vma *anon_vma;
unsigned long anon_mapping;

rcu_read_lock();
anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
goto out;
if (!page_mapped(page)) /* (1) here */
goto out;

anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
spin_lock(&anon_vma->lock);
return anon_vma;
out:
rcu_read_unlock();
return NULL;
}
=================================================


And, I think your following patch seems incorrect.
The added page_mapped() is called after spinlock(anon_vma->lock),
it mean check-after-dereference. such check doesn't prevent invalid
pointer dereference, I think.

perhaps, I'm missing anything. I have to reread this thread at all from
first.

---
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -302,7 +302,11 @@ struct anon_vma *page_lock_anon_vma(struct page *page)

anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
spin_lock(&anon_vma->lock);
- return anon_vma;
+
+ if (page_mapped(page))
+ return anon_vma;
+
+ spin_unlock(&anon_vma->lock);
out:
rcu_read_unlock();
return NULL;






2010-04-13 10:53:41

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

> struct anon_vma *page_lock_anon_vma(struct page *page)
> {
> @@ -294,14 +309,24 @@ struct anon_vma *page_lock_anon_vma(struct page *page)
> unsigned long anon_mapping;
>
> rcu_read_lock();
> - anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
> + anon_mapping = (unsigned long)rcu_dereference(page->mapping);
> if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
> goto out;
> - if (!page_mapped(page))
> - goto out;
>
> anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
> spin_lock(&anon_vma->lock);

Does anon->lock dereference is guranteed if page->_mapcount==-1?
It can be freed miliseconds ago, rcu_read_lock() doesn't provide such
gurantee.

perhaps, I'm missing your point.

2010-04-13 11:31:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

On Tue, 2010-04-13 at 19:53 +0900, KOSAKI Motohiro wrote:
> > struct anon_vma *page_lock_anon_vma(struct page *page)
> > {
> > @@ -294,14 +309,24 @@ struct anon_vma *page_lock_anon_vma(struct page *page)
> > unsigned long anon_mapping;
> >
> > rcu_read_lock();
> > - anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
> > + anon_mapping = (unsigned long)rcu_dereference(page->mapping);
> > if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
> > goto out;
> > - if (!page_mapped(page))
> > - goto out;
> >
> > anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
> > spin_lock(&anon_vma->lock);
>
> Does anon->lock dereference is guranteed if page->_mapcount==-1?
> It can be freed miliseconds ago, rcu_read_lock() doesn't provide such
> gurantee.
>
> perhaps, I'm missing your point.

No you're right, I got my head hopelessly twisted up trying to make
page_lock_anon_vma() do something reliable, but there really isn't much
that can be done.

Luckily most users (with exception of the memory-failure.c one) don't
really care and all take steps to verify the page is indeed in any of
the vmas it might find.

So I've given up on this and will only submit a patch like the below,
which hopefully does still make sense...

I do think there's a missing barrier in there as well, but I've made
enough of a fool of myself.

[ with the preemptible mmu_gather patches I introduce a refcount to
the anon_vma, and then with atomic_inc_not_zero() we can add a
guarantee that the returned anon_vma is alive ]

---
mm/rmap.c | 18 ++++++++++++++++--
1 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index eaa7a09..49a2533 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -285,8 +285,22 @@ void __init anon_vma_init(void)
}

/*
- * Getting a lock on a stable anon_vma from a page off the LRU is
- * tricky: page_lock_anon_vma rely on RCU to guard against the races.
+ * Getting a lock on a stable anon_vma from a page off the LRU is tricky!
+ *
+ * Since there is no serialization what so ever against page_remove_rmap()
+ * the best this function can do is return a locked anon_vma that might
+ * have been relevant to this page.
+ *
+ * The page might have been remapped to a different anon_vma or the anon_vma
+ * returned may already be freed (and even reused).
+ *
+ * All users of this function must be very careful when walking the anon_vma
+ * chain and verify that the page in question is indeed mapped in it
+ * [ something equivalent to page_mapped_in_vma() ].
+ *
+ * Since anon_vma's slab is DESTROY_BY_RCU and we know from page_remove_rmap()
+ * that the anon_vma pointer from page->mapping is valid if there is a
+ * mapcount, we can dereference the anon_vma after observing those.
*/
struct anon_vma *page_lock_anon_vma(struct page *page)
{

2010-04-13 12:00:45

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

> On Tue, 2010-04-13 at 19:53 +0900, KOSAKI Motohiro wrote:
> > > struct anon_vma *page_lock_anon_vma(struct page *page)
> > > {
> > > @@ -294,14 +309,24 @@ struct anon_vma *page_lock_anon_vma(struct page *page)
> > > unsigned long anon_mapping;
> > >
> > > rcu_read_lock();
> > > - anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
> > > + anon_mapping = (unsigned long)rcu_dereference(page->mapping);
> > > if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
> > > goto out;
> > > - if (!page_mapped(page))
> > > - goto out;
> > >
> > > anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
> > > spin_lock(&anon_vma->lock);
> >
> > Does anon->lock dereference is guranteed if page->_mapcount==-1?
> > It can be freed miliseconds ago, rcu_read_lock() doesn't provide such
> > gurantee.
> >
> > perhaps, I'm missing your point.
>
> No you're right, I got my head hopelessly twisted up trying to make
> page_lock_anon_vma() do something reliable, but there really isn't much
> that can be done.
>
> Luckily most users (with exception of the memory-failure.c one) don't
> really care and all take steps to verify the page is indeed in any of
> the vmas it might find.
>
> So I've given up on this and will only submit a patch like the below,
> which hopefully does still make sense...
>
> I do think there's a missing barrier in there as well, but I've made
> enough of a fool of myself.
>
> [ with the preemptible mmu_gather patches I introduce a refcount to
> the anon_vma, and then with atomic_inc_not_zero() we can add a
> guarantee that the returned anon_vma is alive ]

Indeed. refcount is best way. anon_vma DESTROY_BY_RCU stuff seems
overengineering, I think. this is fastest, but anon_vma allocation is not
(and was not) fork/exit bottleneck point. So, I guess most simply way is
best.


Also following patch looks good to me.
Reviewed-by: KOSAKI Motohiro <[email protected]>

Thanks for that. I've thought this is really necessary. but my (very) poor
english skill make hesitate it to me. sorry my laziness ;)



>
> ---
> mm/rmap.c | 18 ++++++++++++++++--
> 1 files changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index eaa7a09..49a2533 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -285,8 +285,22 @@ void __init anon_vma_init(void)
> }
>
> /*
> - * Getting a lock on a stable anon_vma from a page off the LRU is
> - * tricky: page_lock_anon_vma rely on RCU to guard against the races.
> + * Getting a lock on a stable anon_vma from a page off the LRU is tricky!
> + *
> + * Since there is no serialization what so ever against page_remove_rmap()
> + * the best this function can do is return a locked anon_vma that might
> + * have been relevant to this page.
> + *
> + * The page might have been remapped to a different anon_vma or the anon_vma
> + * returned may already be freed (and even reused).
> + *
> + * All users of this function must be very careful when walking the anon_vma
> + * chain and verify that the page in question is indeed mapped in it
> + * [ something equivalent to page_mapped_in_vma() ].
> + *
> + * Since anon_vma's slab is DESTROY_BY_RCU and we know from page_remove_rmap()
> + * that the anon_vma pointer from page->mapping is valid if there is a
> + * mapcount, we can dereference the anon_vma after observing those.
> */
> struct anon_vma *page_lock_anon_vma(struct page *page)
> {
>


2010-04-14 22:00:54

by Rik van Riel

[permalink] [raw]
Subject: [PATCH] rmap: add exclusively owned pages to the newest anon_vma

The recent anon_vma fixes cause many anonymous pages to end up
in the parent process anon_vma, even when the page is exclusively
owned by the current process.

Adding exclusively owned anonymous pages to the top anon_vma
reduces rmap scanning overhead, especially in workloads with
forking servers.

This patch adds a parameter to __page_set_anon_rmap that can
be used to indicate whether or not the added page is exclusively
owned by the current process.

Pages added through page_add_new_anon_rmap are exclusively
owned by the current process, and can be added to the top
anon_vma.

Pages added through page_add_anon_rmap can be either shared
or exclusively owned, so we do the conservative thing and
add it to the oldest anon_vma.

A next step would be to add the exclusive parameter to
page_add_anon_rmap, to be used from functions where we do
know for sure whether a page is exclusively owned.

Signed-off-by: Rik van Riel <[email protected]>
---
Borislav, I audited the code before making this change, but would
still appreciate your testing of this patch :)

Linus, once this patch survives Borislav's testing, I'll start
looking at the next step. I'd like to do things one step at a
time so I won't cause another regression...

mm/rmap.c | 30 +++++++++++++++++++-----------
1 files changed, 19 insertions(+), 11 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 4bad326..12ac0f1 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -730,23 +730,31 @@ void page_move_anon_rmap(struct page *page,
* @page: the page to add the mapping to
* @vma: the vm area in which the mapping is added
* @address: the user virtual address mapped
+ * @exclusive: the page is exclusively owned by the current process
*/
static void __page_set_anon_rmap(struct page *page,
- struct vm_area_struct *vma, unsigned long address)
+ struct vm_area_struct *vma, unsigned long address, int exclusive)
{
struct anon_vma_chain *avc;
struct anon_vma *anon_vma;

BUG_ON(!vma->anon_vma);

- /*
- * We must use the _oldest_ possible anon_vma for the page mapping!
- *
- * So take the last AVC chain entry in the vma, which is the deepest
- * ancestor, and use the anon_vma from that.
- */
- avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);
- anon_vma = avc->anon_vma;
+ if (exclusive)
+ anon_vma = vma->anon_vma;
+ else {
+ /*
+ * The page may be shared between multiple processes.
+ * We must use the _oldest_ possible anon_vma for the
+ * page mapping! That anon_vma is guaranteed to be
+ * present in all processes that could share this page.
+ *
+ * So take the last AVC chain entry in the vma, which is the
+ * deepest ancestor, and use the anon_vma from that.
+ */
+ avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);
+ anon_vma = avc->anon_vma;
+ }

anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
page->mapping = (struct address_space *) anon_vma;
@@ -802,7 +810,7 @@ void page_add_anon_rmap(struct page *page,
VM_BUG_ON(!PageLocked(page));
VM_BUG_ON(address < vma->vm_start || address >= vma->vm_end);
if (first)
- __page_set_anon_rmap(page, vma, address);
+ __page_set_anon_rmap(page, vma, address, 0);
else
__page_check_anon_rmap(page, vma, address);
}
@@ -824,7 +832,7 @@ void page_add_new_anon_rmap(struct page *page,
SetPageSwapBacked(page);
atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */
__inc_zone_page_state(page, NR_ANON_PAGES);
- __page_set_anon_rmap(page, vma, address);
+ __page_set_anon_rmap(page, vma, address, 1);
if (page_evictable(page, vma))
lru_cache_add_lru(page, LRU_ACTIVE_ANON);
else

2010-04-14 23:21:01

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH] rmap: add exclusively owned pages to the newest anon_vma

On Wed, Apr 14, 2010 at 05:59:28PM -0400, Rik van Riel wrote:
> The recent anon_vma fixes cause many anonymous pages to end up
> in the parent process anon_vma, even when the page is exclusively
> owned by the current process.
>
> Adding exclusively owned anonymous pages to the top anon_vma
> reduces rmap scanning overhead, especially in workloads with
> forking servers.
>
> This patch adds a parameter to __page_set_anon_rmap that can
> be used to indicate whether or not the added page is exclusively
> owned by the current process.
>
> Pages added through page_add_new_anon_rmap are exclusively
> owned by the current process, and can be added to the top
> anon_vma.
>
> Pages added through page_add_anon_rmap can be either shared
> or exclusively owned, so we do the conservative thing and
> add it to the oldest anon_vma.
>
> A next step would be to add the exclusive parameter to
> page_add_anon_rmap, to be used from functions where we do
> know for sure whether a page is exclusively owned.
>
> Signed-off-by: Rik van Riel <[email protected]>

Reviewed-by: Johannes Weiner <[email protected]>

2010-04-15 07:30:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA

On Tue, 2010-04-13 at 21:00 +0900, KOSAKI Motohiro wrote:
> > [ with the preemptible mmu_gather patches I introduce a refcount to
> > the anon_vma, and then with atomic_inc_not_zero() we can add a
> > guarantee that the returned anon_vma is alive ]
>
> Indeed. refcount is best way. anon_vma DESTROY_BY_RCU stuff seems
> overengineering, I think. this is fastest, but anon_vma allocation is not
> (and was not) fork/exit bottleneck point. So, I guess most simply way is
> best.

Well, that refcount stuff still relies on DESTROY_BY_RCU :-)

Anyway, it also looks like a lot of races are avoided by ordering the
rmap_add/remove calls wrt to adding/removing the page to/from the LRU.

Rmap calls come from LRU pages, and it looks like rmap state is only
changed for pages that are not on the LRU.

I still have to go through all that code again to make sure, but I
couldn't find a race between page_add_anon_rmap() and
page_lock_anon_vma() due to that.

If there is, we need to look at page_mapped() before page->mapping
because page_add_anon_rmap() first increments the mapcount and only then
adjusts the mapping, so the existing order in page_anon_lock_vma() can
end up dereferencing a long dead anon_vma.





2010-04-15 08:34:42

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] rmap: add exclusively owned pages to the newest anon_vma

From: Rik van Riel <[email protected]>
Date: Wed, Apr 14, 2010 at 05:59:28PM -0400

> The recent anon_vma fixes cause many anonymous pages to end up
> in the parent process anon_vma, even when the page is exclusively
> owned by the current process.
>
> Adding exclusively owned anonymous pages to the top anon_vma
> reduces rmap scanning overhead, especially in workloads with
> forking servers.
>
> This patch adds a parameter to __page_set_anon_rmap that can
> be used to indicate whether or not the added page is exclusively
> owned by the current process.
>
> Pages added through page_add_new_anon_rmap are exclusively
> owned by the current process, and can be added to the top
> anon_vma.
>
> Pages added through page_add_anon_rmap can be either shared
> or exclusively owned, so we do the conservative thing and
> add it to the oldest anon_vma.
>
> A next step would be to add the exclusive parameter to
> page_add_anon_rmap, to be used from functions where we do
> know for sure whether a page is exclusively owned.
>
> Signed-off-by: Rik van Riel <[email protected]>
> ---
> Borislav, I audited the code before making this change, but would
> still appreciate your testing of this patch :)

Just did some light hammering and it looks ok so far. I'll keep watching
out for oopsies/issues.

Lightly-tested-by: Borislav Petkov <[email protected]>

--
Regards/Gruss,
Boris.

2010-04-15 16:02:11

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] rmap: add exclusively owned pages to the newest anon_vma

On Thu, Apr 15, 2010 at 6:59 AM, Rik van Riel <[email protected]> wrote:
> The recent anon_vma fixes cause many anonymous pages to end up
> in the parent process anon_vma, even when the page is exclusively
> owned by the current process.
>
> Adding exclusively owned anonymous pages to the top anon_vma
> reduces rmap scanning overhead, especially in workloads with
> forking servers.
>
> This patch adds a parameter to __page_set_anon_rmap that can
> be used to indicate whether or not the added page is exclusively
> owned by the current process.
>
> Pages added through page_add_new_anon_rmap are exclusively
> owned by the current process, and can be added to the top
> anon_vma.
>
> Pages added through page_add_anon_rmap can be either shared
> or exclusively owned, so we do the conservative thing and
> add it to the oldest anon_vma.
>
> A next step would be to add the exclusive parameter to
> page_add_anon_rmap, to be used from functions where we do
> know for sure whether a page is exclusively owned.
>
> Signed-off-by: Rik van Riel <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>


--
Kind regards,
Minchan Kim

2010-04-15 20:06:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] rmap: add exclusively owned pages to the newest anon_vma



On Wed, 14 Apr 2010, Rik van Riel wrote:
> - /*
> - * We must use the _oldest_ possible anon_vma for the page mapping!
> - *
> - * So take the last AVC chain entry in the vma, which is the deepest
> - * ancestor, and use the anon_vma from that.
> - */
> - avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);
> - anon_vma = avc->anon_vma;
> + if (exclusive)
> + anon_vma = vma->anon_vma;
> + else {
> + /*
> + * The page may be shared between multiple processes.
> + * We must use the _oldest_ possible anon_vma for the
> + * page mapping! That anon_vma is guaranteed to be
> + * present in all processes that could share this page.
> + *
> + * So take the last AVC chain entry in the vma, which is the
> + * deepest ancestor, and use the anon_vma from that.
> + */
> + avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);
> + anon_vma = avc->anon_vma;
> + }

I really dislike your coding style.

If we do this conditionally, we're _much_ better off declaring the
variables we only use inside that conditional block inside the block
itself. And since we access "vma->anon_vma" in either case, just move that
case outside the conditional statement, and avoid a pointless
if/then/else.

IOW, something like this. Totally untested.

Linus

---
mm/rmap.c | 26 +++++++++++++++-----------
1 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 4bad326..78d4730 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -732,21 +732,25 @@ void page_move_anon_rmap(struct page *page,
* @address: the user virtual address mapped
*/
static void __page_set_anon_rmap(struct page *page,
- struct vm_area_struct *vma, unsigned long address)
+ struct vm_area_struct *vma, unsigned long address, int exclusive)
{
- struct anon_vma_chain *avc;
- struct anon_vma *anon_vma;
+ struct anon_vma *anon_vma = vma->anon_vma;

- BUG_ON(!vma->anon_vma);
+ BUG_ON(!anon_vma);

/*
- * We must use the _oldest_ possible anon_vma for the page mapping!
+ * If the page isn't exclusively mapped into this vma,
+ * we must use the _oldest_ possible anon_vma for the
+ * page mapping!
*
- * So take the last AVC chain entry in the vma, which is the deepest
- * ancestor, and use the anon_vma from that.
+ * So take the last AVC chain entry in the vma, which is
+ * the deepest ancestor, and use the anon_vma from that.
*/
- avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);
- anon_vma = avc->anon_vma;
+ if (!exclusive) {
+ struct anon_vma_chain *avc;
+ avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);
+ anon_vma = avc->anon_vma;
+ }

anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
page->mapping = (struct address_space *) anon_vma;
@@ -802,7 +806,7 @@ void page_add_anon_rmap(struct page *page,
VM_BUG_ON(!PageLocked(page));
VM_BUG_ON(address < vma->vm_start || address >= vma->vm_end);
if (first)
- __page_set_anon_rmap(page, vma, address);
+ __page_set_anon_rmap(page, vma, address, 0);
else
__page_check_anon_rmap(page, vma, address);
}
@@ -824,7 +828,7 @@ void page_add_new_anon_rmap(struct page *page,
SetPageSwapBacked(page);
atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */
__inc_zone_page_state(page, NR_ANON_PAGES);
- __page_set_anon_rmap(page, vma, address);
+ __page_set_anon_rmap(page, vma, address, 1);
if (page_evictable(page, vma))
lru_cache_add_lru(page, LRU_ACTIVE_ANON);
else

2010-04-16 06:16:55

by Felipe Balbi

[permalink] [raw]
Subject: Re: [PATCH] rmap: add exclusively owned pages to the newest anon_vma

Hi,

On Thu, Apr 15, 2010 at 10:01:11PM +0200, ext Linus Torvalds wrote:
>+ avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);

while at that, would it make sense to first provide list_last_entry()
since we already have list_first_entry() ??

totally unrelated to this patch, sorry

--
balbi

2010-04-16 14:53:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] rmap: add exclusively owned pages to the newest anon_vma



On Fri, 16 Apr 2010, Felipe Balbi wrote:
>
> while at that, would it make sense to first provide list_last_entry() since we
> already have list_first_entry() ??

Yeah, it probably would make sense. Especially as doing a simple grep for
'list_entry.*prev' does seem to imply that there might be quite a few
places that would be able to use it. Although some of them do seem to be
about finding the previous entry rather than the last in a list.

That said, doing the same grep for 'next' shows that a lot of places don't
use the list_first_entry() that we _do_ have, so..

Linus