2011-05-29 07:23:12

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH] mm: Fix boot crash in mm_alloc()


Would be nice to get the fix below into -rc1 as well, it triggers
rather easily on bootup when CONFIG_CPUMASK_OFFSTACK is turned on.

Ingo

---------------------->
>From 59b28833ae328e2206865fb25e61917e738d9696 Mon Sep 17 00:00:00 2001
From: Thomas Gleixner <[email protected]>
Date: Sat, 28 May 2011 08:22:15 +0200
Subject: [PATCH] mm: Fix boot crash in mm_alloc()

Fix CONFIG_CPUMASK_OFFSTACK=y boot crash:

[ 12.598405] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 12.600012] IP: [<c11ae035>] find_next_bit+0x55/0xb0
[ 12.600012] *pdpt = 0000000000000000 *pde = f000e81af000e81a
[ 12.600012] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 12.600012] Modules linked in:
[ 12.600012]
[ 12.600012] Pid: 1, comm: swapper Not tainted 2.6.39-05707-gde03c72-dirty #130523 System manufacturer System Product Name/A8N-E
[ 12.600012] EIP: 0060:[<c11ae035>] EFLAGS: 00010202 CPU: 0
[ 12.600012] EIP is at find_next_bit+0x55/0xb0
[ 12.600012] EAX: 00000000 EBX: 00000002 ECX: 00000000 EDX: 00000000
[ 12.600012] ESI: 00000000 EDI: f59a4000 EBP: f6479e78 ESP: f6479e70
[ 12.600012] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[ 12.600012] Process swapper (pid: 1, ti=f6478000 task=f6470000 task.ti=f6478000)
[ 12.600012] Stack:
[ 12.600012] 00000000 00000000 f6479e8c c11addda 00000000 f59a4000 f5939000 f6479e98
[ 12.600012] c102396b 35937001 f6479eac c1022705 00000001 f5939008 f59a4000 f6479ed8
[ 12.600012] c10227ba f5939000 f59a4000 f5939000 f5937000 f5938000 f593c000 f59a4000
[ 12.600012] Call Trace:
[ 12.600012] [<c11addda>] cpumask_any_but+0x2a/0x70
[ 12.600012] [<c102396b>] flush_tlb_mm+0x2b/0x80
[ 12.600012] [<c1022705>] pud_populate+0x35/0x50
[ 12.600012] [<c10227ba>] pgd_alloc+0x9a/0xf0
[ 12.600012] [<c103a3fc>] mm_init+0xec/0x120
[ 12.600012] [<c103a7a3>] mm_alloc+0x53/0xd0
[ 12.600012] [<c10f9220>] bprm_mm_init+0x20/0x1b0
[ 12.600012] [<c10370bf>] ? sched_exec+0x7f/0xb0
[ 12.600012] [<c10f96b9>] do_execve+0xb9/0x270
[ 12.600012] [<c100aec7>] sys_execve+0x37/0x70
[ 12.600012] [<c13d60a2>] ptregs_execve+0x12/0x18
[ 12.600012] [<c13d5299>] ? syscall_call+0x7/0xb
[ 12.600012] [<c1006840>] ? kernel_execve+0x20/0x30
[ 12.600012] [<c16086af>] ? start_kernel+0x2de/0x2de
[ 12.600012] [<c13c9ea2>] ? run_init_process+0x1c/0x1e
[ 12.600012] [<c13c9f2d>] ? init_post+0x89/0xb3
[ 12.600012] [<c16087d1>] ? kernel_init+0x122/0x122
[ 12.600012] [<c13d657a>] ? kernel_thread_helper+0x6/0x10

Caused by:

de03c72: mm: convert mm->cpu_vm_cpumask into cpumask_var_t

Cc: KOSAKI Motohiro <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/fork.c | 6 +-----
1 files changed, 1 insertions(+), 5 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index ca406d9..7b0669f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -538,17 +538,13 @@ struct mm_struct * mm_alloc(void)
return NULL;

memset(mm, 0, sizeof(*mm));
- mm = mm_init(mm, current);
- if (!mm)
- return NULL;

if (mm_init_cpumask(mm, NULL)) {
- mm_free_pgd(mm);
free_mm(mm);
return NULL;
}

- return mm;
+ return mm_init(mm, current);
}

/*


2011-05-29 16:23:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm: Fix boot crash in mm_alloc()

On Sun, May 29, 2011 at 12:22 AM, Ingo Molnar <[email protected]> wrote:
>
> Would be nice to get the fix below into -rc1 as well, it triggers
> rather easily on bootup when CONFIG_CPUMASK_OFFSTACK is turned on.

Looking at that commit de03c72cfce5, it looks odd in other ways too.

For example, it looks like mm_cpumask is always initialized to zero.
That's a bit odd, isn't it, since it *used* to be initialized
statically with this:

- .cpu_vm_mask = CPU_MASK_ALL,

which is rather different from zero.

Now, I'm sure the init mm_cpumask doesn't really matter, but I'd have
expected a commentary about it.

I also wonder if that whole conversion to cpumask_var_t was worth it,
since clearly it wasn't very well tested. It results in an extra
allocation at fork() time for the many-cpu case, and I do get the
feeling that we would have been better off keeping the cpumask inside
the mm_struct. Moving it to the end of mm_struct makes sense for the
many-cpu case, but at the same time I end up wondering what it does to
the switch_mm() cache behavior. (And perhaps the TLB flush IPI cache
activity).

Ho humm. I have this suspicion that that whole patch wasn't fully
thought out, and that I should revert it rather than fix the oops.

Or, in fact, we could just do something like the attached (UNTESTED!)
which does the whole "move big allocation to end, but keep the
cpumask_var_t at the beginning, and don't do any extra allocations"
thing.

NOTE NOTE NOTE! Not only is the attached patch untested, but please
see the added FIXME comment about the whole mm_struct
kmem_cache_create(). Right now we always allocate the whole
maximum-sized bitmap.

Comments?

Linus


Attachments:
patch.diff (4.58 kB)

2011-05-29 17:20:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm: Fix boot crash in mm_alloc()

On Sun, May 29, 2011 at 9:22 AM, Linus Torvalds
<[email protected]> wrote:
>
> Or, in fact, we could just do something like the attached (UNTESTED!)

So I did warn you that it was untested.

It still is, but I walked through it a bit more, and I realized that
while I had gotten rid of the extra allocations of the
cpu_vm_mask_var, I hadn't gotten rid of the freeing.

So that patch would definitely not have worked very well with
CONFIG_CPUMASK_OFFSTACK.

And I noticed that I moved the cpu_vm_mask back in the wrong space, it
should likely be as close as possible to the mm_context_t, since the
main user is likely the task switching code that touches that anyway.

So here's a slightly updated patch.

STILL TOTALLY UNTESTED! The fixes were just from eyeballing it a bit
more, not from any actual testing.

Linus


Attachments:
patch.diff (4.71 kB)

2011-05-29 18:44:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm: Fix boot crash in mm_alloc()

On Sun, May 29, 2011 at 10:19 AM, Linus Torvalds
<[email protected]> wrote:
>
> STILL TOTALLY UNTESTED! The fixes were just from eyeballing it a bit
> more, not from any actual testing.

Ok, I eyeballed it some more, and tested both the OFFSTACK and ONSTACK
case, and decided that I had better commit it now rather than wait any
later since I'll do the -rc1 later today, and will be on an airplane
most of tomorrow.

The exact placement of the cpu_vm_mask_var is up for grabs. For
example, I started thinking that it might be better to put it *after*
the mm_context_t, since for the non-OFFSTACK case it's generally
touched at the beginning rather than the end.

And the actual change to make the mm_cachep kmem_cache_create() use a
variable-sized allocation for the OFFSTACK case is similarly left as
an exercise for the the reader. So effectively, this reverts a lot of
de03c72cfce5, but does so in a way that should make very it easy to
get back to where KOSAKI was aiming for.

Whatever. I was hoping to get comments on it, but I think I need to
rather push it out to get tested and public than wait any longer. The
patch *looks* fine, tests ok on my machine, and removes more lines
than it adds despite the new big comment.

Linus

2011-05-30 01:12:38

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] mm: Fix boot crash in mm_alloc()

(2011/05/30 3:43), Linus Torvalds wrote:
> On Sun, May 29, 2011 at 10:19 AM, Linus Torvalds
> <[email protected]> wrote:
>>
>> STILL TOTALLY UNTESTED! The fixes were just from eyeballing it a bit
>> more, not from any actual testing.
>
> Ok, I eyeballed it some more, and tested both the OFFSTACK and ONSTACK
> case, and decided that I had better commit it now rather than wait any
> later since I'll do the -rc1 later today, and will be on an airplane
> most of tomorrow.
>
> The exact placement of the cpu_vm_mask_var is up for grabs. For
> example, I started thinking that it might be better to put it *after*
> the mm_context_t, since for the non-OFFSTACK case it's generally
> touched at the beginning rather than the end.
>
> And the actual change to make the mm_cachep kmem_cache_create() use a
> variable-sized allocation for the OFFSTACK case is similarly left as
> an exercise for the the reader. So effectively, this reverts a lot of
> de03c72cfce5, but does so in a way that should make very it easy to
> get back to where KOSAKI was aiming for.
>
> Whatever. I was hoping to get comments on it, but I think I need to
> rather push it out to get tested and public than wait any longer. The
> patch *looks* fine, tests ok on my machine, and removes more lines
> than it adds despite the new big comment.

Hi

Thank you Linus and I'm sorry for bother you and guys. So, if I understand
this thread correctly, rest my homework is 1) make cpumask_allocation variable
size 2) remove NR_CPUS bit fill/copy from fork/exec path. Right?

I think (2) is big matter than (1). NR_CPUS(=4096) bits copy easily screw up
cache behavior. Anyway, will do. Thank you!


2011-05-30 08:14:12

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] mm: Fix boot crash in mm_alloc()


* KOSAKI Motohiro <[email protected]> wrote:

> (2011/05/30 3:43), Linus Torvalds wrote:
> > On Sun, May 29, 2011 at 10:19 AM, Linus Torvalds
> > <[email protected]> wrote:
> >>
> >> STILL TOTALLY UNTESTED! The fixes were just from eyeballing it a bit
> >> more, not from any actual testing.
> >
> > Ok, I eyeballed it some more, and tested both the OFFSTACK and ONSTACK
> > case, and decided that I had better commit it now rather than wait any
> > later since I'll do the -rc1 later today, and will be on an airplane
> > most of tomorrow.
> >
> > The exact placement of the cpu_vm_mask_var is up for grabs. For
> > example, I started thinking that it might be better to put it *after*
> > the mm_context_t, since for the non-OFFSTACK case it's generally
> > touched at the beginning rather than the end.
> >
> > And the actual change to make the mm_cachep kmem_cache_create() use a
> > variable-sized allocation for the OFFSTACK case is similarly left as
> > an exercise for the the reader. So effectively, this reverts a lot of
> > de03c72cfce5, but does so in a way that should make very it easy to
> > get back to where KOSAKI was aiming for.
> >
> > Whatever. I was hoping to get comments on it, but I think I need to
> > rather push it out to get tested and public than wait any longer. The
> > patch *looks* fine, tests ok on my machine, and removes more lines
> > than it adds despite the new big comment.
>
> Hi
>
> Thank you Linus and I'm sorry for bother you and guys. So, if I
> understand this thread correctly, rest my homework is 1) make
> cpumask_allocation variable size 2) remove NR_CPUS bit fill/copy
> from fork/exec path. Right?
>
> I think (2) is big matter than (1). NR_CPUS(=4096) bits copy easily
> screw up cache behavior. Anyway, will do. Thank you!

I think the first task would be to double check that the code in
3.0-rc1 is indeed correct! :-)

Thanks,

Ingo