2004-04-14 08:57:34

by Alexander Y. Fomichev

[permalink] [raw]
Subject: 2.6.5-aa3: kernel BUG at mm/objrmap.c:137!

G'Day,

I've got a bug on 2.6.5-aa3 today

System: Dual P4 Xeon 2.4GHz, 4G ECC RAM - (production web-server) pretty
heavy loaded at most time, but i've got it approximately
at 02:00 (MSD) when no serious load should be.
System remained accessible all the time but operations
with proclist from userspace (i.e. ps, w) appears to be locked.

Here is a trace:

kernel BUG at mm/objrmap.c:137!
invalid operand: 0000 [#1]
PREEMPT SMP
CPU: 1
EIP: 0060:[<c0148ebb>] Not tainted
EFLAGS: 00010282 (2.6.5-aa3)
EIP is at page_referenced_one+0x8c/0x96
eax: dafaaae0 ebx: dafaaae0 ecx: 00000000 edx: c19ef4a0
esi: 00000000 edi: dade48c4 ebp: c19ef4a0 esp: e9049c0c
ds: 007b es: 007b ss: 0068
Process apache2 (pid: 29070, threadinfo=e9048000 task=cf0e8630)
Stack: dafaaae0 f4531e30 00000000 f4531e2c c0149073 c19ef4a0 00000001 c02d2d00
e9048000 00000166 c0149103 c19ef4a0 c013ff58 00000000 00000000 c02d2e04
00000001 00000000 00000000 e9049d30 00000166 c1a59890 c1841cd0 e9049c68
Call Trace:
[<c0149073>] page_referenced_anon+0x45/0x94
[<c0149103>] page_referenced+0x41/0x4d
[<c013ff58>] refill_inactive_zone+0x56c/0x678
[<c0140123>] shrink_zone+0xbf/0xc1
[<c014019d>] shrink_caches+0x78/0x7c
[<c014023b>] try_to_free_pages+0x9a/0x141
[<c0138f68>] __alloc_pages+0x1cb/0x30a
[<c0144594>] do_anonymous_page+0x82/0x246
[<c027fbe2>] inet_recvmsg+0x4a/0x69
[<c01447b5>] do_no_page+0x5d/0x58c
[<c0144ed8>] handle_mm_fault+0xe2/0x17b
[<c0115752>] do_page_fault+0x2eb/0x4d2
[<c015384d>] vfs_read+0xbc/0xf5
[<c0115467>] do_page_fault+0x0/0x4d2
[<c0107789>] error_code+0x2d/0x38

Code: 0f 0b 89 00 a9 20 2a c0 eb e1 55 31 ed 57 89 c7 56 53 83 ec
<6>note: apache2[29070] exited with preempt_count 2
bad: scheduling while atomic!
Call Trace:
[<c01182a4>] schedule+0x6a1/0x6a6
[<c012c2be>] rcu_process_callbacks+0x13a/0x169
[<c012c2be>] rcu_process_callbacks+0x13a/0x169
[<c01208f8>] tasklet_action+0x5f/0xa5
[<c01c9261>] rwsem_down_read_failed+0xa5/0x155
[<c012f5d4>] .text.lock.futex+0x7/0xb3
[<c0111b86>] delay_tsc+0xb/0x15
[<c012f4a5>] do_futex+0x6b/0x6d
[<c012f5a1>] sys_futex+0xfa/0x112
[<c011cc90>] call_console_drivers+0x77/0xfc
[<c011a872>] mm_release+0x96/0xa6
[<c011ea17>] do_exit+0x84/0x4d9
[<c029007b>] packet_ioctl+0x1f/0x192
[<c0107de2>] do_divide_error+0x0/0xd3
[<c010805c>] do_invalid_op+0x0/0xaa
[<c0108104>] do_invalid_op+0xa8/0xaa
[<c0148ebb>] page_referenced_one+0x8c/0x96
[<c0117642>] scheduler_tick+0x3b/0x5f7
[<c0116451>] recalc_task_prio+0x8c/0x1a6
[<c0107789>] error_code+0x2d/0x38
[<c014007b>] shrink_zone+0x17/0xc1
[<c0148ebb>] page_referenced_one+0x8c/0x96
[<c0149073>] page_referenced_anon+0x45/0x94
[<c0149103>] page_referenced+0x41/0x4d
[<c013ff58>] refill_inactive_zone+0x56c/0x678
[<c0140123>] shrink_zone+0xbf/0xc1
[<c014019d>] shrink_caches+0x78/0x7c
[<c014023b>] try_to_free_pages+0x9a/0x141
[<c0138f68>] __alloc_pages+0x1cb/0x30a
[<c0144594>] do_anonymous_page+0x82/0x246
[<c027fbe2>] inet_recvmsg+0x4a/0x69
[<c01447b5>] do_no_page+0x5d/0x58c
[<c0144ed8>] handle_mm_fault+0xe2/0x17b
[<c0115752>] do_page_fault+0x2eb/0x4d2
[<c015384d>] vfs_read+0xbc/0xf5
[<c0115467>] do_page_fault+0x0/0x4d2
[<c0107789>] error_code+0x2d/0x38

& some (probably) useful info:

cpuinfo: http://sysadminday.org.ru/objrmap_bug/cpuinfo
lspci: http://sysadminday.org.ru/objrmap_bug/lspci
lspci -vvn http://sysadminday.org.ru/objrmap_bug/lspci-vvn
full_trace (including initial boot,SysRq-[M,P,T])
http://sysadminday.org.ru/objrmap_bug/full_trace

--
Best regards.
Alexander Y. Fomichev <[email protected]>
Public PGP key: http://sysadminday.org.ru/gluk.asc


2004-04-14 09:54:33

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.6.5-aa3: kernel BUG at mm/objrmap.c:137!

On Wed, Apr 14, 2004 at 12:57:16PM +0400, Alexander Y. Fomichev wrote:
> G'Day,
>
> I've got a bug on 2.6.5-aa3 today
>
> System: Dual P4 Xeon 2.4GHz, 4G ECC RAM - (production web-server) pretty
> heavy loaded at most time, but i've got it approximately
> at 02:00 (MSD) when no serious load should be.
> System remained accessible all the time but operations
> with proclist from userspace (i.e. ps, w) appears to be locked.

I don't think apache2 uses nonlinear. there was an smp race fix in
2.6.5-aa4, so you may want to try again with 2.6.5-aa5 (latest) just in
case this was mm corruption triggered by the race.

Are you using threading with apache2? Such a race could trigger only
with threads.

2004-04-14 11:40:02

by Alexander Y. Fomichev

[permalink] [raw]
Subject: Re: 2.6.5-aa3: kernel BUG at mm/objrmap.c:137!

On Wednesday 14 April 2004 13:54, Andrea Arcangeli wrote:
> On Wed, Apr 14, 2004 at 12:57:16PM +0400, Alexander Y. Fomichev wrote:
> > G'Day,
> >
> > I've got a bug on 2.6.5-aa3 today
> >
> > System: Dual P4 Xeon 2.4GHz, 4G ECC RAM - (production web-server) pretty
btw, i was wrong, this box has only 2G ECC RAM not 4.
> > heavy loaded at most time, but i've got it approximately
> > at 02:00 (MSD) when no serious load should be.
> > System remained accessible all the time but operations
> > with proclist from userspace (i.e. ps, w) appears to be locked.
>
> I don't think apache2 uses nonlinear. there was an smp race fix in
> 2.6.5-aa4, so you may want to try again with 2.6.5-aa5 (latest) just in
> case this was mm corruption triggered by the race.
tnx, i'll try it.
btw, system running on 2.6.5-aa3 almost for a month (at 2004/04/18) and
this is first time i catch this, so 2.6.5-aa3 pretty stable for me. (except
for trouble mentioned by Sergey Kostyliov as "2.6.X kernel memory
leak?"/"2.6.1 IO lockup on SMP systems")

> Are you using threading with apache2? Such a race could trigger only
> with threads.

exactly right, apache2 compiled --with-mpm=worker.

--
Best regards.
Alexander Y. Fomichev <[email protected]>
Public PGP key: http://sysadminday.org.ru/gluk.asc

2004-04-14 11:47:29

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.6.5-aa3: kernel BUG at mm/objrmap.c:137!

On Wed, Apr 14, 2004 at 03:39:49PM +0400, Alexander Y. Fomichev wrote:
> exactly right, apache2 compiled --with-mpm=worker.

ok so there are good chances that 2.6.5-aa5 will fix it, if not then
please notify me again, thanks.

2004-04-19 12:33:10

by Alexander Y. Fomichev

[permalink] [raw]
Subject: Similar behaviour without BUG() message(was: Re: 2.6.5-aa3: kernel BUG at mm/objrmap.c:137!)

On Wednesday 14 April 2004 15:47, Andrea Arcangeli wrote:
> ok so there are good chances that 2.6.5-aa5 will fix it, if not then
> please notify me again, thanks.

I've noticed a behaviour today very similar to mentioned in previous report
(on 14 Apr 2004) except for message from BUG() (kernel 2.6.5-aa5) i.e.
system remained accessible but procps(ps,pkill) appears to be locked and
Sysrq-T is similar to previous one - some of apache2 & all of procps
have been blocked.

Here is a traces:

http://sysadminday.org.ru/2.6.5-ps-lockup/sysrq-M
http://sysadminday.org.ru/2.6.5-ps-lockup/full_trace

--
Best regards.
Alexander Y. Fomichev <[email protected]>
Public PGP key: http://sysadminday.org.ru/gluk.asc

2004-04-23 02:27:35

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Similar behaviour without BUG() message(was: Re: 2.6.5-aa3: kernel BUG at mm/objrmap.c:137!)

On Mon, Apr 19, 2004 at 04:32:53PM +0400, Alexander Y. Fomichev wrote:
> On Wednesday 14 April 2004 15:47, Andrea Arcangeli wrote:
> > ok so there are good chances that 2.6.5-aa5 will fix it, if not then
> > please notify me again, thanks.
>
> I've noticed a behaviour today very similar to mentioned in previous report
> (on 14 Apr 2004) except for message from BUG() (kernel 2.6.5-aa5) i.e.
> system remained accessible but procps(ps,pkill) appears to be locked and
> Sysrq-T is similar to previous one - some of apache2 & all of procps
> have been blocked.
>
> Here is a traces:
>
> http://sysadminday.org.ru/2.6.5-ps-lockup/sysrq-M
> http://sysadminday.org.ru/2.6.5-ps-lockup/full_trace

following your trace I got to some futex issue (likely invoked by NPTL),
and following the futex code I noticed a race with threads in
expand_stack and likely the futex is effectively calling expand_stack
too. Then I found a race in expand_stack with threads.

here the details:

in 2.6-aa expand_stack is moving down vm_start and vm_end with only the
read semaphore held. this means this thing can even run in parallel in
both cpus, the latter will corrupt vm_pgoff which generate malfunction
with anon-vma:

vma->vm_start = address;
vma->vm_pgoff -= grow;

this isn't an issue for the file mappings because only anon vmas can be
growsdown (the filemappings could never work right in filemap_nopage if
expand_stack would mess with pgoff and vm_start without holding the
mmap_sem in write mode).

serializing this with anon-vma is easy and scalable with the
anon_vma_lock (not an mm-wide lock like the page_table_lock).
This will also serialize perfectly with the objrmap. objrmap should be
the only thing that cares about vm_pgoff and vm_start being coherent at
the same time for anon-mappings in anon-vma, and the anon_vma_lock
provides perfect locking for that.

One of the scalability and simplicity locking beauty of anon-vma is the
total avoidance of page_table_lock for vma merging and all other vma
operations in general, and true usage of page_table_lock only for the
pagetables (and future usage of vma->page_table_lock only for pagetables
too). I wouldn't really like to giveup on this and to reintroduce the
whole mess of page_table_lock in the vma merging and in expand stack
that all other implementations like 2.6 mainline and anonmm are
suffering from, and I hope my fix will be enough. Could you test if this
patch helps?

(there was also an ancient page_table_lock in split_vma, I forgot to
delete that previously)

--- 2.6.5-aa3/mm/mmap.c.~1~ 2004-04-14 11:51:33.000000000 +0200
+++ 2.6.5-aa3/mm/mmap.c 2004-04-23 04:04:31.258066752 +0200
@@ -977,6 +977,15 @@ int expand_stack(struct vm_area_struct *
return -EFAULT;

/*
+ * We must make sure the anon-vma is allocated
+ * to make sure the anon-vma locking is not a noop.
+ */
+ if (unlikely(anon_vma_prepare(vma)))
+ return -ENOMEM;
+
+ anon_vma_lock(vma);
+
+ /*
* vma->vm_start/vm_end cannot change under us because the caller
* is required to hold the mmap_sem in read mode. We need to get
* the spinlock only before relocating the vma range ourself.
@@ -986,13 +995,15 @@ int expand_stack(struct vm_area_struct *
grow = (address - vma->vm_end) >> PAGE_SHIFT;

/* Overcommit.. */
- if (security_vm_enough_memory(grow)) {
+ if (unlikely(security_vm_enough_memory(grow))) {
+ anon_vma_unlock(vma);
return -ENOMEM;
}

- if (address - vma->vm_start > current->rlim[RLIMIT_STACK].rlim_cur ||
- ((vma->vm_mm->total_vm + grow) << PAGE_SHIFT) >
- current->rlim[RLIMIT_AS].rlim_cur) {
+ if (unlikely(address - vma->vm_start > current->rlim[RLIMIT_STACK].rlim_cur ||
+ ((vma->vm_mm->total_vm + grow) << PAGE_SHIFT) >
+ current->rlim[RLIMIT_AS].rlim_cur)) {
+ anon_vma_unlock(vma);
vm_unacct_memory(grow);
return -ENOMEM;
}
@@ -1000,6 +1011,9 @@ int expand_stack(struct vm_area_struct *
vma->vm_mm->total_vm += grow;
if (vma->vm_flags & VM_LOCKED)
vma->vm_mm->locked_vm += grow;
+
+ anon_vma_unlock(vma);
+
return 0;
}

@@ -1028,6 +1042,24 @@ int expand_stack(struct vm_area_struct *
unsigned long grow;

/*
+ * We must make sure the anon-vma is allocated
+ * to make sure the anon-vma locking is not a noop.
+ */
+ if (unlikely(anon_vma_prepare(vma)))
+ return -ENOMEM;
+
+ /*
+ * We must serialize against other thread and against
+ * objrmap while moving the vm_start/vm_pgoff of anon-vmas.
+ * The total_vm/locked_vm as well needs serialization
+ * against other threads, the serialization of
+ * locked_vm/total_vm against syscalls is provided by
+ * the mmap_sem that we hold in read mode here (all
+ * syscalls holds it in write mode).
+ */
+ anon_vma_lock(vma);
+
+ /*
* vma->vm_start/vm_end cannot change under us because the caller
* is required to hold the mmap_sem in read mode. We need to get
* the spinlock only before relocating the vma range ourself.
@@ -1036,13 +1068,15 @@ int expand_stack(struct vm_area_struct *
grow = (vma->vm_start - address) >> PAGE_SHIFT;

/* Overcommit.. */
- if (security_vm_enough_memory(grow)) {
+ if (unlikely(security_vm_enough_memory(grow))) {
+ anon_vma_unlock(vma);
return -ENOMEM;
}

- if (vma->vm_end - address > current->rlim[RLIMIT_STACK].rlim_cur ||
- ((vma->vm_mm->total_vm + grow) << PAGE_SHIFT) >
- current->rlim[RLIMIT_AS].rlim_cur) {
+ if (unlikely(vma->vm_end - address > current->rlim[RLIMIT_STACK].rlim_cur ||
+ ((vma->vm_mm->total_vm + grow) << PAGE_SHIFT) >
+ current->rlim[RLIMIT_AS].rlim_cur)) {
+ anon_vma_unlock(vma);
vm_unacct_memory(grow);
return -ENOMEM;
}
@@ -1051,6 +1085,9 @@ int expand_stack(struct vm_area_struct *
vma->vm_mm->total_vm += grow;
if (vma->vm_flags & VM_LOCKED)
vma->vm_mm->locked_vm += grow;
+
+ anon_vma_unlock(vma);
+
return 0;
}

@@ -1289,7 +1326,6 @@ int split_vma(struct mm_struct * mm, str

if (mapping)
down(&mapping->i_shared_sem);
- spin_lock(&mm->page_table_lock);
anon_vma_lock(vma);

if (new_below)
@@ -1301,7 +1337,6 @@ int split_vma(struct mm_struct * mm, str
__insert_vm_struct(mm, new);

anon_vma_unlock(vma);
- spin_unlock(&mm->page_table_lock);
if (mapping)
up(&mapping->i_shared_sem);


2004-04-23 09:24:56

by Alexander Y. Fomichev

[permalink] [raw]
Subject: Re: Similar behaviour without BUG() message(was: Re: 2.6.5-aa3: kernel BUG at mm/objrmap.c:137!)

On Friday 23 April 2004 06:27, Andrea Arcangeli wrote:
> following your trace I got to some futex issue (likely invoked by NPTL),
> and following the futex code I noticed a race with threads in
> expand_stack and likely the futex is effectively calling expand_stack
> too. Then I found a race in expand_stack with threads.
>
> here the details:
>
> in 2.6-aa expand_stack is moving down vm_start and vm_end with only the
> read semaphore held. this means this thing can even run in parallel in
> both cpus, the latter will corrupt vm_pgoff which generate malfunction
> with anon-vma:
>
> vma->vm_start = address;
> vma->vm_pgoff -= grow;
>
> this isn't an issue for the file mappings because only anon vmas can be
> growsdown (the filemappings could never work right in filemap_nopage if
> expand_stack would mess with pgoff and vm_start without holding the
> mmap_sem in write mode).
>
> serializing this with anon-vma is easy and scalable with the
> anon_vma_lock (not an mm-wide lock like the page_table_lock).
> This will also serialize perfectly with the objrmap. objrmap should be
> the only thing that cares about vm_pgoff and vm_start being coherent at
> the same time for anon-mappings in anon-vma, and the anon_vma_lock
> provides perfect locking for that.
>
> One of the scalability and simplicity locking beauty of anon-vma is the
> total avoidance of page_table_lock for vma merging and all other vma
> operations in general, and true usage of page_table_lock only for the
> pagetables (and future usage of vma->page_table_lock only for pagetables
> too). I wouldn't really like to giveup on this and to reintroduce the
> whole mess of page_table_lock in the vma merging and in expand stack
> that all other implementations like 2.6 mainline and anonmm are
> suffering from, and I hope my fix will be enough. Could you test if this
> patch helps?

Of course, but it will take some time,i've got two followed oops
on previous kernel after almost two weeks of stable work. I think week
or two would be enough "timeslice" to say something. Unfortunately i
can't try artificial tests on a production system but i guess ordinary
load would be enough.

--
Best regards.
Alexander Y. Fomichev <[email protected]>
Public PGP key: http://sysadminday.org.ru/gluk.asc