2004-06-04 14:09:52

by Miquel van Smoorenburg

[permalink] [raw]
Subject: msync() oops in 2.6.7-rc2-bk1

I'm running a news server. The innd process uses mmap()s for several
files and uses msync() to force synchronization to disk every so
often. Suddenly, an msync() causes an oops (and innd SEGVs). This
is after the box has been up and running for 3 days:

# uname -a
Linux enterprise 2.6.7-rc2-bk1 #1 Mon May 31 15:03:52 CEST 2004 i686 GNU/Linux

<1>Unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip:
c0149120
*pde = 00000000
Oops: 0002 [#5]
Modules linked in: e100 mii
CPU: 0
EIP: 0060:[<c0149120>] Not tainted
EFLAGS: 00010213 (2.6.7-rc2-bk1)
EIP is at __set_page_dirty_buffers+0x20/0xb0
eax: 00000000 ebx: f77a1e7c ecx: c15706e0 edx: eba5a83c
esi: 5ccfb000 edi: 00000000 ebp: 5d000000 esp: f44b5efc
ds: 007b es: 007b ss: 0068
Process innd (pid: 10936, threadinfo=f44b5000 task=d0eb2c70)
Stack: f77a1de4 00000004 00000000 c4af23ec c013143e c15706e0 c013da1c 5ccfb000
c4af23f0 c013db0f c4af23ec f7101900 5ccfb000 00000001 5cc00000 ce8b25d0
00000000 5d000000 c013dbc3 ce8b25cc 5cc00000 5d000000 f7101900 00000001
Call Trace:
[<c013143e>] set_page_dirty+0x3e/0x50
[<c013da1c>] filemap_sync_pte+0x5c/0x80
[<c013db0f>] filemap_sync_pte_range+0xcf/0xf0
[<c013dbc3>] filemap_sync+0x93/0x100
[<c013dc96>] msync_interval+0x66/0xf0
[<c013de37>] sys_msync+0x117/0x123
[<c0103c7b>] syscall_call+0x7/0xb

Code: 0f ba 28 01 8b 40 08 39 d0 75 f5 0f ba 29 04 19 c0 85 c0 75

Mike.


2004-06-05 04:34:07

by Andrew Morton

[permalink] [raw]
Subject: Re: msync() oops in 2.6.7-rc2-bk1

"Miquel van Smoorenburg" <[email protected]> wrote:
>
> I'm running a news server. The innd process uses mmap()s for several
> files and uses msync() to force synchronization to disk every so
> often. Suddenly, an msync() causes an oops (and innd SEGVs). This
> is after the box has been up and running for 3 days:
>
> # uname -a
> Linux enterprise 2.6.7-rc2-bk1 #1 Mon May 31 15:03:52 CEST 2004 i686 GNU/Linux
>
> <1>Unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip:
> c0149120
> *pde = 00000000
> Oops: 0002 [#5]
> Modules linked in: e100 mii
> CPU: 0
> EIP: 0060:[<c0149120>] Not tainted
> EFLAGS: 00010213 (2.6.7-rc2-bk1)
> EIP is at __set_page_dirty_buffers+0x20/0xb0
> eax: 00000000 ebx: f77a1e7c ecx: c15706e0 edx: eba5a83c
> esi: 5ccfb000 edi: 00000000 ebp: 5d000000 esp: f44b5efc
> ds: 007b es: 007b ss: 0068
> Process innd (pid: 10936, threadinfo=f44b5000 task=d0eb2c70)
> Stack: f77a1de4 00000004 00000000 c4af23ec c013143e c15706e0 c013da1c 5ccfb000
> c4af23f0 c013db0f c4af23ec f7101900 5ccfb000 00000001 5cc00000 ce8b25d0
> 00000000 5d000000 c013dbc3 ce8b25cc 5cc00000 5d000000 f7101900 00000001
> Call Trace:
> [<c013143e>] set_page_dirty+0x3e/0x50
> [<c013da1c>] filemap_sync_pte+0x5c/0x80
> [<c013db0f>] filemap_sync_pte_range+0xcf/0xf0
> [<c013dbc3>] filemap_sync+0x93/0x100
> [<c013dc96>] msync_interval+0x66/0xf0
> [<c013de37>] sys_msync+0x117/0x123
> [<c0103c7b>] syscall_call+0x7/0xb
>
> Code: 0f ba 28 01 8b 40 08 39 d0 75 f5 0f ba 29 04 19 c0 85 c0 75

You have a page which has PG_private set, but page->private is NULL. And
the machine is non-SMP, non-preempt, yes?

I'd be wondering whether that machine has flipped a bit in page->flags,
frankly. How old is it?

Was the mmap of a regular file or of a block device?

2004-06-05 10:35:23

by Miquel van Smoorenburg

[permalink] [raw]
Subject: Re: msync() oops in 2.6.7-rc2-bk1

On Sat, 05 Jun 2004 06:33:15, Andrew Morton wrote:
> "Miquel van Smoorenburg" <[email protected]> wrote:
> >
> > I'm running a news server. The innd process uses mmap()s for several
> > files and uses msync() to force synchronization to disk every so
> > often. Suddenly, an msync() causes an oops (and innd SEGVs). This
> > is after the box has been up and running for 3 days:
> >
> > # uname -a
> > Linux enterprise 2.6.7-rc2-bk1 #1 Mon May 31 15:03:52 CEST 2004 i686 GNU/Linux
> >
> > <1>Unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip:
> > c0149120
> > *pde = 00000000
> > Oops: 0002 [#5]
> > Modules linked in: e100 mii
> > CPU: 0
> > EIP: 0060:[<c0149120>] Not tainted
> > EFLAGS: 00010213 (2.6.7-rc2-bk1)
> > EIP is at __set_page_dirty_buffers+0x20/0xb0
> > eax: 00000000 ebx: f77a1e7c ecx: c15706e0 edx: eba5a83c
> > esi: 5ccfb000 edi: 00000000 ebp: 5d000000 esp: f44b5efc
> > ds: 007b es: 007b ss: 0068
> > Process innd (pid: 10936, threadinfo=f44b5000 task=d0eb2c70)
> > Stack: f77a1de4 00000004 00000000 c4af23ec c013143e c15706e0 c013da1c 5ccfb000
> > c4af23f0 c013db0f c4af23ec f7101900 5ccfb000 00000001 5cc00000 ce8b25d0
> > 00000000 5d000000 c013dbc3 ce8b25cc 5cc00000 5d000000 f7101900 00000001
> > Call Trace:
> > [<c013143e>] set_page_dirty+0x3e/0x50
> > [<c013da1c>] filemap_sync_pte+0x5c/0x80
> > [<c013db0f>] filemap_sync_pte_range+0xcf/0xf0
> > [<c013dbc3>] filemap_sync+0x93/0x100
> > [<c013dc96>] msync_interval+0x66/0xf0
> > [<c013de37>] sys_msync+0x117/0x123
> > [<c0103c7b>] syscall_call+0x7/0xb
> >
> > Code: 0f ba 28 01 8b 40 08 39 d0 75 f5 0f ba 29 04 19 c0 85 c0 75
>
> You have a page which has PG_private set, but page->private is NULL. And
> the machine is non-SMP, non-preempt, yes?

Indeed.

> I'd be wondering whether that machine has flipped a bit in page->flags,
> frankly. How old is it?

It's an Athlon XP 2000+ from september 2002, 1 GB RAM.

> Was the mmap of a regular file or of a block device?

I'm not sure. My collegue was stracing the process when it happended
for the 3rd time but didn't save the output to a file.

I now see that just after 2.6.7-rc2-bk was booted it oopsed too-
a different oops. Perhaps that corrupted something.

Oops: 0002 [#1]
Modules linked in: e100 mii
CPU: 0
EIP: 0060:[drop_buffers+84/144] Not tainted
EFLAGS: 00010207 (2.6.7-rc2-bk1)
EIP is at drop_buffers+0x54/0x90
eax: b552e000 ebx: eba5a834 ecx: eba5a80c edx: 01000406
esi: eba5addc edi: eba5a26c ebp: c15af680 esp: c198cd18
ds: 007b es: 007b ss: 0068
Process kswapd0 (pid: 9, threadinfo=c198c000 task=c19a65d0)
Stack: 00000001 c15af680 c15af680 00000000 c198ce1c c014b917 c15af680 c198cd38
00000000 f7bff63c c15af680 c01361e3 c15af680 000000d0 c198c000 00000001
0000001b 00000000 c198cd60 c198cd60 00000000 00000000 00000000 00000020
Call Trace:
[try_to_free_buffers+55/144] try_to_free_buffers+0x37/0x90
[shrink_list+835/1072] shrink_list+0x343/0x430
[shrink_cache+331/768] shrink_cache+0x14b/0x300
[balance_pgdat+462/608] balance_pgdat+0x1ce/0x260
[kswapd+279/304] kswapd+0x117/0x130
[autoremove_wake_function+0/96] autoremove_wake_function+0x0/0x60
[autoremove_wake_function+0/96] autoremove_wake_function+0x0/0x60
[kswapd+0/304] kswapd+0x0/0x130
[kernel_thread_helper+5/20] kernel_thread_helper+0x5/0x14
Code: 89 42 04 89 10 89 5b 04 89 59 28 39 fe 89 f1 75 df 8b 44 24

But it might indeed be a case of flakey hardware. Re-examining the
history of this machine I see that it has spontaneously rebooted
before. What I'll do is this - it's now running 2.6.6 vanilla. I'll
see what it does over the weekend. After the weekend I'll boot
2.6.7-rcX-bkY and see what it does with that. If I can reproduce
this problem reliably or can find a pattern in it myself, I'll
let you know.

I think I'll boot 2.6.7-rcX-bkY at the same time on a known-good
machine that has a comparable load.

Mike.

2004-06-08 17:34:03

by Bill Davidsen

[permalink] [raw]
Subject: Re: msync() oops in 2.6.7-rc2-bk1

Andrew Morton wrote:
> "Miquel van Smoorenburg" <[email protected]> wrote:
>
>>I'm running a news server. The innd process uses mmap()s for several
>>files and uses msync() to force synchronization to disk every so
>>often. Suddenly, an msync() causes an oops (and innd SEGVs). This
>>is after the box has been up and running for 3 days:
>>
>># uname -a
>>Linux enterprise 2.6.7-rc2-bk1 #1 Mon May 31 15:03:52 CEST 2004 i686 GNU/Linux
>>
>> <1>Unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip:
>>c0149120
>>*pde = 00000000
>>Oops: 0002 [#5]
>>Modules linked in: e100 mii
>>CPU: 0
>>EIP: 0060:[<c0149120>] Not tainted
>>EFLAGS: 00010213 (2.6.7-rc2-bk1)
>>EIP is at __set_page_dirty_buffers+0x20/0xb0
>>eax: 00000000 ebx: f77a1e7c ecx: c15706e0 edx: eba5a83c
>>esi: 5ccfb000 edi: 00000000 ebp: 5d000000 esp: f44b5efc
>>ds: 007b es: 007b ss: 0068
>>Process innd (pid: 10936, threadinfo=f44b5000 task=d0eb2c70)
>>Stack: f77a1de4 00000004 00000000 c4af23ec c013143e c15706e0 c013da1c 5ccfb000
>> c4af23f0 c013db0f c4af23ec f7101900 5ccfb000 00000001 5cc00000 ce8b25d0
>> 00000000 5d000000 c013dbc3 ce8b25cc 5cc00000 5d000000 f7101900 00000001
>>Call Trace:
>> [<c013143e>] set_page_dirty+0x3e/0x50
>> [<c013da1c>] filemap_sync_pte+0x5c/0x80
>> [<c013db0f>] filemap_sync_pte_range+0xcf/0xf0
>> [<c013dbc3>] filemap_sync+0x93/0x100
>> [<c013dc96>] msync_interval+0x66/0xf0
>> [<c013de37>] sys_msync+0x117/0x123
>> [<c0103c7b>] syscall_call+0x7/0xb
>>
>>Code: 0f ba 28 01 8b 40 08 39 d0 75 f5 0f ba 29 04 19 c0 85 c0 75
>
>
> You have a page which has PG_private set, but page->private is NULL. And
> the machine is non-SMP, non-preempt, yes?
>
> I'd be wondering whether that machine has flipped a bit in page->flags,
> frankly. How old is it?
>
> Was the mmap of a regular file or of a block device?

I don't think so... I have a number of machines, also news servers,
which are producing various errors in filemap. In the RH2.4.21-15 kernel
it shows up in filemap_sync_pte_range, which is now in msync. It appears
to be a bad pte coming in, I see the error at pmd_bad sometimes.

I see it with several applications, all news, all mmap() heavily.

Sorry I can't tell you more, but it happens on multiple brand-new Xeon
systems. If he is seeing similar on an Athlon I would assume that it's
some "less traveled way" in the kernel, because I have a limited ability
to believe in coincidence.


--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me