2005-01-11 09:07:15

by Roland Rosenfeld

[permalink] [raw]
Subject: Oops in kjournald

In the last week I got the following Oops five times on two machines,
when they are under very heavy load (mailserver based on Debian sarge
with postfix 2.1.4) when the load is >4 for some hours.

---------------------- schnipp --------------------------------
Unable to handle kernel NULL pointer dereference at virtual address 0000000c
printing eip:
c01a65a5
*pde = 00000000
Oops: 0002 [#1]
PREEMTP SMP
Modules linked in: e1000 tg3 bonding rtc unix
CPU: 0
EIP: 0060:[<c01a65a5>] Not tainted VLI
EFLAGS: 00010282 (2.6.10-686-nc-smp-1)
EIP is at jounral_commit_transaction+0x315/0x12f0
eax: f157117c ebx: 00000000 ecx: 00000000 edx: f7cffb80
es: 007b es: 007b ss: 0068
Process kjournald (pid: 171, threadinfo=f6f7e000 task=f6e92520)
Stack: e2efd5cc e2efd5cc 00000008 000011da 00000000 f6e97ac0 f6f7e000 f6f7e000
f6ef6bb8 f6e97a14 00000000 00000000 00000000 00000000 00000000 f0fa9dac
e8a39ecc 000011da 00000000 f6e92520 c012d580 f6f7fe44 f6f7fe44 00000000
Call Trace:
[<c012d580>] autoremove_wake_function+0x0/0x60
[<c012d580>] autoremove_wake_function+0x0/0x60
[<c01a9af5>] kjounald+0xe5/0x240
[<c011b837>] do_exit+0x2d7/0x480
[<c012d580>] autoremove_wake_function+0x0/0x60
[<c012d580>] autoremove_wake_function+0x0/0x60
[<c0102572>] ret_from_fork+0x6/0x14
[<c01a99f0>] commit_timeout+0x0/0x10
[<c01a9a10>] kjournald+0x0/0x240
[<c01007f5>] kernel_thread_helper+0x5/0x10
Code: 00 8b 45 20 85 c0 75 be 8b 44 24 38 85 c0 0f 85 16 0d 00 00 8b 45 18 85 c0 0f 84 83 00 00 00 be 00 e0 ff ff 21 e6 8b 78 20 8b 1f <f0> ff 43 0c 8b 03 a8 04 0f 85 b3 0c 00 00 89 5c 24 04 8b 94 24
<6>note: kjournald[171] exited with preempt_count 1
---------------------- schnipp --------------------------------

I run a 2.6.10 kernel (with aacraid 1.1.5 2372 driver from Adaptec,
everything else vanilla) on Dual Xeon machines. The kernel has ext3fs
and XFS compiled in but currently all filesystems are ext3.

In the logs I don't see anything, because the machines freeze with the
above message (I retyped the messages from the screen, so there may be
some typos, if necessary I have a screenshot here to correct some
misspellings).

What can I do to fix this problem? Using google I didn't find a hint
for further search, but my kernel knowledge is very limited :-|

Tschoeeee

Roland


2005-01-11 09:24:59

by Jeffrey Hundstad

[permalink] [raw]
Subject: Bad things happening to journaled filesystem machines; Was: Oops in kjournald

I've had 4 machines do the similiar things. It happens during backups
or during updatedb. This has happened on 2.6.9, 2.6.10, 2.6.10-ac7, and
2.6.10-ac8. I've seen several similiar reports with journaled file
systems. I use XFS exclusively; but have seen reports on XFS and EXT3.
I would report something more useful but what I'm usually left with is
XFS unmounted and nothing useful on the screen. This has been on Xeon,
Pentium-3 and Athlon systems. ...wish I could report more but perhaps it
will add /part/ of a data point.

--
jeffrey hundstad


Roland Rosenfeld wrote:

>In the last week I got the following Oops five times on two machines,
>when they are under very heavy load (mailserver based on Debian sarge
>with postfix 2.1.4) when the load is >4 for some hours.
>
>---------------------- schnipp --------------------------------
>Unable to handle kernel NULL pointer dereference at virtual address 0000000c
> printing eip:
>c01a65a5
>*pde = 00000000
>Oops: 0002 [#1]
>PREEMTP SMP
>Modules linked in: e1000 tg3 bonding rtc unix
>CPU: 0
>EIP: 0060:[<c01a65a5>] Not tainted VLI
>EFLAGS: 00010282 (2.6.10-686-nc-smp-1)
>EIP is at jounral_commit_transaction+0x315/0x12f0
>eax: f157117c ebx: 00000000 ecx: 00000000 edx: f7cffb80
>es: 007b es: 007b ss: 0068
>Process kjournald (pid: 171, threadinfo=f6f7e000 task=f6e92520)
>Stack: e2efd5cc e2efd5cc 00000008 000011da 00000000 f6e97ac0 f6f7e000 f6f7e000
> f6ef6bb8 f6e97a14 00000000 00000000 00000000 00000000 00000000 f0fa9dac
> e8a39ecc 000011da 00000000 f6e92520 c012d580 f6f7fe44 f6f7fe44 00000000
>Call Trace:
> [<c012d580>] autoremove_wake_function+0x0/0x60
> [<c012d580>] autoremove_wake_function+0x0/0x60
> [<c01a9af5>] kjounald+0xe5/0x240
> [<c011b837>] do_exit+0x2d7/0x480
> [<c012d580>] autoremove_wake_function+0x0/0x60
> [<c012d580>] autoremove_wake_function+0x0/0x60
> [<c0102572>] ret_from_fork+0x6/0x14
> [<c01a99f0>] commit_timeout+0x0/0x10
> [<c01a9a10>] kjournald+0x0/0x240
> [<c01007f5>] kernel_thread_helper+0x5/0x10
>Code: 00 8b 45 20 85 c0 75 be 8b 44 24 38 85 c0 0f 85 16 0d 00 00 8b 45 18 85 c0 0f 84 83 00 00 00 be 00 e0 ff ff 21 e6 8b 78 20 8b 1f <f0> ff 43 0c 8b 03 a8 04 0f 85 b3 0c 00 00 89 5c 24 04 8b 94 24
> <6>note: kjournald[171] exited with preempt_count 1
>---------------------- schnipp --------------------------------
>
>I run a 2.6.10 kernel (with aacraid 1.1.5 2372 driver from Adaptec,
>everything else vanilla) on Dual Xeon machines. The kernel has ext3fs
>and XFS compiled in but currently all filesystems are ext3.
>
>In the logs I don't see anything, because the machines freeze with the
>above message (I retyped the messages from the screen, so there may be
>some typos, if necessary I have a screenshot here to correct some
>misspellings).
>
>What can I do to fix this problem? Using google I didn't find a hint
>for further search, but my kernel knowledge is very limited :-|
>
>Tschoeeee
>
> Roland
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>

2005-01-11 16:10:32

by Anders Saaby

[permalink] [raw]
Subject: Re: Bad things happening to journaled filesystem machines; Was: Oops in kjournald

Hi,

Well I just want to count my self in. - I have had quite a few of these
errors,
both on XFS and EXT3. I tried a lot of different 2.6.x kernels without
luck. (See other mails here on LKML from me and Jakob Oestergaard et al.)

It seems something is very broken in VFS(?) on 2.6 (atleast after 2.6.5,
which was the last kernel I didn't see this on, but had other errors that
forced me away from it).

Sadly it looks to me as if either noone cares (enough) about this, or noone
is capable of fixing it (myself included). :(

- 2.4.28 fixed it for me. - Just have to live with the poor performance.

Jeffrey E. Hundstad wrote:

> I've had 4 machines do the similiar things. It happens during backups
> or during updatedb. This has happened on 2.6.9, 2.6.10, 2.6.10-ac7, and
> 2.6.10-ac8. I've seen several similiar reports with journaled file
> systems. I use XFS exclusively; but have seen reports on XFS and EXT3.
> I would report something more useful but what I'm usually left with is
> XFS unmounted and nothing useful on the screen. This has been on Xeon,
> Pentium-3 and Athlon systems. ...wish I could report more but perhaps it
> will add /part/ of a data point.
>

--
Med venlig hilsen - Best regards - Meilleures salutations

Anders Saaby
Systems Engineer
------------------------------------------------
Cohaesio A/S - Maglebjergvej 5D - DK-2800 Lyngby
Phone: +45 45 880 888 - Fax: +45 45 880 777
Mail: [email protected] - http://www.cohaesio.com
------------------------------------------------