2006-08-04 08:22:23

by Jesper Juhl

[permalink] [raw]
Subject: 2.6.18-rc3-git3 - XFS - BUG: unable to handle kernel NULL pointer dereference at virtual address 00000078

I just hit a BUG that looks XFS related.

The machine is running 2.6.18-rc3-git3

(more info below the BUG messages)


BUG: unable to handle kernel NULL pointer dereference at virtual
address 00000078
printing eip:
c01e64d7
*pde = 00000000
Oops: 0000 [#1]
SMP
Modules linked in: sky2 piix ide_core eeprom
CPU: 0
EIP: 0060:[<c01e64d7>] Not tainted VLI
EFLAGS: 00010293 (2.6.18-rc3-git3 #1)
EIP is at xfs_btree_init_cursor+0x12e/0x16b
eax: eb6e6690 ebx: 00000000 ecx: eb6e6690 edx: 0000008c
esi: 00000000 edi: 00000000 ebp: f76d0800 esp: d5ffad20
ds: 007b es: 007b ss: 0068
Process rm (pid: 24769, ti=d5ffa000 task=f3ef7ab0 task.ti=d5ffa000)
Stack: cd88a150 00000000 00000019 0000001f 00000000 c01ca335 00000007 00000000
00000000 00000000 c6d49e10 f76d0800 00000013 00000007 00000000 cd88a150
ecc90bb4 00000060 00000006 eebd9a80 0000008c 00000009 c87faf1c 00000000
Call Trace:
[<c01ca335>] xfs_free_ag_extent+0x44/0x667
[<c01cb775>] xfs_free_extent+0xda/0xf4
[<c01dce50>] xfs_bmap_finish+0x107/0x185
[<c021b0d9>] xfs_remove+0x2a8/0x456
[<c0226035>] xfs_vn_unlink+0x23/0x53
[<c016513b>] vfs_unlink+0x88/0x8c
[<c01651d4>] do_unlinkat+0x95/0xfb
[<c0102ae3>] syscall_call+0x7/0xb
[<b7eb705d>] 0xb7eb705d
[<c01ca335>] xfs_free_ag_extent+0x44/0x667
[<c01cb775>] xfs_free_extent+0xda/0xf4
[<c021f362>] kmem_zone_alloc+0x50/0xb9
[<c0205526>] xfs_log_reserve+0x6c/0xc7
[<c021f3f1>] kmem_zone_zalloc+0x26/0x51
[<c01dce50>] xfs_bmap_finish+0x107/0x185
[<c021b0d9>] xfs_remove+0x2a8/0x456
[<c0226035>] xfs_vn_unlink+0x23/0x53
[<c0214fe5>] xfs_dir_lookup_int+0xa3/0xfa
[<c01fb1ac>] xfs_iunlock+0x7b/0x84
[<c0372666>] __mutex_lock_slowpath+0x159/0x1d7
[<c016c22d>] d_splice_alias+0xb5/0xc2
[<c016513b>] vfs_unlink+0x88/0x8c
[<c01651d4>] do_unlinkat+0x95/0xfb
[<c0111296>] do_page_fault+0x12f/0x4e1
[<c0102ae3>] syscall_call+0x7/0xb
Code: 83 86 01 00 00 84 c0 74 98 8b 53 14 0f b6 c0 c1 e0 03 0f b7 92
7c 02 00 00 66 29 c2 eb 83 8b 47 78 8b 58 18 0f cb e9 0f ff ff ff <8b>
47 78 8b 5c b0 1c eb f0 8b 44 24 24 85 c0 75 23 8b 44 24 20
EIP: [<c01e64d7>] xfs_btree_init_cursor+0x12e/0x16b SS:ESP 0068:d5ffad20
<1>BUG: unable to handle kernel NULL pointer dereference at virtual
address 00000078
printing eip:
c01e64d7
*pde = 00000000
Oops: 0000 [#2]
SMP
Modules linked in: sky2 piix ide_core eeprom
CPU: 0
EIP: 0060:[<c01e64d7>] Not tainted VLI
EFLAGS: 00010293 (2.6.18-rc3-git3 #1)
EIP is at xfs_btree_init_cursor+0x12e/0x16b
eax: eb6e6118 ebx: 00000000 ecx: eb6e6118 edx: 0000008c
esi: 00000000 edi: 00000000 ebp: f76d0800 esp: d1e25d20
ds: 007b es: 007b ss: 0068
Process rm (pid: 24954, ti=d1e25000 task=f33b7030 task.ti=d1e25000)
Stack: cd49c980 00000000 00000019 00000005 00000000 c01ca335 00000001 00000000
00000000 00000000 c7ffde10 f76d0800 00000013 00000001 00000000 cd49c980
e085e4b4 00000060 00000006 ed757dc0 0000008c 00000009 c87fabd4 00000000
Call Trace:
[<c01ca335>] xfs_free_ag_extent+0x44/0x667
[<c01cb775>] xfs_free_extent+0xda/0xf4
[<c01dce50>] xfs_bmap_finish+0x107/0x185
[<c021b0d9>] xfs_remove+0x2a8/0x456
[<c0226035>] xfs_vn_unlink+0x23/0x53
[<c016513b>] vfs_unlink+0x88/0x8c
[<c01651d4>] do_unlinkat+0x95/0xfb
[<c0102ae3>] syscall_call+0x7/0xb
[<b7f2405d>] 0xb7f2405d
[<c01ca335>] xfs_free_ag_extent+0x44/0x667
[<c01cb775>] xfs_free_extent+0xda/0xf4
[<c021f362>] kmem_zone_alloc+0x50/0xb9
[<c0205526>] xfs_log_reserve+0x6c/0xc7
[<c021f3f1>] kmem_zone_zalloc+0x26/0x51
[<c01dce50>] xfs_bmap_finish+0x107/0x185
[<c021b0d9>] xfs_remove+0x2a8/0x456
[<c0226035>] xfs_vn_unlink+0x23/0x53
[<c0214fe5>] xfs_dir_lookup_int+0xa3/0xfa
[<c01fb1ac>] xfs_iunlock+0x7b/0x84
[<c0372666>] __mutex_lock_slowpath+0x159/0x1d7
[<c016c22d>] d_splice_alias+0xb5/0xc2
[<c016513b>] vfs_unlink+0x88/0x8c
[<c01651d4>] do_unlinkat+0x95/0xfb
[<c0111296>] do_page_fault+0x12f/0x4e1
[<c0102ae3>] syscall_call+0x7/0xb
Code: 83 86 01 00 00 84 c0 74 98 8b 53 14 0f b6 c0 c1 e0 03 0f b7 92
7c 02 00 00 66 29 c2 eb 83 8b 47 78 8b 58 18 0f cb e9 0f ff ff ff <8b>
47 78 8b 5c b0 1c eb f0 8b 44 24 24 85 c0 75 23 8b 44 24 20
EIP: [<c01e64d7>] xfs_btree_init_cursor+0x12e/0x16b SS:ESP 0068:d1e25d20


# cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 4
model name : Intel(R) Xeon(TM) CPU 3.20GHz
stepping : 3
cpu MHz : 3192.358
cache size : 2048 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 1
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx
lm constant_tsc pni monitor ds_cpl cid cx16 xtpr
bogomips : 6388.63

processor : 1
vendor_id : GenuineIntel
cpu family : 15
model : 4
model name : Intel(R) Xeon(TM) CPU 3.20GHz
stepping : 3
cpu MHz : 3192.358
cache size : 2048 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 1
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx
lm constant_tsc pni monitor ds_cpl cid cx16 xtpr
bogomips : 6384.44


# scripts/ver_linux
If some fields are empty or look unusual you may have an old version.
Compare to the current minimal requirements in Documentation/Changes.

Linux server 2.6.18-rc3-git3 #1 SMP Thu Aug 3 13:28:08 CEST 2006 i686 GNU/Linux

Gnu C 3.3.5
Gnu make 3.80
binutils 2.15
util-linux 2.12p
mount 2.12p
module-init-tools 3.2-pre1
e2fsprogs 1.37
xfsprogs 2.6.20
nfs-utils 1.0.6
Linux C Library 2.3.2
Dynamic linker (ldd) 2.3.2
Procps 3.2.1
Net-tools 1.60
Console-tools 0.2.3
Sh-utils 5.2.1
udev 056
Modules Loaded sky2 piix ide_core eeprom


The box has 28 XFS filesystems of various sizes (each between 50GB and
3.5TB) mounted. These filesystems are on LVM2 and each physical volume
that LVM2 sees is made up of a RAID1 mirror of two disks created by a
3ware ATA-RAID controller.


Any hints as to how I can resolve this problem would be very welcome.
It would also be nice to know if I should expect filesystem corruption
from this.


--
Jesper Juhl <[email protected]>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html


2006-08-04 10:06:06

by Nathan Scott

[permalink] [raw]
Subject: Re: 2.6.18-rc3-git3 - XFS - BUG: unable to handle kernel NULL pointer dereference at virtual address 00000078

On Fri, Aug 04, 2006 at 10:22:21AM +0200, Jesper Juhl wrote:
> I just hit a BUG that looks XFS related.
>
> The machine is running 2.6.18-rc3-git3
>
> (more info below the BUG messages)
>

Thanks for reporting, Jesper - is it reproducible? Could you try this
patch for me? We had a couple of other reports of this, but the earlier
reporters have vanished ... could you let me know if this helps?

cheers.

--
Nathan

--- fs/xfs/xfs_alloc.c.orig 2006-08-04 20:00:34.333456250 +1000
+++ fs/xfs/xfs_alloc.c 2006-08-04 20:00:50.586472000 +1000
@@ -1949,14 +1949,8 @@ xfs_alloc_fix_freelist(
* the restrictions correctly. Can happen for free calls
* on a completely full ag.
*/
- if (targs.agbno == NULLAGBLOCK) {
- if (!(flags & XFS_ALLOC_FLAG_FREEING)) {
- xfs_trans_brelse(tp, agflbp);
- args->agbp = NULL;
- return 0;
- }
+ if (targs.agbno == NULLAGBLOCK)
break;
- }
/*
* Put each allocated block on the list.
*/

2006-08-04 10:43:58

by Jesper Juhl

[permalink] [raw]
Subject: Re: 2.6.18-rc3-git3 - XFS - BUG: unable to handle kernel NULL pointer dereference at virtual address 00000078

On 04/08/06, Nathan Scott <[email protected]> wrote:
> On Fri, Aug 04, 2006 at 10:22:21AM +0200, Jesper Juhl wrote:
> > I just hit a BUG that looks XFS related.
> >
> > The machine is running 2.6.18-rc3-git3
> >
> > (more info below the BUG messages)
> >
>
> Thanks for reporting, Jesper - is it reproducible?

I don't know. I only tried that kernel once and when it broke on me
went back to the previous 2.6.11 kernel it was running before.


> Could you try this
> patch for me?

Sure.
The machine is semi-production, so there are limits to how much and
when I can test on it.
roughly wednesday and thursday each week I should be able to run
experimental kernels on it, the rest of the week the box needs to be
stable.

> We had a couple of other reports of this, but the earlier
> reporters have vanished ... could you let me know if this helps?
>

What I'll do is apply that patch to the 2.6.18-rc3-git3 kernel that
BUG'ed on me, then wednesday next week I'll boot the machine with the
patched kernel and it should be able to run for ~24hrs, then I can
report back to you if it crashed or not.

Or is there some other way you'd rather have me do it (subject to the
constraint that I can only do experiments one and a half to two days a
week) ?


--
Jesper Juhl <[email protected]>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html

2006-08-06 03:07:53

by Tony.Ho

[permalink] [raw]
Subject: Re: 2.6.18-rc3-git3 - XFS - BUG: unable to handle kernel NULL pointer dereference at virtual address 00000078

I'vs tested this patch, but the XFS panic is also reproducible, error
message is same as before.


Nathan Scott wrote:
> On Fri, Aug 04, 2006 at 10:22:21AM +0200, Jesper Juhl wrote:
>
>> I just hit a BUG that looks XFS related.
>>
>> The machine is running 2.6.18-rc3-git3
>>
>> (more info below the BUG messages)
>>
>>
>
> Thanks for reporting, Jesper - is it reproducible? Could you try this
> patch for me? We had a couple of other reports of this, but the earlier
> reporters have vanished ... could you let me know if this helps?
>
> cheers.
>
>

2006-08-06 04:06:17

by Tony.Ho

[permalink] [raw]
Subject: Re: 2.6.18-rc3-git3 - XFS - BUG: unable to handle kernel NULL pointer dereference at virtual address 00000078

I'm sorry about prev mail. I test on a wrong kernel.
The panic is not appear again, but random delete performance looks very bad.

Version 1.03 ------Sequential Output------ --Sequential Input-
--Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
/sec %CP
test 4G 50317 99 232444 54 109507 25 52287 98 329821 29
1169 2
------Sequential Create------ --------Random
Create--------
-Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
/sec %CP
16 104 1 +++++ +++ 87 1 103 1 +++++ +++
100 1
test,4G,50317,99,232444,54,109507,25,52287,98,329821,29,1169.4,2,16,104,1,+++++,+++,87,1,103,1,+++++,+++,100,1


Tony.Ho wrote:
> I'vs tested this patch, but the XFS panic is also reproducible, error
> message is same as before.
>
>
> Nathan Scott wrote:
>> On Fri, Aug 04, 2006 at 10:22:21AM +0200, Jesper Juhl wrote:
>>
>>> I just hit a BUG that looks XFS related.
>>>
>>> The machine is running 2.6.18-rc3-git3
>>>
>>> (more info below the BUG messages)
>>>
>>>
>>
>> Thanks for reporting, Jesper - is it reproducible? Could you try this
>> patch for me? We had a couple of other reports of this, but the earlier
>> reporters have vanished ... could you let me know if this helps?
>>
>> cheers.
>>
>>
> -
> To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2006-08-07 04:34:42

by Nathan Scott

[permalink] [raw]
Subject: Re: 2.6.18-rc3-git3 - XFS - BUG: unable to handle kernel NULL pointer dereference at virtual address 00000078

On Sun, Aug 06, 2006 at 12:05:43PM +0800, Tony.Ho wrote:
> I'm sorry about prev mail. I test on a wrong kernel.
> The panic is not appear again,

Thanks for trying it, but I don't think that earlier patch is right.
I'll send out a new, improved patch to everyone who's reported this,
soon (tomorrow, hopefully).

> but random delete performance looks very bad.

This will be unrelated, and is probably due to the fact that we now
enable write barriers by default.

cheers.

--
Nathan

2006-08-08 03:44:56

by Nathan Scott

[permalink] [raw]
Subject: Re: 2.6.18-rc3-git3 - XFS - BUG: unable to handle kernel NULL pointer dereference at virtual address 00000078

On Mon, Aug 07, 2006 at 08:39:49PM -0700, Avuton Olrich wrote:
> On 8/6/06, Nathan Scott <[email protected]> wrote:
> > On Sun, Aug 06, 2006 at 12:05:43PM +0800, Tony.Ho wrote:
> > > I'm sorry about prev mail. I test on a wrong kernel.
> > > The panic is not appear again,
>
> Using 2.6.18-rc4, is this the bug that this thread refers to?

Yes, try http://oss.sgi.com/archives/xfs/2006-08/msg00054.html
and lemme know what happens - thanks.

cheers.

--
Nathan

2006-08-08 08:54:25

by Nathan Scott

[permalink] [raw]
Subject: Re: 2.6.18-rc3-git3 - XFS - BUG: unable to handle kernel NULL pointer dereference at virtual address 00000078

On Tue, Aug 08, 2006 at 10:37:49AM +0200, Jesper Juhl wrote:
> On 08/08/06, Nathan Scott <[email protected]> wrote:
> > On Mon, Aug 07, 2006 at 08:39:49PM -0700, Avuton Olrich wrote:
> > > On 8/6/06, Nathan Scott <[email protected]> wrote:
> > > > On Sun, Aug 06, 2006 at 12:05:43PM +0800, Tony.Ho wrote:
> > > > > I'm sorry about prev mail. I test on a wrong kernel.
> > > > > The panic is not appear again,
> > >
> > > Using 2.6.18-rc4, is this the bug that this thread refers to?
> >
> > Yes, try http://oss.sgi.com/archives/xfs/2006-08/msg00054.html
> > and lemme know what happens - thanks.
> >
> Come wednesday would you like me to try rc4 + that patch instead of
> rc3-git3 with your previous patch?

Yep, ignore that first patch. Thanks!

--
Nathan

2006-08-10 12:25:49

by Jesper Juhl

[permalink] [raw]
Subject: Re: 2.6.18-rc3-git3 - XFS - BUG: unable to handle kernel NULL pointer dereference at virtual address 00000078

On 10/08/06, Jesper Juhl <[email protected]> wrote:
> On 08/08/06, Nathan Scott <[email protected]> wrote:
> > On Tue, Aug 08, 2006 at 10:37:49AM +0200, Jesper Juhl wrote:
> > > On 08/08/06, Nathan Scott <[email protected]> wrote:
> > > > On Mon, Aug 07, 2006 at 08:39:49PM -0700, Avuton Olrich wrote:
> > > > > On 8/6/06, Nathan Scott <[email protected]> wrote:
> > > > > > On Sun, Aug 06, 2006 at 12:05:43PM +0800, Tony.Ho wrote:
> > > > > > > I'm sorry about prev mail. I test on a wrong kernel.
> > > > > > > The panic is not appear again,
> > > > >
> > > > > Using 2.6.18-rc4, is this the bug that this thread refers to?
> > > >
> > > > Yes, try http://oss.sgi.com/archives/xfs/2006-08/msg00054.html
> > > > and lemme know what happens - thanks.
> > > >
> > > Come wednesday would you like me to try rc4 + that patch instead of
> > > rc3-git3 with your previous patch?
> >
> > Yep, ignore that first patch. Thanks!
> >
>
> Ok, I booted the server with 2.6.18-rc4 + your patch. Things went well
> for ~3 hours and then blew up - not in the same way though.
>
> The machine was under pretty heavy load recieving data via rsync when
> the following happened :
>
> Filesystem "dm-51": XFS internal error xfs_trans_cancel at line 1138
> of file fs/xfs/xfs_trans.c. Caller 0xc0210e3f
> [<c0103a3c>] show_trace_log_lvl+0x152/0x165
> [<c0103a5e>] show_trace+0xf/0x13
> [<c0103b59>] dump_stack+0x15/0x19
> [<c0213474>] xfs_trans_cancel+0xcf/0xf8
> [<c0210e3f>] xfs_rename+0x64d/0x936
> [<c0226286>] xfs_vn_rename+0x48/0x9f
> [<c016584e>] vfs_rename_other+0x99/0xcb
> [<c0165a36>] vfs_rename+0x1b6/0x1eb
> [<c0165bda>] do_rename+0x16f/0x193
> [<c0165c45>] sys_renameat+0x47/0x73
> [<c0165c98>] sys_rename+0x27/0x2b
> [<c0102ae3>] syscall_call+0x7/0xb
> [<b7e0a681>] 0xb7e0a681
> [<c0213474>] xfs_trans_cancel+0xcf/0xf8
> [<c0210e3f>] xfs_rename+0x64d/0x936
> [<c0210e3f>] xfs_rename+0x64d/0x936
> [<c0226286>] xfs_vn_rename+0x48/0x9f
> [<c01626bb>] exec_permission_lite+0x46/0xcd
> [<c0162acb>] __link_path_walk+0x4d/0xd0a
> [<c016f8df>] mntput_no_expire+0x1b/0x78
> [<c01637f3>] link_path_walk+0x6b/0xc4
> [<c016584e>] vfs_rename_other+0x99/0xcb
> [<c022623e>] xfs_vn_rename+0x0/0x9f
> [<c0165a36>] vfs_rename+0x1b6/0x1eb
> [<c0165bda>] do_rename+0x16f/0x193
> [<c0154e6b>] sys_fchmodat+0xc2/0xef
> [<c015508f>] sys_lchown+0x50/0x52
> [<c016231b>] do_getname+0x4b/0x73
> [<c0165c45>] sys_renameat+0x47/0x73
> [<c0165c98>] sys_rename+0x27/0x2b
> [<c0102ae3>] syscall_call+0x7/0xb
> xfs_force_shutdown(dm-51,0x8) called from line 1139 of file
> fs/xfs/xfs_trans.c. Return address = 0xc0229395
> Filesystem "dm-51": Corruption of in-memory data detected. Shutting
> down filesystem: dm-51
> Please umount the filesystem, and rectify the problem(s)
> xfs_force_shutdown(dm-51,0x1) called from line 424 of file
> fs/xfs/xfs_rw.c. Return address = 0xc0229395
>
> I was doing an lvmextend +xfs_resize of a different (XFS) filesystem
> on the same server at roughly the same time. But I'm not sure if
> that's related.
>
> I'm currently running xfs_repair on the fs that blew up.
>

I think lvextend and xfs_growfs are not the cause of the problem since
I've extended 6 more filesystems since one of them blew up and the
problem reported above has not manifested itself again (yet).

--
Jesper Juhl <[email protected]>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html

2006-08-10 22:36:12

by Nathan Scott

[permalink] [raw]
Subject: Re: 2.6.18-rc3-git3 - XFS - BUG: unable to handle kernel NULL pointer dereference at virtual address 00000078

On Thu, Aug 10, 2006 at 01:31:35PM +0200, Jesper Juhl wrote:
> On 08/08/06, Nathan Scott <[email protected]> wrote:
> ...
> Ok, I booted the server with 2.6.18-rc4 + your patch. Things went well
> for ~3 hours and then blew up - not in the same way though.
>
> The machine was under pretty heavy load recieving data via rsync when
> the following happened :
>
> Filesystem "dm-51": XFS internal error xfs_trans_cancel at line 1138
> of file fs/xfs/xfs_trans.c. Caller 0xc0210e3f
> [<c0103a3c>] show_trace_log_lvl+0x152/0x165
> [<c0103a5e>] show_trace+0xf/0x13
> [<c0103b59>] dump_stack+0x15/0x19
> [<c0213474>] xfs_trans_cancel+0xcf/0xf8
> [<c0210e3f>] xfs_rename+0x64d/0x936
> [<c0226286>] xfs_vn_rename+0x48/0x9f
> [<c016584e>] vfs_rename_other+0x99/0xcb
> [<c0165a36>] vfs_rename+0x1b6/0x1eb
> [<c0165bda>] do_rename+0x16f/0x193
> [<c0165c45>] sys_renameat+0x47/0x73

Thanks Jesper. Hmm, lessee - this is a cancelled dirty rename
transaction ... could be ondisk dir2 corruption (any chance this
filesystem was affected by 2.6.17's endian bug?), or something
else entirely. No I/O errors in the system log earlier or anything
like that?

> I was doing an lvmextend +xfs_resize of a different (XFS) filesystem
> on the same server at roughly the same time. But I'm not sure if
> that's related.

That wont be related, no.

> I'm currently running xfs_repair on the fs that blew up.

OK, I'd be interested to see if that reported any directory (or
other) issues.

cheers.

--
Nathan

2006-08-10 22:52:28

by Jesper Juhl

[permalink] [raw]
Subject: Re: 2.6.18-rc3-git3 - XFS - BUG: unable to handle kernel NULL pointer dereference at virtual address 00000078

On 11/08/06, Jesper Juhl <[email protected]> wrote:
> On 11/08/06, Nathan Scott <[email protected]> wrote:
> > On Thu, Aug 10, 2006 at 01:31:35PM +0200, Jesper Juhl wrote:
> > > On 08/08/06, Nathan Scott <[email protected]> wrote:
> > > ...
> > > Ok, I booted the server with 2.6.18-rc4 + your patch. Things went well
> > > for ~3 hours and then blew up - not in the same way though.
> > >
> > > The machine was under pretty heavy load recieving data via rsync when
> > > the following happened :
> > >
> > > Filesystem "dm-51": XFS internal error xfs_trans_cancel at line 1138
> > > of file fs/xfs/xfs_trans.c. Caller 0xc0210e3f
> > > [<c0103a3c>] show_trace_log_lvl+0x152/0x165
> > > [<c0103a5e>] show_trace+0xf/0x13
> > > [<c0103b59>] dump_stack+0x15/0x19
> > > [<c0213474>] xfs_trans_cancel+0xcf/0xf8
> > > [<c0210e3f>] xfs_rename+0x64d/0x936
> > > [<c0226286>] xfs_vn_rename+0x48/0x9f
> > > [<c016584e>] vfs_rename_other+0x99/0xcb
> > > [<c0165a36>] vfs_rename+0x1b6/0x1eb
> > > [<c0165bda>] do_rename+0x16f/0x193
> > > [<c0165c45>] sys_renameat+0x47/0x73
> >
> > Thanks Jesper. Hmm, lessee - this is a cancelled dirty rename
> > transaction ... could be ondisk dir2 corruption (any chance this
> > filesystem was affected by 2.6.17's endian bug?)
>
> No. The machine in question never ran any 2.6.17.* kernels. Its old
> kernel was 2.6.11.11 (UP), then I tried 2.6.18-rc3-git3 (SMP) as
> previously reported, then I tried 2.6.18-rc4 + your XFS patch.
>
> >, or something
> > else entirely. No I/O errors in the system log earlier or anything
> > like that?
> >
> No I/O errors in the logs that I could find, no.
>
>
> > > I was doing an lvmextend +xfs_resize of a different (XFS) filesystem
> > > on the same server at roughly the same time. But I'm not sure if
> > > that's related.
> >
> > That wont be related, no.
> >
> > > I'm currently running xfs_repair on the fs that blew up.
> >
> > OK, I'd be interested to see if that reported any directory (or
> > other) issues.
> >
> It did not.
>
> What happened was this (didn't save the output sorry, so the below is
> from memory) ;
> When I ran xfs_repair it first asked me to mount the filesystem to
> replay the log, then unmount it again, then run xfs_repair again. I
> did that. No errors during that mount or umount.
> Then, when I ran xfs_repair again it ran through phases 1..n (spending
> aproximately 1hour on this) without any messages saying that something
> was wrong, so when it was done I tried mounting the fs again and it
> said it did a mount of a clean fs.
> It's been running fine since.
>
Btw, I have left the machine running with the 2.6.18-rc4 kernel and it
can keep running that for ~11hrs more (I'll ofcourse let you know if
errors show up during the night), then I have to reboot it back to the
2.6.11.11 kernel that it is stable with and it will need to run that
until ~wednesday next week before I can do further experiments.
So, if you want me to test any patches I'll need to recieve them
within the next 10 or so hours if I'm to have a chance to run with
them for a few hours tomorrow - otherwise testing will have to wait
for next week.

--
Jesper Juhl <[email protected]>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html

2006-08-10 23:01:27

by Jesper Juhl

[permalink] [raw]
Subject: Re: 2.6.18-rc3-git3 - XFS - BUG: unable to handle kernel NULL pointer dereference at virtual address 00000078

On 11/08/06, Jesper Juhl <[email protected]> wrote:
> On 11/08/06, Nathan Scott <[email protected]> wrote:
> > On Thu, Aug 10, 2006 at 01:31:35PM +0200, Jesper Juhl wrote:
> > > On 08/08/06, Nathan Scott <[email protected]> wrote:
> > > ...
> > > Ok, I booted the server with 2.6.18-rc4 + your patch. Things went well
> > > for ~3 hours and then blew up - not in the same way though.
> > >
> > > The machine was under pretty heavy load recieving data via rsync when
> > > the following happened :
> > >
> > > Filesystem "dm-51": XFS internal error xfs_trans_cancel at line 1138
> > > of file fs/xfs/xfs_trans.c. Caller 0xc0210e3f
> > > [<c0103a3c>] show_trace_log_lvl+0x152/0x165
> > > [<c0103a5e>] show_trace+0xf/0x13
> > > [<c0103b59>] dump_stack+0x15/0x19
> > > [<c0213474>] xfs_trans_cancel+0xcf/0xf8
> > > [<c0210e3f>] xfs_rename+0x64d/0x936
> > > [<c0226286>] xfs_vn_rename+0x48/0x9f
> > > [<c016584e>] vfs_rename_other+0x99/0xcb
> > > [<c0165a36>] vfs_rename+0x1b6/0x1eb
> > > [<c0165bda>] do_rename+0x16f/0x193
> > > [<c0165c45>] sys_renameat+0x47/0x73
> >
> > Thanks Jesper. Hmm, lessee - this is a cancelled dirty rename
> > transaction ... could be ondisk dir2 corruption (any chance this
> > filesystem was affected by 2.6.17's endian bug?)
>
> No. The machine in question never ran any 2.6.17.* kernels. Its old
> kernel was 2.6.11.11 (UP), then I tried 2.6.18-rc3-git3 (SMP) as
> previously reported, then I tried 2.6.18-rc4 + your XFS patch.
>
Small correction; The above is not 100% true. A single attempt was
made to boot the server with a 2.6.17.7 kernel but the e1000 driver
blew up with that kernel version and hung the box before any of these
filesystems were mounted (at least that's how it appeared since it
didn't *seem* to get any further than loading the e1000 driver - which
happens before these fs mounts). Then the machine was powercycled and
went back to 2.6.11.11
So I very much doubt that that single attempted 2.6.17.7 boot caused
any damage to this XFS filesystem.

--
Jesper Juhl <[email protected]>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html

2006-08-11 08:34:03

by Jesper Juhl

[permalink] [raw]
Subject: Re: 2.6.18-rc3-git3 - XFS - BUG: unable to handle kernel NULL pointer dereference at virtual address 00000078

On 11/08/06, Jesper Juhl <[email protected]> wrote:
> On 11/08/06, Jesper Juhl <[email protected]> wrote:
> > On 11/08/06, Nathan Scott <[email protected]> wrote:
> > > On Thu, Aug 10, 2006 at 01:31:35PM +0200, Jesper Juhl wrote:
> > > > On 08/08/06, Nathan Scott <[email protected]> wrote:
> > > > ...
> > > > Ok, I booted the server with 2.6.18-rc4 + your patch. Things went well
> > > > for ~3 hours and then blew up - not in the same way though.
> > > >
> > > > The machine was under pretty heavy load recieving data via rsync when
> > > > the following happened :
> > > >
> > > > Filesystem "dm-51": XFS internal error xfs_trans_cancel at line 1138
> > > > of file fs/xfs/xfs_trans.c. Caller 0xc0210e3f
> > > > [<c0103a3c>] show_trace_log_lvl+0x152/0x165
> > > > [<c0103a5e>] show_trace+0xf/0x13
> > > > [<c0103b59>] dump_stack+0x15/0x19
> > > > [<c0213474>] xfs_trans_cancel+0xcf/0xf8
> > > > [<c0210e3f>] xfs_rename+0x64d/0x936
> > > > [<c0226286>] xfs_vn_rename+0x48/0x9f
> > > > [<c016584e>] vfs_rename_other+0x99/0xcb
> > > > [<c0165a36>] vfs_rename+0x1b6/0x1eb
> > > > [<c0165bda>] do_rename+0x16f/0x193
> > > > [<c0165c45>] sys_renameat+0x47/0x73
> > >
> > > Thanks Jesper. Hmm, lessee - this is a cancelled dirty rename
> > > transaction ... could be ondisk dir2 corruption (any chance this
> > > filesystem was affected by 2.6.17's endian bug?)
> >
> > No. The machine in question never ran any 2.6.17.* kernels. Its old
> > kernel was 2.6.11.11 (UP), then I tried 2.6.18-rc3-git3 (SMP) as
> > previously reported, then I tried 2.6.18-rc4 + your XFS patch.
> >
> > >, or something
> > > else entirely. No I/O errors in the system log earlier or anything
> > > like that?
> > >
> > No I/O errors in the logs that I could find, no.
> >
> >
> > > > I was doing an lvmextend +xfs_resize of a different (XFS) filesystem
> > > > on the same server at roughly the same time. But I'm not sure if
> > > > that's related.
> > >
> > > That wont be related, no.
> > >
> > > > I'm currently running xfs_repair on the fs that blew up.
> > >
> > > OK, I'd be interested to see if that reported any directory (or
> > > other) issues.
> > >
> > It did not.
> >
> > What happened was this (didn't save the output sorry, so the below is
> > from memory) ;
> > When I ran xfs_repair it first asked me to mount the filesystem to
> > replay the log, then unmount it again, then run xfs_repair again. I
> > did that. No errors during that mount or umount.
> > Then, when I ran xfs_repair again it ran through phases 1..n (spending
> > aproximately 1hour on this) without any messages saying that something
> > was wrong, so when it was done I tried mounting the fs again and it
> > said it did a mount of a clean fs.
> > It's been running fine since.
> >
> Btw, I have left the machine running with the 2.6.18-rc4 kernel and it
> can keep running that for ~11hrs more (I'll ofcourse let you know if
> errors show up during the night)

It blew up again. Same filesystem again + a new one :

Filesystem "dm-31": XFS internal error xfs_trans_cancel at line 1138
of file fs/xfs/xfs_trans.c. Caller 0xc0210e3f
[<c0103a3c>] show_trace_log_lvl+0x152/0x165
[<c0103a5e>] show_trace+0xf/0x13
[<c0103b59>] dump_stack+0x15/0x19
[<c0213474>] xfs_trans_cancel+0xcf/0xf8
[<c0210e3f>] xfs_rename+0x64d/0x936
[<c0226286>] xfs_vn_rename+0x48/0x9f
[<c016584e>] vfs_rename_other+0x99/0xcb
[<c0165a36>] vfs_rename+0x1b6/0x1eb
[<c0165bda>] do_rename+0x16f/0x193
[<c0165c45>] sys_renameat+0x47/0x73
[<c0165c98>] sys_rename+0x27/0x2b
[<c0102ae3>] syscall_call+0x7/0xb
[<b7e0a681>] 0xb7e0a681
[<c0213474>] xfs_trans_cancel+0xcf/0xf8
[<c0210e3f>] xfs_rename+0x64d/0x936
[<c0210e3f>] xfs_rename+0x64d/0x936
[<c0226286>] xfs_vn_rename+0x48/0x9f
[<c01626bb>] exec_permission_lite+0x46/0xcd
[<c0162acb>] __link_path_walk+0x4d/0xd0a
[<c016f8df>] mntput_no_expire+0x1b/0x78
[<c01637f3>] link_path_walk+0x6b/0xc4
[<c016584e>] vfs_rename_other+0x99/0xcb
[<c022623e>] xfs_vn_rename+0x0/0x9f
[<c0165a36>] vfs_rename+0x1b6/0x1eb
[<c0165bda>] do_rename+0x16f/0x193
[<c0154e6b>] sys_fchmodat+0xc2/0xef
[<c015508f>] sys_lchown+0x50/0x52
[<c016231b>] do_getname+0x4b/0x73
[<c0165c45>] sys_renameat+0x47/0x73
[<c0165c98>] sys_rename+0x27/0x2b
[<c0102ae3>] syscall_call+0x7/0xb
xfs_force_shutdown(dm-31,0x8) called from line 1139 of file
fs/xfs/xfs_trans.c. Return address = 0xc0229395
Filesystem "dm-31": Corruption of in-memory data detected. Shutting
down filesystem: dm-31
Please umount the filesystem, and rectify the problem(s)
xfs_force_shutdown(dm-31,0x1) called from line 424 of file
fs/xfs/xfs_rw.c. Return address = 0xc0229395
Filesystem "dm-31": Disabling barriers, not supported with external log device
XFS mounting filesystem dm-31
Starting XFS recovery on filesystem: dm-31 (logdev: /dev/Log/ws1_log)
Ending XFS recovery on filesystem: dm-31 (logdev: /dev/Log/ws1_log)
Filesystem "dm-31": Disabling barriers, not supported with external log device
XFS mounting filesystem dm-31
Ending clean XFS mount for filesystem: dm-31
Filesystem "dm-51": XFS internal error xfs_trans_cancel at line 1138
of file fs/xfs/xfs_trans.c. Caller 0xc0210e3f
[<c0103a3c>] show_trace_log_lvl+0x152/0x165
[<c0103a5e>] show_trace+0xf/0x13
[<c0103b59>] dump_stack+0x15/0x19
[<c0213474>] xfs_trans_cancel+0xcf/0xf8
[<c0210e3f>] xfs_rename+0x64d/0x936
[<c0226286>] xfs_vn_rename+0x48/0x9f
[<c016584e>] vfs_rename_other+0x99/0xcb
[<c0165a36>] vfs_rename+0x1b6/0x1eb
[<c0165bda>] do_rename+0x16f/0x193
[<c0165c45>] sys_renameat+0x47/0x73
[<c0165c98>] sys_rename+0x27/0x2b
[<c0102ae3>] syscall_call+0x7/0xb
[<b7e0a681>] 0xb7e0a681
[<c0213474>] xfs_trans_cancel+0xcf/0xf8
[<c0210e3f>] xfs_rename+0x64d/0x936
[<c0210e3f>] xfs_rename+0x64d/0x936
[<c0226286>] xfs_vn_rename+0x48/0x9f
[<c01626bb>] exec_permission_lite+0x46/0xcd
[<c0162acb>] __link_path_walk+0x4d/0xd0a
[<c016f8df>] mntput_no_expire+0x1b/0x78
[<c01637f3>] link_path_walk+0x6b/0xc4
[<c016584e>] vfs_rename_other+0x99/0xcb
[<c022623e>] xfs_vn_rename+0x0/0x9f
[<c0165a36>] vfs_rename+0x1b6/0x1eb
[<c0165bda>] do_rename+0x16f/0x193
[<c0154e6b>] sys_fchmodat+0xc2/0xef
[<c015508f>] sys_lchown+0x50/0x52
[<c016231b>] do_getname+0x4b/0x73
[<c0165c45>] sys_renameat+0x47/0x73
[<c0165c98>] sys_rename+0x27/0x2b
[<c0102ae3>] syscall_call+0x7/0xb
xfs_force_shutdown(dm-51,0x8) called from line 1139 of file
fs/xfs/xfs_trans.c. Return address = 0xc0229395
Filesystem "dm-51": Corruption of in-memory data detected. Shutting
down filesystem: dm-51
Please umount the filesystem, and rectify the problem(s)
Filesystem "dm-31": XFS internal error xfs_trans_cancel at line 1138
of file fs/xfs/xfs_trans.c. Caller 0xc0210e3f
[<c0103a3c>] show_trace_log_lvl+0x152/0x165
[<c0103a5e>] show_trace+0xf/0x13
[<c0103b59>] dump_stack+0x15/0x19
[<c0213474>] xfs_trans_cancel+0xcf/0xf8
[<c0210e3f>] xfs_rename+0x64d/0x936
[<c0226286>] xfs_vn_rename+0x48/0x9f
[<c016584e>] vfs_rename_other+0x99/0xcb
[<c0165a36>] vfs_rename+0x1b6/0x1eb
[<c0165bda>] do_rename+0x16f/0x193
[<c0165c45>] sys_renameat+0x47/0x73
[<c0165c98>] sys_rename+0x27/0x2b
[<c0102ae3>] syscall_call+0x7/0xb
[<b7e0a681>] 0xb7e0a681
[<c0213474>] xfs_trans_cancel+0xcf/0xf8
[<c0210e3f>] xfs_rename+0x64d/0x936
[<c0210e3f>] xfs_rename+0x64d/0x936
[<c0226286>] xfs_vn_rename+0x48/0x9f
[<c01626bb>] exec_permission_lite+0x46/0xcd
[<c0162acb>] __link_path_walk+0x4d/0xd0a
[<c016f8df>] mntput_no_expire+0x1b/0x78
[<c01637f3>] link_path_walk+0x6b/0xc4
[<c016584e>] vfs_rename_other+0x99/0xcb
[<c022623e>] xfs_vn_rename+0x0/0x9f
[<c0165a36>] vfs_rename+0x1b6/0x1eb
[<c0165bda>] do_rename+0x16f/0x193
[<c0154e6b>] sys_fchmodat+0xc2/0xef
[<c015508f>] sys_lchown+0x50/0x52
[<c016231b>] do_getname+0x4b/0x73
[<c0165c45>] sys_renameat+0x47/0x73
[<c0165c98>] sys_rename+0x27/0x2b
[<c0102ae3>] syscall_call+0x7/0xb
xfs_force_shutdown(dm-31,0x8) called from line 1139 of file
fs/xfs/xfs_trans.c. Return address = 0xc0229395
Filesystem "dm-31": Corruption of in-memory data detected. Shutting
down filesystem: dm-31
Please umount the filesystem, and rectify the problem(s)


--
Jesper Juhl <[email protected]>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html

2006-08-11 10:25:12

by Jesper Juhl

[permalink] [raw]
Subject: Re: 2.6.18-rc3-git3 - XFS - BUG: unable to handle kernel NULL pointer dereference at virtual address 00000078

On 11/08/06, Jesper Juhl <[email protected]> wrote:
> On 11/08/06, Jesper Juhl <[email protected]> wrote:
> > On 11/08/06, Jesper Juhl <[email protected]> wrote:
> > > On 11/08/06, Nathan Scott <[email protected]> wrote:
> > > > On Thu, Aug 10, 2006 at 01:31:35PM +0200, Jesper Juhl wrote:
> > > > > On 08/08/06, Nathan Scott <[email protected]> wrote:
> > > > > ...
> > > > > Ok, I booted the server with 2.6.18-rc4 + your patch. Things went well
> > > > > for ~3 hours and then blew up - not in the same way though.
> > > > >
> > > > > The machine was under pretty heavy load recieving data via rsync when
> > > > > the following happened :
> > > > >
> > > > > Filesystem "dm-51": XFS internal error xfs_trans_cancel at line 1138
> > > > > of file fs/xfs/xfs_trans.c. Caller 0xc0210e3f
> > > > > [<c0103a3c>] show_trace_log_lvl+0x152/0x165
> > > > > [<c0103a5e>] show_trace+0xf/0x13
> > > > > [<c0103b59>] dump_stack+0x15/0x19
> > > > > [<c0213474>] xfs_trans_cancel+0xcf/0xf8
> > > > > [<c0210e3f>] xfs_rename+0x64d/0x936
> > > > > [<c0226286>] xfs_vn_rename+0x48/0x9f
> > > > > [<c016584e>] vfs_rename_other+0x99/0xcb
> > > > > [<c0165a36>] vfs_rename+0x1b6/0x1eb
> > > > > [<c0165bda>] do_rename+0x16f/0x193
> > > > > [<c0165c45>] sys_renameat+0x47/0x73
> > > >
> > > > Thanks Jesper. Hmm, lessee - this is a cancelled dirty rename
> > > > transaction ... could be ondisk dir2 corruption (any chance this
> > > > filesystem was affected by 2.6.17's endian bug?)
> > >
> > > No. The machine in question never ran any 2.6.17.* kernels. Its old
> > > kernel was 2.6.11.11 (UP), then I tried 2.6.18-rc3-git3 (SMP) as
> > > previously reported, then I tried 2.6.18-rc4 + your XFS patch.
> > >
> > > >, or something
> > > > else entirely. No I/O errors in the system log earlier or anything
> > > > like that?
> > > >
> > > No I/O errors in the logs that I could find, no.
> > >
> > >
> > > > > I was doing an lvmextend +xfs_resize of a different (XFS) filesystem
> > > > > on the same server at roughly the same time. But I'm not sure if
> > > > > that's related.
> > > >
> > > > That wont be related, no.
> > > >
> > > > > I'm currently running xfs_repair on the fs that blew up.
> > > >
> > > > OK, I'd be interested to see if that reported any directory (or
> > > > other) issues.
> > > >
> > > It did not.
> > >
> > > What happened was this (didn't save the output sorry, so the below is
> > > from memory) ;
> > > When I ran xfs_repair it first asked me to mount the filesystem to
> > > replay the log, then unmount it again, then run xfs_repair again. I
> > > did that. No errors during that mount or umount.
> > > Then, when I ran xfs_repair again it ran through phases 1..n (spending
> > > aproximately 1hour on this) without any messages saying that something
> > > was wrong, so when it was done I tried mounting the fs again and it
> > > said it did a mount of a clean fs.
> > > It's been running fine since.
> > >
> > Btw, I have left the machine running with the 2.6.18-rc4 kernel and it
> > can keep running that for ~11hrs more (I'll ofcourse let you know if
> > errors show up during the night)
>
> It blew up again. Same filesystem again + a new one :
>
> Filesystem "dm-31": XFS internal error xfs_trans_cancel at line 1138
> of file fs/xfs/xfs_trans.c. Caller 0xc0210e3f
> [<c0103a3c>] show_trace_log_lvl+0x152/0x165
> [<c0103a5e>] show_trace+0xf/0x13
> [<c0103b59>] dump_stack+0x15/0x19
> [<c0213474>] xfs_trans_cancel+0xcf/0xf8
> [<c0210e3f>] xfs_rename+0x64d/0x936
> [<c0226286>] xfs_vn_rename+0x48/0x9f
> [<c016584e>] vfs_rename_other+0x99/0xcb
> [<c0165a36>] vfs_rename+0x1b6/0x1eb
> [<c0165bda>] do_rename+0x16f/0x193
> [<c0165c45>] sys_renameat+0x47/0x73
> [<c0165c98>] sys_rename+0x27/0x2b
> [<c0102ae3>] syscall_call+0x7/0xb
> [<b7e0a681>] 0xb7e0a681
> [<c0213474>] xfs_trans_cancel+0xcf/0xf8
> [<c0210e3f>] xfs_rename+0x64d/0x936
> [<c0210e3f>] xfs_rename+0x64d/0x936
> [<c0226286>] xfs_vn_rename+0x48/0x9f
> [<c01626bb>] exec_permission_lite+0x46/0xcd
> [<c0162acb>] __link_path_walk+0x4d/0xd0a
> [<c016f8df>] mntput_no_expire+0x1b/0x78
> [<c01637f3>] link_path_walk+0x6b/0xc4
> [<c016584e>] vfs_rename_other+0x99/0xcb
> [<c022623e>] xfs_vn_rename+0x0/0x9f
> [<c0165a36>] vfs_rename+0x1b6/0x1eb
> [<c0165bda>] do_rename+0x16f/0x193
> [<c0154e6b>] sys_fchmodat+0xc2/0xef
> [<c015508f>] sys_lchown+0x50/0x52
> [<c016231b>] do_getname+0x4b/0x73
> [<c0165c45>] sys_renameat+0x47/0x73
> [<c0165c98>] sys_rename+0x27/0x2b
> [<c0102ae3>] syscall_call+0x7/0xb
> xfs_force_shutdown(dm-31,0x8) called from line 1139 of file
> fs/xfs/xfs_trans.c. Return address = 0xc0229395
> Filesystem "dm-31": Corruption of in-memory data detected. Shutting
> down filesystem: dm-31
> Please umount the filesystem, and rectify the problem(s)
> xfs_force_shutdown(dm-31,0x1) called from line 424 of file
> fs/xfs/xfs_rw.c. Return address = 0xc0229395
> Filesystem "dm-31": Disabling barriers, not supported with external log device
> XFS mounting filesystem dm-31
> Starting XFS recovery on filesystem: dm-31 (logdev: /dev/Log/ws1_log)
> Ending XFS recovery on filesystem: dm-31 (logdev: /dev/Log/ws1_log)
> Filesystem "dm-31": Disabling barriers, not supported with external log device
> XFS mounting filesystem dm-31
> Ending clean XFS mount for filesystem: dm-31
> Filesystem "dm-51": XFS internal error xfs_trans_cancel at line 1138
> of file fs/xfs/xfs_trans.c. Caller 0xc0210e3f
> [<c0103a3c>] show_trace_log_lvl+0x152/0x165
> [<c0103a5e>] show_trace+0xf/0x13
> [<c0103b59>] dump_stack+0x15/0x19
> [<c0213474>] xfs_trans_cancel+0xcf/0xf8
> [<c0210e3f>] xfs_rename+0x64d/0x936
> [<c0226286>] xfs_vn_rename+0x48/0x9f
> [<c016584e>] vfs_rename_other+0x99/0xcb
> [<c0165a36>] vfs_rename+0x1b6/0x1eb
> [<c0165bda>] do_rename+0x16f/0x193
> [<c0165c45>] sys_renameat+0x47/0x73
> [<c0165c98>] sys_rename+0x27/0x2b
> [<c0102ae3>] syscall_call+0x7/0xb
> [<b7e0a681>] 0xb7e0a681
> [<c0213474>] xfs_trans_cancel+0xcf/0xf8
> [<c0210e3f>] xfs_rename+0x64d/0x936
> [<c0210e3f>] xfs_rename+0x64d/0x936
> [<c0226286>] xfs_vn_rename+0x48/0x9f
> [<c01626bb>] exec_permission_lite+0x46/0xcd
> [<c0162acb>] __link_path_walk+0x4d/0xd0a
> [<c016f8df>] mntput_no_expire+0x1b/0x78
> [<c01637f3>] link_path_walk+0x6b/0xc4
> [<c016584e>] vfs_rename_other+0x99/0xcb
> [<c022623e>] xfs_vn_rename+0x0/0x9f
> [<c0165a36>] vfs_rename+0x1b6/0x1eb
> [<c0165bda>] do_rename+0x16f/0x193
> [<c0154e6b>] sys_fchmodat+0xc2/0xef
> [<c015508f>] sys_lchown+0x50/0x52
> [<c016231b>] do_getname+0x4b/0x73
> [<c0165c45>] sys_renameat+0x47/0x73
> [<c0165c98>] sys_rename+0x27/0x2b
> [<c0102ae3>] syscall_call+0x7/0xb
> xfs_force_shutdown(dm-51,0x8) called from line 1139 of file
> fs/xfs/xfs_trans.c. Return address = 0xc0229395
> Filesystem "dm-51": Corruption of in-memory data detected. Shutting
> down filesystem: dm-51
> Please umount the filesystem, and rectify the problem(s)
> Filesystem "dm-31": XFS internal error xfs_trans_cancel at line 1138
> of file fs/xfs/xfs_trans.c. Caller 0xc0210e3f
> [<c0103a3c>] show_trace_log_lvl+0x152/0x165
> [<c0103a5e>] show_trace+0xf/0x13
> [<c0103b59>] dump_stack+0x15/0x19
> [<c0213474>] xfs_trans_cancel+0xcf/0xf8
> [<c0210e3f>] xfs_rename+0x64d/0x936
> [<c0226286>] xfs_vn_rename+0x48/0x9f
> [<c016584e>] vfs_rename_other+0x99/0xcb
> [<c0165a36>] vfs_rename+0x1b6/0x1eb
> [<c0165bda>] do_rename+0x16f/0x193
> [<c0165c45>] sys_renameat+0x47/0x73
> [<c0165c98>] sys_rename+0x27/0x2b
> [<c0102ae3>] syscall_call+0x7/0xb
> [<b7e0a681>] 0xb7e0a681
> [<c0213474>] xfs_trans_cancel+0xcf/0xf8
> [<c0210e3f>] xfs_rename+0x64d/0x936
> [<c0210e3f>] xfs_rename+0x64d/0x936
> [<c0226286>] xfs_vn_rename+0x48/0x9f
> [<c01626bb>] exec_permission_lite+0x46/0xcd
> [<c0162acb>] __link_path_walk+0x4d/0xd0a
> [<c016f8df>] mntput_no_expire+0x1b/0x78
> [<c01637f3>] link_path_walk+0x6b/0xc4
> [<c016584e>] vfs_rename_other+0x99/0xcb
> [<c022623e>] xfs_vn_rename+0x0/0x9f
> [<c0165a36>] vfs_rename+0x1b6/0x1eb
> [<c0165bda>] do_rename+0x16f/0x193
> [<c0154e6b>] sys_fchmodat+0xc2/0xef
> [<c015508f>] sys_lchown+0x50/0x52
> [<c016231b>] do_getname+0x4b/0x73
> [<c0165c45>] sys_renameat+0x47/0x73
> [<c0165c98>] sys_rename+0x27/0x2b
> [<c0102ae3>] syscall_call+0x7/0xb
> xfs_force_shutdown(dm-31,0x8) called from line 1139 of file
> fs/xfs/xfs_trans.c. Return address = 0xc0229395
> Filesystem "dm-31": Corruption of in-memory data detected. Shutting
> down filesystem: dm-31
> Please umount the filesystem, and rectify the problem(s)
>

I didn't capture all of the xfs_repair output, but I did get this :

- agno = 10
- agno = 11
- agno = 12
- agno = 13
- agno = 14
- agno = 15
- agno = 16
- agno = 17
- agno = 18
- agno = 19
- agno = 20
- agno = 21
- agno = 22
- agno = 23
- agno = 24
- process newly discovered inodes...
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- clear lost+found (if it exists) ...
- clearing existing "lost+found" inode
- deleting existing "lost+found" entry
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- agno = 5
- agno = 6
LEAFN node level is 1 inode 412035424 bno = 8388608
- agno = 7
- agno = 8
- agno = 9
- agno = 10
- agno = 11
- agno = 12
- agno = 13
- agno = 14
- agno = 15
- agno = 16
- agno = 17
- agno = 18
- agno = 19
- agno = 20
- agno = 21
- agno = 22
- agno = 23
- agno = 24
Phase 5 - rebuild AG headers and trees...
- reset superblock...
Phase 6 - check inode connectivity...
- resetting contents of realtime bitmap and summary inodes
- ensuring existence of lost+found directory
- traversing filesystem starting at / ...


--
Jesper Juhl <[email protected]>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html

2006-08-14 02:01:20

by Nathan Scott

[permalink] [raw]
Subject: Re: 2.6.18-rc3-git3 - XFS - BUG: unable to handle kernel NULL pointer dereference at virtual address 00000078

On Fri, Aug 11, 2006 at 12:25:03PM +0200, Jesper Juhl wrote:
> I didn't capture all of the xfs_repair output, but I did get this :
> ...
> Phase 4 - check for duplicate blocks...
> - setting up duplicate extent list...
> - clear lost+found (if it exists) ...
> - clearing existing "lost+found" inode
> - deleting existing "lost+found" entry
> - check for inodes claiming duplicate blocks...
> - agno = 0
> - agno = 1
> - agno = 2
> - agno = 3
> - agno = 4
> - agno = 5
> - agno = 6
> LEAFN node level is 1 inode 412035424 bno = 8388608

Ooh. Can you describe this test case you're using? Something with
a bunch of renames in it, obviously, but I'd also like to be able to
reproduce locally with the exact data set (file names in particular),
if at all possible.

thanks!

--
Nathan

2006-08-14 07:49:12

by Jesper Juhl

[permalink] [raw]
Subject: Re: 2.6.18-rc3-git3 - XFS - BUG: unable to handle kernel NULL pointer dereference at virtual address 00000078

On 14/08/06, Nathan Scott <[email protected]> wrote:
> On Fri, Aug 11, 2006 at 12:25:03PM +0200, Jesper Juhl wrote:
> > I didn't capture all of the xfs_repair output, but I did get this :
> > ...
> > Phase 4 - check for duplicate blocks...
> > - setting up duplicate extent list...
> > - clear lost+found (if it exists) ...
> > - clearing existing "lost+found" inode
> > - deleting existing "lost+found" entry
> > - check for inodes claiming duplicate blocks...
> > - agno = 0
> > - agno = 1
> > - agno = 2
> > - agno = 3
> > - agno = 4
> > - agno = 5
> > - agno = 6
> > LEAFN node level is 1 inode 412035424 bno = 8388608
>
> Ooh. Can you describe this test case you're using?

Sure.

The server has a bunch of XFS filesystems :

# mount
/dev/md1 on / type xfs (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw)
usbfs on /proc/bus/usb type usbfs (rw)
/dev/md0 on /boot type ext3 (rw)
/dev/mapper/Archive-Backup on /mnt/backup type xfs (rw)
/dev/mapper/Mirror-ws1 on /mnt/rsync/ws1 type xfs
(rw,noatime,ihashsize=64433,logdev=/dev/Log/ws1_log)
/dev/mapper/Mirror-ws2 on /mnt/rsync/ws2 type xfs
(rw,noatime,ihashsize=64433,logdev=/dev/Log/ws2_log)
/dev/mapper/Mirror-ws3 on /mnt/rsync/ws3 type xfs
(rw,noatime,ihashsize=64433,logdev=/dev/Log/ws3_log)
/dev/mapper/Mirror-ws4 on /mnt/rsync/ws4 type xfs
(rw,noatime,ihashsize=64433,logdev=/dev/Log/ws4_log)
/dev/mapper/Mirror-ws5 on /mnt/rsync/ws5 type xfs
(rw,noatime,ihashsize=64433,logdev=/dev/Log/ws5_log)
/dev/mapper/Mirror-ws6 on /mnt/rsync/ws6 type xfs
(rw,noatime,ihashsize=64433,logdev=/dev/Log/ws6_log)
/dev/mapper/Mirror-ws7 on /mnt/rsync/ws7 type xfs
(rw,noatime,ihashsize=64433,logdev=/dev/Log/ws7_log)
/dev/mapper/Mirror-ws8 on /mnt/rsync/ws8 type xfs
(rw,noatime,ihashsize=64433,logdev=/dev/Log/ws8_log)
/dev/mapper/Mirror-ws9 on /mnt/rsync/ws9 type xfs
(rw,noatime,ihashsize=64433,logdev=/dev/Log/ws9_log)
/dev/mapper/Mirror-ws10 on /mnt/rsync/ws10 type xfs
(rw,noatime,ihashsize=64433,logdev=/dev/Log/ws10_log)
/dev/mapper/Mirror-ws11 on /mnt/rsync/ws11 type xfs
(rw,noatime,ihashsize=64433,logdev=/dev/Log/ws11_log)
/dev/mapper/Mirror-ws12 on /mnt/rsync/ws12 type xfs
(rw,noatime,ihashsize=64433,logdev=/dev/Log/ws12_log)
/dev/mapper/Mirror-ws13 on /mnt/rsync/ws13 type xfs
(rw,noatime,ihashsize=64433,logdev=/dev/Log/ws13_log)
/dev/mapper/Mirror-ws14 on /mnt/rsync/ws14 type xfs
(rw,noatime,ihashsize=64433,logdev=/dev/Log/ws14_log)
/dev/mapper/Mirror-ws15 on /mnt/rsync/ws15 type xfs
(rw,noatime,ihashsize=64433,logdev=/dev/Log/ws15_log)
/dev/mapper/Mirror-ws16 on /mnt/rsync/ws16 type xfs
(rw,noatime,ihashsize=64433,logdev=/dev/Log/ws16_log)
/dev/mapper/Mirror-ws17 on /mnt/rsync/ws17 type xfs
(rw,noatime,ihashsize=64433,logdev=/dev/Log/ws17_log)
/dev/mapper/Mirror-ws18 on /mnt/rsync/ws18 type xfs
(rw,noatime,ihashsize=64433,logdev=/dev/Log/ws18_log)
/dev/mapper/Mirror-ws19 on /mnt/rsync/ws19 type xfs
(rw,noatime,ihashsize=64433,logdev=/dev/Log/ws19_log)
/dev/mapper/Mirror-ws20 on /mnt/rsync/ws20 type xfs
(rw,noatime,ihashsize=64433,logdev=/dev/Log/ws20_log)
/dev/mapper/Mirror-ws21 on /mnt/rsync/ws21 type xfs
(rw,noatime,ihashsize=64433,logdev=/dev/Log/ws21_log)
/dev/mapper/Mirror-wsb1 on /mnt/rsync/wsb1 type xfs
(rw,noatime,ihashsize=64433,logdev=/dev/Log/wsb1_log)
/dev/mapper/Mirror-wsb2 on /mnt/rsync/wsb2 type xfs
(rw,noatime,ihashsize=64433,logdev=/dev/Log/wsb2_log)
/dev/mapper/Mirror-wsb3 on /mnt/rsync/wsb3 type xfs
(rw,noatime,ihashsize=64433,logdev=/dev/Log/wsb3_log)
/dev/mapper/Mirror-wsb4 on /mnt/rsync/wsb4 type xfs
(rw,noatime,ihashsize=64433,logdev=/dev/Log/wsb4_log)
/dev/mapper/Mirror-wsp1 on /mnt/rsync/wsp1 type xfs
(rw,noatime,ihashsize=64433,logdev=/dev/Log/wsp1_log)
/dev/mapper/Mirror-wsp2 on /mnt/rsync/wsp2 type xfs
(rw,noatime,ihashsize=64433,logdev=/dev/Log/wsp2_log)
/dev/mapper/Mirror-wsp3 on /mnt/rsync/wsp3 type xfs
(rw,noatime,ihashsize=64433,logdev=/dev/Log/wsp3_log)
/dev/mapper/Mirror-obr1 on /mnt/ob/obr1 type xfs (rw,noatime,ihashsize=64433)
tmpfs on /dev type tmpfs (rw,size=10M,mode=0755)

These filesystems vary in size from 50G to 3.5T

The XFS filesystems contain rsync copies of filesystems on as many servers.
The workload that triggers the problem is when all the servers start
updating their copy via rsync - Then within a few hours the problem
triggers.

So to recreate the same scenario you'll want 28 servers doing rsync of
filesystems of various sizes between 50G & 3.5T to a central server
running 2.6.18-rc4 with 28 XFS filesystems.
The XFS filesystems are on LVM and each physical volume is made up of
two disks in a RAID1.


> Something with
> a bunch of renames in it, obviously, but I'd also like to be able to
> reproduce locally with the exact data set (file names in particular),
> if at all possible.
>
There are millions of files. The data the server recievs is copies of
websites. Each server that sends data to the server with the 28 XFS
filesystems hosts between 1800 and 2600 websites, so there are lots of
files and every concievable strange filename.

--
Jesper Juhl <[email protected]>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html

2006-08-15 09:03:58

by Nathan Scott

[permalink] [raw]
Subject: Re: 2.6.18-rc3-git3 - XFS - BUG: unable to handle kernel NULL pointer dereference at virtual address 00000078

On Mon, Aug 14, 2006 at 09:49:10AM +0200, Jesper Juhl wrote:
> On 14/08/06, Nathan Scott <[email protected]> wrote:
> > > LEAFN node level is 1 inode 412035424 bno = 8388608
> >
> > Ooh. Can you describe this test case you're using?
> ...
> These filesystems vary in size from 50G to 3.5T
>
> The XFS filesystems contain rsync copies of filesystems on as many servers.
> The workload that triggers the problem is when all the servers start
> updating their copy via rsync - Then within a few hours the problem
> triggers.
>
> So to recreate the same scenario you'll want 28 servers doing rsync of
> filesystems of various sizes between 50G & 3.5T to a central server
> running 2.6.18-rc4 with 28 XFS filesystems.
> ...
> There are millions of files. The data the server recievs is copies of
> websites. Each server that sends data to the server with the 28 XFS
> filesystems hosts between 1800 and 2600 websites, so there are lots of
> files and every concievable strange filename.

Wow, a special kind of hell for a filesystem developer...! ;-)

Its not clear to me where the rename operation happens in all of
this - does rsync create a local, temporary copy of the file and
then rename it? Is it that central server going down or one of
those 28 other server machines? (I assume it is, I can't see an
opportunity for renaming out there...).

When you hit it again, could you grab the contents of the inode
(you'll get that from xfs_repair -n, e.g. 412035424 above) with
xfs_db (see last entry in the XFS FAQ which describes how to do
that), then mail that to me please? If you can get the source
and target names in the rename that'll help alot too... I can
explain how to use KDB to get that, but maybe you have another
debugger handy already?

thanks!

--
Nathan

2006-08-15 11:42:29

by Jesper Juhl

[permalink] [raw]
Subject: Re: 2.6.18-rc3-git3 - XFS - BUG: unable to handle kernel NULL pointer dereference at virtual address 00000078

On 15/08/06, Nathan Scott <[email protected]> wrote:
> On Mon, Aug 14, 2006 at 09:49:10AM +0200, Jesper Juhl wrote:
> > On 14/08/06, Nathan Scott <[email protected]> wrote:
> > > > LEAFN node level is 1 inode 412035424 bno = 8388608
> > >
> > > Ooh. Can you describe this test case you're using?
> > ...
> > These filesystems vary in size from 50G to 3.5T
> >
> > The XFS filesystems contain rsync copies of filesystems on as many servers.
> > The workload that triggers the problem is when all the servers start
> > updating their copy via rsync - Then within a few hours the problem
> > triggers.
> >
> > So to recreate the same scenario you'll want 28 servers doing rsync of
> > filesystems of various sizes between 50G & 3.5T to a central server
> > running 2.6.18-rc4 with 28 XFS filesystems.
> > ...
> > There are millions of files. The data the server recievs is copies of
> > websites. Each server that sends data to the server with the 28 XFS
> > filesystems hosts between 1800 and 2600 websites, so there are lots of
> > files and every concievable strange filename.
>
> Wow, a special kind of hell for a filesystem developer...! ;-)
>
> Its not clear to me where the rename operation happens in all of
> this - does rsync create a local, temporary copy of the file and
> then rename it?

I'm not sure. I'll investigate and see if I can work out what exactely
rsync does.

> Is it that central server going down or one of
> those 28 other server machines? (I assume it is, I can't see an
> opportunity for renaming out there...).
>
It's the central server with all the xfs filesystems that dies. The
one recieving all the rsync data.


> When you hit it again, could you grab the contents of the inode
> (you'll get that from xfs_repair -n, e.g. 412035424 above) with
> xfs_db (see last entry in the XFS FAQ which describes how to do
> that), then mail that to me please?

Sure, I'll read up on that and make sure to grab that info next time.


> If you can get the source
> and target names in the rename that'll help alot too... I can
> explain how to use KDB to get that, but maybe you have another
> debugger handy already?
>
An explanation of how exactely to do that would be greatly appreciated.

--
Jesper Juhl <[email protected]>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html

2006-08-15 14:37:14

by Chris Wedgwood

[permalink] [raw]
Subject: Re: 2.6.18-rc3-git3 - XFS - BUG: unable to handle kernel NULL pointer dereference at virtual address 00000078

On Tue, Aug 15, 2006 at 07:03:43PM +1000, Nathan Scott wrote:

> Its not clear to me where the rename operation happens in all of
> this - does rsync create a local, temporary copy of the file and
> then rename it?

Yes, this is normally how rsync does it.

2006-08-15 14:51:11

by Jan Engelhardt

[permalink] [raw]
Subject: Re: 2.6.18-rc3-git3 - XFS - BUG: unable to handle kernel NULL pointer dereference at virtual address 00000078


>> Its not clear to me where the rename operation happens in all of
>> this - does rsync create a local, temporary copy of the file and
>> then rename it?
>
>Yes, this is normally how rsync does it.

If file already exists {
foreach block {
copy block either from disk or from the
source operand, whichever is never, to temp file
}
}

When rsync catches a signal {
move the tempfile to the original name
}





Jan Engelhardt
--

2006-08-16 01:26:48

by Nathan Scott

[permalink] [raw]
Subject: Re: 2.6.18-rc3-git3 - XFS - BUG: unable to handle kernel NULL pointer dereference at virtual address 00000078

On Tue, Aug 15, 2006 at 01:42:27PM +0200, Jesper Juhl wrote:
> On 15/08/06, Nathan Scott <[email protected]> wrote:
> > If you can get the source
> > and target names in the rename that'll help alot too... I can
> > explain how to use KDB to get that, but maybe you have another
> > debugger handy already?
> >
> An explanation of how exactely to do that would be greatly appreciated.

- patch in KDB
- echo 127 > /proc/sys/fs/xfs/panic_mask
[ filesystem shutdown now == panic ]
- kdb> bt
[ pick out parameters to rename from the backtrace ]
- kdb> md 0xXXX
[ gives a memory dump of the pointers to pathnames ]

cheers.

--
Nathan

2006-08-16 12:38:13

by Paul Slootman

[permalink] [raw]
Subject: Re: 2.6.18-rc3-git3 - XFS - BUG: unable to handle kernel NULL pointer dereference at virtual address 00000078

Nathan Scott <[email protected]> wrote:
>On Fri, Aug 11, 2006 at 12:25:03PM +0200, Jesper Juhl wrote:
>> I didn't capture all of the xfs_repair output, but I did get this :
>> ...
>> Phase 4 - check for duplicate blocks...
>> - setting up duplicate extent list...
>> - clear lost+found (if it exists) ...
>> - clearing existing "lost+found" inode
>> - deleting existing "lost+found" entry
>> - check for inodes claiming duplicate blocks...
>> - agno = 0
>> - agno = 1
>> - agno = 2
>> - agno = 3
>> - agno = 4
>> - agno = 5
>> - agno = 6
>> LEAFN node level is 1 inode 412035424 bno = 8388608
>
>Ooh. Can you describe this test case you're using? Something with
>a bunch of renames in it, obviously, but I'd also like to be able to
>reproduce locally with the exact data set (file names in particular),
>if at all possible.

>From your reaction above I gather that "LEAFN node level is 1 inode ..."
is a bad thing?

My filesystem (that crashes under heavy load, while rsyncing to and from
it) has a lot of these messages when xfs_repair is run.

Note that I've now put an older kernel on the system (2.6.15.6) and it
seems to be surviving longer than before, with 2.6.17.7. It would be
nice if it survived a day, as it's a backup server for a couple of
important things...

(See also my messages to the xfs list, subject "cache_purge: shake on
cache 0x5880a0 left 8 nodes!?" and "XFS internal error
XFS_WANT_CORRUPTED_GOTO".)


Paul Slootman

2006-08-16 22:48:07

by Nathan Scott

[permalink] [raw]
Subject: Re: 2.6.18-rc3-git3 - XFS - BUG: unable to handle kernel NULL pointer dereference at virtual address 00000078

On Wed, Aug 16, 2006 at 12:38:10PM +0000, Paul Slootman wrote:
> Nathan Scott <[email protected]> wrote:
> >On Fri, Aug 11, 2006 at 12:25:03PM +0200, Jesper Juhl wrote:
> >> I didn't capture all of the xfs_repair output, but I did get this :
> >> ...
> >> Phase 4 - check for duplicate blocks...
> >> - setting up duplicate extent list...
> >> - clear lost+found (if it exists) ...
> >> - clearing existing "lost+found" inode
> >> - deleting existing "lost+found" entry
> >> - check for inodes claiming duplicate blocks...
> >> - agno = 0
> >> - agno = 1
> >> - agno = 2
> >> - agno = 3
> >> - agno = 4
> >> - agno = 5
> >> - agno = 6
> >> LEAFN node level is 1 inode 412035424 bno = 8388608
> >
> >Ooh. Can you describe this test case you're using? Something with
> >a bunch of renames in it, obviously, but I'd also like to be able to
> >reproduce locally with the exact data set (file names in particular),
> >if at all possible.
>
> >From your reaction above I gather that "LEAFN node level is 1 inode ..."
> is a bad thing?
>
> My filesystem (that crashes under heavy load, while rsyncing to and from
> it) has a lot of these messages when xfs_repair is run.

Do you have a reproducible test case? Please send a go-to-woe recipe
so I can see the problem first hand... and preferably one that is, er,
slightly simpler than Jesper's case.

thanks.

--
Nathan

2006-08-17 09:02:14

by Paul Slootman

[permalink] [raw]
Subject: Re: 2.6.18-rc3-git3 - XFS - BUG: unable to handle kernel NULL pointer dereference at virtual address 00000078

On Thu 17 Aug 2006, Nathan Scott wrote:
> On Wed, Aug 16, 2006 at 12:38:10PM +0000, Paul Slootman wrote:
> > Nathan Scott <[email protected]> wrote:
> > >On Fri, Aug 11, 2006 at 12:25:03PM +0200, Jesper Juhl wrote:
> > >> I didn't capture all of the xfs_repair output, but I did get this :
> > >> ...
> > >> Phase 4 - check for duplicate blocks...
> > >> - setting up duplicate extent list...
> > >> - clear lost+found (if it exists) ...
> > >> - clearing existing "lost+found" inode
> > >> - deleting existing "lost+found" entry
> > >> - check for inodes claiming duplicate blocks...
> > >> - agno = 0
> > >> - agno = 1
> > >> - agno = 2
> > >> - agno = 3
> > >> - agno = 4
> > >> - agno = 5
> > >> - agno = 6
> > >> LEAFN node level is 1 inode 412035424 bno = 8388608
> > >
> > >Ooh. Can you describe this test case you're using? Something with
> > >a bunch of renames in it, obviously, but I'd also like to be able to
> > >reproduce locally with the exact data set (file names in particular),
> > >if at all possible.
> >
> > >From your reaction above I gather that "LEAFN node level is 1 inode ..."
> > is a bad thing?
> >
> > My filesystem (that crashes under heavy load, while rsyncing to and from
> > it) has a lot of these messages when xfs_repair is run.
>
> Do you have a reproducible test case? Please send a go-to-woe recipe
> so I can see the problem first hand... and preferably one that is, er,
> slightly simpler than Jesper's case.

Unfortunately no, this is a 1.1TB filesystem with 54% usage, and dozens
of large rsyncs to and from it. However during this XFS panicks.

That was with 2.6.17.7 (after 2.6.17.4 had buggered it with the endian
bug, but after numerous xfs_repairs). Interestingly I rebooted into an
old 2.6.15.6 kernel yesterday after the last XFS crash, and it survived
last night's activities perfectly well. After a couple of days I'm
willing to give the latest 2.6.18-rc or whatever a try (once I've a
complete set of backups again, and they've been passed on to the
long-term backup system).


Paul Slootman

2006-08-17 21:23:42

by Jesper Juhl

[permalink] [raw]
Subject: Re: 2.6.18-rc3-git3 - XFS - BUG: unable to handle kernel NULL pointer dereference at virtual address 00000078

On 16/08/06, Nathan Scott <[email protected]> wrote:
> On Tue, Aug 15, 2006 at 01:42:27PM +0200, Jesper Juhl wrote:
> > On 15/08/06, Nathan Scott <[email protected]> wrote:
> > > If you can get the source
> > > and target names in the rename that'll help alot too... I can
> > > explain how to use KDB to get that, but maybe you have another
> > > debugger handy already?
> > >
> > An explanation of how exactely to do that would be greatly appreciated.
>
> - patch in KDB
> - echo 127 > /proc/sys/fs/xfs/panic_mask
> [ filesystem shutdown now == panic ]
> - kdb> bt
> [ pick out parameters to rename from the backtrace ]
> - kdb> md 0xXXX
> [ gives a memory dump of the pointers to pathnames ]
>

Thanks a lot for the explanation.

Unfortunately I didn't get a chance to run new tests on the server
this week (always the big problem when it's a production machine).
I'm also going on a short vacation, so I won't have the oppotunity to
try and recreate a simpler test case at home for the next few days.

When I get back (in some 4 days time) I'll try to build a more simple
test case and in about a week or so I hopefully will get a new chance
to run new tests on the server that has so far shown the problem.
If there are additional tests you want me to run or data you want me
to collect, then let me know and I'll do so the first chance I get.

I'll be back in touch in ~1 weeks time.

--
Jesper Juhl <[email protected]>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html

2006-08-23 08:42:34

by Paul Slootman

[permalink] [raw]
Subject: Re: 2.6.18-rc3-git3 - XFS - BUG: unable to handle kernel NULL pointer dereference at virtual address 00000078

On Thu 17 Aug 2006, Paul Slootman wrote:
> On Thu 17 Aug 2006, Nathan Scott wrote:
> > On Wed, Aug 16, 2006 at 12:38:10PM +0000, Paul Slootman wrote:
> > > Nathan Scott <[email protected]> wrote:
> > > >On Fri, Aug 11, 2006 at 12:25:03PM +0200, Jesper Juhl wrote:
> > > >> I didn't capture all of the xfs_repair output, but I did get this :
> > > >> ...
> > > >> Phase 4 - check for duplicate blocks...
> > > >> - setting up duplicate extent list...
> > > >> - clear lost+found (if it exists) ...
> > > >> - clearing existing "lost+found" inode
> > > >> - deleting existing "lost+found" entry
> > > >> - check for inodes claiming duplicate blocks...
> > > >> - agno = 0
> > > >> - agno = 1
> > > >> - agno = 2
> > > >> - agno = 3
> > > >> - agno = 4
> > > >> - agno = 5
> > > >> - agno = 6
> > > >> LEAFN node level is 1 inode 412035424 bno = 8388608
> > > >
> > > >Ooh. Can you describe this test case you're using? Something with
> > > >a bunch of renames in it, obviously, but I'd also like to be able to
> > > >reproduce locally with the exact data set (file names in particular),
> > > >if at all possible.
> > >
> > > >From your reaction above I gather that "LEAFN node level is 1 inode ..."
> > > is a bad thing?
> > >
> > > My filesystem (that crashes under heavy load, while rsyncing to and from
> > > it) has a lot of these messages when xfs_repair is run.
> >
> > Do you have a reproducible test case? Please send a go-to-woe recipe
> > so I can see the problem first hand... and preferably one that is, er,
> > slightly simpler than Jesper's case.
>
> Unfortunately no, this is a 1.1TB filesystem with 54% usage, and dozens
> of large rsyncs to and from it. However during this XFS panicks.
>
> That was with 2.6.17.7 (after 2.6.17.4 had buggered it with the endian
> bug, but after numerous xfs_repairs). Interestingly I rebooted into an
> old 2.6.15.6 kernel yesterday after the last XFS crash, and it survived
> last night's activities perfectly well. After a couple of days I'm
> willing to give the latest 2.6.18-rc or whatever a try (once I've a
> complete set of backups again, and they've been passed on to the
> long-term backup system).

I compiled 2.6.17.9 yesterday with gcc 4.1 (the previous kernel that
showed problems was 2.6.17.7 compiled with gcc 3.3.5), and the same
problem showed itself again, after 2.6.15.6 had run with no problems
whatsoever for 5 days.

I'll now give 2.6.16.1 a go (we have that kernel lying around :-)


BTW, what's the significance of the xfs_repair message
LEAFN node level is 1 inode 827198 bno = 8388608
(I see a lot more of these this time round).


Paul Slootman

2006-08-24 06:55:48

by Nathan Scott

[permalink] [raw]
Subject: Re: 2.6.18-rc3-git3 - XFS - BUG: unable to handle kernel NULL pointer dereference at virtual address 00000078

On Wed, Aug 23, 2006 at 10:42:10AM +0200, Paul Slootman wrote:
>
> I compiled 2.6.17.9 yesterday with gcc 4.1 (the previous kernel that
> showed problems was 2.6.17.7 compiled with gcc 3.3.5), and the same
> problem showed itself again, after 2.6.15.6 had run with no problems
> whatsoever for 5 days.
>
> I'll now give 2.6.16.1 a go (we have that kernel lying around :-)

Hmm, if there's no reproducible case, next best thing is a git bisect
to try to identify potential commits which are causing the problem...
not easy on your production server, I know. I had believed this to be
a long-standing issue though, I'm sure I've seen it reported before -
we've never had any information to go on to try to diagnose it however.
Jesper's rename hint is the most helpful information we've had so far.

> BTW, what's the significance of the xfs_repair message
> LEAFN node level is 1 inode 827198 bno = 8388608
> (I see a lot more of these this time round).

It basically means a directory inode's btree has got into an invalid
state, somehow.

cheers.

--
Nathan

2006-08-24 09:07:27

by Paul Slootman

[permalink] [raw]
Subject: Re: 2.6.18-rc3-git3 - XFS - BUG: unable to handle kernel NULL pointer dereference at virtual address 00000078

On Thu 24 Aug 2006, Nathan Scott wrote:
> On Wed, Aug 23, 2006 at 10:42:10AM +0200, Paul Slootman wrote:
> >
> > I compiled 2.6.17.9 yesterday with gcc 4.1 (the previous kernel that
> > showed problems was 2.6.17.7 compiled with gcc 3.3.5), and the same
> > problem showed itself again, after 2.6.15.6 had run with no problems
> > whatsoever for 5 days.
> >
> > I'll now give 2.6.16.1 a go (we have that kernel lying around :-)

That also fell over.

> Hmm, if there's no reproducible case, next best thing is a git bisect
> to try to identify potential commits which are causing the problem...
> not easy on your production server, I know. I had believed this to be
> a long-standing issue though, I'm sure I've seen it reported before -
> we've never had any information to go on to try to diagnose it however.
> Jesper's rename hint is the most helpful information we've had so far.

To me it seems to have happened somewhere between 2.6.15.6 and 2.6.16.1 :)
I'll have to boot with 2.6.15.6 again and run that for a couple of days,
after that I'll try some kernel 2.6.15.6 < x < 2.6.16.1 (any
suggestions?)


> > BTW, what's the significance of the xfs_repair message
> > LEAFN node level is 1 inode 827198 bno = 8388608
> > (I see a lot more of these this time round).
>
> It basically means a directory inode's btree has got into an invalid
> state, somehow.

Hmm, what version of xfs_repair is supposed to fix that? Because neither
the 2.6.20 version that comes with Debian nor the CVS version of August
10 which calls itself 2.8.11 makes it go away (i.e. multiple runs will
persistently show those lines).


thanks,
Paul Slootman