2010-07-08 14:33:05

by Johannes Hirte

[permalink] [raw]
Subject: kernel BUG at fs/btrfs/extent-tree.c:1353

When doing a 'rm -r /var/tmp/portage/sys-devel' I get the following Oops:

------------[ cut here ]------------
kernel BUG at fs/btrfs/extent-tree.c:1353!
invalid opcode: 0000 [#1] PREEMPT SMP
last sysfs file:
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0C0A:00/power_supply/BAT0/charge_full
Modules linked in: snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq
snd_seq_device snd_pcm_oss snd_mixer_oss nfs lockd nfs_acl auth_rpcgss sunrpc
sco rfcomm bnep l2cap crc16 xts gf128mul usb_storage dm_crypt dm_mod coretemp
hwmon acpi_cpufreq mperf snd_hda_codec_realtek uvcvideo iwl3945 snd_hda_intel
snd_hda_codec iwlcore videodev r8169 snd_hwdep btusb snd_pcm v4l1_compat
mac80211 snd_timer bluetooth snd mii cfg80211 soundcore sg rfkill ac i2c_i801
snd_page_alloc uhci_hcd battery [last unloaded: microcode]

Pid: 2358, comm: rm Not tainted 2.6.35-rc4 #32 M912/M912
EIP: 0060:[<c10c383b>] EFLAGS: 00010202 CPU: 1
EIP is at lookup_inline_extent_backref+0xf2/0x406
EAX: 00000001 EBX: 00000007 ECX: 00000000 EDX: 00000000
ESI: 00000004 EDI: f7268150 EBP: 00000004 ESP: f5aa5d08
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process rm (pid: 2358, ti=f5aa4000 task=f6f0fa70 task.ti=f5aa4000)
Stack:
f702f8c0 f744e080 f665f380 000000b0 00000000 00000000 ffffffff f6c80f00
<0> f744e080 c10ec226 e98acfff f6c98000 00001001 0e987000 00000004 00000000
<0> 00000850 040e9870 a8000000 00001000 00000000 00000007 00000000 0e987000
Call Trace:
[<c10ec226>] ? set_extent_dirty+0x19/0x1d
[<c10c5081>] ? __btrfs_free_extent+0xda/0x675
[<c10c88bf>] ? run_clustered_refs+0x699/0x6d7
[<c10d239f>] ? btrfs_mark_buffer_dirty+0xa3/0xef
[<c1101454>] ? btrfs_find_ref_cluster+0xf9/0x13a
[<c10c89bc>] ? btrfs_run_delayed_refs+0xbf/0x155
[<c10d3a73>] ? __btrfs_end_transaction+0x53/0x16c
[<c10db480>] ? btrfs_delete_inode+0x166/0x17e
[<c102280d>] ? get_parent_ip+0x8/0x19
[<c108fe5c>] ? generic_delete_inode+0x6f/0xbd
[<c108f5b3>] ? iput+0x46/0x48
[<c10893a8>] ? do_unlinkat+0xc7/0x109
[<c102280d>] ? get_parent_ip+0x8/0x19
[<c10822e3>] ? fput+0x12/0x15c
[<c10a2f30>] ? dnotify_flush+0x41/0xc2
[<c107fe85>] ? filp_close+0x4c/0x52
[<c107feed>] ? sys_close+0x62/0x9b
[<c1002550>] ? sysenter_do_call+0x12/0x26
Code: 80 4e 68 02 8d 4c 24 43 89 f8 6a 01 ff 74 24 1c ff 74 24 08 8b 54 24 38 e8
01 c2 ff ff 83 c4 0c 83 f8 00 0f 8c e1 02 00 00 74 02 <0f> 0b 8b 04 24 8b 34 24
8b 00 8b 56 20 89 44 24 08 e8 2e fa ff
EIP: [<c10c383b>] lookup_inline_extent_backref+0xf2/0x406 SS:ESP 0068:f5aa5d08
---[ end trace d97601f0b455ca72 ]---
note: rm[2358] exited with preempt_count 2
BUG: scheduling while atomic: rm/2358/0x10000003
Modules linked in: snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq
snd_seq_device snd_pcm_oss snd_mixer_oss nfs lockd nfs_acl auth_rpcgss sunrpc
sco rfcomm bnep l2cap crc16 xts gf128mul usb_storage dm_crypt dm_mod coretemp
hwmon acpi_cpufreq mperf snd_hda_codec_realtek uvcvideo iwl3945 snd_hda_intel
snd_hda_codec iwlcore videodev r8169 snd_hwdep btusb snd_pcm v4l1_compat
mac80211 snd_timer bluetooth snd mii cfg80211 soundcore sg rfkill ac i2c_i801
snd_page_alloc uhci_hcd battery [last unloaded: microcode]
Pid: 2358, comm: rm Tainted: G D 2.6.35-rc4 #32
Call Trace:
[<c12de6b3>] ? schedule+0x88/0x332
[<c10237c1>] ? __cond_resched+0xf/0x19
[<c12de9e2>] ? _cond_resched+0x12/0x18
[<c106ceec>] ? unmap_vmas+0x4e7/0x534
[<c1070c8f>] ? exit_mmap+0x64/0xa4
[<c1026089>] ? mmput+0x21/0x96
[<c102938e>] ? exit_mm+0xe7/0xf0
[<c12dfa28>] ? _raw_spin_unlock_irqrestore+0x1a/0x24
[<c103aaa1>] ? hrtimer_try_to_cancel+0x31/0x3a
[<c102a42e>] ? do_exit+0x17b/0x57d
[<c1028e78>] ? kmsg_dump+0x81/0xf9
[<c1002d06>] ? do_invalid_op+0x0/0x76
[<c1004fa0>] ? oops_end+0x72/0x75
[<c1002d6f>] ? do_invalid_op+0x69/0x76
[<c10c383b>] ? lookup_inline_extent_backref+0xf2/0x406
[<c10bdc9a>] ? generic_bin_search.clone.0+0x145/0x150
[<c10bcf30>] ? btrfs_cow_block+0x106/0x112
[<c10bdcdc>] ? bin_search+0x37/0x3d
[<c10bfe33>] ? btrfs_search_slot+0x405/0x477
[<c12e031a>] ? error_code+0x66/0x6c
[<c1002d06>] ? do_invalid_op+0x0/0x76
[<c10c383b>] ? lookup_inline_extent_backref+0xf2/0x406
[<c10ec226>] ? set_extent_dirty+0x19/0x1d
[<c10c5081>] ? __btrfs_free_extent+0xda/0x675
[<c10c88bf>] ? run_clustered_refs+0x699/0x6d7
[<c10d239f>] ? btrfs_mark_buffer_dirty+0xa3/0xef
[<c1101454>] ? btrfs_find_ref_cluster+0xf9/0x13a
[<c10c89bc>] ? btrfs_run_delayed_refs+0xbf/0x155
[<c10d3a73>] ? __btrfs_end_transaction+0x53/0x16c
[<c10db480>] ? btrfs_delete_inode+0x166/0x17e
[<c102280d>] ? get_parent_ip+0x8/0x19
[<c108fe5c>] ? generic_delete_inode+0x6f/0xbd
[<c108f5b3>] ? iput+0x46/0x48
[<c10893a8>] ? do_unlinkat+0xc7/0x109
[<c102280d>] ? get_parent_ip+0x8/0x19
[<c10822e3>] ? fput+0x12/0x15c
[<c10a2f30>] ? dnotify_flush+0x41/0xc2
[<c107fe85>] ? filp_close+0x4c/0x52
[<c107feed>] ? sys_close+0x62/0x9b
[<c1002550>] ? sysenter_do_call+0x12/0x26


2010-07-08 15:01:30

by Chris Mason

[permalink] [raw]
Subject: Re: kernel BUG at fs/btrfs/extent-tree.c:1353

Neither Yan nor I have been able to reproduce this locally, but a few
people have now hit it. Johannes, are you available to try out a
debugging kernel to try and track this down?

-chris

On Thu, Jul 08, 2010 at 04:27:23PM +0200, Johannes Hirte wrote:
> When doing a 'rm -r /var/tmp/portage/sys-devel' I get the following Oops:
>
> ------------[ cut here ]------------
> kernel BUG at fs/btrfs/extent-tree.c:1353!
> invalid opcode: 0000 [#1] PREEMPT SMP
> last sysfs file:
> /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0C0A:00/power_supply/BAT0/charge_full
> Modules linked in: snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq
> snd_seq_device snd_pcm_oss snd_mixer_oss nfs lockd nfs_acl auth_rpcgss sunrpc
> sco rfcomm bnep l2cap crc16 xts gf128mul usb_storage dm_crypt dm_mod coretemp
> hwmon acpi_cpufreq mperf snd_hda_codec_realtek uvcvideo iwl3945 snd_hda_intel
> snd_hda_codec iwlcore videodev r8169 snd_hwdep btusb snd_pcm v4l1_compat
> mac80211 snd_timer bluetooth snd mii cfg80211 soundcore sg rfkill ac i2c_i801
> snd_page_alloc uhci_hcd battery [last unloaded: microcode]
>
> Pid: 2358, comm: rm Not tainted 2.6.35-rc4 #32 M912/M912
> EIP: 0060:[<c10c383b>] EFLAGS: 00010202 CPU: 1
> EIP is at lookup_inline_extent_backref+0xf2/0x406
> EAX: 00000001 EBX: 00000007 ECX: 00000000 EDX: 00000000
> ESI: 00000004 EDI: f7268150 EBP: 00000004 ESP: f5aa5d08
> DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> Process rm (pid: 2358, ti=f5aa4000 task=f6f0fa70 task.ti=f5aa4000)
> Stack:
> f702f8c0 f744e080 f665f380 000000b0 00000000 00000000 ffffffff f6c80f00
> <0> f744e080 c10ec226 e98acfff f6c98000 00001001 0e987000 00000004 00000000
> <0> 00000850 040e9870 a8000000 00001000 00000000 00000007 00000000 0e987000
> Call Trace:
> [<c10ec226>] ? set_extent_dirty+0x19/0x1d
> [<c10c5081>] ? __btrfs_free_extent+0xda/0x675
> [<c10c88bf>] ? run_clustered_refs+0x699/0x6d7
> [<c10d239f>] ? btrfs_mark_buffer_dirty+0xa3/0xef
> [<c1101454>] ? btrfs_find_ref_cluster+0xf9/0x13a
> [<c10c89bc>] ? btrfs_run_delayed_refs+0xbf/0x155
> [<c10d3a73>] ? __btrfs_end_transaction+0x53/0x16c
> [<c10db480>] ? btrfs_delete_inode+0x166/0x17e
> [<c102280d>] ? get_parent_ip+0x8/0x19
> [<c108fe5c>] ? generic_delete_inode+0x6f/0xbd
> [<c108f5b3>] ? iput+0x46/0x48
> [<c10893a8>] ? do_unlinkat+0xc7/0x109
> [<c102280d>] ? get_parent_ip+0x8/0x19
> [<c10822e3>] ? fput+0x12/0x15c
> [<c10a2f30>] ? dnotify_flush+0x41/0xc2
> [<c107fe85>] ? filp_close+0x4c/0x52
> [<c107feed>] ? sys_close+0x62/0x9b
> [<c1002550>] ? sysenter_do_call+0x12/0x26
> Code: 80 4e 68 02 8d 4c 24 43 89 f8 6a 01 ff 74 24 1c ff 74 24 08 8b 54 24 38 e8
> 01 c2 ff ff 83 c4 0c 83 f8 00 0f 8c e1 02 00 00 74 02 <0f> 0b 8b 04 24 8b 34 24
> 8b 00 8b 56 20 89 44 24 08 e8 2e fa ff
> EIP: [<c10c383b>] lookup_inline_extent_backref+0xf2/0x406 SS:ESP 0068:f5aa5d08
> ---[ end trace d97601f0b455ca72 ]---
> note: rm[2358] exited with preempt_count 2
> BUG: scheduling while atomic: rm/2358/0x10000003
> Modules linked in: snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq
> snd_seq_device snd_pcm_oss snd_mixer_oss nfs lockd nfs_acl auth_rpcgss sunrpc
> sco rfcomm bnep l2cap crc16 xts gf128mul usb_storage dm_crypt dm_mod coretemp
> hwmon acpi_cpufreq mperf snd_hda_codec_realtek uvcvideo iwl3945 snd_hda_intel
> snd_hda_codec iwlcore videodev r8169 snd_hwdep btusb snd_pcm v4l1_compat
> mac80211 snd_timer bluetooth snd mii cfg80211 soundcore sg rfkill ac i2c_i801
> snd_page_alloc uhci_hcd battery [last unloaded: microcode]
> Pid: 2358, comm: rm Tainted: G D 2.6.35-rc4 #32
> Call Trace:
> [<c12de6b3>] ? schedule+0x88/0x332
> [<c10237c1>] ? __cond_resched+0xf/0x19
> [<c12de9e2>] ? _cond_resched+0x12/0x18
> [<c106ceec>] ? unmap_vmas+0x4e7/0x534
> [<c1070c8f>] ? exit_mmap+0x64/0xa4
> [<c1026089>] ? mmput+0x21/0x96
> [<c102938e>] ? exit_mm+0xe7/0xf0
> [<c12dfa28>] ? _raw_spin_unlock_irqrestore+0x1a/0x24
> [<c103aaa1>] ? hrtimer_try_to_cancel+0x31/0x3a
> [<c102a42e>] ? do_exit+0x17b/0x57d
> [<c1028e78>] ? kmsg_dump+0x81/0xf9
> [<c1002d06>] ? do_invalid_op+0x0/0x76
> [<c1004fa0>] ? oops_end+0x72/0x75
> [<c1002d6f>] ? do_invalid_op+0x69/0x76
> [<c10c383b>] ? lookup_inline_extent_backref+0xf2/0x406
> [<c10bdc9a>] ? generic_bin_search.clone.0+0x145/0x150
> [<c10bcf30>] ? btrfs_cow_block+0x106/0x112
> [<c10bdcdc>] ? bin_search+0x37/0x3d
> [<c10bfe33>] ? btrfs_search_slot+0x405/0x477
> [<c12e031a>] ? error_code+0x66/0x6c
> [<c1002d06>] ? do_invalid_op+0x0/0x76
> [<c10c383b>] ? lookup_inline_extent_backref+0xf2/0x406
> [<c10ec226>] ? set_extent_dirty+0x19/0x1d
> [<c10c5081>] ? __btrfs_free_extent+0xda/0x675
> [<c10c88bf>] ? run_clustered_refs+0x699/0x6d7
> [<c10d239f>] ? btrfs_mark_buffer_dirty+0xa3/0xef
> [<c1101454>] ? btrfs_find_ref_cluster+0xf9/0x13a
> [<c10c89bc>] ? btrfs_run_delayed_refs+0xbf/0x155
> [<c10d3a73>] ? __btrfs_end_transaction+0x53/0x16c
> [<c10db480>] ? btrfs_delete_inode+0x166/0x17e
> [<c102280d>] ? get_parent_ip+0x8/0x19
> [<c108fe5c>] ? generic_delete_inode+0x6f/0xbd
> [<c108f5b3>] ? iput+0x46/0x48
> [<c10893a8>] ? do_unlinkat+0xc7/0x109
> [<c102280d>] ? get_parent_ip+0x8/0x19
> [<c10822e3>] ? fput+0x12/0x15c
> [<c10a2f30>] ? dnotify_flush+0x41/0xc2
> [<c107fe85>] ? filp_close+0x4c/0x52
> [<c107feed>] ? sys_close+0x62/0x9b
> [<c1002550>] ? sysenter_do_call+0x12/0x26

2010-07-08 15:40:15

by Johannes Hirte

[permalink] [raw]
Subject: Re: kernel BUG at fs/btrfs/extent-tree.c:1353

Am Donnerstag 08 Juli 2010, 16:31:09 schrieb Chris Mason:
> Neither Yan nor I have been able to reproduce this locally, but a few
> people have now hit it. Johannes, are you available to try out a
> debugging kernel to try and track this down?

Sure, just tell me what to do. Is it enough to recompile the kernel with debug
options enabled or are special debug patches necessary?

regards,
Johannes

2010-07-11 12:28:19

by Johannes Hirte

[permalink] [raw]
Subject: Re: kernel BUG at fs/btrfs/extent-tree.c:1353

It's getting worse. The /home partition is now affected too. I get the Oops on
simple unmounting the fs. btrfsck gives me this output on this fs:

btrfsck /dev/mapper/sdb3
leaf 123780497408 items 49 free space 271 generation 62207 owner 2
fs uuid 7f013285-88d8-452f-a139-7d44bffd14b6
chunk uuid 365526c9-e209-46a1-8963-3157306d9e05
item 0 key (123780108288 EXTENT_ITEM 4096) itemoff 3944 itemsize 51
extent refs 1 gen 45309 flags 2
tree block key (836797 6c 0) level 1
tree block backref root 5
item 1 key (123780112384 EXTENT_ITEM 4096) itemoff 3893 itemsize 51
extent refs 1 gen 45309 flags 2
tree block key (836955 c 416051) level 0
tree block backref root 5
item 2 key (123780116480 EXTENT_ITEM 4096) itemoff 3842 itemsize 51
extent refs 1 gen 24599 flags 2
tree block key (18446744073709551606 80 122869407744) level 0
tree block backref root 7
item 3 key (123780120576 EXTENT_ITEM 4096) itemoff 3791 itemsize 51
extent refs 1 gen 49958 flags 2
tree block key (979249 1 0) level 0
tree block backref root 5
item 4 key (123780124672 EXTENT_ITEM 4096) itemoff 3740 itemsize 51
extent refs 1 gen 62191 flags 2
tree block key (1220 1 0) level 0
tree block backref root 5
item 5 key (123780128768 EXTENT_ITEM 4096) itemoff 3689 itemsize 51
extent refs 1 gen 54817 flags 2
tree block key (1001168 c 455590) level 0
tree block backref root 5
item 6 key (123780132864 EXTENT_ITEM 4096) itemoff 3638 itemsize 51
extent refs 1 gen 62201 flags 2
tree block key (28712 1 0) level 1
tree block backref root 5
item 7 key (123780136960 EXTENT_ITEM 4096) itemoff 3587 itemsize 51
extent refs 1 gen 62191 flags 2
tree block key (34645 c 33037) level 0
tree block backref root 5
item 8 key (123780141056 EXTENT_ITEM 4096) itemoff 3536 itemsize 51
extent refs 1 gen 50007 flags 2
tree block key (31007 60 4044) level 0
tree block backref root 5
item 9 key (123780145152 EXTENT_ITEM 4096) itemoff 3485 itemsize 51
extent refs 1 gen 62202 flags 2
tree block key (123644329984 a8 4096) level 0
tree block backref root 2
item 10 key (123780149248 EXTENT_ITEM 4096) itemoff 3434 itemsize 51
extent refs 1 gen 62202 flags 2
tree block key (123644854272 a8 4096) level 0
tree block backref root 2
item 11 key (123780153344 EXTENT_ITEM 4096) itemoff 3383 itemsize 51
extent refs 1 gen 62202 flags 2
tree block key (123645849600 a8 4096) level 0
tree block backref root 2
item 12 key (123780157440 EXTENT_ITEM 4096) itemoff 3332 itemsize 51
extent refs 1 gen 62207 flags 2
tree block key (123411308544 a8 4096) level 2
tree block backref root 2
item 13 key (123780161536 EXTENT_ITEM 4096) itemoff 3281 itemsize 51
extent refs 1 gen 62200 flags 2
tree block key (1325101 c 1264) level 0
tree block backref root 5
item 14 key (123780165632 EXTENT_ITEM 4096) itemoff 3230 itemsize 51
extent refs 1 gen 56401 flags 2
tree block key (59621 1 0) level 1
tree block backref root 5
item 15 key (123780169728 EXTENT_ITEM 4096) itemoff 3179 itemsize 51
extent refs 1 gen 24613 flags 2
tree block key (18446744073709551606 80 125996056576) level 0
tree block backref root 7
item 16 key (123780173824 EXTENT_ITEM 4096) itemoff 3128 itemsize 51
extent refs 1 gen 62189 flags 2
tree block key (1324334 1 0) level 0
tree block backref root 5
item 17 key (123780177920 EXTENT_ITEM 4095) itemoff 3077 itemsize 51
extent refs 1 gen 62207 flags 2
tree block key (123682791424 a8 4096) level 1
tree block backref root 2
item 18 key (123780182016 EXTENT_ITEM 4096) itemoff 3026 itemsize 51
extent refs 1 gen 62202 flags 2
tree block key (123648741376 a8 4096) level 0
tree block backref root 2
item 19 key (123780186112 EXTENT_ITEM 4096) itemoff 2975 itemsize 51
extent refs 1 gen 62201 flags 2
tree block key (28854 54 1781866506) level 0
tree block backref root 5
item 20 key (123780190208 EXTENT_ITEM 4096) itemoff 2924 itemsize 51
extent refs 1 gen 62192 flags 2
tree block key (1456 6c 3231744) level 0
tree block backref root 5
item 21 key (123780194304 EXTENT_ITEM 4096) itemoff 2873 itemsize 51
extent refs 1 gen 62189 flags 2
tree block key (123646390272 a8 4096) level 0
tree block backref root 2
item 22 key (123780198400 EXTENT_ITEM 4096) itemoff 2822 itemsize 51
extent refs 1 gen 58732 flags 2
tree block key (145006825472 a8 8192) level 0
tree block backref root 2
item 23 key (123780202496 EXTENT_ITEM 4096) itemoff 2771 itemsize 51
extent refs 1 gen 62191 flags 2
tree block key (123656515584 a8 4096) level 0
tree block backref root 2
item 24 key (123780206592 EXTENT_ITEM 4096) itemoff 2720 itemsize 51
extent refs 1 gen 59061 flags 2
tree block key (34873 1 0) level 0
tree block backref root 5
item 25 key (123780210688 EXTENT_ITEM 4096) itemoff 2669 itemsize 51
extent refs 1 gen 62207 flags 2
tree block key (123778146304 a8 4096) level 0
tree block backref root 2
item 26 key (123780214784 EXTENT_ITEM 4096) itemoff 2618 itemsize 51
extent refs 1 gen 62201 flags 2
tree block key (28848 c 2786) level 0
tree block backref root 5
item 27 key (123780218880 EXTENT_ITEM 4096) itemoff 2567 itemsize 51
extent refs 1 gen 62189 flags 2
tree block key (123657785344 a8 4096) level 0
tree block backref root 2
item 28 key (123780222976 EXTENT_ITEM 4096) itemoff 2516 itemsize 51
extent refs 1 gen 62201 flags 2
tree block key (133719777280 a8 4096) level 1
tree block backref root 2
item 29 key (123780227072 EXTENT_ITEM 4096) itemoff 2465 itemsize 51
extent refs 1 gen 62189 flags 2
tree block key (123659431936 a8 4096) level 0
tree block backref root 2
item 30 key (123780231168 EXTENT_ITEM 4096) itemoff 2414 itemsize 51
extent refs 1 gen 62202 flags 2
tree block key (123658563584 a8 4096) level 1
tree block backref root 2
item 31 key (123780235264 EXTENT_ITEM 4096) itemoff 2363 itemsize 51
extent refs 1 gen 46988 flags 2
tree block key (34911 6c 282624) level 0
tree block backref root 5
item 32 key (123780239360 EXTENT_ITEM 4096) itemoff 2312 itemsize 51
extent refs 1 gen 58714 flags 2
tree block key (63797821440 a8 4096) level 0
tree block backref root 2
item 33 key (123780243456 EXTENT_ITEM 4096) itemoff 2261 itemsize 51
extent refs 1 gen 62201 flags 2
tree block key (133728186368 a8 4096) level 0
tree block backref root 2
item 34 key (123780247552 EXTENT_ITEM 4096) itemoff 2210 itemsize 51
extent refs 1 gen 56401 flags 2
tree block key (59654 60 7) level 0
tree block backref root 5
item 35 key (123780251648 EXTENT_ITEM 4096) itemoff 2159 itemsize 51
extent refs 1 gen 62207 flags 2
tree block key (123778404352 a8 4096) level 0
tree block backref root 2
item 36 key (123780255744 EXTENT_ITEM 4096) itemoff 2108 itemsize 51
extent refs 1 gen 62202 flags 2
tree block key (123659759616 a8 4096) level 0
tree block backref root 2
item 37 key (123780259840 EXTENT_ITEM 4096) itemoff 2057 itemsize 51
extent refs 1 gen 62204 flags 2
tree block key (145014075392 a8 12288) level 0
tree block backref root 2
item 38 key (123780263936 EXTENT_ITEM 4096) itemoff 2006 itemsize 51
extent refs 1 gen 62193 flags 2
tree block key (145016651776 a8 8192) level 0
tree block backref root 2
item 39 key (123780268032 EXTENT_ITEM 4096) itemoff 1955 itemsize 51
extent refs 1 gen 62207 flags 2
tree block key (123778617344 a8 4096) level 0
tree block backref root 2
item 40 key (123780272128 EXTENT_ITEM 4096) itemoff 1904 itemsize 51
extent refs 1 gen 59985 flags 2
tree block key (18446744073709551606 80 62746009600) level 0
tree block backref root 7
item 41 key (123780276224 EXTENT_ITEM 4096) itemoff 1853 itemsize 51
extent refs 1 gen 62202 flags 2
tree block key (123660300288 a8 4096) level 0
tree block backref root 2
item 42 key (123780280320 EXTENT_ITEM 4096) itemoff 1802 itemsize 51
extent refs 1 gen 22901 flags 2
tree block key (66305400832 a8 4096) level 0
tree block backref root 2
item 43 key (123780284416 EXTENT_ITEM 4096) itemoff 1751 itemsize 51
extent refs 1 gen 62207 flags 2
tree block key (123778818048 a8 4096) level 0
tree block backref root 2
item 44 key (123780288512 EXTENT_ITEM 4096) itemoff 1700 itemsize 51
extent refs 1 gen 51582 flags 2
tree block key (455590 54 3842052003) level 0
tree block backref root 5
item 45 key (123780292608 EXTENT_ITEM 4096) itemoff 1649 itemsize 51
extent refs 1 gen 62200 flags 2
tree block key (1283 54 2984231802) level 0
tree block backref root 5
item 46 key (123780296704 EXTENT_ITEM 4096) itemoff 1598 itemsize 51
extent refs 1 gen 56398 flags 2
tree block key (54879 c 54735) level 0
tree block backref root 5
item 47 key (123780300800 EXTENT_ITEM 4096) itemoff 1547 itemsize 51
extent refs 1 gen 62201 flags 2
tree block key (2786 54 1989163538) level 1
tree block backref root 5
item 48 key (123780304896 EXTENT_ITEM 4096) itemoff 1496 itemsize 51
extent refs 1 gen 59040 flags 2
tree block key (123812081664 a8 4096) level 0
tree block backref root 2
failed to find block number 123780177920
Aborted

Mounting the fs read-only an reading the content works fine so far.

I've three systems running with btrfs, a dual Opteron (252), a Pentium 4
system and a netbook with N270 Atom. The netbook is the only one that shows
the errors. It's also the only system where I'm using gcc-4.5. Perhaps it's
related, but I doubt it's the only reason as I'm using gcc-4.5 since May.


regards,
Johannes

2010-07-13 12:23:33

by Johannes Hirte

[permalink] [raw]
Subject: Re: kernel BUG at fs/btrfs/extent-tree.c:1353

Am Sonntag 11 Juli 2010, 14:28:09 schrieb Johannes Hirte:
...
> I've three systems running with btrfs, a dual Opteron (252), a Pentium 4
> system and a netbook with N270 Atom. The netbook is the only one that shows
> the errors. It's also the only system where I'm using gcc-4.5. Perhaps it's
> related, but I doubt it's the only reason as I'm using gcc-4.5 since May.

On the Opteron system I got now csum errors. I've synced some data from the
netbook to the Opteron yesteray. After hitting ENOSPC with 4GB free, I've run
'btrfs-vol -b' on this fs in hope to get some more free space. It worked but
the command failed and I found in dmesg:

btrfs csum failed ino 339 off 935280640 csum 337776576 private 337776575
btrfs csum failed ino 339 off 935280640 csum 337776576 private 337776575
btrfs csum failed ino 339 off 935280640 csum 337776576 private 337776575
btrfs csum failed ino 339 off 935280640 csum 337776576 private 337776575

So I've tested the new synced data by syncing them to another disk on the
Optoern system (XFS). As I've expected (or better feared), some data wasn't
readable and I found more csum errors in dmesg:

btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688
btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688
btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688
btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688
btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688
btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688
btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688
btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688
btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688
btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688
btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547
btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547
btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547
btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547
btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547
btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547
btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547
btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547
btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547
btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547
btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547
btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547
btrfs csum failed ino 1959333 off 252362752 csum 686735346 private 686735345
btrfs csum failed ino 1959333 off 252362752 csum 686735346 private 686735345
btrfs csum failed ino 1959333 off 252362752 csum 686735346 private 686735345
btrfs csum failed ino 1959333 off 252362752 csum 686735346 private 686735345
btrfs csum failed ino 1959333 off 252362752 csum 686735346 private 686735345
btrfs csum failed ino 1959333 off 252362752 csum 686735346 private 686735345
btrfs csum failed ino 1959333 off 651108352 csum 2851505977 private 2851505976
btrfs csum failed ino 1959333 off 651108352 csum 2851505977 private 2851505976
btrfs csum failed ino 1959333 off 651108352 csum 2851505977 private 2851505976
btrfs csum failed ino 1959333 off 651108352 csum 2851505977 private 2851505976
btrfs csum failed ino 1959333 off 651108352 csum 2851505977 private 2851505976
btrfs csum failed ino 1959333 off 651108352 csum 2851505977 private 2851505976
btrfs csum failed ino 1959333 off 898342912 csum 4271223884 private 4271223883
btrfs csum failed ino 1959333 off 898342912 csum 4271223884 private 4271223883
btrfs csum failed ino 1959333 off 898342912 csum 4271223884 private 4271223883
btrfs csum failed ino 1959333 off 898342912 csum 4271223884 private 4271223883
btrfs csum failed ino 1959333 off 898342912 csum 4271223884 private 4271223883
btrfs csum failed ino 1959333 off 898342912 csum 4271223884 private 4271223883

I suspect something goes horribly wrong on writing to disc within btrfs. On
the netbook I got missing blocks, on the Opteron System bad csums. Both
systems are running linux-2.6.35-rc4, the netbook with gcc-4.5.0 the Opteron
system with gcc-4.4.4. I'll test the P4 system later, if there are similar
errors too.

regards,
Johannes

2010-07-14 15:24:56

by Johannes Hirte

[permalink] [raw]
Subject: Re: kernel BUG at fs/btrfs/extent-tree.c:1353

Am Donnerstag 08 Juli 2010, 16:31:09 schrieb Chris Mason:
> Neither Yan nor I have been able to reproduce this locally, but a few
> people have now hit it. Johannes, are you available to try out a
> debugging kernel to try and track this down?
>
> -chris
>
> On Thu, Jul 08, 2010 at 04:27:23PM +0200, Johannes Hirte wrote:
> > When doing a 'rm -r /var/tmp/portage/sys-devel' I get the following Oops:
> >
> > ------------[ cut here ]------------
> > kernel BUG at fs/btrfs/extent-tree.c:1353!
> > invalid opcode: 0000 [#1] PREEMPT SMP
> > last sysfs file:
> > /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0C0A:00/power_supply/BAT0/charge_
> > full Modules linked in: snd_seq_dummy snd_seq_oss snd_seq_midi_event
> > snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss nfs lockd nfs_acl
> > auth_rpcgss sunrpc sco rfcomm bnep l2cap crc16 xts gf128mul usb_storage
> > dm_crypt dm_mod coretemp hwmon acpi_cpufreq mperf snd_hda_codec_realtek
> > uvcvideo iwl3945 snd_hda_intel snd_hda_codec iwlcore videodev r8169
> > snd_hwdep btusb snd_pcm v4l1_compat mac80211 snd_timer bluetooth snd mii
> > cfg80211 soundcore sg rfkill ac i2c_i801 snd_page_alloc uhci_hcd battery
> > [last unloaded: microcode]
> >
> > Pid: 2358, comm: rm Not tainted 2.6.35-rc4 #32 M912/M912
> > EIP: 0060:[<c10c383b>] EFLAGS: 00010202 CPU: 1
> > EIP is at lookup_inline_extent_backref+0xf2/0x406
> > EAX: 00000001 EBX: 00000007 ECX: 00000000 EDX: 00000000
> > ESI: 00000004 EDI: f7268150 EBP: 00000004 ESP: f5aa5d08
> >
> > DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> >
> > Process rm (pid: 2358, ti=f5aa4000 task=f6f0fa70 task.ti=f5aa4000)
> >
> > Stack:
> > f702f8c0 f744e080 f665f380 000000b0 00000000 00000000 ffffffff f6c80f00
> >
> > <0> f744e080 c10ec226 e98acfff f6c98000 00001001 0e987000 00000004
> > 00000000 <0> 00000850 040e9870 a8000000 00001000 00000000 00000007
> > 00000000 0e987000
> >
> > Call Trace:
> > [<c10ec226>] ? set_extent_dirty+0x19/0x1d
> > [<c10c5081>] ? __btrfs_free_extent+0xda/0x675
> > [<c10c88bf>] ? run_clustered_refs+0x699/0x6d7
> > [<c10d239f>] ? btrfs_mark_buffer_dirty+0xa3/0xef
> > [<c1101454>] ? btrfs_find_ref_cluster+0xf9/0x13a
> > [<c10c89bc>] ? btrfs_run_delayed_refs+0xbf/0x155
> > [<c10d3a73>] ? __btrfs_end_transaction+0x53/0x16c
> > [<c10db480>] ? btrfs_delete_inode+0x166/0x17e
> > [<c102280d>] ? get_parent_ip+0x8/0x19
> > [<c108fe5c>] ? generic_delete_inode+0x6f/0xbd
> > [<c108f5b3>] ? iput+0x46/0x48
> > [<c10893a8>] ? do_unlinkat+0xc7/0x109
> > [<c102280d>] ? get_parent_ip+0x8/0x19
> > [<c10822e3>] ? fput+0x12/0x15c
> > [<c10a2f30>] ? dnotify_flush+0x41/0xc2
> > [<c107fe85>] ? filp_close+0x4c/0x52
> > [<c107feed>] ? sys_close+0x62/0x9b
> > [<c1002550>] ? sysenter_do_call+0x12/0x26
> >
> > Code: 80 4e 68 02 8d 4c 24 43 89 f8 6a 01 ff 74 24 1c ff 74 24 08 8b 54
> > 24 38 e8 01 c2 ff ff 83 c4 0c 83 f8 00 0f 8c e1 02 00 00 74 02 <0f> 0b
> > 8b 04 24 8b 34 24 8b 00 8b 56 20 89 44 24 08 e8 2e fa ff
> > EIP: [<c10c383b>] lookup_inline_extent_backref+0xf2/0x406 SS:ESP
> > 0068:f5aa5d08 ---[ end trace d97601f0b455ca72 ]---
> > note: rm[2358] exited with preempt_count 2
> > BUG: scheduling while atomic: rm/2358/0x10000003
> > Modules linked in: snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq
> > snd_seq_device snd_pcm_oss snd_mixer_oss nfs lockd nfs_acl auth_rpcgss
> > sunrpc sco rfcomm bnep l2cap crc16 xts gf128mul usb_storage dm_crypt
> > dm_mod coretemp hwmon acpi_cpufreq mperf snd_hda_codec_realtek uvcvideo
> > iwl3945 snd_hda_intel snd_hda_codec iwlcore videodev r8169 snd_hwdep
> > btusb snd_pcm v4l1_compat mac80211 snd_timer bluetooth snd mii cfg80211
> > soundcore sg rfkill ac i2c_i801 snd_page_alloc uhci_hcd battery [last
> > unloaded: microcode]
> > Pid: 2358, comm: rm Tainted: G D 2.6.35-rc4 #32
> >
> > Call Trace:
> > [<c12de6b3>] ? schedule+0x88/0x332
> > [<c10237c1>] ? __cond_resched+0xf/0x19
> > [<c12de9e2>] ? _cond_resched+0x12/0x18
> > [<c106ceec>] ? unmap_vmas+0x4e7/0x534
> > [<c1070c8f>] ? exit_mmap+0x64/0xa4
> > [<c1026089>] ? mmput+0x21/0x96
> > [<c102938e>] ? exit_mm+0xe7/0xf0
> > [<c12dfa28>] ? _raw_spin_unlock_irqrestore+0x1a/0x24
> > [<c103aaa1>] ? hrtimer_try_to_cancel+0x31/0x3a
> > [<c102a42e>] ? do_exit+0x17b/0x57d
> > [<c1028e78>] ? kmsg_dump+0x81/0xf9
> > [<c1002d06>] ? do_invalid_op+0x0/0x76
> > [<c1004fa0>] ? oops_end+0x72/0x75
> > [<c1002d6f>] ? do_invalid_op+0x69/0x76
> > [<c10c383b>] ? lookup_inline_extent_backref+0xf2/0x406
> > [<c10bdc9a>] ? generic_bin_search.clone.0+0x145/0x150
> > [<c10bcf30>] ? btrfs_cow_block+0x106/0x112
> > [<c10bdcdc>] ? bin_search+0x37/0x3d
> > [<c10bfe33>] ? btrfs_search_slot+0x405/0x477
> > [<c12e031a>] ? error_code+0x66/0x6c
> > [<c1002d06>] ? do_invalid_op+0x0/0x76
> > [<c10c383b>] ? lookup_inline_extent_backref+0xf2/0x406
> > [<c10ec226>] ? set_extent_dirty+0x19/0x1d
> > [<c10c5081>] ? __btrfs_free_extent+0xda/0x675
> > [<c10c88bf>] ? run_clustered_refs+0x699/0x6d7
> > [<c10d239f>] ? btrfs_mark_buffer_dirty+0xa3/0xef
> > [<c1101454>] ? btrfs_find_ref_cluster+0xf9/0x13a
> > [<c10c89bc>] ? btrfs_run_delayed_refs+0xbf/0x155
> > [<c10d3a73>] ? __btrfs_end_transaction+0x53/0x16c
> > [<c10db480>] ? btrfs_delete_inode+0x166/0x17e
> > [<c102280d>] ? get_parent_ip+0x8/0x19
> > [<c108fe5c>] ? generic_delete_inode+0x6f/0xbd
> > [<c108f5b3>] ? iput+0x46/0x48
> > [<c10893a8>] ? do_unlinkat+0xc7/0x109
> > [<c102280d>] ? get_parent_ip+0x8/0x19
> > [<c10822e3>] ? fput+0x12/0x15c
> > [<c10a2f30>] ? dnotify_flush+0x41/0xc2
> > [<c107fe85>] ? filp_close+0x4c/0x52
> > [<c107feed>] ? sys_close+0x62/0x9b
> > [<c1002550>] ? sysenter_do_call+0x12/0x26

I'm not sure if btrfs is to blame for this error. After the errors I switched
to XFS on this system and got now this error:

ls -l .kde4/share/apps/akregator/data/
ls: cannot access .kde4/share/apps/akregator/data/feeds.opml: Structure needs
cleaning
total 4
?????????? ? ? ? ? ? feeds.opml

xfs_check is showing this:

xfs_check /dev/sda3
link count mismatch for inode 219998792 (name ?), nlink 0, counted 1
disconnected inode 220064328, nlink 1

So this is the second FS I've got suddenly errors, so I think the problem lies
deeper. Adding some CCs for this.


regards,
Johannes

2010-07-15 00:11:11

by Dave Chinner

[permalink] [raw]
Subject: Re: kernel BUG at fs/btrfs/extent-tree.c:1353

On Wed, Jul 14, 2010 at 05:25:23PM +0200, Johannes Hirte wrote:
> Am Donnerstag 08 Juli 2010, 16:31:09 schrieb Chris Mason:
> I'm not sure if btrfs is to blame for this error. After the errors I switched
> to XFS on this system and got now this error:
>
> ls -l .kde4/share/apps/akregator/data/
> ls: cannot access .kde4/share/apps/akregator/data/feeds.opml: Structure needs
> cleaning
> total 4
> ?????????? ? ? ? ? ? feeds.opml

What is the error reported in dmesg when the XFS filesytem shuts down?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-07-15 18:14:23

by Johannes Hirte

[permalink] [raw]
Subject: Re: kernel BUG at fs/btrfs/extent-tree.c:1353

Am Donnerstag 15 Juli 2010, 02:11:04 schrieb Dave Chinner:
> On Wed, Jul 14, 2010 at 05:25:23PM +0200, Johannes Hirte wrote:
> > Am Donnerstag 08 Juli 2010, 16:31:09 schrieb Chris Mason:
> > I'm not sure if btrfs is to blame for this error. After the errors I
> > switched to XFS on this system and got now this error:
> >
> > ls -l .kde4/share/apps/akregator/data/
> > ls: cannot access .kde4/share/apps/akregator/data/feeds.opml: Structure
> > needs cleaning
> > total 4
> > ?????????? ? ? ? ? ? feeds.opml
>
> What is the error reported in dmesg when the XFS filesytem shuts down?

Nothing. I double checked the logs. There are only the messages when mounting
the filesystem. No other errors are reported than the inaccessible file and the
output from xfs_check.

regards,
Johannes

2010-07-15 18:29:43

by Johannes Hirte

[permalink] [raw]
Subject: csum errors

Am Dienstag 13 Juli 2010, 14:23:58 schrieb Johannes Hirte:
> On the Opteron system I got now csum errors. I've synced some data from the
> netbook to the Opteron yesteray. After hitting ENOSPC with 4GB free, I've
> run 'btrfs-vol -b' on this fs in hope to get some more free space. It
> worked but the command failed and I found in dmesg:
>
> btrfs csum failed ino 339 off 935280640 csum 337776576 private 337776575
> btrfs csum failed ino 339 off 935280640 csum 337776576 private 337776575
> btrfs csum failed ino 339 off 935280640 csum 337776576 private 337776575
> btrfs csum failed ino 339 off 935280640 csum 337776576 private 337776575
>
> So I've tested the new synced data by syncing them to another disk on the
> Optoern system (XFS). As I've expected (or better feared), some data wasn't
> readable and I found more csum errors in dmesg:
>
> btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688
> btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688
> btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688
> btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688
> btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688
> btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688
> btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688
> btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688
> btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688
> btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688
> btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547
> btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547
> btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547
> btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547
> btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547
> btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547
> btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547
> btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547
> btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547
> btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547
> btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547
> btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547
> btrfs csum failed ino 1959333 off 252362752 csum 686735346 private
> 686735345 btrfs csum failed ino 1959333 off 252362752 csum 686735346
> private 686735345 btrfs csum failed ino 1959333 off 252362752 csum
> 686735346 private 686735345 btrfs csum failed ino 1959333 off 252362752
> csum 686735346 private 686735345 btrfs csum failed ino 1959333 off
> 252362752 csum 686735346 private 686735345 btrfs csum failed ino 1959333
> off 252362752 csum 686735346 private 686735345 btrfs csum failed ino
> 1959333 off 651108352 csum 2851505977 private 2851505976 btrfs csum failed
> ino 1959333 off 651108352 csum 2851505977 private 2851505976 btrfs csum
> failed ino 1959333 off 651108352 csum 2851505977 private 2851505976 btrfs
> csum failed ino 1959333 off 651108352 csum 2851505977 private 2851505976
> btrfs csum failed ino 1959333 off 651108352 csum 2851505977 private
> 2851505976 btrfs csum failed ino 1959333 off 651108352 csum 2851505977
> private 2851505976 btrfs csum failed ino 1959333 off 898342912 csum
> 4271223884 private 4271223883 btrfs csum failed ino 1959333 off 898342912
> csum 4271223884 private 4271223883 btrfs csum failed ino 1959333 off
> 898342912 csum 4271223884 private 4271223883 btrfs csum failed ino 1959333
> off 898342912 csum 4271223884 private 4271223883 btrfs csum failed ino
> 1959333 off 898342912 csum 4271223884 private 4271223883 btrfs csum failed
> ino 1959333 off 898342912 csum 4271223884 private 4271223883

I think, this is a different error. I've only seen them on filesystems from my
Opteron system. It seems that the recorded csums are wrong and it looks to me
like rounding errors. The data itself should be correct, as I've tested one
affected file via md5sum against the original on another filesystem.
Any ideas what is going wrong here?

regards,
Johannes

2010-07-15 19:04:25

by Chris Mason

[permalink] [raw]
Subject: Re: csum errors

On Thu, Jul 15, 2010 at 08:30:17PM +0200, Johannes Hirte wrote:
> Am Dienstag 13 Juli 2010, 14:23:58 schrieb Johannes Hirte:
> > ino 1959333 off 898342912 csum 4271223884 private 4271223883
>
> I think, this is a different error. I've only seen them on filesystems from my
> Opteron system. It seems that the recorded csums are wrong and it looks to me
> like rounding errors. The data itself should be correct, as I've tested one
> affected file via md5sum against the original on another filesystem.
> Any ideas what is going wrong here?

Are you doing data mirroring?

We can map that block and do a raw read off the device to see what the
data blocks actually contain.

-chris

2010-07-15 19:31:40

by Johannes Hirte

[permalink] [raw]
Subject: Re: csum errors

Am Donnerstag 15 Juli 2010, 21:03:09 schrieb Chris Mason:
> On Thu, Jul 15, 2010 at 08:30:17PM +0200, Johannes Hirte wrote:
> > Am Dienstag 13 Juli 2010, 14:23:58 schrieb Johannes Hirte:
> > > ino 1959333 off 898342912 csum 4271223884 private 4271223883
> >
> > I think, this is a different error. I've only seen them on filesystems
> > from my Opteron system. It seems that the recorded csums are wrong and
> > it looks to me like rounding errors. The data itself should be correct,
> > as I've tested one affected file via md5sum against the original on
> > another filesystem. Any ideas what is going wrong here?
>
> Are you doing data mirroring?

No, I don't.

> We can map that block and do a raw read off the device to see what the
> data blocks actually contain.

I've modified the btrfs-source a little to get the data. In inode.c I've
changed the code to:


csum = btrfs_csum_data(root, kaddr + offset, csum, end - start + 1);
btrfs_csum_final(csum, (char *)&csum);
if (csum != private)
if (printk_ratelimit()) {
printk(KERN_INFO "csum != private; ino %lu off %llu "
"csum %u private %llu\n", page->mapping->host->i_ino,
(unsigned long long)start, csum,
(unsigned long long)private);
}
// goto zeroit;

kunmap_atomic(kaddr, KM_USER0);

This way I could read the files with wrong csum too. As I wrote, I've compared
the md5sum from one file with a copy on an other filesystem. As they are the
same, at least for this file the data should be correct. The big question is,
why do the csums differ?

regards,
Johannes

2010-07-15 19:36:12

by Chris Mason

[permalink] [raw]
Subject: Re: csum errors

On Thu, Jul 15, 2010 at 09:32:12PM +0200, Johannes Hirte wrote:
> Am Donnerstag 15 Juli 2010, 21:03:09 schrieb Chris Mason:
> > On Thu, Jul 15, 2010 at 08:30:17PM +0200, Johannes Hirte wrote:
> > > Am Dienstag 13 Juli 2010, 14:23:58 schrieb Johannes Hirte:
> > > > ino 1959333 off 898342912 csum 4271223884 private 4271223883
> > >
> > > I think, this is a different error. I've only seen them on filesystems
> > > from my Opteron system. It seems that the recorded csums are wrong and
> > > it looks to me like rounding errors. The data itself should be correct,
> > > as I've tested one affected file via md5sum against the original on
> > > another filesystem. Any ideas what is going wrong here?
> >
> > Are you doing data mirroring?
>
> No, I don't.
>
> > We can map that block and do a raw read off the device to see what the
> > data blocks actually contain.
>
> I've modified the btrfs-source a little to get the data. In inode.c I've
> changed the code to:

Great. The bad csums are all just one bit off, that can't be an
accident. When were they written (which kernel?). Did you boot a 32
bit kernel on there at any time?

-chris

2010-07-15 19:59:49

by Johannes Hirte

[permalink] [raw]
Subject: Re: csum errors

Am Donnerstag 15 Juli 2010, 21:35:51 schrieb Chris Mason:
> On Thu, Jul 15, 2010 at 09:32:12PM +0200, Johannes Hirte wrote:
> > Am Donnerstag 15 Juli 2010, 21:03:09 schrieb Chris Mason:
> > > On Thu, Jul 15, 2010 at 08:30:17PM +0200, Johannes Hirte wrote:
> > > > Am Dienstag 13 Juli 2010, 14:23:58 schrieb Johannes Hirte:
> > > > > ino 1959333 off 898342912 csum 4271223884 private 4271223883
> > > >
> > > > I think, this is a different error. I've only seen them on
> > > > filesystems from my Opteron system. It seems that the recorded csums
> > > > are wrong and it looks to me like rounding errors. The data itself
> > > > should be correct, as I've tested one affected file via md5sum
> > > > against the original on another filesystem. Any ideas what is going
> > > > wrong here?
> > >
> > > Are you doing data mirroring?
> >
> > No, I don't.
> >
> > > We can map that block and do a raw read off the device to see what the
> > > data blocks actually contain.
> >
> > I've modified the btrfs-source a little to get the data. In inode.c I've
>
> > changed the code to:
> Great. The bad csums are all just one bit off, that can't be an
> accident. When were they written (which kernel?). Did you boot a 32
> bit kernel on there at any time?

No, I don't have a bootable 32bit installation on this system. I've tested it
now with a 32bit system by dumping the whole filesystem to an external drive
and mounting this to a 32bit system. The result was the same.

The affected files were written by different kernels. I think at least 2.6.34,
2.6.35-rc3 and 2.6.35-rc4 should be involved, perhaps 2.6.33 too. I'll try to
figure it out more exactly.

regards,
Johannes

2010-07-16 14:59:21

by Johannes Hirte

[permalink] [raw]
Subject: Re: kernel BUG at fs/btrfs/extent-tree.c:1353

Am Donnerstag 15 Juli 2010, 20:14:51 schrieb Johannes Hirte:
> Am Donnerstag 15 Juli 2010, 02:11:04 schrieb Dave Chinner:
> > On Wed, Jul 14, 2010 at 05:25:23PM +0200, Johannes Hirte wrote:
> > > Am Donnerstag 08 Juli 2010, 16:31:09 schrieb Chris Mason:
> > > I'm not sure if btrfs is to blame for this error. After the errors I
> > > switched to XFS on this system and got now this error:
> > >
> > > ls -l .kde4/share/apps/akregator/data/
> > > ls: cannot access .kde4/share/apps/akregator/data/feeds.opml: Structure
> > > needs cleaning
> > > total 4
> > > ?????????? ? ? ? ? ? feeds.opml
> >
> > What is the error reported in dmesg when the XFS filesytem shuts down?
>
> Nothing. I double checked the logs. There are only the messages when
> mounting the filesystem. No other errors are reported than the
> inaccessible file and the output from xfs_check.

I'm running now a kernel with more debug options enabled and got this:

[ 6794.810935]
[ 6794.810941] =================================
[ 6794.810955] [ INFO: inconsistent lock state ]
[ 6794.810966] 2.6.35-rc4-btrfs-debug #7
[ 6794.810975] ---------------------------------
[ 6794.810984] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
[ 6794.810996] kswapd0/361 [HC0[0]:SC0[0]:HE1:SE1] takes:
[ 6794.811006] (&(&ip->i_iolock)->mr_lock#2){++++?+}, at: [<c10fa82d>]
xfs_ilock+0x22/0x67
[ 6794.811039] {RECLAIM_FS-ON-W} state was registered at:
[ 6794.811046] [<c104ebc1>] mark_held_locks+0x42/0x5e
[ 6794.811046] [<c104f1f7>] lockdep_trace_alloc+0x99/0xb0
[ 6794.811046] [<c10740b8>] __alloc_pages_nodemask+0x6a/0x4a1
[ 6794.811046] [<c106edc2>] __page_cache_alloc+0x11/0x13
[ 6794.811046] [<c106fb43>] grab_cache_page_write_begin+0x47/0x81
[ 6794.811046] [<c10b2050>] block_write_begin_newtrunc+0x2e/0x9c
[ 6794.811046] [<c10b233a>] block_write_begin+0x23/0x5d
[ 6794.811046] [<c1114a9d>] xfs_vm_write_begin+0x26/0x28
[ 6794.811046] [<c106f15d>] generic_file_buffered_write+0xb5/0x1bd
[ 6794.811046] [<c1117e31>] xfs_file_aio_write+0x40e/0x66d
[ 6794.811046] [<c10950b4>] do_sync_write+0x8b/0xc6
[ 6794.811046] [<c109568b>] vfs_write+0x77/0xa4
[ 6794.811046] [<c10957f3>] sys_write+0x3c/0x5e
[ 6794.811046] [<c1002690>] sysenter_do_call+0x12/0x36
[ 6794.811046] irq event stamp: 141369
[ 6794.811046] hardirqs last enabled at (141369): [<c13639d2>]
_raw_spin_unlock_irqrestore+0x36/0x5b
[ 6794.811046] hardirqs last disabled at (141368): [<c13634c5>]
_raw_spin_lock_irqsave+0x14/0x68
[ 6794.811046] softirqs last enabled at (141300): [<c1032d69>]
__do_softirq+0xfe/0x10d
[ 6794.811046] softirqs last disabled at (141295): [<c1032da7>]
do_softirq+0x2f/0x47
[ 6794.811046]
[ 6794.811046] other info that might help us debug this:
[ 6794.811046] 2 locks held by kswapd0/361:
[ 6794.811046] #0: (shrinker_rwsem){++++..}, at: [<c10774db>]
shrink_slab+0x25/0x13f
[ 6794.811046] #1: (&xfs_mount_list_lock){++++.-}, at: [<c111cc78>]
xfs_reclaim_inode_shrink+0x2a/0xe8
[ 6794.811046]
[ 6794.811046] stack backtrace:
[ 6794.811046] Pid: 361, comm: kswapd0 Not tainted 2.6.35-rc4-btrfs-debug #7
[ 6794.811046] Call Trace:
[ 6794.811046] [<c13616c0>] ? printk+0xf/0x17
[ 6794.811046] [<c104e988>] valid_state+0x134/0x142
[ 6794.811046] [<c104ea66>] mark_lock+0xd0/0x1e9
[ 6794.811046] [<c104e2a7>] ? check_usage_forwards+0x0/0x5f
[ 6794.811046] [<c105003d>] __lock_acquire+0x374/0xc80
[ 6794.811046] [<c1044942>] ? sched_clock_local+0x12/0x121
[ 6794.811046] [<c1044c0b>] ? sched_clock_cpu+0x122/0x133
[ 6794.811046] [<c1050d4d>] lock_acquire+0x5f/0x76
[ 6794.811046] [<c10fa82d>] ? xfs_ilock+0x22/0x67
[ 6794.811046] [<c1043974>] down_write_nested+0x32/0x63
[ 6794.811046] [<c10fa82d>] ? xfs_ilock+0x22/0x67
[ 6794.811046] [<c10fa82d>] xfs_ilock+0x22/0x67
[ 6794.811046] [<c10faa48>] xfs_ireclaim+0x98/0xbb
[ 6794.811046] [<c1043a1e>] ? up_write+0x16/0x2b
[ 6794.811046] [<c111c78c>] xfs_reclaim_inode+0x1a7/0x1b1
[ 6794.811046] [<c111cafe>] xfs_inode_ag_walk+0x77/0xbc
[ 6794.811046] [<c111c5e5>] ? xfs_reclaim_inode+0x0/0x1b1
[ 6794.811046] [<c111cc07>] xfs_inode_ag_iterator+0x52/0x99
[ 6794.811046] [<c111cc78>] ? xfs_reclaim_inode_shrink+0x2a/0xe8
[ 6794.811046] [<c111c5e5>] ? xfs_reclaim_inode+0x0/0x1b1
[ 6794.811046] [<c111cc99>] xfs_reclaim_inode_shrink+0x4b/0xe8
[ 6794.811046] [<c1077588>] shrink_slab+0xd2/0x13f
[ 6794.811046] [<c1078cef>] kswapd+0x37d/0x4e9
[ 6794.811046] [<c104028f>] ? autoremove_wake_function+0x0/0x2f
[ 6794.811046] [<c1078972>] ? kswapd+0x0/0x4e9
[ 6794.811046] [<c103ffbc>] kthread+0x60/0x65
[ 6794.811046] [<c103ff5c>] ? kthread+0x0/0x65
[ 6794.811046] [<c1002bba>] kernel_thread_helper+0x6/0x10

Don't know if this is related to the problem.


regards,
Johannes

2010-07-17 04:55:57

by Brian Rogers

[permalink] [raw]
Subject: Re: csum errors

On 07/15/2010 12:35 PM, Chris Mason wrote:
> On Thu, Jul 15, 2010 at 09:32:12PM +0200, Johannes Hirte wrote:
>
>> Am Donnerstag 15 Juli 2010, 21:03:09 schrieb Chris Mason:
>>
>>> On Thu, Jul 15, 2010 at 08:30:17PM +0200, Johannes Hirte wrote:
>>>
>>>> Am Dienstag 13 Juli 2010, 14:23:58 schrieb Johannes Hirte:
>>>>
>>>>> ino 1959333 off 898342912 csum 4271223884 private 4271223883
> Great. The bad csums are all just one bit off, that can't be an
> accident. When were they written (which kernel?). Did you boot a 32
> bit kernel on there at any time?
>

I've seen this as well, with three files. In all instances, csum ==
*private + 1. Here are the unique lines from dmesg:

[32700.980806] btrfs csum failed ino 320113 off 55889920 csum 2415136266
private 2415136265
[32735.751112] btrfs csum failed ino 1731630 off 24776704 csum
1385284137 private 1385284136
[32738.777624] btrfs csum failed ino 2495707 off 171790336 csum
1385781806 private 1385781805

All three files are from when I first transitioned to btrfs (or more
accurately, they are clones of those files I made to hold onto a copy of
the corrupted version). Since the vast majority of my disk usage comes
from the transition anyway, I can't be sure this is due to a problem
only present at that time. I believe I was running 2.6.34 when I copied
my files over to my new btrfs partition, but I'm going from memory here.

My btrfs partition has never been touched by a 32-bit kernel.

2010-07-19 08:01:42

by Miao Xie

[permalink] [raw]
Subject: Re: kernel BUG at fs/btrfs/extent-tree.c:1353

On Thu, 15 Jul 2010 20:14:51 +0200, Johannes Hirte wrote:
> Am Donnerstag 15 Juli 2010, 02:11:04 schrieb Dave Chinner:
>> On Wed, Jul 14, 2010 at 05:25:23PM +0200, Johannes Hirte wrote:
>>> Am Donnerstag 08 Juli 2010, 16:31:09 schrieb Chris Mason:
>>> I'm not sure if btrfs is to blame for this error. After the errors I
>>> switched to XFS on this system and got now this error:
>>>
>>> ls -l .kde4/share/apps/akregator/data/
>>> ls: cannot access .kde4/share/apps/akregator/data/feeds.opml: Structure
>>> needs cleaning
>>> total 4
>>> ?????????? ? ? ? ? ? feeds.opml
>>
>> What is the error reported in dmesg when the XFS filesytem shuts down?
>
> Nothing. I double checked the logs. There are only the messages when mounting
> the filesystem. No other errors are reported than the inaccessible file and the
> output from xfs_check.

Is there anything wrong with your disks or memory?
Sometimes the bad memory can break the filesystem. I have met this kind of problem
some time ago.

If there is no problem with your disk and memory, Could you tell us the parameter of
mkfs.btrfs and mount?

Thanks
Miao Xie

2010-07-22 18:06:27

by Johannes Hirte

[permalink] [raw]
Subject: Re: kernel BUG at fs/btrfs/extent-tree.c:1353

Am Montag 19 Juli 2010, 10:01:46 schrieb Miao Xie:
> On Thu, 15 Jul 2010 20:14:51 +0200, Johannes Hirte wrote:
> > Am Donnerstag 15 Juli 2010, 02:11:04 schrieb Dave Chinner:
> >> On Wed, Jul 14, 2010 at 05:25:23PM +0200, Johannes Hirte wrote:
> >>> Am Donnerstag 08 Juli 2010, 16:31:09 schrieb Chris Mason:
> >>> I'm not sure if btrfs is to blame for this error. After the errors I
> >>> switched to XFS on this system and got now this error:
> >>>
> >>> ls -l .kde4/share/apps/akregator/data/
> >>> ls: cannot access .kde4/share/apps/akregator/data/feeds.opml: Structure
> >>> needs cleaning
> >>> total 4
> >>> ?????????? ? ? ? ? ? feeds.opml
> >>
> >> What is the error reported in dmesg when the XFS filesytem shuts down?
> >
> > Nothing. I double checked the logs. There are only the messages when
> > mounting the filesystem. No other errors are reported than the
> > inaccessible file and the output from xfs_check.
>
> Is there anything wrong with your disks or memory?
> Sometimes the bad memory can break the filesystem. I have met this kind of
> problem some time ago.

I don't think that's the case. I've checked the RAM with memtest86+ and got no
errors. I got the errors with two different disks, the first one with btrfs the
second one now with XFS. Before changing to the second disk, I've run
badblocks on it to be sure it has no errors.

>
> If there is no problem with your disk and memory, Could you tell us the
> parameter of mkfs.btrfs and mount?

I'm not sure what parameters I've used for mkbtrfs. It was either none ore '-m
single'. mount parameters are only noatime. Some time ago I've played a little
with max_inline.

On the actual disk with XFS I got now some more errors on my root-fs. Similar
error on one file:

ls: cannot access /var/tmp/portage/app-
office/krita-2.2.1/work/krita-2.2.1/krita/image/tiles3/tests/dm_consistancy_test/dm_consistancy_test.pr:
Invalid argument

xfs_check shows on this fs:

localhost ~ # xfs_check /dev/sda1
agi unlinked bucket 10 is 7279754 in ag 0 (inode=7279754)
agi unlinked bucket 11 is 7279755 in ag 0 (inode=7279755)
dir 91466358 entry dm_consistancy_test.pr bad inode number 1862628266
dir 91466358 size is 36, should be 35
agi unlinked bucket 48 is 11677104 in ag 2 (inode=78785968)
agi unlinked bucket 49 is 11677105 in ag 2 (inode=78785969)
agi unlinked bucket 50 is 11677106 in ag 2 (inode=78785970)
agi unlinked bucket 51 is 11677107 in ag 2 (inode=78785971)
agi unlinked bucket 52 is 11677108 in ag 2 (inode=78785972)
agi unlinked bucket 53 is 11677109 in ag 2 (inode=78785973)
agi unlinked bucket 54 is 11677110 in ag 2 (inode=78785974)
agi unlinked bucket 55 is 11677111 in ag 2 (inode=78785975)
agi unlinked bucket 58 is 11677114 in ag 2 (inode=78785978)
agi unlinked bucket 59 is 11677115 in ag 2 (inode=78785979)
agi unlinked bucket 60 is 11677116 in ag 2 (inode=78785980)
agi unlinked bucket 61 is 11677117 in ag 2 (inode=78785981)
allocated inode 7279754 has 0 link count
allocated inode 7279755 has 0 link count
disconnected inode 91466360, nlink 1
allocated inode 78785968 has 0 link count
allocated inode 78785969 has 0 link count
allocated inode 78785970 has 0 link count
allocated inode 78785971 has 0 link count
allocated inode 78785972 has 0 link count
allocated inode 78785973 has 0 link count
allocated inode 78785974 has 0 link count
allocated inode 78785975 has 0 link count
allocated inode 78785978 has 0 link count
allocated inode 78785979 has 0 link count
allocated inode 78785980 has 0 link count
allocated inode 78785981 has 0 link count

And again I don't find any related message in dmesg.

regards,
Johannes

2010-07-23 11:02:24

by Daniel J Blueman

[permalink] [raw]
Subject: Re: kernel BUG at fs/btrfs/extent-tree.c:1353

On 22 July 2010 19:07, Johannes Hirte <[email protected]> wrote:
> Am Montag 19 Juli 2010, 10:01:46 schrieb Miao Xie:
>> On Thu, 15 Jul 2010 20:14:51 +0200, Johannes Hirte wrote:
>> > Am Donnerstag 15 Juli 2010, 02:11:04 schrieb Dave Chinner:
>> >> On Wed, Jul 14, 2010 at 05:25:23PM +0200, Johannes Hirte wrote:
>> >>> Am Donnerstag 08 Juli 2010, 16:31:09 schrieb Chris Mason:
>> >>> I'm not sure if btrfs is to blame for this error. After the errors I
>> >>> switched to XFS on this system and got now this error:
>> >>>
>> >>> ls -l .kde4/share/apps/akregator/data/
>> >>> ls: cannot access .kde4/share/apps/akregator/data/feeds.opml: Structure
>> >>> needs cleaning
>> >>> total 4
>> >>> ?????????? ? ? ? ?? ? ? ? ?? ? ? ? ? ? ?? feeds.opml
>> >>
>> >> What is the error reported in dmesg when the XFS filesytem shuts down?
>> >
>> > Nothing. I double checked the logs. There are only the messages when
>> > mounting the filesystem. No other errors are reported than the
>> > inaccessible file and the output from xfs_check.
>>
>> Is there anything wrong with your disks or memory?
>> Sometimes the bad memory can break the filesystem. I have met this kind of
>> problem some time ago.
>
> I don't think that's the case. I've checked the RAM with memtest86+ and got no
> errors. I got the errors with two different disks, the first one with btrfs the
> second one now with XFS. Before changing to the second disk, I've run
> badblocks on it to be sure it has no errors.

There are some known-buggy chipsets also. One still around is the
Nvidia CK804/MCP55, under certain patterns of spatially-local pending
reads and writes to the memory controller, a 64-byte request would
occasionally be returned with the wrong offset. I was hitting it with
some 27-Gbit adapters and managed to capture it on a PCI-e protocol
analyser. Rsync between network and local disk would hit sometimes
too.
--
Daniel J Blueman

2010-07-23 11:14:59

by Bob Copeland

[permalink] [raw]
Subject: Re: kernel BUG at fs/btrfs/extent-tree.c:1353

On Thu, Jul 22, 2010 at 2:07 PM, Johannes Hirte
<[email protected]> wrote:
>> Is there anything wrong with your disks or memory?
>> Sometimes the bad memory can break the filesystem. I have met this kind of
>> problem some time ago.
>
> I don't think that's the case. I've checked the RAM with memtest86+ and got no
> errors. I got the errors with two different disks, the first one with btrfs the
> second one now with XFS. Before changing to the second disk, I've run
> badblocks on it to be sure it has no errors.

You might also try kmemcheck. There's a good chance that a bug
that scribbles random memory shows up as FS corruption. I have had
ext4 corruption due to an inotify bug, which kmemcheck found on the
first try.

--
Bob Copeland %% http://www.bobcopeland.com

2010-07-29 17:08:40

by Johannes Hirte

[permalink] [raw]
Subject: Re: kernel BUG at fs/btrfs/extent-tree.c:1353

Am Donnerstag 22 Juli 2010, 20:07:23 schrieb Johannes Hirte:
> Am Montag 19 Juli 2010, 10:01:46 schrieb Miao Xie:
> > On Thu, 15 Jul 2010 20:14:51 +0200, Johannes Hirte wrote:
> > > Am Donnerstag 15 Juli 2010, 02:11:04 schrieb Dave Chinner:
> > >> On Wed, Jul 14, 2010 at 05:25:23PM +0200, Johannes Hirte wrote:
> > >>> Am Donnerstag 08 Juli 2010, 16:31:09 schrieb Chris Mason:
> > >>> I'm not sure if btrfs is to blame for this error. After the errors I
> > >>> switched to XFS on this system and got now this error:
> > >>>
> > >>> ls -l .kde4/share/apps/akregator/data/
> > >>> ls: cannot access .kde4/share/apps/akregator/data/feeds.opml:
> > >>> Structure needs cleaning
> > >>> total 4
> > >>> ?????????? ? ? ? ? ? feeds.opml
> > >>
> > >> What is the error reported in dmesg when the XFS filesytem shuts down?
> > >
> > > Nothing. I double checked the logs. There are only the messages when
> > > mounting the filesystem. No other errors are reported than the
> > > inaccessible file and the output from xfs_check.
> >
> > Is there anything wrong with your disks or memory?
> > Sometimes the bad memory can break the filesystem. I have met this kind
> > of problem some time ago.
>
> I don't think that's the case. I've checked the RAM with memtest86+ and got
> no errors. I got the errors with two different disks, the first one with
> btrfs the second one now with XFS. Before changing to the second disk,
> I've run badblocks on it to be sure it has no errors.

I think I've found it. The bug was introduced by

commit 7f0e7bed936a0c422641a046551829a01341dd80
Author: Christoph Hellwig <[email protected]>
Date: Tue Jun 8 18:14:34 2010 +0200

writeback: fix writeback completion notifications

The code dealing with bdi_work->state and completion of a bdi_work is a
major mess currently. This patch makes sure we directly use one set of
flags to deal with it, and use it consistently, which means:

- always notify about completion from the rcu callback. We only ever
wait for it from on-stack callers, so this simplification does not
even cause a theoretical slowdown currently. It also makes sure we
don't miss out on the notification if we ever add other callers to
wait for it.
- make earlier completion notification depending on the on-stack
allocation, not the sync mode. If we introduce new callers that
want to do WB_SYNC_NONE writeback from on-stack callers this will
be nessecary.

Also rename bdi_wait_on_work_clear to bdi_wait_on_work_done and inline
a few small functions into their only caller to make the code
understandable.

Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

and seems to be fixed by

commit 83ba7b071f30f7c01f72518ad72d5cd203c27502
Author: Christoph Hellwig <[email protected]>
Date: Tue Jul 6 08:59:53 2010 +0200

writeback: simplify the write back thread queue

First remove items from work_list as soon as we start working on them.This
means we don't have to track any pending or visited state and can get
rid of all the RCU magic freeing the work items - we can simply free
them once the operation has finished. Second use a real completion for
tracking synchronous requests - if the caller sets the completion pointer
we complete it, otherwise use it as a boolean indicator that we can free
the work item directly. Third unify struct wb_writeback_args and struct
bdi_work into a single data structure, wb_writeback_work. Previous we
set all parameters into a struct wb_writeback_args, copied it into
struct bdi_work, copied it again on the stack to use it there. Instead
of just allocate one structure dynamically or on the stack and use it
all the way through the stack.

Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

I was able to reproduce the bug by unpacking a big tar-file and deleting this files multiple times. Normally with btrfs the kernel crashed within 20 runs. After commit 83ba7b071f30f7c01f72518ad72d5cd203c27502 it survived more than 500 runs.


regards,
Johannes

2010-07-29 18:55:18

by Jens Axboe

[permalink] [raw]
Subject: Re: kernel BUG at fs/btrfs/extent-tree.c:1353

On 07/29/2010 07:09 PM, Johannes Hirte wrote:
> Am Donnerstag 22 Juli 2010, 20:07:23 schrieb Johannes Hirte:
>> Am Montag 19 Juli 2010, 10:01:46 schrieb Miao Xie:
>>> On Thu, 15 Jul 2010 20:14:51 +0200, Johannes Hirte wrote:
>>>> Am Donnerstag 15 Juli 2010, 02:11:04 schrieb Dave Chinner:
>>>>> On Wed, Jul 14, 2010 at 05:25:23PM +0200, Johannes Hirte wrote:
>>>>>> Am Donnerstag 08 Juli 2010, 16:31:09 schrieb Chris Mason:
>>>>>> I'm not sure if btrfs is to blame for this error. After the errors I
>>>>>> switched to XFS on this system and got now this error:
>>>>>>
>>>>>> ls -l .kde4/share/apps/akregator/data/
>>>>>> ls: cannot access .kde4/share/apps/akregator/data/feeds.opml:
>>>>>> Structure needs cleaning
>>>>>> total 4
>>>>>> ?????????? ? ? ? ? ? feeds.opml
>>>>>
>>>>> What is the error reported in dmesg when the XFS filesytem shuts down?
>>>>
>>>> Nothing. I double checked the logs. There are only the messages when
>>>> mounting the filesystem. No other errors are reported than the
>>>> inaccessible file and the output from xfs_check.
>>>
>>> Is there anything wrong with your disks or memory?
>>> Sometimes the bad memory can break the filesystem. I have met this kind
>>> of problem some time ago.
>>
>> I don't think that's the case. I've checked the RAM with memtest86+ and got
>> no errors. I got the errors with two different disks, the first one with
>> btrfs the second one now with XFS. Before changing to the second disk,
>> I've run badblocks on it to be sure it has no errors.
>
> I think I've found it. The bug was introduced by
>
> commit 7f0e7bed936a0c422641a046551829a01341dd80
> Author: Christoph Hellwig <[email protected]>
> Date: Tue Jun 8 18:14:34 2010 +0200
>
> writeback: fix writeback completion notifications
>
> The code dealing with bdi_work->state and completion of a bdi_work is a
> major mess currently. This patch makes sure we directly use one set of
> flags to deal with it, and use it consistently, which means:
>
> - always notify about completion from the rcu callback. We only ever
> wait for it from on-stack callers, so this simplification does not
> even cause a theoretical slowdown currently. It also makes sure we
> don't miss out on the notification if we ever add other callers to
> wait for it.
> - make earlier completion notification depending on the on-stack
> allocation, not the sync mode. If we introduce new callers that
> want to do WB_SYNC_NONE writeback from on-stack callers this will
> be nessecary.
>
> Also rename bdi_wait_on_work_clear to bdi_wait_on_work_done and inline
> a few small functions into their only caller to make the code
> understandable.
>
> Signed-off-by: Christoph Hellwig <[email protected]>
> Signed-off-by: Jens Axboe <[email protected]>
>
> and seems to be fixed by
>
> commit 83ba7b071f30f7c01f72518ad72d5cd203c27502
> Author: Christoph Hellwig <[email protected]>
> Date: Tue Jul 6 08:59:53 2010 +0200
>
> writeback: simplify the write back thread queue
>
> First remove items from work_list as soon as we start working on them.This
> means we don't have to track any pending or visited state and can get
> rid of all the RCU magic freeing the work items - we can simply free
> them once the operation has finished. Second use a real completion for
> tracking synchronous requests - if the caller sets the completion pointer
> we complete it, otherwise use it as a boolean indicator that we can free
> the work item directly. Third unify struct wb_writeback_args and struct
> bdi_work into a single data structure, wb_writeback_work. Previous we
> set all parameters into a struct wb_writeback_args, copied it into
> struct bdi_work, copied it again on the stack to use it there. Instead
> of just allocate one structure dynamically or on the stack and use it
> all the way through the stack.
>
> Signed-off-by: Christoph Hellwig <[email protected]>
> Signed-off-by: Jens Axboe <[email protected]>
>
> I was able to reproduce the bug by unpacking a big tar-file and
> deleting this files multiple times. Normally with btrfs the kernel
> crashed within 20 runs. After commit
> 83ba7b071f30f7c01f72518ad72d5cd203c27502 it survived more than 500
> runs.

Makes sense, that first commit would potentially pass in stack cruft as
the wbc arg. So I think we can safely consider it fixed now.

--
Jens Axboe

2010-08-10 21:06:39

by Sebastian 'gonX' Jensen

[permalink] [raw]
Subject: Re: csum errors

On 17 July 2010 06:55, Brian Rogers <[email protected]> wrote:
> On 07/15/2010 12:35 PM, Chris Mason wrote:
>>
>> On Thu, Jul 15, 2010 at 09:32:12PM +0200, Johannes Hirte wrote:
>>
>>>
>>> Am Donnerstag 15 Juli 2010, 21:03:09 schrieb Chris Mason:
>>>
>>>>
>>>> On Thu, Jul 15, 2010 at 08:30:17PM +0200, Johannes Hirte wrote:
>>>>
>>>>>
>>>>> Am Dienstag 13 Juli 2010, 14:23:58 schrieb Johannes Hirte:
>>>>>
>>>>>>
>>>>>> ino 1959333 off 898342912 csum 4271223884 private 4271223883
>>
>> Great.   The bad csums are all just one bit off, that can't be an
>> accident.  When were they written (which kernel?).  Did you boot a 32
>> bit kernel on there at any time?
>>
>
> I've seen this as well, with three files. In all instances, csum == *private
> + 1. Here are the unique lines from dmesg:
>
> [32700.980806] btrfs csum failed ino 320113 off 55889920 csum 2415136266
> private 2415136265
> [32735.751112] btrfs csum failed ino 1731630 off 24776704 csum 1385284137
> private 1385284136
> [32738.777624] btrfs csum failed ino 2495707 off 171790336 csum 1385781806
> private 1385781805
>
> All three files are from when I first transitioned to btrfs (or more
> accurately, they are clones of those files I made to hold onto a copy of the
> corrupted version). Since the vast majority of my disk usage comes from the
> transition anyway, I can't be sure this is due to a problem only present at
> that time. I believe I was running 2.6.34 when I copied my files over to my
> new btrfs partition, but I'm going from memory here.
>
> My btrfs partition has never been touched by a 32-bit kernel.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

I am also getting this now:

btrfs csum failed ino 288 off 799268864 csum 4054934499 private 4054934498
btrfs csum failed ino 288 off 799268864 csum 4054934499 private 4054934498
btrfs csum failed ino 288 off 799268864 csum 4054934499 private 4054934498
btrfs csum failed ino 288 off 799268864 csum 4054934499 private 4054934498

A bit unrelated, but I was doing this while doing a rebalance across
my drives. RAID-0. I think it's a different issue though, but I'll go
ahead and post my dmesg output anyway:

------------[ cut here ]------------
kernel BUG at fs/btrfs/volumes.c:1980!
invalid opcode: 0000 [#1] PREEMPT SMP
last sysfs file: /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
CPU 1
Modules linked in: crc32c_intel ipv6 microcode cifs cpufreq_ondemand
hwmon_vid coretemp fuse ext2 mbcache raid1 md_mod usbhid hid
rndis_wlan cfg80211 rfkill rndis_host cdc_ether usbnet
snd_hda_codec_realtek snd_usb_audio snd_usbmidi_lib snd_hda_intel
snd_rawmidi snd_seq_dummy snd_seq_oss snd_hda_codec snd_seq_midi_event
snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss nvidia(P) snd_hwdep
snd_pcm snd_timer snd uhci_hcd i2c_i801 acpi_cpufreq soundcore
freq_table ehci_hcd tpm_tis i2c_core tpm i7core_edac r8169
snd_page_alloc mperf iTCO_wdt mii iTCO_vendor_support wmi tpm_bios
usbcore edac_core processor fan thermal pcspkr button evdev rtc_cmos
rtc_core rtc_lib btrfs zlib_deflate crc32c libcrc32c sr_mod sg cdrom
sd_mod ata_piix pata_acpi ata_generic libata scsi_mod

Pid: 3480, comm: btrfs Tainted: P 2.6.35-ARCH #1 121-BL-E756/OEM
RIP: 0010:[<ffffffffa018158a>] [<ffffffffa018158a>]
btrfs_balance+0x24a/0x260 [btrfs]
RSP: 0018:ffff8801d5675d48 EFLAGS: 00010282
RAX: 00000000fffffffb RBX: ffff8801d8b60ab0 RCX: ffffea00054814a8
RDX: 0000000000000004 RSI: ffffffff81530df0 RDI: 0000000000000282
RBP: ffff8801d5675dc8 R08: 0000000000000002 R09: 0000000000000002
R10: ffff88000001d240 R11: 0000000000000000 R12: ffff8801d9eec800
R13: 0000000000000000 R14: ffff8801d5675d78 R15: 00000659f9210000
FS: 00007f7c3aef8740(0000) GS:ffff880001820000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007ff55a388000 CR3: 00000001d8d26000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process btrfs (pid: 3480, threadinfo ffff8801d5674000, task ffff8801d77ce120)
Stack:
ffff8801d9eed000 0000000000000100 0000000000000100 000659f9210000e4
<0> 00007f7c3a84a200 0000000000000000 0000000000000100 00065a7920ffffe4
<0> ffff8801d5675e00 ffffffff810f8ea3 0000000000000000 ffff8801bebd6e40
Call Trace:
[<ffffffff810f8ea3>] ? handle_mm_fault+0x1c3/0xab0
[<ffffffffa0188520>] btrfs_ioctl+0x2c0/0x940 [btrfs]
[<ffffffff811334ac>] vfs_ioctl+0x3c/0xd0
[<ffffffff81133aac>] do_vfs_ioctl+0x7c/0x520
[<ffffffff81133fd1>] sys_ioctl+0x81/0xa0
[<ffffffff81009e82>] system_call_fastpath+0x16/0x1b
Code: 6d 62 fb ff 48 8b 45 80 48 8b b8 28 01 00 00 48 81 c7 80 1c 00
00 e8 a6 fd 1e e1 e9 fe fd ff ff 45 31 ed eb d7 0f 0b 85 c0 74 a7 <0f>
0b 0f 0b 0f 0b 0f 0b 66 66 66 66 66 2e 0f 1f 84 00 00 00 00
RIP [<ffffffffa018158a>] btrfs_balance+0x24a/0x260 [btrfs]
RSP <ffff8801d5675d48>
---[ end trace 7501e28b9295d333 ]---

# btrfs-show
Label: none uuid: 405f0d0b-ee4d-4426-9826-d2580d0c8d6c
Total devices 2 FS bytes used 178.06GB
devid 5 size 232.55GB used 91.63GB path /dev/sde3
devid 7 size 189.82GB used 91.63GB path /dev/sda2

Btrfs Btrfs v0.19

Regards,
Sebastian J.

2010-08-13 12:19:18

by David Woodhouse

[permalink] [raw]
Subject: Re: kernel BUG at fs/btrfs/extent-tree.c:1353

On Thu, 2010-07-08 at 10:31 -0400, Chris Mason wrote:
> Neither Yan nor I have been able to reproduce this locally, but a few
> people have now hit it. Johannes, are you available to try out a
> debugging kernel to try and track this down?

I have a machine you can log into, with a spare file system that causes
this. It's 200GiB so non-trivial to upload, unless you have a tool like
the xfs one which extracts just the important data structures and leaves
all the data extents and free space as zero.

It was created with the MeeGo 2.6.33.6 kernel -- which I don't think
contains commit 7f0e7bed which is later alleged to be the cause.

I've also tried booting this machine with a current kernel from git,
which includes commit 83ba7b07 which is alleged to be a fix -- and btrfs
still oopses just the same.

btrfsck and btrfs-dump-tree both abort with:
extent-tree.c:1755: update_space_info: Assertion `!(found->total_bytes < found->bytes_used)' failed.

I can mount it read-only though and read certain things out of it. But
when I boot from it, I hit the BUG().

--
David Woodhouse Open Source Technology Centre
[email protected] Intel Corporation

2010-08-14 07:11:29

by Brian Rogers

[permalink] [raw]
Subject: Re: csum errors

On 08/10/2010 02:06 PM, Sebastian 'gonX' Jensen wrote:
> On 17 July 2010 06:55, Brian Rogers<[email protected]> wrote:
>> On 07/15/2010 12:35 PM, Chris Mason wrote:
>>> On Thu, Jul 15, 2010 at 09:32:12PM +0200, Johannes Hirte wrote:
>>>
>>>> Am Donnerstag 15 Juli 2010, 21:03:09 schrieb Chris Mason:
>>>>
>>>>> On Thu, Jul 15, 2010 at 08:30:17PM +0200, Johannes Hirte wrote:
>>>>>
>>>>>> Am Dienstag 13 Juli 2010, 14:23:58 schrieb Johannes Hirte:
>>>>>>
>>>>>>> ino 1959333 off 898342912 csum 4271223884 private 4271223883
>>> Great. The bad csums are all just one bit off, that can't be an
>>> accident. When were they written (which kernel?). Did you boot a 32
>>> bit kernel on there at any time?
>>>
>> I've seen this as well, with three files. In all instances, csum == *private
>> + 1. Here are the unique lines from dmesg:
>>
>> [32700.980806] btrfs csum failed ino 320113 off 55889920 csum 2415136266
>> private 2415136265
>> [32735.751112] btrfs csum failed ino 1731630 off 24776704 csum 1385284137
>> private 1385284136
>> [32738.777624] btrfs csum failed ino 2495707 off 171790336 csum 1385781806
>> private 1385781805
>>
>> All three files are from when I first transitioned to btrfs (or more
>> accurately, they are clones of those files I made to hold onto a copy of the
>> corrupted version). Since the vast majority of my disk usage comes from the
>> transition anyway, I can't be sure this is due to a problem only present at
>> that time. I believe I was running 2.6.34 when I copied my files over to my
>> new btrfs partition, but I'm going from memory here.
>>
>> My btrfs partition has never been touched by a 32-bit kernel.
> I am also getting this now:
>
> btrfs csum failed ino 288 off 799268864 csum 4054934499 private 4054934498
> btrfs csum failed ino 288 off 799268864 csum 4054934499 private 4054934498
> btrfs csum failed ino 288 off 799268864 csum 4054934499 private 4054934498
> btrfs csum failed ino 288 off 799268864 csum 4054934499 private 4054934498
>
> A bit unrelated, but I was doing this while doing a rebalance across
> my drives. RAID-0.

I get this as well on single-drive btrfs. I cleaned out all the files
that produce a csum error when read normally, but I still get the error
during a rebalance. I can read all the files on any subvolume with the
matching inode number just fine. If I delete the mentioned files or
replace them with new copies and do a rebalance again, I'll get the same
error again on a different inode number.

I did two rebalance runs in a row (with a reboot between each) without
deleting the problem inode to see if it would fail in the same place
each time. The inode number varied, but the block group, offset, and
checksums were the same:

Run 1:
[63978.519791] btrfs: relocating block group 511130468352 flags 1
[63980.401249] btrfs csum failed ino 418 off 9949184 csum 1385781806
private 1385781805
[63980.499024] btrfs csum failed ino 418 off 9949184 csum 1385781806
private 1385781805
[63980.535384] btrfs csum failed ino 418 off 9949184 csum 1385781806
private 1385781805
[63980.570196] btrfs csum failed ino 418 off 9949184 csum 1385781806
private 1385781805

Run 2:
[51317.967011] btrfs: relocating block group 511130468352 flags 1
[51321.298448] btrfs csum failed ino 415 off 9949184 csum 1385781806
private 1385781805
[51321.807357] btrfs csum failed ino 415 off 9949184 csum 1385781806
private 1385781805
[51322.707362] btrfs csum failed ino 415 off 9949184 csum 1385781806
private 1385781805
[51323.318478] btrfs csum failed ino 415 off 9949184 csum 1385781806
private 1385781805

These files should have different contents (unfortunately I already
deleted them by now), so I don't know what they're doing at the same
offset, sharing the same checksum... Could these files both be inlined
in the same chunk of metadata, or does this mean something else?

Also, I wonder if the miscalculated checksum is something that happens
non-deterministically, or if it's just that the inodes were processed in
a different order the second time...

It certainly seems significant that the inode number is always low. The
balance always runs for quite a while before hitting a problem, and
since it appears to start from the end of the disk, it seems that only
the earliest and lowest-numbered inodes at the beginning of the disk can
cause this problem.

Complete crash from dmesg:

[51317.967011] btrfs: relocating block group 511130468352 flags 1
[51321.298448] btrfs csum failed ino 415 off 9949184 csum 1385781806
private 1385781805
[51321.807357] btrfs csum failed ino 415 off 9949184 csum 1385781806
private 1385781805
[51322.707362] btrfs csum failed ino 415 off 9949184 csum 1385781806
private 1385781805
[51323.318478] btrfs csum failed ino 415 off 9949184 csum 1385781806
private 1385781805
[51327.954315] ------------[ cut here ]------------
[51327.954322] kernel BUG at
/build/buildd/linux-2.6.35/fs/btrfs/volumes.c:1980!
[51327.954326] invalid opcode: 0000 [#1] SMP
[51327.954330] last sysfs file:
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:1f/PNP0C0A:00/power_supply/BAT1/charge_full
[51327.954334] CPU 0
[51327.954336] Modules linked in: ip6table_filter ip6_tables hidp hid
binfmt_misc rfcomm parport_pc ppdev sco bnep l2cap ipt_MASQUERADE
iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state
nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables
bridge stp microcode joydev i915 snd_hda_codec_si3054
snd_hda_codec_realtek drm_kms_helper drm i2c_algo_bit snd_hda_intel
snd_hda_codec arc4 snd_hwdep uinput snd_pcm iwl3945 video snd_seq_midi
snd_rawmidi snd_seq_midi_event iwlcore snd_seq snd_timer snd_seq_device
lp snd mac80211 soundcore output psmouse btusb intel_agp serio_raw
cfg80211 bluetooth snd_page_alloc parport btrfs zlib_deflate
firewire_ohci firewire_core ahci crc_itu_t sdhci_pci sdhci led_class tg3
crc32c libahci libcrc32c
[51327.954396]
[51327.954400] Pid: 15426, comm: btrfs Not tainted 2.6.35-15-generic
#21-Ubuntu IFT01 /N/A
[51327.954404] RIP: 0010:[<ffffffffa00cc25f>] [<ffffffffa00cc25f>]
btrfs_balance+0x24f/0x260 [btrfs]
[51327.954425] RSP: 0018:ffff88012eb95dc8 EFLAGS: 00010282
[51327.954428] RAX: 00000000fffffffb RBX: ffff880037c78480 RCX:
0200000000004081
[51327.954431] RDX: 0000000000000003 RSI: ffffea0003ea1640 RDI:
0000000000000282
[51327.954434] RBP: ffff88012eb95e48 R08: 0000000000000000 R09:
0000000000000000
[51327.954437] R10: 0000000000000069 R11: 0000000000000001 R12:
ffff880138da6800
[51327.954439] R13: 0000000000000000 R14: 0000007701c00000 R15:
ffff88012eb95df8
[51327.954443] FS: 00007fbea8710740(0000) GS:ffff880001e00000(0000)
knlGS:0000000000000000
[51327.954446] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[51327.954449] CR2: 00007f99c0088cc1 CR3: 0000000114bee000 CR4:
00000000000006f0
[51327.954452] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[51327.954455] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[51327.954458] Process btrfs (pid: 15426, threadinfo ffff88012eb94000,
task ffff88003fb6adc0)
[51327.954460] Stack:
[51327.954462] ffff880138da7000 0000000000000100 0000000000000100
00007701c00000e4
[51327.954467] <0> ffff880100001c00 ffff88013fc31400 0000000000000100
0000e15b3fffffe4
[51327.954473] <0> ffff88012eb95e00 ffffffff811280f5 ffff8801315f5038
ffff880115d35600
[51327.954478] Call Trace:
[51327.954486] [<ffffffff811280f5>] ? page_add_new_anon_rmap+0x95/0xa0
[51327.954500] [<ffffffffa00d44b0>] btrfs_ioctl+0x2c0/0x4c0 [btrfs]
[51327.954505] [<ffffffff811615ad>] vfs_ioctl+0x3d/0xd0
[51327.954509] [<ffffffff81161e81>] do_vfs_ioctl+0x81/0x340
[51327.954514] [<ffffffff8158c8ae>] ? do_page_fault+0x15e/0x350
[51327.954517] [<ffffffff811621c1>] sys_ioctl+0x81/0xa0
[51327.954523] [<ffffffff8100a0f2>] system_call_fastpath+0x16/0x1b
[51327.954525] Code: fb ff 48 8b 45 80 48 8b b8 28 01 00 00 48 81 c7 20
1c 00 00 e8 e3 b0 4b e1 e9 00 fe ff ff 45 31 ed eb d7 0f 0b eb fe 85 c0
74 a5 <0f> 0b eb fe 0f 0b eb fe 0f 0b eb fe 0f 0b eb fe 90 55 48 89 e5
[51327.954567] RIP [<ffffffffa00cc25f>] btrfs_balance+0x24f/0x260 [btrfs]
[51327.954580] RSP <ffff88012eb95dc8>
[51327.954583] ---[ end trace 0bf81e832fde7349 ]---

2010-08-14 11:10:47

by Sebastian 'gonX' Jensen

[permalink] [raw]
Subject: Re: csum errors

On 14 August 2010 09:05, Brian Rogers <[email protected]> wrote:
>  On 08/10/2010 02:06 PM, Sebastian 'gonX' Jensen wrote:
>>
>> On 17 July 2010 06:55, Brian Rogers<[email protected]>  wrote:
>>>
>>> On 07/15/2010 12:35 PM, Chris Mason wrote:
>>>>
>>>> On Thu, Jul 15, 2010 at 09:32:12PM +0200, Johannes Hirte wrote:
>>>>
>>>>> Am Donnerstag 15 Juli 2010, 21:03:09 schrieb Chris Mason:
>>>>>
>>>>>> On Thu, Jul 15, 2010 at 08:30:17PM +0200, Johannes Hirte wrote:
>>>>>>
>>>>>>> Am Dienstag 13 Juli 2010, 14:23:58 schrieb Johannes Hirte:
>>>>>>>
>>>>>>>> ino 1959333 off 898342912 csum 4271223884 private 4271223883
>>>>
>>>> Great.   The bad csums are all just one bit off, that can't be an
>>>> accident.  When were they written (which kernel?).  Did you boot a 32
>>>> bit kernel on there at any time?
>>>>
>>> I've seen this as well, with three files. In all instances, csum ==
>>> *private
>>> + 1. Here are the unique lines from dmesg:
>>>
>>> [32700.980806] btrfs csum failed ino 320113 off 55889920 csum 2415136266
>>> private 2415136265
>>> [32735.751112] btrfs csum failed ino 1731630 off 24776704 csum 1385284137
>>> private 1385284136
>>> [32738.777624] btrfs csum failed ino 2495707 off 171790336 csum
>>> 1385781806
>>> private 1385781805
>>>
>>> All three files are from when I first transitioned to btrfs (or more
>>> accurately, they are clones of those files I made to hold onto a copy of
>>> the
>>> corrupted version). Since the vast majority of my disk usage comes from
>>> the
>>> transition anyway, I can't be sure this is due to a problem only present
>>> at
>>> that time. I believe I was running 2.6.34 when I copied my files over to
>>> my
>>> new btrfs partition, but I'm going from memory here.
>>>
>>> My btrfs partition has never been touched by a 32-bit kernel.
>>
>> I am also getting this now:
>>
>> btrfs csum failed ino 288 off 799268864 csum 4054934499 private 4054934498
>> btrfs csum failed ino 288 off 799268864 csum 4054934499 private 4054934498
>> btrfs csum failed ino 288 off 799268864 csum 4054934499 private 4054934498
>> btrfs csum failed ino 288 off 799268864 csum 4054934499 private 4054934498
>>
>> A bit unrelated, but I was doing this while doing a rebalance across
>> my drives. RAID-0.
>
> I get this as well on single-drive btrfs. I cleaned out all the files that
> produce a csum error when read normally, but I still get the error during a
> rebalance. I can read all the files on any subvolume with the matching inode
> number just fine. If I delete the mentioned files or replace them with new
> copies and do a rebalance again, I'll get the same error again on a
> different inode number.
>
> I did two rebalance runs in a row (with a reboot between each) without
> deleting the problem inode to see if it would fail in the same place each
> time. The inode number varied, but the block group, offset, and checksums
> were the same:
>
> Run 1:
> [63978.519791] btrfs: relocating block group 511130468352 flags 1
> [63980.401249] btrfs csum failed ino 418 off 9949184 csum 1385781806 private
> 1385781805
> [63980.499024] btrfs csum failed ino 418 off 9949184 csum 1385781806 private
> 1385781805
> [63980.535384] btrfs csum failed ino 418 off 9949184 csum 1385781806 private
> 1385781805
> [63980.570196] btrfs csum failed ino 418 off 9949184 csum 1385781806 private
> 1385781805
>
> Run 2:
> [51317.967011] btrfs: relocating block group 511130468352 flags 1
> [51321.298448] btrfs csum failed ino 415 off 9949184 csum 1385781806 private
> 1385781805
> [51321.807357] btrfs csum failed ino 415 off 9949184 csum 1385781806 private
> 1385781805
> [51322.707362] btrfs csum failed ino 415 off 9949184 csum 1385781806 private
> 1385781805
> [51323.318478] btrfs csum failed ino 415 off 9949184 csum 1385781806 private
> 1385781805
>
> These files should have different contents (unfortunately I already deleted
> them by now), so I don't know what they're doing at the same offset, sharing
> the same checksum... Could these files both be inlined in the same chunk of
> metadata, or does this mean something else?
>
> Also, I wonder if the miscalculated checksum is something that happens
> non-deterministically, or if it's just that the inodes were processed in a
> different order the second time...
>
> It certainly seems significant that the inode number is always low. The
> balance always runs for quite a while before hitting a problem, and since it
> appears to start from the end of the disk, it seems that only the earliest
> and lowest-numbered inodes at the beginning of the disk can cause this
> problem.
>
> Complete crash from dmesg:
>
> [51317.967011] btrfs: relocating block group 511130468352 flags 1
> [51321.298448] btrfs csum failed ino 415 off 9949184 csum 1385781806 private
> 1385781805
> [51321.807357] btrfs csum failed ino 415 off 9949184 csum 1385781806 private
> 1385781805
> [51322.707362] btrfs csum failed ino 415 off 9949184 csum 1385781806 private
> 1385781805
> [51323.318478] btrfs csum failed ino 415 off 9949184 csum 1385781806 private
> 1385781805
> [51327.954315] ------------[ cut here ]------------
> [51327.954322] kernel BUG at
> /build/buildd/linux-2.6.35/fs/btrfs/volumes.c:1980!
> [51327.954326] invalid opcode: 0000 [#1] SMP
> [51327.954330] last sysfs file:
> /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:1f/PNP0C0A:00/power_supply/BAT1/charge_full
> [51327.954334] CPU 0
> [51327.954336] Modules linked in: ip6table_filter ip6_tables hidp hid
> binfmt_misc rfcomm parport_pc ppdev sco bnep l2cap ipt_MASQUERADE
> iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack
> ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp microcode
> joydev i915 snd_hda_codec_si3054 snd_hda_codec_realtek drm_kms_helper drm
> i2c_algo_bit snd_hda_intel snd_hda_codec arc4 snd_hwdep uinput snd_pcm
> iwl3945 video snd_seq_midi snd_rawmidi snd_seq_midi_event iwlcore snd_seq
> snd_timer snd_seq_device lp snd mac80211 soundcore output psmouse btusb
> intel_agp serio_raw cfg80211 bluetooth snd_page_alloc parport btrfs
> zlib_deflate firewire_ohci firewire_core ahci crc_itu_t sdhci_pci sdhci
> led_class tg3 crc32c libahci libcrc32c
> [51327.954396]
> [51327.954400] Pid: 15426, comm: btrfs Not tainted 2.6.35-15-generic
> #21-Ubuntu IFT01         /N/A
> [51327.954404] RIP: 0010:[<ffffffffa00cc25f>]  [<ffffffffa00cc25f>]
> btrfs_balance+0x24f/0x260 [btrfs]
> [51327.954425] RSP: 0018:ffff88012eb95dc8  EFLAGS: 00010282
> [51327.954428] RAX: 00000000fffffffb RBX: ffff880037c78480 RCX:
> 0200000000004081
> [51327.954431] RDX: 0000000000000003 RSI: ffffea0003ea1640 RDI:
> 0000000000000282
> [51327.954434] RBP: ffff88012eb95e48 R08: 0000000000000000 R09:
> 0000000000000000
> [51327.954437] R10: 0000000000000069 R11: 0000000000000001 R12:
> ffff880138da6800
> [51327.954439] R13: 0000000000000000 R14: 0000007701c00000 R15:
> ffff88012eb95df8
> [51327.954443] FS:  00007fbea8710740(0000) GS:ffff880001e00000(0000)
> knlGS:0000000000000000
> [51327.954446] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [51327.954449] CR2: 00007f99c0088cc1 CR3: 0000000114bee000 CR4:
> 00000000000006f0
> [51327.954452] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [51327.954455] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000400
> [51327.954458] Process btrfs (pid: 15426, threadinfo ffff88012eb94000, task
> ffff88003fb6adc0)
> [51327.954460] Stack:
> [51327.954462]  ffff880138da7000 0000000000000100 0000000000000100
> 00007701c00000e4
> [51327.954467] <0> ffff880100001c00 ffff88013fc31400 0000000000000100
> 0000e15b3fffffe4
> [51327.954473] <0> ffff88012eb95e00 ffffffff811280f5 ffff8801315f5038
> ffff880115d35600
> [51327.954478] Call Trace:
> [51327.954486]  [<ffffffff811280f5>] ? page_add_new_anon_rmap+0x95/0xa0
> [51327.954500]  [<ffffffffa00d44b0>] btrfs_ioctl+0x2c0/0x4c0 [btrfs]
> [51327.954505]  [<ffffffff811615ad>] vfs_ioctl+0x3d/0xd0
> [51327.954509]  [<ffffffff81161e81>] do_vfs_ioctl+0x81/0x340
> [51327.954514]  [<ffffffff8158c8ae>] ? do_page_fault+0x15e/0x350
> [51327.954517]  [<ffffffff811621c1>] sys_ioctl+0x81/0xa0
> [51327.954523]  [<ffffffff8100a0f2>] system_call_fastpath+0x16/0x1b
> [51327.954525] Code: fb ff 48 8b 45 80 48 8b b8 28 01 00 00 48 81 c7 20 1c
> 00 00 e8 e3 b0 4b e1 e9 00 fe ff ff 45 31 ed eb d7 0f 0b eb fe 85 c0 74 a5
> <0f> 0b eb fe 0f 0b eb fe 0f 0b eb fe 0f 0b eb fe 90 55 48 89 e5
> [51327.954567] RIP  [<ffffffffa00cc25f>] btrfs_balance+0x24f/0x260 [btrfs]
> [51327.954580]  RSP <ffff88012eb95dc8>
> [51327.954583] ---[ end trace 0bf81e832fde7349 ]---
>
>

Sorry to burst your bubble, but it's definitely not only on the
low-numbered inodes. It just segfaults when it comes across the first
checksum error on the inodes while balancing, and this is something
copying files won't do. I'm in the process of moving most of my
storage stored on my 2-drive RAID-0 btrfs array, and I get
Input/Output errors from time to time. It's a bit annoying, but I can
live with it since I've kept backup of most of those files. The backup
drive isn't on btrfs, so the errors aren't from that drive. This is
what my dmesg looks like after having it all copied over:

btrfs csum failed ino 388772 off 5146492928 csum 169329396 private 169329395
btrfs csum failed ino 388772 off 5146492928 csum 169329396 private 169329395
btrfs csum failed ino 388772 off 5146492928 csum 169329396 private 169329395
btrfs csum failed ino 388772 off 5146492928 csum 169329396 private 169329395
btrfs csum failed ino 388772 off 5146492928 csum 169329396 private 169329395
btrfs csum failed ino 388772 off 5146492928 csum 169329396 private 169329395
btrfs csum failed ino 388822 off 910540800 csum 2930498218 private 2930498217
btrfs csum failed ino 388822 off 910540800 csum 2930498218 private 2930498217
btrfs csum failed ino 388822 off 910540800 csum 2930498218 private 2930498217
btrfs csum failed ino 388822 off 910540800 csum 2930498218 private 2930498217
btrfs csum failed ino 388822 off 910540800 csum 2930498218 private 2930498217
btrfs csum failed ino 388822 off 910540800 csum 2930498218 private 2930498217
btrfs csum failed ino 388823 off 563412992 csum 3815793104 private 771608454
btrfs csum failed ino 388823 off 2966323200 csum 2865511448 private 2865511447
btrfs csum failed ino 388823 off 2966323200 csum 2865511448 private 2865511447
btrfs csum failed ino 388823 off 2966323200 csum 2865511448 private 2865511447
btrfs csum failed ino 388823 off 2966323200 csum 2865511448 private 2865511447
btrfs csum failed ino 389762 off 114298880 csum 350178036 private 350178035
btrfs csum failed ino 389762 off 114298880 csum 350178036 private 350178035
btrfs csum failed ino 389762 off 114298880 csum 350178036 private 350178035
btrfs csum failed ino 389762 off 114298880 csum 350178036 private 350178035
btrfs csum failed ino 389762 off 114298880 csum 350178036 private 350178035
btrfs csum failed ino 389762 off 114298880 csum 350178036 private 350178035
btrfs csum failed ino 389807 off 97808384 csum 558603966 private 313438526
btrfs csum failed ino 389807 off 97812480 csum 934061497 private 2103939083
btrfs csum failed ino 4290176 off 119853056 csum 2643001855 private 2643001854
btrfs csum failed ino 4290176 off 119853056 csum 2643001855 private 2643001854
btrfs csum failed ino 4290176 off 119853056 csum 2643001855 private 2643001854
btrfs csum failed ino 4290176 off 119853056 csum 2643001855 private 2643001854
btrfs csum failed ino 4290176 off 119853056 csum 2643001855 private 2643001854
btrfs csum failed ino 4290176 off 119853056 csum 2643001855 private 2643001854
btrfs csum failed ino 4290280 off 389795840 csum 257184388 private 2989030598
btrfs csum failed ino 4290334 off 22917120 csum 1416109476 private 4240914906
btrfs csum failed ino 4290415 off 171986944 csum 2046579276 private 2822934100
btrfs csum failed ino 409660 extent 40120546816 csum 4161269104 wanted
4161269103 mirror 0
btrfs csum failed ino 409660 extent 92755793920 csum 4161269104 wanted
4161269103 mirror 1
btrfs csum failed ino 409660 extent 92755793920 csum 4161269104 wanted
4161269103 mirror 1
btrfs csum failed ino 409660 extent 92755793920 csum 4161269104 wanted
4161269103 mirror 1
btrfs csum failed ino 409660 extent 92755793920 csum 4161269104 wanted
4161269103 mirror 1
btrfs csum failed ino 409660 extent 92755793920 csum 4161269104 wanted
4161269103 mirror 1
btrfs csum failed ino 409660 extent 92755793920 csum 4161269104 wanted
4161269103 mirror 1
btrfs csum failed ino 409660 extent 92755793920 csum 4161269104 wanted
4161269103 mirror 1
btrfs csum failed ino 409660 extent 92755793920 csum 4161269104 wanted
4161269103 mirror 0
btrfs csum failed ino 409660 extent 92755793920 csum 4161269104 wanted
4161269103 mirror 1
btrfs csum failed ino 409660 extent 92755793920 csum 4161269104 wanted
4161269103 mirror 1