LinuxLists.cc - 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)

2009-10-17 22:34:55

by Justin Piszcz

[permalink] [raw]

Subject: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)

Hello,

I have a system I recently upgraded from 2.6.30.x and after approximately
24-48 hours--sometimes longer, the system cannot write any more files to
disk (luckily though I can still write to /dev/shm) -- to which I have
saved the sysrq-t and sysrq-w output:

http://home.comcast.net/~jpiszcz/20091017/sysrq-w.txt
http://home.comcast.net/~jpiszcz/20091017/sysrq-t.txt

Configuration:

$ cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md1 : active raid1 sdb2[1] sda2[0]
136448 blocks [2/2] [UU]

md2 : active raid1 sdb3[1] sda3[0]
129596288 blocks [2/2] [UU]

md3 : active raid5 sdj1[7] sdi1[6] sdh1[5] sdf1[3] sdg1[4] sde1[2] sdd1[1] sdc1[0]
5128001536 blocks level 5, 1024k chunk, algorithm 2 [8/8] [UUUUUUUU]

md0 : active raid1 sdb1[1] sda1[0]
16787776 blocks [2/2] [UU]

$ mount
/dev/md2 on / type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
proc on /proc type proc (rw,noexec,nosuid,nodev)
sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
udev on /dev type tmpfs (rw,mode=0755)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620)
/dev/md1 on /boot type ext3 (rw,noatime)
/dev/md3 on /r/1 type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
nfsd on /proc/fs/nfsd type nfsd (rw)

Distribution: Debian Testing
Arch: x86_64

The problem occurs with 2.6.31 and I upgraded to 2.6.31.4 and the problem
persists.

Here is a snippet of two processes in D-state, the first was not doing
anything, the second was mrtg.

[121444.684000] pickup D 0000000000000003 0 18407 4521 0x00000000
[121444.684000] ffff880231dd2290 0000000000000086 0000000000000000 0000000000000000
[121444.684000] 000000000000ff40 000000000000c8c8 ffff880176794d10 ffff880176794f90
[121444.684000] 000000032266dd08 ffff8801407a87f0 ffff8800280878d8 ffff880176794f90
[121444.684000] Call Trace:
[121444.684000] [<ffffffff810a742d>] ? free_pages_and_swap_cache+0x9d/0xc0
[121444.684000] [<ffffffff81454866>] ? __mutex_lock_slowpath+0xd6/0x160
[121444.684000] [<ffffffff814546ba>] ? mutex_lock+0x1a/0x40
[121444.684000] [<ffffffff810b26ef>] ? generic_file_llseek+0x2f/0x70
[121444.684000] [<ffffffff810b119e>] ? sys_lseek+0x7e/0x90
[121444.684000] [<ffffffff8109ffd2>] ? sys_munmap+0x52/0x80
[121444.684000] [<ffffffff8102c52b>] ? system_call_fastpath+0x16/0x1b

[121444.684000] rateup D 0000000000000000 0 18538 18465 0x00000000
[121444.684000] ffff88023f8a8c10 0000000000000082 0000000000000000 ffff88023ea09ec8
[121444.684000] 000000000000ff40 000000000000c8c8 ffff88023faace50 ffff88023faad0d0
[121444.684000] 0000000300003e00 000000010720cc78 0000000000003e00 ffff88023faad0d0
[121444.684000] Call Trace:
[121444.684000] [<ffffffff811f42e2>] ? xfs_buf_iorequest+0x42/0x90
[121444.684000] [<ffffffff811dd66d>] ? xlog_bdstrat_cb+0x3d/0x50
[121444.684000] [<ffffffff811db05b>] ? xlog_sync+0x20b/0x4e0
[121444.684000] [<ffffffff811dc44c>] ? xlog_state_sync+0x26c/0x2a0
[121444.684000] [<ffffffff810513e0>] ? default_wake_function+0x0/0x10
[121444.684000] [<ffffffff811dc4d1>] ? _xfs_log_force+0x51/0x80
[121444.684000] [<ffffffff811dc50b>] ? xfs_log_force+0xb/0x40
[121444.684000] [<ffffffff811a7223>] ? xfs_alloc_ag_vextent+0x123/0x130
[121444.684000] [<ffffffff811a7aa8>] ? xfs_alloc_vextent+0x368/0x4b0
[121444.684000] [<ffffffff811b41e8>] ? xfs_bmap_btalloc+0x598/0xa40
[121444.684000] [<ffffffff811b6a42>] ? xfs_bmapi+0x9e2/0x11a0
[121444.684000] [<ffffffff811dd7f0>] ? xlog_grant_push_ail+0x30/0xf0
[121444.684000] [<ffffffff811e8fd8>] ? xfs_trans_reserve+0xa8/0x220
[121444.684000] [<ffffffff811d805e>] ? xfs_iomap_write_allocate+0x23e/0x3b0
[121444.684000] [<ffffffff811f0daf>] ? __xfs_get_blocks+0x8f/0x220
[121444.684000] [<ffffffff811d8c00>] ? xfs_iomap+0x2c0/0x300
[121444.684000] [<ffffffff810d5b76>] ? __set_page_dirty+0x66/0xd0
[121444.684000] [<ffffffff811f0d15>] ? xfs_map_blocks+0x25/0x30
[121444.684000] [<ffffffff811f1e04>] ? xfs_page_state_convert+0x414/0x6c0
[121444.684000] [<ffffffff811f23b7>] ? xfs_vm_writepage+0x77/0x130
[121444.684000] [<ffffffff8108b21a>] ? __writepage+0xa/0x40
[121444.684000] [<ffffffff8108baff>] ? write_cache_pages+0x1df/0x3c0
[121444.684000] [<ffffffff8108b210>] ? __writepage+0x0/0x40
[121444.684000] [<ffffffff810b1533>] ? do_sync_write+0xe3/0x130
[121444.684000] [<ffffffff8108bd30>] ? do_writepages+0x20/0x40
[121444.684000] [<ffffffff81085abd>] ? __filemap_fdatawrite_range+0x4d/0x60
[121444.684000] [<ffffffff811f54dd>] ? xfs_flush_pages+0xad/0xc0
[121444.684000] [<ffffffff811ee907>] ? xfs_release+0x167/0x1d0
[121444.684000] [<ffffffff811f52b0>] ? xfs_file_release+0x10/0x20
[121444.684000] [<ffffffff810b2c0d>] ? __fput+0xcd/0x1e0
[121444.684000] [<ffffffff810af556>] ? filp_close+0x56/0x90
[121444.684000] [<ffffffff810af636>] ? sys_close+0xa6/0x100
[121444.684000] [<ffffffff8102c52b>] ? system_call_fastpath+0x16/0x1b

Anyone know what is going on here?

Justin.

2009-10-18 20:17:41

by Justin Piszcz

[permalink] [raw]

Subject: Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)

On Sat, 17 Oct 2009, Justin Piszcz wrote:

> Hello,

It has happened again, all sysrq-X output was saved this time.

wget http://home.comcast.net/~jpiszcz/20091018/crash.txt
wget http://home.comcast.net/~jpiszcz/20091018/dmesg.txt
wget http://home.comcast.net/~jpiszcz/20091018/interrupts.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-l.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-m.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-p.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-q.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-t.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-w.txt

Kernel configuration:

wget http://home.comcast.net/~jpiszcz/20091018/config-2.6.30.9.txt
wget http://home.comcast.net/~jpiszcz/20091018/config-2.6.31.4.txt

Diff of the two configs:

$ diff config-2.6.30.9.txt config-2.6.31.4.txt |grep -v "#"|grep "_"
> CONFIG_OUTPUT_FORMAT="elf64-x86-64"
> CONFIG_CONSTRUCTORS=y
> CONFIG_HAVE_PERF_COUNTERS=y
> CONFIG_HAVE_DMA_ATTRS=y
> CONFIG_BLK_DEV_BSG=y
> CONFIG_X86_NEW_MCE=y
> CONFIG_X86_THERMAL_VECTOR=y
< CONFIG_UNEVICTABLE_LRU=y
< CONFIG_PHYSICAL_START=0x200000
> CONFIG_PHYSICAL_START=0x1000000
< CONFIG_PHYSICAL_ALIGN=0x200000
> CONFIG_PHYSICAL_ALIGN=0x1000000
< CONFIG_COMPAT_NET_DEV_OPS=y
< CONFIG_SND_JACK=y
> CONFIG_HID_DRAGONRISE=y
> CONFIG_HID_GREENASIA=y
> CONFIG_HID_SMARTJOYPLUS=y
> CONFIG_HID_THRUSTMASTER=y
> CONFIG_HID_ZEROPLUS=y
> CONFIG_FSNOTIFY=y
> CONFIG_HAVE_FUNCTION_GRAPH_FP_TEST=y
> CONFIG_HAVE_ARCH_KMEMCHECK=y

I have reverted back to 2.6.30.9 to see if the problem recurs with this
kernel version.

I do not recall seeing this on the older 2.6.30.x kernels:

[ 9.276427] md3: detected capacity change from 0 to 5251073572864
[ 9.277411] md2: detected capacity change from 0 to 132706598912
[ 9.278305] md1: detected capacity change from 0 to 139722752
[ 9.278921] md0: detected capacity change from 0 to 17190682624

Again, some more D-state processes:

[76325.608073] pdflush D 0000000000000001 0 362 2 0x00000000
[76325.608087] Call Trace:
[76325.608095] [<ffffffff811ea1c0>] ? xfs_trans_brelse+0x30/0x130
[76325.608099] [<ffffffff811dc44c>] ? xlog_state_sync+0x26c/0x2a0
[76325.608103] [<ffffffff810513e0>] ? default_wake_function+0x0/0x10
[76325.608106] [<ffffffff811dc4d1>] ? _xfs_log_force+0x51/0x80
[76325.608108] [<ffffffff811dc50b>] ? xfs_log_force+0xb/0x40

[76325.608202] xfssyncd D 0000000000000000 0 831 2 0x00000000
[76325.608214] Call Trace:
[76325.608216] [<ffffffff811dc229>] ? xlog_state_sync+0x49/0x2a0
[76325.608220] [<ffffffff811d3485>] ? __xfs_iunpin_wait+0x95/0xe0
[76325.608222] [<ffffffff81069c20>] ? autoremove_wake_function+0x0/0x30
[76325.608225] [<ffffffff811d566d>] ? xfs_iflush+0xdd/0x2f0
[76325.608228] [<ffffffff811fbe28>] ? xfs_reclaim_inode+0x148/0x190
[76325.608231] [<ffffffff811fbe70>] ? xfs_reclaim_inode_now+0x0/0xa0
[76325.608233] [<ffffffff811fc8dc>] ? xfs_inode_ag_walk+0x6c/0xc0
[76325.608236] [<ffffffff811fbe70>] ? xfs_reclaim_inode_now+0x0/0xa0

All of the D-state processes:

$ cat sysrq-w.txt |grep ' D'
[76307.285125] alpine D 0000000000000000 0 7659 29120 0x00000000
[76325.608073] pdflush D 0000000000000001 0 362 2 0x00000000
[76325.608202] xfssyncd D 0000000000000000 0 831 2 0x00000000
[76325.608257] syslogd D 0000000000000002 0 2438 1 0x00000000
[76325.608318] freshclam D 0000000000000000 0 2877 1 0x00000000
[76325.608428] asterisk D 0000000000000001 0 3278 1 0x00000000
[76325.608492] console-kit-d D 0000000000000000 0 3299 1 0x00000000
[76325.608562] dhcpd3 D 0000000000000000 0 3554 1 0x00000000
[76325.608621] plasma-deskto D 0000000000000002 0 32482 1 0x00000000
[76325.608713] kaccess D 0000000000000001 0 32488 1 0x00000000
[76325.608752] mail D 0000000000000000 0 7397 7386 0x00000000
[76325.608830] hal-acl-tool D 0000000000000000 0 7430 3399 0x00000004
[76325.608888] mrtg D 0000000000000000 0 7444 7433 0x00000000
[76325.608981] cron D 0000000000000000 0 7500 3630 0x00000000
[76325.609000] alpine D 0000000000000000 0 7659 29120 0x00000000

List of functions underneath the D-state processes (sorted/uniqued)--

121 [<ffffffff81069c20>] ? autoremove_wake_function+0x0/0x30
77 [<ffffffff8102c52b>] ? system_call_fastpath+0x16/0x1b
62 [<ffffffff814543a5>] ? schedule_timeout+0x165/0x1a0
60 [<ffffffff813bc1f6>] ? __alloc_skb+0x66/0x170
60 [<ffffffff813b3e59>] ? sys_sendto+0x119/0x180
59 [<ffffffff81428397>] ? unix_dgram_sendmsg+0x467/0x5c0
59 [<ffffffff81427ce6>] ? unix_wait_for_peer+0x86/0xd0
59 [<ffffffff813bd497>] ? memcpy_fromiovec+0x57/0x80
59 [<ffffffff813b6c29>] ? sock_alloc_send_pskb+0x1d9/0x2f0
59 [<ffffffff813b3a4b>] ? sock_sendmsg+0xcb/0x100
59 [<ffffffff813b3062>] ? sockfd_lookup_light+0x22/0x80
58 [<ffffffff814287ed>] ? unix_dgram_connect+0xad/0x270
58 [<ffffffff813b3336>] ? sys_connect+0x86/0xe0
57 [<ffffffff81427ed5>] ? unix_find_other+0x1a5/0x200
57 [<ffffffff810c9d13>] ? mntput_no_expire+0x23/0xf0
57 [<ffffffff810a3e74>] ? page_add_new_anon_rmap+0x54/0x90
57 [<ffffffff8105947e>] ? current_fs_time+0x1e/0x30
55 [<ffffffff81085445>] ? filemap_fault+0x95/0x3e0
8 [<ffffffff810513e0>] ? default_wake_function+0x0/0x10
7 [<ffffffff811e8fd8>] ? xfs_trans_reserve+0xa8/0x220
7 [<ffffffff810af727>] ? do_sys_open+0x97/0x150
6 [<ffffffff811dc4d1>] ? _xfs_log_force+0x51/0x80
5 [<ffffffff811dd7f0>] ? xlog_grant_push_ail+0x30/0xf0
4 [<ffffffff811f5284>] ? xfs_file_fsync+0x54/0x70
4 [<ffffffff811f42e2>] ? xfs_buf_iorequest+0x42/0x90
4 [<ffffffff811f0242>] ? kmem_zone_zalloc+0x32/0x50
4 [<ffffffff811f01d3>] ? kmem_zone_alloc+0x83/0xc0
4 [<ffffffff811dc44c>] ? xlog_state_sync+0x26c/0x2a0
4 [<ffffffff810d3a4b>] ? sys_fsync+0xb/0x20
4 [<ffffffff810d39f6>] ? do_fsync+0x36/0x60
4 [<ffffffff810d394e>] ? vfs_fsync+0x9e/0x110
4 [<ffffffff810bbcde>] ? __link_path_walk+0x7e/0x1000
3 [<ffffffff81454866>] ? __mutex_lock_slowpath+0xd6/0x160
3 [<ffffffff814546ba>] ? mutex_lock+0x1a/0x40
3 [<ffffffff811f7b82>] ? xfs_vn_mknod+0x82/0x130
3 [<ffffffff811eeab1>] ? xfs_fsync+0x141/0x190
3 [<ffffffff811e8f1b>] ? _xfs_trans_commit+0x38b/0x3a0
3 [<ffffffff811ddfac>] ? xlog_grant_log_space+0x28c/0x3c0
3 [<ffffffff811dd66d>] ? xlog_bdstrat_cb+0x3d/0x50
3 [<ffffffff811dc50b>] ? xfs_log_force+0xb/0x40
3 [<ffffffff811dc1b0>] ? xfs_log_release_iclog+0x10/0x40
3 [<ffffffff811db05b>] ? xlog_sync+0x20b/0x4e0
3 [<ffffffff811b6a42>] ? xfs_bmapi+0x9e2/0x11a0
3 [<ffffffff811b41e8>] ? xfs_bmap_btalloc+0x598/0xa40
3 [<ffffffff811a7aa8>] ? xfs_alloc_vextent+0x368/0x4b0
3 [<ffffffff811a7223>] ? xfs_alloc_ag_vextent+0x123/0x130
3 [<ffffffff810c80ca>] ? alloc_fd+0x4a/0x140
3 [<ffffffff810c2110>] ? pollwake+0x0/0x60
3 [<ffffffff810c0b88>] ? poll_freewait+0x48/0xb0
3 [<ffffffff810be8ee>] ? do_filp_open+0x9ee/0xac0
3 [<ffffffff810be134>] ? do_filp_open+0x234/0xac0
3 [<ffffffff810baeb6>] ? vfs_create+0xa6/0xf0
3 [<ffffffff810b51d7>] ? vfs_fstatat+0x37/0x80
3 [<ffffffff810ad46d>] ? kmem_cache_alloc+0x6d/0xa0
3 [<ffffffff8104aca3>] ? __wake_up+0x43/0x70
2 [<ffffffff81455797>] ? __down_write_nested+0x17/0xb0
2 [<ffffffff81455151>] ? __down+0x61/0xa0
2 [<ffffffff81454e85>] ? do_nanosleep+0x95/0xd0
2 [<ffffffff81454dbd>] ? schedule_hrtimeout_range+0x11d/0x140
2 [<ffffffff81454359>] ? schedule_timeout+0x119/0x1a0
2 [<ffffffff811fbe70>] ? xfs_reclaim_inode_now+0x0/0xa0
2 [<ffffffff811f4b82>] ? xfs_buf_read_flags+0x12/0xa0
2 [<ffffffff811f4a4e>] ? xfs_buf_get_flags+0x6e/0x190
2 [<ffffffff811f48f4>] ? _xfs_buf_find+0x134/0x220
2 [<ffffffff811f23b7>] ? xfs_vm_writepage+0x77/0x130
2 [<ffffffff811f1e04>] ? xfs_page_state_convert+0x414/0x6c0
2 [<ffffffff811f0d15>] ? xfs_map_blocks+0x25/0x30
2 [<ffffffff811ed872>] ? xfs_create+0x312/0x530
2 [<ffffffff811eb6e8>] ? xfs_dir_ialloc+0xa8/0x340
2 [<ffffffff811ea4a6>] ? xfs_trans_read_buf+0x1e6/0x360
2 [<ffffffff811dc337>] ? xlog_state_sync+0x157/0x2a0
2 [<ffffffff811d8c00>] ? xfs_iomap+0x2c0/0x300
2 [<ffffffff811d805e>] ? xfs_iomap_write_allocate+0x23e/0x3b0
2 [<ffffffff810c31dc>] ? dput+0xac/0x160
2 [<ffffffff810c29d3>] ? d_kill+0x53/0x70
2 [<ffffffff810b9b38>] ? generic_permission+0x78/0x130
2 [<ffffffff8109a9a5>] ? handle_mm_fault+0x1b5/0x780
2 [<ffffffff810987fa>] ? __do_fault+0x3ca/0x4b0
2 [<ffffffff8108cc30>] ? pdflush+0x0/0x220
2 [<ffffffff8108bd30>] ? do_writepages+0x20/0x40
2 [<ffffffff8108baff>] ? write_cache_pages+0x1df/0x3c0
2 [<ffffffff8108b21a>] ? __writepage+0xa/0x40
2 [<ffffffff8108b210>] ? __writepage+0x0/0x40
2 [<ffffffff8108ab88>] ? __alloc_pages_nodemask+0x108/0x5f0
2 [<ffffffff81084b6b>] ? find_get_page+0x1b/0xb0
2 [<ffffffff8106e016>] ? down+0x46/0x50
2 [<ffffffff8106d4e0>] ? sys_nanosleep+0x70/0x80
2 [<ffffffff8106d3e2>] ? hrtimer_nanosleep+0xa2/0x130
2 [<ffffffff8106d1ab>] ? __hrtimer_start_range_ns+0x12b/0x2a0
2 [<ffffffff8106c960>] ? hrtimer_wakeup+0x0/0x30
2 [<ffffffff81069bd8>] ? __wake_up_bit+0x28/0x30
2 [<ffffffff81069886>] ? kthread+0xa6/0xb0
2 [<ffffffff810697e0>] ? kthread+0x0/0xb0
2 [<ffffffff8105efb0>] ? process_timeout+0x0/0x10
2 [<ffffffff8105ee14>] ? try_to_del_timer_sync+0x54/0x60
2 [<ffffffff8105eaa4>] ? lock_timer_base+0x34/0x70
2 [<ffffffff8102d4ba>] ? child_rip+0xa/0x20
2 [<ffffffff8102d4b0>] ? child_rip+0x0/0x20
1 [<ffffffff81455b09>] ? _spin_lock_bh+0x9/0x20
1 [<ffffffff81455857>] ? __down_read+0x17/0xae
1 [<ffffffff814545d0>] ? __wait_on_bit+0x50/0x80
1 [<ffffffff81454144>] ? io_schedule+0x34/0x50
1 [<ffffffff81453741>] ? wait_for_common+0x151/0x180
1 [<ffffffff81403c26>] ? tcp_write_xmit+0x206/0xa30
1 [<ffffffff813f73b9>] ? tcp_sendmsg+0x859/0xb10
1 [<ffffffff813b675f>] ? sk_reset_timer+0xf/0x20
1 [<ffffffff813b6273>] ? release_sock+0x13/0xa0
1 [<ffffffff813b270a>] ? sock_aio_write+0x13a/0x150
1 [<ffffffff81272408>] ? tty_ldisc_try+0x48/0x60
1 [<ffffffff8126c391>] ? tty_write+0x221/0x270
1 [<ffffffff81221960>] ? swiotlb_map_page+0x0/0x100
1 [<ffffffff81219361>] ? __up_read+0x21/0xc0
1 [<ffffffff811fca29>] ? xfs_sync_worker+0x49/0x80
1 [<ffffffff811fc993>] ? xfs_inode_ag_iterator+0x63/0xa0
1 [<ffffffff811fc8dc>] ? xfs_inode_ag_walk+0x6c/0xc0
1 [<ffffffff811fc0ec>] ? xfssyncd+0x13c/0x1c0
1 [<ffffffff811fbfb0>] ? xfssyncd+0x0/0x1c0
1 [<ffffffff811fbe28>] ? xfs_reclaim_inode+0x148/0x190
1 [<ffffffff811f8645>] ? xfs_bdstrat_cb+0x45/0x50
1 [<ffffffff811f8076>] ? xfs_vn_setattr+0x16/0x20
1 [<ffffffff811f54dd>] ? xfs_flush_pages+0xad/0xc0
1 [<ffffffff811f5423>] ? xfs_wait_on_pages+0x23/0x30
1 [<ffffffff811f52b0>] ? xfs_file_release+0x10/0x20
1 [<ffffffff811f3f8b>] ? xfs_buf_rele+0x3b/0x100
1 [<ffffffff811f3d65>] ? _xfs_buf_lookup_pages+0x265/0x340
1 [<ffffffff811f0daf>] ? __xfs_get_blocks+0x8f/0x220
1 [<ffffffff811ef5e6>] ? xfs_setattr+0x826/0x880
1 [<ffffffff811ee9c6>] ? xfs_fsync+0x56/0x190
1 [<ffffffff811ee907>] ? xfs_release+0x167/0x1d0
1 [<ffffffff811edb20>] ? xfs_lookup+0x90/0xe0
1 [<ffffffff811ed96b>] ? xfs_create+0x40b/0x530
1 [<ffffffff811eab8a>] ? xfs_trans_iget+0xda/0x100
1 [<ffffffff811eaa48>] ? xfs_trans_ijoin+0x38/0xa0
1 [<ffffffff811ea9d7>] ? xfs_trans_log_inode+0x27/0x60
1 [<ffffffff811ea948>] ? xfs_trans_get_efd+0x28/0x40
1 [<ffffffff811ea1c0>] ? xfs_trans_brelse+0x30/0x130
1 [<ffffffff811dc229>] ? xlog_state_sync+0x49/0x2a0
1 [<ffffffff811d566d>] ? xfs_iflush+0xdd/0x2f0
1 [<ffffffff811d50ff>] ? xfs_ialloc+0x52f/0x6f0
1 [<ffffffff811d4c8e>] ? xfs_ialloc+0xbe/0x6f0
1 [<ffffffff811d4c4e>] ? xfs_ialloc+0x7e/0x6f0
1 [<ffffffff811d483a>] ? xfs_itruncate_finish+0x15a/0x320
1 [<ffffffff811d3485>] ? __xfs_iunpin_wait+0x95/0xe0
1 [<ffffffff811d17dd>] ? xfs_iget+0xfd/0x480
1 [<ffffffff811d17cb>] ? xfs_iget+0xeb/0x480
1 [<ffffffff811d0341>] ? xfs_dialloc+0x2e1/0xa70
1 [<ffffffff811cee12>] ? xfs_ialloc_ag_select+0x222/0x320
1 [<ffffffff811ceaaf>] ? xfs_ialloc_read_agi+0x1f/0x80
1 [<ffffffff811ce9f1>] ? xfs_read_agi+0x71/0x110
1 [<ffffffff811cbf90>] ? xfs_dir2_sf_addname+0x430/0x5c0
1 [<ffffffff811c3a4f>] ? xfs_dir2_sf_to_block+0x9f/0x5c0
1 [<ffffffff811c388a>] ? xfs_dir_createname+0x17a/0x1d0
1 [<ffffffff811c2bda>] ? xfs_dir2_grow_inode+0x15a/0x3f0
1 [<ffffffff811b4bf4>] ? xfs_bmap_finish+0x164/0x1b0
1 [<ffffffff811a76fe>] ? xfs_free_extent+0x7e/0xc0
1 [<ffffffff811a75a9>] ? xfs_alloc_fix_freelist+0x379/0x450
1 [<ffffffff811a5450>] ? xfs_alloc_read_agf+0x30/0xd0
1 [<ffffffff811a52f8>] ? xfs_read_agf+0x68/0x190
1 [<ffffffff810e38cf>] ? sys_epoll_wait+0x22f/0x2e0
1 [<ffffffff810d5b76>] ? __set_page_dirty+0x66/0xd0
1 [<ffffffff810d00f6>] ? writeback_inodes+0x46/0xe0
1 [<ffffffff810cfe46>] ? generic_sync_sb_inodes+0x2e6/0x4b0
1 [<ffffffff810cf6a9>] ? writeback_single_inode+0x1e9/0x460
1 [<ffffffff810c7341>] ? notify_change+0x101/0x2f0
1 [<ffffffff810c47da>] ? __d_lookup+0xaa/0x140
1 [<ffffffff810c1ff0>] ? __pollwait+0x0/0x120
1 [<ffffffff810c1f31>] ? sys_select+0x51/0x110
1 [<ffffffff810c1b9f>] ? core_sys_select+0x1ff/0x310
1 [<ffffffff810c182f>] ? do_select+0x4ff/0x670
1 [<ffffffff810c0b1c>] ? poll_schedule_timeout+0x2c/0x50
1 [<ffffffff810be5a0>] ? do_filp_open+0x6a0/0xac0
1 [<ffffffff810bb851>] ? may_open+0x1c1/0x1f0
1 [<ffffffff810b9e50>] ? get_write_access+0x20/0x60
1 [<ffffffff810b2c0d>] ? __fput+0xcd/0x1e0
1 [<ffffffff810b2233>] ? sys_write+0x53/0xa0
1 [<ffffffff810b1533>] ? do_sync_write+0xe3/0x130
1 [<ffffffff810b060e>] ? do_truncate+0x5e/0x80
1 [<ffffffff810af636>] ? sys_close+0xa6/0x100
1 [<ffffffff810af556>] ? filp_close+0x56/0x90
1 [<ffffffff810ace06>] ? cache_alloc_refill+0x96/0x590
1 [<ffffffff8108d71a>] ? pagevec_lookup_tag+0x1a/0x30
1 [<ffffffff8108cd40>] ? pdflush+0x110/0x220
1 [<ffffffff8108beb6>] ? wb_kupdate+0xb6/0x140
1 [<ffffffff8108be00>] ? wb_kupdate+0x0/0x140
1 [<ffffffff81085abd>] ? __filemap_fdatawrite_range+0x4d/0x60
1 [<ffffffff810859d3>] ? wait_on_page_writeback_range+0xc3/0x140
1 [<ffffffff81084fac>] ? wait_on_page_bit+0x6c/0x80
1 [<ffffffff81084e83>] ? find_lock_page+0x23/0x80
1 [<ffffffff81084d95>] ? sync_page+0x35/0x60
1 [<ffffffff81084d60>] ? sync_page+0x0/0x60
1 [<ffffffff8106ee8e>] ? sched_clock_cpu+0x6e/0x250
1 [<ffffffff81069c50>] ? wake_bit_function+0x0/0x30
1 [<ffffffff81069c29>] ? autoremove_wake_function+0x9/0x30
1 [<ffffffff81064e09>] ? sys_setpriority+0x89/0x240
1 [<ffffffff8105444e>] ? do_fork+0x16e/0x360
1 [<ffffffff810512bf>] ? try_to_wake_up+0xaf/0x1d0
1 [<ffffffff8104ad17>] ? task_rq_lock+0x47/0x90
1 [<ffffffff8104a99b>] ? __wake_up_common+0x5b/0x90
1 [<ffffffff81049bcf>] ? sched_slice+0x5f/0x90
1 [<ffffffff81034200>] ? sys_vfork+0x20/0x30
1 [<ffffffff8102c853>] ? stub_vfork+0x13/0x20

2009-10-19 03:05:07

by Dave Chinner

[permalink] [raw]

Subject: Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)

On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote:
> It has happened again, all sysrq-X output was saved this time.
>
> wget http://home.comcast.net/~jpiszcz/20091018/crash.txt
> wget http://home.comcast.net/~jpiszcz/20091018/dmesg.txt
> wget http://home.comcast.net/~jpiszcz/20091018/interrupts.txt
> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-l.txt
> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-m.txt
> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-p.txt
> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-q.txt
> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-t.txt
> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-w.txt
.....
> Again, some more D-state processes:
>
> [76325.608073] pdflush D 0000000000000001 0 362 2 0x00000000
> [76325.608087] Call Trace:
> [76325.608095] [<ffffffff811ea1c0>] ? xfs_trans_brelse+0x30/0x130
> [76325.608099] [<ffffffff811dc44c>] ? xlog_state_sync+0x26c/0x2a0
> [76325.608103] [<ffffffff810513e0>] ? default_wake_function+0x0/0x10
> [76325.608106] [<ffffffff811dc4d1>] ? _xfs_log_force+0x51/0x80
> [76325.608108] [<ffffffff811dc50b>] ? xfs_log_force+0xb/0x40
>
> [76325.608202] xfssyncd D 0000000000000000 0 831 2 0x00000000
> [76325.608214] Call Trace:
> [76325.608216] [<ffffffff811dc229>] ? xlog_state_sync+0x49/0x2a0
> [76325.608220] [<ffffffff811d3485>] ? __xfs_iunpin_wait+0x95/0xe0
> [76325.608222] [<ffffffff81069c20>] ? autoremove_wake_function+0x0/0x30
> [76325.608225] [<ffffffff811d566d>] ? xfs_iflush+0xdd/0x2f0
> [76325.608228] [<ffffffff811fbe28>] ? xfs_reclaim_inode+0x148/0x190
> [76325.608231] [<ffffffff811fbe70>] ? xfs_reclaim_inode_now+0x0/0xa0
> [76325.608233] [<ffffffff811fc8dc>] ? xfs_inode_ag_walk+0x6c/0xc0
> [76325.608236] [<ffffffff811fbe70>] ? xfs_reclaim_inode_now+0x0/0xa0
>
> All of the D-state processes:

All pointing to log IO not completing.

That is, all of the D state processes are backed up on locks or
waiting for IO completion processing. A lot of the processes are
waiting for _xfs_log_force to complete, others are waiting for
inodes to be unpinned or are backed up behind locked inodes that are
waiting on log IO to complete before they can complete the
transaction and unlock the inode, and so on.

Unfortunately, the xfslogd and xfsdatad kernel threads are not
present in any of the output given, so I can't tell if these have
deadlocked themselves and caused the problem. However, my experience
with such pile-ups is that an I/O completion has not been run for
some reason and that is the cause of the problem. I don't know if
you can provide enough information to tell us if this happened or
not. Instead, do you have a test case that you can share?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2009-10-19 10:18:56

by Justin Piszcz

[permalink] [raw]

Subject: Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)

On Mon, 19 Oct 2009, Dave Chinner wrote:

> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote:
>> It has happened again, all sysrq-X output was saved this time.
>>
>> wget http://home.comcast.net/~jpiszcz/20091018/crash.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/dmesg.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/interrupts.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-l.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-m.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-p.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-q.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-t.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-w.txt
> .....
>
> All pointing to log IO not completing.
>
> That is, all of the D state processes are backed up on locks or
> waiting for IO completion processing. A lot of the processes are
> waiting for _xfs_log_force to complete, others are waiting for
> inodes to be unpinned or are backed up behind locked inodes that are
> waiting on log IO to complete before they can complete the
> transaction and unlock the inode, and so on.
>
> Unfortunately, the xfslogd and xfsdatad kernel threads are not
> present in any of the output given, so I can't tell if these have
> deadlocked themselves and caused the problem. However, my experience
> with such pile-ups is that an I/O completion has not been run for
> some reason and that is the cause of the problem. I don't know if
> you can provide enough information to tell us if this happened or
> not. Instead, do you have a test case that you can share?
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
>

Hello,

So far I do not have a reproducible test case, the only other thing not posted
was the output of ps auxww during the time of the lockup, not sure if it will
help, but here it is:

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 10320 684 ? Ss Oct16 0:00 init [2]
root 2 0.0 0.0 0 0 ? S< Oct16 0:00 [kthreadd]
root 3 0.0 0.0 0 0 ? S< Oct16 0:00 [migration/0]
root 4 0.0 0.0 0 0 ? S< Oct16 0:00 [ksoftirqd/0]
root 5 0.0 0.0 0 0 ? S< Oct16 0:00 [migration/1]
root 6 0.0 0.0 0 0 ? S< Oct16 0:00 [ksoftirqd/1]
root 7 0.0 0.0 0 0 ? S< Oct16 0:00 [migration/2]
root 8 0.0 0.0 0 0 ? S< Oct16 0:00 [ksoftirqd/2]
root 9 0.0 0.0 0 0 ? S< Oct16 0:00 [migration/3]
root 10 0.0 0.0 0 0 ? S< Oct16 0:00 [ksoftirqd/3]
root 11 0.0 0.0 0 0 ? R< Oct16 0:00 [events/0]
root 12 0.0 0.0 0 0 ? S< Oct16 0:00 [events/1]
root 13 0.0 0.0 0 0 ? S< Oct16 0:00 [events/2]
root 14 0.0 0.0 0 0 ? S< Oct16 0:00 [events/3]
root 15 0.0 0.0 0 0 ? S< Oct16 0:00 [khelper]
root 20 0.0 0.0 0 0 ? S< Oct16 0:00 [async/mgr]
root 180 0.0 0.0 0 0 ? S< Oct16 0:00 [kblockd/0]
root 181 0.0 0.0 0 0 ? S< Oct16 0:00 [kblockd/1]
root 182 0.0 0.0 0 0 ? S< Oct16 0:00 [kblockd/2]
root 183 0.0 0.0 0 0 ? S< Oct16 0:00 [kblockd/3]
root 185 0.0 0.0 0 0 ? S< Oct16 0:00 [kacpid]
root 186 0.0 0.0 0 0 ? S< Oct16 0:00 [kacpi_notify]
root 187 0.0 0.0 0 0 ? S< Oct16 0:00 [kacpi_hotplug]
root 271 0.0 0.0 0 0 ? S< Oct16 0:00 [ata/0]
root 272 0.0 0.0 0 0 ? S< Oct16 0:00 [ata/1]
root 273 0.0 0.0 0 0 ? S< Oct16 0:00 [ata/2]
root 274 0.0 0.0 0 0 ? S< Oct16 0:00 [ata/3]
root 275 0.0 0.0 0 0 ? S< Oct16 0:00 [ata_aux]
root 276 0.0 0.0 0 0 ? S< Oct16 0:00 [ksuspend_usbd]
root 280 0.0 0.0 0 0 ? S< Oct16 0:00 [khubd]
root 283 0.0 0.0 0 0 ? S< Oct16 0:00 [kseriod]
root 318 0.0 0.0 0 0 ? S< Oct16 0:00 [khpsbpkt]
root 361 0.0 0.0 0 0 ? S Oct16 0:00 [pdflush]
root 362 0.0 0.0 0 0 ? D Oct16 0:43 [pdflush]
root 363 0.0 0.0 0 0 ? S< Oct16 0:21 [kswapd0]
root 364 0.0 0.0 0 0 ? S< Oct16 0:00 [aio/0]
root 365 0.0 0.0 0 0 ? S< Oct16 0:00 [aio/1]
root 366 0.0 0.0 0 0 ? S< Oct16 0:00 [aio/2]
root 367 0.0 0.0 0 0 ? S< Oct16 0:00 [aio/3]
root 368 0.0 0.0 0 0 ? S< Oct16 0:00 [nfsiod]
root 369 0.0 0.0 0 0 ? S< Oct16 0:00 [cifsoplockd]
root 370 0.0 0.0 0 0 ? S< Oct16 0:00 [xfs_mru_cache]
root 371 0.0 0.0 0 0 ? R< Oct16 0:01 [xfslogd/0]
root 372 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/1]
root 373 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/2]
root 374 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/3]
root 375 0.0 0.0 0 0 ? R< Oct16 0:00 [xfsdatad/0]
root 376 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsdatad/1]
root 377 0.0 0.0 0 0 ? S< Oct16 0:03 [xfsdatad/2]
root 378 0.0 0.0 0 0 ? S< Oct16 0:01 [xfsdatad/3]
root 379 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/0]
root 380 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/1]
root 381 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/2]
root 382 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/3]
root 518 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_0]
root 521 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_1]
root 524 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_2]
root 527 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_3]
root 530 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_4]
root 533 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_5]
root 542 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_6]
root 545 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_7]
root 551 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_8]
root 554 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_9]
root 558 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_10]
root 562 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_11]
root 568 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_12]
root 571 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_13]
root 584 0.0 0.0 0 0 ? S< Oct16 0:00 [knodemgrd_0]
root 616 0.0 0.0 0 0 ? S< Oct16 0:00 [kpsmoused]
root 666 0.0 0.0 0 0 ? S< Oct16 0:00 [usbhid_resumer]
root 683 0.0 0.0 0 0 ? S< Oct16 0:00 [hd-audio0]
root 703 0.0 0.0 0 0 ? S< Oct16 0:00 [rpciod/0]
root 704 0.0 0.0 0 0 ? S< Oct16 0:00 [rpciod/1]
root 705 0.0 0.0 0 0 ? S< Oct16 0:00 [rpciod/2]
root 706 0.0 0.0 0 0 ? S< Oct16 0:00 [rpciod/3]
root 811 4.0 0.0 0 0 ? S< Oct16 81:54 [md3_raid5]
root 817 0.3 0.0 0 0 ? S< Oct16 6:30 [md2_raid1]
root 823 0.0 0.0 0 0 ? S< Oct16 0:00 [md1_raid1]
root 827 0.0 0.0 0 0 ? S< Oct16 0:50 [md0_raid1]
root 829 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsbufd]
root 830 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsaild]
root 831 0.0 0.0 0 0 ? D< Oct16 0:00 [xfssyncd]
root 884 0.0 0.0 16736 740 ? S<s Oct16 0:00 udevd --daemon
postfix 1649 0.0 0.0 39124 2468 ? S 05:00 0:00 qmgr -l -t fifo -u -c
www-data 1877 0.0 0.0 146612 5248 ? S 05:01 0:00 /usr/sbin/apache2 -k start
root 3182 0.0 0.0 0 0 ? S< Oct16 0:00 [kjournald]
root 3183 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsbufd]
root 3184 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsaild]
root 3185 0.0 0.0 0 0 ? S< Oct16 0:00 [xfssyncd]
root 3337 0.0 0.0 8072 132 ? Ss Oct16 0:00 /app/ulogd-1.24-x86_64/sbin/ulogd -c /app/ulogd-1.24-x86_64/etc/ulogd-eth0.conf -d
root 3339 0.0 0.0 8072 424 ? Ds Oct16 0:00 /app/ulogd-1.24-x86_64/sbin/ulogd -c /app/ulogd-1.24-x86_64/etc/ulogd-eth1.conf -d
daemon 3526 0.0 0.0 8016 564 ? Ss Oct16 0:00 /sbin/portmap
statd 3538 0.0 0.0 10148 776 ? Ss Oct16 0:00 /sbin/rpc.statd
root 3547 0.0 0.0 26956 572 ? Ss Oct16 0:00 /usr/sbin/rpc.idmapd
root 3732 0.0 0.0 5900 676 ? Ds Oct16 0:00 /sbin/syslogd
root 3741 0.0 0.0 3796 452 ? Ss Oct16 0:00 /sbin/klogd -x
root 3750 0.0 0.0 3804 640 ? Ss Oct16 0:00 /usr/sbin/acpid
110 3760 0.0 0.0 23560 1476 ? Ss Oct16 0:02 /usr/bin/dbus-daemon --system
bind 3773 0.0 0.8 264364 70004 ? Ssl Oct16 0:05 /usr/sbin/named -u bind -S 1024
root 3796 0.0 0.0 49028 1260 ? Ss Oct16 0:00 /usr/sbin/sshd
root 3827 0.0 0.0 104804 7452 ? Ssl Oct16 0:08 /usr/sbin/console-kit-daemon
amavis 3909 0.0 1.1 217948 91004 ? Ss Oct16 0:00 amavisd (master)
nobody 3913 0.0 0.2 118612 21632 ? Sl Oct16 0:02 /app/gross-1.0.1-x86_64/sbin/grossd -f /etc/grossd/grossd.conf -p /var/run/grossd/grossd.pid
polw 3932 0.0 0.1 51868 12084 ? Ss Oct16 0:00 policyd-weight (master)
polw 3933 0.0 0.1 51868 11744 ? Ss Oct16 0:00 policyd-weight (cache)
postfw 3937 0.0 0.1 55248 14064 ? Ss Oct16 0:00 /usr/sbin/postfwd --summary=0 --cache=600 --cache-rdomain-only --cache-no-size --daemon --file=/etc/postfwd/postfwd.cf --interface=127.0.0.1 --port=10040 --user=postfw --group=postfw --pidfile=/var/run/postfwd.pid
postgrey 3940 0.0 0.1 57900 13232 ? Ss Oct16 0:00 /usr/sbin/postgrey --pidfile=/var/run/postgrey.pid --daemonize --inet=127.0.0.1:60000 --greylist-action=421
clamav 4166 0.0 1.3 215576 112772 ? Ssl Oct16 0:04 /usr/sbin/clamd
clamav 4262 0.0 0.0 25776 1328 ? Ss Oct16 0:00 /usr/bin/freshclam -d --quiet
amavis 4307 0.0 1.1 224072 96732 ? S Oct16 0:02 amavisd (ch14-avail)
amavis 4308 0.0 1.2 228168 100820 ? S Oct16 0:05 amavisd (ch14-avail)
root 4353 0.0 0.0 7224 3944 ? S Oct16 0:02 /usr/sbin/hddtemp -d -l 127.0.0.1 -p 7634 -s | /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj
root 4362 0.0 0.0 8864 520 ? Ss Oct16 0:00 /usr/sbin/irqbalance
daemon 4377 0.0 0.0 38488 840 ? Ss Oct16 0:00 lpd Waiting
root 4412 0.0 0.0 0 0 ? S< Oct16 0:00 [lockd]
root 4413 0.0 0.0 0 0 ? S< Oct16 0:10 [nfsd]
root 4414 0.0 0.0 0 0 ? D< Oct16 0:03 [nfsd]
root 4415 0.0 0.0 0 0 ? D< Oct16 0:13 [nfsd]
root 4416 0.0 0.0 0 0 ? S< Oct16 0:07 [nfsd]
root 4417 0.0 0.0 0 0 ? D< Oct16 0:11 [nfsd]
root 4418 0.0 0.0 0 0 ? D< Oct16 0:01 [nfsd]
root 4419 0.0 0.0 0 0 ? D< Oct16 0:04 [nfsd]
root 4420 0.0 0.0 0 0 ? D< Oct16 0:02 [nfsd]
root 4424 0.0 0.0 18812 1208 ? Ss Oct16 0:00 /usr/sbin/rpc.mountd --manage-gids
oident 4432 0.0 0.0 12232 576 ? Ss Oct16 0:00 /usr/sbin/oidentd -m -u oident -g oident
root 4444 0.0 0.0 10124 660 ? Ss Oct16 0:00 /usr/sbin/inetd
root 4521 0.0 0.0 37020 2388 ? Ss Oct16 0:00 /usr/lib/postfix/master
root 4545 0.0 0.0 58084 1796 ? Ds Oct16 0:00 /usr/sbin/nmbd -D
root 4547 0.0 0.0 93724 3044 ? Ss Oct16 0:00 /usr/sbin/smbd -D
root 4568 0.0 0.0 93752 1776 ? S Oct16 0:00 /usr/sbin/smbd -D
root 4700 0.0 0.0 18768 1348 ? S Oct16 0:00 /usr/sbin/smartd --pidfile /var/run/smartd.pid
asterisk 4714 1.3 0.2 666972 24460 ? Ssl Oct16 26:36 /usr/sbin/asterisk -p -U asterisk
asterisk 4715 0.0 0.0 13656 880 ? S Oct16 0:00 astcanary /var/run/asterisk/alt.asterisk.canary.tweet.tweet.tweet
ntp 4734 0.0 0.0 23428 1400 ? Ss Oct16 0:00 /usr/sbin/ntpd -p /var/run/ntpd.pid -u 105:106 -g
root 4789 0.0 0.0 16644 780 ? Ss Oct16 0:00 /usr/sbin/dovecot
root 4790 0.0 0.0 74664 3196 ? S Oct16 0:00 dovecot-auth
root 4792 0.0 0.0 74664 3240 ? S Oct16 0:00 dovecot-auth -w
111 4797 0.0 0.0 38084 4616 ? Ss Oct16 0:07 /usr/sbin/hald
root 4798 0.0 0.0 20040 1432 ? S Oct16 0:01 hald-runner
root 4895 0.0 0.0 22028 1240 ? S Oct16 0:00 hald-addon-input: Listening on /dev/input/event1 /dev/input/event0 /dev/input/event3
root 4899 0.0 0.0 22032 1232 ? S Oct16 0:00 hald-addon-storage: no polling on /dev/sr0 because it is explicitly disabled
root 4900 0.0 0.0 22028 1228 ? S Oct16 0:00 hald-addon-storage: no polling on /dev/fd0 because it is explicitly disabled
111 4902 0.0 0.0 25920 1204 ? S Oct16 0:00 hald-addon-acpi: listening on acpid socket /var/run/acpid.socket
root 4911 0.0 0.0 12484 656 ? Ss Oct16 0:00 /sbin/mdadm --monitor --pid-file /var/run/mdadm/monitor.pid --daemonise --scan --syslog
root 4929 0.0 0.0 24020 584 ? Ss Oct16 0:00 /usr/sbin/squid -D -YC
proxy 4931 0.0 0.0 26900 5136 ? S Oct16 0:00 (squid) -D -YC
root 4942 0.0 0.0 28924 3548 ? Ss Oct16 0:00 /usr/bin/perl -T /usr/lib/postfix/p0f-analyzer.pl 2345
root 4943 0.0 0.0 17392 1380 ? S Oct16 0:00 sh -c p0f -u daemon -i eth1 -l 'tcp dst port 25' 2>&1
daemon 4945 0.3 0.0 16596 3172 ? S Oct16 7:01 p0f -u daemon -i eth1 -l tcp dst port 25
root 4947 0.0 0.0 93604 5832 ? SNs Oct16 0:10 /usr/bin/perl -w /app/mailgraph-1.14/bin/mailgraph.pl -l /var/log/mail.log -d --ignore-localhost --rbl-is-spam --daemon-pid=/var/run/mailgraph.pid --daemon-rrd=/var/lib/mailgraph
root 4951 0.0 0.0 9296 3716 ? Ss Oct16 0:00 /usr/sbin/dhcpd3 -q eth0
nut 4964 0.0 0.0 14384 744 ? Ss Oct16 0:15 /lib/nut/usbhid-ups -a apc
nut 4966 0.0 0.0 14316 600 ? Ss Oct16 0:01 /sbin/upsd
root 4968 0.0 0.0 14284 728 ? Ss Oct16 0:00 /sbin/upsmon
nut 4970 0.0 0.0 14284 712 ? S Oct16 0:01 /sbin/upsmon
daemon 5007 0.0 0.0 16356 416 ? Ss Oct16 0:00 /usr/sbin/atd
root 5027 0.0 0.0 20988 1060 ? Ss Oct16 0:00 /usr/sbin/cron
dovecot 5887 0.0 0.0 18504 2108 ? S 15:31 0:00 imap-login
root 6184 0.0 0.0 73496 6380 ? Sl Oct16 0:00 /usr/bin/python /usr/bin/fail2ban-server -b -s /var/run/fail2ban/fail2ban.sock
root 6205 0.0 0.0 26292 720 ? Ss Oct16 0:00 /usr/bin/kdm -config /var/run/kdm/kdmrc
root 6213 0.1 1.3 192020 112852 tty7 Ss+ Oct16 2:34 /usr/bin/X -br -nolisten tcp :0 vt7 -auth /var/run/xauth/A:0-y1Rrr6
root 6284 0.0 0.0 5852 600 tty1 Ss+ Oct16 0:00 /sbin/getty 38400 tty1
root 6285 0.0 0.0 5852 596 tty2 Ss+ Oct16 0:00 /sbin/getty 38400 tty2
root 6286 0.0 0.0 5852 596 tty3 Ss+ Oct16 0:00 /sbin/getty 38400 tty3
root 6287 0.0 0.0 5852 600 tty4 Ss+ Oct16 0:00 /sbin/getty 38400 tty4
root 6288 0.0 0.0 5852 596 tty5 Ss+ Oct16 0:00 /sbin/getty 38400 tty5
root 6289 0.0 0.0 5852 600 tty6 Ss+ Oct16 0:00 /sbin/getty 38400 tty6
root 6297 0.0 0.0 55716 1864 ? S Oct16 0:00 -:0
postfix 6362 0.0 0.0 39076 2308 ? S 15:36 0:00 anvil -l -t unix -u -o max_idle=3600s
root 7003 0.0 0.0 80904 3412 ? Ss Oct16 0:00 sshd: ap [priv]
polw 7443 0.0 0.1 52000 12516 ? S Oct16 0:00 policyd-weight (child)
dovecot 8922 0.0 0.0 18504 2108 ? S 16:01 0:00 imap-login
dovecot 8923 0.0 0.0 18504 2112 ? S 16:01 0:00 imap-login
postfix 18407 0.0 0.0 39076 2352 ? D 17:50 0:00 pickup -l -t fifo -u -c -o receive_override_options=no_header_body_checks
asterisk 18424 0.0 0.0 36996 2260 ? D 17:54 0:00 /usr/sbin/postdrop -r
root 18425 0.0 0.0 46372 1440 ? S 17:55 0:00 /USR/SBIN/CRON
root 18459 0.0 0.0 11404 1328 ? Ss 17:55 0:00 /bin/sh -c if [ -x /usr/bin/mrtg ] && [ -r /etc/mrtg.cfg ]; then mkdir -p /var/log/mrtg ; env LANG=C /usr/bin/mrtg /etc/mrtg.cfg 2>&1 | tee -a /var/log/mrtg/mrtg.log ; fi
root 18460 0.0 0.0 21888 1432 ? D 17:55 0:00 /usr/lib/hal/hal-acl-tool --reconfigure
root 18465 0.0 0.2 43416 16892 ? S 17:55 0:00 /usr/bin/perl -w /usr/bin/mrtg /etc/mrtg.cfg
root 18479 0.0 0.0 4140 612 ? S 17:55 0:00 tee -a /var/log/mrtg/mrtg.log
root 18538 0.0 0.0 38476 1976 ? D 17:55 0:00 /usr/bin/rateup /var/www/monitor/mrtg/ eth0 1255816502 -Z u 21723041186 43737397048 125000000 c #00cc00 #0000ff #006600 #ff00ff k 1000 i /var/www/monitor/mrtg/eth0-day.png -125000000 -125000000 400 100 1 1 1 300 0 4 1 %Y-%m-%d %H:%M 0 i /var/www/monitor/mrtg/eth0-week.png -125000000 -125000000 400 100 1 1 1 1800 0 4 1 %Y-%m-%d %H:%M 0
postfix 18539 0.0 0.1 50740 9836 ? S 17:55 0:00 cleanup -z -t unix -u -c
root 18555 0.0 0.0 46368 1172 ? S 18:00 0:00 /USR/SBIN/CRON
root 18556 0.0 0.0 46368 1172 ? S 18:00 0:00 /USR/SBIN/CRON
root 18557 0.0 0.0 46368 1172 ? S 18:00 0:00 /USR/SBIN/CRON
root 18558 0.0 0.0 46368 1172 ? S 18:00 0:00 /USR/SBIN/CRON
root 18559 0.0 0.0 46368 1172 ? S 18:00 0:00 /USR/SBIN/CRON
root 18560 0.0 0.0 46368 1172 ? S 18:02 0:00 /USR/SBIN/CRON
root 18561 0.0 0.0 46368 1172 ? S 18:05 0:00 /USR/SBIN/CRON
root 18562 0.0 0.0 46368 1172 ? S 18:05 0:00 /USR/SBIN/CRON
root 18563 0.0 0.0 46368 1172 ? S 18:05 0:00 /USR/SBIN/CRON
root 18564 0.0 0.0 46368 1172 ? S 18:05 0:00 /USR/SBIN/CRON
root 18565 0.0 0.0 46368 1172 ? S 18:09 0:00 /USR/SBIN/CRON
root 18566 0.0 0.0 46368 1172 ? S 18:10 0:00 /USR/SBIN/CRON
root 18567 0.0 0.0 46368 1172 ? S 18:10 0:00 /USR/SBIN/CRON
root 18568 0.0 0.0 46368 1172 ? S 18:10 0:00 /USR/SBIN/CRON
root 18578 0.0 0.0 46368 1172 ? S 18:15 0:00 /USR/SBIN/CRON
root 18579 0.0 0.0 46368 1172 ? S 18:15 0:00 /USR/SBIN/CRON
root 18580 0.0 0.0 46368 1172 ? S 18:15 0:00 /USR/SBIN/CRON
root 18581 0.0 0.0 46368 1172 ? S 18:15 0:00 /USR/SBIN/CRON
root 18582 0.0 0.0 46368 1172 ? S 18:17 0:00 /USR/SBIN/CRON
root 18583 0.0 0.0 46368 1172 ? S 18:18 0:00 /USR/SBIN/CRON
postfix 18584 0.0 0.0 71452 8104 ? S 18:18 0:00 smtpd -n 75.144.35.65:smtp -t inet -u -c -o stress= -o stress=yes -o content_filter=amavisfeed:[127.0.0.1]:10024 -o receive_override_options=no_address_mappings
postfix 18585 0.0 0.0 39076 2300 ? S 18:18 0:00 proxymap -t unix -u
sshd 18597 0.0 0.0 0 0 ? Z 18:19 0:00 [sshd] <defunct>
root 18601 0.0 0.0 46368 1172 ? S 18:20 0:00 /USR/SBIN/CRON
root 18602 0.0 0.0 46368 1172 ? S 18:20 0:00 /USR/SBIN/CRON
root 18603 0.0 0.0 46368 1172 ? S 18:20 0:00 /USR/SBIN/CRON
root 30866 0.0 0.0 54392 944 ? Ss 04:31 0:00 /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 5
root 30867 0.0 0.0 54392 652 ? S 04:31 0:00 /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 5
root 30868 0.0 0.0 54392 540 ? S 04:31 0:00 /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 5
root 30869 0.0 0.0 54392 540 ? S 04:31 0:00 /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 5
root 30870 0.0 0.0 54392 540 ? S 04:31 0:00 /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 5
root 31018 0.0 0.0 146612 7856 ? Ss 04:31 0:00 /usr/sbin/apache2 -k start
www-data 31025 0.0 0.0 146612 5248 ? S 04:31 0:00 /usr/sbin/apache2 -k start
www-data 31044 0.0 0.0 146612 5248 ? S 04:31 0:00 /usr/sbin/apache2 -k start
root 31076 0.0 0.0 23128 756 ? Ss 04:31 0:00 pure-ftpd (SERVER)

2009-10-20 00:34:14

by Dave Chinner

[permalink] [raw]

Subject: Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)

On Mon, Oct 19, 2009 at 06:18:58AM -0400, Justin Piszcz wrote:
> On Mon, 19 Oct 2009, Dave Chinner wrote:
>> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote:
>>> It has happened again, all sysrq-X output was saved this time.
>> .....
>>
>> All pointing to log IO not completing.
>>
....
> So far I do not have a reproducible test case,

Ok. What sort of load is being placed on the machine?

> the only other thing not posted was the output of ps auxww during
> the time of the lockup, not sure if it will help, but here it is:
>
> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
> root 1 0.0 0.0 10320 684 ? Ss Oct16 0:00 init [2]
....
> root 371 0.0 0.0 0 0 ? R< Oct16 0:01 [xfslogd/0]
> root 372 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/1]
> root 373 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/2]
> root 374 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/3]
> root 375 0.0 0.0 0 0 ? R< Oct16 0:00 [xfsdatad/0]
> root 376 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsdatad/1]
> root 377 0.0 0.0 0 0 ? S< Oct16 0:03 [xfsdatad/2]
> root 378 0.0 0.0 0 0 ? S< Oct16 0:01 [xfsdatad/3]
> root 379 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/0]
> root 380 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/1]
> root 381 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/2]
> root 382 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/3]
.....

It appears that both the xfslogd and the xfsdatad on CPU 0 are in
the running state but don't appear to be consuming any significant
CPU time. If they remain like this then I think that means they are
stuck waiting on the run queue. Do these XFS threads always appear
like this when the hang occurs? If so, is there something else that
is hogging CPU 0 preventing these threads from getting the CPU?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2009-10-20 08:33:35

by Justin Piszcz

[permalink] [raw]

Subject: Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)

On Tue, 20 Oct 2009, Dave Chinner wrote:

> On Mon, Oct 19, 2009 at 06:18:58AM -0400, Justin Piszcz wrote:
>> On Mon, 19 Oct 2009, Dave Chinner wrote:
>>> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote:
>>>> It has happened again, all sysrq-X output was saved this time.
>>> .....
>>>
>>> All pointing to log IO not completing.
>>>
> ....
>> So far I do not have a reproducible test case,
>
> Ok. What sort of load is being placed on the machine?
Hello, generally the load is low, it mainly serves out some samba shares.

>
>> the only other thing not posted was the output of ps auxww during
>> the time of the lockup, not sure if it will help, but here it is:
>>
>> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
>> root 1 0.0 0.0 10320 684 ? Ss Oct16 0:00 init [2]
> ....
>> root 371 0.0 0.0 0 0 ? R< Oct16 0:01 [xfslogd/0]
>> root 372 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/1]
>> root 373 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/2]
>> root 374 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/3]
>> root 375 0.0 0.0 0 0 ? R< Oct16 0:00 [xfsdatad/0]
>> root 376 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsdatad/1]
>> root 377 0.0 0.0 0 0 ? S< Oct16 0:03 [xfsdatad/2]
>> root 378 0.0 0.0 0 0 ? S< Oct16 0:01 [xfsdatad/3]
>> root 379 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/0]
>> root 380 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/1]
>> root 381 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/2]
>> root 382 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/3]
> .....
>
> It appears that both the xfslogd and the xfsdatad on CPU 0 are in
> the running state but don't appear to be consuming any significant
> CPU time. If they remain like this then I think that means they are
> stuck waiting on the run queue. Do these XFS threads always appear
> like this when the hang occurs? If so, is there something else that
> is hogging CPU 0 preventing these threads from getting the CPU?
Yes, the XFS threads show up like this on each time the kernel crashed. So far
with 2.6.30.9 after ~48hrs+ it has not crashed. So it appears to be some issue
between 2.6.30.9 and 2.6.31.x when this began happening. Any recommendations
on how to catch this bug w/certain options enabled/etc?

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
>

2009-10-21 10:19:51

by Justin Piszcz

[permalink] [raw]

Subject: Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)

On Tue, 20 Oct 2009, Justin Piszcz wrote:

>
>
> On Tue, 20 Oct 2009, Dave Chinner wrote:
>
>> On Mon, Oct 19, 2009 at 06:18:58AM -0400, Justin Piszcz wrote:
>>> On Mon, 19 Oct 2009, Dave Chinner wrote:
>>>> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote:
>>>>> It has happened again, all sysrq-X output was saved this time.
>>>> .....
>>>>
>>>> All pointing to log IO not completing.
>>>>
>> ....
>>> So far I do not have a reproducible test case,
>>
>> Ok. What sort of load is being placed on the machine?
> Hello, generally the load is low, it mainly serves out some samba shares.
>
>>
>> It appears that both the xfslogd and the xfsdatad on CPU 0 are in
>> the running state but don't appear to be consuming any significant
>> CPU time. If they remain like this then I think that means they are
>> stuck waiting on the run queue. Do these XFS threads always appear
>> like this when the hang occurs? If so, is there something else that
>> is hogging CPU 0 preventing these threads from getting the CPU?
> Yes, the XFS threads show up like this on each time the kernel crashed. So
> far
> with 2.6.30.9 after ~48hrs+ it has not crashed. So it appears to be some
> issue
> between 2.6.30.9 and 2.6.31.x when this began happening. Any recommendations
> on how to catch this bug w/certain options enabled/etc?
>
>
>>
>> Cheers,
>>
>> Dave.
>> --
>> Dave Chinner
>> [email protected]
>>
>

Uptime with 2.6.30.9:

06:18:41 up 2 days, 14:10, 14 users, load average: 0.41, 0.21, 0.07

No issues yet, so it first started happening in 2.6.(31).(x).

Any further recommendations on how to debug this issue? BTW: Do you view this
as an XFS bug or MD/VFS layer issue based on the logs/output thus far?

Justin.

2009-10-22 22:49:44

by Justin Piszcz

[permalink] [raw]

Subject: Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)

On Wed, 21 Oct 2009, Justin Piszcz wrote:

>
>
> On Tue, 20 Oct 2009, Justin Piszcz wrote:
>
>
>>
>>
>> On Tue, 20 Oct 2009, Dave Chinner wrote:
>>
>>> On Mon, Oct 19, 2009 at 06:18:58AM -0400, Justin Piszcz wrote:
>>>> On Mon, 19 Oct 2009, Dave Chinner wrote:
>>>>> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote:
>>>>>> It has happened again, all sysrq-X output was saved this time.
>>>>> .....
>>>>>
>>>>> All pointing to log IO not completing.
>>>>>
>>> ....
>>>> So far I do not have a reproducible test case,
>>>
>>> Ok. What sort of load is being placed on the machine?
>> Hello, generally the load is low, it mainly serves out some samba shares.
>>
>>>
>>> It appears that both the xfslogd and the xfsdatad on CPU 0 are in
>>> the running state but don't appear to be consuming any significant
>>> CPU time. If they remain like this then I think that means they are
>>> stuck waiting on the run queue. Do these XFS threads always appear
>>> like this when the hang occurs? If so, is there something else that
>>> is hogging CPU 0 preventing these threads from getting the CPU?
>> Yes, the XFS threads show up like this on each time the kernel crashed. So
>> far
>> with 2.6.30.9 after ~48hrs+ it has not crashed. So it appears to be some
>> issue
>> between 2.6.30.9 and 2.6.31.x when this began happening. Any
>> recommendations
>> on how to catch this bug w/certain options enabled/etc?
>>
>>
>>>
>>> Cheers,
>>>
>>> Dave.
>>> --
>>> Dave Chinner
>>> [email protected]
>>>
>>
>
> Uptime with 2.6.30.9:
>
> 06:18:41 up 2 days, 14:10, 14 users, load average: 0.41, 0.21, 0.07
>
> No issues yet, so it first started happening in 2.6.(31).(x).
>
> Any further recommendations on how to debug this issue? BTW: Do you view
> this
> as an XFS bug or MD/VFS layer issue based on the logs/output thus far?
>
> Justin.
>
>

Any other ideas?

Currently stuck on 2.6.30.9.. (no issues, no lockups)-- Box normally has
no load at all either.. Has anyone else reported similar problems?

Justin.

2009-10-22 23:00:20

by Dave Chinner

[permalink] [raw]

Subject: Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)

On Thu, Oct 22, 2009 at 06:49:46PM -0400, Justin Piszcz wrote:
> On Wed, 21 Oct 2009, Justin Piszcz wrote:
>> On Tue, 20 Oct 2009, Justin Piszcz wrote:
>>>> It appears that both the xfslogd and the xfsdatad on CPU 0 are in
>>>> the running state but don't appear to be consuming any significant
>>>> CPU time. If they remain like this then I think that means they are
>>>> stuck waiting on the run queue. Do these XFS threads always appear
>>>> like this when the hang occurs? If so, is there something else that
>>>> is hogging CPU 0 preventing these threads from getting the CPU?
>>> Yes, the XFS threads show up like this on each time the kernel
>>> crashed. So far
>>> with 2.6.30.9 after ~48hrs+ it has not crashed. So it appears to be
>>> some issue
>>> between 2.6.30.9 and 2.6.31.x when this began happening. Any
>>> recommendations
>>> on how to catch this bug w/certain options enabled/etc?
>>
>> Uptime with 2.6.30.9:
>>
>> 06:18:41 up 2 days, 14:10, 14 users, load average: 0.41, 0.21, 0.07
>>
>> No issues yet, so it first started happening in 2.6.(31).(x).

Ok.

>> Any further recommendations on how to debug this issue? BTW: Do
>> you view this as an XFS bug or MD/VFS layer issue based on the
>> logs/output thus far?

Could be either. Nothing so far points at a cause.

> Any other ideas?

If it is relatively quick to reproduce, you could run a git bisect
to try to find the offending commit. Or when it has locked up, run
oprofile with callgraph sampling and so we can get an idea of what
is actually running when XFS appears to hang.

> Currently stuck on 2.6.30.9.. (no issues, no lockups)-- Box normally has
> no load at all either.. Has anyone else reported similar problems?

Not that I know of.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2009-10-26 11:24:39

by Justin Piszcz

[permalink] [raw]

Subject: Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)

On Thu, 22 Oct 2009, Justin Piszcz wrote:

>
> Any other ideas?
>
> Currently stuck on 2.6.30.9.. (no issues, no lockups)-- Box normally has no
> load at all either.. Has anyone else reported similar problems?
>
> Justin.
>

--

Currently running 2.6.31-rc1 for 2 days now, no crashes, will go to -rc2
later today and wait another 48 hours.

Justin.

2009-11-02 21:46:09

by Justin Piszcz

[permalink] [raw]

Subject: Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)

On Mon, 26 Oct 2009, Justin Piszcz wrote:

>
>
> On Thu, 22 Oct 2009, Justin Piszcz wrote:
>
>>
>> Any other ideas?
>>
>> Currently stuck on 2.6.30.9.. (no issues, no lockups)-- Box normally has no
>> load at all either.. Has anyone else reported similar problems?
>>
>> Justin.
>>
>
> --
>
> Currently running 2.6.31-rc1 for 2 days now, no crashes, will go to -rc2
> later today and wait another 48 hours.
>
> Justin.
>
>

Kernel report:

2.6.31-rc1: no crash - uptime 2+ days
2.6.31-rc2: no crash - uptime 2+ days
2.6.31-rc3: no crash - uptime 2+ days
2.6.31-rc4: no crash but network kept dropping out
2.6.31-rc5: cannot test, service owner needs host available
2.6.31-rc6: cannot test, service owner needs host available
2.6.31-rc7: cannot test, service owner needs host available
2.6.31-rc8: cannot test, service owner needs host available
2.6.31-rc9: cannot test, service owner needs host available
2.6.31.x: locks up D-state

It would be somewhere between -rc4 and 2.6.31.x.

Justin.

2009-11-20 20:39:21

by Justin Piszcz

[permalink] [raw]

Subject: Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) - root cause found = asterisk

Package: asterisk
Version: 1.6.2.0~dfsg~rc1-1

See below for issue:

On Wed, 21 Oct 2009, Justin Piszcz wrote:

>
>
> On Tue, 20 Oct 2009, Justin Piszcz wrote:
>
>
>>
>>
>> On Tue, 20 Oct 2009, Dave Chinner wrote:
>>
>>> On Mon, Oct 19, 2009 at 06:18:58AM -0400, Justin Piszcz wrote:
>>>> On Mon, 19 Oct 2009, Dave Chinner wrote:
>>>>> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote:
>>>>>> It has happened again, all sysrq-X output was saved this time.
>>>>> .....
>>>>>
>>>>> All pointing to log IO not completing.
>>>>>
>>> ....
>>>> So far I do not have a reproducible test case,
>>>
>>> Ok. What sort of load is being placed on the machine?
>> Hello, generally the load is low, it mainly serves out some samba shares.
>>
>>>
>>> It appears that both the xfslogd and the xfsdatad on CPU 0 are in
>>> the running state but don't appear to be consuming any significant
>>> CPU time. If they remain like this then I think that means they are
>>> stuck waiting on the run queue. Do these XFS threads always appear
>>> like this when the hang occurs? If so, is there something else that
>>> is hogging CPU 0 preventing these threads from getting the CPU?
>> Yes, the XFS threads show up like this on each time the kernel crashed. So
>> far
>> with 2.6.30.9 after ~48hrs+ it has not crashed. So it appears to be some
>> issue
>> between 2.6.30.9 and 2.6.31.x when this began happening. Any
>> recommendations
>> on how to catch this bug w/certain options enabled/etc?
>>
>>
>>>
>>> Cheers,
>>>
>>> Dave.
>>> --
>>> Dave Chinner
>>> [email protected]
>>>
>>
>
> Uptime with 2.6.30.9:
>
> 06:18:41 up 2 days, 14:10, 14 users, load average: 0.41, 0.21, 0.07
>
> No issues yet, so it first started happening in 2.6.(31).(x).
>
> Any further recommendations on how to debug this issue? BTW: Do you view
> this
> as an XFS bug or MD/VFS layer issue based on the logs/output thus far?
>
> Justin.
>
>

Found root cause-- root cause is asterisk PBX software. I use an SPA3102.
When someone called me, they accidentally dropped the connection, I called
them back in a short period. It is during this time (and the last time)
this happened that the box froze under multiple(!) kernels, always when
someone was calling.

I have removed asterisk but this is the version I was running:
~$ dpkg -l | grep -i asterisk
rc asterisk 1:1.6.2.0~dfsg~rc1-1 Open S

I don't know what asterisk is doing but top did run before the crash
and asterisk was using 100% CPU and as I noted before all other processes
were in D-state.

When this bug occurs, it freezes I/O to all devices and the only way to recover
is to reboot the system.

Just FYI if anyone else out there has their system crash when running asterisk.

Just out of curiosity, has anyone else running asterisk had such an issue?
I was not running any special VoIP PCI cards/etc.

Justin.

2009-11-21 00:06:36

by Faidon Liambotis

[permalink] [raw]

Subject: Re: Bug#557262: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) - root cause found = asterisk

Justin Piszcz wrote:
> Found root cause-- root cause is asterisk PBX software. I use an
SPA3102.
> When someone called me, they accidentally dropped the connection, I called
> them back in a short period. It is during this time (and the last time)
> this happened that the box froze under multiple(!) kernels, always when
> someone was calling.
<snip>
> I don't know what asterisk is doing but top did run before the crash
> and asterisk was using 100% CPU and as I noted before all other processes
> were in D-state.
>
> When this bug occurs, it freezes I/O to all devices and the only way to
> recover
> is to reboot the system.
That's obviously *not* the root cause.

It's not normal for an application that isn't even privileged to hang
all I/O and, subsequently everything on a system.

This is almost probably a kernel issue and asterisk just does something
that triggers this bug.

Regards,
Faidon

2009-11-20 23:51:23

by Justin Piszcz

[permalink] [raw]

Subject: Re: Bug#557262: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) - root cause found = asterisk

On Sat, 21 Nov 2009, Faidon Liambotis wrote:

> Justin Piszcz wrote:
> > Found root cause-- root cause is asterisk PBX software. I use an
> SPA3102.
>> When someone called me, they accidentally dropped the connection, I called
>> them back in a short period. It is during this time (and the last time)
>> this happened that the box froze under multiple(!) kernels, always when
>> someone was calling.
> <snip>
>> I don't know what asterisk is doing but top did run before the crash
>> and asterisk was using 100% CPU and as I noted before all other processes
>> were in D-state.
>>
>> When this bug occurs, it freezes I/O to all devices and the only way to
>> recover
>> is to reboot the system.
> That's obviously *not* the root cause.
>
> It's not normal for an application that isn't even privileged to hang
> all I/O and, subsequently everything on a system.
>
> This is almost probably a kernel issue and asterisk just does something
> that triggers this bug.
>
> Regards,
> Faidon
>

It is possible although I tried with several kernels (2.6.30.[0-9] &
2.6.31+ (never had a crash with earlier versions, I installed asterisk long
ago) but it always used to be 1.4.x until recently.. Nasty bug :\

Justin.

2009-11-21 14:29:19

by Roger Heflin

[permalink] [raw]

Subject: Re: Bug#557262: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) - root cause found = asterisk

Faidon Liambotis wrote:
> Justin Piszcz wrote:
> > Found root cause-- root cause is asterisk PBX software. I use an
> SPA3102.
>> When someone called me, they accidentally dropped the connection, I called
>> them back in a short period. It is during this time (and the last time)
>> this happened that the box froze under multiple(!) kernels, always when
>> someone was calling.
> <snip>
>> I don't know what asterisk is doing but top did run before the crash
>> and asterisk was using 100% CPU and as I noted before all other processes
>> were in D-state.
>>
>> When this bug occurs, it freezes I/O to all devices and the only way to
>> recover
>> is to reboot the system.
> That's obviously *not* the root cause.
>
> It's not normal for an application that isn't even privileged to hang
> all I/O and, subsequently everything on a system.
>
> This is almost probably a kernel issue and asterisk just does something
> that triggers this bug.
>
> Regards,
> Faidon

I had an application in 2.6.5 (SLES9)...that would hang XFS.

The underlying application was multi-threaded and both threads were
doing full disks syncs every so often, and sometimes when doing the
full disk sync the XFS subsystem would deadlock, it appeared to me tha
one sync had a lock and was waiting for another, and the other process
had the second lock and was waiting for the first... We were able to
disable the full disk sync from the application and the deadlock went
away. All non-xfs filesytems still worked and could still be
accessed. I did report the bug with some traces but I don't believe
anyone ever determined where the underlying issues was.

2009-11-24 13:08:04

by Justin Piszcz

[permalink] [raw]

Subject: Which kernel options should be enabled to find the root cause of this bug?

On Sat, 17 Oct 2009, Justin Piszcz wrote:

> Hello,
>
> I have a system I recently upgraded from 2.6.30.x and after approximately
> 24-48 hours--sometimes longer, the system cannot write any more files to disk
> (luckily though I can still write to /dev/shm) -- to which I have
> saved the sysrq-t and sysrq-w output:
>
> http://home.comcast.net/~jpiszcz/20091017/sysrq-w.txt
> http://home.comcast.net/~jpiszcz/20091017/sysrq-t.txt
>
> Configuration:
>
> $ cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md1 :
> active raid1 sdb2[1] sda2[0]
> 136448 blocks [2/2] [UU]
>
> md2 : active raid1 sdb3[1] sda3[0]
> 129596288 blocks [2/2] [UU]
>
> md3 : active raid5 sdj1[7] sdi1[6] sdh1[5] sdf1[3] sdg1[4] sde1[2] sdd1[1]
> sdc1[0]
> 5128001536 blocks level 5, 1024k chunk, algorithm 2 [8/8] [UUUUUUUU]
>
> md0 : active raid1 sdb1[1] sda1[0]
> 16787776 blocks [2/2] [UU]
>
> $ mount
> /dev/md2 on / type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
> tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
> proc on /proc type proc (rw,noexec,nosuid,nodev)
> sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
> udev on /dev type tmpfs (rw,mode=0755)
> tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
> devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620)
> /dev/md1 on /boot type ext3 (rw,noatime)
> /dev/md3 on /r/1 type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
> rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
> nfsd on /proc/fs/nfsd type nfsd (rw)
>
> Distribution: Debian Testing
> Arch: x86_64
>
> The problem occurs with 2.6.31 and I upgraded to 2.6.31.4 and the problem
> persists.
>
> Here is a snippet of two processes in D-state, the first was not doing
> anything, the second was mrtg.
>
> [121444.684000] pickup D 0000000000000003 0 18407 4521
> 0x00000000
> [121444.684000] ffff880231dd2290 0000000000000086 0000000000000000
> 0000000000000000
> [121444.684000] 000000000000ff40 000000000000c8c8 ffff880176794d10
> ffff880176794f90
> [121444.684000] 000000032266dd08 ffff8801407a87f0 ffff8800280878d8
> ffff880176794f90
> [121444.684000] Call Trace:
> [121444.684000] [<ffffffff810a742d>] ? free_pages_and_swap_cache+0x9d/0xc0
> [121444.684000] [<ffffffff81454866>] ? __mutex_lock_slowpath+0xd6/0x160
> [121444.684000] [<ffffffff814546ba>] ? mutex_lock+0x1a/0x40
> [121444.684000] [<ffffffff810b26ef>] ? generic_file_llseek+0x2f/0x70
> [121444.684000] [<ffffffff810b119e>] ? sys_lseek+0x7e/0x90
> [121444.684000] [<ffffffff8109ffd2>] ? sys_munmap+0x52/0x80
> [121444.684000] [<ffffffff8102c52b>] ? system_call_fastpath+0x16/0x1b
>
> [121444.684000] rateup D 0000000000000000 0 18538 18465
> 0x00000000
> [121444.684000] ffff88023f8a8c10 0000000000000082 0000000000000000
> ffff88023ea09ec8
> [121444.684000] 000000000000ff40 000000000000c8c8 ffff88023faace50
> ffff88023faad0d0
> [121444.684000] 0000000300003e00 000000010720cc78 0000000000003e00
> ffff88023faad0d0
> [121444.684000] Call Trace:
> [121444.684000] [<ffffffff811f42e2>] ? xfs_buf_iorequest+0x42/0x90
> [121444.684000] [<ffffffff811dd66d>] ? xlog_bdstrat_cb+0x3d/0x50
> [121444.684000] [<ffffffff811db05b>] ? xlog_sync+0x20b/0x4e0
> [121444.684000] [<ffffffff811dc44c>] ? xlog_state_sync+0x26c/0x2a0
> [121444.684000] [<ffffffff810513e0>] ? default_wake_function+0x0/0x10
> [121444.684000] [<ffffffff811dc4d1>] ? _xfs_log_force+0x51/0x80
> [121444.684000] [<ffffffff811dc50b>] ? xfs_log_force+0xb/0x40
> [121444.684000] [<ffffffff811a7223>] ? xfs_alloc_ag_vextent+0x123/0x130
> [121444.684000] [<ffffffff811a7aa8>] ? xfs_alloc_vextent+0x368/0x4b0
> [121444.684000] [<ffffffff811b41e8>] ? xfs_bmap_btalloc+0x598/0xa40
> [121444.684000] [<ffffffff811b6a42>] ? xfs_bmapi+0x9e2/0x11a0
> [121444.684000] [<ffffffff811dd7f0>] ? xlog_grant_push_ail+0x30/0xf0
> [121444.684000] [<ffffffff811e8fd8>] ? xfs_trans_reserve+0xa8/0x220
> [121444.684000] [<ffffffff811d805e>] ? xfs_iomap_write_allocate+0x23e/0x3b0
> [121444.684000] [<ffffffff811f0daf>] ? __xfs_get_blocks+0x8f/0x220
> [121444.684000] [<ffffffff811d8c00>] ? xfs_iomap+0x2c0/0x300
> [121444.684000] [<ffffffff810d5b76>] ? __set_page_dirty+0x66/0xd0
> [121444.684000] [<ffffffff811f0d15>] ? xfs_map_blocks+0x25/0x30
> [121444.684000] [<ffffffff811f1e04>] ? xfs_page_state_convert+0x414/0x6c0
> [121444.684000] [<ffffffff811f23b7>] ? xfs_vm_writepage+0x77/0x130
> [121444.684000] [<ffffffff8108b21a>] ? __writepage+0xa/0x40
> [121444.684000] [<ffffffff8108baff>] ? write_cache_pages+0x1df/0x3c0
> [121444.684000] [<ffffffff8108b210>] ? __writepage+0x0/0x40
> [121444.684000] [<ffffffff810b1533>] ? do_sync_write+0xe3/0x130
> [121444.684000] [<ffffffff8108bd30>] ? do_writepages+0x20/0x40
> [121444.684000] [<ffffffff81085abd>] ? __filemap_fdatawrite_range+0x4d/0x60
> [121444.684000] [<ffffffff811f54dd>] ? xfs_flush_pages+0xad/0xc0
> [121444.684000] [<ffffffff811ee907>] ? xfs_release+0x167/0x1d0
> [121444.684000] [<ffffffff811f52b0>] ? xfs_file_release+0x10/0x20
> [121444.684000] [<ffffffff810b2c0d>] ? __fput+0xcd/0x1e0
> [121444.684000] [<ffffffff810af556>] ? filp_close+0x56/0x90
> [121444.684000] [<ffffffff810af636>] ? sys_close+0xa6/0x100
> [121444.684000] [<ffffffff8102c52b>] ? system_call_fastpath+0x16/0x1b
>
> Anyone know what is going on here?
>
> Justin.
>

In addition to using netconsole, which kernel options should be enabled
to better diagnose this issue?

Should I enable these to help track down this bug?

[ ] XFS Debugging support (EXPERIMENTAL)
[ ] Compile the kernel with frame pointers

Are there any other options that will help determine the root cause of this
bug that are recommended?

Justin.

2009-11-24 15:46:53

by Eric Sandeen

[permalink] [raw]

Subject: Re: Which kernel options should be enabled to find the root cause of this bug?

Justin Piszcz wrote:
>
>
> On Sat, 17 Oct 2009, Justin Piszcz wrote:
>
>> Hello,
>>
>> I have a system I recently upgraded from 2.6.30.x and after
>> approximately 24-48 hours--sometimes longer, the system cannot write
>> any more files to disk (luckily though I can still write to /dev/shm)
>> -- to which I have
>> saved the sysrq-t and sysrq-w output:
>>
>> http://home.comcast.net/~jpiszcz/20091017/sysrq-w.txt
>> http://home.comcast.net/~jpiszcz/20091017/sysrq-t.txt

Unfortunately it looks like a lot of the sysrq-t, at least, was lost.

The sysrq-w trace has the "show blocked state" start a ways down the file,
for anyone playing along at home ;)

Other things you might try are a sysrq-m to get memory state...

>> Configuration:
>>
>> $ cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md1
>> : active raid1 sdb2[1] sda2[0]
>> 136448 blocks [2/2] [UU]
>>
>> md2 : active raid1 sdb3[1] sda3[0]
>> 129596288 blocks [2/2] [UU]
>>
>> md3 : active raid5 sdj1[7] sdi1[6] sdh1[5] sdf1[3] sdg1[4] sde1[2]
>> sdd1[1] sdc1[0]
>> 5128001536 blocks level 5, 1024k chunk, algorithm 2 [8/8] [UUUUUUUU]
>>
>> md0 : active raid1 sdb1[1] sda1[0]
>> 16787776 blocks [2/2] [UU]
>>
>> $ mount
>> /dev/md2 on / type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
>> tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
>> proc on /proc type proc (rw,noexec,nosuid,nodev)
>> sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
>> udev on /dev type tmpfs (rw,mode=0755)
>> tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
>> devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620)
>> /dev/md1 on /boot type ext3 (rw,noatime)
>> /dev/md3 on /r/1 type xfs
>> (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
>> rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
>> nfsd on /proc/fs/nfsd type nfsd (rw)

Do you get the same behavior if you don't add the log options at mount time?
Kind of grasping at straws here for now ...

>> Distribution: Debian Testing
>> Arch: x86_64
>>
>> The problem occurs with 2.6.31 and I upgraded to 2.6.31.4 and the problem
>> persists.
>>

...

> In addition to using netconsole, which kernel options should be enabled
> to better diagnose this issue?
>
> Should I enable these to help track down this bug?
>
> [ ] XFS Debugging support (EXPERIMENTAL)
> [ ] Compile the kernel with frame pointers

The former probably won't hurt; the latter might gibe us better backtraces.

> Are there any other options that will help determine the root cause of this
> bug that are recommended?

Not that I can think of off hand ...

-Eric

> Justin.

2009-11-24 16:19:55

by Justin Piszcz

[permalink] [raw]

Subject: Re: Which kernel options should be enabled to find the root cause of this bug?

On Tue, 24 Nov 2009, Eric Sandeen wrote:

> Justin Piszcz wrote:
>>
>>
>> On Sat, 17 Oct 2009, Justin Piszcz wrote:
>>
>>> Hello,
>>>
>>> I have a system I recently upgraded from 2.6.30.x and after
>>> approximately 24-48 hours--sometimes longer, the system cannot write
>>> any more files to disk (luckily though I can still write to /dev/shm)
>>> -- to which I have
>>> saved the sysrq-t and sysrq-w output:
>>>
>>> http://home.comcast.net/~jpiszcz/20091017/sysrq-w.txt
>>> http://home.comcast.net/~jpiszcz/20091017/sysrq-t.txt
>
> Unfortunately it looks like a lot of the sysrq-t, at least, was lost.
Yes, when this occurred the first few times, I can only grab whats in dmesg
to the ramdisk, trying to access any file system other than the ramdisk
(tmpfs) /dev/shm, will cause the process to be locked.

>
> The sysrq-w trace has the "show blocked state" start a ways down the file,
> for anyone playing along at home ;)
>
> Other things you might try are a sysrq-m to get memory state...
I actually performed most of the useful sysrq-commands, please see
the following:

wget http://home.comcast.net/~jpiszcz/20091018/dmesg.txt
wget http://home.comcast.net/~jpiszcz/20091018/interrupts.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-l.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-m.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-p.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-q.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-t.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-w.txt

>
>>> Configuration:
>>>
>>> $ cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md1
>>> : active raid1 sdb2[1] sda2[0]
>>> 136448 blocks [2/2] [UU]
>>>
>>> md2 : active raid1 sdb3[1] sda3[0]
>>> 129596288 blocks [2/2] [UU]
>>>
>>> md3 : active raid5 sdj1[7] sdi1[6] sdh1[5] sdf1[3] sdg1[4] sde1[2]
>>> sdd1[1] sdc1[0]
>>> 5128001536 blocks level 5, 1024k chunk, algorithm 2 [8/8] [UUUUUUUU]
>>>
>>> md0 : active raid1 sdb1[1] sda1[0]
>>> 16787776 blocks [2/2] [UU]
>>>
>>> $ mount
>>> /dev/md2 on / type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
>>> tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
>>> proc on /proc type proc (rw,noexec,nosuid,nodev)
>>> sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
>>> udev on /dev type tmpfs (rw,mode=0755)
>>> tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
>>> devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620)
>>> /dev/md1 on /boot type ext3 (rw,noatime)
>>> /dev/md3 on /r/1 type xfs
>>> (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
>>> rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
>>> nfsd on /proc/fs/nfsd type nfsd (rw)
>
> Do you get the same behavior if you don't add the log options at mount time?
I have not tried disabling the log options, although they have been in effect
for a long time, (the logsbufs and bufsize and recently) the nobarrier
support. Could there be an issue using -o nobarrier on a raid1+xfs?

2009-11-24 16:23:27

by Eric Sandeen

[permalink] [raw]

Subject: Re: Which kernel options should be enabled to find the root cause of this bug?

Justin Piszcz wrote:
>
> On Tue, 24 Nov 2009, Eric Sandeen wrote:

...

>> Do you get the same behavior if you don't add the log options at mount time?
> I have not tried disabling the log options, although they have been in effect
> for a long time, (the logsbufs and bufsize and recently) the nobarrier
> support. Could there be an issue using -o nobarrier on a raid1+xfs?

nobarrier should not cause problems.

-Eric