2010-01-10 20:05:54

by Johannes Hirte

[permalink] [raw]
Subject: task imap:2958 blocked for more than 120 seconds

I've observed this hanging task now several times. Not sure when this started,
but 2.6.32 is affected too, IIRC. I don't have a test pattern for this. Dovecot
imap triggers this from time to time. I've enabled CONFIG_DETECT_HUNG_TASK now
and got this two tasks which hang:

INFO: task imap:2958 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
imap D 0000000000000000 0 2958 2653 0x00000000
ffff88008caf5a28 0000000000000046 0000000000000000 ffffffff810544cf
ffff88008caf5998 0000000000000001 ffff88008caf5fd8 ffff88008caf9530
000000000000de78 00000000001d2700 00000000001d2700 ffff88008caf9530
Call Trace:
[<ffffffff810544cf>] ? trace_hardirqs_off+0xd/0xf
[<ffffffff8105613b>] ? trace_hardirqs_on_caller+0x10c/0x130
[<ffffffff81071262>] ? sync_page+0x0/0x48
[<ffffffff8138b46a>] io_schedule+0x38/0x4d
[<ffffffff810712a6>] sync_page+0x44/0x48
[<ffffffff8138b99a>] __wait_on_bit_lock+0x41/0x8a
[<ffffffff8107124e>] __lock_page+0x61/0x68
[<ffffffff810475a3>] ? wake_bit_function+0x0/0x2e
[<ffffffff810718c6>] filemap_fault+0xea/0x345
[<ffffffff81084022>] __do_fault+0x50/0x3d3
[<ffffffff81086077>] handle_mm_fault+0x32f/0x65d
[<ffffffff8102039d>] ? do_page_fault+0xf4/0x26f
[<ffffffff811cacb4>] ? __down_read_trylock+0x46/0x4e
[<ffffffff8104b0f3>] ? down_read_trylock+0x3f/0x49
[<ffffffff8102039d>] ? do_page_fault+0xf4/0x26f
[<ffffffff81020500>] do_page_fault+0x257/0x26f
[<ffffffff8138e2ff>] page_fault+0x1f/0x30
[<ffffffff81083a13>] ? might_fault+0x57/0xa7
[<ffffffff811746e2>] ? btrfs_copy_from_user+0x4f/0x113
[<ffffffff81174771>] ? btrfs_copy_from_user+0xde/0x113
[<ffffffff81174cd0>] btrfs_file_write+0x439/0x6fe
[<ffffffff810a5c34>] vfs_write+0xad/0x14e
[<ffffffff8105613b>] ? trace_hardirqs_on_caller+0x10c/0x130
[<ffffffff810a5d2a>] sys_pwrite64+0x55/0x74
[<ffffffff81001eeb>] system_call_fastpath+0x16/0x1b
2 locks held by imap/2958:
#0: (&sb->s_type->i_mutex_key#4){+.+.+.}, at: [<ffffffff81174a00>]
btrfs_file_write+0x169/0x6fe
#1: (&mm->mmap_sem){++++++}, at: [<ffffffff8102039d>] do_page_fault+0xf4/0x26f
INFO: task flush-btrfs-2:2783 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
flush-btrfs-2 D 0000000000000000 0 2783 2 0x00000000
ffff88010cdcf9d0 0000000000000046 0000000000000000 ffffffff810544cf
ffff88010cdcf940 0000000000000000 ffff88010cdcffd8 ffff88010cd18290
000000000000de78 00000000001d2700 00000000001d2700 ffff88010cd18290
Call Trace:
[<ffffffff810544cf>] ? trace_hardirqs_off+0xd/0xf
[<ffffffff8105613b>] ? trace_hardirqs_on_caller+0x10c/0x130
[<ffffffff81071262>] ? sync_page+0x0/0x48
[<ffffffff8138b46a>] io_schedule+0x38/0x4d
[<ffffffff810712a6>] sync_page+0x44/0x48
[<ffffffff8138b99a>] __wait_on_bit_lock+0x41/0x8a
[<ffffffff81070ce1>] ? find_get_pages_tag+0x0/0x130
[<ffffffff8107124e>] __lock_page+0x61/0x68
[<ffffffff810475a3>] ? wake_bit_function+0x0/0x2e
[<ffffffff811871f6>] T.858+0xf1/0x2cd
[<ffffffff8104c4f9>] ? sched_clock_cpu+0xc6/0xd4
[<ffffffff8104c3cd>] ? sched_clock_local+0x1c/0x82
[<ffffffff8104c4f9>] ? sched_clock_cpu+0xc6/0xd4
[<ffffffff810544cf>] ? trace_hardirqs_off+0xd/0xf
[<ffffffff811874d3>] extent_writepages+0x3f/0x54
[<ffffffff8116dbd5>] ? btrfs_get_extent+0x0/0x7ee
[<ffffffff8116da7e>] btrfs_writepages+0x22/0x24
[<ffffffff81077d12>] do_writepages+0x1f/0x28
[<ffffffff810c0233>] writeback_single_inode+0xf1/0x2f0
[<ffffffff810c0f57>] writeback_inodes_wb+0x3a9/0x4b2
[<ffffffff810c118b>] wb_writeback+0x12b/0x1af
[<ffffffff810c1436>] wb_do_writeback+0x17f/0x195
[<ffffffff810c1342>] ? wb_do_writeback+0x8b/0x195
[<ffffffff810c1477>] bdi_writeback_task+0x2b/0x84
[<ffffffff81082a45>] ? bdi_start_fn+0x0/0xcf
[<ffffffff81082ab6>] bdi_start_fn+0x71/0xcf
[<ffffffff81082a45>] ? bdi_start_fn+0x0/0xcf
[<ffffffff81047197>] kthread+0x7a/0x82
[<ffffffff81002c94>] kernel_thread_helper+0x4/0x10
[<ffffffff8138e0fc>] ? restore_args+0x0/0x30
[<ffffffff8104711d>] ? kthread+0x0/0x82
[<ffffffff81002c90>] ? kernel_thread_helper+0x0/0x10
1 lock held by flush-btrfs-2/2783:
#0: (&type->s_umount_key#20){++++..}, at: [<ffffffff810c0e82>]
writeback_inodes_wb+0x2d4/0x4b2

regards,
Johannes


2010-01-10 20:19:46

by Chris Mason

[permalink] [raw]
Subject: Re: task imap:2958 blocked for more than 120 seconds

On Sun, Jan 10, 2010 at 09:05:46PM +0100, Johannes Hirte wrote:
> I've observed this hanging task now several times. Not sure when this started,
> but 2.6.32 is affected too, IIRC. I don't have a test pattern for this. Dovecot
> imap triggers this from time to time. I've enabled CONFIG_DETECT_HUNG_TASK now
> and got this two tasks which hang:

You're stuck on a read, could you please do a sysrq-w when this happens?
Also, do you eventually recover or are you stuck forever?

-chris

2010-01-11 08:45:42

by Adrian von Bidder

[permalink] [raw]
Subject: Re: task imap:2958 blocked for more than 120 seconds

On Monday 11 January 2010 08.34:36 Adrian von Bidder wrote:
> "btrfs-vol -b" on an 2T btrfs fs (raid 1 mode over 4 disks) on an arm
> CPU has triggered it several times, so it seems a reliable way to
> reproduce this.
>

Found it (Debian kernel 2.6.32 on ARM):

[78260.386272] INFO: task btrfs-vol:10979 blocked for more than 120 seconds.
[78260.386306] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[78260.386331] btrfs-vol D c02b080c 0 10979 1 0x00000001
[78260.386373] [<c02b080c>] (schedule+0x424/0x488) from [<c02b0c9c>] (schedule_timeout+0x1c/0x244)
[78260.386408] [<c02b0c9c>] (schedule_timeout+0x1c/0x244) from [<c02b0b10>] (wait_for_common+0xdc/0x178)
[78260.386611] [<c02b0b10>] (wait_for_common+0xdc/0x178) from [<bf29b880>] (merge_reloc_roots+0x15c/0x1a4 [btrfs])
[78260.386940] [<bf29b880>] (merge_reloc_roots+0x15c/0x1a4 [btrfs]) from [<bf2a3fd8>] (relocate_block_group+0x548/0x5c8 [btrfs])
[78260.387258] [<bf2a3fd8>] (relocate_block_group+0x548/0x5c8 [btrfs]) from [<bf2a4434>] (btrfs_relocate_block_group+0x17c/0x3a4 [btrfs])
[78260.387564] [<bf2a4434>] (btrfs_relocate_block_group+0x17c/0x3a4 [btrfs]) from [<bf2868e0>] (btrfs_relocate_chunk+0x70/0x7c0 [btrfs])
[78260.387856] [<bf2868e0>] (btrfs_relocate_chunk+0x70/0x7c0 [btrfs]) from [<bf2879f4>] (btrfs_balance+0x370/0x424 [btrfs])
[78260.388148] [<bf2879f4>] (btrfs_balance+0x370/0x424 [btrfs]) from [<bf28d3a8>] (btrfs_ioctl+0x754/0x968 [btrfs])
[78260.388319] [<bf28d3a8>] (btrfs_ioctl+0x754/0x968 [btrfs]) from [<c00d8788>] (vfs_ioctl+0x2c/0x70)
[78260.388357] [<c00d8788>] (vfs_ioctl+0x2c/0x70) from [<c00d8e8c>] (do_vfs_ioctl+0x4f4/0x55c)
[78260.388390] [<c00d8e8c>] (do_vfs_ioctl+0x4f4/0x55c) from [<c00d8f44>] (sys_ioctl+0x50/0x74)
[78260.388423] [<c00d8f44>] (sys_ioctl+0x50/0x74) from [<c0026ea0>] (ret_fast_syscall+0x0/0x28)
[78380.381159] INFO: task btrfs-vol:10979 blocked for more than 120 seconds.
[78380.381194] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[78380.381219] btrfs-vol D c02b080c 0 10979 1 0x00000001
[78380.381262] [<c02b080c>] (schedule+0x424/0x488) from [<c02b0c9c>] (schedule_timeout+0x1c/0x244)
[78380.381297] [<c02b0c9c>] (schedule_timeout+0x1c/0x244) from [<c02b0b10>] (wait_for_common+0xdc/0x178)
[78380.381501] [<c02b0b10>] (wait_for_common+0xdc/0x178) from [<bf29b880>] (merge_reloc_roots+0x15c/0x1a4 [btrfs])
[78380.381830] [<bf29b880>] (merge_reloc_roots+0x15c/0x1a4 [btrfs]) from [<bf2a3fd8>] (relocate_block_group+0x548/0x5c8 [btrfs])
[78380.382232] [<bf2a3fd8>] (relocate_block_group+0x548/0x5c8 [btrfs]) from [<bf2a4434>] (btrfs_relocate_block_group+0x17c/0x3a4 [btrfs])
[78380.382545] [<bf2a4434>] (btrfs_relocate_block_group+0x17c/0x3a4 [btrfs]) from [<bf2868e0>] (btrfs_relocate_chunk+0x70/0x7c0 [btrfs])
[78380.382839] [<bf2868e0>] (btrfs_relocate_chunk+0x70/0x7c0 [btrfs]) from [<bf2879f4>] (btrfs_balance+0x370/0x424 [btrfs])
[78380.383131] [<bf2879f4>] (btrfs_balance+0x370/0x424 [btrfs]) from [<bf28d3a8>] (btrfs_ioctl+0x754/0x968 [btrfs])
[78380.383302] [<bf28d3a8>] (btrfs_ioctl+0x754/0x968 [btrfs]) from [<c00d8788>] (vfs_ioctl+0x2c/0x70)
[78380.383341] [<c00d8788>] (vfs_ioctl+0x2c/0x70) from [<c00d8e8c>] (do_vfs_ioctl+0x4f4/0x55c)
[78380.383374] [<c00d8e8c>] (do_vfs_ioctl+0x4f4/0x55c) from [<c00d8f44>] (sys_ioctl+0x50/0x74)
[78380.383408] [<c00d8f44>] (sys_ioctl+0x50/0x74) from [<c0026ea0>] (ret_fast_syscall+0x0/0x28)

umount right after some big fs action (not sure, it was either lots of
file deletions, a big rsync of some tree, or right after the btrfs-vol
stuff) manages to trigger a btrfs related hang, too:

[97460.345446] INFO: task umount:12765 blocked for more than 120 seconds.
[97460.345481] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[97460.345505] umount D c02b080c 0 12765 12681 0x00000000
[97460.345554] [<c02b080c>] (schedule+0x424/0x488) from [<c00e719c>] (bdi_sched_wait+0xc/0x18)
[97460.345592] [<c00e719c>] (bdi_sched_wait+0xc/0x18) from [<c02b0f68>] (__wait_on_bit+0x5c/0xa8)
[97460.345625] [<c02b0f68>] (__wait_on_bit+0x5c/0xa8) from [<c02b1060>] (out_of_line_wait_on_bit+0xac/0xc4)
[97460.345661] [<c02b1060>] (out_of_line_wait_on_bit+0xac/0xc4) from [<c00e7210>] (sync_inodes_sb+0x68/0x100)
[97460.345699] [<c00e7210>] (sync_inodes_sb+0x68/0x100) from [<c00eb340>] (__sync_filesystem+0x64/0x94)
[97460.345737] [<c00eb340>] (__sync_filesystem+0x64/0x94) from [<c00cdc74>] (generic_shutdown_super+0x28/0x110)
[97460.345776] [<c00cdc74>] (generic_shutdown_super+0x28/0x110) from [<c00cdda8>] (kill_anon_super+0x14/0x3c)
[97460.345813] [<c00cdda8>] (kill_anon_super+0x14/0x3c) from [<c00ce46c>] (deactivate_super+0x6c/0x90)
[97460.345849] [<c00ce46c>] (deactivate_super+0x6c/0x90) from [<c00e2310>] (sys_umount+0x2bc/0x2e8)
[97460.345883] [<c00e2310>] (sys_umount+0x2bc/0x2e8) from [<c0026ea0>] (ret_fast_syscall+0x0/0x28)
[97580.340641] INFO: task umount:12765 blocked for more than 120 seconds.
[97580.340674] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[97580.340699] umount D c02b080c 0 12765 12681 0x00000000
[97580.340749] [<c02b080c>] (schedule+0x424/0x488) from [<c00e719c>] (bdi_sched_wait+0xc/0x18)
[97580.340787] [<c00e719c>] (bdi_sched_wait+0xc/0x18) from [<c02b0f68>] (__wait_on_bit+0x5c/0xa8)
[97580.340821] [<c02b0f68>] (__wait_on_bit+0x5c/0xa8) from [<c02b1060>] (out_of_line_wait_on_bit+0xac/0xc4)
[97580.340857] [<c02b1060>] (out_of_line_wait_on_bit+0xac/0xc4) from [<c00e7210>] (sync_inodes_sb+0x68/0x100)
[97580.340894] [<c00e7210>] (sync_inodes_sb+0x68/0x100) from [<c00eb340>] (__sync_filesystem+0x64/0x94)
[97580.340932] [<c00eb340>] (__sync_filesystem+0x64/0x94) from [<c00cdc74>] (generic_shutdown_super+0x28/0x110)
[97580.340970] [<c00cdc74>] (generic_shutdown_super+0x28/0x110) from [<c00cdda8>] (kill_anon_super+0x14/0x3c)
[97580.341008] [<c00cdda8>] (kill_anon_super+0x14/0x3c) from [<c00ce46c>] (deactivate_super+0x6c/0x90)
[97580.341044] [<c00ce46c>] (deactivate_super+0x6c/0x90) from [<c00e2310>] (sys_umount+0x2bc/0x2e8)
[97580.341079] [<c00e2310>] (sys_umount+0x2bc/0x2e8) from [<c0026ea0>] (ret_fast_syscall+0x0/0x28)


I've never had the system or even the affected processes die on me, the
end result was always ok. Just took ages. (Ok, btrfs-vol -b taking ages
on a big fs is ok. umount taking 10min is a bit over the top, especially
since the machine only has 1G ram, so there can't be that many dirty caches
in any case...

cheers
-- vbi



--
featured product: PostgreSQL - http://postgresql.org


Attachments:
signature.asc (389.00 B)
This is a digitally signed message part.

2010-01-11 08:51:13

by Adrian von Bidder

[permalink] [raw]
Subject: Re: task imap:2958 blocked for more than 120 seconds

On Sunday 10 January 2010 21.05:46 Johannes Hirte wrote:
> I've observed this hanging task now several times. Not sure when this
> started, but 2.6.32 is affected too, IIRC. I don't have a test pattern
> for this. Dovecot imap triggers this from time to time. I've enabled
> CONFIG_DETECT_HUNG_TASK now and got this two tasks which hang:

I've had vey similar hangs occasionally; don't remember if they were the
exact same code path or not.

"btrfs-vol -b" on an 2T btrfs fs (raid 1 mode over 4 disks) on an arm CPU
has triggered it several times, so it seems a reliable way to reproduce
this.

Unfortunately log rotation has already killed the messages. Do you want me
to re-run? 2.6.32 kernel, btrfs-progs from git about 4 weeks ago.


cheers
-- vbi


--
Place stamp here.


Attachments:
signature.asc (389.00 B)
This is a digitally signed message part.

2010-01-13 22:37:46

by Johannes Hirte

[permalink] [raw]
Subject: Re: task imap:2958 blocked for more than 120 seconds

Am Sonntag 10 Januar 2010 21:19:26 schrieb Chris Mason:
> On Sun, Jan 10, 2010 at 09:05:46PM +0100, Johannes Hirte wrote:
> > I've observed this hanging task now several times. Not sure when this
> > started, but 2.6.32 is affected too, IIRC. I don't have a test pattern
> > for this. Dovecot imap triggers this from time to time. I've enabled
> > CONFIG_DETECT_HUNG_TASK now
>
> > and got this two tasks which hang:
> You're stuck on a read, could you please do a sysrq-w when this happens?

Will do so when it happens again.

> Also, do you eventually recover or are you stuck forever?

I didn't wait too long when it happened, so I'm not sure. The longest time
I've waited was 20-30min, until reboot without recover. So either it's stuck
forever or it takes really long to recover.

And one question I have: How do you identify a read in this call trace?

regards,
Johannes

2010-01-14 01:07:44

by Chris Mason

[permalink] [raw]
Subject: Re: task imap:2958 blocked for more than 120 seconds

On Wed, Jan 13, 2010 at 11:37:33PM +0100, Johannes Hirte wrote:
> Am Sonntag 10 Januar 2010 21:19:26 schrieb Chris Mason:
> > On Sun, Jan 10, 2010 at 09:05:46PM +0100, Johannes Hirte wrote:
> > > I've observed this hanging task now several times. Not sure when this
> > > started, but 2.6.32 is affected too, IIRC. I don't have a test pattern
> > > for this. Dovecot imap triggers this from time to time. I've enabled
> > > CONFIG_DETECT_HUNG_TASK now
> >
> > > and got this two tasks which hang:
> > You're stuck on a read, could you please do a sysrq-w when this happens?
>
> Will do so when it happens again.

No need, I reread the trace, you're stuck in copy_from_user and probably
hitting a deadlock in the btrfs page faulting path. This is pretty
rare, but I'll put it into the queue to fix up.
>
> > Also, do you eventually recover or are you stuck forever?
>
> I didn't wait too long when it happened, so I'm not sure. The longest time
> I've waited was 20-30min, until reboot without recover. So either it's stuck
> forever or it takes really long to recover.
>
> And one question I have: How do you identify a read in this call trace?

Lock page followed by sync page is usually for reading.

-chris

2010-01-14 01:09:54

by Chris Mason

[permalink] [raw]
Subject: Re: task imap:2958 blocked for more than 120 seconds

On Mon, Jan 11, 2010 at 09:45:31AM +0100, Adrian von Bidder wrote:
> On Monday 11 January 2010 08.34:36 Adrian von Bidder wrote:
> > "btrfs-vol -b" on an 2T btrfs fs (raid 1 mode over 4 disks) on an arm
> > CPU has triggered it several times, so it seems a reliable way to
> > reproduce this.
> >
>
> Found it (Debian kernel 2.6.32 on ARM):
>
> [78260.386272] INFO: task btrfs-vol:10979 blocked for more than 120 seconds.
> [78260.386306] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [78260.386331] btrfs-vol D c02b080c 0 10979 1 0x00000001
> [78260.386373] [<c02b080c>] (schedule+0x424/0x488) from [<c02b0c9c>] (schedule_timeout+0x1c/0x244)
> [78260.386408] [<c02b0c9c>] (schedule_timeout+0x1c/0x244) from [<c02b0b10>] (wait_for_common+0xdc/0x178)
> [78260.386611] [<c02b0b10>] (wait_for_common+0xdc/0x178) from [<bf29b880>] (merge_reloc_roots+0x15c/0x1a4 [btrfs])
> [78260.386940] [<bf29b880>] (merge_reloc_roots+0x15c/0x1a4 [btrfs]) from [<bf2a3fd8>] (relocate_block_group+0x548/0x5c8 [btrfs])

Blocking here isn't a huge surprise, relocation can involve some long
held locks. Does this recover?

-chris

2010-01-14 07:03:56

by Adrian von Bidder

[permalink] [raw]
Subject: Re: task imap:2958 blocked for more than 120 seconds

Heyho!

On Thursday 14 January 2010 02.08:29 Chris Mason wrote:
> On Mon, Jan 11, 2010 at 09:45:31AM +0100, Adrian von Bidder wrote:
> > On Monday 11 January 2010 08.34:36 Adrian von Bidder wrote:
> > > "btrfs-vol -b" on an 2T btrfs fs (raid 1 mode over 4 disks) on an arm
> > > CPU has triggered it several times, so it seems a reliable way to
> > > reproduce this.
> >
> > Found it (Debian kernel 2.6.32 on ARM):
> >
> > [78260.386272] INFO: task btrfs-vol:10979 blocked for more than 120
> > seconds. [78260.386306] "echo 0 >
> > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [78260.386331] btrfs-vol D c02b080c 0 10979 1 0x00000001
> > [78260.386373] [<c02b080c>] (schedule+0x424/0x488) from [<c02b0c9c>]
> > (schedule_timeout+0x1c/0x244) [78260.386408] [<c02b0c9c>]
> > (schedule_timeout+0x1c/0x244) from [<c02b0b10>]
> > (wait_for_common+0xdc/0x178) [78260.386611] [<c02b0b10>]
> > (wait_for_common+0xdc/0x178) from [<bf29b880>]
> > (merge_reloc_roots+0x15c/0x1a4 [btrfs]) [78260.386940] [<bf29b880>]
> > (merge_reloc_roots+0x15c/0x1a4 [btrfs]) from [<bf2a3fd8>]
> > (relocate_block_group+0x548/0x5c8 [btrfs])
>
> Blocking here isn't a huge surprise, relocation can involve some long
> held locks. Does this recover?

Yes, always so far.

cheers
-- vbi

--
Tomorrow we have important delivery and this afternoon out Japanese
colleagues decided that 2000 pages describing our software solution is
not enough. They need at least 500 extra pages. Fortunately it could be
any existing documentation, so we randomly choose POSIX standard Part 1
(535 pages) and the Eclipse basic tutorial (592 pages).
-- Michal Čihař


Attachments:
signature.asc (389.00 B)
This is a digitally signed message part.