2013-06-19 16:45:51

by Dave Jones

[permalink] [raw]
Subject: frequent softlockups with 3.10rc6.

I've been hitting this a lot the last few days.
This is the same machine that I was also seeing lockups during sync()

Dave

BUG: soft lockup - CPU#1 stuck for 22s! [trinity-child9:6902]
Modules linked in: bridge snd_seq_dummy dlci bnep fuse 8021q garp stp hidp tun rfcomm can_raw ipt_ULOG nfnetlink rose scsi_transport_iscsi ipx p8023 p8022 phonet llc2 irda rds pppoe pppox caif_socket caif ppp_generic af_key slhc crc_ccitt bluetooth netrom can_bcm x25 appletalk psnap can llc af_rxrpc atm ax25 af_802154 nfc rfkill coretemp hwmon kvm_intel kvm snd_hda_codec_realtek snd_hda_codec_hdmi crc32c_intel ghash_clmulni_intel snd_hda_intel microcode snd_hda_codec pcspkr snd_hwdep snd_seq snd_seq_device usb_debug snd_pcm e1000e ptp pps_core snd_page_alloc snd_timer snd soundcore xfs libcrc32c
irq event stamp: 2057909
hardirqs last enabled at (2057908): [<ffffffff816ed220>] restore_args+0x0/0x30
hardirqs last disabled at (2057909): [<ffffffff816f5d2a>] apic_timer_interrupt+0x6a/0x80
softirqs last enabled at (1444600): [<ffffffff810542d4>] __do_softirq+0x194/0x440
softirqs last disabled at (1444851): [<ffffffff8105473d>] irq_exit+0xcd/0xe0
CPU: 1 PID: 6902 Comm: trinity-child9 Not tainted 3.10.0-rc6+ #16
task: ffff880243212520 ti: ffff88015c1fa000 task.ti: ffff88015c1fa000
RIP: 0010:[<ffffffff810541f1>] [<ffffffff810541f1>] __do_softirq+0xb1/0x440
RSP: 0000:ffff880244a03f08 EFLAGS: 00000202
RAX: ffff880243212520 RBX: ffffffff816ed220 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880243212520
RBP: ffff880244a03f70 R08: 0000000010000003 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: ffff880244a03e78
R13: ffffffff816f5d2f R14: ffff880244a03f70 R15: 0000000000000000
FS: 00007f2e89fa1740(0000) GS:ffff880244a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000351a000 CR3: 00000001f6b7d000 CR4: 00000000001407e0
DR0: 00007f2e898d2000 DR1: 0000000001c59000 DR2: 0000000001c5c000
DR3: 0000000000008000 DR6: 00000000fffe0ff0 DR7: 00000000000b0602
Stack:
0000000a00406040 00000001008fead4 ffff88015c1fbfd8 ffff88015c1fbfd8
ffff88015c1fbfd8 0000000010010002 ffff88015c1fbfd8 ffffffff00000001
ffff880243212520 0000000000000000 ffff88023bbf0c80 0000000000000001
Call Trace:
<IRQ>

[<ffffffff8105473d>] irq_exit+0xcd/0xe0
[<ffffffff816f6bcb>] smp_apic_timer_interrupt+0x6b/0x9b
[<ffffffff816f5d2f>] apic_timer_interrupt+0x6f/0x80
<EOI>

[<ffffffff816ed220>] ? retint_restore_args+0xe/0xe
[<ffffffff816ec642>] ? _raw_spin_unlock_irq+0x32/0x60
[<ffffffff816ec63c>] ? _raw_spin_unlock_irq+0x2c/0x60
[<ffffffff81086c95>] finish_task_switch+0x85/0x130
[<ffffffff81086c57>] ? finish_task_switch+0x47/0x130
[<ffffffff816ea734>] __schedule+0x444/0x9c0
[<ffffffff816eb313>] preempt_schedule_irq+0x53/0x90
[<ffffffff816ed336>] retint_kernel+0x26/0x30
[<ffffffff81145297>] ? user_enter+0x87/0xd0
[<ffffffff8100f6f8>] syscall_trace_leave+0x78/0x140
[<ffffffff816f53af>] int_check_syscall_exit_work+0x34/0x3d
Code: 48 89 45 b8 48 89 45 b0 48 89 45 a8 66 0f 1f 44 00 00 65 c7 04 25 80 0f 1d 00 00 00 00 00 e8 b7 31 06 00 fb 49 c7 c6 00 41 c0 81 <eb> 0e 0f 1f 44 00 00 49 83 c6 08 41 d1 ef 74 6c 41 f6 c7 01 74


2013-06-19 17:54:13

by Dave Jones

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Wed, Jun 19, 2013 at 12:45:40PM -0400, Dave Jones wrote:
> I've been hitting this a lot the last few days.
> This is the same machine that I was also seeing lockups during sync()

On a whim, I reverted 971394f389992f8462c4e5ae0e3b49a10a9534a3
(As I started seeing these just after that rcu merge).

It's only been 30 minutes, but it seems stable again. Normally I would
hit these within 5 minutes.

I think this may be the same root cause for http://www.spinics.net/lists/kernel/msg1551503.html too.

Paul ?

Dave



> BUG: soft lockup - CPU#1 stuck for 22s! [trinity-child9:6902]
> Modules linked in: bridge snd_seq_dummy dlci bnep fuse 8021q garp stp hidp tun rfcomm can_raw ipt_ULOG nfnetlink rose scsi_transport_iscsi ipx p8023 p8022 phonet llc2 irda rds pppoe pppox caif_socket caif ppp_generic af_key slhc crc_ccitt bluetooth netrom can_bcm x25 appletalk psnap can llc af_rxrpc atm ax25 af_802154 nfc rfkill coretemp hwmon kvm_intel kvm snd_hda_codec_realtek snd_hda_codec_hdmi crc32c_intel ghash_clmulni_intel snd_hda_intel microcode snd_hda_codec pcspkr snd_hwdep snd_seq snd_seq_device usb_debug snd_pcm e1000e ptp pps_core snd_page_alloc snd_timer snd soundcore xfs libcrc32c
> irq event stamp: 2057909
> hardirqs last enabled at (2057908): [<ffffffff816ed220>] restore_args+0x0/0x30
> hardirqs last disabled at (2057909): [<ffffffff816f5d2a>] apic_timer_interrupt+0x6a/0x80
> softirqs last enabled at (1444600): [<ffffffff810542d4>] __do_softirq+0x194/0x440
> softirqs last disabled at (1444851): [<ffffffff8105473d>] irq_exit+0xcd/0xe0
> CPU: 1 PID: 6902 Comm: trinity-child9 Not tainted 3.10.0-rc6+ #16
> task: ffff880243212520 ti: ffff88015c1fa000 task.ti: ffff88015c1fa000
> RIP: 0010:[<ffffffff810541f1>] [<ffffffff810541f1>] __do_softirq+0xb1/0x440
> RSP: 0000:ffff880244a03f08 EFLAGS: 00000202
> RAX: ffff880243212520 RBX: ffffffff816ed220 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880243212520
> RBP: ffff880244a03f70 R08: 0000000010000003 R09: 0000000000000000
> R10: 0000000000000001 R11: 0000000000000000 R12: ffff880244a03e78
> R13: ffffffff816f5d2f R14: ffff880244a03f70 R15: 0000000000000000
> FS: 00007f2e89fa1740(0000) GS:ffff880244a00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 000000000351a000 CR3: 00000001f6b7d000 CR4: 00000000001407e0
> DR0: 00007f2e898d2000 DR1: 0000000001c59000 DR2: 0000000001c5c000
> DR3: 0000000000008000 DR6: 00000000fffe0ff0 DR7: 00000000000b0602
> Stack:
> 0000000a00406040 00000001008fead4 ffff88015c1fbfd8 ffff88015c1fbfd8
> ffff88015c1fbfd8 0000000010010002 ffff88015c1fbfd8 ffffffff00000001
> ffff880243212520 0000000000000000 ffff88023bbf0c80 0000000000000001
> Call Trace:
> <IRQ>
>
> [<ffffffff8105473d>] irq_exit+0xcd/0xe0
> [<ffffffff816f6bcb>] smp_apic_timer_interrupt+0x6b/0x9b
> [<ffffffff816f5d2f>] apic_timer_interrupt+0x6f/0x80
> <EOI>
>
> [<ffffffff816ed220>] ? retint_restore_args+0xe/0xe
> [<ffffffff816ec642>] ? _raw_spin_unlock_irq+0x32/0x60
> [<ffffffff816ec63c>] ? _raw_spin_unlock_irq+0x2c/0x60
> [<ffffffff81086c95>] finish_task_switch+0x85/0x130
> [<ffffffff81086c57>] ? finish_task_switch+0x47/0x130
> [<ffffffff816ea734>] __schedule+0x444/0x9c0
> [<ffffffff816eb313>] preempt_schedule_irq+0x53/0x90
> [<ffffffff816ed336>] retint_kernel+0x26/0x30
> [<ffffffff81145297>] ? user_enter+0x87/0xd0
> [<ffffffff8100f6f8>] syscall_trace_leave+0x78/0x140
> [<ffffffff816f53af>] int_check_syscall_exit_work+0x34/0x3d
> Code: 48 89 45 b8 48 89 45 b0 48 89 45 a8 66 0f 1f 44 00 00 65 c7 04 25 80 0f 1d 00 00 00 00 00 e8 b7 31 06 00 fb 49 c7 c6 00 41 c0 81 <eb> 0e 0f 1f 44 00 00 49 83 c6 08 41 d1 ef 74 6c 41 f6 c7 01 74
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
---end quoted text---

2013-06-19 18:13:24

by Paul E. McKenney

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Wed, Jun 19, 2013 at 01:53:56PM -0400, Dave Jones wrote:
> On Wed, Jun 19, 2013 at 12:45:40PM -0400, Dave Jones wrote:
> > I've been hitting this a lot the last few days.
> > This is the same machine that I was also seeing lockups during sync()
>
> On a whim, I reverted 971394f389992f8462c4e5ae0e3b49a10a9534a3
> (As I started seeing these just after that rcu merge).
>
> It's only been 30 minutes, but it seems stable again. Normally I would
> hit these within 5 minutes.
>
> I think this may be the same root cause for http://www.spinics.net/lists/kernel/msg1551503.html too.
>
> Paul ?

???

In both cases, I am guessing that you built with CONFIG_PROVE_RCU_DELAY=y.
Even then, this is very strange. I am at a loss as to why udelay(200)
would result in a hang. Or does your system turn udelay() into something
other than a pure spin?

Thanx, Paul

> Dave
>
>
>
> > BUG: soft lockup - CPU#1 stuck for 22s! [trinity-child9:6902]
> > Modules linked in: bridge snd_seq_dummy dlci bnep fuse 8021q garp stp hidp tun rfcomm can_raw ipt_ULOG nfnetlink rose scsi_transport_iscsi ipx p8023 p8022 phonet llc2 irda rds pppoe pppox caif_socket caif ppp_generic af_key slhc crc_ccitt bluetooth netrom can_bcm x25 appletalk psnap can llc af_rxrpc atm ax25 af_802154 nfc rfkill coretemp hwmon kvm_intel kvm snd_hda_codec_realtek snd_hda_codec_hdmi crc32c_intel ghash_clmulni_intel snd_hda_intel microcode snd_hda_codec pcspkr snd_hwdep snd_seq snd_seq_device usb_debug snd_pcm e1000e ptp pps_core snd_page_alloc snd_timer snd soundcore xfs libcrc32c
> > irq event stamp: 2057909
> > hardirqs last enabled at (2057908): [<ffffffff816ed220>] restore_args+0x0/0x30
> > hardirqs last disabled at (2057909): [<ffffffff816f5d2a>] apic_timer_interrupt+0x6a/0x80
> > softirqs last enabled at (1444600): [<ffffffff810542d4>] __do_softirq+0x194/0x440
> > softirqs last disabled at (1444851): [<ffffffff8105473d>] irq_exit+0xcd/0xe0
> > CPU: 1 PID: 6902 Comm: trinity-child9 Not tainted 3.10.0-rc6+ #16
> > task: ffff880243212520 ti: ffff88015c1fa000 task.ti: ffff88015c1fa000
> > RIP: 0010:[<ffffffff810541f1>] [<ffffffff810541f1>] __do_softirq+0xb1/0x440
> > RSP: 0000:ffff880244a03f08 EFLAGS: 00000202
> > RAX: ffff880243212520 RBX: ffffffff816ed220 RCX: 0000000000000000
> > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880243212520
> > RBP: ffff880244a03f70 R08: 0000000010000003 R09: 0000000000000000
> > R10: 0000000000000001 R11: 0000000000000000 R12: ffff880244a03e78
> > R13: ffffffff816f5d2f R14: ffff880244a03f70 R15: 0000000000000000
> > FS: 00007f2e89fa1740(0000) GS:ffff880244a00000(0000) knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 000000000351a000 CR3: 00000001f6b7d000 CR4: 00000000001407e0
> > DR0: 00007f2e898d2000 DR1: 0000000001c59000 DR2: 0000000001c5c000
> > DR3: 0000000000008000 DR6: 00000000fffe0ff0 DR7: 00000000000b0602
> > Stack:
> > 0000000a00406040 00000001008fead4 ffff88015c1fbfd8 ffff88015c1fbfd8
> > ffff88015c1fbfd8 0000000010010002 ffff88015c1fbfd8 ffffffff00000001
> > ffff880243212520 0000000000000000 ffff88023bbf0c80 0000000000000001
> > Call Trace:
> > <IRQ>
> >
> > [<ffffffff8105473d>] irq_exit+0xcd/0xe0
> > [<ffffffff816f6bcb>] smp_apic_timer_interrupt+0x6b/0x9b
> > [<ffffffff816f5d2f>] apic_timer_interrupt+0x6f/0x80
> > <EOI>
> >
> > [<ffffffff816ed220>] ? retint_restore_args+0xe/0xe
> > [<ffffffff816ec642>] ? _raw_spin_unlock_irq+0x32/0x60
> > [<ffffffff816ec63c>] ? _raw_spin_unlock_irq+0x2c/0x60
> > [<ffffffff81086c95>] finish_task_switch+0x85/0x130
> > [<ffffffff81086c57>] ? finish_task_switch+0x47/0x130
> > [<ffffffff816ea734>] __schedule+0x444/0x9c0
> > [<ffffffff816eb313>] preempt_schedule_irq+0x53/0x90
> > [<ffffffff816ed336>] retint_kernel+0x26/0x30
> > [<ffffffff81145297>] ? user_enter+0x87/0xd0
> > [<ffffffff8100f6f8>] syscall_trace_leave+0x78/0x140
> > [<ffffffff816f53af>] int_check_syscall_exit_work+0x34/0x3d
> > Code: 48 89 45 b8 48 89 45 b0 48 89 45 a8 66 0f 1f 44 00 00 65 c7 04 25 80 0f 1d 00 00 00 00 00 e8 b7 31 06 00 fb 49 c7 c6 00 41 c0 81 <eb> 0e 0f 1f 44 00 00 49 83 c6 08 41 d1 ef 74 6c 41 f6 c7 01 74
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
> ---end quoted text---
>

2013-06-19 18:42:56

by Dave Jones

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Wed, Jun 19, 2013 at 11:13:02AM -0700, Paul E. McKenney wrote:

> > On a whim, I reverted 971394f389992f8462c4e5ae0e3b49a10a9534a3
> > (As I started seeing these just after that rcu merge).
> >
> > It's only been 30 minutes, but it seems stable again. Normally I would
> > hit these within 5 minutes.
> >
> > I think this may be the same root cause for http://www.spinics.net/lists/kernel/msg1551503.html too.
>
> In both cases, I am guessing that you built with CONFIG_PROVE_RCU_DELAY=y.

Yes.

> Even then, this is very strange. I am at a loss as to why udelay(200)
> would result in a hang.

It may not be a real 'hang' per se, but might just be that that process isn't
scheduled within the time needed to appease the lockup detector ?
(20 seconds is a long time, but that box is under constant load when it's
running the fuzz tests, so.. ?)

> Or does your system turn udelay() into something other than a pure spin?

I see no reason why it would. Am I missing something ?


I also don't know if it's related, but it would be real nice if someone
would push along that fix for rcu_preempt hogging the cpu when idle that's
been in timers/urgent for over a month.

Dave

2013-06-20 00:12:28

by Dave Jones

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Wed, Jun 19, 2013 at 11:13:02AM -0700, Paul E. McKenney wrote:
> On Wed, Jun 19, 2013 at 01:53:56PM -0400, Dave Jones wrote:
> > On Wed, Jun 19, 2013 at 12:45:40PM -0400, Dave Jones wrote:
> > > I've been hitting this a lot the last few days.
> > > This is the same machine that I was also seeing lockups during sync()
> >
> > On a whim, I reverted 971394f389992f8462c4e5ae0e3b49a10a9534a3
> > (As I started seeing these just after that rcu merge).
> >
> > It's only been 30 minutes, but it seems stable again. Normally I would
> > hit these within 5 minutes.
> >
> > I think this may be the same root cause for http://www.spinics.net/lists/kernel/msg1551503.html too.
> >
> > Paul ?
>
> ???
>
> In both cases, I am guessing that you built with CONFIG_PROVE_RCU_DELAY=y.
> Even then, this is very strange. I am at a loss as to why udelay(200)
> would result in a hang. Or does your system turn udelay() into something
> other than a pure spin?

Dammit. Paul, you're off the hook (for now).
It just took longer to hit.

Dave


[19886.451044] BUG: soft lockup - CPU#0 stuck for 22s! [trinity-child0:12994]
[19886.452659] Modules linked in: bridge stp snd_seq_dummy fuse tun hidp rfcomm ipt_ULOG bnep nfnetlink can_raw ipx p8023 p8022 pppoe pppox ppp_generic slhc scsi_transport_iscsi can_bcm rds bluetooth can nfc
rfkill af_key af_802154 netrom irda af_rxrpc appletalk phonet psnap caif_socket caif x25 crc_ccitt llc2 llc rose ax25 atm coretemp hwmon kvm_intel kvm crc32c_intel ghash_clmulni_intel snd_hda_codec_realtek mi
crocode pcspkr snd_hda_codec_hdmi usb_debug snd_hda_intel snd_seq e1000e snd_hda_codec snd_hwdep ptp pps_core snd_seq_device snd_pcm snd_page_alloc snd_timer snd soundcore xfs libcrc32c
[19886.464209] irq event stamp: 2380319
[19886.465510] hardirqs last enabled at (2380318): [<ffffffff816ed220>] restore_args+0x0/0x30
[19886.467446] hardirqs last disabled at (2380319): [<ffffffff816f5d2a>] apic_timer_interrupt+0x6a/0x80
[19886.469464] softirqs last enabled at (196990): [<ffffffff810542d4>] __do_softirq+0x194/0x440
[19886.471395] softirqs last disabled at (197479): [<ffffffff8105473d>] irq_exit+0xcd/0xe0
[19886.473238] CPU: 0 PID: 12994 Comm: trinity-child0 Not tainted 3.10.0-rc6+ #17
[19886.477881] task: ffff8801a8222520 ti: ffff880228d98000 task.ti: ffff880228d98000
[19886.479712] RIP: 0010:[<ffffffff810541f1>] [<ffffffff810541f1>] __do_softirq+0xb1/0x440
[19886.481706] RSP: 0018:ffff880244803f08 EFLAGS: 00000202
[19886.483298] RAX: ffff8801a8222520 RBX: ffffffff816ed220 RCX: 0000000000000002
[19886.485094] RDX: 00000000000045b0 RSI: ffff8801a8222ca0 RDI: ffff8801a8222520
[19886.486896] RBP: ffff880244803f70 R08: 0000000000000001 R09: 0000000000000000
[19886.488687] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880244803e78
[19886.490507] R13: ffffffff816f5d2f R14: ffff880244803f70 R15: 0000000000000000
[19886.492325] FS: 00007f0bcf727740(0000) GS:ffff880244800000(0000) knlGS:0000000000000000
[19886.494184] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[19886.495781] CR2: 0000000000000001 CR3: 00000001a56d0000 CR4: 00000000001407f0
[19886.497545] DR0: 00007f21e7713000 DR1: 0000000000000000 DR2: 0000000000000000
[19886.499304] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[19886.501070] Stack:
[19886.502247] 0000000a00406040 00000001001ddcf4 ffff880228d99fd8 ffff880228d99fd8
[19886.504104] ffff880228d99fd8 ffff8801a8222918 ffff880228d99fd8 ffff880200000000
[19886.505996] ffff8801a8222520 0000000000000000 0000000000002e82 00000000162dccb4
[19886.507859] Call Trace:
[19886.509095] <IRQ>
[19886.509334] [<ffffffff8105473d>] irq_exit+0xcd/0xe0
[19886.511879] [<ffffffff816f6bcb>] smp_apic_timer_interrupt+0x6b/0x9b
[19886.513586] [<ffffffff816f5d2f>] apic_timer_interrupt+0x6f/0x80
[19886.515243] <EOI>
[19886.515482] [<ffffffff812fe80d>] ? idr_find_slowpath+0x4d/0x150
[19886.518126] [<ffffffff812a2cb9>] ipcget+0x89/0x380
[19886.519656] [<ffffffff810b72dd>] ? trace_hardirqs_on_caller+0xfd/0x1c0
[19886.521409] [<ffffffff812a43b6>] SyS_msgget+0x56/0x60
[19886.522971] [<ffffffff812a39a0>] ? rcu_read_lock+0x80/0x80
[19886.524585] [<ffffffff812a37e0>] ? sysvipc_msg_proc_show+0xd0/0xd0
[19886.526274] [<ffffffff816f52d4>] tracesys+0xdd/0xe2
[19886.527804] [<ffffffff8100ffff>] ? enable_step+0x3f/0x1d0
[19886.529397] Code: 48 89 45 b8 48 89 45 b0 48 89 45 a8 66 0f 1f 44 00 00 65 c7 04 25 80 0f 1d 00 00 00 00 00 e8 b7 31 06 00 fb 49 c7 c6 00 41 c0 81 <eb> 0e 0f 1f 44 00 00 49 83 c6 08 41 d1 ef 74 6c 41 f6 c7 01 74

2013-06-20 16:29:56

by Dave Jones

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Thu, Jun 20, 2013 at 09:16:52AM -0700, Paul E. McKenney wrote:
> On Wed, Jun 19, 2013 at 08:12:12PM -0400, Dave Jones wrote:
> > On Wed, Jun 19, 2013 at 11:13:02AM -0700, Paul E. McKenney wrote:
> > > On Wed, Jun 19, 2013 at 01:53:56PM -0400, Dave Jones wrote:
> > > > On Wed, Jun 19, 2013 at 12:45:40PM -0400, Dave Jones wrote:
> > > > > I've been hitting this a lot the last few days.
> > > > > This is the same machine that I was also seeing lockups during sync()
> > > >
> > > > On a whim, I reverted 971394f389992f8462c4e5ae0e3b49a10a9534a3
> > > > (As I started seeing these just after that rcu merge).
> > > >
> > > > It's only been 30 minutes, but it seems stable again. Normally I would
> > > > hit these within 5 minutes.
> > > >
> > > > I think this may be the same root cause for http://www.spinics.net/lists/kernel/msg1551503.html too.
> > > >
> > > > Paul ?
> > >
> > > ???
> > >
> > > In both cases, I am guessing that you built with CONFIG_PROVE_RCU_DELAY=y.
> > > Even then, this is very strange. I am at a loss as to why udelay(200)
> > > would result in a hang. Or does your system turn udelay() into something
> > > other than a pure spin?
> >
> > Dammit. Paul, you're off the hook (for now).
> > It just took longer to hit.
>
> Well, this commit could significantly increase CPU overhead, which might
> make the bug more likely to occur. (Hey, I can rationalize -anything-!!!)

bisecting it now. Hopefully by end of day I'll have it figured out.

Dave

2013-06-20 16:20:30

by Paul E. McKenney

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Wed, Jun 19, 2013 at 08:12:12PM -0400, Dave Jones wrote:
> On Wed, Jun 19, 2013 at 11:13:02AM -0700, Paul E. McKenney wrote:
> > On Wed, Jun 19, 2013 at 01:53:56PM -0400, Dave Jones wrote:
> > > On Wed, Jun 19, 2013 at 12:45:40PM -0400, Dave Jones wrote:
> > > > I've been hitting this a lot the last few days.
> > > > This is the same machine that I was also seeing lockups during sync()
> > >
> > > On a whim, I reverted 971394f389992f8462c4e5ae0e3b49a10a9534a3
> > > (As I started seeing these just after that rcu merge).
> > >
> > > It's only been 30 minutes, but it seems stable again. Normally I would
> > > hit these within 5 minutes.
> > >
> > > I think this may be the same root cause for http://www.spinics.net/lists/kernel/msg1551503.html too.
> > >
> > > Paul ?
> >
> > ???
> >
> > In both cases, I am guessing that you built with CONFIG_PROVE_RCU_DELAY=y.
> > Even then, this is very strange. I am at a loss as to why udelay(200)
> > would result in a hang. Or does your system turn udelay() into something
> > other than a pure spin?
>
> Dammit. Paul, you're off the hook (for now).
> It just took longer to hit.

Well, this commit could significantly increase CPU overhead, which might
make the bug more likely to occur. (Hey, I can rationalize -anything-!!!)

Thanx, Paul

> Dave
>
>
> [19886.451044] BUG: soft lockup - CPU#0 stuck for 22s! [trinity-child0:12994]
> [19886.452659] Modules linked in: bridge stp snd_seq_dummy fuse tun hidp rfcomm ipt_ULOG bnep nfnetlink can_raw ipx p8023 p8022 pppoe pppox ppp_generic slhc scsi_transport_iscsi can_bcm rds bluetooth can nfc
> rfkill af_key af_802154 netrom irda af_rxrpc appletalk phonet psnap caif_socket caif x25 crc_ccitt llc2 llc rose ax25 atm coretemp hwmon kvm_intel kvm crc32c_intel ghash_clmulni_intel snd_hda_codec_realtek mi
> crocode pcspkr snd_hda_codec_hdmi usb_debug snd_hda_intel snd_seq e1000e snd_hda_codec snd_hwdep ptp pps_core snd_seq_device snd_pcm snd_page_alloc snd_timer snd soundcore xfs libcrc32c
> [19886.464209] irq event stamp: 2380319
> [19886.465510] hardirqs last enabled at (2380318): [<ffffffff816ed220>] restore_args+0x0/0x30
> [19886.467446] hardirqs last disabled at (2380319): [<ffffffff816f5d2a>] apic_timer_interrupt+0x6a/0x80
> [19886.469464] softirqs last enabled at (196990): [<ffffffff810542d4>] __do_softirq+0x194/0x440
> [19886.471395] softirqs last disabled at (197479): [<ffffffff8105473d>] irq_exit+0xcd/0xe0
> [19886.473238] CPU: 0 PID: 12994 Comm: trinity-child0 Not tainted 3.10.0-rc6+ #17
> [19886.477881] task: ffff8801a8222520 ti: ffff880228d98000 task.ti: ffff880228d98000
> [19886.479712] RIP: 0010:[<ffffffff810541f1>] [<ffffffff810541f1>] __do_softirq+0xb1/0x440
> [19886.481706] RSP: 0018:ffff880244803f08 EFLAGS: 00000202
> [19886.483298] RAX: ffff8801a8222520 RBX: ffffffff816ed220 RCX: 0000000000000002
> [19886.485094] RDX: 00000000000045b0 RSI: ffff8801a8222ca0 RDI: ffff8801a8222520
> [19886.486896] RBP: ffff880244803f70 R08: 0000000000000001 R09: 0000000000000000
> [19886.488687] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880244803e78
> [19886.490507] R13: ffffffff816f5d2f R14: ffff880244803f70 R15: 0000000000000000
> [19886.492325] FS: 00007f0bcf727740(0000) GS:ffff880244800000(0000) knlGS:0000000000000000
> [19886.494184] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [19886.495781] CR2: 0000000000000001 CR3: 00000001a56d0000 CR4: 00000000001407f0
> [19886.497545] DR0: 00007f21e7713000 DR1: 0000000000000000 DR2: 0000000000000000
> [19886.499304] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
> [19886.501070] Stack:
> [19886.502247] 0000000a00406040 00000001001ddcf4 ffff880228d99fd8 ffff880228d99fd8
> [19886.504104] ffff880228d99fd8 ffff8801a8222918 ffff880228d99fd8 ffff880200000000
> [19886.505996] ffff8801a8222520 0000000000000000 0000000000002e82 00000000162dccb4
> [19886.507859] Call Trace:
> [19886.509095] <IRQ>
> [19886.509334] [<ffffffff8105473d>] irq_exit+0xcd/0xe0
> [19886.511879] [<ffffffff816f6bcb>] smp_apic_timer_interrupt+0x6b/0x9b
> [19886.513586] [<ffffffff816f5d2f>] apic_timer_interrupt+0x6f/0x80
> [19886.515243] <EOI>
> [19886.515482] [<ffffffff812fe80d>] ? idr_find_slowpath+0x4d/0x150
> [19886.518126] [<ffffffff812a2cb9>] ipcget+0x89/0x380
> [19886.519656] [<ffffffff810b72dd>] ? trace_hardirqs_on_caller+0xfd/0x1c0
> [19886.521409] [<ffffffff812a43b6>] SyS_msgget+0x56/0x60
> [19886.522971] [<ffffffff812a39a0>] ? rcu_read_lock+0x80/0x80
> [19886.524585] [<ffffffff812a37e0>] ? sysvipc_msg_proc_show+0xd0/0xd0
> [19886.526274] [<ffffffff816f52d4>] tracesys+0xdd/0xe2
> [19886.527804] [<ffffffff8100ffff>] ? enable_step+0x3f/0x1d0
> [19886.529397] Code: 48 89 45 b8 48 89 45 b0 48 89 45 a8 66 0f 1f 44 00 00 65 c7 04 25 80 0f 1d 00 00 00 00 00 e8 b7 31 06 00 fb 49 c7 c6 00 41 c0 81 <eb> 0e 0f 1f 44 00 00 49 83 c6 08 41 d1 ef 74 6c 41 f6 c7 01 74
>

2013-06-21 15:11:54

by Dave Jones

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Thu, Jun 20, 2013 at 09:16:52AM -0700, Paul E. McKenney wrote:
> > > > > I've been hitting this a lot the last few days.
> > > > > This is the same machine that I was also seeing lockups during sync()
> > > >
> > > > On a whim, I reverted 971394f389992f8462c4e5ae0e3b49a10a9534a3
> > > > (As I started seeing these just after that rcu merge).
> > > >
> > > > It's only been 30 minutes, but it seems stable again. Normally I would
> > > > hit these within 5 minutes.
> > > >
> > > > I think this may be the same root cause for http://www.spinics.net/lists/kernel/msg1551503.html too.
> > >
> > Dammit. Paul, you're off the hook (for now).
> > It just took longer to hit.
>
> Well, this commit could significantly increase CPU overhead, which might
> make the bug more likely to occur. (Hey, I can rationalize -anything-!!!)

I spent yesterday bisecting this (twice, on two systems in parallel to be sure).
It came down to 8aac62706adaaf0fab02c4327761561c8bda9448
Before I turned in last night, I pulled Linus current, and reverted that,
and it's been fine during overnight stress testing.

Oleg ?

Dave



> > [19886.451044] BUG: soft lockup - CPU#0 stuck for 22s! [trinity-child0:12994]
> > [19886.452659] Modules linked in: bridge stp snd_seq_dummy fuse tun hidp rfcomm ipt_ULOG bnep nfnetlink can_raw ipx p8023 p8022 pppoe pppox ppp_generic slhc scsi_transport_iscsi can_bcm rds bluetooth can nfc
> > rfkill af_key af_802154 netrom irda af_rxrpc appletalk phonet psnap caif_socket caif x25 crc_ccitt llc2 llc rose ax25 atm coretemp hwmon kvm_intel kvm crc32c_intel ghash_clmulni_intel snd_hda_codec_realtek mi
> > crocode pcspkr snd_hda_codec_hdmi usb_debug snd_hda_intel snd_seq e1000e snd_hda_codec snd_hwdep ptp pps_core snd_seq_device snd_pcm snd_page_alloc snd_timer snd soundcore xfs libcrc32c
> > [19886.464209] irq event stamp: 2380319
> > [19886.465510] hardirqs last enabled at (2380318): [<ffffffff816ed220>] restore_args+0x0/0x30
> > [19886.467446] hardirqs last disabled at (2380319): [<ffffffff816f5d2a>] apic_timer_interrupt+0x6a/0x80
> > [19886.469464] softirqs last enabled at (196990): [<ffffffff810542d4>] __do_softirq+0x194/0x440
> > [19886.471395] softirqs last disabled at (197479): [<ffffffff8105473d>] irq_exit+0xcd/0xe0
> > [19886.473238] CPU: 0 PID: 12994 Comm: trinity-child0 Not tainted 3.10.0-rc6+ #17
> > [19886.477881] task: ffff8801a8222520 ti: ffff880228d98000 task.ti: ffff880228d98000
> > [19886.479712] RIP: 0010:[<ffffffff810541f1>] [<ffffffff810541f1>] __do_softirq+0xb1/0x440
> > [19886.481706] RSP: 0018:ffff880244803f08 EFLAGS: 00000202
> > [19886.483298] RAX: ffff8801a8222520 RBX: ffffffff816ed220 RCX: 0000000000000002
> > [19886.485094] RDX: 00000000000045b0 RSI: ffff8801a8222ca0 RDI: ffff8801a8222520
> > [19886.486896] RBP: ffff880244803f70 R08: 0000000000000001 R09: 0000000000000000
> > [19886.488687] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880244803e78
> > [19886.490507] R13: ffffffff816f5d2f R14: ffff880244803f70 R15: 0000000000000000
> > [19886.492325] FS: 00007f0bcf727740(0000) GS:ffff880244800000(0000) knlGS:0000000000000000
> > [19886.494184] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [19886.495781] CR2: 0000000000000001 CR3: 00000001a56d0000 CR4: 00000000001407f0
> > [19886.497545] DR0: 00007f21e7713000 DR1: 0000000000000000 DR2: 0000000000000000
> > [19886.499304] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
> > [19886.501070] Stack:
> > [19886.502247] 0000000a00406040 00000001001ddcf4 ffff880228d99fd8 ffff880228d99fd8
> > [19886.504104] ffff880228d99fd8 ffff8801a8222918 ffff880228d99fd8 ffff880200000000
> > [19886.505996] ffff8801a8222520 0000000000000000 0000000000002e82 00000000162dccb4
> > [19886.507859] Call Trace:
> > [19886.509095] <IRQ>
> > [19886.509334] [<ffffffff8105473d>] irq_exit+0xcd/0xe0
> > [19886.511879] [<ffffffff816f6bcb>] smp_apic_timer_interrupt+0x6b/0x9b
> > [19886.513586] [<ffffffff816f5d2f>] apic_timer_interrupt+0x6f/0x80
> > [19886.515243] <EOI>
> > [19886.515482] [<ffffffff812fe80d>] ? idr_find_slowpath+0x4d/0x150
> > [19886.518126] [<ffffffff812a2cb9>] ipcget+0x89/0x380
> > [19886.519656] [<ffffffff810b72dd>] ? trace_hardirqs_on_caller+0xfd/0x1c0
> > [19886.521409] [<ffffffff812a43b6>] SyS_msgget+0x56/0x60
> > [19886.522971] [<ffffffff812a39a0>] ? rcu_read_lock+0x80/0x80
> > [19886.524585] [<ffffffff812a37e0>] ? sysvipc_msg_proc_show+0xd0/0xd0
> > [19886.526274] [<ffffffff816f52d4>] tracesys+0xdd/0xe2
> > [19886.527804] [<ffffffff8100ffff>] ? enable_step+0x3f/0x1d0
> > [19886.529397] Code: 48 89 45 b8 48 89 45 b0 48 89 45 a8 66 0f 1f 44 00 00 65 c7 04 25 80 0f 1d 00 00 00 00 00 e8 b7 31 06 00 fb 49 c7 c6 00 41 c0 81 <eb> 0e 0f 1f 44 00 00 49 83 c6 08 41 d1 ef 74 6c 41 f6 c7 01 74
> >
---end quoted text---

2013-06-21 20:04:13

by Oleg Nesterov

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On 06/21, Dave Jones wrote:
>
> On Thu, Jun 20, 2013 at 09:16:52AM -0700, Paul E. McKenney wrote:
> > > > > > I've been hitting this a lot the last few days.
> > > > > > This is the same machine that I was also seeing lockups during sync()
> > > > >
> > > > > On a whim, I reverted 971394f389992f8462c4e5ae0e3b49a10a9534a3
> > > > > (As I started seeing these just after that rcu merge).
> > > > >
> > > > > It's only been 30 minutes, but it seems stable again. Normally I would
> > > > > hit these within 5 minutes.
> > > > >
> > > > > I think this may be the same root cause for http://www.spinics.net/lists/kernel/msg1551503.html too.
> > > >
> > > Dammit. Paul, you're off the hook (for now).
> > > It just took longer to hit.
> >
> > Well, this commit could significantly increase CPU overhead, which might
> > make the bug more likely to occur. (Hey, I can rationalize -anything-!!!)
>
> I spent yesterday bisecting this (twice, on two systems in parallel to be sure).
> It came down to 8aac62706adaaf0fab02c4327761561c8bda9448
> Before I turned in last night, I pulled Linus current, and reverted that,

I hope you didn't pull some unrelated fix in between ;)

> and it's been fine during overnight stress testing.
>
> Oleg ?

I am puzzled. And I do not really understand

hardirqs last enabled at (2380318): [<ffffffff816ed220>] restore_args+0x0/0x30
hardirqs last disabled at (2380319): [<ffffffff816f5d2a>] apic_timer_interrupt+0x6a/0x80
softirqs last enabled at (196990): [<ffffffff810542d4>] __do_softirq+0x194/0x440 [19886.471395]
softirqs last disabled at (197479): [<ffffffff8105473d>] irq_exit+0xcd/0xe0

below. how can they differ that much...

Dave, any chance you can reproduce the hang with the debugging patch at
the end? Just in case, the warnings themself do not mean a problem, just
to have a bit more info.

> > > [19886.451044] BUG: soft lockup - CPU#0 stuck for 22s! [trinity-child0:12994]
> > > [19886.452659] Modules linked in: bridge stp snd_seq_dummy fuse tun hidp rfcomm ipt_ULOG bnep nfnetlink can_raw ipx p8023 p8022 pppoe pppox ppp_generic slhc scsi_transport_iscsi can_bcm rds bluetooth can nfc
> > > rfkill af_key af_802154 netrom irda af_rxrpc appletalk phonet psnap caif_socket caif x25 crc_ccitt llc2 llc rose ax25 atm coretemp hwmon kvm_intel kvm crc32c_intel ghash_clmulni_intel snd_hda_codec_realtek mi
> > > crocode pcspkr snd_hda_codec_hdmi usb_debug snd_hda_intel snd_seq e1000e snd_hda_codec snd_hwdep ptp pps_core snd_seq_device snd_pcm snd_page_alloc snd_timer snd soundcore xfs libcrc32c
> > > [19886.464209] irq event stamp: 2380319
> > > [19886.465510] hardirqs last enabled at (2380318): [<ffffffff816ed220>] restore_args+0x0/0x30
> > > [19886.467446] hardirqs last disabled at (2380319): [<ffffffff816f5d2a>] apic_timer_interrupt+0x6a/0x80
> > > [19886.469464] softirqs last enabled at (196990): [<ffffffff810542d4>] __do_softirq+0x194/0x440
> > > [19886.471395] softirqs last disabled at (197479): [<ffffffff8105473d>] irq_exit+0xcd/0xe0
> > > [19886.473238] CPU: 0 PID: 12994 Comm: trinity-child0 Not tainted 3.10.0-rc6+ #17
> > > [19886.477881] task: ffff8801a8222520 ti: ffff880228d98000 task.ti: ffff880228d98000
> > > [19886.479712] RIP: 0010:[<ffffffff810541f1>] [<ffffffff810541f1>] __do_softirq+0xb1/0x440
> > > [19886.481706] RSP: 0018:ffff880244803f08 EFLAGS: 00000202
> > > [19886.483298] RAX: ffff8801a8222520 RBX: ffffffff816ed220 RCX: 0000000000000002
> > > [19886.485094] RDX: 00000000000045b0 RSI: ffff8801a8222ca0 RDI: ffff8801a8222520
> > > [19886.486896] RBP: ffff880244803f70 R08: 0000000000000001 R09: 0000000000000000
> > > [19886.488687] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880244803e78
> > > [19886.490507] R13: ffffffff816f5d2f R14: ffff880244803f70 R15: 0000000000000000
> > > [19886.492325] FS: 00007f0bcf727740(0000) GS:ffff880244800000(0000) knlGS:0000000000000000
> > > [19886.494184] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [19886.495781] CR2: 0000000000000001 CR3: 00000001a56d0000 CR4: 00000000001407f0
> > > [19886.497545] DR0: 00007f21e7713000 DR1: 0000000000000000 DR2: 0000000000000000
> > > [19886.499304] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
> > > [19886.501070] Stack:
> > > [19886.502247] 0000000a00406040 00000001001ddcf4 ffff880228d99fd8 ffff880228d99fd8
> > > [19886.504104] ffff880228d99fd8 ffff8801a8222918 ffff880228d99fd8 ffff880200000000
> > > [19886.505996] ffff8801a8222520 0000000000000000 0000000000002e82 00000000162dccb4
> > > [19886.507859] Call Trace:
> > > [19886.509095] <IRQ>
> > > [19886.509334] [<ffffffff8105473d>] irq_exit+0xcd/0xe0
> > > [19886.511879] [<ffffffff816f6bcb>] smp_apic_timer_interrupt+0x6b/0x9b
> > > [19886.513586] [<ffffffff816f5d2f>] apic_timer_interrupt+0x6f/0x80
> > > [19886.515243] <EOI>
> > > [19886.515482] [<ffffffff812fe80d>] ? idr_find_slowpath+0x4d/0x150
> > > [19886.518126] [<ffffffff812a2cb9>] ipcget+0x89/0x380
> > > [19886.519656] [<ffffffff810b72dd>] ? trace_hardirqs_on_caller+0xfd/0x1c0
> > > [19886.521409] [<ffffffff812a43b6>] SyS_msgget+0x56/0x60
> > > [19886.522971] [<ffffffff812a39a0>] ? rcu_read_lock+0x80/0x80
> > > [19886.524585] [<ffffffff812a37e0>] ? sysvipc_msg_proc_show+0xd0/0xd0
> > > [19886.526274] [<ffffffff816f52d4>] tracesys+0xdd/0xe2
> > > [19886.527804] [<ffffffff8100ffff>] ? enable_step+0x3f/0x1d0
> > > [19886.529397] Code: 48 89 45 b8 48 89 45 b0 48 89 45 a8 66 0f 1f 44 00 00 65 c7 04 25 80 0f 1d 00 00 00 00 00 e8 b7 31 06 00 fb 49 c7 c6 00 41 c0 81 <eb> 0e 0f 1f 44 00 00 49 83 c6 08 41 d1 ef 74 6c 41 f6 c7 01 74
> > >
> ---end quoted text---


diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 10e5947..bbeb128 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -59,7 +59,9 @@ extern struct nsproxy init_nsproxy;

static inline struct nsproxy *task_nsproxy(struct task_struct *tsk)
{
- return rcu_dereference(tsk->nsproxy);
+ struct nsproxy *ret = rcu_dereference(tsk->nsproxy);
+ WARN_ON(!ret);
+ return ret;
}

int copy_namespaces(unsigned long flags, struct task_struct *tsk);
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 364ceab..77a20e9 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -220,6 +220,7 @@ void switch_task_namespaces(struct task_struct *p, struct nsproxy *new)
rcu_assign_pointer(p->nsproxy, new);

if (ns && atomic_dec_and_test(&ns->count)) {
+ pr_info("YESTHISHAPPENS new=%p\n", new);
/*
* wait for others to get what they want from this nsproxy.
*
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 65bd3c9..917764e 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -9,6 +9,8 @@ task_work_add(struct task_struct *task, struct callback_head *work, bool notify)
{
struct callback_head *head;

+ WARN_ON(task == current && !task->nsproxy);
+
do {
head = ACCESS_ONCE(task->task_works);
if (unlikely(head == &work_exited))

2013-06-22 01:37:49

by Dave Jones

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Fri, Jun 21, 2013 at 09:59:49PM +0200, Oleg Nesterov wrote:

> I am puzzled. And I do not really understand
>
> hardirqs last enabled at (2380318): [<ffffffff816ed220>] restore_args+0x0/0x30
> hardirqs last disabled at (2380319): [<ffffffff816f5d2a>] apic_timer_interrupt+0x6a/0x80
> softirqs last enabled at (196990): [<ffffffff810542d4>] __do_softirq+0x194/0x440 [19886.471395]
> softirqs last disabled at (197479): [<ffffffff8105473d>] irq_exit+0xcd/0xe0
>
> below. how can they differ that much...
>
> Dave, any chance you can reproduce the hang with the debugging patch at
> the end? Just in case, the warnings themself do not mean a problem, just
> to have a bit more info.

[ 7485.261299] WARNING: at include/linux/nsproxy.h:63 get_proc_task_net+0x1c8/0x1d0()
[ 7485.262021] Modules linked in: 8021q garp stp tun fuse rfcomm bnep hidp snd_seq_dummy nfnetlink scsi_transport_iscsi can_bcm ipt_ULOG can_raw rds af_802154 nfc can rose caif_socket caif llc2 af_rxrpc phonet ipx p8023 p8022 pppoe pppox ppp_generic netrom slhc ax25 x25 af_key appletalk atm psnap llc irda crc_ccitt bluetooth rfkill coretemp hwmon kvm_intel snd_hda_codec_realtek kvm snd_hda_codec_hdmi crc32c_intel ghash_clmulni_intel microcode snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device pcspkr snd_pcm snd_page_alloc e1000e snd_timer ptp snd pps_core soundcore xfs libcrc32c
[ 7485.265434] CPU: 2 PID: 5623 Comm: trinity-child3 Not tainted 3.10.0-rc6+ #28
[ 7485.267158] ffffffff81a1529c ffff8801c8eafd30 ffffffff816e432d ffff8801c8eafd68
[ 7485.268045] ffffffff8104a0c1 0000000000000000 ffff880225e9bd18 ffff8801bc6e4de0
[ 7485.268932] 0000000000000000 00000000000000dd ffff8801c8eafd78 ffffffff8104a19a
[ 7485.270463] Call Trace:
[ 7485.271338] [<ffffffff816e432d>] dump_stack+0x19/0x1b
[ 7485.272207] [<ffffffff8104a0c1>] warn_slowpath_common+0x61/0x80
[ 7485.273092] [<ffffffff8104a19a>] warn_slowpath_null+0x1a/0x20
[ 7485.273942] [<ffffffff81229f58>] get_proc_task_net+0x1c8/0x1d0
[ 7485.274793] [<ffffffff81229d95>] ? get_proc_task_net+0x5/0x1d0
[ 7485.275659] [<ffffffff8122a0bd>] proc_tgid_net_lookup+0x1d/0x80
[ 7485.276531] [<ffffffff811b778d>] lookup_real+0x1d/0x50
[ 7485.277646] [<ffffffff811b7d83>] __lookup_hash+0x33/0x40
[ 7485.278477] [<ffffffff811bb143>] kern_path_create+0xb3/0x190
[ 7485.279345] [<ffffffff811b93d5>] ? getname_flags+0xb5/0x190
[ 7485.280292] [<ffffffff811bb261>] user_path_create+0x41/0x60
[ 7485.281233] [<ffffffff811be6bb>] SyS_symlinkat+0x4b/0xd0
[ 7485.282072] [<ffffffff816f5a54>] tracesys+0xdd/0xe2
[ 7485.282973] ---[ end trace 2204b7c65d6c5519 ]---


> + pr_info("YESTHISHAPPENS new=%p\n", new);

This didn't trigger. (yet?)

Dave

2013-06-22 17:35:55

by Oleg Nesterov

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On 06/21, Dave Jones wrote:
>
> On Fri, Jun 21, 2013 at 09:59:49PM +0200, Oleg Nesterov wrote:
>
> > I am puzzled. And I do not really understand
> >
> > hardirqs last enabled at (2380318): [<ffffffff816ed220>] restore_args+0x0/0x30
> > hardirqs last disabled at (2380319): [<ffffffff816f5d2a>] apic_timer_interrupt+0x6a/0x80
> > softirqs last enabled at (196990): [<ffffffff810542d4>] __do_softirq+0x194/0x440 [19886.471395]
> > softirqs last disabled at (197479): [<ffffffff8105473d>] irq_exit+0xcd/0xe0
> >
> > below. how can they differ that much...

And I misread the original trace. Now that I read it again I am even
more puzzled.

So it actually blames __do_softirq(), I didn't notice "RIP:" part.
And "softirqs last disabled" refers to irq_exit() because __do_softirq()
does __local_bh_disable(__builtin_return_address(0)). Just to add more
confusion I guess ;)

This explains "differ that much" above, __do_softirq() does cli/sti in
a loop without return return.

And how the poor 8aac6270 can trigger this ???

> > Dave, any chance you can reproduce the hang with the debugging patch at
> > the end? Just in case, the warnings themself do not mean a problem, just
> > to have a bit more info.
>
> [ 7485.261299] WARNING: at include/linux/nsproxy.h:63 get_proc_task_net+0x1c8/0x1d0()
> [ 7485.262021] Modules linked in: 8021q garp stp tun fuse rfcomm bnep hidp snd_seq_dummy nfnetlink scsi_transport_iscsi can_bcm ipt_ULOG can_raw rds af_802154 nfc can rose caif_socket caif llc2 af_rxrpc phonet ipx p8023 p8022 pppoe pppox ppp_generic netrom slhc ax25 x25 af_key appletalk atm psnap llc irda crc_ccitt bluetooth rfkill coretemp hwmon kvm_intel snd_hda_codec_realtek kvm snd_hda_codec_hdmi crc32c_intel ghash_clmulni_intel microcode snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device pcspkr snd_pcm snd_page_alloc e1000e snd_timer ptp snd pps_core soundcore xfs libcrc32c
> [ 7485.265434] CPU: 2 PID: 5623 Comm: trinity-child3 Not tainted 3.10.0-rc6+ #28
> [ 7485.267158] ffffffff81a1529c ffff8801c8eafd30 ffffffff816e432d ffff8801c8eafd68
> [ 7485.268045] ffffffff8104a0c1 0000000000000000 ffff880225e9bd18 ffff8801bc6e4de0
> [ 7485.268932] 0000000000000000 00000000000000dd ffff8801c8eafd78 ffffffff8104a19a
> [ 7485.270463] Call Trace:
> [ 7485.271338] [<ffffffff816e432d>] dump_stack+0x19/0x1b
> [ 7485.272207] [<ffffffff8104a0c1>] warn_slowpath_common+0x61/0x80
> [ 7485.273092] [<ffffffff8104a19a>] warn_slowpath_null+0x1a/0x20
> [ 7485.273942] [<ffffffff81229f58>] get_proc_task_net+0x1c8/0x1d0
> [ 7485.274793] [<ffffffff81229d95>] ? get_proc_task_net+0x5/0x1d0
> [ 7485.275659] [<ffffffff8122a0bd>] proc_tgid_net_lookup+0x1d/0x80
> [ 7485.276531] [<ffffffff811b778d>] lookup_real+0x1d/0x50
> [ 7485.277646] [<ffffffff811b7d83>] __lookup_hash+0x33/0x40
> [ 7485.278477] [<ffffffff811bb143>] kern_path_create+0xb3/0x190
> [ 7485.279345] [<ffffffff811b93d5>] ? getname_flags+0xb5/0x190
> [ 7485.280292] [<ffffffff811bb261>] user_path_create+0x41/0x60
> [ 7485.281233] [<ffffffff811be6bb>] SyS_symlinkat+0x4b/0xd0
> [ 7485.282072] [<ffffffff816f5a54>] tracesys+0xdd/0xe2
> [ 7485.282973] ---[ end trace 2204b7c65d6c5519 ]---

Hmm. The test case tries to create the symlink in /proc/*/net/ ?

> > + pr_info("YESTHISHAPPENS new=%p\n", new);
>
> This didn't trigger. (yet?)

This should only trigger if the test-case plays with the namespaces...
But once again, the warnings are fine. I hoped that they can provide
more info when/if you reproduce the lockup.

But it seems you can't ?


Dave, I am sorry but all I can do is to ask you to do more testing.
Could you please reproduce the lockup again on the clean Linus's
current ? (and _without_ reverting 8aac6270, of course).

If watchdog will blame __do_softirq() again I can try to make a
better debugging patch.

Perhaps it makes sense to decrease /proc/sys/kernel/watchdog_thresh
to detect the possible lockups earlier. 2 * 10 is probably too much.



And who knows, perhaps you pulled some fix (say 34376a50fb1 looks
promising) after you finished bisecting and then pulled Linus
current.

Thanks,

Oleg.

2013-06-22 21:59:29

by Dave Jones

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Sat, Jun 22, 2013 at 07:31:29PM +0200, Oleg Nesterov wrote:

> > [ 7485.261299] WARNING: at include/linux/nsproxy.h:63 get_proc_task_net+0x1c8/0x1d0()
> > [ 7485.262021] Modules linked in: 8021q garp stp tun fuse rfcomm bnep hidp snd_seq_dummy nfnetlink scsi_transport_iscsi can_bcm ipt_ULOG can_raw rds af_802154 nfc can rose caif_socket caif llc2 af_rxrpc phonet ipx p8023 p8022 pppoe pppox ppp_generic netrom slhc ax25 x25 af_key appletalk atm psnap llc irda crc_ccitt bluetooth rfkill coretemp hwmon kvm_intel snd_hda_codec_realtek kvm snd_hda_codec_hdmi crc32c_intel ghash_clmulni_intel microcode snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device pcspkr snd_pcm snd_page_alloc e1000e snd_timer ptp snd pps_core soundcore xfs libcrc32c
> > [ 7485.265434] CPU: 2 PID: 5623 Comm: trinity-child3 Not tainted 3.10.0-rc6+ #28
> > [ 7485.267158] ffffffff81a1529c ffff8801c8eafd30 ffffffff816e432d ffff8801c8eafd68
> > [ 7485.268045] ffffffff8104a0c1 0000000000000000 ffff880225e9bd18 ffff8801bc6e4de0
> > [ 7485.268932] 0000000000000000 00000000000000dd ffff8801c8eafd78 ffffffff8104a19a
> > [ 7485.270463] Call Trace:
> > [ 7485.271338] [<ffffffff816e432d>] dump_stack+0x19/0x1b
> > [ 7485.272207] [<ffffffff8104a0c1>] warn_slowpath_common+0x61/0x80
> > [ 7485.273092] [<ffffffff8104a19a>] warn_slowpath_null+0x1a/0x20
> > [ 7485.273942] [<ffffffff81229f58>] get_proc_task_net+0x1c8/0x1d0
> > [ 7485.274793] [<ffffffff81229d95>] ? get_proc_task_net+0x5/0x1d0
> > [ 7485.275659] [<ffffffff8122a0bd>] proc_tgid_net_lookup+0x1d/0x80
> > [ 7485.276531] [<ffffffff811b778d>] lookup_real+0x1d/0x50
> > [ 7485.277646] [<ffffffff811b7d83>] __lookup_hash+0x33/0x40
> > [ 7485.278477] [<ffffffff811bb143>] kern_path_create+0xb3/0x190
> > [ 7485.279345] [<ffffffff811b93d5>] ? getname_flags+0xb5/0x190
> > [ 7485.280292] [<ffffffff811bb261>] user_path_create+0x41/0x60
> > [ 7485.281233] [<ffffffff811be6bb>] SyS_symlinkat+0x4b/0xd0
> > [ 7485.282072] [<ffffffff816f5a54>] tracesys+0xdd/0xe2
> > [ 7485.282973] ---[ end trace 2204b7c65d6c5519 ]---
>
> Hmm. The test case tries to create the symlink in /proc/*/net/ ?

hit it with symlink, but also some other syscalls. eg:

WARNING: at include/linux/nsproxy.h:63 get_proc_task_net+0x1c8/0x1d0()
Modules linked in: 8021q garp stp tun fuse rfcomm bnep hidp snd_seq_dummy nfnetlink scsi_transport_iscsi can_bcm ipt_ULOG can_raw rds af_802154 nfc can rose caif_socket caif llc2 af_rxrpc phonet ipx p8023 p8022 pppoe pppox ppp_generic netrom slhc ax25 x25 af_key appletalk atm psnap llc irda crc_ccitt bluetooth rfkill coretemp hwmon kvm_intel snd_hda_codec_realtek kvm snd_hda_codec_hdmi crc32c_intel ghash_clmulni_intel microcode snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device pcspkr snd_pcm snd_page_alloc e1000e snd_timer ptp snd pps_core soundcore xfs libcrc32c
CPU: 2 PID: 12821 Comm: trinity-child2 Tainted: G W 3.10.0-rc6+ #28
ffffffff81a1529c ffff8801bcbdbc70 ffffffff816e432d ffff8801bcbdbca8
ffffffff8104a0c1 0000000000000000 ffff880226704828 ffff8801b95a7090
00000000ffffff9c 0000000000000008 ffff8801bcbdbcb8 ffffffff8104a19a
Call Trace:
[<ffffffff816e432d>] dump_stack+0x19/0x1b
[<ffffffff8104a0c1>] warn_slowpath_common+0x61/0x80
[<ffffffff8104a19a>] warn_slowpath_null+0x1a/0x20
[<ffffffff81229f58>] get_proc_task_net+0x1c8/0x1d0
[<ffffffff81229d95>] ? get_proc_task_net+0x5/0x1d0
[<ffffffff8122a0bd>] proc_tgid_net_lookup+0x1d/0x80
[<ffffffff811b778d>] lookup_real+0x1d/0x50
[<ffffffff811b7d83>] __lookup_hash+0x33/0x40
[<ffffffff816e3635>] lookup_slow+0x44/0xa9
[<ffffffff811bae88>] path_lookupat+0x7b8/0x810
[<ffffffff8119d3d2>] ? kmem_cache_alloc+0x142/0x320
[<ffffffff811b936f>] ? getname_flags+0x4f/0x190
[<ffffffff811b936f>] ? getname_flags+0x4f/0x190
[<ffffffff811baf0b>] filename_lookup+0x2b/0xc0
[<ffffffff811be144>] user_path_at_empty+0x54/0x90
[<ffffffff81145965>] ? user_exit+0x45/0x90
[<ffffffff810b773d>] ? trace_hardirqs_on_caller+0xfd/0x1c0
[<ffffffff811be191>] user_path_at+0x11/0x20
[<ffffffff811d5338>] SyS_lgetxattr+0x38/0x90
[<ffffffff816f5a54>] tracesys+0xdd/0xe2
---[ end trace 2204b7c65d6c551a ]---

WARNING: at include/linux/nsproxy.h:63 get_proc_task_net+0x1c8/0x1d0()
Modules linked in: 8021q garp stp tun fuse rfcomm bnep hidp snd_seq_dummy nfnetlink scsi_transport_iscsi can_bcm ipt_ULOG can_raw rds af_802154 nfc can rose caif_socket caif llc2 af_rxrpc phonet ipx p8023 p8022 pppoe pppox ppp_generic netrom slhc ax25 x25 af_key appletalk atm psnap llc irda crc_ccitt bluetooth rfkill coretemp hwmon kvm_intel snd_hda_codec_realtek kvm snd_hda_codec_hdmi crc32c_intel ghash_clmulni_intel microcode snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device pcspkr snd_pcm snd_page_alloc e1000e snd_timer ptp snd pps_core soundcore xfs libcrc32c
CPU: 2 PID: 13142 Comm: trinity-child1 Tainted: G W 3.10.0-rc6+ #28
ffffffff81a1529c ffff880228f3dc30 ffffffff816e432d ffff880228f3dc68
ffffffff8104a0c1 0000000000000000 ffff8802239f3790 ffff880224021030
00000000ffffff9c 0000000000000001 ffff880228f3dc78 ffffffff8104a19a
Call Trace:
[<ffffffff816e432d>] dump_stack+0x19/0x1b
[<ffffffff8104a0c1>] warn_slowpath_common+0x61/0x80
[<ffffffff8104a19a>] warn_slowpath_null+0x1a/0x20
[<ffffffff81229f58>] get_proc_task_net+0x1c8/0x1d0
[<ffffffff81229d95>] ? get_proc_task_net+0x5/0x1d0
[<ffffffff8122a0bd>] proc_tgid_net_lookup+0x1d/0x80
[<ffffffff811b778d>] lookup_real+0x1d/0x50
[<ffffffff811b7d83>] __lookup_hash+0x33/0x40
[<ffffffff816e3635>] lookup_slow+0x44/0xa9
[<ffffffff811bae88>] path_lookupat+0x7b8/0x810
[<ffffffff8119d3d2>] ? kmem_cache_alloc+0x142/0x320
[<ffffffff811b936f>] ? getname_flags+0x4f/0x190
[<ffffffff811b936f>] ? getname_flags+0x4f/0x190
[<ffffffff811baf0b>] filename_lookup+0x2b/0xc0
[<ffffffff811be144>] user_path_at_empty+0x54/0x90
[<ffffffff810ba1f8>] ? lock_release_non_nested+0x308/0x350
[<ffffffff810b4dae>] ? lock_release_holdtime.part.30+0xee/0x170
[<ffffffff811be191>] user_path_at+0x11/0x20
[<ffffffff811e1069>] do_utimes+0xa9/0x160
[<ffffffff811e118f>] SyS_utime+0x6f/0xa0
[<ffffffff816f5a54>] tracesys+0xdd/0xe2
---[ end trace 2204b7c65d6c551e ]---

WARNING: at include/linux/nsproxy.h:63 get_proc_task_net+0x1c8/0x1d0()
Modules linked in: 8021q garp stp tun fuse rfcomm bnep hidp snd_seq_dummy nfnetlink scsi_transport_iscsi can_bcm ipt_ULOG can_raw rds af_802154 nfc can rose caif_socket caif llc2 af_rxrpc phonet ipx p8023 p8022 pppoe pppox ppp_generic netrom slhc ax25 x25 af_key appletalk atm psnap llc irda crc_ccitt bluetooth rfkill coretemp hwmon kvm_intel snd_hda_codec_realtek kvm snd_hda_codec_hdmi crc32c_intel ghash_clmulni_intel microcode snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device pcspkr snd_pcm snd_page_alloc e1000e snd_timer ptp snd pps_core soundcore xfs libcrc32c
CPU: 0 PID: 13692 Comm: trinity-child0 Tainted: G W 3.10.0-rc6+ #28
ffffffff81a1529c ffff88021dea7c90 ffffffff816e432d ffff88021dea7cc8
ffffffff8104a0c1 0000000000000000 ffff880226704828 ffff88022404ae40
00000000ffffff9c 0000000000000000 ffff88021dea7cd8 ffffffff8104a19a
Call Trace:
[<ffffffff816e432d>] dump_stack+0x19/0x1b
[<ffffffff8104a0c1>] warn_slowpath_common+0x61/0x80
[<ffffffff8104a19a>] warn_slowpath_null+0x1a/0x20
[<ffffffff81229f58>] get_proc_task_net+0x1c8/0x1d0
[<ffffffff81229d95>] ? get_proc_task_net+0x5/0x1d0
[<ffffffff8122a0bd>] proc_tgid_net_lookup+0x1d/0x80
[<ffffffff811b778d>] lookup_real+0x1d/0x50
[<ffffffff811b7d83>] __lookup_hash+0x33/0x40
[<ffffffff816e3635>] lookup_slow+0x44/0xa9
[<ffffffff811bae88>] path_lookupat+0x7b8/0x810
[<ffffffff8119d3d2>] ? kmem_cache_alloc+0x142/0x320
[<ffffffff811b936f>] ? getname_flags+0x4f/0x190
[<ffffffff811b936f>] ? getname_flags+0x4f/0x190
[<ffffffff811baf0b>] filename_lookup+0x2b/0xc0
[<ffffffff811be144>] user_path_at_empty+0x54/0x90
[<ffffffff810b773d>] ? trace_hardirqs_on_caller+0xfd/0x1c0
[<ffffffff810b780d>] ? trace_hardirqs_on+0xd/0x10
[<ffffffff811be191>] user_path_at+0x11/0x20
[<ffffffff811abe7f>] SyS_chdir+0x2f/0xc0
[<ffffffff816f59f5>] ? tracesys+0x7e/0xe2
[<ffffffff816f5a54>] tracesys+0xdd/0xe2
---[ end trace 2204b7c65d6c5521 ]---

and a bunch of other similar VFS related calls.

> > > + pr_info("YESTHISHAPPENS new=%p\n", new);
> >
> > This didn't trigger. (yet?)
>
> This should only trigger if the test-case plays with the namespaces...
> But once again, the warnings are fine. I hoped that they can provide
> more info when/if you reproduce the lockup.
>
> But it seems you can't ?
>
> Dave, I am sorry but all I can do is to ask you to do more testing.
> Could you please reproduce the lockup again on the clean Linus's
> current ? (and _without_ reverting 8aac6270, of course).

I'll give it a shot. Just rebuilt clean tree, and restarted the tests.

> If watchdog will blame __do_softirq() again I can try to make a
> better debugging patch.
>
> Perhaps it makes sense to decrease /proc/sys/kernel/watchdog_thresh
> to detect the possible lockups earlier. 2 * 10 is probably too much.

I can try that too if it doesn't show up.

> And who knows, perhaps you pulled some fix (say 34376a50fb1 looks
> promising) after you finished bisecting and then pulled Linus
> current.

maybe, though I'm doubtful. I'm sure I already saw it yesterday
on this tree without the revert, but I'll confirm for sure of course..

Dave

2013-06-23 05:02:48

by Andrew Vagin

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Sat, Jun 22, 2013 at 05:59:05PM -0400, Dave Jones wrote:
> On Sat, Jun 22, 2013 at 07:31:29PM +0200, Oleg Nesterov wrote:
>
> > > [ 7485.261299] WARNING: at include/linux/nsproxy.h:63 get_proc_task_net+0x1c8/0x1d0()
> > > [ 7485.262021] Modules linked in: 8021q garp stp tun fuse rfcomm bnep hidp snd_seq_dummy nfnetlink scsi_transport_iscsi can_bcm ipt_ULOG can_raw rds af_802154 nfc can rose caif_socket caif llc2 af_rxrpc phonet ipx p8023 p8022 pppoe pppox ppp_generic netrom slhc ax25 x25 af_key appletalk atm psnap llc irda crc_ccitt bluetooth rfkill coretemp hwmon kvm_intel snd_hda_codec_realtek kvm snd_hda_codec_hdmi crc32c_intel ghash_clmulni_intel microcode snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device pcspkr snd_pcm snd_page_alloc e1000e snd_timer ptp snd pps_core soundcore xfs libcrc32c
> > > [ 7485.265434] CPU: 2 PID: 5623 Comm: trinity-child3 Not tainted 3.10.0-rc6+ #28
> > > [ 7485.267158] ffffffff81a1529c ffff8801c8eafd30 ffffffff816e432d ffff8801c8eafd68
> > > [ 7485.268045] ffffffff8104a0c1 0000000000000000 ffff880225e9bd18 ffff8801bc6e4de0
> > > [ 7485.268932] 0000000000000000 00000000000000dd ffff8801c8eafd78 ffffffff8104a19a
> > > [ 7485.270463] Call Trace:
> > > [ 7485.271338] [<ffffffff816e432d>] dump_stack+0x19/0x1b
> > > [ 7485.272207] [<ffffffff8104a0c1>] warn_slowpath_common+0x61/0x80
> > > [ 7485.273092] [<ffffffff8104a19a>] warn_slowpath_null+0x1a/0x20
> > > [ 7485.273942] [<ffffffff81229f58>] get_proc_task_net+0x1c8/0x1d0
> > > [ 7485.274793] [<ffffffff81229d95>] ? get_proc_task_net+0x5/0x1d0
> > > [ 7485.275659] [<ffffffff8122a0bd>] proc_tgid_net_lookup+0x1d/0x80
> > > [ 7485.276531] [<ffffffff811b778d>] lookup_real+0x1d/0x50
> > > [ 7485.277646] [<ffffffff811b7d83>] __lookup_hash+0x33/0x40
> > > [ 7485.278477] [<ffffffff811bb143>] kern_path_create+0xb3/0x190
> > > [ 7485.279345] [<ffffffff811b93d5>] ? getname_flags+0xb5/0x190
> > > [ 7485.280292] [<ffffffff811bb261>] user_path_create+0x41/0x60
> > > [ 7485.281233] [<ffffffff811be6bb>] SyS_symlinkat+0x4b/0xd0
> > > [ 7485.282072] [<ffffffff816f5a54>] tracesys+0xdd/0xe2
> > > [ 7485.282973] ---[ end trace 2204b7c65d6c5519 ]---
> >
> > Hmm. The test case tries to create the symlink in /proc/*/net/ ?
>
> hit it with symlink, but also some other syscalls. eg:
>
> WARNING: at include/linux/nsproxy.h:63 get_proc_task_net+0x1c8/0x1d0()
> Modules linked in: 8021q garp stp tun fuse rfcomm bnep hidp snd_seq_dummy nfnetlink scsi_transport_iscsi can_bcm ipt_ULOG can_raw rds af_802154 nfc can rose caif_socket caif llc2 af_rxrpc phonet ipx p8023 p8022 pppoe pppox ppp_generic netrom slhc ax25 x25 af_key appletalk atm psnap llc irda crc_ccitt bluetooth rfkill coretemp hwmon kvm_intel snd_hda_codec_realtek kvm snd_hda_codec_hdmi crc32c_intel ghash_clmulni_intel microcode snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device pcspkr snd_pcm snd_page_alloc e1000e snd_timer ptp snd pps_core soundcore xfs libcrc32c
> CPU: 2 PID: 12821 Comm: trinity-child2 Tainted: G W 3.10.0-rc6+ #28
> ffffffff81a1529c ffff8801bcbdbc70 ffffffff816e432d ffff8801bcbdbca8
> ffffffff8104a0c1 0000000000000000 ffff880226704828 ffff8801b95a7090
> 00000000ffffff9c 0000000000000008 ffff8801bcbdbcb8 ffffffff8104a19a
> Call Trace:
> [<ffffffff816e432d>] dump_stack+0x19/0x1b
> [<ffffffff8104a0c1>] warn_slowpath_common+0x61/0x80
> [<ffffffff8104a19a>] warn_slowpath_null+0x1a/0x20
> [<ffffffff81229f58>] get_proc_task_net+0x1c8/0x1d0
> [<ffffffff81229d95>] ? get_proc_task_net+0x5/0x1d0
> [<ffffffff8122a0bd>] proc_tgid_net_lookup+0x1d/0x80
> [<ffffffff811b778d>] lookup_real+0x1d/0x50
> [<ffffffff811b7d83>] __lookup_hash+0x33/0x40
> [<ffffffff816e3635>] lookup_slow+0x44/0xa9
> [<ffffffff811bae88>] path_lookupat+0x7b8/0x810
> [<ffffffff8119d3d2>] ? kmem_cache_alloc+0x142/0x320
> [<ffffffff811b936f>] ? getname_flags+0x4f/0x190
> [<ffffffff811b936f>] ? getname_flags+0x4f/0x190
> [<ffffffff811baf0b>] filename_lookup+0x2b/0xc0
> [<ffffffff811be144>] user_path_at_empty+0x54/0x90
> [<ffffffff81145965>] ? user_exit+0x45/0x90
> [<ffffffff810b773d>] ? trace_hardirqs_on_caller+0xfd/0x1c0
> [<ffffffff811be191>] user_path_at+0x11/0x20
> [<ffffffff811d5338>] SyS_lgetxattr+0x38/0x90
> [<ffffffff816f5a54>] tracesys+0xdd/0xe2
> ---[ end trace 2204b7c65d6c551a ]---
>
>
> and a bunch of other similar VFS related calls.

All these VFS related calls try to work with /proc/PID/ns/SMTH. If a
process is zombie, it's detached from namespaces (except pidns), so if
we try to read /proc/PID/ns/SMTH, we get this warning.

>

2013-06-23 14:41:02

by Oleg Nesterov

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On 06/22, Dave Jones wrote:
>
> On Sat, Jun 22, 2013 at 07:31:29PM +0200, Oleg Nesterov wrote:
>
> > > [ 7485.261299] WARNING: at include/linux/nsproxy.h:63 get_proc_task_net+0x1c8/0x1d0()
> >
> > Hmm. The test case tries to create the symlink in /proc/*/net/ ?
>
> hit it with symlink, but also some other syscalls. eg:

Yes, this is fine, the warnings without lockup are not interesting.
I thought that, perhaps, an exiting task can trigger this WARN()
from task_work_run(). And this would be fine too! But this could
provide more info if lockup happens after that. So please ignore.

> > But it seems you can't ?
> >
> > Dave, I am sorry but all I can do is to ask you to do more testing.
> > Could you please reproduce the lockup again on the clean Linus's
> > current ? (and _without_ reverting 8aac6270, of course).
>
> I'll give it a shot. Just rebuilt clean tree, and restarted the tests.

Thanks a lot.

Oleg.

2013-06-23 15:06:26

by Dave Jones

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Sun, Jun 23, 2013 at 04:36:34PM +0200, Oleg Nesterov wrote:

> > > Dave, I am sorry but all I can do is to ask you to do more testing.
> > > Could you please reproduce the lockup again on the clean Linus's
> > > current ? (and _without_ reverting 8aac6270, of course).
> >
> > I'll give it a shot. Just rebuilt clean tree, and restarted the tests.
>
> Thanks a lot.

ok, hit it on rc7 without the revert

[11018.927809] [sched_delayed] sched: RT throttling activated
[11054.897670] BUG: soft lockup - CPU#2 stuck for 22s! [trinity-child2:14482]
[11054.898503] Modules linked in: bridge stp snd_seq_dummy tun fuse hidp bnep rfcomm can_raw ipt_ULOG can_bcm nfnetlink af_rxrpc llc2 rose caif_socket caif can netrom appletalk af_802154 scsi_transport_iscsi nfc pppoe pppox ppp_generic slhc ipx p8023 psnap p8022 llc ax25 irda crc_ccitt af_key bluetooth rfkill x25 rds atm phonet coretemp hwmon kvm_intel kvm snd_hda_codec_realtek crc32c_intel ghash_clmulni_intel snd_hda_codec_hdmi microcode snd_hda_intel snd_hda_codec pcspkr snd_hwdep snd_seq snd_seq_device snd_pcm e1000e snd_page_alloc ptp snd_timer pps_core snd soundcore xfs libcrc32c
[11054.905490] irq event stamp: 3857095
[11054.905926] hardirqs last enabled at (3857094): [<ffffffff816ed9a0>] restore_args+0x0/0x30
[11054.906945] hardirqs last disabled at (3857095): [<ffffffff816f64aa>] apic_timer_interrupt+0x6a/0x80
[11054.908054] softirqs last enabled at (3856322): [<ffffffff810542e4>] __do_softirq+0x194/0x440
[11054.909102] softirqs last disabled at (3856325): [<ffffffff8105474d>] irq_exit+0xcd/0xe0
[11054.910088] CPU: 2 PID: 14482 Comm: trinity-child2 Not tainted 3.10.0-rc7+ #31
[11054.912900] task: ffff8801ae44ca40 ti: ffff88021fe60000 task.ti: ffff88021fe60000
[11054.913800] RIP: 0010:[<ffffffff81054201>] [<ffffffff81054201>] __do_softirq+0xb1/0x440
[11054.914786] RSP: 0018:ffff880244c03f08 EFLAGS: 00000206
[11054.915428] RAX: ffff8801ae44ca40 RBX: ffffffff816ed9a0 RCX: 0000000000000000
[11054.916286] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8801ae44ca40
[11054.917143] RBP: ffff880244c03f70 R08: 0000000000000000 R09: 0000000000000000
[11054.918002] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880244c03e78
[11054.919799] R13: ffffffff816f64af R14: ffff880244c03f70 R15: 0000000000000000
[11054.921601] FS: 00007f36e952f740(0000) GS:ffff880244c00000(0000) knlGS:0000000000000000
[11054.923529] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[11054.925182] CR2: 0000003850a74cf0 CR3: 00000002081b9000 CR4: 00000000001407e0
[11054.927009] DR0: 0000000001550000 DR1: 0000000000000000 DR2: 0000000000000000
[11054.928830] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[11054.930643] Stack:
[11054.931862] 0000000a00406040 0000000100106b66 ffff88021fe61fd8 ffff88021fe61fd8
[11054.933777] ffff88021fe61fd8 ffff8801ae44ce38 ffff88021fe61fd8 ffffffff00000002
[11054.935694] ffff8801ae44ca40 0000000000000000 ffffea0008f49280 ffff8802434cc240
[11054.937617] Call Trace:
[11054.938916] <IRQ>

[11054.940347] [<ffffffff8105474d>] irq_exit+0xcd/0xe0
[11054.941790] [<ffffffff816f734b>] smp_apic_timer_interrupt+0x6b/0x9b
[11054.943550] [<ffffffff816f64af>] apic_timer_interrupt+0x6f/0x80
[11054.945270] <EOI>

[11054.946705] [<ffffffff816ed9a0>] ? retint_restore_args+0xe/0xe
[11054.948267] [<ffffffff816ecd67>] ? _raw_spin_unlock_irqrestore+0x67/0x80
[11054.950090] [<ffffffff816e28ec>] __slab_free+0x5f/0x39f
[11054.951737] [<ffffffff816ecd75>] ? _raw_spin_unlock_irqrestore+0x75/0x80
[11054.953556] [<ffffffff8131506e>] ? debug_check_no_obj_freed+0x14e/0x250
[11054.955362] [<ffffffff81199335>] ? kmem_cache_free+0x95/0x300
[11054.957065] [<ffffffff8119958c>] kmem_cache_free+0x2ec/0x300
[11054.958759] [<ffffffff81047104>] ? __put_task_struct+0x64/0x140
[11054.960485] [<ffffffff81047104>] __put_task_struct+0x64/0x140
[11054.962184] [<ffffffff81086d4f>] finish_task_switch+0x11f/0x130
[11054.963899] [<ffffffff81086c77>] ? finish_task_switch+0x47/0x130
[11054.965632] [<ffffffff816eae24>] __schedule+0x444/0xa40
[11054.967271] [<ffffffff816eba83>] preempt_schedule_irq+0x53/0x90
[11054.968994] [<ffffffff816edab6>] retint_kernel+0x26/0x30
[11054.970656] [<ffffffff81145877>] ? user_enter+0x87/0xd0
[11054.972305] [<ffffffff8100f6a8>] syscall_trace_leave+0x78/0x140
[11054.974029] [<ffffffff816f5b2f>] int_check_syscall_exit_work+0x34/0x3d
[11054.975819] Code: 48 89 45 b8 48 89 45 b0 48 89 45 a8 66 0f 1f 44 00 00 65 c7 04 25 80 0f 1d 00 00 00 00 00 e8 a7 35 06 00 fb 49 c7 c6 00 41 c0 81 <eb> 0e 0f 1f 44 00 00 49 83 c6 08 41 d1 ef 74 6c 41 f6 c7 01 74

2013-06-23 16:09:18

by Oleg Nesterov

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On 06/23, Dave Jones wrote:
>
> On Sun, Jun 23, 2013 at 04:36:34PM +0200, Oleg Nesterov wrote:
>
> > > > Dave, I am sorry but all I can do is to ask you to do more testing.
> > > > Could you please reproduce the lockup again on the clean Linus's
> > > > current ? (and _without_ reverting 8aac6270, of course).
> > >
> > > I'll give it a shot. Just rebuilt clean tree, and restarted the tests.
> >
> > Thanks a lot.
>
> ok, hit it on rc7 without the revert

Great, thanks.

> [11018.927809] [sched_delayed] sched: RT throttling activated
> [11054.897670] BUG: soft lockup - CPU#2 stuck for 22s! [trinity-child2:14482]
> [11054.898503] Modules linked in: bridge stp snd_seq_dummy tun fuse hidp bnep rfcomm can_raw ipt_ULOG can_bcm nfnetlink af_rxrpc llc2 rose caif_socket caif can netrom appletalk af_802154 scsi_transport_iscsi nfc pppoe pppox ppp_generic slhc ipx p8023 psnap p8022 llc ax25 irda crc_ccitt af_key bluetooth rfkill x25 rds atm phonet coretemp hwmon kvm_intel kvm snd_hda_codec_realtek crc32c_intel ghash_clmulni_intel snd_hda_codec_hdmi microcode snd_hda_intel snd_hda_codec pcspkr snd_hwdep snd_seq snd_seq_device snd_pcm e1000e snd_page_alloc ptp snd_timer pps_core snd soundcore xfs libcrc32c
> [11054.905490] irq event stamp: 3857095
> [11054.905926] hardirqs last enabled at (3857094): [<ffffffff816ed9a0>] restore_args+0x0/0x30
> [11054.906945] hardirqs last disabled at (3857095): [<ffffffff816f64aa>] apic_timer_interrupt+0x6a/0x80
> [11054.908054] softirqs last enabled at (3856322): [<ffffffff810542e4>] __do_softirq+0x194/0x440
> [11054.909102] softirqs last disabled at (3856325): [<ffffffff8105474d>] irq_exit+0xcd/0xe0
> [11054.910088] CPU: 2 PID: 14482 Comm: trinity-child2 Not tainted 3.10.0-rc7+ #31
> [11054.912900] task: ffff8801ae44ca40 ti: ffff88021fe60000 task.ti: ffff88021fe60000
> [11054.913800] RIP: 0010:[<ffffffff81054201>] [<ffffffff81054201>] __do_softirq+0xb1/0x440

OK, __do_softirq() again. But this doesn't necessarily mean it
is the offender.

Just in case, did you change /proc/sys/kernel/watchdog_thresh ?
This times the numbers look different.

Could you please do the following:

1. # cd /sys/kernel/debug/tracing
# echo 0 >> options/function-trace
# echo preemptirqsoff >> current_tracer

2. reproduce the lockup again

3. show the result of
# cat trace

Oleg.

2013-06-24 00:21:50

by Dave Jones

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Sun, Jun 23, 2013 at 06:04:52PM +0200, Oleg Nesterov wrote:

> > [11018.927809] [sched_delayed] sched: RT throttling activated
> > [11054.897670] BUG: soft lockup - CPU#2 stuck for 22s! [trinity-child2:14482]
> > [11054.898503] Modules linked in: bridge stp snd_seq_dummy tun fuse hidp bnep rfcomm can_raw ipt_ULOG can_bcm nfnetlink af_rxrpc llc2 rose caif_socket caif can netrom appletalk af_802154 scsi_transport_iscsi nfc pppoe pppox ppp_generic slhc ipx p8023 psnap p8022 llc ax25 irda crc_ccitt af_key bluetooth rfkill x25 rds atm phonet coretemp hwmon kvm_intel kvm snd_hda_codec_realtek crc32c_intel ghash_clmulni_intel snd_hda_codec_hdmi microcode snd_hda_intel snd_hda_codec pcspkr snd_hwdep snd_seq snd_seq_device snd_pcm e1000e snd_page_alloc ptp snd_timer pps_core snd soundcore xfs libcrc32c
> > [11054.905490] irq event stamp: 3857095
> > [11054.905926] hardirqs last enabled at (3857094): [<ffffffff816ed9a0>] restore_args+0x0/0x30
> > [11054.906945] hardirqs last disabled at (3857095): [<ffffffff816f64aa>] apic_timer_interrupt+0x6a/0x80
> > [11054.908054] softirqs last enabled at (3856322): [<ffffffff810542e4>] __do_softirq+0x194/0x440
> > [11054.909102] softirqs last disabled at (3856325): [<ffffffff8105474d>] irq_exit+0xcd/0xe0
> > [11054.910088] CPU: 2 PID: 14482 Comm: trinity-child2 Not tainted 3.10.0-rc7+ #31
> > [11054.912900] task: ffff8801ae44ca40 ti: ffff88021fe60000 task.ti: ffff88021fe60000
> > [11054.913800] RIP: 0010:[<ffffffff81054201>] [<ffffffff81054201>] __do_softirq+0xb1/0x440
>
> OK, __do_softirq() again. But this doesn't necessarily mean it
> is the offender.
>
> Just in case, did you change /proc/sys/kernel/watchdog_thresh ?
> This times the numbers look different.

I hadn't. Also before I left this morning, I left the test running on rc7 + your patch
(without that one WARN_ON that was too easily triggered in task_nsproxy).
Extra traces from that below. Still no sign of the printk.

> Could you please do the following:
>
> 1. # cd /sys/kernel/debug/tracing
> # echo 0 >> options/function-trace
> # echo preemptirqsoff >> current_tracer

rebuilding kernel with that now. I should have results by the morning.

bonus traces below.

Dave

[24966.306205] BUG: soft lockup - CPU#2 stuck for 22s! [trinity-child2:354]
[24966.307018] Modules linked in: 8021q garp snd_seq_dummy bnep fuse bridge stp rfcomm tun hidp nfnetlink scsi_transport_iscsi ipt_ULOG can_raw phonet af_rxrpc af_key nfc irda can_bcm bluetooth rose llc2 pppoe pppox ppp_generic slhc rfkill x25 atm rds netrom caif_socket ax25 caif crc_ccitt can af_802154 ipx p8023 p8022 appletalk psnap llc coretemp hwmon kvm_intel snd_hda_codec_realtek kvm crc32c_intel snd_hda_codec_hdmi ghash_clmulni_intel microcode pcspkr snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm e1000e snd_page_alloc snd_timer ptp snd pps_core soundcore xfs libcrc32c
[24966.314143] irq event stamp: 2212169
[24966.314580] hardirqs last enabled at (2212168): [<ffffffff816eda20>] restore_args+0x0/0x30
[24966.315599] hardirqs last disabled at (2212169): [<ffffffff816f652a>] apic_timer_interrupt+0x6a/0x80
[24966.316709] softirqs last enabled at (2211394): [<ffffffff810542e4>] __do_softirq+0x194/0x440
[24966.317758] softirqs last disabled at (2211397): [<ffffffff8105474d>] irq_exit+0xcd/0xe0
[24966.318745] CPU: 2 PID: 354 Comm: trinity-child2 Not tainted 3.10.0-rc7+ #32
[24966.321517] task: ffff88020d7a0000 ti: ffff880165dfe000 task.ti: ffff880165dfe000
[24966.322418] RIP: 0010:[<ffffffff81054201>] [<ffffffff81054201>] __do_softirq+0xb1/0x440
[24966.323404] RSP: 0018:ffff880244c03f08 EFLAGS: 00000202
[24966.324050] RAX: ffff88020d7a0000 RBX: ffffffff816eda20 RCX: 0000000000000002
[24966.324914] RDX: 0000000000000450 RSI: ffff88020d7a07b8 RDI: ffff88020d7a0000
[24966.325777] RBP: ffff880244c03f70 R08: 0000000000000000 R09: 0000000000000000
[24966.326641] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880244c03e78
[24966.327505] R13: ffffffff816f652f R14: ffff880244c03f70 R15: 0000000000000000
[24966.329319] FS: 00007f6a251c7740(0000) GS:ffff880244c00000(0000) knlGS:0000000000000000
[24966.331246] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[24966.332900] CR2: 00007f9b1ec28070 CR3: 00000001b0481000 CR4: 00000000001407e0
[24966.334732] DR0: 0000000001c0f000 DR1: 0000000000000000 DR2: 0000000000000000
[24966.336555] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[24966.338364] Stack:
[24966.339576] 0000000a00406040 000000010025a8e5 ffff880165dfffd8 ffff880165dfffd8
[24966.341487] ffff880165dfffd8 ffff88020d7a03f8 ffff880165dfffd8 ffffffff00000002
[24966.343399] ffff88020d7a0000 0000000000000000 ffff88023d3023a0 ffff880224986e00
[24966.345315] Call Trace:
[24966.346608] <IRQ>

[24966.348033] [<ffffffff8105474d>] irq_exit+0xcd/0xe0
[24966.349460] [<ffffffff816f73cb>] smp_apic_timer_interrupt+0x6b/0x9b
[24966.351202] [<ffffffff816f652f>] apic_timer_interrupt+0x6f/0x80
[24966.352904] <EOI>

[24966.354328] [<ffffffff816eda20>] ? retint_restore_args+0xe/0xe
[24966.355870] [<ffffffff811dad2f>] ? sync_inodes_sb+0x19f/0x2a0
[24966.357559] [<ffffffff811dad28>] ? sync_inodes_sb+0x198/0x2a0
[24966.359235] [<ffffffff816ea06f>] ? wait_for_completion+0xdf/0x110
[24966.360959] [<ffffffff8108cf3d>] ? get_parent_ip+0xd/0x50
[24966.362594] [<ffffffff811e0950>] ? generic_write_sync+0x70/0x70
[24966.364296] [<ffffffff811e0969>] sync_inodes_one_sb+0x19/0x20
[24966.365970] [<ffffffff811b1272>] iterate_supers+0xb2/0x110
[24966.367623] [<ffffffff811e0bd5>] sys_sync+0x35/0x90
[24966.369199] [<ffffffff816f5ad4>] tracesys+0xdd/0xe2
[24966.370770] Code: 48 89 45 b8 48 89 45 b0 48 89 45 a8 66 0f 1f 44 00 00 65 c7 04 25 80 0f 1d 00 00 00 00 00 e8 07 36 06 00 fb 49 c7 c6 00 41 c0 81 <eb> 0e 0f 1f 44 00 00 49 83 c6 08 41 d1 ef 74 6c 41 f6 c7 01 74
[24990.292787] BUG: soft lockup - CPU#2 stuck for 23s! [trinity-child2:354]
[24990.294617] Modules linked in: 8021q garp snd_seq_dummy bnep fuse bridge stp rfcomm tun hidp nfnetlink scsi_transport_iscsi ipt_ULOG can_raw phonet af_rxrpc af_key nfc irda can_bcm bluetooth rose llc2 pppoe pppox ppp_generic slhc rfkill x25 atm rds netrom caif_socket ax25 caif crc_ccitt can af_802154 ipx p8023 p8022 appletalk psnap llc coretemp hwmon kvm_intel snd_hda_codec_realtek kvm crc32c_intel snd_hda_codec_hdmi ghash_clmulni_intel microcode pcspkr snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm e1000e snd_page_alloc snd_timer ptp snd pps_core soundcore xfs libcrc32c
[24990.306017] irq event stamp: 4061851
[24990.307570] hardirqs last enabled at (4061850): [<ffffffff816eda20>] restore_args+0x0/0x30
[24990.309725] hardirqs last disabled at (4061851): [<ffffffff816f652a>] apic_timer_interrupt+0x6a/0x80
[24990.311968] softirqs last enabled at (4061076): [<ffffffff810542e4>] __do_softirq+0x194/0x440
[24990.314152] softirqs last disabled at (4061079): [<ffffffff8105474d>] irq_exit+0xcd/0xe0
[24990.316268] CPU: 2 PID: 354 Comm: trinity-child2 Not tainted 3.10.0-rc7+ #32
[24990.321369] task: ffff88020d7a0000 ti: ffff880165dfe000 task.ti: ffff880165dfe000
[24990.323458] RIP: 0010:[<ffffffff81054201>] [<ffffffff81054201>] __do_softirq+0xb1/0x440
[24990.325634] RSP: 0018:ffff880244c03f08 EFLAGS: 00000202
[24990.327484] RAX: ffff88020d7a0000 RBX: ffffffff816eda20 RCX: 0000000000000002
[24990.329534] RDX: 0000000000000450 RSI: ffff88020d7a07b8 RDI: ffff88020d7a0000
[24990.331589] RBP: ffff880244c03f70 R08: 0000000000000000 R09: 0000000000000000
[24990.333618] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880244c03e78
[24990.335621] R13: ffffffff816f652f R14: ffff880244c03f70 R15: 0000000000000000
[24990.337605] FS: 00007f6a251c7740(0000) GS:ffff880244c00000(0000) knlGS:0000000000000000
[24990.339697] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[24990.341500] CR2: 0000000000000000 CR3: 00000001b0481000 CR4: 00000000001407e0
[24990.343461] DR0: 0000000001c0f000 DR1: 0000000000000000 DR2: 0000000000000000
[24990.345412] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[24990.347350] Stack:
[24990.348696] 0000000a00406040 000000010025b245 ffff880165dfffd8 ffff880165dfffd8
[24990.350712] ffff880165dfffd8 ffff88020d7a03f8 ffff880165dfffd8 ffffffff00000002
[24990.352707] ffff88020d7a0000 0000000000000000 ffff8802361ac630 0000000000000000
[24990.354701] Call Trace:
[24990.356043] <IRQ>

[24990.357512] [<ffffffff8105474d>] irq_exit+0xcd/0xe0
[24990.358985] [<ffffffff816f73cb>] smp_apic_timer_interrupt+0x6b/0x9b
[24990.360771] [<ffffffff816f652f>] apic_timer_interrupt+0x6f/0x80
[24990.362522] <EOI>

[24990.363990] [<ffffffff816eda20>] ? retint_restore_args+0xe/0xe
[24990.365587] [<ffffffff811dad4a>] ? sync_inodes_sb+0x1ba/0x2a0
[24990.367308] [<ffffffff811dad28>] ? sync_inodes_sb+0x198/0x2a0
[24990.369020] [<ffffffff816ea06f>] ? wait_for_completion+0xdf/0x110
[24990.370768] [<ffffffff8108cf3d>] ? get_parent_ip+0xd/0x50
[24990.372427] [<ffffffff811e0950>] ? generic_write_sync+0x70/0x70
[24990.374160] [<ffffffff811e0969>] sync_inodes_one_sb+0x19/0x20
[24990.375860] [<ffffffff811b1272>] iterate_supers+0xb2/0x110
[24990.377521] [<ffffffff811e0bd5>] sys_sync+0x35/0x90
[24990.379105] [<ffffffff816f5ad4>] tracesys+0xdd/0xe2
[24990.380680] Code: 48 89 45 b8 48 89 45 b0 48 89 45 a8 66 0f 1f 44 00 00 65 c7 04 25 80 0f 1d 00 00 00 00 00 e8 07 36 06 00 fb 49 c7 c6 00 41 c0 81 <eb> 0e 0f 1f 44 00 00 49 83 c6 08 41 d1 ef 74 6c 41 f6 c7 01 74
[25014.249357] BUG: soft lockup - CPU#1 stuck for 22s! [trinity-main:13421]
[25014.249360] BUG: soft lockup - CPU#0 stuck for 22s! [trinity-main:13522]
[25014.249382] Modules linked in: 8021q garp snd_seq_dummy bnep fuse bridge stp rfcomm tun hidp nfnetlink scsi_transport_iscsi ipt_ULOG can_raw phonet af_rxrpc af_key nfc irda can_bcm bluetooth rose llc2 pppoe pppox ppp_generic slhc rfkill x25 atm rds netrom caif_socket ax25 caif crc_ccitt can af_802154 ipx p8023 p8022 appletalk psnap llc coretemp hwmon kvm_intel snd_hda_codec_realtek kvm crc32c_intel snd_hda_codec_hdmi ghash_clmulni_intel microcode pcspkr snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm e1000e snd_page_alloc snd_timer ptp snd pps_core soundcore xfs libcrc32c
[25014.249382] irq event stamp: 6118396
[25014.249387] hardirqs last enabled at (6118395): [<ffffffff816eda20>] restore_args+0x0/0x30
[25014.249389] hardirqs last disabled at (6118396): [<ffffffff816f652a>] apic_timer_interrupt+0x6a/0x80
[25014.249391] softirqs last enabled at (6118394): [<ffffffff810542e4>] __do_softirq+0x194/0x440
[25014.249393] softirqs last disabled at (6118389): [<ffffffff8105474d>] irq_exit+0xcd/0xe0
[25014.249396] CPU: 0 PID: 13522 Comm: trinity-main Not tainted 3.10.0-rc7+ #32
[25014.249397] task: ffff880229e70000 ti: ffff880229c5a000 task.ti: ffff880229c5a000
[25014.249401] RIP: 0010:[<ffffffff81312163>] [<ffffffff81312163>] do_raw_spin_lock+0xd3/0x130
[25014.249401] RSP: 0018:ffff880229c5bc80 EFLAGS: 00000202
[25014.249402] RAX: ffff880229c5bfd8 RBX: ffffffffffffff10 RCX: 000000000000b910
[25014.249402] RDX: 0000000000002726 RSI: 0000000000000001 RDI: 0000000000000001
[25014.249403] RBP: ffff880229c5bc98 R08: 0000000000000000 R09: 0000000000000000
[25014.249403] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000001
[25014.249404] R13: 0000000000000015 R14: 000000000000b910 R15: ffff880229c5bfd8
[25014.249405] FS: 00007fe7a0216740(0000) GS:ffff880244800000(0000) knlGS:0000000000000000
[25014.249405] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[25014.249406] CR2: 00007fcc44ab8070 CR3: 000000022f655000 CR4: 00000000001407f0
[25014.249406] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[25014.249407] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[25014.249407] Stack:
[25014.249409] ffffffff81c04640 ffffffff81c04658 ffffffff8181d140 ffff880229c5bcc0
[25014.249410] ffffffff816ec980 ffffffff811c908b ffff88018ae6b208 ffff88018ae6b3d0
[25014.249412] ffff880229c5bce8 ffffffff811c908b ffff88018ae6b208 ffff88018ae6b290
[25014.249412] Call Trace:
[25014.249414] [<ffffffff816ec980>] _raw_spin_lock+0x60/0x80
[25014.249416] [<ffffffff811c908b>] ? evict+0x6b/0x1a0
[25014.249417] [<ffffffff811c908b>] evict+0x6b/0x1a0
[25014.249419] [<ffffffff811c9a55>] iput+0xf5/0x190
[25014.249421] [<ffffffff811c50e8>] dput+0x208/0x2f0
[25014.249424] [<ffffffff81220e96>] proc_flush_task+0xc6/0x1b0
[25014.249425] [<ffffffff8104eace>] release_task+0xbe/0x690
[25014.249427] [<ffffffff8104ea29>] ? release_task+0x19/0x690
[25014.249428] [<ffffffff810508a8>] wait_consider_task+0xb18/0xee0
[25014.249430] [<ffffffff810503c0>] ? wait_consider_task+0x630/0xee0
[25014.249431] [<ffffffff81050d70>] do_wait+0x100/0x370
[25014.249433] [<ffffffff81051414>] SyS_wait4+0x64/0xe0
[25014.249435] [<ffffffff8104e5e0>] ? task_stopped_code+0x60/0x60
[25014.249436] [<ffffffff816f5ad4>] tracesys+0xdd/0xe2
[25014.249451] Code: 00 00 89 43 08 65 48 8b 04 25 00 ba 00 00 48 89 43 10 5b 41 5c 41 5d 5d c3 8d 8a 00 01 00 00 89 d0 f0 66 0f b1 0b 66 39 d0 74 cf <bf> 01 00 00 00 49 83 c4 01 e8 df 79 ff ff 4d 39 ec 0f 84 6e ff
[25014.309323] BUG: soft lockup - CPU#3 stuck for 22s! [trinity-child3:764]
[25014.309344] Modules linked in: 8021q garp snd_seq_dummy bnep fuse bridge stp rfcomm tun hidp nfnetlink scsi_transport_iscsi ipt_ULOG can_raw phonet af_rxrpc af_key nfc irda can_bcm bluetooth rose llc2 pppoe pppox ppp_generic slhc rfkill x25 atm rds netrom caif_socket ax25 caif crc_ccitt can af_802154 ipx p8023 p8022 appletalk psnap llc coretemp hwmon kvm_intel snd_hda_codec_realtek kvm crc32c_intel snd_hda_codec_hdmi ghash_clmulni_intel microcode pcspkr snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm e1000e snd_page_alloc snd_timer ptp snd pps_core soundcore xfs libcrc32c
[25014.309345] irq event stamp: 39760
[25014.309348] hardirqs last enabled at (39759): [<ffffffff816eda20>] restore_args+0x0/0x30
[25014.309350] hardirqs last disabled at (39760): [<ffffffff816f652a>] apic_timer_interrupt+0x6a/0x80
[25014.309352] softirqs last enabled at (39758): [<ffffffff810542e4>] __do_softirq+0x194/0x440
[25014.309353] softirqs last disabled at (39753): [<ffffffff8105474d>] irq_exit+0xcd/0xe0
[25014.309356] CPU: 3 PID: 764 Comm: trinity-child3 Not tainted 3.10.0-rc7+ #32
[25014.309357] task: ffff8801a583a520 ti: ffff88022d0d6000 task.ti: ffff88022d0d6000
[25014.309361] RIP: 0010:[<ffffffff81309c2f>] [<ffffffff81309c2f>] delay_tsc+0x2f/0xe0
[25014.309361] RSP: 0018:ffff88022d0d7d50 EFLAGS: 00000202
[25014.309362] RAX: 00000000d879b906 RBX: ffffffff816eda20 RCX: 000000000000b910
[25014.309362] RDX: 0000000000003444 RSI: 0000000000000001 RDI: 0000000000000001
[25014.309363] RBP: ffff88022d0d7d78 R08: 0000000000000000 R09: 0000000000000000
[25014.309363] R10: 0000000000000001 R11: 0000000000000001 R12: ffff88022d0d7cc8
[25014.309364] R13: 0000000000000046 R14: ffff88022d0d6000 R15: ffff8801a583a520
[25014.309365] FS: 00007f6a251c7740(0000) GS:ffff880244e00000(0000) knlGS:0000000000000000
[25014.309365] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[25014.309366] CR2: 0000000000000001 CR3: 00000001d16b4000 CR4: 00000000001407e0
[25014.309366] DR0: 0000000000ae4000 DR1: 0000000000000000 DR2: 0000000000000000
[25014.309367] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[25014.309367] Stack:
[25014.309369] ffffffff81c04640 000000002abffe8a 0000000088c66b68 0000000000000000
[25014.309370] ffffffff00000000 ffff88022d0d7d88 ffffffff81309b5f ffff88022d0d7db0
[25014.309371] ffffffff81312171 ffffffff81c04640 ffffffff81c04658 ffff88023d3023a0
[25014.309372] Call Trace:
[25014.309374] [<ffffffff81309b5f>] __delay+0xf/0x20
[25014.309376] [<ffffffff81312171>] do_raw_spin_lock+0xe1/0x130
[25014.309378] [<ffffffff816ec980>] _raw_spin_lock+0x60/0x80
[25014.309380] [<ffffffff811dad04>] ? sync_inodes_sb+0x174/0x2a0
[25014.309382] [<ffffffff811dad04>] sync_inodes_sb+0x174/0x2a0
[25014.309384] [<ffffffff816ea06f>] ? wait_for_completion+0xdf/0x110
[25014.309387] [<ffffffff8108cf3d>] ? get_parent_ip+0xd/0x50
[25014.309390] [<ffffffff811e0950>] ? generic_write_sync+0x70/0x70
[25014.309391] [<ffffffff811e0969>] sync_inodes_one_sb+0x19/0x20
[25014.309393] [<ffffffff811b1272>] iterate_supers+0xb2/0x110
[25014.309394] [<ffffffff811e0bd5>] sys_sync+0x35/0x90
[25014.309396] [<ffffffff816f5ad4>] tracesys+0xdd/0xe2
[25014.309410] Code: 00 55 48 89 e5 41 57 41 56 41 55 41 54 41 89 fc bf 01 00 00 00 53 e8 51 78 3e 00 e8 cc 9e 00 00 41 89 c5 0f 1f 00 0f ae e8 0f 31 <65> 4c 8b 3c 25 f0 b9 00 00 89 c3 eb 2f 0f 1f 40 00 bf 01 00 00
[25014.422967] Modules linked in: 8021q garp snd_seq_dummy bnep fuse bridge stp rfcomm tun hidp nfnetlink scsi_transport_iscsi ipt_ULOG can_raw phonet af_rxrpc af_key nfc irda can_bcm bluetooth rose llc2 pppoe pppox ppp_generic slhc rfkill x25 atm rds netrom caif_socket ax25 caif crc_ccitt can af_802154 ipx p8023 p8022 appletalk psnap llc coretemp hwmon kvm_intel snd_hda_codec_realtek kvm crc32c_intel snd_hda_codec_hdmi ghash_clmulni_intel microcode pcspkr snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm e1000e snd_page_alloc snd_timer ptp snd pps_core soundcore xfs libcrc32c
[25014.434185] irq event stamp: 5462930
[25014.435696] hardirqs last enabled at (5462929): [<ffffffff816eda20>] restore_args+0x0/0x30
[25014.437815] hardirqs last disabled at (5462930): [<ffffffff816f652a>] apic_timer_interrupt+0x6a/0x80
[25014.440029] softirqs last enabled at (5462928): [<ffffffff810542e4>] __do_softirq+0x194/0x440
[25014.442184] softirqs last disabled at (5462923): [<ffffffff8105474d>] irq_exit+0xcd/0xe0
[25014.444272] CPU: 1 PID: 13421 Comm: trinity-main Not tainted 3.10.0-rc7+ #32
[25014.449318] task: ffff880240e2ca40 ti: ffff88022fa9e000 task.ti: ffff88022fa9e000
[25014.451386] RIP: 0010:[<ffffffff81309c2f>] [<ffffffff81309c2f>] delay_tsc+0x2f/0xe0
[25014.453506] RSP: 0018:ffff88022fa9fb88 EFLAGS: 00000202
[25014.455340] RAX: 00000000d044e152 RBX: 0000000000000000 RCX: 000000000000b910
[25014.457405] RDX: 0000000000003444 RSI: 0000000000000001 RDI: 0000000000000001
[25014.459454] RBP: ffff88022fa9fbb0 R08: 0000000000000000 R09: 0000000000000000
[25014.461510] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
[25014.463531] R13: 0000000000000001 R14: ffffffff8130afce R15: ffff88022fa9fbc0
[25014.465529] FS: 00007f6a251c7740(0000) GS:ffff880244a00000(0000) knlGS:0000000000000000
[25014.467637] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[25014.469461] CR2: 0000003850a74cf0 CR3: 000000023b828000 CR4: 00000000001407e0
[25014.471431] DR0: 0000000002015000 DR1: 0000000000000000 DR2: 0000000000000000
[25014.473371] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[25014.475280] Stack:
[25014.476570] ffffffff81c04640 000000002a9da73e 0000000088c66b68 ffffffff8181d140
[25014.478549] ffff88014d45c000 ffff88022fa9fbc0 ffffffff81309b5f ffff88022fa9fbe8
[25014.480522] ffffffff81312171 ffffffff81c04640 ffffffff81c04658 ffffffff8181d140
[25014.482498] Call Trace:
[25014.483853] [<ffffffff81309b5f>] __delay+0xf/0x20
[25014.485465] [<ffffffff81312171>] do_raw_spin_lock+0xe1/0x130
[25014.487195] [<ffffffff816ec980>] _raw_spin_lock+0x60/0x80
[25014.488856] [<ffffffff811c908b>] ? evict+0x6b/0x1a0
[25014.490448] [<ffffffff811c908b>] evict+0x6b/0x1a0
[25014.492003] [<ffffffff811c9a55>] iput+0xf5/0x190
[25014.493541] [<ffffffff811c5c88>] shrink_dentry_list+0x4a8/0x600
[25014.495243] [<ffffffff811c57e5>] ? shrink_dentry_list+0x5/0x600
[25014.496934] [<ffffffff811c60e6>] shrink_dcache_parent+0x266/0x300
[25014.498655] [<ffffffff81220e86>] proc_flush_task+0xb6/0x1b0
[25014.500308] [<ffffffff8104eace>] release_task+0xbe/0x690
[25014.501924] [<ffffffff8104ea29>] ? release_task+0x19/0x690
[25014.503557] [<ffffffff810508a8>] wait_consider_task+0xb18/0xee0
[25014.505230] [<ffffffff810503c0>] ? wait_consider_task+0x630/0xee0
[25014.506937] [<ffffffff81050d70>] do_wait+0x100/0x370
[25014.508501] [<ffffffff81051414>] SyS_wait4+0x64/0xe0
[25014.510064] [<ffffffff8104e5e0>] ? task_stopped_code+0x60/0x60
[25014.511720] [<ffffffff816f5ad4>] tracesys+0xdd/0xe2
[25014.513261] Code: 00 55 48 89 e5 41 57 41 56 41 55 41 54 41 89 fc bf 01 00 00 00 53 e8 51 78 3e 00 e8 cc 9e 00 00 41 89 c5 0f 1f 00 0f ae e8 0f 31 <65> 4c 8b 3c 25 f0 b9 00 00 89 c3 eb 2f 0f 1f 40 00 bf 01 00 00
[25018.277105] BUG: soft lockup - CPU#2 stuck for 23s! [trinity-child2:354]
[25018.278898] Modules linked in: 8021q garp snd_seq_dummy bnep fuse bridge stp rfcomm tun hidp nfnetlink scsi_transport_iscsi ipt_ULOG can_raw phonet af_rxrpc af_key nfc irda can_bcm bluetooth rose llc2 pppoe pppox ppp_generic slhc rfkill x25 atm rds netrom caif_socket ax25 caif crc_ccitt can af_802154 ipx p8023 p8022 appletalk psnap llc coretemp hwmon kvm_intel snd_hda_codec_realtek kvm crc32c_intel snd_hda_codec_hdmi ghash_clmulni_intel microcode pcspkr snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm e1000e snd_page_alloc snd_timer ptp snd pps_core soundcore xfs libcrc32c
[25018.290239] irq event stamp: 6233275
[25018.291784] hardirqs last enabled at (6233274): [<ffffffff816eda20>] restore_args+0x0/0x30
[25018.293927] hardirqs last disabled at (6233275): [<ffffffff816f652a>] apic_timer_interrupt+0x6a/0x80
[25018.296166] softirqs last enabled at (6232502): [<ffffffff810542e4>] __do_softirq+0x194/0x440
[25018.298341] softirqs last disabled at (6232505): [<ffffffff8105474d>] irq_exit+0xcd/0xe0
[25018.300465] CPU: 2 PID: 354 Comm: trinity-child2 Not tainted 3.10.0-rc7+ #32
[25018.305556] task: ffff88020d7a0000 ti: ffff880165dfe000 task.ti: ffff880165dfe000
[25018.307628] RIP: 0010:[<ffffffff81054201>] [<ffffffff81054201>] __do_softirq+0xb1/0x440
[25018.309774] RSP: 0018:ffff880244c03f08 EFLAGS: 00000202
[25018.311559] RAX: ffff88020d7a0000 RBX: ffffffff816eda20 RCX: 0000000000000002
[25018.313539] RDX: 00000000000031b0 RSI: ffff88020d7a07f0 RDI: ffff88020d7a0000
[25018.315494] RBP: ffff880244c03f70 R08: 0000000000000000 R09: 0000000000000000
[25018.317427] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880244c03e78
[25018.319343] R13: ffffffff816f652f R14: ffff880244c03f70 R15: 0000000000000000
[25018.321247] FS: 00007f6a251c7740(0000) GS:ffff880244c00000(0000) knlGS:0000000000000000
[25018.323282] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[25018.325044] CR2: 00007f7117e6eaf0 CR3: 00000001b0481000 CR4: 00000000001407e0
[25018.326966] DR0: 0000000001c0f000 DR1: 0000000000000000 DR2: 0000000000000000
[25018.328874] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[25018.330758] Stack:
[25018.332025] 0000000a00406040 000000010025bd35 ffff880165dfffd8 ffff880165dfffd8
[25018.333971] ffff880165dfffd8 ffff88020d7a03f8 ffff880165dfffd8 ffffffff00000002
[25018.335906] ffff88020d7a0000 0000000000000000 0000000000000000 0000000000000002
[25018.337851] Call Trace:
[25018.339170] <IRQ>

[25018.340624] [<ffffffff8105474d>] irq_exit+0xcd/0xe0
[25018.342091] [<ffffffff816f73cb>] smp_apic_timer_interrupt+0x6b/0x9b
[25018.343872] [<ffffffff816f652f>] apic_timer_interrupt+0x6f/0x80
[25018.345608] <EOI>

[25018.347056] [<ffffffff816eda20>] ? retint_restore_args+0xe/0xe
[25018.348615] [<ffffffff810b9da6>] ? lock_acquire+0xa6/0x1f0
[25018.350295] [<ffffffff811dad52>] ? sync_inodes_sb+0x1c2/0x2a0
[25018.352004] [<ffffffff816ec960>] _raw_spin_lock+0x40/0x80
[25018.353665] [<ffffffff811dad52>] ? sync_inodes_sb+0x1c2/0x2a0
[25018.355363] [<ffffffff811dad52>] sync_inodes_sb+0x1c2/0x2a0
[25018.357024] [<ffffffff816ea06f>] ? wait_for_completion+0xdf/0x110
[25018.358756] [<ffffffff8108cf3d>] ? get_parent_ip+0xd/0x50
[25018.360399] [<ffffffff811e0950>] ? generic_write_sync+0x70/0x70
[25018.362093] [<ffffffff811e0969>] sync_inodes_one_sb+0x19/0x20
[25018.363769] [<ffffffff811b1272>] iterate_supers+0xb2/0x110
[25018.365416] [<ffffffff811e0bd5>] sys_sync+0x35/0x90
[25018.366992] [<ffffffff816f5ad4>] tracesys+0xdd/0xe2
[25018.368570] Code: 48 89 45 b8 48 89 45 b0 48 89 45 a8 66 0f 1f 44 00 00 65 c7 04 25 80 0f 1d 00 00 00 00 00 e8 07 36 06 00 fb 49 c7 c6 00 41 c0 81 <eb> 0e 0f 1f 44 00 00 49 83 c6 08 41 d1 ef 74 6c 41 f6 c7 01 74
[25042.263653] BUG: soft lockup - CPU#2 stuck for 22s! [trinity-child2:354]
[25042.265498] Modules linked in: 8021q garp snd_seq_dummy bnep fuse bridge stp rfcomm tun hidp nfnetlink scsi_transport_iscsi ipt_ULOG can_raw phonet af_rxrpc af_key nfc irda can_bcm bluetooth rose llc2 pppoe pppox ppp_generic slhc rfkill x25 atm rds netrom caif_socket ax25 caif crc_ccitt can af_802154 ipx p8023 p8022 appletalk psnap llc coretemp hwmon kvm_intel snd_hda_codec_realtek kvm crc32c_intel snd_hda_codec_hdmi ghash_clmulni_intel microcode pcspkr snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm e1000e snd_page_alloc snd_timer ptp snd pps_core soundcore xfs libcrc32c
[25042.277003] irq event stamp: 8090805
[25042.278588] hardirqs last enabled at (8090804): [<ffffffff816eda20>] restore_args+0x0/0x30
[25042.280769] hardirqs last disabled at (8090805): [<ffffffff816f652a>] apic_timer_interrupt+0x6a/0x80
[25042.283040] softirqs last enabled at (8090032): [<ffffffff810542e4>] __do_softirq+0x194/0x440
[25042.285263] softirqs last disabled at (8090035): [<ffffffff8105474d>] irq_exit+0xcd/0xe0
[25042.287422] CPU: 2 PID: 354 Comm: trinity-child2 Not tainted 3.10.0-rc7+ #32
[25042.292564] task: ffff88020d7a0000 ti: ffff880165dfe000 task.ti: ffff880165dfe000
[25042.293647] BUG: soft lockup - CPU#3 stuck for 22s! [trinity-child1:785]
[25042.293670] Modules linked in: 8021q garp snd_seq_dummy bnep fuse bridge stp rfcomm tun hidp nfnetlink scsi_transport_iscsi ipt_ULOG can_raw phonet af_rxrpc af_key nfc irda can_bcm bluetooth rose llc2 pppoe pppox ppp_generic slhc rfkill x25 atm rds netrom caif_socket ax25 caif crc_ccitt can af_802154 ipx p8023 p8022 appletalk psnap llc coretemp hwmon kvm_intel snd_hda_codec_realtek kvm crc32c_intel snd_hda_codec_hdmi ghash_clmulni_intel microcode pcspkr snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm e1000e snd_page_alloc snd_timer ptp snd pps_core soundcore xfs libcrc32c
[25042.293670] irq event stamp: 28128
[25042.293675] hardirqs last enabled at (28127): [<ffffffff816eda20>] restore_args+0x0/0x30
[25042.293677] hardirqs last disabled at (28128): [<ffffffff816f652a>] apic_timer_interrupt+0x6a/0x80
[25042.293679] softirqs last enabled at (28126): [<ffffffff810542e4>] __do_softirq+0x194/0x440
[25042.293681] softirqs last disabled at (28121): [<ffffffff8105474d>] irq_exit+0xcd/0xe0
[25042.293683] CPU: 3 PID: 785 Comm: trinity-child1 Not tainted 3.10.0-rc7+ #32
[25042.293685] task: ffff88017f254a40 ti: ffff880189b9a000 task.ti: ffff880189b9a000
[25042.293689] RIP: 0010:[<ffffffff81313b0c>] [<ffffffff81313b0c>] debug_smp_processor_id+0x1c/0xf0
[25042.293689] RSP: 0018:ffff880189b9baf0 EFLAGS: 00000297
[25042.293690] RAX: 0000000000000002 RBX: ffff880189b9ba78 RCX: 000000000000b910
[25042.293690] RDX: 0000000000004140 RSI: 0000000000000001 RDI: 0000000000000001
[25042.293691] RBP: ffff880189b9baf8 R08: 0000000000000000 R09: 0000000000000000
[25042.293691] R10: 0000000000000001 R11: 0000000000000001 R12: ffff880189b9a000
[25042.293692] R13: ffff88017f254a40 R14: 0000000000000000 R15: 0000000000000000
[25042.293693] FS: 00007fe7a0216740(0000) GS:ffff880244e00000(0000) knlGS:0000000000000000
[25042.293693] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[25042.293694] CR2: 0000003850ae6500 CR3: 00000002416fb000 CR4: 00000000001407e0
[25042.293694] DR0: 0000000000ae4000 DR1: 0000000000000000 DR2: 0000000000000000
[25042.293695] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[25042.293695] Stack:
[25042.293697] ffffffff81c04640 ffff880189b9bb30 ffffffff81309c24 ffffffff81c04640
[25042.293698] 000000002a9962a7 0000000088c66b68 0000000000000311 0000000000000000
[25042.293700] ffff880189b9bb40 ffffffff81309b5f ffff880189b9bb68 ffffffff81312171
[25042.293700] Call Trace:
[25042.293703] [<ffffffff81309c24>] delay_tsc+0x24/0xe0
[25042.293705] [<ffffffff81309b5f>] __delay+0xf/0x20
[25042.293707] [<ffffffff81312171>] do_raw_spin_lock+0xe1/0x130
[25042.293710] [<ffffffff816ec980>] _raw_spin_lock+0x60/0x80
[25042.293712] [<ffffffff811c7dc9>] ? inode_sb_list_add+0x19/0x50
[25042.293713] [<ffffffff811c7dc9>] inode_sb_list_add+0x19/0x50
[25042.293715] [<ffffffff811ca409>] new_inode+0x29/0x30
[25042.293717] [<ffffffff8121f58f>] proc_pid_make_inode+0x1f/0x250
[25042.293719] [<ffffffff8121f7db>] proc_pid_instantiate+0x1b/0xd0
[25042.293721] [<ffffffff812210bc>] proc_pid_lookup+0x13c/0x200
[25042.293722] [<ffffffff8122100e>] ? proc_pid_lookup+0x8e/0x200
[25042.293724] [<ffffffff8121b52f>] proc_root_lookup+0x2f/0x40
[25042.293726] [<ffffffff811b77dd>] lookup_real+0x1d/0x50
[25042.293727] [<ffffffff811b7dd3>] __lookup_hash+0x33/0x40
[25042.293730] [<ffffffff816e36b5>] lookup_slow+0x44/0xa9
[25042.293731] [<ffffffff811ba453>] link_path_walk+0x733/0x900
[25042.293733] [<ffffffff811bd704>] path_openat+0x94/0x530
[25042.293736] [<ffffffff8100a384>] ? native_sched_clock+0x24/0x80
[25042.293739] [<ffffffff81091db5>] ? sched_clock_cpu+0xb5/0x100
[25042.293741] [<ffffffff81091db5>] ? sched_clock_cpu+0xb5/0x100
[25042.293742] [<ffffffff811be228>] do_filp_open+0x38/0x80
[25042.293744] [<ffffffff816eccf1>] ? _raw_spin_unlock+0x31/0x60
[25042.293745] [<ffffffff811ccd7f>] ? __alloc_fd+0xaf/0x200
[25042.293747] [<ffffffff811ac669>] do_sys_open+0xe9/0x1c0
[25042.293749] [<ffffffff811ac75e>] SyS_open+0x1e/0x20
[25042.293750] [<ffffffff816f5ad4>] tracesys+0xdd/0xe2
[25042.293765] Code: 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 0b 66 90 55 48 89 e5 53 65 48 8b 04 25 f0 b9 00 00 8b 80 44 e0 ff ff 65 8b 1c 25 1c b0 00 00 <85> c0 74 05 89 d8 5b 5d c3 9c 58 f6 c4 02 74 f4 89 d8 8b 15 c4
[25042.408673] RIP: 0010:[<ffffffff81054201>] [<ffffffff81054201>] __do_softirq+0xb1/0x440
[25042.410647] RSP: 0018:ffff880244c03f08 EFLAGS: 00000206
[25042.412286] RAX: ffff88020d7a0000 RBX: ffffffff816eda20 RCX: 0000000000000002
[25042.414133] RDX: 0000000000000450 RSI: ffff88020d7a07b8 RDI: ffff88020d7a0000
[25042.415957] RBP: ffff880244c03f70 R08: 0000000000000000 R09: 0000000000000000
[25042.417765] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880244c03e78
[25042.419569] R13: ffffffff816f652f R14: ffff880244c03f70 R15: 0000000000000000
[25042.421352] FS: 00007f6a251c7740(0000) GS:ffff880244c00000(0000) knlGS:0000000000000000
[25042.423233] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[25042.424814] CR2: 00007f711c934088 CR3: 00000001b0481000 CR4: 00000000001407e0
[25042.426540] DR0: 0000000001c0f000 DR1: 0000000000000000 DR2: 0000000000000000
[25042.428252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[25042.429950] Stack:
[25042.431040] 0000000a00406040 000000010025c695 ffff880165dfffd8 ffff880165dfffd8
[25042.432823] ffff880165dfffd8 ffff88020d7a03f8 ffff880165dfffd8 ffffffff00000002
[25042.434610] ffff88020d7a0000 0000000000000000 ffff8802361ac630 ffff880195249b80
[25042.436381] Call Trace:
[25042.437508] <IRQ>

[25042.438752] [<ffffffff8105474d>] irq_exit+0xcd/0xe0
[25042.439989] [<ffffffff816f73cb>] smp_apic_timer_interrupt+0x6b/0x9b
[25042.441543] [<ffffffff816f652f>] apic_timer_interrupt+0x6f/0x80
[25042.443048] <EOI>

[25042.444272] [<ffffffff816eda20>] ? retint_restore_args+0xe/0xe
[25042.445626] [<ffffffff811dad2f>] ? sync_inodes_sb+0x19f/0x2a0
[25042.447124] [<ffffffff811dad28>] ? sync_inodes_sb+0x198/0x2a0
[25042.448614] [<ffffffff816ea06f>] ? wait_for_completion+0xdf/0x110
[25042.450140] [<ffffffff8108cf3d>] ? get_parent_ip+0xd/0x50
[25042.451578] [<ffffffff811e0950>] ? generic_write_sync+0x70/0x70
[25042.453070] [<ffffffff811e0969>] sync_inodes_one_sb+0x19/0x20
[25042.454549] [<ffffffff811b1272>] iterate_supers+0xb2/0x110
[25042.455993] [<ffffffff811e0bd5>] sys_sync+0x35/0x90
[25042.457357] [<ffffffff816f5ad4>] tracesys+0xdd/0xe2
[25042.458714] Code: 48 89 45 b8 48 89 45 b0 48 89 45 a8 66 0f 1f 44 00 00 65 c7 04 25 80 0f 1d 00 00 00 00 00 e8 07 36 06 00 fb 49 c7 c6 00 41 c0 81 <eb> 0e 0f 1f 44 00 00 49 83 c6 08 41 d1 ef 74 6c 41 f6 c7 01 74
[25070.218002] BUG: soft lockup - CPU#0 stuck for 22s! [trinity-child0:701]
[25070.218004] BUG: soft lockup - CPU#1 stuck for 24s! [trinity-child1:725]
[25070.218028] Modules linked in: 8021q garp snd_seq_dummy bnep fuse bridge stp rfcomm tun hidp nfnetlink scsi_transport_iscsi ipt_ULOG can_raw phonet af_rxrpc af_key nfc irda can_bcm bluetooth rose llc2 pppoe pppox ppp_generic slhc rfkill x25 atm rds netrom caif_socket ax25 caif crc_ccitt can af_802154 ipx p8023 p8022 appletalk psnap llc coretemp hwmon kvm_intel snd_hda_codec_realtek kvm crc32c_intel snd_hda_codec_hdmi ghash_clmulni_intel microcode pcspkr snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm e1000e snd_page_alloc snd_timer ptp snd pps_core soundcore xfs libcrc32c
[25070.218028] irq event stamp: 43906
[25070.218033] hardirqs last enabled at (43905): [<ffffffff816eda20>] restore_args+0x0/0x30
[25070.218036] hardirqs last disabled at (43906): [<ffffffff816f652a>] apic_timer_interrupt+0x6a/0x80
[25070.218039] softirqs last enabled at (43904): [<ffffffff810542e4>] __do_softirq+0x194/0x440
[25070.218041] softirqs last disabled at (43899): [<ffffffff8105474d>] irq_exit+0xcd/0xe0
[25070.218043] CPU: 1 PID: 725 Comm: trinity-child1 Not tainted 3.10.0-rc7+ #32
[25070.218045] task: ffff8801b0768000 ti: ffff88022d1ca000 task.ti: ffff88022d1ca000
[25070.218047] RIP: 0010:[<ffffffff816eb89f>] [<ffffffff816eb89f>] preempt_schedule+0xf/0x60
[25070.218048] RSP: 0018:ffff88022d1cbd48 EFLAGS: 00000202
[25070.218049] RAX: ffff88022d1cbfd8 RBX: ffff88022d1cbcc8 RCX: 000000000000b910
[25070.218049] RDX: 0000000000000015 RSI: 0000000000000001 RDI: 0000000000000001
[25070.218050] RBP: ffff88022d1cbd78 R08: 0000000000000000 R09: 0000000000000000
[25070.218050] R10: 0000000000000001 R11: 0000000000000001 R12: ffff88022d1ca000
[25070.218051] R13: ffff8801b0768000 R14: 0000000000000000 R15: 0000000000000000
[25070.218051] FS: 00007f641bcda740(0000) GS:ffff880244a00000(0000) knlGS:0000000000000000
[25070.218052] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[25070.218053] CR2: 00007ffd6497f000 CR3: 000000022f848000 CR4: 00000000001407e0
[25070.218053] DR0: 0000000002015000 DR1: 0000000000000000 DR2: 0000000000000000
[25070.218054] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[25070.218054] Stack:
[25070.218056] ffffffff81309cd5 ffffffff81c04640 000000002dd7c28f 0000000088c66b68
[25070.218057] 0000000000000000 ffffffff00000000 ffff88022d1cbd88 ffffffff81309b5f
[25070.218059] ffff88022d1cbdb0 ffffffff81312171 ffffffff81c04640 ffffffff81c04658
[25070.218059] Call Trace:
[25070.218062] [<ffffffff81309cd5>] ? delay_tsc+0xd5/0xe0
[25070.218064] [<ffffffff81309b5f>] __delay+0xf/0x20
[25070.218067] [<ffffffff81312171>] do_raw_spin_lock+0xe1/0x130
[25070.218068] [<ffffffff816ec980>] _raw_spin_lock+0x60/0x80
[25070.218071] [<ffffffff811dad04>] ? sync_inodes_sb+0x174/0x2a0
[25070.218072] [<ffffffff811dad04>] sync_inodes_sb+0x174/0x2a0
[25070.218074] [<ffffffff816ea06f>] ? wait_for_completion+0xdf/0x110
[25070.218077] [<ffffffff8108cf3d>] ? get_parent_ip+0xd/0x50
[25070.218078] [<ffffffff811e0950>] ? generic_write_sync+0x70/0x70
[25070.218079] [<ffffffff811e0969>] sync_inodes_one_sb+0x19/0x20
[25070.218082] [<ffffffff811b1272>] iterate_supers+0xb2/0x110
[25070.218083] [<ffffffff811e0bd5>] sys_sync+0x35/0x90
[25070.218085] [<ffffffff816f5ad4>] tracesys+0xdd/0xe2
[25070.218099] Code: 1f 44 00 00 48 8d 47 18 48 39 47 18 75 e4 e9 71 ff ff ff 66 0f 1f 84 00 00 00 00 00 65 48 8b 04 25 f0 b9 00 00 8b b0 44 e0 ff ff <85> f6 74 01 c3 9c 58 f6 c4 02 74 f8 55 48 89 e5 41 55 41 54 53
[25070.247984] BUG: soft lockup - CPU#2 stuck for 22s! [trinity-child2:354]
[25070.248006] Modules linked in: 8021q garp snd_seq_dummy bnep fuse bridge stp rfcomm tun hidp nfnetlink scsi_transport_iscsi ipt_ULOG can_raw phonet af_rxrpc af_key nfc irda can_bcm bluetooth rose llc2 pppoe pppox ppp_generic slhc rfkill x25 atm rds netrom caif_socket ax25 caif crc_ccitt can af_802154 ipx p8023 p8022 appletalk psnap llc coretemp hwmon kvm_intel snd_hda_codec_realtek kvm crc32c_intel snd_hda_codec_hdmi ghash_clmulni_intel microcode pcspkr snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm e1000e snd_page_alloc snd_timer ptp snd pps_core soundcore xfs libcrc32c
[25070.248009] irq event stamp: 10251323
[25070.248015] hardirqs last enabled at (10251322): [<ffffffff816eda20>] restore_args+0x0/0x30
[25070.248017] hardirqs last disabled at (10251323): [<ffffffff816f652a>] apic_timer_interrupt+0x6a/0x80
[25070.248020] softirqs last enabled at (10250550): [<ffffffff810542e4>] __do_softirq+0x194/0x440
[25070.248021] softirqs last disabled at (10250553): [<ffffffff8105474d>] irq_exit+0xcd/0xe0
[25070.248025] CPU: 2 PID: 354 Comm: trinity-child2 Not tainted 3.10.0-rc7+ #32
[25070.248027] task: ffff88020d7a0000 ti: ffff880165dfe000 task.ti: ffff880165dfe000
[25070.248029] RIP: 0010:[<ffffffff81054201>] [<ffffffff81054201>] __do_softirq+0xb1/0x440
[25070.248030] RSP: 0018:ffff880244c03f08 EFLAGS: 00000206
[25070.248030] RAX: ffff88020d7a0000 RBX: ffffffff816eda20 RCX: 0000000000000002
[25070.248031] RDX: 00000000000031b0 RSI: ffff88020d7a07f0 RDI: ffff88020d7a0000
[25070.248031] RBP: ffff880244c03f70 R08: 0000000000000000 R09: 0000000000000000
[25070.248032] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880244c03e78
[25070.248032] R13: ffffffff816f652f R14: ffff880244c03f70 R15: 0000000000000000
[25070.248033] FS: 00007f6a251c7740(0000) GS:ffff880244c00000(0000) knlGS:0000000000000000
[25070.248034] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[25070.248034] CR2: 00007f71154f01e8 CR3: 00000001b0481000 CR4: 00000000001407e0
[25070.248035] DR0: 0000000001c0f000 DR1: 0000000000000000 DR2: 0000000000000000
[25070.248035] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[25070.248035] Stack:
[25070.248037] 0000000a00406040 000000010025d182 ffff880165dfffd8 ffff880165dfffd8
[25070.248038] ffff880165dfffd8 ffff88020d7a03f8 ffff880165dfffd8 ffffffff00000002
[25070.248040] ffff88020d7a0000 0000000000000000 ffff8802361ac630 ffff88018edd1b80
[25070.248040] Call Trace:
[25070.248041] <IRQ>
[25070.248043] [<ffffffff8105474d>] irq_exit+0xcd/0xe0
[25070.248046] [<ffffffff816f73cb>] smp_apic_timer_interrupt+0x6b/0x9b
[25070.248048] [<ffffffff816f652f>] apic_timer_interrupt+0x6f/0x80
[25070.248048] <EOI>
[25070.248050] [<ffffffff816eda20>] ? retint_restore_args+0xe/0xe
[25070.248052] [<ffffffff811dad63>] ? sync_inodes_sb+0x1d3/0x2a0
[25070.248054] [<ffffffff811dad52>] ? sync_inodes_sb+0x1c2/0x2a0
[25070.248056] [<ffffffff816ea06f>] ? wait_for_completion+0xdf/0x110
[25070.248058] [<ffffffff8108cf3d>] ? get_parent_ip+0xd/0x50
[25070.248059] [<ffffffff811e0950>] ? generic_write_sync+0x70/0x70
[25070.248060] [<ffffffff811e0969>] sync_inodes_one_sb+0x19/0x20
[25070.248062] [<ffffffff811b1272>] iterate_supers+0xb2/0x110
[25070.248064] [<ffffffff811e0bd5>] sys_sync+0x35/0x90
[25070.248065] [<ffffffff816f5ad4>] tracesys+0xdd/0xe2
[25070.248080] Code: 48 89 45 b8 48 89 45 b0 48 89 45 a8 66 0f 1f 44 00 00 65 c7 04 25 80 0f 1d 00 00 00 00 00 e8 07 36 06 00 fb 49 c7 c6 00 41 c0 81 <eb> 0e 0f 1f 44 00 00 49 83 c6 08 41 d1 ef 74 6c 41 f6 c7 01 74
[25070.392616] Modules linked in: 8021q garp snd_seq_dummy bnep fuse bridge stp rfcomm tun hidp nfnetlink scsi_transport_iscsi ipt_ULOG can_raw phonet af_rxrpc af_key nfc irda can_bcm bluetooth rose llc2 pppoe pppox ppp_generic slhc rfkill x25 atm rds netrom caif_socket ax25 caif crc_ccitt can af_802154 ipx p8023 p8022 appletalk psnap llc coretemp hwmon kvm_intel snd_hda_codec_realtek kvm crc32c_intel snd_hda_codec_hdmi ghash_clmulni_intel microcode pcspkr snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm e1000e snd_page_alloc snd_timer ptp snd pps_core soundcore xfs libcrc32c
[25070.403930] irq event stamp: 113174
[25070.405459] hardirqs last enabled at (113173): [<ffffffff816eda20>] restore_args+0x0/0x30
[25070.407576] hardirqs last disabled at (113174): [<ffffffff816f652a>] apic_timer_interrupt+0x6a/0x80
[25070.409799] softirqs last enabled at (113172): [<ffffffff810542e4>] __do_softirq+0x194/0x440
[25070.411953] softirqs last disabled at (113167): [<ffffffff8105474d>] irq_exit+0xcd/0xe0
[25070.414053] CPU: 0 PID: 701 Comm: trinity-child0 Not tainted 3.10.0-rc7+ #32
[25070.419119] task: ffff880235f24a40 ti: ffff880218608000 task.ti: ffff880218608000
[25070.421195] RIP: 0010:[<ffffffff816f14ac>] [<ffffffff816f14ac>] add_preempt_count+0x3c/0xf0
[25070.423422] RSP: 0018:ffff880218609d38 EFLAGS: 00000213
[25070.425265] RAX: ffff880218609fd8 RBX: ffff880218608000 RCX: 000000000000b910
[25070.427315] RDX: 000000000000201f RSI: 0000000000000001 RDI: 0000000000000001
[25070.429340] RBP: ffff880218609d40 R08: 0000000000000000 R09: 0000000000000000
[25070.431355] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
[25070.433347] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
[25070.435305] FS: 00007f6072db9740(0000) GS:ffff880244800000(0000) knlGS:0000000000000000
[25070.437378] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[25070.439137] CR2: 0000000000000008 CR3: 00000001a5868000 CR4: 00000000001407f0
[25070.441055] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[25070.442959] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[25070.444853] Stack:
[25070.446144] ffffffff81c04640 ffff880218609d78 ffffffff81309c1f ffffffff81c04640
[25070.448124] 0000000031a04c45 0000000088c66b68 ffff88023c0323c0 ffff88023c0325d0
[25070.450099] ffff880218609d88 ffffffff81309b5f ffff880218609db0 ffffffff81312171
[25070.452063] Call Trace:
[25070.453384] [<ffffffff81309c1f>] delay_tsc+0x1f/0xe0
[25070.455000] [<ffffffff81309b5f>] __delay+0xf/0x20
[25070.456574] [<ffffffff81312171>] do_raw_spin_lock+0xe1/0x130
[25070.458252] [<ffffffff816ec980>] _raw_spin_lock+0x60/0x80
[25070.459904] [<ffffffff811dadb8>] ? sync_inodes_sb+0x228/0x2a0
[25070.461586] [<ffffffff811dadb8>] sync_inodes_sb+0x228/0x2a0
[25070.463255] [<ffffffff816ea06f>] ? wait_for_completion+0xdf/0x110
[25070.464981] [<ffffffff8108cf3d>] ? get_parent_ip+0xd/0x50
[25070.466616] [<ffffffff811e0950>] ? generic_write_sync+0x70/0x70
[25070.468308] [<ffffffff811e0969>] sync_inodes_one_sb+0x19/0x20
[25070.469979] [<ffffffff811b1272>] iterate_supers+0xb2/0x110
[25070.471607] [<ffffffff811e0bd5>] sys_sync+0x35/0x90
[25070.473179] [<ffffffff816f5ad4>] tracesys+0xdd/0xe2
[25070.474728] Code: 25 f0 b9 00 00 48 89 e5 53 89 fb 45 85 c0 75 57 8b b8 44 e0 ff ff 85 ff 0f 88 85 00 00 00 01 98 44 e0 ff ff 80 b8 44 e0 ff ff f4 <76> 40 e8 5d a8 c1 ff 85 c0 74 37 83 3d 42 2c 4a 01 00 75 2e 48
[25094.234554] BUG: soft lockup - CPU#2 stuck for 22s! [trinity-child2:354]
[25094.236292] Modules linked in: 8021q garp snd_seq_dummy bnep fuse bridge stp rfcomm tun hidp nfnetlink scsi_transport_iscsi ipt_ULOG can_raw phonet af_rxrpc af_key nfc irda can_bcm bluetooth rose llc2 pppoe pppox ppp_generic slhc rfkill x25 atm rds netrom caif_socket ax25 caif crc_ccitt can af_802154 ipx p8023 p8022 appletalk psnap llc coretemp hwmon kvm_intel snd_hda_codec_realtek kvm crc32c_intel snd_hda_codec_hdmi ghash_clmulni_intel microcode pcspkr snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm e1000e snd_page_alloc snd_timer ptp snd pps_core soundcore xfs libcrc32c
[25094.247304] irq event stamp: 12162607
[25094.248770] hardirqs last enabled at (12162606): [<ffffffff816eda20>] restore_args+0x0/0x30
[25094.250836] hardirqs last disabled at (12162607): [<ffffffff816f652a>] apic_timer_interrupt+0x6a/0x80
[25094.252983] softirqs last enabled at (12161816): [<ffffffff810542e4>] __do_softirq+0x194/0x440
[25094.255077] softirqs last disabled at (12161819): [<ffffffff8105474d>] irq_exit+0xcd/0xe0
[25094.257124] CPU: 2 PID: 354 Comm: trinity-child2 Not tainted 3.10.0-rc7+ #32
[25094.262081] task: ffff88020d7a0000 ti: ffff880165dfe000 task.ti: ffff880165dfe000
[25094.264133] RIP: 0010:[<ffffffff81054201>] [<ffffffff81054201>] __do_softirq+0xb1/0x440
[25094.266253] RSP: 0018:ffff880244c03f08 EFLAGS: 00000202
[25094.268011] RAX: ffff88020d7a0000 RBX: ffffffff816eda20 RCX: 0000000000000002
[25094.269985] RDX: 00000000000031b0 RSI: ffff88020d7a07f0 RDI: ffff88020d7a0000
[25094.271969] RBP: ffff880244c03f70 R08: 0000000000000000 R09: 0000000000000000
[25094.273936] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880244c03e78
[25094.275879] R13: ffffffff816f652f R14: ffff880244c03f70 R15: 0000000000000000
[25094.277805] FS: 00007f6a251c7740(0000) GS:ffff880244c00000(0000) knlGS:0000000000000000
[25094.279828] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[25094.281562] CR2: 0000000000000000 CR3: 00000001b0481000 CR4: 00000000001407e0
[25094.283440] DR0: 0000000001c0f000 DR1: 0000000000000000 DR2: 0000000000000000
[25094.285329] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[25094.287188] Stack:
[25094.288433] 0000000a00406040 000000010025dae5 ffff880165dfffd8 ffff880165dfffd8
[25094.290369] ffff880165dfffd8 ffff88020d7a03f8 ffff880165dfffd8 ffffffff00000002
[25094.292299] ffff88020d7a0000 0000000000000000 ffff8802361ac630 ffff880231d46e00
[25094.294223] Call Trace:
[25094.295504] <IRQ>

[25094.296897] [<ffffffff8105474d>] irq_exit+0xcd/0xe0
[25094.298296] [<ffffffff816f73cb>] smp_apic_timer_interrupt+0x6b/0x9b
[25094.299998] [<ffffffff816f652f>] apic_timer_interrupt+0x6f/0x80
[25094.301657] <EOI>

[25094.303033] [<ffffffff816eda20>] ? retint_restore_args+0xe/0xe
[25094.304534] [<ffffffff813121dd>] ? do_raw_spin_trylock+0x1d/0x50
[25094.306217] [<ffffffff816ec968>] _raw_spin_lock+0x48/0x80
[25094.307822] [<ffffffff811dad52>] ? sync_inodes_sb+0x1c2/0x2a0
[25094.309471] [<ffffffff811dad52>] sync_inodes_sb+0x1c2/0x2a0
[25094.311088] [<ffffffff816ea06f>] ? wait_for_completion+0xdf/0x110
[25094.312772] [<ffffffff8108cf3d>] ? get_parent_ip+0xd/0x50
[25094.314363] [<ffffffff811e0950>] ? generic_write_sync+0x70/0x70
[25094.316035] [<ffffffff811e0969>] sync_inodes_one_sb+0x19/0x20
[25094.317660] [<ffffffff811b1272>] iterate_supers+0xb2/0x110
[25094.319256] [<ffffffff811e0bd5>] sys_sync+0x35/0x90
[25094.320770] [<ffffffff816f5ad4>] tracesys+0xdd/0xe2
[25094.322280] Code: 48 89 45 b8 48 89 45 b0 48 89 45 a8 66 0f 1f 44 00 00 65 c7 04 25 80 0f 1d 00 00 00 00 00 e8 07 36 06 00 fb 49 c7 c6 00 41 c0 81 <eb> 0e 0f 1f 44 00 00 49 83 c6 08 41 d1 ef 74 6c 41 f6 c7 01 74
[25118.221102] BUG: soft lockup - CPU#2 stuck for 23s! [trinity-child2:354]
[25118.222941] Modules linked in: 8021q garp snd_seq_dummy bnep fuse bridge stp rfcomm tun hidp nfnetlink scsi_transport_iscsi ipt_ULOG can_raw phonet af_rxrpc af_key nfc irda can_bcm bluetooth rose llc2 pppoe pppox ppp_generic slhc rfkill x25 atm rds netrom caif_socket ax25 caif crc_ccitt can af_802154 ipx p8023 p8022 appletalk psnap llc coretemp hwmon kvm_intel snd_hda_codec_realtek kvm crc32c_intel snd_hda_codec_hdmi ghash_clmulni_intel microcode pcspkr snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm e1000e snd_page_alloc snd_timer ptp snd pps_core soundcore xfs libcrc32c
[25118.234444] irq event stamp: 14004875
[25118.236035] hardirqs last enabled at (14004874): [<ffffffff816eda20>] restore_args+0x0/0x30
[25118.238227] hardirqs last disabled at (14004875): [<ffffffff816f652a>] apic_timer_interrupt+0x6a/0x80
[25118.240511] softirqs last enabled at (14004102): [<ffffffff810542e4>] __do_softirq+0x194/0x440
[25118.242739] softirqs last disabled at (14004105): [<ffffffff8105474d>] irq_exit+0xcd/0xe0
[25118.244897] CPU: 2 PID: 354 Comm: trinity-child2 Not tainted 3.10.0-rc7+ #32
[25118.250066] task: ffff88020d7a0000 ti: ffff880165dfe000 task.ti: ffff880165dfe000
[25118.252172] RIP: 0010:[<ffffffff81054201>] [<ffffffff81054201>] __do_softirq+0xb1/0x440
[25118.254352] RSP: 0018:ffff880244c03f08 EFLAGS: 00000202
[25118.256186] RAX: ffff88020d7a0000 RBX: ffffffff816eda20 RCX: 0000000000000002
[25118.258219] RDX: 0000000000003330 RSI: ffff88020d7a07b8 RDI: ffff88020d7a0000
[25118.260223] RBP: ffff880244c03f70 R08: 0000000000000000 R09: 0000000000000000
[25118.262217] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880244c03e78
[25118.264180] R13: ffffffff816f652f R14: ffff880244c03f70 R15: 0000000000000000
[25118.266121] FS: 00007f6a251c7740(0000) GS:ffff880244c00000(0000) knlGS:0000000000000000
[25118.268180] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[25118.269970] CR2: 0000000000000008 CR3: 00000001b0481000 CR4: 00000000001407e0
[25118.271920] DR0: 0000000001c0f000 DR1: 0000000000000000 DR2: 0000000000000000
[25118.273877] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[25118.275814] Stack:
[25118.277129] 0000000a00406040 000000010025e445 ffff880165dfffd8 ffff880165dfffd8
[25118.279108] ffff880165dfffd8 ffff88020d7a03f8 ffff880165dfffd8 ffffffff00000002
[25118.281090] ffff88020d7a0000 0000000000000000 0000000000000003 0000000000000000
[25118.283056] Call Trace:
[25118.284399] <IRQ>

[25118.285876] [<ffffffff8105474d>] irq_exit+0xcd/0xe0
[25118.287351] [<ffffffff816f73cb>] smp_apic_timer_interrupt+0x6b/0x9b
[25118.289144] [<ffffffff816f652f>] apic_timer_interrupt+0x6f/0x80
[25118.290897] <EOI>

[25118.292358] [<ffffffff816eda20>] ? retint_restore_args+0xe/0xe
[25118.293932] [<ffffffff816ecdc7>] ? _raw_spin_unlock_irqrestore+0x67/0x80
[25118.295768] [<ffffffff810869f4>] __wake_up+0x44/0x50
[25118.297420] [<ffffffffa009923f>] xlog_cil_push+0x38f/0x3d0 [xfs]
[25118.299193] [<ffffffffa00999a8>] xlog_cil_force_lsn+0x1a8/0x1d0 [xfs]
[25118.300991] [<ffffffff816eaeae>] ? __schedule+0x46e/0xa40
[25118.302666] [<ffffffff811e0ad0>] ? do_fsync+0x80/0x80
[25118.304314] [<ffffffffa0097b21>] _xfs_log_force+0x61/0x290 [xfs]
[25118.306053] [<ffffffff816f15d1>] ? sub_preempt_count+0x71/0x100
[25118.307778] [<ffffffff811e0ad0>] ? do_fsync+0x80/0x80
[25118.309419] [<ffffffffa0097d76>] xfs_log_force+0x26/0x170 [xfs]
[25118.311161] [<ffffffffa002b4dd>] xfs_fs_sync_fs+0x2d/0x50 [xfs]
[25118.312887] [<ffffffff811e0af0>] sync_fs_one_sb+0x20/0x30
[25118.314546] [<ffffffff811b1272>] iterate_supers+0xb2/0x110
[25118.316213] [<ffffffff811e0bf5>] sys_sync+0x55/0x90
[25118.317804] [<ffffffff816f5ad4>] tracesys+0xdd/0xe2
[25118.319388] Code: 48 89 45 b8 48 89 45 b0 48 89 45 a8 66 0f 1f 44 00 00 65 c7 04 25 80 0f 1d 00 00 00 00 00 e8 07 36 06 00 fb 49 c7 c6 00 41 c0 81 <eb> 0e 0f 1f 44 00 00 49 83 c6 08 41 d1 ef 74 6c 41 f6 c7 01 74

2013-06-24 02:00:32

by Dave Jones

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Sun, Jun 23, 2013 at 06:04:52PM +0200, Oleg Nesterov wrote:

> Could you please do the following:
>
> 1. # cd /sys/kernel/debug/tracing
> # echo 0 >> options/function-trace
> # echo preemptirqsoff >> current_tracer

dammit.

WARNING: at include/linux/list.h:385 rb_head_page_deactivate.isra.39+0x61/0x80()
check_list_nodes corruption. next->prev should be prev (ffff88023b8a1a08), but was 00ffff88023b8a1a. (next=ffff880243288001).
Modules linked in: coretemp hwmon snd_hda_codec_realtek kvm_intel kvm snd_hda_codec_hdmi crc32c_intel ghash_clmulni_intel microcode snd_hda_intel snd_hda_codec pcspkr snd_hwdep snd_seq snd_seq_device snd_pcm e1000e snd_page_alloc ptp snd_timer snd pps_core soundcore xfs libcrc32c
CPU: 1 PID: 474 Comm: bash Not tainted 3.10.0-rc7+ #33
ffffffff81a1762e ffff88023c659ce0 ffffffff816e467d ffff88023c659d18
ffffffff8104a0c1 ffff88023b8a1a08 ffff8802434b9040 ffff8802434b8fe8
0000000000000282 ffff88023109a620 ffff88023c659d78 ffffffff8104a12c
Call Trace:
[<ffffffff816e467d>] dump_stack+0x19/0x1b
[<ffffffff8104a0c1>] warn_slowpath_common+0x61/0x80
[<ffffffff8104a12c>] warn_slowpath_fmt+0x4c/0x50
[<ffffffff81112c61>] rb_head_page_deactivate.isra.39+0x61/0x80
[<ffffffff81113353>] ring_buffer_reset_cpu+0xa3/0x250
[<ffffffff81111f4a>] ? ring_buffer_time_stamp+0x1a/0x30
[<ffffffff81119d1a>] tracing_reset_online_cpus+0x5a/0xb0
[<ffffffff8111ee13>] tracing_set_tracer+0x223/0x310
[<ffffffff8111f16d>] tracing_set_trace_write+0xad/0xf0
[<ffffffff811ad6bb>] ? vfs_write+0x1bb/0x1f0
[<ffffffff812b9ca1>] ? security_file_permission+0x21/0xa0
[<ffffffff811ad5c0>] vfs_write+0xc0/0x1f0
[<ffffffff811cc487>] ? fget_light+0x387/0x4f0
[<ffffffff811adfcc>] SyS_write+0x4c/0xa0
[<ffffffff816f5d94>] tracesys+0xdd/0xe2
---[ end trace d4694abee6ec3f10 ]---


hopefully despite that it'll actually function as intended.

Dave

2013-06-24 14:45:54

by Oleg Nesterov

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On 06/23, Dave Jones wrote:
>
> On Sun, Jun 23, 2013 at 06:04:52PM +0200, Oleg Nesterov wrote:
>
> > Could you please do the following:
> >
> > 1. # cd /sys/kernel/debug/tracing
> > # echo 0 >> options/function-trace
> > # echo preemptirqsoff >> current_tracer
>
> dammit.
>
> WARNING: at include/linux/list.h:385 rb_head_page_deactivate.isra.39+0x61/0x80()

Hmmm. which kernel do use use?

380 #define list_for_each(pos, head) \
381 for (pos = (head)->next; pos != (head); pos = pos->next)
382
383 /**
384 * __list_for_each - iterate over a list
385 * @pos: the &struct list_head to use as a loop cursor.
386 * @head: the head for your list.
387 *
388 * This variant doesn't differ from list_for_each() any more.
389 * We don't do prefetching in either case.
390 */
391 #define __list_for_each(pos, head) \
392 for (pos = (head)->next; pos != (head); pos = pos->next)
393
394 /**
395 * list_for_each_prev - iterate over a list backwards
396 * @pos: the &struct list_head to use as a loop cursor.
397 * @head: the head for your list.
398 */
399 #define list_for_each_prev(pos, head) \
400 for (pos = (head)->prev; pos != (head); pos = pos->prev)

On 9e895ace5d8 (Linux 3.10-rc7).

> check_list_nodes corruption. next->prev should be prev (ffff88023b8a1a08), but was 00ffff88023b8a1a. (next=ffff880243288001).

Can't find "check_list_nodes" in lib/list_debug.c or elsewhere...

> [<ffffffff816e467d>] dump_stack+0x19/0x1b
> [<ffffffff8104a0c1>] warn_slowpath_common+0x61/0x80
> [<ffffffff8104a12c>] warn_slowpath_fmt+0x4c/0x50
> [<ffffffff81112c61>] rb_head_page_deactivate.isra.39+0x61/0x80

How? rb_list_head_clear() just modifies list->next directly.

> hopefully despite that it'll actually function as intended.

Yes ;)

Oleg.

2013-06-24 14:52:35

by Steven Rostedt

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Mon, 2013-06-24 at 16:39 +0200, Oleg Nesterov wrote:
> On 06/23, Dave Jones wrote:
> >
> > On Sun, Jun 23, 2013 at 06:04:52PM +0200, Oleg Nesterov wrote:
> >
> > > Could you please do the following:
> > >
> > > 1. # cd /sys/kernel/debug/tracing
> > > # echo 0 >> options/function-trace
> > > # echo preemptirqsoff >> current_tracer
> >
> > dammit.
> >
> > WARNING: at include/linux/list.h:385 rb_head_page_deactivate.isra.39+0x61/0x80()
>
> Hmmm. which kernel do use use?
>
> 380 #define list_for_each(pos, head) \
> 381 for (pos = (head)->next; pos != (head); pos = pos->next)
> 382
> 383 /**
> 384 * __list_for_each - iterate over a list
> 385 * @pos: the &struct list_head to use as a loop cursor.
> 386 * @head: the head for your list.
> 387 *
> 388 * This variant doesn't differ from list_for_each() any more.
> 389 * We don't do prefetching in either case.
> 390 */
> 391 #define __list_for_each(pos, head) \
> 392 for (pos = (head)->next; pos != (head); pos = pos->next)
> 393
> 394 /**
> 395 * list_for_each_prev - iterate over a list backwards
> 396 * @pos: the &struct list_head to use as a loop cursor.
> 397 * @head: the head for your list.
> 398 */
> 399 #define list_for_each_prev(pos, head) \
> 400 for (pos = (head)->prev; pos != (head); pos = pos->prev)
>
> On 9e895ace5d8 (Linux 3.10-rc7).

Right, and on 3.10-rc6:

382
383 /**
384 * __list_for_each - iterate over a list
385 * @pos: the &struct list_head to use as a loop cursor.
386 * @head: the head for your list.
387 *
388 * This variant doesn't differ from list_for_each() any more.
389 * We don't do prefetching in either case.
390 */
391 #define __list_for_each(pos, head) \
392 for (pos = (head)->next; pos != (head); pos = pos->next)
393


>
> > check_list_nodes corruption. next->prev should be prev (ffff88023b8a1a08), but was 00ffff88023b8a1a. (next=ffff880243288001).
>
> Can't find "check_list_nodes" in lib/list_debug.c or elsewhere...
>
> > [<ffffffff816e467d>] dump_stack+0x19/0x1b
> > [<ffffffff8104a0c1>] warn_slowpath_common+0x61/0x80
> > [<ffffffff8104a12c>] warn_slowpath_fmt+0x4c/0x50
> > [<ffffffff81112c61>] rb_head_page_deactivate.isra.39+0x61/0x80
>
> How? rb_list_head_clear() just modifies list->next directly.
>
> > hopefully despite that it'll actually function as intended.
>
> Yes ;)

I'm curious to what happened.

-- Steve

2013-06-24 15:58:55

by Dave Jones

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Sun, Jun 23, 2013 at 06:04:52PM +0200, Oleg Nesterov wrote:
> > [11054.897670] BUG: soft lockup - CPU#2 stuck for 22s! [trinity-child2:14482]
> > [11054.898503] Modules linked in: bridge stp snd_seq_dummy tun fuse hidp bnep rfcomm can_raw ipt_ULOG can_bcm nfnetlink af_rxrpc llc2 rose caif_socket caif can netrom appletalk af_802154 scsi_transport_iscsi nfc pppoe pppox ppp_generic slhc ipx p8023 psnap p8022 llc ax25 irda crc_ccitt af_key bluetooth rfkill x25 rds atm phonet coretemp hwmon kvm_intel kvm snd_hda_codec_realtek crc32c_intel ghash_clmulni_intel snd_hda_codec_hdmi microcode snd_hda_intel snd_hda_codec pcspkr snd_hwdep snd_seq snd_seq_device snd_pcm e1000e snd_page_alloc ptp snd_timer pps_core snd soundcore xfs libcrc32c
> > [11054.905490] irq event stamp: 3857095
> > [11054.905926] hardirqs last enabled at (3857094): [<ffffffff816ed9a0>] restore_args+0x0/0x30
> > [11054.906945] hardirqs last disabled at (3857095): [<ffffffff816f64aa>] apic_timer_interrupt+0x6a/0x80
> > [11054.908054] softirqs last enabled at (3856322): [<ffffffff810542e4>] __do_softirq+0x194/0x440
> > [11054.909102] softirqs last disabled at (3856325): [<ffffffff8105474d>] irq_exit+0xcd/0xe0
> > [11054.910088] CPU: 2 PID: 14482 Comm: trinity-child2 Not tainted 3.10.0-rc7+ #31
> > [11054.912900] task: ffff8801ae44ca40 ti: ffff88021fe60000 task.ti: ffff88021fe60000
> > [11054.913800] RIP: 0010:[<ffffffff81054201>] [<ffffffff81054201>] __do_softirq+0xb1/0x440
>
> OK, __do_softirq() again. But this doesn't necessarily mean it
> is the offender.
>
> Just in case, did you change /proc/sys/kernel/watchdog_thresh ?
> This times the numbers look different.
>
> Could you please do the following:
>
> 1. # cd /sys/kernel/debug/tracing
> # echo 0 >> options/function-trace
> # echo preemptirqsoff >> current_tracer
>
> 2. reproduce the lockup again
>
> 3. show the result of
> # cat trace

Not sure this is helpful, but..

# tracer: preemptirqsoff
#
# preemptirqsoff latency trace v1.1.5 on 3.10.0-rc7+
# --------------------------------------------------------------------
# latency: 165015310 us, #4/4, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
# -----------------
# | task: trinity-child1-3173 (uid:1000 nice:19 policy:0 rt_prio:0)
# -----------------
# => started at: vprintk_emit
# => ended at: vprintk_emit
#
#
# _------=> CPU#
# / _-----=> irqs-off
# | / _----=> need-resched
# || / _---=> hardirq/softirq
# ||| / _--=> preempt-depth
# |||| / delay
# cmd pid ||||| time | caller
# \ / ||||| \ | /
trinity--3173 1dNh1 0us!: console_unlock <-vprintk_emit
trinity--3173 1dNh1 165015310us : console_unlock <-vprintk_emit
trinity--3173 1dNh1 165015311us+: stop_critical_timings <-vprintk_emit
trinity--3173 1dNh1 165015315us : <stack trace>
=> console_unlock
=> vprintk_emit
=> printk
=> watchdog_timer_fn
=> __run_hrtimer
=> hrtimer_interrupt
=> smp_apic_timer_interrupt
=> apic_timer_interrupt
=> _raw_spin_lock
=> sync_inodes_sb
=> sync_inodes_one_sb
=> iterate_supers
=> sys_sync
=> tracesys

2013-06-24 16:01:44

by Dave Jones

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Mon, Jun 24, 2013 at 10:52:29AM -0400, Steven Rostedt wrote:

> > > check_list_nodes corruption. next->prev should be prev (ffff88023b8a1a08), but was 00ffff88023b8a1a. (next=ffff880243288001).
> >
> > Can't find "check_list_nodes" in lib/list_debug.c or elsewhere...
> >
> > > [<ffffffff816e467d>] dump_stack+0x19/0x1b
> > > [<ffffffff8104a0c1>] warn_slowpath_common+0x61/0x80
> > > [<ffffffff8104a12c>] warn_slowpath_fmt+0x4c/0x50
> > > [<ffffffff81112c61>] rb_head_page_deactivate.isra.39+0x61/0x80
> >
> > How? rb_list_head_clear() just modifies list->next directly.
> >
> > > hopefully despite that it'll actually function as intended.
> >
> > Yes ;)
>
> I'm curious to what happened.

Ah, this is the first victim of my new 'check sanity of nodes during list walks' patch.
It's doing the same prev->next next->prev checking as list_add and friends.
I'm looking at getting it into shape for a 3.12 merge after some other preparatory patches
go into 3.11

Dave

2013-06-24 16:24:45

by Steven Rostedt

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Mon, 2013-06-24 at 12:00 -0400, Dave Jones wrote:
> On Mon, Jun 24, 2013 at 10:52:29AM -0400, Steven Rostedt wrote:
>
> > > > check_list_nodes corruption. next->prev should be prev (ffff88023b8a1a08), but was 00ffff88023b8a1a. (next=ffff880243288001).
> > >
> > > Can't find "check_list_nodes" in lib/list_debug.c or elsewhere...
> > >
> > > > [<ffffffff816e467d>] dump_stack+0x19/0x1b
> > > > [<ffffffff8104a0c1>] warn_slowpath_common+0x61/0x80
> > > > [<ffffffff8104a12c>] warn_slowpath_fmt+0x4c/0x50
> > > > [<ffffffff81112c61>] rb_head_page_deactivate.isra.39+0x61/0x80
> > >
> > > How? rb_list_head_clear() just modifies list->next directly.
> > >
> > > > hopefully despite that it'll actually function as intended.
> > >
> > > Yes ;)
> >
> > I'm curious to what happened.
>
> Ah, this is the first victim of my new 'check sanity of nodes during list walks' patch.
> It's doing the same prev->next next->prev checking as list_add and friends.
> I'm looking at getting it into shape for a 3.12 merge after some other preparatory patches
> go into 3.11

OK, and you may need to make an exception for the ring buffer. To do a
lockless swap out of the reader page for one of the pages in the buffer,
it uses the 2 LSB as flags. Notice the "next=ffff880243288001", that "1"
is a flag that states the next page is the "header" page (next to be
read). We use cmpxchg to update the pages to handle races between the
reader and writer.

In fact, the above code is done on reset of the ring buffer, where we
clear those bits:

list_for_each(hd, cpu_buffer->pages)
rb_list_head_clear(hd);

The function that was flagged, was the one going through and clearing
the bits.

It seems that your new tool is failing due to this trick.

-- Steve

2013-06-24 16:41:51

by Oleg Nesterov

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On 06/24, Dave Jones wrote:
>
> On Mon, Jun 24, 2013 at 10:52:29AM -0400, Steven Rostedt wrote:
>
> > > > check_list_nodes corruption. next->prev should be prev (ffff88023b8a1a08), but was 00ffff88023b8a1a. (next=ffff880243288001).
> > >
> > > Can't find "check_list_nodes" in lib/list_debug.c or elsewhere...
> > >
> > > > [<ffffffff816e467d>] dump_stack+0x19/0x1b
> > > > [<ffffffff8104a0c1>] warn_slowpath_common+0x61/0x80
> > > > [<ffffffff8104a12c>] warn_slowpath_fmt+0x4c/0x50
> > > > [<ffffffff81112c61>] rb_head_page_deactivate.isra.39+0x61/0x80
> > >
> > > How? rb_list_head_clear() just modifies list->next directly.
> > >
> > > > hopefully despite that it'll actually function as intended.
> > >
> > > Yes ;)
> >
> > I'm curious to what happened.
>
> Ah, this is the first victim of my new 'check sanity of nodes during list walks' patch.
> It's doing the same prev->next next->prev checking as list_add and friends.
> I'm looking at getting it into shape for a 3.12 merge after some other preparatory patches
> go into 3.11

So you used the patched kernel when you hit the lockups ?

What else was changed?

Oleg.

2013-06-24 16:49:37

by Dave Jones

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Mon, Jun 24, 2013 at 06:37:08PM +0200, Oleg Nesterov wrote:
> On 06/24, Dave Jones wrote:
> >
> > On Mon, Jun 24, 2013 at 10:52:29AM -0400, Steven Rostedt wrote:
> >
> > > > > check_list_nodes corruption. next->prev should be prev (ffff88023b8a1a08), but was 00ffff88023b8a1a. (next=ffff880243288001).
> > > >
> > > > Can't find "check_list_nodes" in lib/list_debug.c or elsewhere...
> > > >
> > > > > [<ffffffff816e467d>] dump_stack+0x19/0x1b
> > > > > [<ffffffff8104a0c1>] warn_slowpath_common+0x61/0x80
> > > > > [<ffffffff8104a12c>] warn_slowpath_fmt+0x4c/0x50
> > > > > [<ffffffff81112c61>] rb_head_page_deactivate.isra.39+0x61/0x80
> > > >
> > > > How? rb_list_head_clear() just modifies list->next directly.
> > > >
> > > > > hopefully despite that it'll actually function as intended.
> > > >
> > > > Yes ;)
> > >
> > > I'm curious to what happened.
> >
> > Ah, this is the first victim of my new 'check sanity of nodes during list walks' patch.
> > It's doing the same prev->next next->prev checking as list_add and friends.
> > I'm looking at getting it into shape for a 3.12 merge after some other preparatory patches
> > go into 3.11
>
> So you used the patched kernel when you hit the lockups ?
>
> What else was changed?

I hit the lockups on the vanilla tree too.

Dave



2013-06-24 16:51:57

by Dave Jones

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Mon, Jun 24, 2013 at 12:24:39PM -0400, Steven Rostedt wrote:

> > Ah, this is the first victim of my new 'check sanity of nodes during list walks' patch.
> > It's doing the same prev->next next->prev checking as list_add and friends.
> > I'm looking at getting it into shape for a 3.12 merge after some other preparatory patches
> > go into 3.11
>
> OK, and you may need to make an exception for the ring buffer. To do a
> lockless swap out of the reader page for one of the pages in the buffer,
> it uses the 2 LSB as flags. Notice the "next=ffff880243288001", that "1"
> is a flag that states the next page is the "header" page (next to be
> read). We use cmpxchg to update the pages to handle races between the
> reader and writer.

I just had a plumber come visit to replace my toilet.
I think even he would say "dude, gross" about that hack.

Dave

2013-06-24 17:04:39

by Steven Rostedt

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Mon, 2013-06-24 at 12:51 -0400, Dave Jones wrote:
> On Mon, Jun 24, 2013 at 12:24:39PM -0400, Steven Rostedt wrote:
>
> > > Ah, this is the first victim of my new 'check sanity of nodes during list walks' patch.
> > > It's doing the same prev->next next->prev checking as list_add and friends.
> > > I'm looking at getting it into shape for a 3.12 merge after some other preparatory patches
> > > go into 3.11
> >
> > OK, and you may need to make an exception for the ring buffer. To do a
> > lockless swap out of the reader page for one of the pages in the buffer,
> > it uses the 2 LSB as flags. Notice the "next=ffff880243288001", that "1"
> > is a flag that states the next page is the "header" page (next to be
> > read). We use cmpxchg to update the pages to handle races between the
> > reader and writer.
>
> I just had a plumber come visit to replace my toilet.
> I think even he would say "dude, gross" about that hack.
>

Wow, that hack made you so sick you needed to replace your toilet?

Note, the idea of using the 2 LSB bits of pointers came from -rt. Where
we do the same with the rt_mutex owner.

-- Steve

2013-06-24 17:40:40

by Oleg Nesterov

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On 06/24, Dave Jones wrote:
>
> On Sun, Jun 23, 2013 at 06:04:52PM +0200, Oleg Nesterov wrote:
> >
> > Could you please do the following:
> >
> > 1. # cd /sys/kernel/debug/tracing
> > # echo 0 >> options/function-trace
> > # echo preemptirqsoff >> current_tracer
> >
> > 2. reproduce the lockup again
> >
> > 3. show the result of
> > # cat trace
>
> Not sure this is helpful, but..

This makes me think that something is seriously broken.

Or I do not understand this stuff at all. Quite possible too.
Steven, could you please help?

> # preemptirqsoff latency trace v1.1.5 on 3.10.0-rc7+
> # --------------------------------------------------------------------
> # latency: 165015310 us, #4/4, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)

OK, 165015310/1000000 = 165, nice.

> # -----------------
> # | task: trinity-child1-3173 (uid:1000 nice:19 policy:0 rt_prio:0)
> # -----------------
> # => started at: vprintk_emit
> # => ended at: vprintk_emit
> #
> #
> # _------=> CPU#
> # / _-----=> irqs-off
> # | / _----=> need-resched
> # || / _---=> hardirq/softirq
> # ||| / _--=> preempt-depth
> # |||| / delay
> # cmd pid ||||| time | caller
> # \ / ||||| \ | /
> trinity--3173 1dNh1 0us!: console_unlock <-vprintk_emit
> trinity--3173 1dNh1 165015310us : console_unlock <-vprintk_emit
> trinity--3173 1dNh1 165015311us+: stop_critical_timings <-vprintk_emit
> trinity--3173 1dNh1 165015315us : <stack trace>
> => console_unlock
> => vprintk_emit
> => printk
> => watchdog_timer_fn

But this is already called in the non-preemtible context, how can
'started at' blame vprintk_emit?

> => __run_hrtimer
> => hrtimer_interrupt
> => smp_apic_timer_interrupt
> => apic_timer_interrupt
> => _raw_spin_lock

This is where start_critical_timing() should be called?

Or by TRACE_IRQS_OFF in apic_timer_interrupt...

> => sync_inodes_sb
> => sync_inodes_one_sb
> => iterate_supers
> => sys_sync
> => tracesys

Also. watchdog_timer_fn() calls printk() only if it detects the
lockup, so I assume you hit another one?

Oleg.

2013-06-24 17:44:48

by Dave Jones

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Mon, Jun 24, 2013 at 07:35:10PM +0200, Oleg Nesterov wrote:

> > Not sure this is helpful, but..
>
> This makes me think that something is seriously broken.
>
> Or I do not understand this stuff at all. Quite possible too.
> Steven, could you please help?
>
> But this is already called in the non-preemtible context, how can
> 'started at' blame vprintk_emit?
>

I'll do another run on just plain rc7 again, just to rule out my debug patches.

Dave

2013-06-24 17:53:14

by Steven Rostedt

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Mon, 2013-06-24 at 19:35 +0200, Oleg Nesterov wrote:
> On 06/24, Dave Jones wrote:
> >
> > On Sun, Jun 23, 2013 at 06:04:52PM +0200, Oleg Nesterov wrote:
> > >
> > > Could you please do the following:
> > >
> > > 1. # cd /sys/kernel/debug/tracing
> > > # echo 0 >> options/function-trace
> > > # echo preemptirqsoff >> current_tracer
> > >
> > > 2. reproduce the lockup again
> > >
> > > 3. show the result of
> > > # cat trace
> >
> > Not sure this is helpful, but..
>
> This makes me think that something is seriously broken.
>
> Or I do not understand this stuff at all. Quite possible too.
> Steven, could you please help?
>
> > # preemptirqsoff latency trace v1.1.5 on 3.10.0-rc7+
> > # --------------------------------------------------------------------
> > # latency: 165015310 us, #4/4, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
>
> OK, 165015310/1000000 = 165, nice.

9600 baud modem serial console?

>
> > # -----------------
> > # | task: trinity-child1-3173 (uid:1000 nice:19 policy:0 rt_prio:0)
> > # -----------------
> > # => started at: vprintk_emit
> > # => ended at: vprintk_emit
> > #
> > #
> > # _------=> CPU#
> > # / _-----=> irqs-off
> > # | / _----=> need-resched
> > # || / _---=> hardirq/softirq
> > # ||| / _--=> preempt-depth
> > # |||| / delay
> > # cmd pid ||||| time | caller
> > # \ / ||||| \ | /
> > trinity--3173 1dNh1 0us!: console_unlock <-vprintk_emit
> > trinity--3173 1dNh1 165015310us : console_unlock <-vprintk_emit
> > trinity--3173 1dNh1 165015311us+: stop_critical_timings <-vprintk_emit
> > trinity--3173 1dNh1 165015315us : <stack trace>
> > => console_unlock
> > => vprintk_emit
> > => printk
> > => watchdog_timer_fn
>
> But this is already called in the non-preemtible context, how can
> 'started at' blame vprintk_emit?

Well, it looks to really have started with console_unlock() not
vprintk_emit.

>
> > => __run_hrtimer
> > => hrtimer_interrupt
> > => smp_apic_timer_interrupt
> > => apic_timer_interrupt
> > => _raw_spin_lock
>
> This is where start_critical_timing() should be called?
>
> Or by TRACE_IRQS_OFF in apic_timer_interrupt...

Also, what _raw_spin_lock is that. Unless the interrupt triggered at the
start of the completion spin lock (before it disabled interrupts), it
could have happened while spinning on inode_sb_list_lock?

But you are correct, the critical timing should have started with the
entering of smp_apic_timer_interrupt. Did anything re-enable interrupts?


>
> > => sync_inodes_sb
> > => sync_inodes_one_sb
> > => iterate_supers
> > => sys_sync
> > => tracesys
>
> Also. watchdog_timer_fn() calls printk() only if it detects the
> lockup, so I assume you hit another one?

Probably.

-- Steve

2013-06-24 18:01:21

by Dave Jones

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Mon, Jun 24, 2013 at 01:53:11PM -0400, Steven Rostedt wrote:

> > Also. watchdog_timer_fn() calls printk() only if it detects the
> > lockup, so I assume you hit another one?
>
> Probably.

Yeah, unfortunately it happened while I was travelling home to the box,
so I couldn't stop it after the first instance.

Hopefully better traces now that I'm in front of it.

Dave

2013-06-25 15:36:06

by Dave Jones

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

Took a lot longer to trigger this time. (13 hours of runtime).

This trace may still not be from the first lockup, as a flood of
them happened at the same time.


# tracer: preemptirqsoff
#
# preemptirqsoff latency trace v1.1.5 on 3.10.0-rc7+
# --------------------------------------------------------------------
# latency: 389877255 us, #4/4, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
# -----------------
# | task: trinity-main-9252 (uid:1000 nice:19 policy:0 rt_prio:0)
# -----------------
# => started at: vprintk_emit
# => ended at: vprintk_emit
#
#
# _------=> CPU#
# / _-----=> irqs-off
# | / _----=> need-resched
# || / _---=> hardirq/softirq
# ||| / _--=> preempt-depth
# |||| / delay
# cmd pid ||||| time | caller
# \ / ||||| \ | /
trinity--9252 1dNh1 0us!: console_unlock <-vprintk_emit
trinity--9252 1dNh1 389877255us : console_unlock <-vprintk_emit
trinity--9252 1dNh1 389877255us+: stop_critical_timings <-vprintk_emit
trinity--9252 1dNh1 389877261us : <stack trace>
=> console_unlock
=> vprintk_emit
=> printk
=> rcu_check_callbacks
=> update_process_times
=> tick_sched_handle.isra.16
=> tick_sched_timer
=> __run_hrtimer
=> hrtimer_interrupt
=> smp_apic_timer_interrupt
=> apic_timer_interrupt
=> _raw_spin_lock
=> evict
=> iput
=> dput
=> proc_flush_task
=> release_task
=> wait_consider_task
=> do_wait
=> SyS_wait4
=> tracesys

2013-06-25 16:23:40

by Steven Rostedt

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Tue, 2013-06-25 at 11:35 -0400, Dave Jones wrote:
> Took a lot longer to trigger this time. (13 hours of runtime).
>
> This trace may still not be from the first lockup, as a flood of
> them happened at the same time.
>
>
> # tracer: preemptirqsoff
> #
> # preemptirqsoff latency trace v1.1.5 on 3.10.0-rc7+
> # --------------------------------------------------------------------
> # latency: 389877255 us, #4/4, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
> # -----------------
> # | task: trinity-main-9252 (uid:1000 nice:19 policy:0 rt_prio:0)
> # -----------------
> # => started at: vprintk_emit
> # => ended at: vprintk_emit
> #
> #
> # _------=> CPU#
> # / _-----=> irqs-off
> # | / _----=> need-resched
> # || / _---=> hardirq/softirq
> # ||| / _--=> preempt-depth
> # |||| / delay
> # cmd pid ||||| time | caller
> # \ / ||||| \ | /
> trinity--9252 1dNh1 0us!: console_unlock <-vprintk_emit
> trinity--9252 1dNh1 389877255us : console_unlock <-vprintk_emit
> trinity--9252 1dNh1 389877255us+: stop_critical_timings <-vprintk_emit
> trinity--9252 1dNh1 389877261us : <stack trace>
> => console_unlock
> => vprintk_emit
> => printk

This is the same as the last one, with no new info to why it started the
tracing at console_unlock :-/

Now, what we can try to do as well, is to add a trigger to disable
tracing, which should (I need to check the code) stop tracing on printk.
To do so:

# echo printk:traceoff > /sys/kernel/debug/tracing/set_ftrace_filter

This will add a trigger to the printk function that when called, will
disable tracing. If it is hit before you get your trace, you can just
re-enable tracing with:

# echo 1 > /sys/kernel/debug/tracing/tracing_on

Hmm, no it needs a fix to make this work. I applied a patch below that
should do this correctly (and will put this into my 3.11 queue).

If you run the test again with this change and with the above filter, it
should stop the trace before overwriting the first dump, as it should
ignore the printk output.

-- Steve


> => rcu_check_callbacks
> => update_process_times
> => tick_sched_handle.isra.16
> => tick_sched_timer
> => __run_hrtimer
> => hrtimer_interrupt
> => smp_apic_timer_interrupt
> => apic_timer_interrupt
> => _raw_spin_lock
> => evict
> => iput
> => dput
> => proc_flush_task
> => release_task
> => wait_consider_task
> => do_wait
> => SyS_wait4
> => tracesys
>

diff --git a/kernel/trace/trace_irqsoff.c b/kernel/trace/trace_irqsoff.c
index b19d065..2aefbee 100644
--- a/kernel/trace/trace_irqsoff.c
+++ b/kernel/trace/trace_irqsoff.c
@@ -373,7 +373,7 @@ start_critical_timing(unsigned long ip, unsigned long parent_ip)
struct trace_array_cpu *data;
unsigned long flags;

- if (likely(!tracer_enabled))
+ if (!tracer_enabled || !tracing_is_enabled())
return;

cpu = raw_smp_processor_id();
@@ -416,7 +416,7 @@ stop_critical_timing(unsigned long ip, unsigned long parent_ip)
else
return;

- if (!tracer_enabled)
+ if (!tracer_enabled || !tracing_is_enabled())
return;

data = per_cpu_ptr(tr->trace_buffer.data, cpu);

2013-06-25 16:56:14

by Dave Jones

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Mon, Jun 24, 2013 at 01:04:36PM -0400, Steven Rostedt wrote:
> On Mon, 2013-06-24 at 12:51 -0400, Dave Jones wrote:
> > On Mon, Jun 24, 2013 at 12:24:39PM -0400, Steven Rostedt wrote:
> >
> > > > Ah, this is the first victim of my new 'check sanity of nodes during list walks' patch.
> > > > It's doing the same prev->next next->prev checking as list_add and friends.
> > > > I'm looking at getting it into shape for a 3.12 merge after some other preparatory patches
> > > > go into 3.11
> > >
> > > OK, and you may need to make an exception for the ring buffer. To do a
> > > lockless swap out of the reader page for one of the pages in the buffer,
> > > it uses the 2 LSB as flags. Notice the "next=ffff880243288001", that "1"
> > > is a flag that states the next page is the "header" page (next to be
> > > read). We use cmpxchg to update the pages to handle races between the
> > > reader and writer.
> >
> > I just had a plumber come visit to replace my toilet.
> > I think even he would say "dude, gross" about that hack.
>
> Wow, that hack made you so sick you needed to replace your toilet?
>
> Note, the idea of using the 2 LSB bits of pointers came from -rt. Where
> we do the same with the rt_mutex owner.

While I've been spinning wheels trying to reproduce that softlockup bug,
On another machine I've been refining my list-walk debug patch.
I added an ugly "ok, the ringbuffer is playing games with lower two bits" special case.

But what the hell is going on here ?

next->prev should be prev (ffff88023c6cdd18), but was 00ffff88023c6cdd. (next=ffff880243288001).

(trace comes from the same ringbuffer code)

Dave

2013-06-25 17:21:34

by Steven Rostedt

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Tue, 2013-06-25 at 12:55 -0400, Dave Jones wrote:

> While I've been spinning wheels trying to reproduce that softlockup bug,
> On another machine I've been refining my list-walk debug patch.
> I added an ugly "ok, the ringbuffer is playing games with lower two bits" special case.
>
> But what the hell is going on here ?
>
> next->prev should be prev (ffff88023c6cdd18), but was 00ffff88023c6cdd. (next=ffff880243288001).
>
> (trace comes from the same ringbuffer code)

What's the above saying? ffff880243288000->prev == 00ffff88023c6cdd but
it should have been ffff88023c6cdd18? That is: ffff88023c6cdd18->next ==
ffff880243288001?

Not sure how that would mess up. The ring-buffer code has lots of
integrity checks to make sure nothing like this breaks.

-- Steve

2013-06-25 17:23:35

by Steven Rostedt

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Tue, 2013-06-25 at 13:21 -0400, Steven Rostedt wrote:

> Not sure how that would mess up. The ring-buffer code has lots of
> integrity checks to make sure nothing like this breaks.

See rb_check_pages() and rb_check_list().

-- Steve

2013-06-25 17:26:31

by Dave Jones

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Tue, Jun 25, 2013 at 01:21:30PM -0400, Steven Rostedt wrote:
> On Tue, 2013-06-25 at 12:55 -0400, Dave Jones wrote:
>
> > While I've been spinning wheels trying to reproduce that softlockup bug,
> > On another machine I've been refining my list-walk debug patch.
> > I added an ugly "ok, the ringbuffer is playing games with lower two bits" special case.
> >
> > But what the hell is going on here ?
> >
> > next->prev should be prev (ffff88023c6cdd18), but was 00ffff88023c6cdd. (next=ffff880243288001).
> >
> > (trace comes from the same ringbuffer code)
>
> What's the above saying? ffff880243288000->prev == 00ffff88023c6cdd but
> it should have been ffff88023c6cdd18? That is: ffff88023c6cdd18->next ==
> ffff880243288001?

It's saying something has done >>8 on a pointer, and stuck it in a list head.

> Not sure how that would mess up. The ring-buffer code has lots of
> integrity checks to make sure nothing like this breaks.

My integrity checks can beat up your integrity checks.

Dav

2013-06-25 17:29:58

by Steven Rostedt

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Tue, 2013-06-25 at 13:21 -0400, Steven Rostedt wrote:
> On Tue, 2013-06-25 at 12:55 -0400, Dave Jones wrote:
>
> > While I've been spinning wheels trying to reproduce that softlockup bug,
> > On another machine I've been refining my list-walk debug patch.
> > I added an ugly "ok, the ringbuffer is playing games with lower two bits" special case.
> >
> > But what the hell is going on here ?
> >
> > next->prev should be prev (ffff88023c6cdd18), but was 00ffff88023c6cdd. (next=ffff880243288001).

Ah you didn't handle the bit set case. I just noticed "00" in
00ffff88023c6cdd. To test this, you really need to do a "next & ~3", to
clear the pointer.

Perhaps its best to have just a "raw_list_for_each" that doesn't do any
check, and have the ring buffer use that instead. The
rb_head_page_deactivate() is usually followed by an integrity check
anyway.

-- Steve

2013-06-25 17:31:15

by Steven Rostedt

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Tue, 2013-06-25 at 13:26 -0400, Dave Jones wrote:

> > Not sure how that would mess up. The ring-buffer code has lots of
> > integrity checks to make sure nothing like this breaks.
>
> My integrity checks can beat up your integrity checks.

I don't know. It looks like my code is beating up yours ;-)

-- Steve

2013-06-25 17:32:55

by Steven Rostedt

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Tue, 2013-06-25 at 13:26 -0400, Dave Jones wrote:
>
> > What's the above saying? ffff880243288000->prev == 00ffff88023c6cdd but
> > it should have been ffff88023c6cdd18? That is: ffff88023c6cdd18->next ==
> > ffff880243288001?
>
> It's saying something has done >>8 on a pointer, and stuck it in a list head.
>

Yes, that's from the "01" in ffff880243288001. The flags held in the
pointer are never used as an address. That's always encapsulated with
rb_list_head(), which clears the flags and returns the actual address.

-- Steve

2013-06-25 17:35:09

by Dave Jones

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Tue, Jun 25, 2013 at 01:29:54PM -0400, Steven Rostedt wrote:
> On Tue, 2013-06-25 at 13:21 -0400, Steven Rostedt wrote:
> > On Tue, 2013-06-25 at 12:55 -0400, Dave Jones wrote:
> >
> > > While I've been spinning wheels trying to reproduce that softlockup bug,
> > > On another machine I've been refining my list-walk debug patch.
> > > I added an ugly "ok, the ringbuffer is playing games with lower two bits" special case.
> > >
> > > But what the hell is going on here ?
> > >
> > > next->prev should be prev (ffff88023c6cdd18), but was 00ffff88023c6cdd. (next=ffff880243288001).
>
> Ah you didn't handle the bit set case. I just noticed "00" in
> 00ffff88023c6cdd. To test this, you really need to do a "next & ~3", to
> clear the pointer.
>
> Perhaps its best to have just a "raw_list_for_each" that doesn't do any
> check, and have the ring buffer use that instead. The
> rb_head_page_deactivate() is usually followed by an integrity check
> anyway.

I think that's probably the best way forward. The ring buffer code does
so many weird things with list heads that it's almost it's own ADT.

Dave

2013-06-26 05:24:11

by Dave Jones

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Tue, Jun 25, 2013 at 12:23:34PM -0400, Steven Rostedt wrote:

> Now, what we can try to do as well, is to add a trigger to disable
> tracing, which should (I need to check the code) stop tracing on printk.
> To do so:
>
> # echo printk:traceoff > /sys/kernel/debug/tracing/set_ftrace_filter
>
> This will add a trigger to the printk function that when called, will
> disable tracing. If it is hit before you get your trace, you can just
> re-enable tracing with:
>
> # echo 1 > /sys/kernel/debug/tracing/tracing_on
>
> Hmm, no it needs a fix to make this work. I applied a patch below that
> should do this correctly (and will put this into my 3.11 queue).
>
> If you run the test again with this change and with the above filter, it
> should stop the trace before overwriting the first dump, as it should
> ignore the printk output.

I think something isn't right with this patch.
After 10 hours, I hit the bug again, but...

(01:21:28:root@binary:tracing)# cat trace
# tracer: preemptirqsoff
#
(01:21:30:root@binary:tracing)#

Dave

2013-06-26 05:48:26

by Dave Jones

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Tue, Jun 25, 2013 at 12:23:34PM -0400, Steven Rostedt wrote:
> On Tue, 2013-06-25 at 11:35 -0400, Dave Jones wrote:
> > Took a lot longer to trigger this time. (13 hours of runtime).
> >
> > This trace may still not be from the first lockup, as a flood of
> > them happened at the same time.
> >
> >
> > # tracer: preemptirqsoff
> > #
> > # preemptirqsoff latency trace v1.1.5 on 3.10.0-rc7+
> > # --------------------------------------------------------------------
> > # latency: 389877255 us, #4/4, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
> > # -----------------
> > # | task: trinity-main-9252 (uid:1000 nice:19 policy:0 rt_prio:0)
> > # -----------------
> > # => started at: vprintk_emit
> > # => ended at: vprintk_emit
> > #
> > #
> > # _------=> CPU#
> > # / _-----=> irqs-off
> > # | / _----=> need-resched
> > # || / _---=> hardirq/softirq
> > # ||| / _--=> preempt-depth
> > # |||| / delay
> > # cmd pid ||||| time | caller
> > # \ / ||||| \ | /
> > trinity--9252 1dNh1 0us!: console_unlock <-vprintk_emit
> > trinity--9252 1dNh1 389877255us : console_unlock <-vprintk_emit
> > trinity--9252 1dNh1 389877255us+: stop_critical_timings <-vprintk_emit
> > trinity--9252 1dNh1 389877261us : <stack trace>
> > => console_unlock
> > => vprintk_emit
> > => printk
>
> This is the same as the last one, with no new info to why it started the
> tracing at console_unlock :-/
>
> Now, what we can try to do as well, is to add a trigger to disable
> tracing, which should (I need to check the code) stop tracing on printk.
> To do so:
>
> # echo printk:traceoff > /sys/kernel/debug/tracing/set_ftrace_filter
>
> This will add a trigger to the printk function that when called, will
> disable tracing. If it is hit before you get your trace, you can just
> re-enable tracing with:
>
> # echo 1 > /sys/kernel/debug/tracing/tracing_on
>
> Hmm, no it needs a fix to make this work. I applied a patch below that
> should do this correctly (and will put this into my 3.11 queue).
>
> If you run the test again with this change and with the above filter, it
> should stop the trace before overwriting the first dump, as it should
> ignore the printk output.

More puzzling. Rebooted the machine, restarted the test, and hit it pretty
quickly.. Though different backtrace this time..

[ 1583.293902] BUG: soft lockup - CPU#1 stuck for 22s! [migration/1:28]
[ 1583.293905] BUG: soft lockup - CPU#0 stuck for 22s! [migration/0:7]
[ 1583.293932] Modules linked in: tun hidp bnep nfnetlink rfcomm scsi_transport_iscsi ipt_ULOG can_raw af_rxrpc netrom nfc can_bcm can appletalk ipx p8023 af_key psnap irda p8022 rose caif_socket caif ax25 crc_ccitt llc2 af_802154 llc rds bluetooth phonet rfkill pppoe pppox atm ppp_generic slhc x25 coretemp hwmon kvm_intel kvm snd_hda_codec_realtek crc32c_intel snd_hda_codec_hdmi ghash_clmulni_intel microcode snd_hda_intel pcspkr snd_hda_codec snd_hwdep snd_seq snd_seq_device e1000e snd_pcm ptp pps_core snd_page_alloc snd_timer snd soundcore xfs libcrc32c
[ 1583.293932] irq event stamp: 108950
[ 1583.293937] hardirqs last enabled at (108949): [<ffffffff816e9320>] restore_args+0x0/0x30
[ 1583.293940] hardirqs last disabled at (108950): [<ffffffff816f1dea>] apic_timer_interrupt+0x6a/0x80
[ 1583.293943] softirqs last enabled at (108948): [<ffffffff810541d4>] __do_softirq+0x194/0x440
[ 1583.293945] softirqs last disabled at (108943): [<ffffffff8105463d>] irq_exit+0xcd/0xe0
[ 1583.293947] CPU: 0 PID: 7 Comm: migration/0 Not tainted 3.10.0-rc7+ #5
[ 1583.293948] task: ffff880244190000 ti: ffff88024418a000 task.ti: ffff88024418a000
[ 1583.293952] RIP: 0010:[<ffffffff810dd856>] [<ffffffff810dd856>] stop_machine_cpu_stop+0x86/0x110
[ 1583.293953] RSP: 0018:ffff88024418bce8 EFLAGS: 00000293
[ 1583.293953] RAX: 0000000000000001 RBX: ffffffff816e9320 RCX: 0000000000000000
[ 1583.293954] RDX: ffff88024418bf00 RSI: ffffffff81084b2c RDI: ffff8801ecd51b88
[ 1583.293954] RBP: ffff88024418bd10 R08: 0000000000000001 R09: 0000000000000000
[ 1583.293955] R10: 0000000000000001 R11: 0000000000000000 R12: ffff88024418bc58
[ 1583.293956] R13: ffff88024418bfd8 R14: ffff88024418a000 R15: 0000000000000046
[ 1583.293956] FS: 0000000000000000(0000) GS:ffff880245600000(0000) knlGS:0000000000000000
[ 1583.293957] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1583.293958] CR2: 0000000006ff0158 CR3: 00000001ea0df000 CR4: 00000000001407f0
[ 1583.293958] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1583.293959] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1583.293959] Stack:
[ 1583.293961] ffff8802457cd880 ffff8801ecd51ac0 ffff8801ecd51b88 ffffffff810dd7d0
[ 1583.293963] ffff88024418bfd8 ffff88024418bde0 ffffffff810dd6ed ffff88024418bfd8
[ 1583.293965] ffff8802457cd8d0 000000000000017f ffff88024418bd48 00000007810b463e
[ 1583.293965] Call Trace:
[ 1583.293967] [<ffffffff810dd7d0>] ? cpu_stopper_thread+0x180/0x180
[ 1583.293969] [<ffffffff810dd6ed>] cpu_stopper_thread+0x9d/0x180
[ 1583.293971] [<ffffffff816e86d5>] ? _raw_spin_unlock_irqrestore+0x65/0x80
[ 1583.293973] [<ffffffff810b75b5>] ? trace_hardirqs_on_caller+0x115/0x1e0
[ 1583.293976] [<ffffffff81084b2c>] smpboot_thread_fn+0x1ac/0x320
[ 1583.293978] [<ffffffff81084980>] ? lg_global_unlock+0xe0/0xe0
[ 1583.293981] [<ffffffff8107a88d>] kthread+0xed/0x100
[ 1583.293983] [<ffffffff816e597f>] ? wait_for_completion+0xdf/0x110
[ 1583.293985] [<ffffffff8107a7a0>] ? insert_kthread_work+0x80/0x80
[ 1583.293987] [<ffffffff816f10dc>] ret_from_fork+0x7c/0xb0
[ 1583.293989] [<ffffffff8107a7a0>] ? insert_kthread_work+0x80/0x80
[ 1583.294007] Code: f0 ff 4b 24 0f 94 c2 84 d2 44 89 e0 74 12 8b 43 20 8b 53 10 83 c0 01 89 53 24 89 43 20 44 89 e0 83 f8 04 74 20 f3 90 44 8b 63 20 <41> 39 c4 74 f0 41 83 fc 02 75 bf fa e8 49 66 fd ff eb c2 0f 1f

Same trace for all 4 cpus occurred.

'trace' file contained this...

# tracer: preemptirqsoff
#
# preemptirqsoff latency trace v1.1.5 on 3.10.0-rc7+
# --------------------------------------------------------------------
# latency: 478 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
# -----------------
# | task: sshd-405 (uid:1000 nice:0 policy:0 rt_prio:0)
# -----------------
# => started at: fget_light
# => ended at: fget_light
#
#
# _------=> CPU#
# / _-----=> irqs-off
# | / _----=> need-resched
# || / _---=> hardirq/softirq
# ||| / _--=> preempt-depth
# |||| / delay
# cmd pid ||||| time | caller
# \ / ||||| \ | /
sshd-405 3...1 0us!: rcu_lockdep_current_cpu_online <-fget_light
sshd-405 3...1 478us+: rcu_lockdep_current_cpu_online <-fget_light
sshd-405 3...1 481us+: trace_preempt_on <-fget_light
sshd-405 3...1 496us : <stack trace>
=> sub_preempt_count
=> rcu_lockdep_current_cpu_online
=> fget_light
=> do_select
=> core_sys_select
=> SyS_select
=> tracesys

which looks pretty unhelpful to me ?

Dave

2013-06-26 19:23:46

by Oleg Nesterov

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On 06/25, Dave Jones wrote:
>
> Took a lot longer to trigger this time. (13 hours of runtime).

And _perhaps_ this means that 3.10-rc7 without 8aac6270 needs more
time to hit the same bug ;)

Dave, I am not going to "deny the problem". We should investigate it
anyway. And yes, 8aac6270 is not as trivial as it looks.

But so far it is absolutely unclear how it can trigger such a problem,
and none of the traces _look_ as if it should be blamed.

Probably I should try to reproduce too...

Oleg.

2013-06-26 19:40:57

by Dave Jones

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Wed, Jun 26, 2013 at 09:18:53PM +0200, Oleg Nesterov wrote:
> On 06/25, Dave Jones wrote:
> >
> > Took a lot longer to trigger this time. (13 hours of runtime).
>
> And _perhaps_ this means that 3.10-rc7 without 8aac6270 needs more
> time to hit the same bug ;)
>
> Dave, I am not going to "deny the problem". We should investigate it
> anyway. And yes, 8aac6270 is not as trivial as it looks.
>
> But so far it is absolutely unclear how it can trigger such a problem,
> and none of the traces _look_ as if it should be blamed.
>
> Probably I should try to reproduce too...

I've tried so many different kernels this last week that I've forgotten
just how long I left the build without 8aac6270 running, but I'm pretty
sure it included an overnight run last weekend. Given I have two identical
machines that can reproduce it, I'll leave one running for a few days
with that reverted just to see if it turns up.

(Though, I seem to be walking into other issues [like the perf overflow bug
I reported last night] before seeing the lockups reproduce, so I'll keep an eye on it).

Dave

2013-06-26 19:52:18

by Steven Rostedt

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Wed, 2013-06-26 at 01:23 -0400, Dave Jones wrote:
> On Tue, Jun 25, 2013 at 12:23:34PM -0400, Steven Rostedt wrote:
>
> > Now, what we can try to do as well, is to add a trigger to disable
> > tracing, which should (I need to check the code) stop tracing on printk.
> > To do so:
> >
> > # echo printk:traceoff > /sys/kernel/debug/tracing/set_ftrace_filter
> >
> > This will add a trigger to the printk function that when called, will
> > disable tracing. If it is hit before you get your trace, you can just
> > re-enable tracing with:
> >
> > # echo 1 > /sys/kernel/debug/tracing/tracing_on
> >
> > Hmm, no it needs a fix to make this work. I applied a patch below that
> > should do this correctly (and will put this into my 3.11 queue).
> >
> > If you run the test again with this change and with the above filter, it
> > should stop the trace before overwriting the first dump, as it should
> > ignore the printk output.
>
> I think something isn't right with this patch.
> After 10 hours, I hit the bug again, but...
>
> (01:21:28:root@binary:tracing)# cat trace
> # tracer: preemptirqsoff
> #
> (01:21:30:root@binary:tracing)#
>

Did you apply the patch I added to my last email. It should have prevent
that form happening :-/

-- Steve

2013-06-26 20:00:41

by Dave Jones

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Wed, Jun 26, 2013 at 03:52:15PM -0400, Steven Rostedt wrote:

> > > Hmm, no it needs a fix to make this work. I applied a patch below that
> > > should do this correctly (and will put this into my 3.11 queue).
> > >
> > > If you run the test again with this change and with the above filter, it
> > > should stop the trace before overwriting the first dump, as it should
> > > ignore the printk output.
> >
> > I think something isn't right with this patch.
> > After 10 hours, I hit the bug again, but...
> >
> > (01:21:28:root@binary:tracing)# cat trace
> > # tracer: preemptirqsoff
> > #
> > (01:21:30:root@binary:tracing)#
>
> Did you apply the patch I added to my last email. It should have prevent
> that form happening :-/

Yeah, that's what I meant by "this patch".
To reduce ambiguity, I mean the one below.. There wasn't another patch
that I missed right ?

Dave


diff --git a/kernel/trace/trace_irqsoff.c b/kernel/trace/trace_irqsoff.c
index b19d065..2aefbee 100644
--- a/kernel/trace/trace_irqsoff.c
+++ b/kernel/trace/trace_irqsoff.c
@@ -373,7 +373,7 @@ start_critical_timing(unsigned long ip, unsigned long parent_ip)
struct trace_array_cpu *data;
unsigned long flags;

- if (likely(!tracer_enabled))
+ if (!tracer_enabled || !tracing_is_enabled())
return;

cpu = raw_smp_processor_id();
@@ -416,7 +416,7 @@ stop_critical_timing(unsigned long ip, unsigned long parent_ip)
else
return;

- if (!tracer_enabled)
+ if (!tracer_enabled || !tracing_is_enabled())
return;

data = per_cpu_ptr(tr->trace_buffer.data, cpu);

2013-06-27 00:23:15

by Dave Jones

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Wed, Jun 26, 2013 at 09:18:53PM +0200, Oleg Nesterov wrote:
> On 06/25, Dave Jones wrote:
> >
> > Took a lot longer to trigger this time. (13 hours of runtime).
>
> And _perhaps_ this means that 3.10-rc7 without 8aac6270 needs more
> time to hit the same bug ;)

Ok, that didn't take long. 4 hours in, and I hit it on rc7 with 8aac6270 reverted.
So that's the 2nd commit I've mistakenly blamed for this bug.

Crap. I'm going to have to redo the bisecting, and give it a whole day
at each step to be sure. That's going to take a while.

Anyone got any ideas better than a week of non-stop bisecting ?

What I've gathered so far:

- Only affects two machines I have (both Intel Quad core Haswell, one with SSD, one with hybrid SSD)
- One machine is XFS, the other EXT4.
- When the lockup occurs, it happens on all cores.
- It's nearly always a sync() call that triggers it looking like this..

irq event stamp: 8465043
hardirqs last enabled at (8465042): [<ffffffff816ebc60>] restore_args+0x0/0x30
hardirqs last disabled at (8465043): [<ffffffff816f476a>] apic_timer_interrupt+0x6a/0x80
softirqs last enabled at (8464292): [<ffffffff81054204>] __do_softirq+0x194/0x440
softirqs last disabled at (8464295): [<ffffffff8105466d>] irq_exit+0xcd/0xe0
RIP: 0010:[<ffffffff81054121>] [<ffffffff81054121>] __do_softirq+0xb1/0x440

Call Trace:
<IRQ>
[<ffffffff8105466d>] irq_exit+0xcd/0xe0
[<ffffffff816f560b>] smp_apic_timer_interrupt+0x6b/0x9b
[<ffffffff816f476f>] apic_timer_interrupt+0x6f/0x80
<EOI>
[<ffffffff816ebc60>] ? retint_restore_args+0xe/0xe
[<ffffffff810b9c56>] ? lock_acquire+0xa6/0x1f0
[<ffffffff811da892>] ? sync_inodes_sb+0x1c2/0x2a0
[<ffffffff816eaba0>] _raw_spin_lock+0x40/0x80
[<ffffffff811da892>] ? sync_inodes_sb+0x1c2/0x2a0
[<ffffffff811da892>] sync_inodes_sb+0x1c2/0x2a0
[<ffffffff816e8206>] ? wait_for_completion+0x36/0x110
[<ffffffff811e04f0>] ? generic_write_sync+0x70/0x70
[<ffffffff811e0509>] sync_inodes_one_sb+0x19/0x20
[<ffffffff811b0e62>] iterate_supers+0xb2/0x110
[<ffffffff811e0775>] sys_sync+0x35/0x90
[<ffffffff816f3d14>] tracesys+0xdd/0xe2


I'll work on trying to narrow down what trinity is doing. That might at least
make it easier to reproduce it in a shorter timeframe.

Dave

2013-06-27 01:07:29

by Eric W. Biederman

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

Dave Jones <[email protected]> writes:

> On Wed, Jun 26, 2013 at 09:18:53PM +0200, Oleg Nesterov wrote:
> > On 06/25, Dave Jones wrote:
> > >
> > > Took a lot longer to trigger this time. (13 hours of runtime).
> >
> > And _perhaps_ this means that 3.10-rc7 without 8aac6270 needs more
> > time to hit the same bug ;)
>
> Ok, that didn't take long. 4 hours in, and I hit it on rc7 with 8aac6270 reverted.
> So that's the 2nd commit I've mistakenly blamed for this bug.
>
> Crap. I'm going to have to redo the bisecting, and give it a whole day
> at each step to be sure. That's going to take a while.
>
> Anyone got any ideas better than a week of non-stop bisecting ?

Just based on the last trace and your observation that it seems to be
vfs/block layer related I am going to mildly suggest that Jens and Tejun
might have a clue. Tejun made a transformation of the threads used for
writeback from a custom thread pool to the generic mechanism. So it
seems worth asking the question could it have been in Jens block merge
of 4de13d7aa8f4d02f4dc99d4609575659f92b3c5a.

Eric

> What I've gathered so far:
>
> - Only affects two machines I have (both Intel Quad core Haswell, one with SSD, one with hybrid SSD)
> - One machine is XFS, the other EXT4.
> - When the lockup occurs, it happens on all cores.
> - It's nearly always a sync() call that triggers it looking like this..
>
> irq event stamp: 8465043
> hardirqs last enabled at (8465042): [<ffffffff816ebc60>] restore_args+0x0/0x30
> hardirqs last disabled at (8465043): [<ffffffff816f476a>] apic_timer_interrupt+0x6a/0x80
> softirqs last enabled at (8464292): [<ffffffff81054204>] __do_softirq+0x194/0x440
> softirqs last disabled at (8464295): [<ffffffff8105466d>] irq_exit+0xcd/0xe0
> RIP: 0010:[<ffffffff81054121>] [<ffffffff81054121>] __do_softirq+0xb1/0x440
>
> Call Trace:
> <IRQ>
> [<ffffffff8105466d>] irq_exit+0xcd/0xe0
> [<ffffffff816f560b>] smp_apic_timer_interrupt+0x6b/0x9b
> [<ffffffff816f476f>] apic_timer_interrupt+0x6f/0x80
> <EOI>
> [<ffffffff816ebc60>] ? retint_restore_args+0xe/0xe
> [<ffffffff810b9c56>] ? lock_acquire+0xa6/0x1f0
> [<ffffffff811da892>] ? sync_inodes_sb+0x1c2/0x2a0
> [<ffffffff816eaba0>] _raw_spin_lock+0x40/0x80
> [<ffffffff811da892>] ? sync_inodes_sb+0x1c2/0x2a0
> [<ffffffff811da892>] sync_inodes_sb+0x1c2/0x2a0
> [<ffffffff816e8206>] ? wait_for_completion+0x36/0x110
> [<ffffffff811e04f0>] ? generic_write_sync+0x70/0x70
> [<ffffffff811e0509>] sync_inodes_one_sb+0x19/0x20
> [<ffffffff811b0e62>] iterate_supers+0xb2/0x110
> [<ffffffff811e0775>] sys_sync+0x35/0x90
> [<ffffffff816f3d14>] tracesys+0xdd/0xe2
>
>
> I'll work on trying to narrow down what trinity is doing. That might at least
> make it easier to reproduce it in a shorter timeframe.

2013-06-27 02:32:53

by Tejun Heo

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

Hello,

On Wed, Jun 26, 2013 at 06:06:45PM -0700, Eric W. Biederman wrote:
> Just based on the last trace and your observation that it seems to be
> vfs/block layer related I am going to mildly suggest that Jens and Tejun
> might have a clue. Tejun made a transformation of the threads used for
> writeback from a custom thread pool to the generic mechanism. So it
> seems worth asking the question could it have been in Jens block merge
> of 4de13d7aa8f4d02f4dc99d4609575659f92b3c5a.

If workqueue was suffering deadlock or starvation, it'd lock up in a
lot more boring places, unless it's somehow completely messing up blk
softirq completion logic.

Given the wide variety of places, it's likely something lower level.
Maybe something is raising an interrupt and then its softirq raises it
again right away so that it evades the screaming irq detection but
still live-locks the CPU? It'd definitely be interesting to watch the
interrupt and softirq counters. Something like the following may
help.

Thanks.

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 05039e3..21d4e82 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -261,6 +261,49 @@ static void watchdog_interrupt_count(void)
static int watchdog_nmi_enable(unsigned int cpu);
static void watchdog_nmi_disable(unsigned int cpu);

+#include <linux/irq.h>
+#include <linux/irqdesc.h>
+#include <linux/kernel_stat.h>
+static void dump_irqs(void)
+{
+ static u64 hirq_sums[NR_IRQS];
+ static u64 sirq_sums[NR_SOFTIRQS];
+ int i, cpu;
+
+ for (i = 0; i < nr_irqs; i++) {
+ struct irq_desc *desc = irq_to_desc(i);
+ u64 sum = 0;
+
+ if (!desc)
+ continue;
+
+ for_each_online_cpu(cpu)
+ sum += kstat_irqs_cpu(i, cpu);
+ if (sum == hirq_sums[i])
+ continue;
+
+ printk("XXX HIRQ %2d %8s : sum_delta=%8llu on_this_cpu=%8u\n",
+ i, desc->action ? desc->action->name : "",
+ sum - hirq_sums[i],
+ kstat_irqs_cpu(i, smp_processor_id()));
+ hirq_sums[i] = sum;
+ }
+
+ for (i = 0; i < NR_SOFTIRQS; i++) {
+ u64 sum = 0;
+
+ for_each_online_cpu(cpu)
+ sum += kstat_softirqs_cpu(i, cpu);
+ if (sum == sirq_sums[i])
+ continue;
+
+ printk("XXX SIRQ %2d %8s : sum_delta=%8llu on_this_cpu=%8u\n",
+ i, softirq_to_name[i], sum - sirq_sums[i],
+ kstat_softirqs_cpu(i, smp_processor_id()));
+ sirq_sums[i] = sum;
+ }
+}
+
/* watchdog kicker functions */
static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
{
@@ -301,6 +344,8 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
*/
duration = is_softlockup(touch_ts);
if (unlikely(duration)) {
+ static int warned_on_cpu = -1;
+
/*
* If a virtual machine is stopped by the host it can look to
* the watchdog like a soft lockup, check to see if the host
@@ -309,10 +354,14 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
if (kvm_check_and_clear_guest_paused())
return HRTIMER_RESTART;

- /* only warn once */
- if (__this_cpu_read(soft_watchdog_warn) == true)
+ /* let's follow the first cpu */
+ if (warned_on_cpu >= 0 && warned_on_cpu != smp_processor_id())
return HRTIMER_RESTART;

+ /* only warn once */
+ //if (__this_cpu_read(soft_watchdog_warn) == true)
+ // return HRTIMER_RESTART;
+
printk(KERN_EMERG "BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n",
smp_processor_id(), duration,
current->comm, task_pid_nr(current));
@@ -323,6 +372,8 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
else
dump_stack();

+ dump_irqs();
+
if (softlockup_panic)
panic("softlockup: hung tasks");
__this_cpu_write(soft_watchdog_warn, true);

2013-06-27 03:01:14

by Steven Rostedt

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Wed, 2013-06-26 at 16:00 -0400, Dave Jones wrote:
> On Wed, Jun 26, 2013 at 03:52:15PM -0400, Steven Rostedt wrote:

> Yeah, that's what I meant by "this patch".
> To reduce ambiguity, I mean the one below.. There wasn't another patch
> that I missed right ?
>

On other patch, but I've found issues with the patch I sent you. This
seems to pass my initial tests, and I'm working to get this into 3.11.
This is still a beta patch.

-- Steve

Index: linux-trace.git/kernel/trace/trace.c
===================================================================
--- linux-trace.git.orig/kernel/trace/trace.c
+++ linux-trace.git/kernel/trace/trace.c
@@ -217,7 +217,13 @@ cycle_t ftrace_now(int cpu)

int tracing_is_enabled(void)
{
- return tracing_is_on();
+ /*
+ * For quick access (irqsoff uses this in fast path), just
+ * return the mirror variable of the state of the ring buffer.
+ * It's a little racy, but we don't really care.
+ */
+ smp_rmb();
+ return !global_trace.buffer_disabled;
}

/*
@@ -330,6 +336,23 @@ unsigned long trace_flags = TRACE_ITER_P
TRACE_ITER_GRAPH_TIME | TRACE_ITER_RECORD_CMD | TRACE_ITER_OVERWRITE |
TRACE_ITER_IRQ_INFO | TRACE_ITER_MARKERS | TRACE_ITER_FUNCTION;

+void tracer_tracing_on(struct trace_array *tr)
+{
+ if (tr->trace_buffer.buffer)
+ ring_buffer_record_on(tr->trace_buffer.buffer);
+ /*
+ * This flag is looked at when buffers haven't been allocated
+ * yet, or by some tracers (like irqsoff), that just want to
+ * know if the ring buffer has been disabled, but it can handle
+ * races of where it gets disabled but we still do a record.
+ * As the check is in the fast path of the tracers, it is more
+ * important to be fast than accurate.
+ */
+ tr->buffer_disabled = 0;
+ /* Make the flag seen by readers */
+ smp_wmb();
+}
+
/**
* tracing_on - enable tracing buffers
*
@@ -338,15 +361,7 @@ unsigned long trace_flags = TRACE_ITER_P
*/
void tracing_on(void)
{
- if (global_trace.trace_buffer.buffer)
- ring_buffer_record_on(global_trace.trace_buffer.buffer);
- /*
- * This flag is only looked at when buffers haven't been
- * allocated yet. We don't really care about the race
- * between setting this flag and actually turning
- * on the buffer.
- */
- global_trace.buffer_disabled = 0;
+ tracer_tracing_on(&global_trace);
}
EXPORT_SYMBOL_GPL(tracing_on);

@@ -540,6 +555,23 @@ void tracing_snapshot_alloc(void)
EXPORT_SYMBOL_GPL(tracing_snapshot_alloc);
#endif /* CONFIG_TRACER_SNAPSHOT */

+void tracer_tracing_off(struct trace_array *tr)
+{
+ if (tr->trace_buffer.buffer)
+ ring_buffer_record_off(tr->trace_buffer.buffer);
+ /*
+ * This flag is looked at when buffers haven't been allocated
+ * yet, or by some tracers (like irqsoff), that just want to
+ * know if the ring buffer has been disabled, but it can handle
+ * races of where it gets disabled but we still do a record.
+ * As the check is in the fast path of the tracers, it is more
+ * important to be fast than accurate.
+ */
+ tr->buffer_disabled = 1;
+ /* Make the flag seen by readers */
+ smp_wmb();
+}
+
/**
* tracing_off - turn off tracing buffers
*
@@ -550,26 +582,23 @@ EXPORT_SYMBOL_GPL(tracing_snapshot_alloc
*/
void tracing_off(void)
{
- if (global_trace.trace_buffer.buffer)
- ring_buffer_record_off(global_trace.trace_buffer.buffer);
- /*
- * This flag is only looked at when buffers haven't been
- * allocated yet. We don't really care about the race
- * between setting this flag and actually turning
- * on the buffer.
- */
- global_trace.buffer_disabled = 1;
+ tracer_tracing_off(&global_trace);
}
EXPORT_SYMBOL_GPL(tracing_off);

+int tracer_tracing_is_on(struct trace_array *tr)
+{
+ if (tr->trace_buffer.buffer)
+ return ring_buffer_record_is_on(tr->trace_buffer.buffer);
+ return !tr->buffer_disabled;
+}
+
/**
* tracing_is_on - show state of ring buffers enabled
*/
int tracing_is_on(void)
{
- if (global_trace.trace_buffer.buffer)
- return ring_buffer_record_is_on(global_trace.trace_buffer.buffer);
- return !global_trace.buffer_disabled;
+ return tracer_tracing_is_on(&global_trace);
}
EXPORT_SYMBOL_GPL(tracing_is_on);

@@ -3939,7 +3968,7 @@ static int tracing_wait_pipe(struct file
*
* iter->pos will be 0 if we haven't read anything.
*/
- if (!tracing_is_enabled() && iter->pos)
+ if (!tracing_is_on() && iter->pos)
break;
}

@@ -5612,15 +5641,10 @@ rb_simple_read(struct file *filp, char _
size_t cnt, loff_t *ppos)
{
struct trace_array *tr = filp->private_data;
- struct ring_buffer *buffer = tr->trace_buffer.buffer;
char buf[64];
int r;

- if (buffer)
- r = ring_buffer_record_is_on(buffer);
- else
- r = 0;
-
+ r = tracer_tracing_is_on(tr);
r = sprintf(buf, "%d\n", r);

return simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
@@ -5642,11 +5666,11 @@ rb_simple_write(struct file *filp, const
if (buffer) {
mutex_lock(&trace_types_lock);
if (val) {
- ring_buffer_record_on(buffer);
+ tracer_tracing_on(tr);
if (tr->current_trace->start)
tr->current_trace->start(tr);
} else {
- ring_buffer_record_off(buffer);
+ tracer_tracing_off(tr);
if (tr->current_trace->stop)
tr->current_trace->stop(tr);
}
Index: linux-trace.git/kernel/trace/trace_irqsoff.c
===================================================================
--- linux-trace.git.orig/kernel/trace/trace_irqsoff.c
+++ linux-trace.git/kernel/trace/trace_irqsoff.c
@@ -373,7 +373,7 @@ start_critical_timing(unsigned long ip,
struct trace_array_cpu *data;
unsigned long flags;

- if (likely(!tracer_enabled))
+ if (!tracer_enabled || !tracing_is_enabled())
return;

cpu = raw_smp_processor_id();
@@ -416,7 +416,7 @@ stop_critical_timing(unsigned long ip, u
else
return;

- if (!tracer_enabled)
+ if (!tracer_enabled || !tracing_is_enabled())
return;

data = per_cpu_ptr(tr->trace_buffer.data, cpu);

2013-06-27 07:55:52

by Dave Chinner

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Wed, Jun 26, 2013 at 08:22:55PM -0400, Dave Jones wrote:
> On Wed, Jun 26, 2013 at 09:18:53PM +0200, Oleg Nesterov wrote:
> > On 06/25, Dave Jones wrote:
> > >
> > > Took a lot longer to trigger this time. (13 hours of runtime).
> >
> > And _perhaps_ this means that 3.10-rc7 without 8aac6270 needs more
> > time to hit the same bug ;)
....
> What I've gathered so far:
>
> - Only affects two machines I have (both Intel Quad core Haswell, one with SSD, one with hybrid SSD)
> - One machine is XFS, the other EXT4.
> - When the lockup occurs, it happens on all cores.
> - It's nearly always a sync() call that triggers it looking like this..
>
> irq event stamp: 8465043
> hardirqs last enabled at (8465042): [<ffffffff816ebc60>] restore_args+0x0/0x30
> hardirqs last disabled at (8465043): [<ffffffff816f476a>] apic_timer_interrupt+0x6a/0x80
> softirqs last enabled at (8464292): [<ffffffff81054204>] __do_softirq+0x194/0x440
> softirqs last disabled at (8464295): [<ffffffff8105466d>] irq_exit+0xcd/0xe0
> RIP: 0010:[<ffffffff81054121>] [<ffffffff81054121>] __do_softirq+0xb1/0x440
>
> Call Trace:
> <IRQ>
> [<ffffffff8105466d>] irq_exit+0xcd/0xe0
> [<ffffffff816f560b>] smp_apic_timer_interrupt+0x6b/0x9b
> [<ffffffff816f476f>] apic_timer_interrupt+0x6f/0x80
> <EOI>
> [<ffffffff816ebc60>] ? retint_restore_args+0xe/0xe
> [<ffffffff810b9c56>] ? lock_acquire+0xa6/0x1f0
> [<ffffffff811da892>] ? sync_inodes_sb+0x1c2/0x2a0
> [<ffffffff816eaba0>] _raw_spin_lock+0x40/0x80
> [<ffffffff811da892>] ? sync_inodes_sb+0x1c2/0x2a0
> [<ffffffff811da892>] sync_inodes_sb+0x1c2/0x2a0
> [<ffffffff816e8206>] ? wait_for_completion+0x36/0x110
> [<ffffffff811e04f0>] ? generic_write_sync+0x70/0x70
> [<ffffffff811e0509>] sync_inodes_one_sb+0x19/0x20
> [<ffffffff811b0e62>] iterate_supers+0xb2/0x110
> [<ffffffff811e0775>] sys_sync+0x35/0x90
> [<ffffffff816f3d14>] tracesys+0xdd/0xe2

Is this just a soft lockup warning? Or is the system hung?

I mean, what you see here is probably sync_inodes_sb() having called
wait_sb_inodes() and is spinning on the inode_sb_list_lock.

There's nothing stopping multiple sys_sync() calls from executing on
the same superblock simulatenously, and if there's lots of cached
inodes on a single filesystem and nothing much to write back then
concurrent sync() calls will enter wait_sb_inodes() concurrently and
contend on the inode_sb_list_lock.

Get enough sync() calls running at the same time, and you'll see
this. e.g. I just ran a parallel find/stat workload over a
filesystem with 50 million inodes in it, and once that had reached a
steady state of about 2 million cached inodes in RAM:

$ for i in `seq 0 1 100`; do time sync & done
.....

real 0m38.508s
user 0m0.000s
sys 0m2.849s
$

While the syncs were running the system is essentially unresponsive.
It's takes seconds to respond to a single keypress, and it's
completely CPU bound. what is running on each CPU? from echo l >
/proc/sysrq:

[ 4864.252344] SysRq : Show backtrace of all active CPUs
[ 4864.253565] CPU0:
[ 4864.254037] dead000000200200 ffff88003ec03f58 ffffffff817f2864 ffff88003ec03f58
[ 4864.255770] ffff88003ec16e28 ffff88003ec03f98 ffffffff810e32fe ffff88003ec03f68
[ 4864.256010] ffff88003ec03f68 ffff88003ec03f98 ffffffff820f53c0 0000000000004f02
[ 4864.256010] Call Trace:
[ 4864.256010] <IRQ> [<ffffffff817f2864>] showacpu+0x54/0x70
[ 4864.256010] [<ffffffff810e32fe>] generic_smp_call_function_single_interrupt+0xbe/0x130
[ 4864.256010] [<ffffffff81067577>] smp_call_function_interrupt+0x27/0x40
[ 4864.256010] [<ffffffff81c3f15d>] call_function_interrupt+0x6d/0x80
[ 4864.256010] <EOI> [<ffffffff817622c1>] ? do_raw_spin_lock+0xb1/0x110
[ 4864.256010] [<ffffffff817622c1>] ? do_raw_spin_lock+0xb1/0x110
[ 4864.256010] [<ffffffff810b85a0>] ? try_to_wake_up+0x2f0/0x2f0
[ 4864.256010] [<ffffffff81c358ae>] _raw_spin_lock+0x1e/0x20
[ 4864.256010] [<ffffffff811af74c>] sync_inodes_sb+0xdc/0x200
[ 4864.256010] [<ffffffff811b6010>] ? fdatawrite_one_bdev+0x20/0x20
[ 4864.256010] [<ffffffff811b6029>] sync_inodes_one_sb+0x19/0x20
[ 4864.256010] [<ffffffff8118ab59>] iterate_supers+0xe9/0xf0
[ 4864.256010] [<ffffffff811b61f5>] sys_sync+0x35/0x90
[ 4864.256010] [<ffffffff81c3e2d9>] system_call_fastpath+0x16/0x1b
[ 4864.256032] CPU7:
[ 4864.256032] 0000000000000000 ffff88011fd03f58 ffffffff817f2864 ffff88011fd03f58
[ 4864.256032] ffff88011fd16e28 ffff88011fd03f98 ffffffff810e32fe ffff88011fd03f68
[ 4864.256032] ffff88011fd03f68 ffff88011fd03f98 ffffffff820f53c0 ffff88001eea7fd8
[ 4864.256032] Call Trace:
[ 4864.256032] <IRQ> [<ffffffff817f2864>] showacpu+0x54/0x70
[ 4864.256032] [<ffffffff810e32fe>] generic_smp_call_function_single_interrupt+0xbe/0x130
[ 4864.256032] [<ffffffff81067577>] smp_call_function_interrupt+0x27/0x40
[ 4864.256032] [<ffffffff81c3f15d>] call_function_interrupt+0x6d/0x80
[ 4864.256032] <EOI> [<ffffffff81144a55>] ? pagevec_lookup_tag+0x25/0x40
[ 4864.256032] [<ffffffff81759d90>] ? delay_tsc+0x30/0xc0
[ 4864.256032] [<ffffffff81759c9f>] __delay+0xf/0x20
[ 4864.256032] [<ffffffff817622d5>] do_raw_spin_lock+0xc5/0x110
[ 4864.256032] [<ffffffff81c358ae>] _raw_spin_lock+0x1e/0x20
[ 4864.256032] [<ffffffff811af808>] sync_inodes_sb+0x198/0x200
[ 4864.256032] [<ffffffff811b6010>] ? fdatawrite_one_bdev+0x20/0x20
[ 4864.256032] [<ffffffff811b6029>] sync_inodes_one_sb+0x19/0x20
[ 4864.256032] [<ffffffff8118ab59>] iterate_supers+0xe9/0xf0
[ 4864.256032] [<ffffffff811b61f5>] sys_sync+0x35/0x90
[ 4864.256032] [<ffffffff81c3e2d9>] system_call_fastpath+0x16/0x1b
.....

Yup, it's doing exactly what your system is doing - smashing the
inode_sb_list_lock as hard as it can scanning every cached inode in
the system to check whether it is dirty. Well, there ain't no dirty
inodes to be found, so this is just plain wasted CPU.

So, guess what I've been beating the tux3 people up over today? Their
attempt to "optimise" the lock contention problems in
wait_sb_inodes() by providing a method for tux3 to avoid needing to
call wait_sb_inodes() rather than actually fixing the lock
contention problem. Thread here:

https://lkml.org/lkml/2013/6/26/181

I didn't realise that just calling sync caused this lock contention
problem until I read this thread, so fixing this just went up
several levels of priority given the affect an unprivileged user can
have on the system just by running lots of concurrent sync calls.

> I'll work on trying to narrow down what trinity is doing. That might at least
> make it easier to reproduce it in a shorter timeframe.

This is only occurring on your new machines, right? They have more
memory than your old machines, and faster drives? So the caches are
larger and the IO completion faster? Those combinations will put
more pressure on wait_sb_inodes() from concurrent sync operations...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-06-27 10:06:20

by Dave Chinner

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Thu, Jun 27, 2013 at 05:55:43PM +1000, Dave Chinner wrote:
> On Wed, Jun 26, 2013 at 08:22:55PM -0400, Dave Jones wrote:
> > On Wed, Jun 26, 2013 at 09:18:53PM +0200, Oleg Nesterov wrote:
> > > On 06/25, Dave Jones wrote:
> > > >
> > > > Took a lot longer to trigger this time. (13 hours of runtime).
> > >
> > > And _perhaps_ this means that 3.10-rc7 without 8aac6270 needs more
> > > time to hit the same bug ;)
> ....
> > What I've gathered so far:
> >
> > - Only affects two machines I have (both Intel Quad core Haswell, one with SSD, one with hybrid SSD)
> > - One machine is XFS, the other EXT4.
> > - When the lockup occurs, it happens on all cores.
> > - It's nearly always a sync() call that triggers it looking like this..
> >
> > irq event stamp: 8465043
> > hardirqs last enabled at (8465042): [<ffffffff816ebc60>] restore_args+0x0/0x30
> > hardirqs last disabled at (8465043): [<ffffffff816f476a>] apic_timer_interrupt+0x6a/0x80
> > softirqs last enabled at (8464292): [<ffffffff81054204>] __do_softirq+0x194/0x440
> > softirqs last disabled at (8464295): [<ffffffff8105466d>] irq_exit+0xcd/0xe0
> > RIP: 0010:[<ffffffff81054121>] [<ffffffff81054121>] __do_softirq+0xb1/0x440
> >
> > Call Trace:
> > <IRQ>
> > [<ffffffff8105466d>] irq_exit+0xcd/0xe0
> > [<ffffffff816f560b>] smp_apic_timer_interrupt+0x6b/0x9b
> > [<ffffffff816f476f>] apic_timer_interrupt+0x6f/0x80
> > <EOI>
> > [<ffffffff816ebc60>] ? retint_restore_args+0xe/0xe
> > [<ffffffff810b9c56>] ? lock_acquire+0xa6/0x1f0
> > [<ffffffff811da892>] ? sync_inodes_sb+0x1c2/0x2a0
> > [<ffffffff816eaba0>] _raw_spin_lock+0x40/0x80
> > [<ffffffff811da892>] ? sync_inodes_sb+0x1c2/0x2a0
> > [<ffffffff811da892>] sync_inodes_sb+0x1c2/0x2a0
> > [<ffffffff816e8206>] ? wait_for_completion+0x36/0x110
> > [<ffffffff811e04f0>] ? generic_write_sync+0x70/0x70
> > [<ffffffff811e0509>] sync_inodes_one_sb+0x19/0x20
> > [<ffffffff811b0e62>] iterate_supers+0xb2/0x110
> > [<ffffffff811e0775>] sys_sync+0x35/0x90
> > [<ffffffff816f3d14>] tracesys+0xdd/0xe2
>
> Is this just a soft lockup warning? Or is the system hung?
>
> I mean, what you see here is probably sync_inodes_sb() having called
> wait_sb_inodes() and is spinning on the inode_sb_list_lock.
>
> There's nothing stopping multiple sys_sync() calls from executing on
> the same superblock simulatenously, and if there's lots of cached
> inodes on a single filesystem and nothing much to write back then
> concurrent sync() calls will enter wait_sb_inodes() concurrently and
> contend on the inode_sb_list_lock.
>
> Get enough sync() calls running at the same time, and you'll see
> this. e.g. I just ran a parallel find/stat workload over a
> filesystem with 50 million inodes in it, and once that had reached a
> steady state of about 2 million cached inodes in RAM:
>
> $ for i in `seq 0 1 100`; do time sync & done
> .....
>
> real 0m38.508s
> user 0m0.000s
> sys 0m2.849s
> $

Got a prototype patch for this. It's a bit nasty - it hooks the
test_set_page_writeback/test_clear_page_writeback and adds another
listhead to the struct inode, but:

$ time (for i in `seq 0 1 1000`; do sync & done ; wait)

real 0m0.552s
user 0m0.097s
sys 0m1.555s
$

Yup, that's about three of orders of magnitude faster on this
workload....

Lightly smoke tested patch below - it passed the first round of
XFS data integrity tests in xfstests, so it's not completely
busted...

Cheers,

Dave.
--
Dave Chinner
[email protected]

writeback: store inodes under writeback on a separate list

From: Dave Chinner <[email protected]>

When there are lots of cached inodes, a sync(2) operation walks all
of them to try to find which ones are under writeback and wait for
IO completion on them. Run enough load, and this caused catastrophic
lock contention on the inode_sb_list_lock.

Try to fix this problem by tracking inodes under data IO and wait
specifically for only those inodes that haven't completed their data
IO in wait_sb_inodes().

This is a bit hacky and messy, but demonstrates one method of
solving this problem....

XXX: This will catch data IO - do we need to catch actual inode
writeback (i.e. the metadata) here? I'm pretty sure we don't need to
because the existing code just calls filemap_fdatawait() and that
doesn't wait for the inode metadata writeback to occur....

Signed-off-by: Dave Chinner <[email protected]>
---
fs/fs-writeback.c | 44 +++++++++++++++++++++++++++++++++++++------
fs/inode.c | 1 +
include/linux/backing-dev.h | 3 +++
include/linux/fs.h | 3 ++-
mm/backing-dev.c | 2 ++
mm/page-writeback.c | 22 ++++++++++++++++++++++
6 files changed, 68 insertions(+), 7 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 3be5718..15e6394 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1208,7 +1208,9 @@ EXPORT_SYMBOL(__mark_inode_dirty);

static void wait_sb_inodes(struct super_block *sb)
{
- struct inode *inode, *old_inode = NULL;
+ struct backing_dev_info *bdi = sb->s_bdi;
+ struct inode *old_inode = NULL;
+ LIST_HEAD(sync_list);

/*
* We need to be protected against the filesystem going from
@@ -1216,7 +1218,6 @@ static void wait_sb_inodes(struct super_block *sb)
*/
WARN_ON(!rwsem_is_locked(&sb->s_umount));

- spin_lock(&inode_sb_list_lock);

/*
* Data integrity sync. Must wait for all pages under writeback,
@@ -1224,8 +1225,29 @@ static void wait_sb_inodes(struct super_block *sb)
* call, but which had writeout started before we write it out.
* In which case, the inode may not be on the dirty list, but
* we still have to wait for that writeout.
+ *
+ * To avoid syncing inodes put under IO after we have started here,
+ * splice the io list to a temporary list head and walk that. Newly
+ * dirtied inodes will go onto the primary list so we won't wait for
+ * them.
+ *
+ * Inodes that have pages under writeback after we've finished the wait
+ * may or may not be on the primary list. They had pages under put IOi
+ * after
+ * we started our wait, so we need to make sure the next sync or IO
+ * completion treats them correctly, Move them back to the primary list
+ * and restart the walk.
+ *
+ * Inodes that are clean after we have waited for them don't belong
+ * on any list, and the cleaning of them should have removed them from
+ * the temporary list. Check this is true, and restart the walk.
*/
- list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+ spin_lock(&bdi->wb.wb_list_lock);
+ list_splice_init(&bdi->wb.b_wb, &sync_list);
+
+ while (!list_empty(&sync_list)) {
+ struct inode *inode = list_first_entry(&sync_list, struct inode,
+ i_io_list);
struct address_space *mapping = inode->i_mapping;

spin_lock(&inode->i_lock);
@@ -1236,7 +1258,7 @@ static void wait_sb_inodes(struct super_block *sb)
}
__iget(inode);
spin_unlock(&inode->i_lock);
- spin_unlock(&inode_sb_list_lock);
+ spin_unlock(&bdi->wb.wb_list_lock);

/*
* We hold a reference to 'inode' so it couldn't have been
@@ -1253,9 +1275,19 @@ static void wait_sb_inodes(struct super_block *sb)

cond_resched();

- spin_lock(&inode_sb_list_lock);
+ /*
+ * the inode has been written back now, so check whether we
+ * still have pages under IO and move it back to the primary
+ * list if necessary
+ */
+ spin_lock(&bdi->wb.wb_list_lock);
+ if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
+ WARN_ON(list_empty(&inode->i_io_list));
+ list_move(&inode->i_io_list, &bdi->wb.b_wb);
+ } else
+ WARN_ON(!list_empty(&inode->i_io_list));
}
- spin_unlock(&inode_sb_list_lock);
+ spin_unlock(&bdi->wb.wb_list_lock);
iput(old_inode);
}

diff --git a/fs/inode.c b/fs/inode.c
index 00d5fc3..f25c1ca 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -364,6 +364,7 @@ void inode_init_once(struct inode *inode)
INIT_HLIST_NODE(&inode->i_hash);
INIT_LIST_HEAD(&inode->i_devices);
INIT_LIST_HEAD(&inode->i_wb_list);
+ INIT_LIST_HEAD(&inode->i_io_list);
INIT_LIST_HEAD(&inode->i_lru);
address_space_init_once(&inode->i_data);
i_size_ordered_init(inode);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index c388155..4a6283c 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -59,6 +59,9 @@ struct bdi_writeback {
struct list_head b_io; /* parked for writeback */
struct list_head b_more_io; /* parked for more writeback */
spinlock_t list_lock; /* protects the b_* lists */
+
+ spinlock_t wb_list_lock; /* writeback list lock */
+ struct list_head b_wb; /* under writeback, for wait_sb_inodes */
};

struct backing_dev_info {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 63cac31..7861017 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -573,7 +573,8 @@ struct inode {
unsigned long dirtied_when; /* jiffies of first dirtying */

struct hlist_node i_hash;
- struct list_head i_wb_list; /* backing dev IO list */
+ struct list_head i_wb_list; /* backing dev WB list */
+ struct list_head i_io_list; /* backing dev IO list */
struct list_head i_lru; /* inode LRU list */
struct list_head i_sb_list;
union {
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 5025174..896b8f5 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -421,7 +421,9 @@ static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
INIT_LIST_HEAD(&wb->b_dirty);
INIT_LIST_HEAD(&wb->b_io);
INIT_LIST_HEAD(&wb->b_more_io);
+ INIT_LIST_HEAD(&wb->b_wb);
spin_lock_init(&wb->list_lock);
+ spin_lock_init(&wb->wb_list_lock);
INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn);
}

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 4514ad7..4c411fe 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2238,6 +2238,15 @@ int test_clear_page_writeback(struct page *page)
__dec_bdi_stat(bdi, BDI_WRITEBACK);
__bdi_writeout_inc(bdi);
}
+ if (mapping->host &&
+ !mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
+ struct inode *inode = mapping->host;
+
+ WARN_ON(list_empty(&inode->i_io_list));
+ spin_lock(&bdi->wb.wb_list_lock);
+ list_del_init(&inode->i_io_list);
+ spin_unlock(&bdi->wb.wb_list_lock);
+ }
}
spin_unlock_irqrestore(&mapping->tree_lock, flags);
} else {
@@ -2262,11 +2271,24 @@ int test_set_page_writeback(struct page *page)
spin_lock_irqsave(&mapping->tree_lock, flags);
ret = TestSetPageWriteback(page);
if (!ret) {
+ bool on_wblist;
+
+ /* __swap_writepage comes through here */
+ on_wblist = mapping_tagged(mapping,
+ PAGECACHE_TAG_WRITEBACK);
radix_tree_tag_set(&mapping->page_tree,
page_index(page),
PAGECACHE_TAG_WRITEBACK);
if (bdi_cap_account_writeback(bdi))
__inc_bdi_stat(bdi, BDI_WRITEBACK);
+ if (!on_wblist && mapping->host) {
+ struct inode *inode = mapping->host;
+
+ WARN_ON(!list_empty(&inode->i_io_list));
+ spin_lock(&bdi->wb.wb_list_lock);
+ list_add_tail(&inode->i_io_list, &bdi->wb.b_wb);
+ spin_unlock(&bdi->wb.wb_list_lock);
+ }
}
if (!PageDirty(page))
radix_tree_tag_clear(&mapping->page_tree,

2013-06-27 12:52:46

by Dave Chinner

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Thu, Jun 27, 2013 at 08:06:12PM +1000, Dave Chinner wrote:
> On Thu, Jun 27, 2013 at 05:55:43PM +1000, Dave Chinner wrote:
> > On Wed, Jun 26, 2013 at 08:22:55PM -0400, Dave Jones wrote:
> > > On Wed, Jun 26, 2013 at 09:18:53PM +0200, Oleg Nesterov wrote:
> > > > On 06/25, Dave Jones wrote:
> > > > >
> > > > > Took a lot longer to trigger this time. (13 hours of runtime).
> > > >
> > > > And _perhaps_ this means that 3.10-rc7 without 8aac6270 needs more
> > > > time to hit the same bug ;)
> > ....
> > > What I've gathered so far:
> > >
> > > - Only affects two machines I have (both Intel Quad core Haswell, one with SSD, one with hybrid SSD)
> > > - One machine is XFS, the other EXT4.
> > > - When the lockup occurs, it happens on all cores.
> > > - It's nearly always a sync() call that triggers it looking like this..
> > >
> > > irq event stamp: 8465043
> > > hardirqs last enabled at (8465042): [<ffffffff816ebc60>] restore_args+0x0/0x30
> > > hardirqs last disabled at (8465043): [<ffffffff816f476a>] apic_timer_interrupt+0x6a/0x80
> > > softirqs last enabled at (8464292): [<ffffffff81054204>] __do_softirq+0x194/0x440
> > > softirqs last disabled at (8464295): [<ffffffff8105466d>] irq_exit+0xcd/0xe0
> > > RIP: 0010:[<ffffffff81054121>] [<ffffffff81054121>] __do_softirq+0xb1/0x440
> > >
> > > Call Trace:
> > > <IRQ>
> > > [<ffffffff8105466d>] irq_exit+0xcd/0xe0
> > > [<ffffffff816f560b>] smp_apic_timer_interrupt+0x6b/0x9b
> > > [<ffffffff816f476f>] apic_timer_interrupt+0x6f/0x80
> > > <EOI>
> > > [<ffffffff816ebc60>] ? retint_restore_args+0xe/0xe
> > > [<ffffffff810b9c56>] ? lock_acquire+0xa6/0x1f0
> > > [<ffffffff811da892>] ? sync_inodes_sb+0x1c2/0x2a0
> > > [<ffffffff816eaba0>] _raw_spin_lock+0x40/0x80
> > > [<ffffffff811da892>] ? sync_inodes_sb+0x1c2/0x2a0
> > > [<ffffffff811da892>] sync_inodes_sb+0x1c2/0x2a0
> > > [<ffffffff816e8206>] ? wait_for_completion+0x36/0x110
> > > [<ffffffff811e04f0>] ? generic_write_sync+0x70/0x70
> > > [<ffffffff811e0509>] sync_inodes_one_sb+0x19/0x20
> > > [<ffffffff811b0e62>] iterate_supers+0xb2/0x110
> > > [<ffffffff811e0775>] sys_sync+0x35/0x90
> > > [<ffffffff816f3d14>] tracesys+0xdd/0xe2
> >
> > Is this just a soft lockup warning? Or is the system hung?
> >
> > I mean, what you see here is probably sync_inodes_sb() having called
> > wait_sb_inodes() and is spinning on the inode_sb_list_lock.
> >
> > There's nothing stopping multiple sys_sync() calls from executing on
> > the same superblock simulatenously, and if there's lots of cached
> > inodes on a single filesystem and nothing much to write back then
> > concurrent sync() calls will enter wait_sb_inodes() concurrently and
> > contend on the inode_sb_list_lock.
> >
> > Get enough sync() calls running at the same time, and you'll see
> > this. e.g. I just ran a parallel find/stat workload over a
> > filesystem with 50 million inodes in it, and once that had reached a
> > steady state of about 2 million cached inodes in RAM:
> >
> > $ for i in `seq 0 1 100`; do time sync & done
> > .....
> >
> > real 0m38.508s
> > user 0m0.000s
> > sys 0m2.849s
> > $
>
> Got a prototype patch for this. It's a bit nasty - it hooks the
> test_set_page_writeback/test_clear_page_writeback and adds another
> listhead to the struct inode, but:
>
> $ time (for i in `seq 0 1 1000`; do sync & done ; wait)
>
> real 0m0.552s
> user 0m0.097s
> sys 0m1.555s
> $
>
> Yup, that's about three of orders of magnitude faster on this
> workload....
>
> Lightly smoke tested patch below - it passed the first round of
> XFS data integrity tests in xfstests, so it's not completely
> busted...

And now with even less smoke out that the first version. This one
gets though a full xfstests run...

Cheers,

Dave.
--
Dave Chinner
[email protected]

writeback: store inodes under writeback on a separate list

From: Dave Chinner <[email protected]>

When there are lots of cached inodes, a sync(2) operation walks all
of them to try to find which ones are under writeback and wait for
IO completion on them. Run enough load, and this caused catastrophic
lock contention on the inode_sb_list_lock.

Try to fix this problem by tracking inodes under data IO and wait
specifically for only those inodes that haven't completed their data
IO in wait_sb_inodes().

This is a bit hacky and messy, but demonstrates one method of
solving this problem....

XXX: This will catch data IO - do we need to catch actual inode
writeback (i.e. the metadata) here? I'm pretty sure we don't need to
because the existing code just calls filemap_fdatawait() and that
doesn't wait for the inode metadata writeback to occur....

[v2 - needs spin_lock_irq variants in wait_sb_inodes.
- move freeing inodes back to primary list, we don't wait for
them
- take mapping lock in wait_sb_inodes when requeuing.]

Signed-off-by: Dave Chinner <[email protected]>
---
fs/fs-writeback.c | 50 ++++++++++++++++++++++++++++++++++++-------
fs/inode.c | 1 +
include/linux/backing-dev.h | 3 +++
include/linux/fs.h | 3 ++-
mm/backing-dev.c | 2 ++
mm/page-writeback.c | 22 +++++++++++++++++++
6 files changed, 72 insertions(+), 9 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 3be5718..b0e943a 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1208,7 +1208,10 @@ EXPORT_SYMBOL(__mark_inode_dirty);

static void wait_sb_inodes(struct super_block *sb)
{
- struct inode *inode, *old_inode = NULL;
+ struct backing_dev_info *bdi = sb->s_bdi;
+ struct inode *old_inode = NULL;
+ unsigned long flags;
+ LIST_HEAD(sync_list);

/*
* We need to be protected against the filesystem going from
@@ -1216,7 +1219,6 @@ static void wait_sb_inodes(struct super_block *sb)
*/
WARN_ON(!rwsem_is_locked(&sb->s_umount));

- spin_lock(&inode_sb_list_lock);

/*
* Data integrity sync. Must wait for all pages under writeback,
@@ -1224,19 +1226,40 @@ static void wait_sb_inodes(struct super_block *sb)
* call, but which had writeout started before we write it out.
* In which case, the inode may not be on the dirty list, but
* we still have to wait for that writeout.
+ *
+ * To avoid syncing inodes put under IO after we have started here,
+ * splice the io list to a temporary list head and walk that. Newly
+ * dirtied inodes will go onto the primary list so we won't wait for
+ * them.
+ *
+ * Inodes that have pages under writeback after we've finished the wait
+ * may or may not be on the primary list. They had pages under put IOi
+ * after
+ * we started our wait, so we need to make sure the next sync or IO
+ * completion treats them correctly, Move them back to the primary list
+ * and restart the walk.
+ *
+ * Inodes that are clean after we have waited for them don't belong
+ * on any list, and the cleaning of them should have removed them from
+ * the temporary list. Check this is true, and restart the walk.
*/
- list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+ spin_lock_irqsave(&bdi->wb.wb_list_lock, flags);
+ list_splice_init(&bdi->wb.b_wb, &sync_list);
+
+ while (!list_empty(&sync_list)) {
+ struct inode *inode = list_first_entry(&sync_list, struct inode,
+ i_io_list);
struct address_space *mapping = inode->i_mapping;

spin_lock(&inode->i_lock);
- if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
- (mapping->nrpages == 0)) {
+ if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
+ list_move(&inode->i_io_list, &bdi->wb.b_wb);
spin_unlock(&inode->i_lock);
continue;
}
__iget(inode);
spin_unlock(&inode->i_lock);
- spin_unlock(&inode_sb_list_lock);
+ spin_unlock_irqrestore(&bdi->wb.wb_list_lock, flags);

/*
* We hold a reference to 'inode' so it couldn't have been
@@ -1253,9 +1276,20 @@ static void wait_sb_inodes(struct super_block *sb)

cond_resched();

- spin_lock(&inode_sb_list_lock);
+ /*
+ * the inode has been written back now, so check whether we
+ * still have pages under IO and move it back to the primary
+ * list if necessary
+ */
+ spin_lock_irqsave(&mapping->tree_lock, flags);
+ spin_lock(&bdi->wb.wb_list_lock);
+ if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
+ WARN_ON(list_empty(&inode->i_io_list));
+ list_move(&inode->i_io_list, &bdi->wb.b_wb);
+ }
+ spin_unlock(&mapping->tree_lock);
}
- spin_unlock(&inode_sb_list_lock);
+ spin_unlock_irqrestore(&bdi->wb.wb_list_lock, flags);
iput(old_inode);
}

diff --git a/fs/inode.c b/fs/inode.c
index 00d5fc3..f25c1ca 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -364,6 +364,7 @@ void inode_init_once(struct inode *inode)
INIT_HLIST_NODE(&inode->i_hash);
INIT_LIST_HEAD(&inode->i_devices);
INIT_LIST_HEAD(&inode->i_wb_list);
+ INIT_LIST_HEAD(&inode->i_io_list);
INIT_LIST_HEAD(&inode->i_lru);
address_space_init_once(&inode->i_data);
i_size_ordered_init(inode);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index c388155..4a6283c 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -59,6 +59,9 @@ struct bdi_writeback {
struct list_head b_io; /* parked for writeback */
struct list_head b_more_io; /* parked for more writeback */
spinlock_t list_lock; /* protects the b_* lists */
+
+ spinlock_t wb_list_lock; /* writeback list lock */
+ struct list_head b_wb; /* under writeback, for wait_sb_inodes */
};

struct backing_dev_info {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 63cac31..7861017 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -573,7 +573,8 @@ struct inode {
unsigned long dirtied_when; /* jiffies of first dirtying */

struct hlist_node i_hash;
- struct list_head i_wb_list; /* backing dev IO list */
+ struct list_head i_wb_list; /* backing dev WB list */
+ struct list_head i_io_list; /* backing dev IO list */
struct list_head i_lru; /* inode LRU list */
struct list_head i_sb_list;
union {
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 5025174..896b8f5 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -421,7 +421,9 @@ static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
INIT_LIST_HEAD(&wb->b_dirty);
INIT_LIST_HEAD(&wb->b_io);
INIT_LIST_HEAD(&wb->b_more_io);
+ INIT_LIST_HEAD(&wb->b_wb);
spin_lock_init(&wb->list_lock);
+ spin_lock_init(&wb->wb_list_lock);
INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn);
}

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 4514ad7..4c411fe 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2238,6 +2238,15 @@ int test_clear_page_writeback(struct page *page)
__dec_bdi_stat(bdi, BDI_WRITEBACK);
__bdi_writeout_inc(bdi);
}
+ if (mapping->host &&
+ !mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
+ struct inode *inode = mapping->host;
+
+ WARN_ON(list_empty(&inode->i_io_list));
+ spin_lock(&bdi->wb.wb_list_lock);
+ list_del_init(&inode->i_io_list);
+ spin_unlock(&bdi->wb.wb_list_lock);
+ }
}
spin_unlock_irqrestore(&mapping->tree_lock, flags);
} else {
@@ -2262,11 +2271,24 @@ int test_set_page_writeback(struct page *page)
spin_lock_irqsave(&mapping->tree_lock, flags);
ret = TestSetPageWriteback(page);
if (!ret) {
+ bool on_wblist;
+
+ /* __swap_writepage comes through here */
+ on_wblist = mapping_tagged(mapping,
+ PAGECACHE_TAG_WRITEBACK);
radix_tree_tag_set(&mapping->page_tree,
page_index(page),
PAGECACHE_TAG_WRITEBACK);
if (bdi_cap_account_writeback(bdi))
__inc_bdi_stat(bdi, BDI_WRITEBACK);
+ if (!on_wblist && mapping->host) {
+ struct inode *inode = mapping->host;
+
+ WARN_ON(!list_empty(&inode->i_io_list));
+ spin_lock(&bdi->wb.wb_list_lock);
+ list_add_tail(&inode->i_io_list, &bdi->wb.b_wb);
+ spin_unlock(&bdi->wb.wb_list_lock);
+ }
}
if (!PageDirty(page))
radix_tree_tag_clear(&mapping->page_tree,

2013-06-27 14:31:44

by Dave Jones

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Thu, Jun 27, 2013 at 05:55:43PM +1000, Dave Chinner wrote:

> Is this just a soft lockup warning? Or is the system hung?

I've only seen it completely lock up the box 2-3 times out of dozens
of times I've seen this, and tbh that could have been a different bug.

> I mean, what you see here is probably sync_inodes_sb() having called
> wait_sb_inodes() and is spinning on the inode_sb_list_lock.
>
> There's nothing stopping multiple sys_sync() calls from executing on
> the same superblock simulatenously, and if there's lots of cached
> inodes on a single filesystem and nothing much to write back then
> concurrent sync() calls will enter wait_sb_inodes() concurrently and
> contend on the inode_sb_list_lock.
>
> Get enough sync() calls running at the same time, and you'll see
> this. e.g. I just ran a parallel find/stat workload over a
> filesystem with 50 million inodes in it, and once that had reached a
> steady state of about 2 million cached inodes in RAM:

It's not even just sync calls it seems. Here's the latest victim from
last nights overnight run, failing in hugetlb mmap.
Same lock, but we got there by different way. (I suppose it could be
that the other CPUs were running sync() at the time of this mmap call)

BUG: soft lockup - CPU#1 stuck for 22s! [trinity-child1:24304]
BUG: soft lockup - CPU#0 stuck for 22s! [trinity-child0:25603]
Modules linked in: bridge stp dlci fuse tun snd_seq_dummy bnep rfcomm hidp nfnetlink scsi_transport_iscsi can_raw rose pppoe pppox ppp_generic caif_socket slhc caif nfc ipx p8023 x25 p8022 bluetooth irda crc_ccitt rfkill llc2 netrom rds ax25 af_key ipt_ULOG af_rxrpc can_bcm phonet can appletalk af_802154 psnap llc atm coretemp hwmon kvm_intel kvm snd_hda_codec_realtek crc32c_intel snd_hda_codec_hdmi ghash_clmulni_intel microcode pcspkr snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device e1000e snd_pcm ptp pps_core snd_page_alloc snd_timer snd soundcore xfs libcrc32c
irq event stamp: 83048
hardirqs last enabled at (83047): [<ffffffff816e9560>] restore_args+0x0/0x30
hardirqs last disabled at (83048): [<ffffffff816f202a>] apic_timer_interrupt+0x6a/0x80
softirqs last enabled at (83046): [<ffffffff810541d4>] __do_softirq+0x194/0x440
softirqs last disabled at (83023): [<ffffffff8105463d>] irq_exit+0xcd/0xe0
CPU: 0 PID: 25603 Comm: trinity-child0 Not tainted 3.10.0-rc7+ #6
task: ffff880232a7a3e0 ti: ffff88023aec0000 task.ti: ffff88023aec0000
RIP: 0010:[<ffffffff81306143>] [<ffffffff81306143>] delay_tsc+0x73/0xe0
RSP: 0018:ffff88023aec1db8 EFLAGS: 00000202
RAX: 0000000005cee9e0 RBX: ffffffff816e9560 RCX: 000000000000b910
RDX: 000000000000197f RSI: 0000000000000001 RDI: 0000000000000001
RBP: ffff88023aec1de0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000001 R12: ffff88023aec1d28
R13: ffff88023aec1fd8 R14: ffff88023aec0000 R15: 0000000000000046
FS: 00007f5b482e3740(0000) GS:ffff880245600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f5b482e7070 CR3: 000000023f9ef000 CR4: 00000000001407f0
DR0: 0000000000cbb000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
Stack:
ffffffff81c04640 000000002a279c22 0000000088c74df8 0000000000000000
ffff880243c8b000 ffff88023aec1df0 ffffffff8130602f ffff88023aec1e18
ffffffff8130e641 ffffffff81c04640 ffffffff81c04658 0000000000000000
Call Trace:
[<ffffffff8130602f>] __delay+0xf/0x20
[<ffffffff8130e641>] do_raw_spin_lock+0xe1/0x130
[<ffffffff816e84e0>] _raw_spin_lock+0x60/0x80
[<ffffffff811c4429>] ? inode_sb_list_add+0x19/0x50
[<ffffffff811c4429>] inode_sb_list_add+0x19/0x50
[<ffffffff811c6a69>] new_inode+0x29/0x30
[<ffffffff8128ed00>] hugetlbfs_get_inode+0x20/0x140
[<ffffffff8128fd5f>] hugetlb_file_setup+0xff/0x2d0
[<ffffffff81176a32>] SyS_mmap_pgoff+0xb2/0x250
[<ffffffff8100f388>] ? syscall_trace_enter+0x18/0x290
[<ffffffff81007232>] SyS_mmap+0x22/0x30
[<ffffffff816f15d4>] tracesys+0xdd/0xe2
Code: 3e 00 49 8b 87 38 e0 ff ff a8 08 75 5a f3 90 bf 01 00 00 00 e8 6f 6e 3e 00 e8 8a 9e 00 00 41 39 c5 75 4b 0f 1f 00 0f ae e8 0f 31 <48> c1 e2 20 89 c0 48 09 c2 41 89 d6 29 da 44 39 e2 72 ba bf 01
XXX HIRQ 0 timer : sum_delta= 20 on_this_cpu= 20
XXX HIRQ 1 i8042 : sum_delta= 10 on_this_cpu= 10
XXX HIRQ 8 rtc0 : sum_delta= 1 on_this_cpu= 1
XXX HIRQ 16 ehci_hcd:usb1 : sum_delta= 138 on_this_cpu= 33
XXX HIRQ 23 ehci_hcd:usb2 : sum_delta= 124 on_this_cpu= 109
XXX HIRQ 42 i915 : sum_delta= 80 on_this_cpu= 80
XXX HIRQ 43 ahci : sum_delta= 89683 on_this_cpu= 3163
XXX HIRQ 45 enp0s25 : sum_delta= 30655 on_this_cpu= 20
XXX HIRQ 46 snd_hda_intel : sum_delta= 457 on_this_cpu= 457
XXX HIRQ 47 snd_hda_intel : sum_delta= 2961 on_this_cpu= 2961
XXX SIRQ 1 TIMER : sum_delta= 3397163 on_this_cpu= 1219258
XXX SIRQ 2 NET_TX : sum_delta= 2447 on_this_cpu= 2440
XXX SIRQ 3 NET_RX : sum_delta= 31103 on_this_cpu= 150
XXX SIRQ 4 BLOCK : sum_delta= 72285 on_this_cpu= 2950
XXX SIRQ 6 TASKLET : sum_delta= 1 on_this_cpu= 0
XXX SIRQ 7 SCHED : sum_delta= 1103706 on_this_cpu= 393605
XXX SIRQ 8 HRTIMER : sum_delta= 16656 on_this_cpu= 3663
XXX SIRQ 9 RCU : sum_delta= 2123140 on_this_cpu= 576876
BUG: soft lockup - CPU#2 stuck for 22s! [trinity-child3:24661]


> I didn't realise that just calling sync caused this lock contention
> problem until I read this thread, so fixing this just went up
> several levels of priority given the affect an unprivileged user can
> have on the system just by running lots of concurrent sync calls.
>
> > I'll work on trying to narrow down what trinity is doing. That might at least
> > make it easier to reproduce it in a shorter timeframe.
>
> This is only occurring on your new machines, right? They have more
> memory than your old machines, and faster drives? So the caches are
> larger and the IO completion faster? Those combinations will put
> more pressure on wait_sb_inodes() from concurrent sync operations...

Sounds feasible. Maybe I should add something to trinity to create more
dirty pages, perhaps that would have triggered this faster.

8gb ram, 80MB/s SSD's, nothing exciting there (compared to my other machines)
so I think it's purely down to the CPUs being faster, or some other architectural
improvement with Haswell that increases parallelism.

Dave

2013-06-27 15:22:16

by Dave Jones

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Thu, Jun 27, 2013 at 10:52:18PM +1000, Dave Chinner wrote:


> > Yup, that's about three of orders of magnitude faster on this
> > workload....
> >
> > Lightly smoke tested patch below - it passed the first round of
> > XFS data integrity tests in xfstests, so it's not completely
> > busted...
>
> And now with even less smoke out that the first version. This one
> gets though a full xfstests run...

:sadface:

[ 567.680836] ======================================================
[ 567.681582] [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
[ 567.682389] 3.10.0-rc7+ #9 Not tainted
[ 567.682862] ------------------------------------------------------
[ 567.683607] trinity-child2/8665 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
[ 567.684464] (&sb->s_type->i_lock_key#3){+.+...}, at: [<ffffffff811d74e5>] sync_inodes_sb+0x225/0x3b0
[ 567.685632]
and this task is already holding:
[ 567.686334] (&(&wb->wb_list_lock)->rlock){..-...}, at: [<ffffffff811d7451>] sync_inodes_sb+0x191/0x3b0
[ 567.687506] which would create a new lock dependency:
[ 567.688115] (&(&wb->wb_list_lock)->rlock){..-...} -> (&sb->s_type->i_lock_key#3){+.+...}
[ 567.689188]
but this new dependency connects a SOFTIRQ-irq-safe lock:
[ 567.690137] (&(&wb->wb_list_lock)->rlock){..-...}
... which became SOFTIRQ-irq-safe at:
[ 567.691151] [<ffffffff810b7f05>] __lock_acquire+0x595/0x1af0
[ 567.691866] [<ffffffff810b9c11>] lock_acquire+0x91/0x1f0
[ 567.692539] [<ffffffff816e7660>] _raw_spin_lock+0x40/0x80
[ 567.693798] [<ffffffff81153fce>] test_clear_page_writeback+0x11e/0x230
[ 567.695193] [<ffffffff81146b30>] end_page_writeback+0x20/0x50
[ 567.696497] [<ffffffff811e10c3>] end_buffer_async_write+0x1a3/0x2b0
[ 567.697865] [<ffffffff811df04c>] end_bio_bh_io_sync+0x2c/0x60
[ 567.699164] [<ffffffff811e3a8d>] bio_endio+0x1d/0x30
[ 567.700361] [<ffffffff812d6172>] blk_update_request+0xc2/0x5f0
[ 567.701675] [<ffffffff812d66bc>] blk_update_bidi_request+0x1c/0x80
[ 567.703009] [<ffffffff812d673f>] blk_end_bidi_request+0x1f/0x60
[ 567.704294] [<ffffffff812d6790>] blk_end_request+0x10/0x20
[ 567.705524] [<ffffffff814978c3>] scsi_io_completion+0xf3/0x6e0
[ 567.706791] [<ffffffff8148d8f0>] scsi_finish_command+0xb0/0x110
[ 567.708071] [<ffffffff814976cf>] scsi_softirq_done+0x12f/0x160
[ 567.709339] [<ffffffff812de0b8>] blk_done_softirq+0x88/0xa0
[ 567.710574] [<ffffffff8105413f>] __do_softirq+0xff/0x440
[ 567.711799] [<ffffffff8105463d>] irq_exit+0xcd/0xe0
[ 567.712965] [<ffffffff816f1fb6>] do_IRQ+0x56/0xc0
[ 567.714108] [<ffffffff816e866f>] ret_from_intr+0x0/0x13
[ 567.715305] [<ffffffff813062f9>] __const_udelay+0x29/0x30
[ 567.716514] [<ffffffff81075fd4>] __rcu_read_unlock+0x54/0xa0
[ 567.717753] [<ffffffff811c35dd>] __d_lookup+0x14d/0x320
[ 567.718939] [<ffffffff811b598a>] lookup_fast+0x16a/0x2d0
[ 567.720129] [<ffffffff811b669b>] link_path_walk+0x1ab/0x900
[ 567.721350] [<ffffffff811b9ed4>] path_openat+0x94/0x530
[ 567.722524] [<ffffffff811ba9f8>] do_filp_open+0x38/0x80
[ 567.723701] [<ffffffff811a8e39>] do_sys_open+0xe9/0x1c0
[ 567.724875] [<ffffffff811a8f2e>] SyS_open+0x1e/0x20
[ 567.726007] [<ffffffff816f0794>] tracesys+0xdd/0xe2
[ 567.727138]
to a SOFTIRQ-irq-unsafe lock:
[ 567.728823] (&sb->s_type->i_lock_key#3){+.+...}
... which became SOFTIRQ-irq-unsafe at:
[ 567.730875] ... [<ffffffff810b7f59>] __lock_acquire+0x5e9/0x1af0
[ 567.732150] [<ffffffff810b9c11>] lock_acquire+0x91/0x1f0
[ 567.733339] [<ffffffff816e7660>] _raw_spin_lock+0x40/0x80
[ 567.734533] [<ffffffff811c6b88>] new_inode_pseudo+0x28/0x60
[ 567.735736] [<ffffffff811c6bd9>] new_inode+0x19/0x30
[ 567.736853] [<ffffffff811d27c1>] mount_pseudo+0xb1/0x180
[ 567.738008] [<ffffffff811e6a04>] bd_mount+0x24/0x30
[ 567.739090] [<ffffffff811ae089>] mount_fs+0x39/0x1b0
[ 567.740172] [<ffffffff811cb623>] vfs_kern_mount+0x63/0xf0
[ 567.741314] [<ffffffff811cb6c9>] kern_mount_data+0x19/0x30
[ 567.742450] [<ffffffff81effaab>] bdev_cache_init+0x56/0x80
[ 567.743571] [<ffffffff81efe1cf>] vfs_caches_init+0x9e/0x115
[ 567.744691] [<ffffffff81ed8dd0>] start_kernel+0x3a2/0x3fe
[ 567.745790] [<ffffffff81ed856f>] x86_64_start_reservations+0x2a/0x2c
[ 567.747011] [<ffffffff81ed863d>] x86_64_start_kernel+0xcc/0xcf
[ 567.748167]
other info that might help us debug this:

[ 567.750396] Possible interrupt unsafe locking scenario:

[ 567.752062] CPU0 CPU1
[ 567.753025] ---- ----
[ 567.753981] lock(&sb->s_type->i_lock_key#3);
[ 567.754969] local_irq_disable();
[ 567.756085] lock(&(&wb->wb_list_lock)->rlock);
[ 567.757368] lock(&sb->s_type->i_lock_key#3);
[ 567.758642] <Interrupt>
[ 567.759370] lock(&(&wb->wb_list_lock)->rlock);
[ 567.760379]
*** DEADLOCK ***

[ 567.762337] 2 locks held by trinity-child2/8665:
[ 567.763297] #0: (&type->s_umount_key#23){++++..}, at: [<ffffffff811ada2c>] iterate_supers+0x9c/0x110
[ 567.764898] #1: (&(&wb->wb_list_lock)->rlock){..-...}, at: [<ffffffff811d7451>] sync_inodes_sb+0x191/0x3b0
[ 567.766558]
the dependencies between SOFTIRQ-irq-safe lock and the holding lock:
[ 567.768362] -> (&(&wb->wb_list_lock)->rlock){..-...} ops: 395 {
[ 567.769581] IN-SOFTIRQ-W at:
[ 567.770428] [<ffffffff810b7f05>] __lock_acquire+0x595/0x1af0
[ 567.771803] [<ffffffff810b9c11>] lock_acquire+0x91/0x1f0
[ 567.773128] [<ffffffff816e7660>] _raw_spin_lock+0x40/0x80
[ 567.774457] [<ffffffff81153fce>] test_clear_page_writeback+0x11e/0x230
[ 567.775923] [<ffffffff81146b30>] end_page_writeback+0x20/0x50
[ 567.777303] [<ffffffff811e10c3>] end_buffer_async_write+0x1a3/0x2b0
[ 567.778745] [<ffffffff811df04c>] end_bio_bh_io_sync+0x2c/0x60
[ 567.780125] [<ffffffff811e3a8d>] bio_endio+0x1d/0x30
[ 567.781428] [<ffffffff812d6172>] blk_update_request+0xc2/0x5f0
[ 567.782820] [<ffffffff812d66bc>] blk_update_bidi_request+0x1c/0x80
[ 567.784245] [<ffffffff812d673f>] blk_end_bidi_request+0x1f/0x60
[ 567.785634] [<ffffffff812d6790>] blk_end_request+0x10/0x20
[ 567.786971] [<ffffffff814978c3>] scsi_io_completion+0xf3/0x6e0
[ 567.788349] [<ffffffff8148d8f0>] scsi_finish_command+0xb0/0x110
[ 567.789740] [<ffffffff814976cf>] scsi_softirq_done+0x12f/0x160
[ 567.791140] [<ffffffff812de0b8>] blk_done_softirq+0x88/0xa0
[ 567.792502] [<ffffffff8105413f>] __do_softirq+0xff/0x440
[ 567.793845] [<ffffffff8105463d>] irq_exit+0xcd/0xe0
[ 567.795127] [<ffffffff816f1fb6>] do_IRQ+0x56/0xc0
[ 567.796375] [<ffffffff816e866f>] ret_from_intr+0x0/0x13
[ 567.797685] [<ffffffff813062f9>] __const_udelay+0x29/0x30
[ 567.799016] [<ffffffff81075fd4>] __rcu_read_unlock+0x54/0xa0
[ 567.800381] [<ffffffff811c35dd>] __d_lookup+0x14d/0x320
[ 567.801692] [<ffffffff811b598a>] lookup_fast+0x16a/0x2d0
[ 567.803013] [<ffffffff811b669b>] link_path_walk+0x1ab/0x900
[ 567.804361] [<ffffffff811b9ed4>] path_openat+0x94/0x530
[ 567.805668] [<ffffffff811ba9f8>] do_filp_open+0x38/0x80
[ 567.806974] [<ffffffff811a8e39>] do_sys_open+0xe9/0x1c0
[ 567.808275] [<ffffffff811a8f2e>] SyS_open+0x1e/0x20
[ 567.809534] [<ffffffff816f0794>] tracesys+0xdd/0xe2
[ 567.810809] INITIAL USE at:
[ 567.811668] [<ffffffff810b7c55>] __lock_acquire+0x2e5/0x1af0
[ 567.813038] [<ffffffff810b9c11>] lock_acquire+0x91/0x1f0
[ 567.814364] [<ffffffff816e7660>] _raw_spin_lock+0x40/0x80
[ 567.815688] [<ffffffff81151975>] test_set_page_writeback+0x155/0x200
[ 567.817128] [<ffffffff811e18c0>] __block_write_full_page+0x140/0x3a0
[ 567.818565] [<ffffffff811e1d29>] block_write_full_page_endio+0xf9/0x120
[ 567.820035] [<ffffffff811e1d65>] block_write_full_page+0x15/0x20
[ 567.821436] [<ffffffff811e6d58>] blkdev_writepage+0x18/0x20
[ 567.822782] [<ffffffff811515b6>] __writepage+0x16/0x50
[ 567.824081] [<ffffffff811520fb>] write_cache_pages+0x27b/0x630
[ 567.825460] [<ffffffff811524f3>] generic_writepages+0x43/0x60
[ 567.826819] [<ffffffff81153e51>] do_writepages+0x21/0x50
[ 567.828128] [<ffffffff81148739>] __filemap_fdatawrite_range+0x59/0x60
[ 567.829562] [<ffffffff8114883d>] filemap_write_and_wait_range+0x2d/0x70
[ 567.831010] [<ffffffff811e69ab>] blkdev_fsync+0x1b/0x50
[ 567.832292] [<ffffffff811dd346>] do_fsync+0x56/0x80
[ 567.833514] [<ffffffff811dd600>] SyS_fsync+0x10/0x20
[ 567.834741] [<ffffffff816f0794>] tracesys+0xdd/0xe2
[ 567.835955] }
[ 567.836583] ... key at: [<ffffffff82a38c89>] __key.27437+0x0/0x8
[ 567.837800] ... acquired at:
[ 567.838584] [<ffffffff810b6f3a>] check_irq_usage+0x4a/0xc0
[ 567.839712] [<ffffffff810b8783>] __lock_acquire+0xe13/0x1af0
[ 567.840877] [<ffffffff810b9c11>] lock_acquire+0x91/0x1f0
[ 567.841991] [<ffffffff816e7660>] _raw_spin_lock+0x40/0x80
[ 567.843107] [<ffffffff811d74e5>] sync_inodes_sb+0x225/0x3b0
[ 567.844245] [<ffffffff811dd209>] sync_inodes_one_sb+0x19/0x20
[ 567.845399] [<ffffffff811ada42>] iterate_supers+0xb2/0x110
[ 567.846526] [<ffffffff811dd475>] sys_sync+0x35/0x90
[ 567.847580] [<ffffffff816f0794>] tracesys+0xdd/0xe2

[ 567.849239]
the dependencies between the lock to be acquired and SOFTIRQ-irq-unsafe lock:
[ 567.851102] -> (&sb->s_type->i_lock_key#3){+.+...} ops: 1230 {
[ 567.852314] HARDIRQ-ON-W at:
[ 567.853160] [<ffffffff810b7f2c>] __lock_acquire+0x5bc/0x1af0
[ 567.854498] [<ffffffff810b9c11>] lock_acquire+0x91/0x1f0
[ 567.855796] [<ffffffff816e7660>] _raw_spin_lock+0x40/0x80
[ 567.857105] [<ffffffff811c6b88>] new_inode_pseudo+0x28/0x60
[ 567.858437] [<ffffffff811c6bd9>] new_inode+0x19/0x30
[ 567.859693] [<ffffffff811d27c1>] mount_pseudo+0xb1/0x180
[ 567.861008] [<ffffffff811e6a04>] bd_mount+0x24/0x30
[ 567.862264] [<ffffffff811ae089>] mount_fs+0x39/0x1b0
[ 567.863529] [<ffffffff811cb623>] vfs_kern_mount+0x63/0xf0
[ 567.864846] [<ffffffff811cb6c9>] kern_mount_data+0x19/0x30
[ 567.866174] [<ffffffff81effaab>] bdev_cache_init+0x56/0x80
[ 567.867502] [<ffffffff81efe1cf>] vfs_caches_init+0x9e/0x115
[ 567.868830] [<ffffffff81ed8dd0>] start_kernel+0x3a2/0x3fe
[ 567.870125] [<ffffffff81ed856f>] x86_64_start_reservations+0x2a/0x2c
[ 567.871554] [<ffffffff81ed863d>] x86_64_start_kernel+0xcc/0xcf
[ 567.872906] SOFTIRQ-ON-W at:
[ 567.873731] [<ffffffff810b7f59>] __lock_acquire+0x5e9/0x1af0
[ 567.875068] [<ffffffff810b9c11>] lock_acquire+0x91/0x1f0
[ 567.876362] [<ffffffff816e7660>] _raw_spin_lock+0x40/0x80
[ 567.877669] [<ffffffff811c6b88>] new_inode_pseudo+0x28/0x60
[ 567.878996] [<ffffffff811c6bd9>] new_inode+0x19/0x30
[ 567.880251] [<ffffffff811d27c1>] mount_pseudo+0xb1/0x180
[ 567.881560] [<ffffffff811e6a04>] bd_mount+0x24/0x30
[ 567.882802] [<ffffffff811ae089>] mount_fs+0x39/0x1b0
[ 567.884053] [<ffffffff811cb623>] vfs_kern_mount+0x63/0xf0
[ 567.885353] [<ffffffff811cb6c9>] kern_mount_data+0x19/0x30
[ 567.886659] [<ffffffff81effaab>] bdev_cache_init+0x56/0x80
[ 567.887964] [<ffffffff81efe1cf>] vfs_caches_init+0x9e/0x115
[ 567.889276] [<ffffffff81ed8dd0>] start_kernel+0x3a2/0x3fe
[ 567.890571] [<ffffffff81ed856f>] x86_64_start_reservations+0x2a/0x2c
[ 567.891995] [<ffffffff81ed863d>] x86_64_start_kernel+0xcc/0xcf
[ 567.893350] INITIAL USE at:
[ 567.894182] [<ffffffff810b7c55>] __lock_acquire+0x2e5/0x1af0
[ 567.895503] [<ffffffff810b9c11>] lock_acquire+0x91/0x1f0
[ 567.896782] [<ffffffff816e7660>] _raw_spin_lock+0x40/0x80
[ 567.898068] [<ffffffff811c6b88>] new_inode_pseudo+0x28/0x60
[ 567.899375] [<ffffffff811c6bd9>] new_inode+0x19/0x30
[ 567.900622] [<ffffffff811d27c1>] mount_pseudo+0xb1/0x180
[ 567.901906] [<ffffffff811e6a04>] bd_mount+0x24/0x30
[ 567.903140] [<ffffffff811ae089>] mount_fs+0x39/0x1b0
[ 567.904379] [<ffffffff811cb623>] vfs_kern_mount+0x63/0xf0
[ 567.905663] [<ffffffff811cb6c9>] kern_mount_data+0x19/0x30
[ 567.906957] [<ffffffff81effaab>] bdev_cache_init+0x56/0x80
[ 567.908248] [<ffffffff81efe1cf>] vfs_caches_init+0x9e/0x115
[ 567.909555] [<ffffffff81ed8dd0>] start_kernel+0x3a2/0x3fe
[ 567.910850] [<ffffffff81ed856f>] x86_64_start_reservations+0x2a/0x2c
[ 567.912247] [<ffffffff81ed863d>] x86_64_start_kernel+0xcc/0xcf
[ 567.913576] }
[ 567.914201] ... key at: [<ffffffff81c6fd68>] bd_type+0x68/0x80
[ 567.915389] ... acquired at:
[ 567.916163] [<ffffffff810b6f3a>] check_irq_usage+0x4a/0xc0
[ 567.917282] [<ffffffff810b8783>] __lock_acquire+0xe13/0x1af0
[ 567.918426] [<ffffffff810b9c11>] lock_acquire+0x91/0x1f0
[ 567.919527] [<ffffffff816e7660>] _raw_spin_lock+0x40/0x80
[ 567.920639] [<ffffffff811d74e5>] sync_inodes_sb+0x225/0x3b0
[ 567.921793] [<ffffffff811dd209>] sync_inodes_one_sb+0x19/0x20
[ 567.922959] [<ffffffff811ada42>] iterate_supers+0xb2/0x110
[ 567.924087] [<ffffffff811dd475>] sys_sync+0x35/0x90
[ 567.925140] [<ffffffff816f0794>] tracesys+0xdd/0xe2

[ 567.926805]
stack backtrace:
[ 567.928164] CPU: 2 PID: 8665 Comm: trinity-child2 Not tainted 3.10.0-rc7+ #9
[ 567.931829] ffffffff824d5e70 ffff88022fe7db30 ffffffff816df0ad ffff88022fe7dc30
[ 567.933224] ffffffff810b6ee5 0000000000000000 ffffffff00000000 0000000700000001
[ 567.934620] ffff88022fe7db80 ffff88022fe7dbc0 ffffffff81a129af ffff88022fe7db80
[ 567.936023] Call Trace:
[ 567.936804] [<ffffffff816df0ad>] dump_stack+0x19/0x1b
[ 567.937906] [<ffffffff810b6ee5>] check_usage+0x4d5/0x4e0
[ 567.939042] [<ffffffff81091cff>] ? local_clock+0x3f/0x50
[ 567.940176] [<ffffffff810b6f3a>] check_irq_usage+0x4a/0xc0
[ 567.941332] [<ffffffff810b8783>] __lock_acquire+0xe13/0x1af0
[ 567.942507] [<ffffffff81091cff>] ? local_clock+0x3f/0x50
[ 567.943643] [<ffffffff816e4cdd>] ? wait_for_completion+0x4d/0x110
[ 567.944875] [<ffffffff810b741b>] ? mark_held_locks+0xbb/0x140
[ 567.946065] [<ffffffff816e7b0c>] ? _raw_spin_unlock_irq+0x2c/0x60
[ 567.947296] [<ffffffff810b9c11>] lock_acquire+0x91/0x1f0
[ 567.948430] [<ffffffff811d74e5>] ? sync_inodes_sb+0x225/0x3b0
[ 567.949612] [<ffffffff816e7660>] _raw_spin_lock+0x40/0x80
[ 567.950752] [<ffffffff811d74e5>] ? sync_inodes_sb+0x225/0x3b0
[ 567.951932] [<ffffffff811d74e5>] sync_inodes_sb+0x225/0x3b0
[ 567.953091] [<ffffffff816e4d6f>] ? wait_for_completion+0xdf/0x110
[ 567.954320] [<ffffffff811dd1f0>] ? generic_write_sync+0x70/0x70
[ 567.955526] [<ffffffff811dd209>] sync_inodes_one_sb+0x19/0x20
[ 567.956714] [<ffffffff811ada42>] iterate_supers+0xb2/0x110
[ 567.957873] [<ffffffff811dd475>] sys_sync+0x35/0x90
[ 567.958960] [<ffffffff816f0794>] tracesys+0xdd/0xe2

2013-06-28 01:13:09

by Dave Chinner

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Thu, Jun 27, 2013 at 11:21:51AM -0400, Dave Jones wrote:
> On Thu, Jun 27, 2013 at 10:52:18PM +1000, Dave Chinner wrote:
>
>
> > > Yup, that's about three of orders of magnitude faster on this
> > > workload....
> > >
> > > Lightly smoke tested patch below - it passed the first round of
> > > XFS data integrity tests in xfstests, so it's not completely
> > > busted...
> >
> > And now with even less smoke out that the first version. This one
> > gets though a full xfstests run...
>
> :sadface:
>
> [ 567.680836] ======================================================
> [ 567.681582] [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
> [ 567.682389] 3.10.0-rc7+ #9 Not tainted
> [ 567.682862] ------------------------------------------------------
> [ 567.683607] trinity-child2/8665 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
> [ 567.684464] (&sb->s_type->i_lock_key#3){+.+...}, at: [<ffffffff811d74e5>] sync_inodes_sb+0x225/0x3b0
> [ 567.685632]
> and this task is already holding:
> [ 567.686334] (&(&wb->wb_list_lock)->rlock){..-...}, at: [<ffffffff811d7451>] sync_inodes_sb+0x191/0x3b0
> [ 567.687506] which would create a new lock dependency:
> [ 567.688115] (&(&wb->wb_list_lock)->rlock){..-...} -> (&sb->s_type->i_lock_key#3){+.+...}

.....

> other info that might help us debug this:
>
> [ 567.750396] Possible interrupt unsafe locking scenario:
>
> [ 567.752062] CPU0 CPU1
> [ 567.753025] ---- ----
> [ 567.753981] lock(&sb->s_type->i_lock_key#3);
> [ 567.754969] local_irq_disable();
> [ 567.756085] lock(&(&wb->wb_list_lock)->rlock);
> [ 567.757368] lock(&sb->s_type->i_lock_key#3);
> [ 567.758642] <Interrupt>
> [ 567.759370] lock(&(&wb->wb_list_lock)->rlock);

Oh, that's easy enough to fix. It's just changing the wait_sb_inodes
loop to use a spin_trylock(&inode->i_lock), moving the inode to
the end of the sync list, dropping all locks and starting again...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-06-28 01:18:56

by Dave Chinner

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Thu, Jun 27, 2013 at 10:30:55AM -0400, Dave Jones wrote:
> On Thu, Jun 27, 2013 at 05:55:43PM +1000, Dave Chinner wrote:
>
> > Is this just a soft lockup warning? Or is the system hung?
>
> I've only seen it completely lock up the box 2-3 times out of dozens
> of times I've seen this, and tbh that could have been a different bug.
>
> > I mean, what you see here is probably sync_inodes_sb() having called
> > wait_sb_inodes() and is spinning on the inode_sb_list_lock.
> >
> > There's nothing stopping multiple sys_sync() calls from executing on
> > the same superblock simulatenously, and if there's lots of cached
> > inodes on a single filesystem and nothing much to write back then
> > concurrent sync() calls will enter wait_sb_inodes() concurrently and
> > contend on the inode_sb_list_lock.
> >
> > Get enough sync() calls running at the same time, and you'll see
> > this. e.g. I just ran a parallel find/stat workload over a
> > filesystem with 50 million inodes in it, and once that had reached a
> > steady state of about 2 million cached inodes in RAM:
>
> It's not even just sync calls it seems. Here's the latest victim from
> last nights overnight run, failing in hugetlb mmap.
> Same lock, but we got there by different way. (I suppose it could be
> that the other CPUs were running sync() at the time of this mmap call)

Right, that will be what is happening - the entire system will go
unresponsive when a sync call happens, so it's entirely possible
to see the soft lockups on inode_sb_list_add()/inode_sb_list_del()
trying to get the lock because of the way ticket spinlocks work...

> > I didn't realise that just calling sync caused this lock contention
> > problem until I read this thread, so fixing this just went up
> > several levels of priority given the affect an unprivileged user can
> > have on the system just by running lots of concurrent sync calls.
> >
> > > I'll work on trying to narrow down what trinity is doing. That might at least
> > > make it easier to reproduce it in a shorter timeframe.
> >
> > This is only occurring on your new machines, right? They have more
> > memory than your old machines, and faster drives? So the caches are
> > larger and the IO completion faster? Those combinations will put
> > more pressure on wait_sb_inodes() from concurrent sync operations...
>
> Sounds feasible. Maybe I should add something to trinity to create more
> dirty pages, perhaps that would have triggered this faster.

Creating more cached -clean, empty- inodes will make it happen
faster. The trigger for long lock holds is clean inodes that have no
cached pages (i.e. hit the mapping->nr_pages == 0 shortcut) on them...

> 8gb ram, 80MB/s SSD's, nothing exciting there (compared to my other machines)
> so I think it's purely down to the CPUs being faster, or some other architectural
> improvement with Haswell that increases parallelism.

Possibly - I'm reproducing it here with 8GB RAM, and the disk speed
doesn't realy matter as I'm seeing it with workload that doesn't
dirty any data or inodes at all...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-06-28 02:54:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Thu, Jun 27, 2013 at 3:18 PM, Dave Chinner <[email protected]> wrote:
>
> Right, that will be what is happening - the entire system will go
> unresponsive when a sync call happens, so it's entirely possible
> to see the soft lockups on inode_sb_list_add()/inode_sb_list_del()
> trying to get the lock because of the way ticket spinlocks work...

So what made it all start happening now? I don't recall us having had
these kinds of issues before..

Linus

2013-06-28 03:54:46

by Dave Chinner

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Thu, Jun 27, 2013 at 04:54:53PM -1000, Linus Torvalds wrote:
> On Thu, Jun 27, 2013 at 3:18 PM, Dave Chinner <[email protected]> wrote:
> >
> > Right, that will be what is happening - the entire system will go
> > unresponsive when a sync call happens, so it's entirely possible
> > to see the soft lockups on inode_sb_list_add()/inode_sb_list_del()
> > trying to get the lock because of the way ticket spinlocks work...
>
> So what made it all start happening now? I don't recall us having had
> these kinds of issues before..

Not sure - it's a sudden surprise for me, too. Then again, I haven't
been looking at sync from a performance or lock contention point of
view any time recently. The algorithm that wait_sb_inodes() is
effectively unchanged since at least 2009, so it's probably a case
of it having been protected from contention by some external factor
we've fixed/removed recently. Perhaps the bdi-flusher thread
replacement in -rc1 has changed the timing sufficiently that it no
longer serialises concurrent sync calls as much....

However, the inode_sb_list_lock is known to be a badly contended
lock from a create/unlink fastpath for XFS, so it's not like this sort
of thing is completely unexpected. It sits behind only the dentry
cache LRU lock on my most contended VFS lock list, so it's been on
my radar for a while. With the work to remove the global dentry LRU
lock currently in -mm, this was always going to be the next lock I
looked at....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-06-28 03:59:00

by Dave Chinner

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Fri, Jun 28, 2013 at 11:13:01AM +1000, Dave Chinner wrote:
> On Thu, Jun 27, 2013 at 11:21:51AM -0400, Dave Jones wrote:
> > On Thu, Jun 27, 2013 at 10:52:18PM +1000, Dave Chinner wrote:
> >
> >
> > > > Yup, that's about three of orders of magnitude faster on this
> > > > workload....
> > > >
> > > > Lightly smoke tested patch below - it passed the first round of
> > > > XFS data integrity tests in xfstests, so it's not completely
> > > > busted...
> > >
> > > And now with even less smoke out that the first version. This one
> > > gets though a full xfstests run...
> >
> > :sadface:
> >
> > [ 567.680836] ======================================================
> > [ 567.681582] [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
> > [ 567.682389] 3.10.0-rc7+ #9 Not tainted
> > [ 567.682862] ------------------------------------------------------
> > [ 567.683607] trinity-child2/8665 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
> > [ 567.684464] (&sb->s_type->i_lock_key#3){+.+...}, at: [<ffffffff811d74e5>] sync_inodes_sb+0x225/0x3b0
> > [ 567.685632]
> > and this task is already holding:
> > [ 567.686334] (&(&wb->wb_list_lock)->rlock){..-...}, at: [<ffffffff811d7451>] sync_inodes_sb+0x191/0x3b0
> > [ 567.687506] which would create a new lock dependency:
> > [ 567.688115] (&(&wb->wb_list_lock)->rlock){..-...} -> (&sb->s_type->i_lock_key#3){+.+...}
>
> .....
>
> > other info that might help us debug this:
> >
> > [ 567.750396] Possible interrupt unsafe locking scenario:
> >
> > [ 567.752062] CPU0 CPU1
> > [ 567.753025] ---- ----
> > [ 567.753981] lock(&sb->s_type->i_lock_key#3);
> > [ 567.754969] local_irq_disable();
> > [ 567.756085] lock(&(&wb->wb_list_lock)->rlock);
> > [ 567.757368] lock(&sb->s_type->i_lock_key#3);
> > [ 567.758642] <Interrupt>
> > [ 567.759370] lock(&(&wb->wb_list_lock)->rlock);
>
> Oh, that's easy enough to fix. It's just changing the wait_sb_inodes
> loop to use a spin_trylock(&inode->i_lock), moving the inode to
> the end of the sync list, dropping all locks and starting again...

New version below, went through xfstests with lockdep enabled this
time....

Cheers,

Dave.
--
Dave Chinner
[email protected]

writeback: store inodes under writeback on a separate list

From: Dave Chinner <[email protected]>

When there are lots of cached inodes, a sync(2) operation walks all
of them to try to find which ones are under writeback and wait for
IO completion on them. Run enough load, and this caused catastrophic
lock contention on the inode_sb_list_lock.

Try to fix this problem by tracking inodes under data IO and wait
specifically for only those inodes that haven't completed their data
IO in wait_sb_inodes().

This is a bit hacky and messy, but demonstrates one method of
solving this problem....

XXX: This will catch data IO - do we need to catch actual inode
writeback (i.e. the metadata) here? I'm pretty sure we don't need to
because the existing code just calls filemap_fdatawait() and that
doesn't wait for the inode metadata writeback to occur....

[v3 - avoid deadlock due to interrupt while holding inode->i_lock]

[v2 - needs spin_lock_irq variants in wait_sb_inodes.
- move freeing inodes back to primary list, we don't wait for
them
- take mapping lock in wait_sb_inodes when requeuing.]

Signed-off-by: Dave Chinner <[email protected]>
---
fs/fs-writeback.c | 70 +++++++++++++++++++++++++++++++++++++------
fs/inode.c | 1 +
include/linux/backing-dev.h | 3 ++
include/linux/fs.h | 3 +-
mm/backing-dev.c | 2 ++
mm/page-writeback.c | 22 ++++++++++++++
6 files changed, 91 insertions(+), 10 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 3be5718..589c40b 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1208,7 +1208,10 @@ EXPORT_SYMBOL(__mark_inode_dirty);

static void wait_sb_inodes(struct super_block *sb)
{
- struct inode *inode, *old_inode = NULL;
+ struct backing_dev_info *bdi = sb->s_bdi;
+ struct inode *old_inode = NULL;
+ unsigned long flags;
+ LIST_HEAD(sync_list);

/*
* We need to be protected against the filesystem going from
@@ -1216,7 +1219,6 @@ static void wait_sb_inodes(struct super_block *sb)
*/
WARN_ON(!rwsem_is_locked(&sb->s_umount));

- spin_lock(&inode_sb_list_lock);

/*
* Data integrity sync. Must wait for all pages under writeback,
@@ -1224,19 +1226,58 @@ static void wait_sb_inodes(struct super_block *sb)
* call, but which had writeout started before we write it out.
* In which case, the inode may not be on the dirty list, but
* we still have to wait for that writeout.
+ *
+ * To avoid syncing inodes put under IO after we have started here,
+ * splice the io list to a temporary list head and walk that. Newly
+ * dirtied inodes will go onto the primary list so we won't wait for
+ * them.
+ *
+ * Inodes that have pages under writeback after we've finished the wait
+ * may or may not be on the primary list. They had pages under put IOi
+ * after
+ * we started our wait, so we need to make sure the next sync or IO
+ * completion treats them correctly, Move them back to the primary list
+ * and restart the walk.
+ *
+ * Inodes that are clean after we have waited for them don't belong
+ * on any list, and the cleaning of them should have removed them from
+ * the temporary list. Check this is true, and restart the walk.
*/
- list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+ spin_lock_irqsave(&bdi->wb.wb_list_lock, flags);
+ list_splice_init(&bdi->wb.b_wb, &sync_list);
+
+ while (!list_empty(&sync_list)) {
+ struct inode *inode = list_first_entry(&sync_list, struct inode,
+ i_io_list);
struct address_space *mapping = inode->i_mapping;

- spin_lock(&inode->i_lock);
- if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
- (mapping->nrpages == 0)) {
+ /*
+ * we are nesting the inode->i_lock inside a IRQ disabled
+ * section here, so there's the possibility that we could have
+ * a lock inversion due to an interrupt while holding the
+ * inode->i_lock elsewhere. This is the only place we take the
+ * inode->i_lock inside the wb_list_lock, so we need to use a
+ * trylock to avoid a deadlock. If we fail to get the lock,
+ * the only way to make progress is to also drop the
+ * wb_list_lock so the interrupt trying to get it can make
+ * progress.
+ */
+ if (!spin_trylock(&inode->i_lock)) {
+ list_move(&inode->i_io_list, &bdi->wb.b_wb);
+ spin_unlock_irqrestore(&bdi->wb.wb_list_lock, flags);
+ cpu_relax();
+ spin_lock_irqsave(&bdi->wb.wb_list_lock, flags);
+ continue;
+ }
+
+ if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
+ list_move(&inode->i_io_list, &bdi->wb.b_wb);
spin_unlock(&inode->i_lock);
continue;
}
__iget(inode);
spin_unlock(&inode->i_lock);
- spin_unlock(&inode_sb_list_lock);
+ spin_unlock_irqrestore(&bdi->wb.wb_list_lock, flags);

/*
* We hold a reference to 'inode' so it couldn't have been
@@ -1253,9 +1294,20 @@ static void wait_sb_inodes(struct super_block *sb)

cond_resched();

- spin_lock(&inode_sb_list_lock);
+ /*
+ * the inode has been written back now, so check whether we
+ * still have pages under IO and move it back to the primary
+ * list if necessary
+ */
+ spin_lock_irqsave(&mapping->tree_lock, flags);
+ spin_lock(&bdi->wb.wb_list_lock);
+ if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
+ WARN_ON(list_empty(&inode->i_io_list));
+ list_move(&inode->i_io_list, &bdi->wb.b_wb);
+ }
+ spin_unlock(&mapping->tree_lock);
}
- spin_unlock(&inode_sb_list_lock);
+ spin_unlock_irqrestore(&bdi->wb.wb_list_lock, flags);
iput(old_inode);
}

diff --git a/fs/inode.c b/fs/inode.c
index 00d5fc3..f25c1ca 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -364,6 +364,7 @@ void inode_init_once(struct inode *inode)
INIT_HLIST_NODE(&inode->i_hash);
INIT_LIST_HEAD(&inode->i_devices);
INIT_LIST_HEAD(&inode->i_wb_list);
+ INIT_LIST_HEAD(&inode->i_io_list);
INIT_LIST_HEAD(&inode->i_lru);
address_space_init_once(&inode->i_data);
i_size_ordered_init(inode);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index c388155..4a6283c 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -59,6 +59,9 @@ struct bdi_writeback {
struct list_head b_io; /* parked for writeback */
struct list_head b_more_io; /* parked for more writeback */
spinlock_t list_lock; /* protects the b_* lists */
+
+ spinlock_t wb_list_lock; /* writeback list lock */
+ struct list_head b_wb; /* under writeback, for wait_sb_inodes */
};

struct backing_dev_info {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 63cac31..7861017 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -573,7 +573,8 @@ struct inode {
unsigned long dirtied_when; /* jiffies of first dirtying */

struct hlist_node i_hash;
- struct list_head i_wb_list; /* backing dev IO list */
+ struct list_head i_wb_list; /* backing dev WB list */
+ struct list_head i_io_list; /* backing dev IO list */
struct list_head i_lru; /* inode LRU list */
struct list_head i_sb_list;
union {
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 5025174..896b8f5 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -421,7 +421,9 @@ static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
INIT_LIST_HEAD(&wb->b_dirty);
INIT_LIST_HEAD(&wb->b_io);
INIT_LIST_HEAD(&wb->b_more_io);
+ INIT_LIST_HEAD(&wb->b_wb);
spin_lock_init(&wb->list_lock);
+ spin_lock_init(&wb->wb_list_lock);
INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn);
}

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 4514ad7..4c411fe 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2238,6 +2238,15 @@ int test_clear_page_writeback(struct page *page)
__dec_bdi_stat(bdi, BDI_WRITEBACK);
__bdi_writeout_inc(bdi);
}
+ if (mapping->host &&
+ !mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
+ struct inode *inode = mapping->host;
+
+ WARN_ON(list_empty(&inode->i_io_list));
+ spin_lock(&bdi->wb.wb_list_lock);
+ list_del_init(&inode->i_io_list);
+ spin_unlock(&bdi->wb.wb_list_lock);
+ }
}
spin_unlock_irqrestore(&mapping->tree_lock, flags);
} else {
@@ -2262,11 +2271,24 @@ int test_set_page_writeback(struct page *page)
spin_lock_irqsave(&mapping->tree_lock, flags);
ret = TestSetPageWriteback(page);
if (!ret) {
+ bool on_wblist;
+
+ /* __swap_writepage comes through here */
+ on_wblist = mapping_tagged(mapping,
+ PAGECACHE_TAG_WRITEBACK);
radix_tree_tag_set(&mapping->page_tree,
page_index(page),
PAGECACHE_TAG_WRITEBACK);
if (bdi_cap_account_writeback(bdi))
__inc_bdi_stat(bdi, BDI_WRITEBACK);
+ if (!on_wblist && mapping->host) {
+ struct inode *inode = mapping->host;
+
+ WARN_ON(!list_empty(&inode->i_io_list));
+ spin_lock(&bdi->wb.wb_list_lock);
+ list_add_tail(&inode->i_io_list, &bdi->wb.b_wb);
+ spin_unlock(&bdi->wb.wb_list_lock);
+ }
}
if (!PageDirty(page))
radix_tree_tag_clear(&mapping->page_tree,

2013-06-28 05:59:53

by Linus Torvalds

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Thu, Jun 27, 2013 at 5:54 PM, Dave Chinner <[email protected]> wrote:
> On Thu, Jun 27, 2013 at 04:54:53PM -1000, Linus Torvalds wrote:
>>
>> So what made it all start happening now? I don't recall us having had
>> these kinds of issues before..
>
> Not sure - it's a sudden surprise for me, too. Then again, I haven't
> been looking at sync from a performance or lock contention point of
> view any time recently. The algorithm that wait_sb_inodes() is
> effectively unchanged since at least 2009, so it's probably a case
> of it having been protected from contention by some external factor
> we've fixed/removed recently. Perhaps the bdi-flusher thread
> replacement in -rc1 has changed the timing sufficiently that it no
> longer serialises concurrent sync calls as much....
>
> However, the inode_sb_list_lock is known to be a badly contended
> lock from a create/unlink fastpath for XFS, so it's not like this sort
> of thing is completely unexpected.

That whole inode_sb_list_lock seems moronic. Why isn't it a per-sb
one? No, that won't fix all problems, but it might at least help a
*bit*.

Also, looking some more now at that wait_sb_inodes logic, I have to
say that if the problem is primarily the inode->i_lock, then that's
just crazy. We normally shouldn't even *need* that lock, since we
could do a totally unlocked iget() as long as the count is non-zero.

And no, I don't think really need the i_lock for checking
"mapping->nrpages == 0" or the magical "inode is being freed" bits
either. Or at least we could easily do some of this optimistically for
the common cases.

So instead of doing

struct address_space *mapping = inode->i_mapping;

spin_lock(&inode->i_lock);
if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
(mapping->nrpages == 0)) {
spin_unlock(&inode->i_lock);
continue;
}
__iget(inode);
spin_unlock(&inode->i_lock);

I really think we could do that without getting the inode lock at
*all* in the common case.

I'm attaching a pretty trivial patch, which may obviously be trivially
totally flawed. I have not tested this in any way, but half the new
lines are comments about why it's doing what it is doing. And I
really think that it should make the "actually take the inode lock" be
something quite rare.

And quite frankly, I'd much rather get *rid* of crazy i_lock accesses,
than try to be clever and use a whole different list at this point.
Not that I disagree that it wouldn't be much nicer to use a separate
list in the long run, but for a short-term solution I'd much rather
keep the old logic and just tweak it to be much more usable..

Hmm? Al? Jan? Comments?

Linus


Attachments:
patch.diff (2.30 kB)

2013-06-28 07:21:46

by Dave Chinner

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Thu, Jun 27, 2013 at 07:59:50PM -1000, Linus Torvalds wrote:
> On Thu, Jun 27, 2013 at 5:54 PM, Dave Chinner <[email protected]> wrote:
> > On Thu, Jun 27, 2013 at 04:54:53PM -1000, Linus Torvalds wrote:
> >>
> >> So what made it all start happening now? I don't recall us having had
> >> these kinds of issues before..
> >
> > Not sure - it's a sudden surprise for me, too. Then again, I haven't
> > been looking at sync from a performance or lock contention point of
> > view any time recently. The algorithm that wait_sb_inodes() is
> > effectively unchanged since at least 2009, so it's probably a case
> > of it having been protected from contention by some external factor
> > we've fixed/removed recently. Perhaps the bdi-flusher thread
> > replacement in -rc1 has changed the timing sufficiently that it no
> > longer serialises concurrent sync calls as much....
> >
> > However, the inode_sb_list_lock is known to be a badly contended
> > lock from a create/unlink fastpath for XFS, so it's not like this sort
> > of thing is completely unexpected.
>
> That whole inode_sb_list_lock seems moronic. Why isn't it a per-sb
> one? No, that won't fix all problems, but it might at least help a
> *bit*.

Historic. That's how we initially split up the old global inode_lock
in 2.6.38 in preparation for the RCU dentry walk code. It was never
intended as a long term solution.....

Besides, making the inode_sb_list_lock per sb won't help solve this
problem, anyway. The case that I'm testing involves a filesystem
that contains 99.97% of all inodes cached by the system. This is a
pretty common situation....

> Also, looking some more now at that wait_sb_inodes logic, I have to
> say that if the problem is primarily the inode->i_lock, then that's
> just crazy. We normally shouldn't even *need* that lock, since we
> could do a totally unlocked iget() as long as the count is non-zero.

The problem is not the inode->i_lock. lockstat is pretty clear on
that...

> And no, I don't think really need the i_lock for checking
> "mapping->nrpages == 0" or the magical "inode is being freed" bits
> either. Or at least we could easily do some of this optimistically for
> the common cases.

Right, we could check some of it optimisitcally, but we'd still be
walking millions of inodes under the inode_sb_list_lock on each
sync() call just to find the one inode that is dirty. It's like
polishing a turd - no matter how shiny you make it, it's still just
a pile of shit.

> I'm attaching a pretty trivial patch, which may obviously be trivially
> totally flawed. I have not tested this in any way, but half the new
> lines are comments about why it's doing what it is doing. And I
> really think that it should make the "actually take the inode lock" be
> something quite rare.

It looks ok, but I still think it is solving the wrong problem.
FWIW, your optimisation has much wider application that just this
one place. I'll have a look to see how we can apply this approach
across all the inode lookup+validate code we currently have that
unconditionally takes the inode->i_lock....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-06-28 08:22:32

by Al Viro

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Thu, Jun 27, 2013 at 07:59:50PM -1000, Linus Torvalds wrote:

> Also, looking some more now at that wait_sb_inodes logic, I have to
> say that if the problem is primarily the inode->i_lock, then that's
> just crazy.


Looks more like contention on inode_sb_list_lock, actually...

> And no, I don't think really need the i_lock for checking
> "mapping->nrpages == 0" or the magical "inode is being freed" bits
> either. Or at least we could easily do some of this optimistically for
> the common cases.

> I'm attaching a pretty trivial patch, which may obviously be trivially
> totally flawed. I have not tested this in any way, but half the new
> lines are comments about why it's doing what it is doing. And I
> really think that it should make the "actually take the inode lock" be
> something quite rare.
>
> And quite frankly, I'd much rather get *rid* of crazy i_lock accesses,
> than try to be clever and use a whole different list at this point.
> Not that I disagree that it wouldn't be much nicer to use a separate
> list in the long run, but for a short-term solution I'd much rather
> keep the old logic and just tweak it to be much more usable..
>
> Hmm? Al? Jan? Comments?

Patch seems to be sane, but I'm not sure how much will it buy in that
case.

2013-06-28 08:22:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Thu, Jun 27, 2013 at 9:21 PM, Dave Chinner <[email protected]> wrote:
>
> Besides, making the inode_sb_list_lock per sb won't help solve this
> problem, anyway. The case that I'm testing involves a filesystem
> that contains 99.97% of all inodes cached by the system. This is a
> pretty common situation....

Yeah..

> The problem is not the inode->i_lock. lockstat is pretty clear on
> that...

So the problem is that we're at -rc7, and apparently this has
magically gotten much worse. I'd *really* prefer to polish some turds
here over being fancy.

> Right, we could check some of it optimisitcally, but we'd still be
> walking millions of inodes under the inode_sb_list_lock on each
> sync() call just to find the one inode that is dirty. It's like
> polishing a turd - no matter how shiny you make it, it's still just
> a pile of shit.

Agreed. But it's not a _new_ pile of shit, and so I'm looking for
something less scary than a whole new list with totally new locking.
If we could make the cost of walking the (many) inodes sufficiently
lower so that we can paper over things for now, that would be lovely.

And with the inode i_lock we might well get into some kind of lockstep
worst-case behavior wrt the sb_lock too. I was hoping that making the
inner loop more optimized would possibly improve the contention case -
or at least push it out a bit (which is presumably what the situation
*used* to be).

> It looks ok, but I still think it is solving the wrong problem.
> FWIW, your optimisation has much wider application that just this
> one place. I'll have a look to see how we can apply this approach
> across all the inode lookup+validate code we currently have that
> unconditionally takes the inode->i_lock....

Yes, I was looking at all the other cases that also seemed to be
testing i_state for those "about to go away" cases.

Linus

2013-06-28 08:32:31

by Al Viro

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Thu, Jun 27, 2013 at 10:22:45PM -1000, Linus Torvalds wrote:

> > It looks ok, but I still think it is solving the wrong problem.
> > FWIW, your optimisation has much wider application that just this
> > one place. I'll have a look to see how we can apply this approach
> > across all the inode lookup+validate code we currently have that
> > unconditionally takes the inode->i_lock....
>
> Yes, I was looking at all the other cases that also seemed to be
> testing i_state for those "about to go away" cases.

FWIW, there's a subtle issue here - something like ext2_new_inode()
starts with allocating an inode and putting it into list (no I_NEW
yet), then decides what inumber will it have and calls insert_inode_locked(),
which sets I_NEW. Then we proceed with initializing the inode (and
eventually do unlock_new_inode(), which removes I_NEW). We depend on having
no pages in the pagecache of that sucker prior to insert_inode_locked() call;
you really don't want to start playing with writeback on this half-initialized
in-core inode.

2013-06-28 09:49:22

by Jan Kara

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Thu 27-06-13 19:59:50, Linus Torvalds wrote:
> On Thu, Jun 27, 2013 at 5:54 PM, Dave Chinner <[email protected]> wrote:
> > On Thu, Jun 27, 2013 at 04:54:53PM -1000, Linus Torvalds wrote:
> >>
> >> So what made it all start happening now? I don't recall us having had
> >> these kinds of issues before..
> >
> > Not sure - it's a sudden surprise for me, too. Then again, I haven't
> > been looking at sync from a performance or lock contention point of
> > view any time recently. The algorithm that wait_sb_inodes() is
> > effectively unchanged since at least 2009, so it's probably a case
> > of it having been protected from contention by some external factor
> > we've fixed/removed recently. Perhaps the bdi-flusher thread
> > replacement in -rc1 has changed the timing sufficiently that it no
> > longer serialises concurrent sync calls as much....
> >
> > However, the inode_sb_list_lock is known to be a badly contended
> > lock from a create/unlink fastpath for XFS, so it's not like this sort
> > of thing is completely unexpected.
>
> That whole inode_sb_list_lock seems moronic. Why isn't it a per-sb
> one? No, that won't fix all problems, but it might at least help a
> *bit*.
>
> Also, looking some more now at that wait_sb_inodes logic, I have to
> say that if the problem is primarily the inode->i_lock, then that's
> just crazy. We normally shouldn't even *need* that lock, since we
> could do a totally unlocked iget() as long as the count is non-zero.
>
> And no, I don't think really need the i_lock for checking
> "mapping->nrpages == 0" or the magical "inode is being freed" bits
> either. Or at least we could easily do some of this optimistically for
> the common cases.
>
> So instead of doing
>
> struct address_space *mapping = inode->i_mapping;
>
> spin_lock(&inode->i_lock);
> if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> (mapping->nrpages == 0)) {
> spin_unlock(&inode->i_lock);
> continue;
> }
> __iget(inode);
> spin_unlock(&inode->i_lock);
>
> I really think we could do that without getting the inode lock at
> *all* in the common case.
>
> I'm attaching a pretty trivial patch, which may obviously be trivially
> totally flawed. I have not tested this in any way, but half the new
> lines are comments about why it's doing what it is doing. And I
> really think that it should make the "actually take the inode lock" be
> something quite rare.
>
> And quite frankly, I'd much rather get *rid* of crazy i_lock accesses,
> than try to be clever and use a whole different list at this point.
> Not that I disagree that it wouldn't be much nicer to use a separate
> list in the long run, but for a short-term solution I'd much rather
> keep the old logic and just tweak it to be much more usable..
>
> Hmm? Al? Jan? Comments?
Yeah, the patch looks good to me so if it helps Dave with his softlockups
I also think it's a safer alternative than Dave's patch for 3.10. BTW, one
suggestion for improvement below:

fs/fs-writeback.c | 58 +++++++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 48 insertions(+), 10 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 3be57189efd5..3dcc8b202a40 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1206,6 +1206,52 @@ out_unlock_inode:
}
EXPORT_SYMBOL(__mark_inode_dirty);

+/*
+ * Do we want to get the inode for writeback?
+ */
+static int get_inode_for_writeback(struct inode *inode)
+{
+ struct address_space *mapping = inode->i_mapping;
+
+ /*
+ * It's a data integrity sync, but we don't care about
+ * racing with new pages - we're about data integrity
+ * of things in the past, not the future
+ */
+ if (!ACCESS_ONCE(mapping->nrpages))
+ return 0;
I think we can change the above condition to:
if (!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK))
return 0;

That should make us skip most of the inodes in the case Dave Chinner was
testing.

Honza
+
+ /* Similar logic wrt the I_NEW bit */
+ if (ACCESS_ONCE(inode->i_state) & I_NEW)
+ return 0;
+
+ /*
+ * When the inode count goes down to zero, the
+ * I_WILL_FREE and I_FREEING bits might get set.
+ * But not before.
+ *
+ * So if we get this, we know those bits are
+ * clear, and the inode is still interesting.
+ */
+ if (atomic_inc_not_zero(&inode->i_count))
+ return 1;
+
+ /*
+ * Slow path never happens normally, since any
+ * active inode will be referenced by a dentry
+ * and thus caught above
+ */
+ spin_lock(&inode->i_lock);
+ if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
+ (mapping->nrpages == 0)) {
+ spin_unlock(&inode->i_lock);
+ return 0;
+ }
+ __iget(inode);
+ spin_unlock(&inode->i_lock);
+ return 1;
+}
+
static void wait_sb_inodes(struct super_block *sb)
{
struct inode *inode, *old_inode = NULL;
@@ -1226,16 +1272,8 @@ static void wait_sb_inodes(struct super_block *sb)
* we still have to wait for that writeout.
*/
list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
- struct address_space *mapping = inode->i_mapping;
-
- spin_lock(&inode->i_lock);
- if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
- (mapping->nrpages == 0)) {
- spin_unlock(&inode->i_lock);
+ if (!get_inode_for_writeback(inode))
continue;
- }
- __iget(inode);
- spin_unlock(&inode->i_lock);
spin_unlock(&inode_sb_list_lock);

/*
@@ -1249,7 +1287,7 @@ static void wait_sb_inodes(struct super_block *sb)
iput(old_inode);
old_inode = inode;

- filemap_fdatawait(mapping);
+ filemap_fdatawait(inode->i_mapping);

cond_resched();



--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-06-28 10:28:25

by Jan Kara

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Fri 28-06-13 13:58:25, Dave Chinner wrote:
> On Fri, Jun 28, 2013 at 11:13:01AM +1000, Dave Chinner wrote:
> > On Thu, Jun 27, 2013 at 11:21:51AM -0400, Dave Jones wrote:
> > > On Thu, Jun 27, 2013 at 10:52:18PM +1000, Dave Chinner wrote:
> > >
> > >
> > > > > Yup, that's about three of orders of magnitude faster on this
> > > > > workload....
> > > > >
> > > > > Lightly smoke tested patch below - it passed the first round of
> > > > > XFS data integrity tests in xfstests, so it's not completely
> > > > > busted...
> > > >
> > > > And now with even less smoke out that the first version. This one
> > > > gets though a full xfstests run...
> > >
> > > :sadface:
> > >
> > > [ 567.680836] ======================================================
> > > [ 567.681582] [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
> > > [ 567.682389] 3.10.0-rc7+ #9 Not tainted
> > > [ 567.682862] ------------------------------------------------------
> > > [ 567.683607] trinity-child2/8665 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
> > > [ 567.684464] (&sb->s_type->i_lock_key#3){+.+...}, at: [<ffffffff811d74e5>] sync_inodes_sb+0x225/0x3b0
> > > [ 567.685632]
> > > and this task is already holding:
> > > [ 567.686334] (&(&wb->wb_list_lock)->rlock){..-...}, at: [<ffffffff811d7451>] sync_inodes_sb+0x191/0x3b0
> > > [ 567.687506] which would create a new lock dependency:
> > > [ 567.688115] (&(&wb->wb_list_lock)->rlock){..-...} -> (&sb->s_type->i_lock_key#3){+.+...}
> >
> > .....
> >
> > > other info that might help us debug this:
> > >
> > > [ 567.750396] Possible interrupt unsafe locking scenario:
> > >
> > > [ 567.752062] CPU0 CPU1
> > > [ 567.753025] ---- ----
> > > [ 567.753981] lock(&sb->s_type->i_lock_key#3);
> > > [ 567.754969] local_irq_disable();
> > > [ 567.756085] lock(&(&wb->wb_list_lock)->rlock);
> > > [ 567.757368] lock(&sb->s_type->i_lock_key#3);
> > > [ 567.758642] <Interrupt>
> > > [ 567.759370] lock(&(&wb->wb_list_lock)->rlock);
> >
> > Oh, that's easy enough to fix. It's just changing the wait_sb_inodes
> > loop to use a spin_trylock(&inode->i_lock), moving the inode to
> > the end of the sync list, dropping all locks and starting again...
>
> New version below, went through xfstests with lockdep enabled this
> time....
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
>
> writeback: store inodes under writeback on a separate list
>
> From: Dave Chinner <[email protected]>
>
> When there are lots of cached inodes, a sync(2) operation walks all
> of them to try to find which ones are under writeback and wait for
> IO completion on them. Run enough load, and this caused catastrophic
> lock contention on the inode_sb_list_lock.
>
> Try to fix this problem by tracking inodes under data IO and wait
> specifically for only those inodes that haven't completed their data
> IO in wait_sb_inodes().
>
> This is a bit hacky and messy, but demonstrates one method of
> solving this problem....
>
> XXX: This will catch data IO - do we need to catch actual inode
> writeback (i.e. the metadata) here? I'm pretty sure we don't need to
> because the existing code just calls filemap_fdatawait() and that
> doesn't wait for the inode metadata writeback to occur....
>
> [v3 - avoid deadlock due to interrupt while holding inode->i_lock]
>
> [v2 - needs spin_lock_irq variants in wait_sb_inodes.
> - move freeing inodes back to primary list, we don't wait for
> them
> - take mapping lock in wait_sb_inodes when requeuing.]
>
> Signed-off-by: Dave Chinner <[email protected]>
> ---
> fs/fs-writeback.c | 70 +++++++++++++++++++++++++++++++++++++------
> fs/inode.c | 1 +
> include/linux/backing-dev.h | 3 ++
> include/linux/fs.h | 3 +-
> mm/backing-dev.c | 2 ++
> mm/page-writeback.c | 22 ++++++++++++++
> 6 files changed, 91 insertions(+), 10 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 3be5718..589c40b 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1208,7 +1208,10 @@ EXPORT_SYMBOL(__mark_inode_dirty);
>
> static void wait_sb_inodes(struct super_block *sb)
> {
> - struct inode *inode, *old_inode = NULL;
> + struct backing_dev_info *bdi = sb->s_bdi;
> + struct inode *old_inode = NULL;
> + unsigned long flags;
> + LIST_HEAD(sync_list);
>
> /*
> * We need to be protected against the filesystem going from
> @@ -1216,7 +1219,6 @@ static void wait_sb_inodes(struct super_block *sb)
> */
> WARN_ON(!rwsem_is_locked(&sb->s_umount));
>
> - spin_lock(&inode_sb_list_lock);
>
> /*
> * Data integrity sync. Must wait for all pages under writeback,
> @@ -1224,19 +1226,58 @@ static void wait_sb_inodes(struct super_block *sb)
> * call, but which had writeout started before we write it out.
> * In which case, the inode may not be on the dirty list, but
> * we still have to wait for that writeout.
> + *
> + * To avoid syncing inodes put under IO after we have started here,
> + * splice the io list to a temporary list head and walk that. Newly
> + * dirtied inodes will go onto the primary list so we won't wait for
> + * them.
> + *
> + * Inodes that have pages under writeback after we've finished the wait
> + * may or may not be on the primary list. They had pages under put IOi
> + * after
> + * we started our wait, so we need to make sure the next sync or IO
> + * completion treats them correctly, Move them back to the primary list
> + * and restart the walk.
> + *
> + * Inodes that are clean after we have waited for them don't belong
> + * on any list, and the cleaning of them should have removed them from
> + * the temporary list. Check this is true, and restart the walk.
> */
> - list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> + spin_lock_irqsave(&bdi->wb.wb_list_lock, flags);
> + list_splice_init(&bdi->wb.b_wb, &sync_list);
> +
> + while (!list_empty(&sync_list)) {
> + struct inode *inode = list_first_entry(&sync_list, struct inode,
> + i_io_list);
> struct address_space *mapping = inode->i_mapping;
>
> - spin_lock(&inode->i_lock);
> - if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> - (mapping->nrpages == 0)) {
> + /*
> + * we are nesting the inode->i_lock inside a IRQ disabled
> + * section here, so there's the possibility that we could have
> + * a lock inversion due to an interrupt while holding the
> + * inode->i_lock elsewhere. This is the only place we take the
> + * inode->i_lock inside the wb_list_lock, so we need to use a
> + * trylock to avoid a deadlock. If we fail to get the lock,
> + * the only way to make progress is to also drop the
> + * wb_list_lock so the interrupt trying to get it can make
> + * progress.
> + */
> + if (!spin_trylock(&inode->i_lock)) {
> + list_move(&inode->i_io_list, &bdi->wb.b_wb);
> + spin_unlock_irqrestore(&bdi->wb.wb_list_lock, flags);
> + cpu_relax();
> + spin_lock_irqsave(&bdi->wb.wb_list_lock, flags);
> + continue;
> + }
> +
> + if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
> + list_move(&inode->i_io_list, &bdi->wb.b_wb);
> spin_unlock(&inode->i_lock);
> continue;
> }
Ugh, the locking looks ugly. Plus the list handling is buggy because the
first wait_sb_inodes() invocation will move all inodes to its private
sync_list so if there's another wait_sb_inodes() invocation racing with it,
it won't wait properly for all the inodes it should.

Won't it be easier to remove inodes from b_wb list (btw, I'd slightly
prefer name b_writeback) lazily instead of from
test_clear_page_writeback()? I mean we would remove inodes from b_wb list
only in wait_sb_inodes() or when inodes get reclaimed from memory. That way
we save some work in test_clear_page_writeback() which is a fast path and
defer it to sync which isn't that performance critical. Also we would avoid
that ugly games with irq safe spinlocks.

> __iget(inode);
> spin_unlock(&inode->i_lock);
> - spin_unlock(&inode_sb_list_lock);
> + spin_unlock_irqrestore(&bdi->wb.wb_list_lock, flags);
>
> /*
> * We hold a reference to 'inode' so it couldn't have been
> @@ -1253,9 +1294,20 @@ static void wait_sb_inodes(struct super_block *sb)
>
> cond_resched();
>
> - spin_lock(&inode_sb_list_lock);
> + /*
> + * the inode has been written back now, so check whether we
> + * still have pages under IO and move it back to the primary
> + * list if necessary
> + */
> + spin_lock_irqsave(&mapping->tree_lock, flags);
> + spin_lock(&bdi->wb.wb_list_lock);
> + if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
> + WARN_ON(list_empty(&inode->i_io_list));
> + list_move(&inode->i_io_list, &bdi->wb.b_wb);
> + }
> + spin_unlock(&mapping->tree_lock);
> }
> - spin_unlock(&inode_sb_list_lock);
> + spin_unlock_irqrestore(&bdi->wb.wb_list_lock, flags);
> iput(old_inode);
> }
>
> diff --git a/fs/inode.c b/fs/inode.c
> index 00d5fc3..f25c1ca 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -364,6 +364,7 @@ void inode_init_once(struct inode *inode)
> INIT_HLIST_NODE(&inode->i_hash);
> INIT_LIST_HEAD(&inode->i_devices);
> INIT_LIST_HEAD(&inode->i_wb_list);
> + INIT_LIST_HEAD(&inode->i_io_list);
> INIT_LIST_HEAD(&inode->i_lru);
> address_space_init_once(&inode->i_data);
> i_size_ordered_init(inode);
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index c388155..4a6283c 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -59,6 +59,9 @@ struct bdi_writeback {
> struct list_head b_io; /* parked for writeback */
> struct list_head b_more_io; /* parked for more writeback */
> spinlock_t list_lock; /* protects the b_* lists */
> +
> + spinlock_t wb_list_lock; /* writeback list lock */
> + struct list_head b_wb; /* under writeback, for wait_sb_inodes */
> };
>
> struct backing_dev_info {
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 63cac31..7861017 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -573,7 +573,8 @@ struct inode {
> unsigned long dirtied_when; /* jiffies of first dirtying */
>
> struct hlist_node i_hash;
> - struct list_head i_wb_list; /* backing dev IO list */
> + struct list_head i_wb_list; /* backing dev WB list */
> + struct list_head i_io_list; /* backing dev IO list */
> struct list_head i_lru; /* inode LRU list */
> struct list_head i_sb_list;
> union {
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 5025174..896b8f5 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -421,7 +421,9 @@ static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
> INIT_LIST_HEAD(&wb->b_dirty);
> INIT_LIST_HEAD(&wb->b_io);
> INIT_LIST_HEAD(&wb->b_more_io);
> + INIT_LIST_HEAD(&wb->b_wb);
> spin_lock_init(&wb->list_lock);
> + spin_lock_init(&wb->wb_list_lock);
> INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn);
> }
>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 4514ad7..4c411fe 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2238,6 +2238,15 @@ int test_clear_page_writeback(struct page *page)
> __dec_bdi_stat(bdi, BDI_WRITEBACK);
> __bdi_writeout_inc(bdi);
> }
> + if (mapping->host &&
> + !mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
> + struct inode *inode = mapping->host;
> +
> + WARN_ON(list_empty(&inode->i_io_list));
> + spin_lock(&bdi->wb.wb_list_lock);
> + list_del_init(&inode->i_io_list);
> + spin_unlock(&bdi->wb.wb_list_lock);
> + }
> }
> spin_unlock_irqrestore(&mapping->tree_lock, flags);
> } else {
> @@ -2262,11 +2271,24 @@ int test_set_page_writeback(struct page *page)
> spin_lock_irqsave(&mapping->tree_lock, flags);
> ret = TestSetPageWriteback(page);
> if (!ret) {
> + bool on_wblist;
> +
> + /* __swap_writepage comes through here */
> + on_wblist = mapping_tagged(mapping,
> + PAGECACHE_TAG_WRITEBACK);
> radix_tree_tag_set(&mapping->page_tree,
> page_index(page),
> PAGECACHE_TAG_WRITEBACK);
> if (bdi_cap_account_writeback(bdi))
> __inc_bdi_stat(bdi, BDI_WRITEBACK);
> + if (!on_wblist && mapping->host) {
> + struct inode *inode = mapping->host;
> +
> + WARN_ON(!list_empty(&inode->i_io_list));
> + spin_lock(&bdi->wb.wb_list_lock);
> + list_add_tail(&inode->i_io_list, &bdi->wb.b_wb);
> + spin_unlock(&bdi->wb.wb_list_lock);
> + }
> }
> if (!PageDirty(page))
> radix_tree_tag_clear(&mapping->page_tree,
I'm somewhat uneasy about this. Writeback code generally uses
inode_to_bdi() function to get from the mapping to backing_dev_info (which
uses inode->i_sb->s_bdi except for inodes on blockdev_superblock). That isn't
always the same as inode->i_mapping->backing_dev_info used here although
I now fail to remember a case where inode->i_mapping->backing_dev_info
would be a wrong bdi to use for sync purposes.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-06-29 03:39:34

by Dave Chinner

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Fri, Jun 28, 2013 at 12:28:19PM +0200, Jan Kara wrote:
> On Fri 28-06-13 13:58:25, Dave Chinner wrote:
> > writeback: store inodes under writeback on a separate list
> >
> > From: Dave Chinner <[email protected]>
> >
> > When there are lots of cached inodes, a sync(2) operation walks all
> > of them to try to find which ones are under writeback and wait for
> > IO completion on them. Run enough load, and this caused catastrophic
> > lock contention on the inode_sb_list_lock.
> >
> > Try to fix this problem by tracking inodes under data IO and wait
> > specifically for only those inodes that haven't completed their data
> > IO in wait_sb_inodes().
> >
> > This is a bit hacky and messy, but demonstrates one method of
> > solving this problem....
> >
> > XXX: This will catch data IO - do we need to catch actual inode
> > writeback (i.e. the metadata) here? I'm pretty sure we don't need to
> > because the existing code just calls filemap_fdatawait() and that
> > doesn't wait for the inode metadata writeback to occur....
> >
> > [v3 - avoid deadlock due to interrupt while holding inode->i_lock]
> >
> > [v2 - needs spin_lock_irq variants in wait_sb_inodes.
> > - move freeing inodes back to primary list, we don't wait for
> > them
> > - take mapping lock in wait_sb_inodes when requeuing.]
> >
> > Signed-off-by: Dave Chinner <[email protected]>
> > ---
> > fs/fs-writeback.c | 70 +++++++++++++++++++++++++++++++++++++------
> > fs/inode.c | 1 +
> > include/linux/backing-dev.h | 3 ++
> > include/linux/fs.h | 3 +-
> > mm/backing-dev.c | 2 ++
> > mm/page-writeback.c | 22 ++++++++++++++
> > 6 files changed, 91 insertions(+), 10 deletions(-)
> >
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 3be5718..589c40b 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -1208,7 +1208,10 @@ EXPORT_SYMBOL(__mark_inode_dirty);
> >
> > static void wait_sb_inodes(struct super_block *sb)
> > {
> > - struct inode *inode, *old_inode = NULL;
> > + struct backing_dev_info *bdi = sb->s_bdi;
> > + struct inode *old_inode = NULL;
> > + unsigned long flags;
> > + LIST_HEAD(sync_list);
> >
> > /*
> > * We need to be protected against the filesystem going from
> > @@ -1216,7 +1219,6 @@ static void wait_sb_inodes(struct super_block *sb)
> > */
> > WARN_ON(!rwsem_is_locked(&sb->s_umount));
> >
> > - spin_lock(&inode_sb_list_lock);
> >
> > /*
> > * Data integrity sync. Must wait for all pages under writeback,
> > @@ -1224,19 +1226,58 @@ static void wait_sb_inodes(struct super_block *sb)
> > * call, but which had writeout started before we write it out.
> > * In which case, the inode may not be on the dirty list, but
> > * we still have to wait for that writeout.
> > + *
> > + * To avoid syncing inodes put under IO after we have started here,
> > + * splice the io list to a temporary list head and walk that. Newly
> > + * dirtied inodes will go onto the primary list so we won't wait for
> > + * them.
> > + *
> > + * Inodes that have pages under writeback after we've finished the wait
> > + * may or may not be on the primary list. They had pages under put IOi
> > + * after
> > + * we started our wait, so we need to make sure the next sync or IO
> > + * completion treats them correctly, Move them back to the primary list
> > + * and restart the walk.
> > + *
> > + * Inodes that are clean after we have waited for them don't belong
> > + * on any list, and the cleaning of them should have removed them from
> > + * the temporary list. Check this is true, and restart the walk.
> > */
> > - list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> > + spin_lock_irqsave(&bdi->wb.wb_list_lock, flags);
> > + list_splice_init(&bdi->wb.b_wb, &sync_list);
> > +
> > + while (!list_empty(&sync_list)) {
> > + struct inode *inode = list_first_entry(&sync_list, struct inode,
> > + i_io_list);
> > struct address_space *mapping = inode->i_mapping;
> >
> > - spin_lock(&inode->i_lock);
> > - if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> > - (mapping->nrpages == 0)) {
> > + /*
> > + * we are nesting the inode->i_lock inside a IRQ disabled
> > + * section here, so there's the possibility that we could have
> > + * a lock inversion due to an interrupt while holding the
> > + * inode->i_lock elsewhere. This is the only place we take the
> > + * inode->i_lock inside the wb_list_lock, so we need to use a
> > + * trylock to avoid a deadlock. If we fail to get the lock,
> > + * the only way to make progress is to also drop the
> > + * wb_list_lock so the interrupt trying to get it can make
> > + * progress.
> > + */
> > + if (!spin_trylock(&inode->i_lock)) {
> > + list_move(&inode->i_io_list, &bdi->wb.b_wb);
> > + spin_unlock_irqrestore(&bdi->wb.wb_list_lock, flags);
> > + cpu_relax();
> > + spin_lock_irqsave(&bdi->wb.wb_list_lock, flags);
> > + continue;
> > + }
> > +
> > + if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
> > + list_move(&inode->i_io_list, &bdi->wb.b_wb);
> > spin_unlock(&inode->i_lock);
> > continue;
> > }
> Ugh, the locking looks ugly.

Yes, it is, and I don't really like it.

> Plus the list handling is buggy because the
> first wait_sb_inodes() invocation will move all inodes to its private
> sync_list so if there's another wait_sb_inodes() invocation racing with it,
> it won't wait properly for all the inodes it should.

Hmmmm - yeah, we only have implicit ordering of concurrent sync()
calls based on the serialisation of bdi-flusher work queuing and
dispatch. The waiting for IO completion is not serialised at all.
Seems like it's easy to fix with a per-sb sync mutex around the
dispatch and wait in sync_inodes_sb()....

> Won't it be easier to remove inodes from b_wb list (btw, I'd slightly
> prefer name b_writeback)

Yeah, b_writeback would be nicer. It's messy, though - the writeback
structure uses b_io/b_more_io for stuff that is queued for writeback
(not actually under IO), while the inode calls that the i_wb_list.
Now we add a writeback list to the writeback structure for inodes
under IO, and call the inode list i_io_list. I think this needs to
be cleaned up as well...

> lazily instead of from
> test_clear_page_writeback()? I mean we would remove inodes from b_wb list
> only in wait_sb_inodes() or when inodes get reclaimed from memory. That way
> we save some work in test_clear_page_writeback() which is a fast path and
> defer it to sync which isn't that performance critical.

We could, but we just end up in the same place with sync as we are
now - with a long list of clean inodes with a few inodes hidden in
it that are under IO. i.e. we still have to walk lots of clean
inodes to find the dirty ones that we need to wait on....

> Also we would avoid
> that ugly games with irq safe spinlocks.

Yeah, that is a definite bonus to being lazy.

Hmmm - perhaps we could do a periodic cleanup of the list via the
periodic kupdate bdi-flusher pass? Just to clear off anything that
is clean, not to wait on anything that is under IO. That would stop
the entire inode cache migrating to this list over time, and
generally keep it down to a sane size for sync passes...

> > + PAGECACHE_TAG_WRITEBACK);
> > radix_tree_tag_set(&mapping->page_tree,
> > page_index(page),
> > PAGECACHE_TAG_WRITEBACK);
> > if (bdi_cap_account_writeback(bdi))
> > __inc_bdi_stat(bdi, BDI_WRITEBACK);
> > + if (!on_wblist && mapping->host) {
> > + struct inode *inode = mapping->host;
> > +
> > + WARN_ON(!list_empty(&inode->i_io_list));
> > + spin_lock(&bdi->wb.wb_list_lock);
> > + list_add_tail(&inode->i_io_list, &bdi->wb.b_wb);
> > + spin_unlock(&bdi->wb.wb_list_lock);
> > + }
> > }
> > if (!PageDirty(page))
> > radix_tree_tag_clear(&mapping->page_tree,
> I'm somewhat uneasy about this. Writeback code generally uses
> inode_to_bdi() function to get from the mapping to backing_dev_info (which
> uses inode->i_sb->s_bdi except for inodes on blockdev_superblock). That isn't
> always the same as inode->i_mapping->backing_dev_info used here although
> I now fail to remember a case where inode->i_mapping->backing_dev_info
> would be a wrong bdi to use for sync purposes.

I can change it - I'd forgotten about inode_to_bdi() - I would have
used it if I'd remembered about it...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-06-29 20:13:35

by Dave Jones

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Fri, Jun 28, 2013 at 01:58:25PM +1000, Dave Chinner wrote:

> > Oh, that's easy enough to fix. It's just changing the wait_sb_inodes
> > loop to use a spin_trylock(&inode->i_lock), moving the inode to
> > the end of the sync list, dropping all locks and starting again...
>
> New version below, went through xfstests with lockdep enabled this
> time....

So with that patch, those two boxes have now been fuzzing away for
over 24hrs without seeing that specific sync related bug.

I did see the trace below, but I think that's a different problem..
Not sure who to point at for that one though. Linus?

I'm not even sure why cpu_stopper_thread is even running at that point.

[ 1547.161076] hrtimer: interrupt took 4539 ns
[ 1583.293902] BUG: soft lockup - CPU#1 stuck for 22s! [migration/1:28]
[ 1583.293905] BUG: soft lockup - CPU#0 stuck for 22s! [migration/0:7]
[ 1583.293932] Modules linked in: tun hidp bnep nfnetlink rfcomm scsi_transport_iscsi ipt_ULOG can_raw af_rxrpc netrom nfc can_bcm can appletalk ipx p8023 af_key psnap irda p8022 rose caif_socket caif ax25 crc_ccitt llc2 af_802154 llc rds bluetooth phonet rfkill pppoe pppox atm ppp_generic slhc x25 coretemp hwmon kvm_intel kvm snd_hda_codec_realtek crc32c_intel snd_hda_codec_hdmi ghash_clmulni_intel microcode snd_hda_intel pcspkr snd_hda_codec snd_hwdep snd_seq snd_seq_device e1000e snd_pcm ptp pps_core snd_page_alloc snd_timer snd soundcore xfs libcrc32c
[ 1583.293932] irq event stamp: 108950
[ 1583.293937] hardirqs last enabled at (108949): [<ffffffff816e9320>] restore_args+0x0/0x30
[ 1583.293940] hardirqs last disabled at (108950): [<ffffffff816f1dea>] apic_timer_interrupt+0x6a/0x80
[ 1583.293943] softirqs last enabled at (108948): [<ffffffff810541d4>] __do_softirq+0x194/0x440
[ 1583.293945] softirqs last disabled at (108943): [<ffffffff8105463d>] irq_exit+0xcd/0xe0
[ 1583.293947] CPU: 0 PID: 7 Comm: migration/0 Not tainted 3.10.0-rc7+ #5
[ 1583.293948] task: ffff880244190000 ti: ffff88024418a000 task.ti: ffff88024418a000
[ 1583.293952] RIP: 0010:[<ffffffff810dd856>] [<ffffffff810dd856>] stop_machine_cpu_stop+0x86/0x110
[ 1583.293953] RSP: 0018:ffff88024418bce8 EFLAGS: 00000293
[ 1583.293953] RAX: 0000000000000001 RBX: ffffffff816e9320 RCX: 0000000000000000
[ 1583.293954] RDX: ffff88024418bf00 RSI: ffffffff81084b2c RDI: ffff8801ecd51b88
[ 1583.293954] RBP: ffff88024418bd10 R08: 0000000000000001 R09: 0000000000000000
[ 1583.293955] R10: 0000000000000001 R11: 0000000000000000 R12: ffff88024418bc58
[ 1583.293956] R13: ffff88024418bfd8 R14: ffff88024418a000 R15: 0000000000000046
[ 1583.293956] FS: 0000000000000000(0000) GS:ffff880245600000(0000) knlGS:0000000000000000
[ 1583.293957] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1583.293958] CR2: 0000000006ff0158 CR3: 00000001ea0df000 CR4: 00000000001407f0
[ 1583.293958] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1583.293959] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1583.293959] Stack:
[ 1583.293961] ffff8802457cd880 ffff8801ecd51ac0 ffff8801ecd51b88 ffffffff810dd7d0
[ 1583.293963] ffff88024418bfd8 ffff88024418bde0 ffffffff810dd6ed ffff88024418bfd8
[ 1583.293965] ffff8802457cd8d0 000000000000017f ffff88024418bd48 00000007810b463e
[ 1583.293965] Call Trace:
[ 1583.293967] [<ffffffff810dd7d0>] ? cpu_stopper_thread+0x180/0x180
[ 1583.293969] [<ffffffff810dd6ed>] cpu_stopper_thread+0x9d/0x180
[ 1583.293971] [<ffffffff816e86d5>] ? _raw_spin_unlock_irqrestore+0x65/0x80
[ 1583.293973] [<ffffffff810b75b5>] ? trace_hardirqs_on_caller+0x115/0x1e0
[ 1583.293976] [<ffffffff81084b2c>] smpboot_thread_fn+0x1ac/0x320
[ 1583.293978] [<ffffffff81084980>] ? lg_global_unlock+0xe0/0xe0
[ 1583.293981] [<ffffffff8107a88d>] kthread+0xed/0x100
[ 1583.293983] [<ffffffff816e597f>] ? wait_for_completion+0xdf/0x110
[ 1583.293985] [<ffffffff8107a7a0>] ? insert_kthread_work+0x80/0x80
[ 1583.293987] [<ffffffff816f10dc>] ret_from_fork+0x7c/0xb0
[ 1583.293989] [<ffffffff8107a7a0>] ? insert_kthread_work+0x80/0x80
[ 1583.294007] Code: f0 ff 4b 24 0f 94 c2 84 d2 44 89 e0 74 12 8b 43 20 8b 53 10 83 c0 01 89 53 24 89 43 20 44 89 e0 83 f8 04 74 20 f3 90 44 8b 63 20 <41> 39 c4 74 f0 41 83 fc 02 75 bf fa e8 49 66 fd ff eb c2 0f 1f

2013-06-29 22:23:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Sat, Jun 29, 2013 at 1:13 PM, Dave Jones <[email protected]> wrote:
>
> So with that patch, those two boxes have now been fuzzing away for
> over 24hrs without seeing that specific sync related bug.

Ok, so at least that confirms that yes, the problem is the excessive
contention on inode_sb_list_lock.

Ugh. There's no way we can do that patch by DaveC for 3.10. Not only
is it scary, Andi pointed out that it's actively buggy and will miss
inodes that need writeback due to moving things to private lists.

So I suspect we'll have to do 3.10 with this starvation issue in
place, and mark for stable backporting whatever eventual fix we find.

> I did see the trace below, but I think that's a different problem..
> Not sure who to point at for that one though. Linus?

Hmm.

> [ 1583.293952] RIP: 0010:[<ffffffff810dd856>] [<ffffffff810dd856>] stop_machine_cpu_stop+0x86/0x110

I'm not sure how sane the watchdog is over stop_machine situations. I
think we disable the watchdog for suspend/resume exactly because
stop-machine can take almost arbitrarily long. I'm assuming you're
stress-testing (perhaps unintentionally) the cpu offlining/onlining
and/or memory migration, which is just fundamentally big expensive
things.

Does the machine recover? Because if it does, I'd be inclined to just
ignore it. Although it would be interesting to hear what triggers this
- normal users - and I'm assuming you're still running trinity as
non-root - generally should not be able to trigger stop-machine
events..

Linus

2013-06-29 23:45:17

by Dave Jones

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Sat, Jun 29, 2013 at 03:23:48PM -0700, Linus Torvalds wrote:

> > So with that patch, those two boxes have now been fuzzing away for
> > over 24hrs without seeing that specific sync related bug.
>
> Ok, so at least that confirms that yes, the problem is the excessive
> contention on inode_sb_list_lock.
>
> Ugh. There's no way we can do that patch by DaveC for 3.10. Not only
> is it scary, Andi pointed out that it's actively buggy and will miss
> inodes that need writeback due to moving things to private lists.
>
> So I suspect we'll have to do 3.10 with this starvation issue in
> place, and mark for stable backporting whatever eventual fix we find.

Given I'm the only person who seems to have been bitten by this,
I suspect it's not going to be a big deal. Worst case we can tell
people "yeah, just disable the soft watchdog until this is fixed".

> > I did see the trace below, but I think that's a different problem..
> > Not sure who to point at for that one though. Linus?
>
> Hmm.
>
> > [ 1583.293952] RIP: 0010:[<ffffffff810dd856>] [<ffffffff810dd856>] stop_machine_cpu_stop+0x86/0x110
>
> I'm not sure how sane the watchdog is over stop_machine situations. I
> think we disable the watchdog for suspend/resume exactly because
> stop-machine can take almost arbitrarily long. I'm assuming you're
> stress-testing (perhaps unintentionally) the cpu offlining/onlining
> and/or memory migration, which is just fundamentally big expensive
> things.
>
> Does the machine recover? Because if it does, I'd be inclined to just
> ignore it.

It did, after spewing that a few times, followed by this one..

BUG: soft lockup - CPU#2 stuck for 23s! [trinity-child3:2185]
Modules linked in: bridge stp dlci mpoa snd_seq_dummy sctp fuse hidp tun bnep nfnetlink scsi_transport_iscsi rfcomm can_raw can_bcm af_802154 appletalk caif_socket can caif ipt_ULOG x25 rose af_key pppoe pppox ipx phonet i
rda llc2 ppp_generic slhc p8023 psnap p8022 llc crc_ccitt atm bluetooth netrom ax25 nfc rfkill rds af_rxrpc coretemp hwmon kvm_intel kvm crc32c_intel snd_hda_codec_realtek ghash_clmulni_intel microcode pcspkr snd_hda_codec_hdmi snd_hda_i
ntel snd_hda_codec snd_hwdep usb_debug snd_seq snd_seq_device snd_pcm e1000e snd_page_alloc snd_timer ptp snd pps_core soundcore xfs libcrc32c
irq event stamp: 2291065
hardirqs last enabled at (2291064): [<ffffffff816edca0>] restore_args+0x0/0x30
hardirqs last disabled at (2291065): [<ffffffff816f67aa>] apic_timer_interrupt+0x6a/0x80
softirqs last enabled at (2290298): [<ffffffff810542e4>] __do_softirq+0x194/0x440
softirqs last disabled at (2290301): [<ffffffff8105474d>] irq_exit+0xcd/0xe0
CPU: 2 PID: 2185 Comm: trinity-child3 Not tainted 3.10.0-rc7+ #37 [loadavg: 27.02 10.32 6.81 60/194 2646]
task: ffff8801023e4a40 ti: ffff88022c958000 task.ti: ffff88022c958000
RIP: 0010:[<ffffffff81054201>] [<ffffffff81054201>] __do_softirq+0xb1/0x440
RSP: 0000:ffff880244c03f08 EFLAGS: 00000206
RAX: ffff8801023e4a40 RBX: ffffffff816edca0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8801023e4a40
RBP: ffff880244c03f70 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: ffff880244c03e78
R13: ffffffff816f67af R14: ffff880244c03f70 R15: 0000000000000000
FS: 00007f0f89ffb740(0000) GS:ffff880244c00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000002c1b000 CR3: 0000000210a2f000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
Stack:
0000000a00406040 00000001002e7923 ffff88022c959fd8 ffff88022c959fd8
ffff88022c959fd8 ffff8801023e4e38 ffff88022c959fd8 ffffffff00000002
ffff8801023e4a40 0000000000000000 0000000000000006 0000000001807000
Call Trace:
<IRQ>
[<ffffffff8105474d>] irq_exit+0xcd/0xe0
[<ffffffff816f764b>] smp_apic_timer_interrupt+0x6b/0x9b
[<ffffffff816f67af>] apic_timer_interrupt+0x6f/0x80
<EOI>
[<ffffffff816edca0>] ? retint_restore_args+0xe/0xe
[<ffffffff816eacf0>] ? wait_for_completion_interruptible+0x170/0x170
[<ffffffff816ebd93>] ? preempt_schedule_irq+0x53/0x90
[<ffffffff816eddb6>] retint_kernel+0x26/0x30
[<ffffffff81145ba7>] ? user_enter+0x87/0xd0
[<ffffffff816f1345>] do_page_fault+0x45/0x50
[<ffffffff816edee2>] page_fault+0x22/0x30
Code: 48 89 45 b8 48 89 45 b0 48 89 45 a8 66 0f 1f 44 00 00 65 c7 04 25 80 0f 1d 00 00 00 00 00 e8 d7 35 06 00 fb 49 c7 c6 00 41 c0 81 <eb> 0e 0f 1f 44 00 00 49 83 c6 08 41 d1 ef 74 6c 41 f6 c7 01 74

But after that, and one more from stop_machine, it's been quiet since, still chugging along.

> Although it would be interesting to hear what triggers this
> - normal users - and I'm assuming you're still running trinity as
> non-root - generally should not be able to trigger stop-machine
> events..

Yeah, this is running as a user. Those don't sound like things that should
be possible. What instrumentation could I add to figure out why
that kthread got awakened ?

Dave

2013-06-30 00:17:13

by Steven Rostedt

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Sat, 2013-06-29 at 15:23 -0700, Linus Torvalds wrote:

> Does the machine recover? Because if it does, I'd be inclined to just
> ignore it. Although it would be interesting to hear what triggers this
> - normal users - and I'm assuming you're still running trinity as
> non-root - generally should not be able to trigger stop-machine
> events..

It may be that we are live locking the system. If you enable too much
debug (I see lockdep is on as well), when the system is stressed, it
starts doing too much debugging and stops forward progress.

Dave, can you reproduce these hangs when you disable lockdep?

-- Steve

2013-06-30 00:21:32

by Steven Rostedt

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Sat, 2013-06-29 at 19:44 -0400, Dave Jones wrote:

> Yeah, this is running as a user. Those don't sound like things that should
> be possible. What instrumentation could I add to figure out why
> that kthread got awakened ?


trace-cmd record -e sched_wakeup -f 'comm ~ "migrati*"'

Add "-O stacktrace" to see where they got woken up too.

-- Steve

2013-06-30 02:05:40

by Dave Chinner

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Sat, Jun 29, 2013 at 03:23:48PM -0700, Linus Torvalds wrote:
> On Sat, Jun 29, 2013 at 1:13 PM, Dave Jones <[email protected]> wrote:
> >
> > So with that patch, those two boxes have now been fuzzing away for
> > over 24hrs without seeing that specific sync related bug.
>
> Ok, so at least that confirms that yes, the problem is the excessive
> contention on inode_sb_list_lock.
>
> Ugh. There's no way we can do that patch by DaveC for 3.10. Not only
> is it scary, Andi pointed out that it's actively buggy and will miss
> inodes that need writeback due to moving things to private lists.

Right - it was just a quick hack for proof of concept... :)

> So I suspect we'll have to do 3.10 with this starvation issue in
> place, and mark for stable backporting whatever eventual fix we find.

I can reproduce the contention problem on both 3.8 and 3.9 kernels,
so this isn't a recent regression, and as such it's likely I'll be
able to reproduce it on any kernel since the global inode_lock
breakup was done back in 2.6.38.

Hence I don't think there is significant urgency to fix it 3.10.
I'll have a bit more of a think about how to address this, because
we really need to make the inode_sb_list_lock disappear from the
create/unlink paths as well.

There are several "walk all cached inodes on the superblock"
algorithms in the kernel that also need fixing, too. Hence
I'm tempted just to turn this list into another list_lru (even
though we would't use the LRU capabilities of the interface) and ue
the list walk interface it has to hide the fact it is actually
using per-node lists and locks...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-06-30 02:34:28

by Dave Chinner

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Sun, Jun 30, 2013 at 12:05:31PM +1000, Dave Chinner wrote:
> On Sat, Jun 29, 2013 at 03:23:48PM -0700, Linus Torvalds wrote:
> > On Sat, Jun 29, 2013 at 1:13 PM, Dave Jones <[email protected]> wrote:
> > >
> > > So with that patch, those two boxes have now been fuzzing away for
> > > over 24hrs without seeing that specific sync related bug.
> >
> > Ok, so at least that confirms that yes, the problem is the excessive
> > contention on inode_sb_list_lock.
> >
> > Ugh. There's no way we can do that patch by DaveC for 3.10. Not only
> > is it scary, Andi pointed out that it's actively buggy and will miss
> > inodes that need writeback due to moving things to private lists.
>
> Right - it was just a quick hack for proof of concept... :)
>
> > So I suspect we'll have to do 3.10 with this starvation issue in
> > place, and mark for stable backporting whatever eventual fix we find.
>
> I can reproduce the contention problem on both 3.8 and 3.9 kernels,
> so this isn't a recent regression, and as such it's likely I'll be
> able to reproduce it on any kernel since the global inode_lock
> breakup was done back in 2.6.38.

Just as a data point - I just found a machine running a 3.4 kernel
and I can reproduce the inode_sb_list_lock contention problem on it,
too. It's definitely not a new problem...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-07-01 12:08:17

by Jan Kara

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Sat 29-06-13 13:39:24, Dave Chinner wrote:
> On Fri, Jun 28, 2013 at 12:28:19PM +0200, Jan Kara wrote:
> > On Fri 28-06-13 13:58:25, Dave Chinner wrote:
> > > writeback: store inodes under writeback on a separate list
> > >
> > > From: Dave Chinner <[email protected]>
> > >
> > > When there are lots of cached inodes, a sync(2) operation walks all
> > > of them to try to find which ones are under writeback and wait for
> > > IO completion on them. Run enough load, and this caused catastrophic
> > > lock contention on the inode_sb_list_lock.
> > >
> > > Try to fix this problem by tracking inodes under data IO and wait
> > > specifically for only those inodes that haven't completed their data
> > > IO in wait_sb_inodes().
> > >
> > > This is a bit hacky and messy, but demonstrates one method of
> > > solving this problem....
> > >
> > > XXX: This will catch data IO - do we need to catch actual inode
> > > writeback (i.e. the metadata) here? I'm pretty sure we don't need to
> > > because the existing code just calls filemap_fdatawait() and that
> > > doesn't wait for the inode metadata writeback to occur....
> > >
> > > [v3 - avoid deadlock due to interrupt while holding inode->i_lock]
> > >
> > > [v2 - needs spin_lock_irq variants in wait_sb_inodes.
> > > - move freeing inodes back to primary list, we don't wait for
> > > them
> > > - take mapping lock in wait_sb_inodes when requeuing.]
> > >
> > > Signed-off-by: Dave Chinner <[email protected]>
> > > ---
> > > fs/fs-writeback.c | 70 +++++++++++++++++++++++++++++++++++++------
> > > fs/inode.c | 1 +
> > > include/linux/backing-dev.h | 3 ++
> > > include/linux/fs.h | 3 +-
> > > mm/backing-dev.c | 2 ++
> > > mm/page-writeback.c | 22 ++++++++++++++
> > > 6 files changed, 91 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > index 3be5718..589c40b 100644
> > > --- a/fs/fs-writeback.c
> > > +++ b/fs/fs-writeback.c
> > > @@ -1208,7 +1208,10 @@ EXPORT_SYMBOL(__mark_inode_dirty);
> > >
> > > static void wait_sb_inodes(struct super_block *sb)
> > > {
> > > - struct inode *inode, *old_inode = NULL;
> > > + struct backing_dev_info *bdi = sb->s_bdi;
> > > + struct inode *old_inode = NULL;
> > > + unsigned long flags;
> > > + LIST_HEAD(sync_list);
> > >
> > > /*
> > > * We need to be protected against the filesystem going from
> > > @@ -1216,7 +1219,6 @@ static void wait_sb_inodes(struct super_block *sb)
> > > */
> > > WARN_ON(!rwsem_is_locked(&sb->s_umount));
> > >
> > > - spin_lock(&inode_sb_list_lock);
> > >
> > > /*
> > > * Data integrity sync. Must wait for all pages under writeback,
> > > @@ -1224,19 +1226,58 @@ static void wait_sb_inodes(struct super_block *sb)
> > > * call, but which had writeout started before we write it out.
> > > * In which case, the inode may not be on the dirty list, but
> > > * we still have to wait for that writeout.
> > > + *
> > > + * To avoid syncing inodes put under IO after we have started here,
> > > + * splice the io list to a temporary list head and walk that. Newly
> > > + * dirtied inodes will go onto the primary list so we won't wait for
> > > + * them.
> > > + *
> > > + * Inodes that have pages under writeback after we've finished the wait
> > > + * may or may not be on the primary list. They had pages under put IOi
> > > + * after
> > > + * we started our wait, so we need to make sure the next sync or IO
> > > + * completion treats them correctly, Move them back to the primary list
> > > + * and restart the walk.
> > > + *
> > > + * Inodes that are clean after we have waited for them don't belong
> > > + * on any list, and the cleaning of them should have removed them from
> > > + * the temporary list. Check this is true, and restart the walk.
> > > */
> > > - list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> > > + spin_lock_irqsave(&bdi->wb.wb_list_lock, flags);
> > > + list_splice_init(&bdi->wb.b_wb, &sync_list);
> > > +
> > > + while (!list_empty(&sync_list)) {
> > > + struct inode *inode = list_first_entry(&sync_list, struct inode,
> > > + i_io_list);
> > > struct address_space *mapping = inode->i_mapping;
> > >
> > > - spin_lock(&inode->i_lock);
> > > - if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> > > - (mapping->nrpages == 0)) {
> > > + /*
> > > + * we are nesting the inode->i_lock inside a IRQ disabled
> > > + * section here, so there's the possibility that we could have
> > > + * a lock inversion due to an interrupt while holding the
> > > + * inode->i_lock elsewhere. This is the only place we take the
> > > + * inode->i_lock inside the wb_list_lock, so we need to use a
> > > + * trylock to avoid a deadlock. If we fail to get the lock,
> > > + * the only way to make progress is to also drop the
> > > + * wb_list_lock so the interrupt trying to get it can make
> > > + * progress.
> > > + */
> > > + if (!spin_trylock(&inode->i_lock)) {
> > > + list_move(&inode->i_io_list, &bdi->wb.b_wb);
> > > + spin_unlock_irqrestore(&bdi->wb.wb_list_lock, flags);
> > > + cpu_relax();
> > > + spin_lock_irqsave(&bdi->wb.wb_list_lock, flags);
> > > + continue;
> > > + }
> > > +
> > > + if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
> > > + list_move(&inode->i_io_list, &bdi->wb.b_wb);
> > > spin_unlock(&inode->i_lock);
> > > continue;
> > > }
> > Ugh, the locking looks ugly.
>
> Yes, it is, and I don't really like it.
>
> > Plus the list handling is buggy because the
> > first wait_sb_inodes() invocation will move all inodes to its private
> > sync_list so if there's another wait_sb_inodes() invocation racing with it,
> > it won't wait properly for all the inodes it should.
>
> Hmmmm - yeah, we only have implicit ordering of concurrent sync()
> calls based on the serialisation of bdi-flusher work queuing and
> dispatch. The waiting for IO completion is not serialised at all.
> Seems like it's easy to fix with a per-sb sync mutex around the
> dispatch and wait in sync_inodes_sb()....
>
> > Won't it be easier to remove inodes from b_wb list (btw, I'd slightly
> > prefer name b_writeback)
>
> Yeah, b_writeback would be nicer. It's messy, though - the writeback
> structure uses b_io/b_more_io for stuff that is queued for writeback
> (not actually under IO), while the inode calls that the i_wb_list.
> Now we add a writeback list to the writeback structure for inodes
> under IO, and call the inode list i_io_list. I think this needs to
> be cleaned up as well...
Good point. The naming is somewhat inconsistent and would use a cleanup.

> > lazily instead of from
> > test_clear_page_writeback()? I mean we would remove inodes from b_wb list
> > only in wait_sb_inodes() or when inodes get reclaimed from memory. That way
> > we save some work in test_clear_page_writeback() which is a fast path and
> > defer it to sync which isn't that performance critical.
>
> We could, but we just end up in the same place with sync as we are
> now - with a long list of clean inodes with a few inodes hidden in
> it that are under IO. i.e. we still have to walk lots of clean
> inodes to find the dirty ones that we need to wait on....
If the syncs are rare then yes. If they are relatively frequent, you
would win because the first sync will cleanup the list and subsequent ones
will be fast.

> > Also we would avoid
> > that ugly games with irq safe spinlocks.
>
> Yeah, that is a definite bonus to being lazy.
>
> Hmmm - perhaps we could do a periodic cleanup of the list via the
> periodic kupdate bdi-flusher pass? Just to clear off anything that
> is clean, not to wait on anything that is under IO. That would stop
> the entire inode cache migrating to this list over time, and
> generally keep it down to a sane size for sync passes...
I like this idea. That will keep the scanning of even the first sync after
a long time in reasonable bounds.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-07-01 12:49:37

by Pavel Machek

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Sat 2013-06-29 19:44:49, Dave Jones wrote:
> On Sat, Jun 29, 2013 at 03:23:48PM -0700, Linus Torvalds wrote:
>
> > > So with that patch, those two boxes have now been fuzzing away for
> > > over 24hrs without seeing that specific sync related bug.
> >
> > Ok, so at least that confirms that yes, the problem is the excessive
> > contention on inode_sb_list_lock.
> >
> > Ugh. There's no way we can do that patch by DaveC for 3.10. Not only
> > is it scary, Andi pointed out that it's actively buggy and will miss
> > inodes that need writeback due to moving things to private lists.
> >
> > So I suspect we'll have to do 3.10 with this starvation issue in
> > place, and mark for stable backporting whatever eventual fix we find.
>
> Given I'm the only person who seems to have been bitten by this,
> I suspect it's not going to be a big deal. Worst case we can tell
> people "yeah, just disable the soft watchdog until this is fixed".

Actually... I don't think you are alone. I was doing big dd's in
attempt to debug the bad sectors (on 3.10-rc), and got soft-lockups
too... by stuff as simple as "read the disk in the background and try
to work" and "write zeros to disk in the background and try to work".

But as machine survived, I figured out I was simply loading machine
too much.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2013-07-01 17:58:02

by Dave Jones

[permalink] [raw]
Subject: block layer softlockup

On Fri, Jun 28, 2013 at 01:54:37PM +1000, Dave Chinner wrote:
> On Thu, Jun 27, 2013 at 04:54:53PM -1000, Linus Torvalds wrote:
> > On Thu, Jun 27, 2013 at 3:18 PM, Dave Chinner <[email protected]> wrote:
> > >
> > > Right, that will be what is happening - the entire system will go
> > > unresponsive when a sync call happens, so it's entirely possible
> > > to see the soft lockups on inode_sb_list_add()/inode_sb_list_del()
> > > trying to get the lock because of the way ticket spinlocks work...
> >
> > So what made it all start happening now? I don't recall us having had
> > these kinds of issues before..
>
> Not sure - it's a sudden surprise for me, too. Then again, I haven't
> been looking at sync from a performance or lock contention point of
> view any time recently. The algorithm that wait_sb_inodes() is
> effectively unchanged since at least 2009, so it's probably a case
> of it having been protected from contention by some external factor
> we've fixed/removed recently. Perhaps the bdi-flusher thread
> replacement in -rc1 has changed the timing sufficiently that it no
> longer serialises concurrent sync calls as much....

This mornings new trace reminded me of this last sentence. Related ?

BUG: soft lockup - CPU#0 stuck for 22s! [trinity-child1:7219]
Modules linked in: lec sctp dlci 8021q garp mpoa dccp_ipv4 dccp bridge stp tun snd_seq_dummy fuse bnep rfcomm nfnetlink scsi_transport_iscsi hidp ipt_ULOG can_raw can_bcm af_key af_rxrpc rose ipx p8023 p8022 atm llc2 pppoe pppox ppp_generic slhc bluetooth rds af_802154 appletalk nfc psnap phonet llc rfkill netrom x25 ax25 irda can caif_socket caif crc_ccitt coretemp hwmon kvm_intel kvm snd_hda_codec_realtek crc32c_intel ghash_clmulni_intel snd_hda_codec_hdmi microcode pcspkr snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device usb_debug snd_pcm e1000e snd_page_alloc snd_timer snd ptp pps_core soundcore xfs libcrc32c
irq event stamp: 3181543
hardirqs last enabled at (3181542): [<ffffffff816edc60>] restore_args+0x0/0x30
hardirqs last disabled at (3181543): [<ffffffff816f676a>] apic_timer_interrupt+0x6a/0x80
softirqs last enabled at (1794686): [<ffffffff810542e4>] __do_softirq+0x194/0x440
softirqs last disabled at (1794689): [<ffffffff8105474d>] irq_exit+0xcd/0xe0
CPU: 0 PID: 7219 Comm: trinity-child1 Not tainted 3.10.0+ #38
task: ffff8801d3a0ca40 ti: ffff88022e07e000 task.ti: ffff88022e07e000
RIP: 0010:[<ffffffff816ed037>] [<ffffffff816ed037>] _raw_spin_unlock_irqrestore+0x67/0x80
RSP: 0018:ffff880244803db0 EFLAGS: 00000286
RAX: ffff8801d3a0ca40 RBX: ffffffff816edc60 RCX: 0000000000000002
RDX: 0000000000004730 RSI: ffff8801d3a0d1c0 RDI: ffff8801d3a0ca40
RBP: ffff880244803dc0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000001 R12: ffff880244803d28
R13: ffffffff816f676f R14: ffff880244803dc0 R15: ffff88023d1307f8
FS: 00007f00e7ab6740(0000) GS:ffff880244800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000002f18000 CR3: 000000022d31b000 CR4: 00000000001407f0
DR0: 0000000000ad9000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
Stack:
ffff88023e21a680 0000000000000000 ffff880244803df0 ffffffff812da4c1
ffff88023e21a680 0000000000000000 0000000000000000 0000000000000000
ffff880244803e00 ffffffff812da4e0 ffff880244803e60 ffffffff8149ba13
Call Trace:
<IRQ>

[<ffffffff812da4c1>] blk_end_bidi_request+0x51/0x60
[<ffffffff812da4e0>] blk_end_request+0x10/0x20
[<ffffffff8149ba13>] scsi_io_completion+0xf3/0x6e0
[<ffffffff81491a60>] scsi_finish_command+0xb0/0x110
[<ffffffff8149b81f>] scsi_softirq_done+0x12f/0x160
[<ffffffff812e1e08>] blk_done_softirq+0x88/0xa0
[<ffffffff8105424f>] __do_softirq+0xff/0x440
[<ffffffff8105474d>] irq_exit+0xcd/0xe0
[<ffffffff816f760b>] smp_apic_timer_interrupt+0x6b/0x9b
[<ffffffff816f676f>] apic_timer_interrupt+0x6f/0x80
<EOI>

[<ffffffff816edc60>] ? retint_restore_args+0xe/0xe
[<ffffffff812ff465>] ? idr_find_slowpath+0x115/0x150
[<ffffffff812ff475>] ? idr_find_slowpath+0x125/0x150
[<ffffffff8108ceb0>] ? scheduler_tick_max_deferment+0x60/0x60
[<ffffffff816f1765>] ? add_preempt_count+0xa5/0xf0
[<ffffffff810fc8ea>] rcu_lockdep_current_cpu_online+0x3a/0xa0
[<ffffffff812ff475>] idr_find_slowpath+0x125/0x150
[<ffffffff812a3879>] ipcget+0x89/0x380
[<ffffffff810b76e5>] ? trace_hardirqs_on_caller+0x115/0x1e0
[<ffffffff812a4f76>] SyS_msgget+0x56/0x60
[<ffffffff812a4560>] ? rcu_read_lock+0x80/0x80
[<ffffffff812a43a0>] ? sysvipc_msg_proc_show+0xd0/0xd0
[<ffffffff816f5d14>] tracesys+0xdd/0xe2
[<ffffffffa00000b4>] ? libcrc32c_mod_fini+0x48/0xf94 [libcrc32c]
Code: 00 e8 9e 47 00 00 65 48 8b 04 25 f0 b9 00 00 48 8b 80 38 e0 ff ff a8 08 75 13 5b 41 5c 5d c3 0f 1f 44 00 00 e8 7b a7 9c ff 53 9d <eb> cf 0f 1f 80 00 00 00 00 e8 bb ea ff ff eb df 66 0f 1f 84 00


My read of this is that block layer was taking a *long* time to do something,
and prevented the msgget from progressing within the watchdog cutoff time.

Plausible ?

Dave

2013-07-02 02:07:50

by Dave Chinner

[permalink] [raw]
Subject: Re: block layer softlockup

On Mon, Jul 01, 2013 at 01:57:34PM -0400, Dave Jones wrote:
> On Fri, Jun 28, 2013 at 01:54:37PM +1000, Dave Chinner wrote:
> > On Thu, Jun 27, 2013 at 04:54:53PM -1000, Linus Torvalds wrote:
> > > On Thu, Jun 27, 2013 at 3:18 PM, Dave Chinner <[email protected]> wrote:
> > > >
> > > > Right, that will be what is happening - the entire system will go
> > > > unresponsive when a sync call happens, so it's entirely possible
> > > > to see the soft lockups on inode_sb_list_add()/inode_sb_list_del()
> > > > trying to get the lock because of the way ticket spinlocks work...
> > >
> > > So what made it all start happening now? I don't recall us having had
> > > these kinds of issues before..
> >
> > Not sure - it's a sudden surprise for me, too. Then again, I haven't
> > been looking at sync from a performance or lock contention point of
> > view any time recently. The algorithm that wait_sb_inodes() is
> > effectively unchanged since at least 2009, so it's probably a case
> > of it having been protected from contention by some external factor
> > we've fixed/removed recently. Perhaps the bdi-flusher thread
> > replacement in -rc1 has changed the timing sufficiently that it no
> > longer serialises concurrent sync calls as much....
>
> This mornings new trace reminded me of this last sentence. Related ?

Was this running the last patch I posted, or a vanilla kernel?

> BUG: soft lockup - CPU#0 stuck for 22s! [trinity-child1:7219]
....
> CPU: 0 PID: 7219 Comm: trinity-child1 Not tainted 3.10.0+ #38
.....
> RIP: 0010:[<ffffffff816ed037>] [<ffffffff816ed037>] _raw_spin_unlock_irqrestore+0x67/0x80
.....
> <IRQ>
>
> [<ffffffff812da4c1>] blk_end_bidi_request+0x51/0x60
> [<ffffffff812da4e0>] blk_end_request+0x10/0x20
> [<ffffffff8149ba13>] scsi_io_completion+0xf3/0x6e0
> [<ffffffff81491a60>] scsi_finish_command+0xb0/0x110
> [<ffffffff8149b81f>] scsi_softirq_done+0x12f/0x160
> [<ffffffff812e1e08>] blk_done_softirq+0x88/0xa0
> [<ffffffff8105424f>] __do_softirq+0xff/0x440
> [<ffffffff8105474d>] irq_exit+0xcd/0xe0
> [<ffffffff816f760b>] smp_apic_timer_interrupt+0x6b/0x9b
> [<ffffffff816f676f>] apic_timer_interrupt+0x6f/0x80
> <EOI>

That's doing IO completion processing in softirq time, and the lock
it just dropped was the q->queue_lock. But that lock is held over
end IO processing, so it is possible that the way the page writeback
transition handling of my POC patch caused this.

FWIW, I've attached a simple patch you might like to try to see if
it *minimises* the inode_sb_list_lock contention problems. All it
does is try to prevent concurrent entry in wait_sb_inodes() for a
given superblock and hence only have one walker on the contending
filesystem at a time. Replace the previous one I sent with it. If
that doesn't work, I have another simple patch that makes the
inode_sb_list_lock per-sb to take this isolation even further....

Cheers,

Dave.
--
Dave Chinner
[email protected]

sync: serialise per-superblock sync operations

From: Dave Chinner <[email protected]>

When competing sync(2) calls walk the same filesystem, they need to
walk the list of inodes on the superblock to find all the inodes
that we need to wait for IO completion on. However, when multiple
wait_sb_inodes() calls do this at the same time, they contend on the
the inode_sb_list_lock and the contention causes system wide
slowdowns. In effect, concurrent sync(2) calls the take longer and
burn more CPU than if they were serialised.

Stop the worst of the contention by adding a per-sb mutex to wrap
around sync_inodes_sb() so that we only execute one sync(2)
operation at a time per superblock and hence mostly avoid
contention.

Signed-off-by: Dave Chinner <[email protected]>
---
fs/fs-writeback.c | 9 ++++++++-
fs/super.c | 1 +
include/linux/fs.h | 2 ++
3 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 996f91a..4d7a90c 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1353,7 +1353,12 @@ EXPORT_SYMBOL(try_to_writeback_inodes_sb);
* @sb: the superblock
*
* This function writes and waits on any dirty inode belonging to this
- * super_block.
+ * super_block. The @s_sync_lock is used to serialise concurrent sync operations
+ * to avoid lock contention problems with concurrent wait_sb_inodes() calls.
+ * This also allows us to optimise wait_sb_inodes() to use private dirty lists
+ * as subsequent sync calls will block waiting for @s_sync_lock and hence always
+ * wait for the inodes in the private sync lists to be completed before they do
+ * their own private wait.
*/
void sync_inodes_sb(struct super_block *sb)
{
@@ -1372,10 +1377,12 @@ void sync_inodes_sb(struct super_block *sb)
return;
WARN_ON(!rwsem_is_locked(&sb->s_umount));

+ mutex_lock(&sb->s_sync_lock);
bdi_queue_work(sb->s_bdi, &work);
wait_for_completion(&done);

wait_sb_inodes(sb);
+ mutex_unlock(&sb->s_sync_lock);
}
EXPORT_SYMBOL(sync_inodes_sb);

diff --git a/fs/super.c b/fs/super.c
index 7465d43..887bfbe 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -181,6 +181,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
INIT_HLIST_NODE(&s->s_instances);
INIT_HLIST_BL_HEAD(&s->s_anon);
INIT_LIST_HEAD(&s->s_inodes);
+ mutex_init(&s->s_sync_lock);
INIT_LIST_HEAD(&s->s_dentry_lru);
INIT_LIST_HEAD(&s->s_inode_lru);
spin_lock_init(&s->s_inode_lru_lock);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 41f0945..74ba328 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1257,6 +1257,8 @@ struct super_block {
const struct xattr_handler **s_xattr;

struct list_head s_inodes; /* all inodes */
+ struct mutex s_sync_lock; /* sync serialisation lock */
+
struct hlist_bl_head s_anon; /* anonymous dentries for (nfs) exporting */
#ifdef CONFIG_SMP
struct list_head __percpu *s_files;

2013-07-02 06:02:09

by Dave Jones

[permalink] [raw]
Subject: Re: block layer softlockup

On Tue, Jul 02, 2013 at 12:07:41PM +1000, Dave Chinner wrote:
> On Mon, Jul 01, 2013 at 01:57:34PM -0400, Dave Jones wrote:
> > On Fri, Jun 28, 2013 at 01:54:37PM +1000, Dave Chinner wrote:
> > > On Thu, Jun 27, 2013 at 04:54:53PM -1000, Linus Torvalds wrote:
> > > > On Thu, Jun 27, 2013 at 3:18 PM, Dave Chinner <[email protected]> wrote:
> > > > >
> > > > > Right, that will be what is happening - the entire system will go
> > > > > unresponsive when a sync call happens, so it's entirely possible
> > > > > to see the soft lockups on inode_sb_list_add()/inode_sb_list_del()
> > > > > trying to get the lock because of the way ticket spinlocks work...
> > > >
> > > > So what made it all start happening now? I don't recall us having had
> > > > these kinds of issues before..
> > >
> > > Not sure - it's a sudden surprise for me, too. Then again, I haven't
> > > been looking at sync from a performance or lock contention point of
> > > view any time recently. The algorithm that wait_sb_inodes() is
> > > effectively unchanged since at least 2009, so it's probably a case
> > > of it having been protected from contention by some external factor
> > > we've fixed/removed recently. Perhaps the bdi-flusher thread
> > > replacement in -rc1 has changed the timing sufficiently that it no
> > > longer serialises concurrent sync calls as much....
> >
> > This mornings new trace reminded me of this last sentence. Related ?
>
> Was this running the last patch I posted, or a vanilla kernel?

yeah, this had v2 of your patch (the one post lockdep warnings)

> That's doing IO completion processing in softirq time, and the lock
> it just dropped was the q->queue_lock. But that lock is held over
> end IO processing, so it is possible that the way the page writeback
> transition handling of my POC patch caused this.
>
> FWIW, I've attached a simple patch you might like to try to see if
> it *minimises* the inode_sb_list_lock contention problems. All it
> does is try to prevent concurrent entry in wait_sb_inodes() for a
> given superblock and hence only have one walker on the contending
> filesystem at a time. Replace the previous one I sent with it. If
> that doesn't work, I have another simple patch that makes the
> inode_sb_list_lock per-sb to take this isolation even further....

I can try it, though as always, proving a negative....

Dave

2013-07-02 06:30:10

by Dave Chinner

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Mon, Jul 01, 2013 at 02:00:37PM +0200, Jan Kara wrote:
> On Sat 29-06-13 13:39:24, Dave Chinner wrote:
> > On Fri, Jun 28, 2013 at 12:28:19PM +0200, Jan Kara wrote:
> > > On Fri 28-06-13 13:58:25, Dave Chinner wrote:
> > > > writeback: store inodes under writeback on a separate list
> > > >
> > > > From: Dave Chinner <[email protected]>
> > > >
> > > > When there are lots of cached inodes, a sync(2) operation walks all
> > > > of them to try to find which ones are under writeback and wait for
> > > > IO completion on them. Run enough load, and this caused catastrophic
> > > > lock contention on the inode_sb_list_lock.
.....
> > > Ugh, the locking looks ugly.
> >
> > Yes, it is, and I don't really like it.
> >
> > > Plus the list handling is buggy because the
> > > first wait_sb_inodes() invocation will move all inodes to its private
> > > sync_list so if there's another wait_sb_inodes() invocation racing with it,
> > > it won't wait properly for all the inodes it should.
> >
> > Hmmmm - yeah, we only have implicit ordering of concurrent sync()
> > calls based on the serialisation of bdi-flusher work queuing and
> > dispatch. The waiting for IO completion is not serialised at all.
> > Seems like it's easy to fix with a per-sb sync mutex around the
> > dispatch and wait in sync_inodes_sb()....

SO I have a patchset that does this, then moves to per-sb inode list
locks, then does....

> > > Won't it be easier to remove inodes from b_wb list (btw, I'd slightly
> > > prefer name b_writeback)
> >
> > Yeah, b_writeback would be nicer. It's messy, though - the writeback
> > structure uses b_io/b_more_io for stuff that is queued for writeback
> > (not actually under IO), while the inode calls that the i_wb_list.
> > Now we add a writeback list to the writeback structure for inodes
> > under IO, and call the inode list i_io_list. I think this needs to
> > be cleaned up as well...
> Good point. The naming is somewhat inconsistent and would use a cleanup.

... this, and then does....
>
> > > lazily instead of from
> > > test_clear_page_writeback()? I mean we would remove inodes from b_wb list
> > > only in wait_sb_inodes() or when inodes get reclaimed from memory. That way
> > > we save some work in test_clear_page_writeback() which is a fast path and
> > > defer it to sync which isn't that performance critical.

... this.

> >
> > We could, but we just end up in the same place with sync as we are
> > now - with a long list of clean inodes with a few inodes hidden in
> > it that are under IO. i.e. we still have to walk lots of clean
> > inodes to find the dirty ones that we need to wait on....
> If the syncs are rare then yes. If they are relatively frequent, you
> would win because the first sync will cleanup the list and subsequent ones
> will be fast.

I haven't done this yet, because I've found an interesting
performance problem with our sync implementation. Basically, sync(2)
on a filesystem that is being constantly dirtied blocks the flusher
thread waiting for IO completion like so:

# echo w > /proc/sysrq-trigger
[ 1968.031001] SysRq : Show Blocked State
[ 1968.032748] task PC stack pid father
[ 1968.034534] kworker/u19:2 D ffff8800bed13140 3448 4830 2 0x00000000
[ 1968.034534] Workqueue: writeback bdi_writeback_workfn (flush-253:32)
[ 1968.034534] ffff8800bdca3998 0000000000000046 ffff8800bd1cae20 ffff8800bdca3fd8
[ 1968.034534] ffff8800bdca3fd8 ffff8800bdca3fd8 ffff88003ea10000 ffff8800bd1cae20
[ 1968.034534] ffff8800bdca3968 ffff8800bd1cae20 ffff8800bed139a0 0000000000000002
[ 1968.034534] Call Trace:
[ 1968.034534] [<ffffffff81bff7c9>] schedule+0x29/0x70
[ 1968.034534] [<ffffffff81bff89f>] io_schedule+0x8f/0xd0
[ 1968.034534] [<ffffffff8113263e>] sleep_on_page+0xe/0x20
[ 1968.034534] [<ffffffff81bfd030>] __wait_on_bit+0x60/0x90
[ 1968.034534] [<ffffffff81132770>] wait_on_page_bit+0x80/0x90
[ 1968.034534] [<ffffffff81132881>] filemap_fdatawait_range+0x101/0x190
[ 1968.034534] [<ffffffff81132937>] filemap_fdatawait+0x27/0x30
[ 1968.034534] [<ffffffff811a7f18>] __writeback_single_inode+0x1b8/0x220
[ 1968.034534] [<ffffffff811a88ab>] writeback_sb_inodes+0x27b/0x410
[ 1968.034534] [<ffffffff811a8c00>] wb_writeback+0xf0/0x2c0
[ 1968.034534] [<ffffffff811aa668>] wb_do_writeback+0xb8/0x210
[ 1968.034534] [<ffffffff811aa832>] bdi_writeback_workfn+0x72/0x160
[ 1968.034534] [<ffffffff8109e487>] process_one_work+0x177/0x400
[ 1968.034534] [<ffffffff8109eb82>] worker_thread+0x122/0x380
[ 1968.034534] [<ffffffff810a5508>] kthread+0xd8/0xe0
[ 1968.034534] [<ffffffff81c091ec>] ret_from_fork+0x7c/0xb0

i.e. this code:

static int
__writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
{
struct address_space *mapping = inode->i_mapping;
long nr_to_write = wbc->nr_to_write;
unsigned dirty;
int ret;

WARN_ON(!(inode->i_state & I_SYNC));

trace_writeback_single_inode_start(inode, wbc, nr_to_write);

ret = do_writepages(mapping, wbc);

/*
* Make sure to wait on the data before writing out the metadata.
* This is important for filesystems that modify metadata on data
* I/O completion.
*/
if (wbc->sync_mode == WB_SYNC_ALL) {
int err = filemap_fdatawait(mapping);
if (ret == 0)
ret = err;
}
....

If completely serialising IO dispatch during sync. We are not
batching IO submission at all - we are dispatching it one inode at a
time an then waiting for it to complete. Guess where in the
benchmark run I ran sync:

FSUse% Count Size Files/sec App Overhead
.....
0 640000 4096 35154.6 1026984
0 720000 4096 36740.3 1023844
0 800000 4096 36184.6 916599
0 880000 4096 1282.7 1054367
0 960000 4096 3951.3 918773
0 1040000 4096 40646.2 996448
0 1120000 4096 43610.1 895647
0 1200000 4096 40333.1 921048

sync absolutely *murders* asynchronous IO performance right now
because it stops background writeback completely and stalls all new
writes in balance_dirty_pages like:

[ 1968.034534] fs_mark D ffff88007ed13140 3680 9219 7127 0x00000000
[ 1968.034534] ffff88005a279a38 0000000000000046 ffff880040318000 ffff88005a279fd8
[ 1968.034534] ffff88005a279fd8 ffff88005a279fd8 ffff88003e9fdc40 ffff880040318000
[ 1968.034534] ffff88005a279a28 ffff88005a279a70 ffff88007e9e0000 0000000100065d20
[ 1968.034534] Call Trace:
[ 1968.034534] [<ffffffff81bff7c9>] schedule+0x29/0x70
[ 1968.034534] [<ffffffff81bfcd3b>] schedule_timeout+0x10b/0x1f0
[ 1968.034534] [<ffffffff81bfe492>] io_schedule_timeout+0xa2/0x100
[ 1968.034534] [<ffffffff8113d6fb>] balance_dirty_pages_ratelimited+0x37b/0x7a0
[ 1968.034534] [<ffffffff811322e8>] generic_file_buffered_write+0x1b8/0x280
[ 1968.034534] [<ffffffff8144e649>] xfs_file_buffered_aio_write+0x109/0x1a0
[ 1968.034534] [<ffffffff8144e7ae>] xfs_file_aio_write+0xce/0x140
[ 1968.034534] [<ffffffff8117f4b0>] do_sync_write+0x80/0xb0
[ 1968.034534] [<ffffffff811801c1>] vfs_write+0xc1/0x1c0
[ 1968.034534] [<ffffffff81180642>] SyS_write+0x52/0xa0
[ 1968.034534] [<ffffffff81c09299>] system_call_fastpath+0x16/0x1b

IOWs, blocking the flusher thread for IO completion on WB_SYNC_ALL
writeback is very harmful. Given that we rely on ->sync_fs to
guarantee all inode metadata is written back - the async pass up
front doesn't do any waiting so any inode metadata updates done
after IO completion have to be caught by ->sync_fs - why are we
doing IO completion waiting here for sync(2) writeback?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-07-02 07:30:27

by Dave Chinner

[permalink] [raw]
Subject: Re: block layer softlockup

On Tue, Jul 02, 2013 at 02:01:46AM -0400, Dave Jones wrote:
> On Tue, Jul 02, 2013 at 12:07:41PM +1000, Dave Chinner wrote:
> > On Mon, Jul 01, 2013 at 01:57:34PM -0400, Dave Jones wrote:
> > > On Fri, Jun 28, 2013 at 01:54:37PM +1000, Dave Chinner wrote:
> > > > On Thu, Jun 27, 2013 at 04:54:53PM -1000, Linus Torvalds wrote:
> > > > > On Thu, Jun 27, 2013 at 3:18 PM, Dave Chinner <[email protected]> wrote:
> > > > > >
> > > > > > Right, that will be what is happening - the entire system will go
> > > > > > unresponsive when a sync call happens, so it's entirely possible
> > > > > > to see the soft lockups on inode_sb_list_add()/inode_sb_list_del()
> > > > > > trying to get the lock because of the way ticket spinlocks work...
> > > > >
> > > > > So what made it all start happening now? I don't recall us having had
> > > > > these kinds of issues before..
> > > >
> > > > Not sure - it's a sudden surprise for me, too. Then again, I haven't
> > > > been looking at sync from a performance or lock contention point of
> > > > view any time recently. The algorithm that wait_sb_inodes() is
> > > > effectively unchanged since at least 2009, so it's probably a case
> > > > of it having been protected from contention by some external factor
> > > > we've fixed/removed recently. Perhaps the bdi-flusher thread
> > > > replacement in -rc1 has changed the timing sufficiently that it no
> > > > longer serialises concurrent sync calls as much....
> > >
> > > This mornings new trace reminded me of this last sentence. Related ?
> >
> > Was this running the last patch I posted, or a vanilla kernel?
>
> yeah, this had v2 of your patch (the one post lockdep warnings)

Ok, I can see how that one might cause that issues to occur. The
current patchset I'm working on doesn't have all the nasty io
completion time stuff in it, so shouldn't cause any problems like
this...

>
> > That's doing IO completion processing in softirq time, and the lock
> > it just dropped was the q->queue_lock. But that lock is held over
> > end IO processing, so it is possible that the way the page writeback
> > transition handling of my POC patch caused this.
> >
> > FWIW, I've attached a simple patch you might like to try to see if
> > it *minimises* the inode_sb_list_lock contention problems. All it
> > does is try to prevent concurrent entry in wait_sb_inodes() for a
> > given superblock and hence only have one walker on the contending
> > filesystem at a time. Replace the previous one I sent with it. If
> > that doesn't work, I have another simple patch that makes the
> > inode_sb_list_lock per-sb to take this isolation even further....
>
> I can try it, though as always, proving a negative....

Very true, though all I'm really interested in is whether you see
the soft lockup warnings or not. i.e. if you don't see them, then we
have a minimal patch that might be sufficient for -stable kernels...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-07-02 08:19:47

by Jan Kara

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Tue 02-07-13 16:29:54, Dave Chinner wrote:
> > > We could, but we just end up in the same place with sync as we are
> > > now - with a long list of clean inodes with a few inodes hidden in
> > > it that are under IO. i.e. we still have to walk lots of clean
> > > inodes to find the dirty ones that we need to wait on....
> > If the syncs are rare then yes. If they are relatively frequent, you
> > would win because the first sync will cleanup the list and subsequent ones
> > will be fast.
>
> I haven't done this yet, because I've found an interesting
> performance problem with our sync implementation. Basically, sync(2)
> on a filesystem that is being constantly dirtied blocks the flusher
> thread waiting for IO completion like so:
>
> # echo w > /proc/sysrq-trigger
> [ 1968.031001] SysRq : Show Blocked State
> [ 1968.032748] task PC stack pid father
> [ 1968.034534] kworker/u19:2 D ffff8800bed13140 3448 4830 2 0x00000000
> [ 1968.034534] Workqueue: writeback bdi_writeback_workfn (flush-253:32)
> [ 1968.034534] ffff8800bdca3998 0000000000000046 ffff8800bd1cae20 ffff8800bdca3fd8
> [ 1968.034534] ffff8800bdca3fd8 ffff8800bdca3fd8 ffff88003ea10000 ffff8800bd1cae20
> [ 1968.034534] ffff8800bdca3968 ffff8800bd1cae20 ffff8800bed139a0 0000000000000002
> [ 1968.034534] Call Trace:
> [ 1968.034534] [<ffffffff81bff7c9>] schedule+0x29/0x70
> [ 1968.034534] [<ffffffff81bff89f>] io_schedule+0x8f/0xd0
> [ 1968.034534] [<ffffffff8113263e>] sleep_on_page+0xe/0x20
> [ 1968.034534] [<ffffffff81bfd030>] __wait_on_bit+0x60/0x90
> [ 1968.034534] [<ffffffff81132770>] wait_on_page_bit+0x80/0x90
> [ 1968.034534] [<ffffffff81132881>] filemap_fdatawait_range+0x101/0x190
> [ 1968.034534] [<ffffffff81132937>] filemap_fdatawait+0x27/0x30
> [ 1968.034534] [<ffffffff811a7f18>] __writeback_single_inode+0x1b8/0x220
> [ 1968.034534] [<ffffffff811a88ab>] writeback_sb_inodes+0x27b/0x410
> [ 1968.034534] [<ffffffff811a8c00>] wb_writeback+0xf0/0x2c0
> [ 1968.034534] [<ffffffff811aa668>] wb_do_writeback+0xb8/0x210
> [ 1968.034534] [<ffffffff811aa832>] bdi_writeback_workfn+0x72/0x160
> [ 1968.034534] [<ffffffff8109e487>] process_one_work+0x177/0x400
> [ 1968.034534] [<ffffffff8109eb82>] worker_thread+0x122/0x380
> [ 1968.034534] [<ffffffff810a5508>] kthread+0xd8/0xe0
> [ 1968.034534] [<ffffffff81c091ec>] ret_from_fork+0x7c/0xb0
>
> i.e. this code:
>
> static int
> __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
> {
> struct address_space *mapping = inode->i_mapping;
> long nr_to_write = wbc->nr_to_write;
> unsigned dirty;
> int ret;
>
> WARN_ON(!(inode->i_state & I_SYNC));
>
> trace_writeback_single_inode_start(inode, wbc, nr_to_write);
>
> ret = do_writepages(mapping, wbc);
>
> /*
> * Make sure to wait on the data before writing out the metadata.
> * This is important for filesystems that modify metadata on data
> * I/O completion.
> */
> if (wbc->sync_mode == WB_SYNC_ALL) {
> int err = filemap_fdatawait(mapping);
> if (ret == 0)
> ret = err;
> }
> ....
>
> If completely serialising IO dispatch during sync. We are not
> batching IO submission at all - we are dispatching it one inode at a
> time an then waiting for it to complete. Guess where in the
> benchmark run I ran sync:
>
> FSUse% Count Size Files/sec App Overhead
> .....
> 0 640000 4096 35154.6 1026984
> 0 720000 4096 36740.3 1023844
> 0 800000 4096 36184.6 916599
> 0 880000 4096 1282.7 1054367
> 0 960000 4096 3951.3 918773
> 0 1040000 4096 40646.2 996448
> 0 1120000 4096 43610.1 895647
> 0 1200000 4096 40333.1 921048
>
> sync absolutely *murders* asynchronous IO performance right now
> because it stops background writeback completely and stalls all new
> writes in balance_dirty_pages like:
>
> [ 1968.034534] fs_mark D ffff88007ed13140 3680 9219 7127 0x00000000
> [ 1968.034534] ffff88005a279a38 0000000000000046 ffff880040318000 ffff88005a279fd8
> [ 1968.034534] ffff88005a279fd8 ffff88005a279fd8 ffff88003e9fdc40 ffff880040318000
> [ 1968.034534] ffff88005a279a28 ffff88005a279a70 ffff88007e9e0000 0000000100065d20
> [ 1968.034534] Call Trace:
> [ 1968.034534] [<ffffffff81bff7c9>] schedule+0x29/0x70
> [ 1968.034534] [<ffffffff81bfcd3b>] schedule_timeout+0x10b/0x1f0
> [ 1968.034534] [<ffffffff81bfe492>] io_schedule_timeout+0xa2/0x100
> [ 1968.034534] [<ffffffff8113d6fb>] balance_dirty_pages_ratelimited+0x37b/0x7a0
> [ 1968.034534] [<ffffffff811322e8>] generic_file_buffered_write+0x1b8/0x280
> [ 1968.034534] [<ffffffff8144e649>] xfs_file_buffered_aio_write+0x109/0x1a0
> [ 1968.034534] [<ffffffff8144e7ae>] xfs_file_aio_write+0xce/0x140
> [ 1968.034534] [<ffffffff8117f4b0>] do_sync_write+0x80/0xb0
> [ 1968.034534] [<ffffffff811801c1>] vfs_write+0xc1/0x1c0
> [ 1968.034534] [<ffffffff81180642>] SyS_write+0x52/0xa0
> [ 1968.034534] [<ffffffff81c09299>] system_call_fastpath+0x16/0x1b
>
> IOWs, blocking the flusher thread for IO completion on WB_SYNC_ALL
> writeback is very harmful. Given that we rely on ->sync_fs to
> guarantee all inode metadata is written back - the async pass up
> front doesn't do any waiting so any inode metadata updates done
> after IO completion have to be caught by ->sync_fs - why are we
> doing IO completion waiting here for sync(2) writeback?
So I did a bit of digging in history and the wait in
__writeback_single_inode() (at that time it was just
writeback_single_inode()) has been introduced by Christoph in commit
26821ed40. It is there for calls like sync_inode() or write_inode_now()
where it really is necessary.

You are right that for syncing the whole filesystem like sync(2) does, the
wait in __writeback_single_inode() isn't necessary. After all, all the
inode data might have been written back asynchronously so never have to see
the inode in __writeback_single_inode() when sync_mode == WB_SYNC_ALL and
each filesystem has to make sure inode metadata is correctly on disk. So
removing that wait for global sync isn't going to introduce any new
problem.

Now how to implement that practically - __writeback_single_inode() is
called from two places. From writeback_single_inode() - that is the
function for writing only a single inode, and from writeback_sb_inodes()
- that is used by flusher threads. I'd be inclined to move do_writepages()
call from __writeback_single_inode() to the callsites and move the wait to
writeback_single_inode() only. But that would mean also moving of
tracepoints and it starts to get a bit ugly. Maybe we could instead create
a new flag in struct writeback_control to indicate that we are doing global
sync - for_sync flag? Then we could use that flag in
__writeback_single_inode() to decide whether to wait or not.

As a bonus filesystems could also optimize their write_inode() methods when
they know ->sync_fs() is going to happen in future. E.g. ext4 wouldn't have
to do the stupid ext4_force_commit() after each written inode in
WB_SYNC_ALL mode.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-07-02 12:38:47

by Dave Chinner

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Tue, Jul 02, 2013 at 10:19:37AM +0200, Jan Kara wrote:
> On Tue 02-07-13 16:29:54, Dave Chinner wrote:
> > > > We could, but we just end up in the same place with sync as we are
> > > > now - with a long list of clean inodes with a few inodes hidden in
> > > > it that are under IO. i.e. we still have to walk lots of clean
> > > > inodes to find the dirty ones that we need to wait on....
> > > If the syncs are rare then yes. If they are relatively frequent, you
> > > would win because the first sync will cleanup the list and subsequent ones
> > > will be fast.
> >
> > I haven't done this yet, because I've found an interesting
> > performance problem with our sync implementation. Basically, sync(2)
> > on a filesystem that is being constantly dirtied blocks the flusher
> > thread waiting for IO completion like so:
> >
> > # echo w > /proc/sysrq-trigger
> > [ 1968.031001] SysRq : Show Blocked State
> > [ 1968.032748] task PC stack pid father
> > [ 1968.034534] kworker/u19:2 D ffff8800bed13140 3448 4830 2 0x00000000
> > [ 1968.034534] Workqueue: writeback bdi_writeback_workfn (flush-253:32)
> > [ 1968.034534] ffff8800bdca3998 0000000000000046 ffff8800bd1cae20 ffff8800bdca3fd8
> > [ 1968.034534] ffff8800bdca3fd8 ffff8800bdca3fd8 ffff88003ea10000 ffff8800bd1cae20
> > [ 1968.034534] ffff8800bdca3968 ffff8800bd1cae20 ffff8800bed139a0 0000000000000002
> > [ 1968.034534] Call Trace:
> > [ 1968.034534] [<ffffffff81bff7c9>] schedule+0x29/0x70
> > [ 1968.034534] [<ffffffff81bff89f>] io_schedule+0x8f/0xd0
> > [ 1968.034534] [<ffffffff8113263e>] sleep_on_page+0xe/0x20
> > [ 1968.034534] [<ffffffff81bfd030>] __wait_on_bit+0x60/0x90
> > [ 1968.034534] [<ffffffff81132770>] wait_on_page_bit+0x80/0x90
> > [ 1968.034534] [<ffffffff81132881>] filemap_fdatawait_range+0x101/0x190
> > [ 1968.034534] [<ffffffff81132937>] filemap_fdatawait+0x27/0x30
> > [ 1968.034534] [<ffffffff811a7f18>] __writeback_single_inode+0x1b8/0x220
> > [ 1968.034534] [<ffffffff811a88ab>] writeback_sb_inodes+0x27b/0x410
> > [ 1968.034534] [<ffffffff811a8c00>] wb_writeback+0xf0/0x2c0
> > [ 1968.034534] [<ffffffff811aa668>] wb_do_writeback+0xb8/0x210
> > [ 1968.034534] [<ffffffff811aa832>] bdi_writeback_workfn+0x72/0x160
> > [ 1968.034534] [<ffffffff8109e487>] process_one_work+0x177/0x400
> > [ 1968.034534] [<ffffffff8109eb82>] worker_thread+0x122/0x380
> > [ 1968.034534] [<ffffffff810a5508>] kthread+0xd8/0xe0
> > [ 1968.034534] [<ffffffff81c091ec>] ret_from_fork+0x7c/0xb0
> >
> > i.e. this code:
> >
> > static int
> > __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
> > {
> > struct address_space *mapping = inode->i_mapping;
> > long nr_to_write = wbc->nr_to_write;
> > unsigned dirty;
> > int ret;
> >
> > WARN_ON(!(inode->i_state & I_SYNC));
> >
> > trace_writeback_single_inode_start(inode, wbc, nr_to_write);
> >
> > ret = do_writepages(mapping, wbc);
> >
> > /*
> > * Make sure to wait on the data before writing out the metadata.
> > * This is important for filesystems that modify metadata on data
> > * I/O completion.
> > */
> > if (wbc->sync_mode == WB_SYNC_ALL) {
> > int err = filemap_fdatawait(mapping);
> > if (ret == 0)
> > ret = err;
> > }
> > ....
> >
> > If completely serialising IO dispatch during sync. We are not
> > batching IO submission at all - we are dispatching it one inode at a
> > time an then waiting for it to complete. Guess where in the
> > benchmark run I ran sync:
> >
> > FSUse% Count Size Files/sec App Overhead
> > .....
> > 0 640000 4096 35154.6 1026984
> > 0 720000 4096 36740.3 1023844
> > 0 800000 4096 36184.6 916599
> > 0 880000 4096 1282.7 1054367
> > 0 960000 4096 3951.3 918773
> > 0 1040000 4096 40646.2 996448
> > 0 1120000 4096 43610.1 895647
> > 0 1200000 4096 40333.1 921048
> >
> > sync absolutely *murders* asynchronous IO performance right now
> > because it stops background writeback completely and stalls all new
> > writes in balance_dirty_pages like:
> >
> > [ 1968.034534] fs_mark D ffff88007ed13140 3680 9219 7127 0x00000000
> > [ 1968.034534] ffff88005a279a38 0000000000000046 ffff880040318000 ffff88005a279fd8
> > [ 1968.034534] ffff88005a279fd8 ffff88005a279fd8 ffff88003e9fdc40 ffff880040318000
> > [ 1968.034534] ffff88005a279a28 ffff88005a279a70 ffff88007e9e0000 0000000100065d20
> > [ 1968.034534] Call Trace:
> > [ 1968.034534] [<ffffffff81bff7c9>] schedule+0x29/0x70
> > [ 1968.034534] [<ffffffff81bfcd3b>] schedule_timeout+0x10b/0x1f0
> > [ 1968.034534] [<ffffffff81bfe492>] io_schedule_timeout+0xa2/0x100
> > [ 1968.034534] [<ffffffff8113d6fb>] balance_dirty_pages_ratelimited+0x37b/0x7a0
> > [ 1968.034534] [<ffffffff811322e8>] generic_file_buffered_write+0x1b8/0x280
> > [ 1968.034534] [<ffffffff8144e649>] xfs_file_buffered_aio_write+0x109/0x1a0
> > [ 1968.034534] [<ffffffff8144e7ae>] xfs_file_aio_write+0xce/0x140
> > [ 1968.034534] [<ffffffff8117f4b0>] do_sync_write+0x80/0xb0
> > [ 1968.034534] [<ffffffff811801c1>] vfs_write+0xc1/0x1c0
> > [ 1968.034534] [<ffffffff81180642>] SyS_write+0x52/0xa0
> > [ 1968.034534] [<ffffffff81c09299>] system_call_fastpath+0x16/0x1b
> >
> > IOWs, blocking the flusher thread for IO completion on WB_SYNC_ALL
> > writeback is very harmful. Given that we rely on ->sync_fs to
> > guarantee all inode metadata is written back - the async pass up
> > front doesn't do any waiting so any inode metadata updates done
> > after IO completion have to be caught by ->sync_fs - why are we
> > doing IO completion waiting here for sync(2) writeback?
> So I did a bit of digging in history and the wait in
> __writeback_single_inode() (at that time it was just
> writeback_single_inode()) has been introduced by Christoph in commit
> 26821ed40. It is there for calls like sync_inode() or write_inode_now()
> where it really is necessary.
>
> You are right that for syncing the whole filesystem like sync(2) does, the
> wait in __writeback_single_inode() isn't necessary. After all, all the
> inode data might have been written back asynchronously so never have to see
> the inode in __writeback_single_inode() when sync_mode == WB_SYNC_ALL and
> each filesystem has to make sure inode metadata is correctly on disk. So
> removing that wait for global sync isn't going to introduce any new
> problem.
>
> Now how to implement that practically - __writeback_single_inode() is
> called from two places. From writeback_single_inode() - that is the
> function for writing only a single inode, and from writeback_sb_inodes()
> - that is used by flusher threads. I'd be inclined to move do_writepages()
> call from __writeback_single_inode() to the callsites and move the wait to
> writeback_single_inode() only. But that would mean also moving of
> tracepoints and it starts to get a bit ugly. Maybe we could instead create
> a new flag in struct writeback_control to indicate that we are doing global
> sync - for_sync flag? Then we could use that flag in
> __writeback_single_inode() to decide whether to wait or not.

Snap!

I wrote a patch a few hours ago that does exactly this, right down
to the "for_sync" variable name. :)
See below.

> As a bonus filesystems could also optimize their write_inode() methods when
> they know ->sync_fs() is going to happen in future. E.g. ext4 wouldn't have
> to do the stupid ext4_force_commit() after each written inode in
> WB_SYNC_ALL mode.

Yeah, that's true.

Since XFS now catches all metadata modifications in it's journal, it
doesn't have a ->write_inode method anymore. Only ->fsync,
->sync_fs and ->commit_metadata are defined as data integrity
operations that require metadata to be sychronised and we ensure the
journal is committed in those methods. WB_SYNC_ALL writeback is
really only a method for getting data dispatched to disk, so I
suspect that we should ensure that waiting for data IO completion
happens at higher levels, not be hidden deep in the guts of writing
back inode metadata..

Cheers,

Dave.
--
Dave Chinner
[email protected]

sync: don't block the flusher thread waiting on IO

From: Dave Chinner <[email protected]>

When sync does it's WB_SYNC_ALL writeback, it issues data Io and
then immediately waits for IO completion. This is done in the
context of the flusher thread, and hence completely ties up the
flusher thread for the backing device until all the dirty inodes
have been synced. On filesystems that are dirtying inodes constantly
and quickly, this means the flusher thread can be tied up for
minutes per sync call and hence badly affect system level write IO
performance as the page cache cannot be cleaned quickly.

We already have a wait loop for IO completion for sync(2), so cut
this out of the flusher thread and delegate it to wait_sb_inodes().
Hence we can do rapid IO submission, and then wait for it all to
complete.

Effect of sync on fsmark before the patch:

FSUse% Count Size Files/sec App Overhead
.....
0 640000 4096 35154.6 1026984
0 720000 4096 36740.3 1023844
0 800000 4096 36184.6 916599
0 880000 4096 1282.7 1054367
0 960000 4096 3951.3 918773
0 1040000 4096 40646.2 996448
0 1120000 4096 43610.1 895647
0 1200000 4096 40333.1 921048

And a single sync pass took:

real 0m52.407s
user 0m0.000s
sys 0m0.090s

After the patch, there is no impact on fsmark results, and each
individual sync(2) operation run concurrently with the same fsmark
workload takes roughly 7s:

real 0m6.930s
user 0m0.000s
sys 0m0.039s

IOWs, sync is 7-8x faster on a busy filesystem and does not have an
adverse impact on ongoing async data write operations.

Signed-off-by: Dave Chinner <[email protected]>
---
fs/fs-writeback.c | 9 +++++++--
include/linux/writeback.h | 1 +
2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 25a766c..ea56583 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -45,6 +45,7 @@ struct wb_writeback_work {
unsigned int for_kupdate:1;
unsigned int range_cyclic:1;
unsigned int for_background:1;
+ unsigned int for_sync:1; /* sync(2) WB_SYNC_ALL writeback */
enum wb_reason reason; /* why was writeback initiated? */

struct list_head list; /* pending work list */
@@ -476,9 +477,11 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
/*
* Make sure to wait on the data before writing out the metadata.
* This is important for filesystems that modify metadata on data
- * I/O completion.
+ * I/O completion. We don't do it for sync(2) writeback because it has a
+ * separate, external IO completion path and ->sync_fs for guaranteeing
+ * inode metadata is written back correctly.
*/
- if (wbc->sync_mode == WB_SYNC_ALL) {
+ if (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync) {
int err = filemap_fdatawait(mapping);
if (ret == 0)
ret = err;
@@ -611,6 +614,7 @@ static long writeback_sb_inodes(struct super_block *sb,
.tagged_writepages = work->tagged_writepages,
.for_kupdate = work->for_kupdate,
.for_background = work->for_background,
+ .for_sync = work->for_sync,
.range_cyclic = work->range_cyclic,
.range_start = 0,
.range_end = LLONG_MAX,
@@ -1442,6 +1446,7 @@ void sync_inodes_sb(struct super_block *sb)
.range_cyclic = 0,
.done = &done,
.reason = WB_REASON_SYNC,
+ .for_sync = 1,
};

/* Nothing to do? */
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 579a500..abfe117 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -78,6 +78,7 @@ struct writeback_control {
unsigned tagged_writepages:1; /* tag-and-write to avoid livelock */
unsigned for_reclaim:1; /* Invoked from the page allocator */
unsigned range_cyclic:1; /* range_start is cyclic */
+ unsigned for_sync:1; /* sync(2) WB_SYNC_ALL writeback */
};

/*

2013-07-02 14:05:15

by Jan Kara

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Tue 02-07-13 22:38:35, Dave Chinner wrote:
> > As a bonus filesystems could also optimize their write_inode() methods when
> > they know ->sync_fs() is going to happen in future. E.g. ext4 wouldn't have
> > to do the stupid ext4_force_commit() after each written inode in
> > WB_SYNC_ALL mode.
>
> Yeah, that's true.
>
> Since XFS now catches all metadata modifications in it's journal, it
> doesn't have a ->write_inode method anymore. Only ->fsync,
> ->sync_fs and ->commit_metadata are defined as data integrity
> operations that require metadata to be sychronised and we ensure the
> journal is committed in those methods. WB_SYNC_ALL writeback is
> really only a method for getting data dispatched to disk, so I
> suspect that we should ensure that waiting for data IO completion
> happens at higher levels, not be hidden deep in the guts of writing
> back inode metadata..
Yeah. Ext4 could probably do the same, just noone took the time to audit
everything properly and remove the historical heritage... That being said
there are tricky things like making sure write_inode_now() from
iput_final() will do the right thing so it's not completely obvious.

> --
> Dave Chinner
> [email protected]
>
> sync: don't block the flusher thread waiting on IO
>
> From: Dave Chinner <[email protected]>
>
> When sync does it's WB_SYNC_ALL writeback, it issues data Io and
> then immediately waits for IO completion. This is done in the
> context of the flusher thread, and hence completely ties up the
> flusher thread for the backing device until all the dirty inodes
> have been synced. On filesystems that are dirtying inodes constantly
> and quickly, this means the flusher thread can be tied up for
> minutes per sync call and hence badly affect system level write IO
> performance as the page cache cannot be cleaned quickly.
>
> We already have a wait loop for IO completion for sync(2), so cut
> this out of the flusher thread and delegate it to wait_sb_inodes().
> Hence we can do rapid IO submission, and then wait for it all to
> complete.
>
> Effect of sync on fsmark before the patch:
>
> FSUse% Count Size Files/sec App Overhead
> .....
> 0 640000 4096 35154.6 1026984
> 0 720000 4096 36740.3 1023844
> 0 800000 4096 36184.6 916599
> 0 880000 4096 1282.7 1054367
> 0 960000 4096 3951.3 918773
> 0 1040000 4096 40646.2 996448
> 0 1120000 4096 43610.1 895647
> 0 1200000 4096 40333.1 921048
>
> And a single sync pass took:
>
> real 0m52.407s
> user 0m0.000s
> sys 0m0.090s
>
> After the patch, there is no impact on fsmark results, and each
> individual sync(2) operation run concurrently with the same fsmark
> workload takes roughly 7s:
>
> real 0m6.930s
> user 0m0.000s
> sys 0m0.039s
>
> IOWs, sync is 7-8x faster on a busy filesystem and does not have an
> adverse impact on ongoing async data write operations.
The patch looks good. You can add:
Reviewed-by: Jan Kara <[email protected]>

Honza

> Signed-off-by: Dave Chinner <[email protected]>
> ---
> fs/fs-writeback.c | 9 +++++++--
> include/linux/writeback.h | 1 +
> 2 files changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 25a766c..ea56583 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -45,6 +45,7 @@ struct wb_writeback_work {
> unsigned int for_kupdate:1;
> unsigned int range_cyclic:1;
> unsigned int for_background:1;
> + unsigned int for_sync:1; /* sync(2) WB_SYNC_ALL writeback */
> enum wb_reason reason; /* why was writeback initiated? */
>
> struct list_head list; /* pending work list */
> @@ -476,9 +477,11 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
> /*
> * Make sure to wait on the data before writing out the metadata.
> * This is important for filesystems that modify metadata on data
> - * I/O completion.
> + * I/O completion. We don't do it for sync(2) writeback because it has a
> + * separate, external IO completion path and ->sync_fs for guaranteeing
> + * inode metadata is written back correctly.
> */
> - if (wbc->sync_mode == WB_SYNC_ALL) {
> + if (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync) {
> int err = filemap_fdatawait(mapping);
> if (ret == 0)
> ret = err;
> @@ -611,6 +614,7 @@ static long writeback_sb_inodes(struct super_block *sb,
> .tagged_writepages = work->tagged_writepages,
> .for_kupdate = work->for_kupdate,
> .for_background = work->for_background,
> + .for_sync = work->for_sync,
> .range_cyclic = work->range_cyclic,
> .range_start = 0,
> .range_end = LLONG_MAX,
> @@ -1442,6 +1446,7 @@ void sync_inodes_sb(struct super_block *sb)
> .range_cyclic = 0,
> .done = &done,
> .reason = WB_REASON_SYNC,
> + .for_sync = 1,
> };
>
> /* Nothing to do? */
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 579a500..abfe117 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -78,6 +78,7 @@ struct writeback_control {
> unsigned tagged_writepages:1; /* tag-and-write to avoid livelock */
> unsigned for_reclaim:1; /* Invoked from the page allocator */
> unsigned range_cyclic:1; /* range_start is cyclic */
> + unsigned for_sync:1; /* sync(2) WB_SYNC_ALL writeback */
> };
>
> /*
--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-07-02 16:13:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Tue, Jul 2, 2013 at 7:05 AM, Jan Kara <[email protected]> wrote:
> On Tue 02-07-13 22:38:35, Dave Chinner wrote:
>>
>> IOWs, sync is 7-8x faster on a busy filesystem and does not have an
>> adverse impact on ongoing async data write operations.
> The patch looks good. You can add:
> Reviewed-by: Jan Kara <[email protected]>

Ok, I'm going to take this patch asap. Should we also mark it for
stable? It doesn't look like a regression in that particular code, but
it sounds like it might be a regression when paired with the way the
flusher threads interact. Or is this really some long-time performance
problem?

I'm also wondering if we should just change all callers - remove that
"wait for writeback to complete" from writeback_one_inode()
completely, and just make sure that *all* callers that use WB_SYNC_ALL
do the "wait for writeback" in a separate stage, the way "sync()"
already does? That whole

if (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync) {

test doesn't really look all that sane (..so thanks Dave for adding a
comment above it)

Linus

2013-07-02 17:22:10

by Jan Kara

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Tue 02-07-13 09:13:43, Linus Torvalds wrote:
> On Tue, Jul 2, 2013 at 7:05 AM, Jan Kara <[email protected]> wrote:
> > On Tue 02-07-13 22:38:35, Dave Chinner wrote:
> >>
> >> IOWs, sync is 7-8x faster on a busy filesystem and does not have an
> >> adverse impact on ongoing async data write operations.
> > The patch looks good. You can add:
> > Reviewed-by: Jan Kara <[email protected]>
>
> Ok, I'm going to take this patch asap. Should we also mark it for
> stable? It doesn't look like a regression in that particular code, but
> it sounds like it might be a regression when paired with the way the
> flusher threads interact. Or is this really some long-time performance
> problem?
sync(2) was always slow in presence of heavy concurrent IO so I don't
think this is a stable material.

> I'm also wondering if we should just change all callers - remove that
> "wait for writeback to complete" from writeback_one_inode()
> completely, and just make sure that *all* callers that use WB_SYNC_ALL
> do the "wait for writeback" in a separate stage, the way "sync()"
> already does?
The trouble is with callers like write_inode_now() from iput_final().
For write_inode_now() to work correctly in that place, you must make sure
page writeback is finished before calling ->write_inode() because
filesystems may (and do) dirty the inode in their ->end_io callbacks. If
you don't wait you risk calling ->evict_inode() on a dirty inode and thus
loosing some updates.

> That whole
>
> if (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync) {
>
> test doesn't really look all that sane (..so thanks Dave for adding a
> comment above it)
I agree the condition looks a bit fishy so it definitely deserves that
comment. The only way I see to avoid this strange condition is to move
do_writepages() from __writeback_single_inode() into the callers
(writeback_single_inode() and writeback_sb_inodes()) and the condition with
the wait would then be only in writeback_single_inode(). But we would also
have to duplicate the trace points so current solution looked a tad bit
better to me.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-07-02 17:38:22

by Linus Torvalds

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Tue, Jul 2, 2013 at 9:57 AM, Jan Kara <[email protected]> wrote:
>
> sync(2) was always slow in presence of heavy concurrent IO so I don't
> think this is a stable material.

It's not the "sync being slow" part I personally react to. I don't
care that much about that.

It's the "sync slows down other things" part that makes me go "Hmm,
this may be due to interactions with the flushing changes". A slow
sync is fine - a sync that causes the global disk throughput to go
down by also stopping *other* writeback is not.

So it's the effect on the normal background writeback that makes me go
"hmm - have we really always had that, or is this an effect of the old
sync logic _mixed_ with all the bdflush -> worker changes"

The thing is, it used to be that bdflush didn't much care what a sync
by another user was doing. But bdflush doesn't exist any more, it's
all worker threads..

> The trouble is with callers like write_inode_now() from iput_final().
> For write_inode_now() to work correctly in that place, you must make sure
> page writeback is finished before calling ->write_inode() because
> filesystems may (and do) dirty the inode in their ->end_io callbacks. If
> you don't wait you risk calling ->evict_inode() on a dirty inode and thus
> loosing some updates.

My point was - why don't we move that sync thing into the caller (so
write_inode_now() in this case)?

IOW, I'm not disputing the need for filemap_fdatawait() in the data
paths. I'm just saying that maybe we could split things up - including
that whole "write_inode()" call. Some users clearly want to do this in
different orders.

That said, we might also just want to change the "sync_mode" thing.
The thing that I dislike about this patch (even though I applied it)
is that odd

if (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync) {

test. It doesn't make sense to me. It's a hack saying "I know that
'sync' does something special and doesn't actually want this
particular WB_SYNC_ALL behavior at all". That's hacky. Moving that
kind of "I know what the caller *really* meant" logic into the callers
- by splitting up the logic - would get rid of the hacky part.

But another approach of getting rid of the hacky part might be to
simple split - and rename - that "WB_SYNC_ALL" thing, and simply say
"clearly 'sync()' and individual callers of 'write_inode_now()' have
totally different expectations of the semantics of WB_SYNC_ALL". Which
means that they really shouldn't share the same "sync_mode" at all.

So maybe we could just extend that "sync_mode", and have the ones that
want to do _one_ inode synchronously use "WB_SYNC_SINGLE" to make it
clear that they are syncing a single inode. Vs "WB_SYNC_ALL" that
would be used for "I'm syncing all inodes, and I'll do a separate
second pass for syncing".

Then that test would become

if (wbc->sync_mode == WB_SYNC_SINGLE) {

instead, and now "sync_mode" would actually describe what mode of
syncing the caller wants, without that hacky special "we know what the
caller _really_ meant by looking at *which* caller it is".

See what my objection to the code is? And maybe there is yet another
solution to the oddity, I've just outlined two possible ones..

Linus

2013-07-03 03:08:22

by Dave Chinner

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Tue, Jul 02, 2013 at 10:38:20AM -0700, Linus Torvalds wrote:
> On Tue, Jul 2, 2013 at 9:57 AM, Jan Kara <[email protected]> wrote:
> >
> > sync(2) was always slow in presence of heavy concurrent IO so I don't
> > think this is a stable material.
>
> It's not the "sync being slow" part I personally react to. I don't
> care that much about that.
>
> It's the "sync slows down other things" part that makes me go "Hmm,
> this may be due to interactions with the flushing changes". A slow
> sync is fine - a sync that causes the global disk throughput to go
> down by also stopping *other* writeback is not.

Agreed, but none of this is stable stuff at this point. sync(2)
modifications need a fair bit of testing, and I haven't really done
any failure testing to determine whether it is completely safe or
not yet. That's why I haven't posted my entire patch series yet.

As it is, I suspect this patch has a negative effect on NFS client
behaviour on sync, because NFS relies on ->write_inode to send a
commit to the server to transition pages from unstable to clean.
The NFS client has no ->sync_fs method, and hence it appears to be
completely relying on __writeback_single_inode() always waiting for
IO completion before calling ->write_inode for this to work
correctly.

> So it's the effect on the normal background writeback that makes me go
> "hmm - have we really always had that, or is this an effect of the old
> sync logic _mixed_ with all the bdflush -> worker changes"

We've always had it, but I've never really cared that much about
sync(2) performance. What is different right now is that I've been
running new tests that cause sync(2) to behave in nasty ways that
I've never run before, and so I'm noticing certain behaviours for
the first time...

> The thing is, it used to be that bdflush didn't much care what a sync
> by another user was doing. But bdflush doesn't exist any more, it's
> all worker threads..

The flusher threads don't care what users are doing, either. All
they are supposed to do is dispatch IO efficiently. We've broken
that by adding a blocking path into the IO dispatch...

> > The trouble is with callers like write_inode_now() from iput_final().
> > For write_inode_now() to work correctly in that place, you must make sure
> > page writeback is finished before calling ->write_inode() because
> > filesystems may (and do) dirty the inode in their ->end_io callbacks. If
> > you don't wait you risk calling ->evict_inode() on a dirty inode and thus
> > loosing some updates.
>
> My point was - why don't we move that sync thing into the caller (so
> write_inode_now() in this case)?

We could, but it's a mess of twisty passages. The problem is that
we've got several paths that end up in __writeback_single_inode(),
and some require blocking and some don't. And write_inode_now()
needs the I_SYNC synchronisation that writeback_single_inode()
provides. That's why it all funnels in to a single path that tries
to do everything for everyone.

> IOW, I'm not disputing the need for filemap_fdatawait() in the data
> paths. I'm just saying that maybe we could split things up - including
> that whole "write_inode()" call. Some users clearly want to do this in
> different orders.

Yes, we need to do that, but it's pretty major surgery because
it implies a separation of data and metadata writeback.

>
> That said, we might also just want to change the "sync_mode" thing.
> The thing that I dislike about this patch (even though I applied it)
> is that odd
>
> if (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync) {
>
> test. It doesn't make sense to me. It's a hack saying "I know that
> 'sync' does something special and doesn't actually want this
> particular WB_SYNC_ALL behavior at all". That's hacky. Moving that
> kind of "I know what the caller *really* meant" logic into the callers
> - by splitting up the logic - would get rid of the hacky part.

Yes, that would be nice, and something I'd like to do. But it's
pretty major surgery, and not something that I want to under time
pressure.

> But another approach of getting rid of the hacky part might be to
> simple split - and rename - that "WB_SYNC_ALL" thing, and simply say
> "clearly 'sync()' and individual callers of 'write_inode_now()' have
> totally different expectations of the semantics of WB_SYNC_ALL". Which
> means that they really shouldn't share the same "sync_mode" at all.

*nod*

That's the fundamental problem - WB_SYNC_ALL means one thing
for filemap_fdatawrite() callers, and something different sync(2)
callers, and something different again for al other callers...

> So maybe we could just extend that "sync_mode", and have the ones that
> want to do _one_ inode synchronously use "WB_SYNC_SINGLE" to make it
> clear that they are syncing a single inode. Vs "WB_SYNC_ALL" that
> would be used for "I'm syncing all inodes, and I'll do a separate
> second pass for syncing".
>
> Then that test would become
>
> if (wbc->sync_mode == WB_SYNC_SINGLE) {
>
> instead, and now "sync_mode" would actually describe what mode of
> syncing the caller wants, without that hacky special "we know what the
> caller _really_ meant by looking at *which* caller it is".

The problem is that all the code that currently looks for
WB_SYNC_ALL for it's behavioural cue during writeback now has
multiple different modes they have to handle. IOWs, it's not a
straight forward conversion process. WB_SYNC_ALL reaches right down
into filesystem ->writepages implementations and they all need to be
changed if we make up a new sync_mode behaviour.

> See what my objection to the code is? And maybe there is yet another
> solution to the oddity, I've just outlined two possible ones..

They are two possibilities that I've been considering over the past few
days - I just haven't vocalised them because I haven't thought them
through completely yet.

I agree with you that this patch is a quick hack that works around
the underlying problem. But I don't have a months of spare time
right now to do a complete overhaul of the inode writeback code to
fix the underlying problem. I've basically resigned myself to
spending a day a week over the next few months cleaning this cruft
up. We're not going to fix the problem in a single patchset for
3.11...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-07-03 03:28:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Tue, Jul 2, 2013 at 8:07 PM, Dave Chinner <[email protected]> wrote:
>>
>> Then that test would become
>>
>> if (wbc->sync_mode == WB_SYNC_SINGLE) {
>>
>> instead, and now "sync_mode" would actually describe what mode of
>> syncing the caller wants, without that hacky special "we know what the
>> caller _really_ meant by looking at *which* caller it is".
>
> The problem is that all the code that currently looks for
> WB_SYNC_ALL for it's behavioural cue during writeback now has
> multiple different modes they have to handle. IOWs, it's not a
> straight forward conversion process. WB_SYNC_ALL reaches right down
> into filesystem ->writepages implementations and they all need to be
> changed if we make up a new sync_mode behaviour.

I have to admit that I absolutely detest our current "sync_mode" to
begin with, so I'd personally be happy to see some major surgery in
this area.

For example, maybe we'd be much better off with something that has
various behavioral flags rather than distinct "mode values". So
instead of being an enum of different reasons for syncing, it would be
a set of bitmasks for specific sync behavior. We have a much better
sync model in our sync_file_range() model, where we have flags like
SYNC_FILE_RANGE_WAIT_xxx (where xxx is BEFORE, WRITE, AFTER to
describe whether you should wait for old writes, start new writes, or
wait after the newly started writes).

That's a very powerful model, and it's also much more easy to think
about. So the above test could become

if (wbc->sync_mode & WB_SYNC_AFTER) {
int err = filemap_fdatawait(mapping);
....

in that kind of model, and the code actually looks sensible. It reads
like "if the caller asked us to synchronize after writing, then we do
an fdatawait on the mapping".

So I think something like that might make sense. And there aren't
_that_ many users of WB_SYNC_xxx, and the patch should be pretty
straightforward. WB_SYNC_NONE semantics would presumably be "just
start writeout" (so it would become WB_SYNC_WRITE), while WB_SYNC_ALL
would become (WB_SYNC_BEFORE | WB_SYNC_WRITE | WB_SYNC_AFTER), but
then the "for_sync" case would remove WB_SYNC_AFTER, because it does
its own waiting after.

Sounds fairly sensible and straightforward to me. Much more
self-explanatory than the current "WB_SYNC_NONE/ALL" distinction,
methinks (well, you'd still have to explain what the point of
BEFORE/AFTER is, and how it interacts with starting writeout, but
especially since we already have that concept for file_sync_range(), I
think that's not too bad).

Linus

2013-07-03 04:49:16

by Dave Chinner

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Tue, Jul 02, 2013 at 08:28:42PM -0700, Linus Torvalds wrote:
> On Tue, Jul 2, 2013 at 8:07 PM, Dave Chinner <[email protected]> wrote:
> >>
> >> Then that test would become
> >>
> >> if (wbc->sync_mode == WB_SYNC_SINGLE) {
> >>
> >> instead, and now "sync_mode" would actually describe what mode of
> >> syncing the caller wants, without that hacky special "we know what the
> >> caller _really_ meant by looking at *which* caller it is".
> >
> > The problem is that all the code that currently looks for
> > WB_SYNC_ALL for it's behavioural cue during writeback now has
> > multiple different modes they have to handle. IOWs, it's not a
> > straight forward conversion process. WB_SYNC_ALL reaches right down
> > into filesystem ->writepages implementations and they all need to be
> > changed if we make up a new sync_mode behaviour.
>
> I have to admit that I absolutely detest our current "sync_mode" to
> begin with, so I'd personally be happy to see some major surgery in
> this area.
>
> For example, maybe we'd be much better off with something that has
> various behavioral flags rather than distinct "mode values". So
> instead of being an enum of different reasons for syncing, it would be
> a set of bitmasks for specific sync behavior. We have a much better
> sync model in our sync_file_range() model, where we have flags like
> SYNC_FILE_RANGE_WAIT_xxx (where xxx is BEFORE, WRITE, AFTER to
> describe whether you should wait for old writes, start new writes, or
> wait after the newly started writes).

I agree that the flag model for writeback behaviour that it uses is
a good example to follow.

> That's a very powerful model, and it's also much more easy to think
> about. So the above test could become
>
> if (wbc->sync_mode & WB_SYNC_AFTER) {
> int err = filemap_fdatawait(mapping);
> ....
>
> in that kind of model, and the code actually looks sensible. It reads
> like "if the caller asked us to synchronize after writing, then we do
> an fdatawait on the mapping".

*nod*

> So I think something like that might make sense. And there aren't
> _that_ many users of WB_SYNC_xxx, and the patch should be pretty
> straightforward.

Patching might be straight forward, but the testing isn't - it's
spread across 30 different filesystems...

> WB_SYNC_NONE semantics would presumably be "just
> start writeout" (so it would become WB_SYNC_WRITE), while WB_SYNC_ALL
> would become (WB_SYNC_BEFORE | WB_SYNC_WRITE | WB_SYNC_AFTER), but
> then the "for_sync" case would remove WB_SYNC_AFTER, because it does
> its own waiting after.

Not exactly. WB_SYNC_NONE currently means "best effort writeback"
which means if we are going to block on a lock, it's ok to abort
writeback as we can try again later. For writeback, WB_SYNC_ALL
means "write everything that is dirty" and means we must block
waiting for locks to do the required writeback. So there's WB_WRITE
and WB_WRITE_SYNC behaviours at minimum.

Also, we don't currently have a WB_SYNC_BEFORE use case, so I'm not
sure we want to add code for something nobody currently uses. We can
add that later if anyone ever needs it to be implemented....

> Sounds fairly sensible and straightforward to me. Much more
> self-explanatory than the current "WB_SYNC_NONE/ALL" distinction,
> methinks (well, you'd still have to explain what the point of
> BEFORE/AFTER is, and how it interacts with starting writeout, but
> especially since we already have that concept for file_sync_range(), I
> think that's not too bad).

Yeah, it makes sense from a a high level perspective, but we've also
got to handle inode metadata writeback as well (which can be done
separately to data writeback via sync_inode_metadata()) and that
means we probably need WB_WRITE_META flags as well.

I suspect that we will also need to have filesystems set their
default behaviour flags on the superblock as well so that we can
make things like filemap_fdatawrite() and friends do exactly the
right thing for each filesystem without having callers need to pass
in flags. e.g. XFS doesn't require any data waiting or inode
metadata writeback, while ext2 requires inode metadata writeback
after data writeback dispatch and NFS needs WB_SYNC_AFTER behaviour
before inode metadata writeback....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-07-04 07:19:50

by Andrew Morton

[permalink] [raw]
Subject: Re: frequent softlockups with 3.10rc6.

On Wed, 3 Jul 2013 14:49:01 +1000 Dave Chinner <[email protected]> wrote:

> On Tue, Jul 02, 2013 at 08:28:42PM -0700, Linus Torvalds wrote:
> > On Tue, Jul 2, 2013 at 8:07 PM, Dave Chinner <[email protected]> wrote:
> > >>
> > >> Then that test would become
> > >>
> > >> if (wbc->sync_mode == WB_SYNC_SINGLE) {
> > >>
> > >> instead, and now "sync_mode" would actually describe what mode of
> > >> syncing the caller wants, without that hacky special "we know what the
> > >> caller _really_ meant by looking at *which* caller it is".
> > >
> > > The problem is that all the code that currently looks for
> > > WB_SYNC_ALL for it's behavioural cue during writeback now has
> > > multiple different modes they have to handle. IOWs, it's not a
> > > straight forward conversion process. WB_SYNC_ALL reaches right down
> > > into filesystem ->writepages implementations and they all need to be
> > > changed if we make up a new sync_mode behaviour.
> >
> > I have to admit that I absolutely detest our current "sync_mode" to
> > begin with, so I'd personally be happy to see some major surgery in
> > this area.

Forgive me, I was young.

>
> > WB_SYNC_NONE semantics would presumably be "just
> > start writeout" (so it would become WB_SYNC_WRITE), while WB_SYNC_ALL
> > would become (WB_SYNC_BEFORE | WB_SYNC_WRITE | WB_SYNC_AFTER), but
> > then the "for_sync" case would remove WB_SYNC_AFTER, because it does
> > its own waiting after.
>
> Not exactly. WB_SYNC_NONE currently means "best effort writeback"

Yup. WB_SYNC_NONE means "this is for memory cleaning" and WB_SYNC_ALL
means "this is for data integrity". They're two quite different
concepts whose implementations share a ton of code.

That being said, yes, sync_mode is pretty dorky and switching to a set
of very carefully defined flags makes sense.