2016-05-02 07:36:00

by Mkrtchyan, Tigran

[permalink] [raw]
Subject: blocked on the spin lock



Hi Dros et. al.

We have seen the following stack trace on one of out systems:


Apr 28 22:09:51 bird510 kernel: BUG: soft lockup - CPU#7 stuck for 67s! [tee:13755]
Apr 28 22:09:51 bird510 kernel: Modules linked in: ipmi_devintf dell_rbu nfs_layout_nfsv41_files fuse nfs lockd fscache auth_rpcgss nfs_acl vfat fat usb_storage mpt3sas mpt2sas raid_class mptctl openafs(P)(U) autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf sg joy
dev power_meter acpi_ipmi ipmi_si ipmi_msghandler iTCO_wdt iTCO_vendor_support bnx2 microcode dcdbas lpc_ich mfd_core i7core_edac edac_core shpchp ext4 jbd2 mbcache sd_mod crc_t10dif mptsas mptscsih mptbase scsi_transport_sas mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_en p
tp pps_core mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ipmi_devintf]
Apr 28 22:09:51 bird510 kernel: CPU 7
Apr 28 22:09:51 bird510 kernel: Modules linked in: ipmi_devintf dell_rbu nfs_layout_nfsv41_files fuse nfs lockd fscache auth_rpcgss nfs_acl vfat fat usb_storage mpt3sas mpt2sas raid_class mptctl openafs(P)(U) autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf sg joy
dev power_meter acpi_ipmi ipmi_si ipmi_msghandler iTCO_wdt iTCO_vendor_support bnx2 microcode dcdbas lpc_ich mfd_core i7core_edac edac_core shpchp ext4 jbd2 mbcache sd_mod crc_t10dif mptsas mptscsih mptbase scsi_transport_sas mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_en p
tp pps_core mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ipmi_devintf]
Apr 28 22:09:51 bird510 kernel:
Apr 28 22:09:51 bird510 kernel: Pid: 13755, comm: tee Tainted: P -- ------------ 2.6.32-573.12.1.el6.x86_64 #1 Dell Inc. PowerEdge M610/0N582M
Apr 28 22:09:51 bird510 kernel: RIP: 0010:[<ffffffff8153bde1>] [<ffffffff8153bde1>] _spin_lock+0x21/0x30
Apr 28 22:09:51 bird510 kernel: RSP: 0018:ffff8803528179c8 EFLAGS: 00000297
Apr 28 22:09:51 bird510 kernel: RAX: 00000000000003b1 RBX: ffff8803528179c8 RCX: 0000000000000000
Apr 28 22:09:51 bird510 kernel: RDX: 00000000000003b0 RSI: ffff880159aeb820 RDI: ffff880159aeb8d0
Apr 28 22:09:51 bird510 kernel: RBP: ffffffff8100bc0e R08: ffff880352817af0 R09: 0000000000000002
Apr 28 22:09:51 bird510 kernel: R10: ffffea0008efc428 R11: 0000000000000003 R12: 0000000000000001
Apr 28 22:09:51 bird510 kernel: R13: 001310facb5806bd R14: 0000000000000000 R15: 0000000000000000
Apr 28 22:09:51 bird510 kernel: FS: 00007f3e78118700(0000) GS:ffff880028260000(0000) knlGS:0000000000000000
Apr 28 22:09:51 bird510 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Apr 28 22:09:51 bird510 kernel: CR2: 00007fefa21ff000 CR3: 0000000352959000 CR4: 00000000000007e0
Apr 28 22:09:51 bird510 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr 28 22:09:51 bird510 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Apr 28 22:09:51 bird510 kernel: Process tee (pid: 13755, threadinfo ffff880352814000, task ffff880636064ab0)
Apr 28 22:09:51 bird510 kernel: Stack:
Apr 28 22:09:51 bird510 kernel: ffff880352817a18 ffffffffa0490dc1 ffff880159aeb8d0 ffff880159aeb770
Apr 28 22:09:51 bird510 kernel: <d> 0000000000000000 0000000000000000 ffffffffa04bd040 ffffffffa048f39a
Apr 28 22:09:51 bird510 kernel: <d> ffff8803378761c0 ffffea000a13dcc0 ffff880352817a48 ffffffffa0490ec8
Apr 28 22:09:51 bird510 kernel: Call Trace:
Apr 28 22:09:51 bird510 kernel: [<ffffffffa0490dc1>] ? nfs_clear_request_commit+0xa1/0xf0 [nfs]
Apr 28 22:09:51 bird510 kernel: [<ffffffffa048f39a>] ? nfs_page_find_request_locked+0x2a/0x40 [nfs]
Apr 28 22:09:51 bird510 kernel: [<ffffffffa0490ec8>] ? nfs_wb_page_cancel+0xb8/0xf0 [nfs]
Apr 28 22:09:51 bird510 kernel: [<ffffffffa047e4f7>] ? nfs_invalidate_page+0x47/0x80 [nfs]
Apr 28 22:09:51 bird510 kernel: [<ffffffff8113edb5>] ? do_invalidatepage+0x25/0x30
Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f0d2>] ? truncate_inode_page+0xa2/0xc0
Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f47f>] ? truncate_inode_pages_range+0x16f/0x500
Apr 28 22:09:51 bird510 kernel: [<ffffffff81129cd7>] ? mempool_free_slab+0x17/0x20
Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f8a5>] ? truncate_inode_pages+0x15/0x20
Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f8ff>] ? truncate_pagecache+0x4f/0x70
Apr 28 22:09:51 bird510 kernel: [<ffffffffa0480eb9>] ? nfs_setattr_update_inode+0xb9/0x150 [nfs]
Apr 28 22:09:51 bird510 kernel: [<ffffffffa0493a6b>] ? nfs3_proc_setattr+0xdb/0x120 [nfs]
Apr 28 22:09:51 bird510 kernel: [<ffffffffa048f6b3>] ? nfs_wb_all+0x43/0x50 [nfs]
Apr 28 22:09:51 bird510 kernel: [<ffffffffa0482610>] ? nfs_setattr+0xf0/0x170 [nfs]
Apr 28 22:09:51 bird510 kernel: [<ffffffff811b1118>] ? notify_change+0x168/0x340
Apr 28 22:09:51 bird510 kernel: [<ffffffffa0480de8>] ? nfs_open+0x78/0x90 [nfs]
Apr 28 22:09:51 bird510 kernel: [<ffffffff8118fcb4>] ? do_truncate+0x64/0xa0
Apr 28 22:09:51 bird510 kernel: [<ffffffff81232b9f>] ? security_inode_permission+0x1f/0x30
Apr 28 22:09:51 bird510 kernel: [<ffffffff811a4fc1>] ? do_filp_open+0x861/0xd20
Apr 28 22:09:51 bird510 kernel: [<ffffffff8129df8a>] ? strncpy_from_user+0x4a/0x90
Apr 28 22:09:51 bird510 kernel: [<ffffffff811b1ff2>] ? alloc_fd+0x92/0x160
Apr 28 22:09:51 bird510 kernel: [<ffffffff8118e967>] ? do_sys_open+0x67/0x130
Apr 28 22:09:51 bird510 kernel: [<ffffffff8118ea70>] ? sys_open+0x20/0x30
Apr 28 22:09:51 bird510 kernel: [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b
Apr 28 22:09:51 bird510 kernel: Code: 01 74 05 e8 52 1c d6 ff c9 c3 55 48 89 e5 0f 1f 44 00 00 b8 00 00 01 00 f0 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 0f b7 17 <eb> f5 83 3f 00 75 f4 eb df c9 c3 0f 1f 40 00 55 48 89 e5 0f 1f
Apr 28 22:09:51 bird510 kernel: Call Trace:
Apr 28 22:09:51 bird510 kernel: [<ffffffffa0490dc1>] ? nfs_clear_request_commit+0xa1/0xf0 [nfs]
Apr 28 22:09:51 bird510 kernel: [<ffffffffa048f39a>] ? nfs_page_find_request_locked+0x2a/0x40 [nfs]
Apr 28 22:09:51 bird510 kernel: [<ffffffffa0490ec8>] ? nfs_wb_page_cancel+0xb8/0xf0 [nfs]
Apr 28 22:09:51 bird510 kernel: [<ffffffffa047e4f7>] ? nfs_invalidate_page+0x47/0x80 [nfs]
Apr 28 22:09:51 bird510 kernel: [<ffffffff8113edb5>] ? do_invalidatepage+0x25/0x30
Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f0d2>] ? truncate_inode_page+0xa2/0xc0
Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f47f>] ? truncate_inode_pages_range+0x16f/0x500
Apr 28 22:09:51 bird510 kernel: [<ffffffff81129cd7>] ? mempool_free_slab+0x17/0x20
Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f8a5>] ? truncate_inode_pages+0x15/0x20
Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f8ff>] ? truncate_pagecache+0x4f/0x70
Apr 28 22:09:51 bird510 kernel: [<ffffffffa0480eb9>] ? nfs_setattr_update_inode+0xb9/0x150 [nfs]
Apr 28 22:09:51 bird510 kernel: [<ffffffffa0493a6b>] ? nfs3_proc_setattr+0xdb/0x120 [nfs]
Apr 28 22:09:51 bird510 kernel: [<ffffffffa048f6b3>] ? nfs_wb_all+0x43/0x50 [nfs]
Apr 28 22:09:51 bird510 kernel: [<ffffffffa0482610>] ? nfs_setattr+0xf0/0x170 [nfs]
Apr 28 22:09:51 bird510 kernel: [<ffffffff811b1118>] ? notify_change+0x168/0x340
Apr 28 22:09:51 bird510 kernel: [<ffffffffa0480de8>] ? nfs_open+0x78/0x90 [nfs]
Apr 28 22:09:51 bird510 kernel: [<ffffffff8118fcb4>] ? do_truncate+0x64/0xa0
Apr 28 22:09:51 bird510 kernel: [<ffffffff81232b9f>] ? security_inode_permission+0x1f/0x30
Apr 28 22:09:51 bird510 kernel: [<ffffffff811a4fc1>] ? do_filp_open+0x861/0xd20
Apr 28 22:09:51 bird510 kernel: [<ffffffff8129df8a>] ? strncpy_from_user+0x4a/0x90
Apr 28 22:09:51 bird510 kernel: [<ffffffff811b1ff2>] ? alloc_fd+0x92/0x160
Apr 28 22:09:51 bird510 kernel: [<ffffffff8118e967>] ? do_sys_open+0x67/0x130
Apr 28 22:09:51 bird510 kernel: [<ffffffff8118ea70>] ? sys_open+0x20/0x30
Apr 28 22:09:51 bird510 kernel: [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b




The questionable spin lock was remove by commit 411a99adf. Is there was a problem?
Do we need a packport in RHEL kernel?

Thanks a lot,
Tigran.


2016-05-02 13:30:32

by Weston Andros Adamson

[permalink] [raw]
Subject: Re: blocked on the spin lock


> On May 2, 2016, at 3:29 AM, Mkrtchyan, Tigran <[email protected]> wrote:
>
>
>
> Hi Dros et. al.
>
> We have seen the following stack trace on one of out systems:
>
>
> Apr 28 22:09:51 bird510 kernel: BUG: soft lockup - CPU#7 stuck for 67s! [tee:13755]
> Apr 28 22:09:51 bird510 kernel: Modules linked in: ipmi_devintf dell_rbu nfs_layout_nfsv41_files fuse nfs lockd fscache auth_rpcgss nfs_acl vfat fat usb_storage mpt3sas mpt2sas raid_class mptctl openafs(P)(U) autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf sg joy
> dev power_meter acpi_ipmi ipmi_si ipmi_msghandler iTCO_wdt iTCO_vendor_support bnx2 microcode dcdbas lpc_ich mfd_core i7core_edac edac_core shpchp ext4 jbd2 mbcache sd_mod crc_t10dif mptsas mptscsih mptbase scsi_transport_sas mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_en p
> tp pps_core mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ipmi_devintf]
> Apr 28 22:09:51 bird510 kernel: CPU 7
> Apr 28 22:09:51 bird510 kernel: Modules linked in: ipmi_devintf dell_rbu nfs_layout_nfsv41_files fuse nfs lockd fscache auth_rpcgss nfs_acl vfat fat usb_storage mpt3sas mpt2sas raid_class mptctl openafs(P)(U) autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf sg joy
> dev power_meter acpi_ipmi ipmi_si ipmi_msghandler iTCO_wdt iTCO_vendor_support bnx2 microcode dcdbas lpc_ich mfd_core i7core_edac edac_core shpchp ext4 jbd2 mbcache sd_mod crc_t10dif mptsas mptscsih mptbase scsi_transport_sas mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_en p
> tp pps_core mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ipmi_devintf]
> Apr 28 22:09:51 bird510 kernel:
> Apr 28 22:09:51 bird510 kernel: Pid: 13755, comm: tee Tainted: P -- ------------ 2.6.32-573.12.1.el6.x86_64 #1 Dell Inc. PowerEdge M610/0N582M
> Apr 28 22:09:51 bird510 kernel: RIP: 0010:[<ffffffff8153bde1>] [<ffffffff8153bde1>] _spin_lock+0x21/0x30
> Apr 28 22:09:51 bird510 kernel: RSP: 0018:ffff8803528179c8 EFLAGS: 00000297
> Apr 28 22:09:51 bird510 kernel: RAX: 00000000000003b1 RBX: ffff8803528179c8 RCX: 0000000000000000
> Apr 28 22:09:51 bird510 kernel: RDX: 00000000000003b0 RSI: ffff880159aeb820 RDI: ffff880159aeb8d0
> Apr 28 22:09:51 bird510 kernel: RBP: ffffffff8100bc0e R08: ffff880352817af0 R09: 0000000000000002
> Apr 28 22:09:51 bird510 kernel: R10: ffffea0008efc428 R11: 0000000000000003 R12: 0000000000000001
> Apr 28 22:09:51 bird510 kernel: R13: 001310facb5806bd R14: 0000000000000000 R15: 0000000000000000
> Apr 28 22:09:51 bird510 kernel: FS: 00007f3e78118700(0000) GS:ffff880028260000(0000) knlGS:0000000000000000
> Apr 28 22:09:51 bird510 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> Apr 28 22:09:51 bird510 kernel: CR2: 00007fefa21ff000 CR3: 0000000352959000 CR4: 00000000000007e0
> Apr 28 22:09:51 bird510 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Apr 28 22:09:51 bird510 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Apr 28 22:09:51 bird510 kernel: Process tee (pid: 13755, threadinfo ffff880352814000, task ffff880636064ab0)
> Apr 28 22:09:51 bird510 kernel: Stack:
> Apr 28 22:09:51 bird510 kernel: ffff880352817a18 ffffffffa0490dc1 ffff880159aeb8d0 ffff880159aeb770
> Apr 28 22:09:51 bird510 kernel: <d> 0000000000000000 0000000000000000 ffffffffa04bd040 ffffffffa048f39a
> Apr 28 22:09:51 bird510 kernel: <d> ffff8803378761c0 ffffea000a13dcc0 ffff880352817a48 ffffffffa0490ec8
> Apr 28 22:09:51 bird510 kernel: Call Trace:
> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0490dc1>] ? nfs_clear_request_commit+0xa1/0xf0 [nfs]
> Apr 28 22:09:51 bird510 kernel: [<ffffffffa048f39a>] ? nfs_page_find_request_locked+0x2a/0x40 [nfs]
> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0490ec8>] ? nfs_wb_page_cancel+0xb8/0xf0 [nfs]
> Apr 28 22:09:51 bird510 kernel: [<ffffffffa047e4f7>] ? nfs_invalidate_page+0x47/0x80 [nfs]
> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113edb5>] ? do_invalidatepage+0x25/0x30
> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f0d2>] ? truncate_inode_page+0xa2/0xc0
> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f47f>] ? truncate_inode_pages_range+0x16f/0x500
> Apr 28 22:09:51 bird510 kernel: [<ffffffff81129cd7>] ? mempool_free_slab+0x17/0x20
> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f8a5>] ? truncate_inode_pages+0x15/0x20
> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f8ff>] ? truncate_pagecache+0x4f/0x70
> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0480eb9>] ? nfs_setattr_update_inode+0xb9/0x150 [nfs]
> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0493a6b>] ? nfs3_proc_setattr+0xdb/0x120 [nfs]
> Apr 28 22:09:51 bird510 kernel: [<ffffffffa048f6b3>] ? nfs_wb_all+0x43/0x50 [nfs]
> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0482610>] ? nfs_setattr+0xf0/0x170 [nfs]
> Apr 28 22:09:51 bird510 kernel: [<ffffffff811b1118>] ? notify_change+0x168/0x340
> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0480de8>] ? nfs_open+0x78/0x90 [nfs]
> Apr 28 22:09:51 bird510 kernel: [<ffffffff8118fcb4>] ? do_truncate+0x64/0xa0
> Apr 28 22:09:51 bird510 kernel: [<ffffffff81232b9f>] ? security_inode_permission+0x1f/0x30
> Apr 28 22:09:51 bird510 kernel: [<ffffffff811a4fc1>] ? do_filp_open+0x861/0xd20
> Apr 28 22:09:51 bird510 kernel: [<ffffffff8129df8a>] ? strncpy_from_user+0x4a/0x90
> Apr 28 22:09:51 bird510 kernel: [<ffffffff811b1ff2>] ? alloc_fd+0x92/0x160
> Apr 28 22:09:51 bird510 kernel: [<ffffffff8118e967>] ? do_sys_open+0x67/0x130
> Apr 28 22:09:51 bird510 kernel: [<ffffffff8118ea70>] ? sys_open+0x20/0x30
> Apr 28 22:09:51 bird510 kernel: [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b
> Apr 28 22:09:51 bird510 kernel: Code: 01 74 05 e8 52 1c d6 ff c9 c3 55 48 89 e5 0f 1f 44 00 00 b8 00 00 01 00 f0 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 0f b7 17 <eb> f5 83 3f 00 75 f4 eb df c9 c3 0f 1f 40 00 55 48 89 e5 0f 1f
> Apr 28 22:09:51 bird510 kernel: Call Trace:
> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0490dc1>] ? nfs_clear_request_commit+0xa1/0xf0 [nfs]
> Apr 28 22:09:51 bird510 kernel: [<ffffffffa048f39a>] ? nfs_page_find_request_locked+0x2a/0x40 [nfs]
> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0490ec8>] ? nfs_wb_page_cancel+0xb8/0xf0 [nfs]
> Apr 28 22:09:51 bird510 kernel: [<ffffffffa047e4f7>] ? nfs_invalidate_page+0x47/0x80 [nfs]
> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113edb5>] ? do_invalidatepage+0x25/0x30
> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f0d2>] ? truncate_inode_page+0xa2/0xc0
> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f47f>] ? truncate_inode_pages_range+0x16f/0x500
> Apr 28 22:09:51 bird510 kernel: [<ffffffff81129cd7>] ? mempool_free_slab+0x17/0x20
> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f8a5>] ? truncate_inode_pages+0x15/0x20
> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f8ff>] ? truncate_pagecache+0x4f/0x70
> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0480eb9>] ? nfs_setattr_update_inode+0xb9/0x150 [nfs]
> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0493a6b>] ? nfs3_proc_setattr+0xdb/0x120 [nfs]
> Apr 28 22:09:51 bird510 kernel: [<ffffffffa048f6b3>] ? nfs_wb_all+0x43/0x50 [nfs]
> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0482610>] ? nfs_setattr+0xf0/0x170 [nfs]
> Apr 28 22:09:51 bird510 kernel: [<ffffffff811b1118>] ? notify_change+0x168/0x340
> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0480de8>] ? nfs_open+0x78/0x90 [nfs]
> Apr 28 22:09:51 bird510 kernel: [<ffffffff8118fcb4>] ? do_truncate+0x64/0xa0
> Apr 28 22:09:51 bird510 kernel: [<ffffffff81232b9f>] ? security_inode_permission+0x1f/0x30
> Apr 28 22:09:51 bird510 kernel: [<ffffffff811a4fc1>] ? do_filp_open+0x861/0xd20
> Apr 28 22:09:51 bird510 kernel: [<ffffffff8129df8a>] ? strncpy_from_user+0x4a/0x90
> Apr 28 22:09:51 bird510 kernel: [<ffffffff811b1ff2>] ? alloc_fd+0x92/0x160
> Apr 28 22:09:51 bird510 kernel: [<ffffffff8118e967>] ? do_sys_open+0x67/0x130
> Apr 28 22:09:51 bird510 kernel: [<ffffffff8118ea70>] ? sys_open+0x20/0x30
> Apr 28 22:09:51 bird510 kernel: [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b
>
>
>
>
> The questionable spin lock was remove by commit 411a99adf. Is there was a problem?
> Do we need a packport in RHEL kernel?
>
> Thanks a lot,
> Tigran.
>

Hey Tigran,

It’s been a while since I looked at this, so I don’t know off the top of my head, but IIRC the patch series that 411a99adf is part of are some important bug fixes to the effort to split sub-page regions:

411a99adffb4 nfs: clear_request_commit while holding i_lock
e6cf82d1830f pnfs: add pnfs_put_lseg_async
02d1426c7053 pnfs: find swapped pages on pnfs commit lists too
b412ddf0661e nfs: fix comment and add warn_on for PG_INODE_REF
e7029206ff43 nfs: check wait_on_bit_lock err in page_group_lock

Also, if you are using the pnfs_nfs commit path, there are several recent fixes that are pretty important. Until recently, the only server to do NFS backed pnfs did only STABLE writes, so that stuff was pretty much untested…

In fact, I have a patch to post today that fixes a problem in the commit path, although it probably isn’t an issue in RHEL as it was a regression introduced by a recent patch.

-dros


2016-05-02 20:07:18

by Mkrtchyan, Tigran

[permalink] [raw]
Subject: Re: blocked on the spin lock

Hi Dros,

as you can see from the stack trace, this is NFSv3.
For now, it happened only once. May be there is some change with
our workflow. I will let you know, if we see it again.


Thanks,
Tigran.

----- Original Message -----
> From: "Weston Andros Adamson" <[email protected]>
> To: "Tigran Mkrtchyan" <[email protected]>
> Cc: "linux-nfs list" <[email protected]>, "Steve Dickson" <[email protected]>, "Andy Adamson"
> <[email protected]>
> Sent: Monday, May 2, 2016 3:30:29 PM
> Subject: Re: blocked on the spin lock

>> On May 2, 2016, at 3:29 AM, Mkrtchyan, Tigran <[email protected]> wrote:
>>
>>
>>
>> Hi Dros et. al.
>>
>> We have seen the following stack trace on one of out systems:
>>
>>
>> Apr 28 22:09:51 bird510 kernel: BUG: soft lockup - CPU#7 stuck for 67s!
>> [tee:13755]
>> Apr 28 22:09:51 bird510 kernel: Modules linked in: ipmi_devintf dell_rbu
>> nfs_layout_nfsv41_files fuse nfs lockd fscache auth_rpcgss nfs_acl vfat fat
>> usb_storage mpt3sas mpt2sas raid_class mptctl openafs(P)(U) autofs4 sunrpc
>> cpufreq_ondemand acpi_cpufreq freq_table mperf sg joy
>> dev power_meter acpi_ipmi ipmi_si ipmi_msghandler iTCO_wdt iTCO_vendor_support
>> bnx2 microcode dcdbas lpc_ich mfd_core i7core_edac edac_core shpchp ext4 jbd2
>> mbcache sd_mod crc_t10dif mptsas mptscsih mptbase scsi_transport_sas mlx4_ib
>> ib_sa ib_mad ib_core ib_addr ipv6 mlx4_en p
>> tp pps_core mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded:
>> ipmi_devintf]
>> Apr 28 22:09:51 bird510 kernel: CPU 7
>> Apr 28 22:09:51 bird510 kernel: Modules linked in: ipmi_devintf dell_rbu
>> nfs_layout_nfsv41_files fuse nfs lockd fscache auth_rpcgss nfs_acl vfat fat
>> usb_storage mpt3sas mpt2sas raid_class mptctl openafs(P)(U) autofs4 sunrpc
>> cpufreq_ondemand acpi_cpufreq freq_table mperf sg joy
>> dev power_meter acpi_ipmi ipmi_si ipmi_msghandler iTCO_wdt iTCO_vendor_support
>> bnx2 microcode dcdbas lpc_ich mfd_core i7core_edac edac_core shpchp ext4 jbd2
>> mbcache sd_mod crc_t10dif mptsas mptscsih mptbase scsi_transport_sas mlx4_ib
>> ib_sa ib_mad ib_core ib_addr ipv6 mlx4_en p
>> tp pps_core mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded:
>> ipmi_devintf]
>> Apr 28 22:09:51 bird510 kernel:
>> Apr 28 22:09:51 bird510 kernel: Pid: 13755, comm: tee Tainted: P --
>> ------------ 2.6.32-573.12.1.el6.x86_64 #1 Dell Inc. PowerEdge M610/0N582M
>> Apr 28 22:09:51 bird510 kernel: RIP: 0010:[<ffffffff8153bde1>]
>> [<ffffffff8153bde1>] _spin_lock+0x21/0x30
>> Apr 28 22:09:51 bird510 kernel: RSP: 0018:ffff8803528179c8 EFLAGS: 00000297
>> Apr 28 22:09:51 bird510 kernel: RAX: 00000000000003b1 RBX: ffff8803528179c8 RCX:
>> 0000000000000000
>> Apr 28 22:09:51 bird510 kernel: RDX: 00000000000003b0 RSI: ffff880159aeb820 RDI:
>> ffff880159aeb8d0
>> Apr 28 22:09:51 bird510 kernel: RBP: ffffffff8100bc0e R08: ffff880352817af0 R09:
>> 0000000000000002
>> Apr 28 22:09:51 bird510 kernel: R10: ffffea0008efc428 R11: 0000000000000003 R12:
>> 0000000000000001
>> Apr 28 22:09:51 bird510 kernel: R13: 001310facb5806bd R14: 0000000000000000 R15:
>> 0000000000000000
>> Apr 28 22:09:51 bird510 kernel: FS: 00007f3e78118700(0000)
>> GS:ffff880028260000(0000) knlGS:0000000000000000
>> Apr 28 22:09:51 bird510 kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
>> 000000008005003b
>> Apr 28 22:09:51 bird510 kernel: CR2: 00007fefa21ff000 CR3: 0000000352959000 CR4:
>> 00000000000007e0
>> Apr 28 22:09:51 bird510 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>> 0000000000000000
>> Apr 28 22:09:51 bird510 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
>> 0000000000000400
>> Apr 28 22:09:51 bird510 kernel: Process tee (pid: 13755, threadinfo
>> ffff880352814000, task ffff880636064ab0)
>> Apr 28 22:09:51 bird510 kernel: Stack:
>> Apr 28 22:09:51 bird510 kernel: ffff880352817a18 ffffffffa0490dc1
>> ffff880159aeb8d0 ffff880159aeb770
>> Apr 28 22:09:51 bird510 kernel: <d> 0000000000000000 0000000000000000
>> ffffffffa04bd040 ffffffffa048f39a
>> Apr 28 22:09:51 bird510 kernel: <d> ffff8803378761c0 ffffea000a13dcc0
>> ffff880352817a48 ffffffffa0490ec8
>> Apr 28 22:09:51 bird510 kernel: Call Trace:
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0490dc1>] ?
>> nfs_clear_request_commit+0xa1/0xf0 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa048f39a>] ?
>> nfs_page_find_request_locked+0x2a/0x40 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0490ec8>] ?
>> nfs_wb_page_cancel+0xb8/0xf0 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa047e4f7>] ?
>> nfs_invalidate_page+0x47/0x80 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113edb5>] ?
>> do_invalidatepage+0x25/0x30
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f0d2>] ?
>> truncate_inode_page+0xa2/0xc0
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f47f>] ?
>> truncate_inode_pages_range+0x16f/0x500
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff81129cd7>] ?
>> mempool_free_slab+0x17/0x20
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f8a5>] ?
>> truncate_inode_pages+0x15/0x20
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f8ff>] ?
>> truncate_pagecache+0x4f/0x70
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0480eb9>] ?
>> nfs_setattr_update_inode+0xb9/0x150 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0493a6b>] ?
>> nfs3_proc_setattr+0xdb/0x120 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa048f6b3>] ? nfs_wb_all+0x43/0x50
>> [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0482610>] ? nfs_setattr+0xf0/0x170
>> [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff811b1118>] ? notify_change+0x168/0x340
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0480de8>] ? nfs_open+0x78/0x90 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8118fcb4>] ? do_truncate+0x64/0xa0
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff81232b9f>] ?
>> security_inode_permission+0x1f/0x30
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff811a4fc1>] ? do_filp_open+0x861/0xd20
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8129df8a>] ?
>> strncpy_from_user+0x4a/0x90
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff811b1ff2>] ? alloc_fd+0x92/0x160
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8118e967>] ? do_sys_open+0x67/0x130
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8118ea70>] ? sys_open+0x20/0x30
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8100b0d2>] ?
>> system_call_fastpath+0x16/0x1b
>> Apr 28 22:09:51 bird510 kernel: Code: 01 74 05 e8 52 1c d6 ff c9 c3 55 48 89 e5
>> 0f 1f 44 00 00 b8 00 00 01 00 f0 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90
>> 0f b7 17 <eb> f5 83 3f 00 75 f4 eb df c9 c3 0f 1f 40 00 55 48 89 e5 0f 1f
>> Apr 28 22:09:51 bird510 kernel: Call Trace:
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0490dc1>] ?
>> nfs_clear_request_commit+0xa1/0xf0 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa048f39a>] ?
>> nfs_page_find_request_locked+0x2a/0x40 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0490ec8>] ?
>> nfs_wb_page_cancel+0xb8/0xf0 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa047e4f7>] ?
>> nfs_invalidate_page+0x47/0x80 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113edb5>] ?
>> do_invalidatepage+0x25/0x30
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f0d2>] ?
>> truncate_inode_page+0xa2/0xc0
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f47f>] ?
>> truncate_inode_pages_range+0x16f/0x500
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff81129cd7>] ?
>> mempool_free_slab+0x17/0x20
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f8a5>] ?
>> truncate_inode_pages+0x15/0x20
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f8ff>] ?
>> truncate_pagecache+0x4f/0x70
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0480eb9>] ?
>> nfs_setattr_update_inode+0xb9/0x150 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0493a6b>] ?
>> nfs3_proc_setattr+0xdb/0x120 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa048f6b3>] ? nfs_wb_all+0x43/0x50
>> [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0482610>] ? nfs_setattr+0xf0/0x170
>> [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff811b1118>] ? notify_change+0x168/0x340
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0480de8>] ? nfs_open+0x78/0x90 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8118fcb4>] ? do_truncate+0x64/0xa0
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff81232b9f>] ?
>> security_inode_permission+0x1f/0x30
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff811a4fc1>] ? do_filp_open+0x861/0xd20
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8129df8a>] ?
>> strncpy_from_user+0x4a/0x90
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff811b1ff2>] ? alloc_fd+0x92/0x160
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8118e967>] ? do_sys_open+0x67/0x130
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8118ea70>] ? sys_open+0x20/0x30
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8100b0d2>] ?
>> system_call_fastpath+0x16/0x1b
>>
>>
>>
>>
>> The questionable spin lock was remove by commit 411a99adf. Is there was a
>> problem?
>> Do we need a packport in RHEL kernel?
>>
>> Thanks a lot,
>> Tigran.
>>
>
> Hey Tigran,
>
> It’s been a while since I looked at this, so I don’t know off the top of my
> head, but IIRC the patch series that 411a99adf is part of are some important
> bug fixes to the effort to split sub-page regions:
>
> 411a99adffb4 nfs: clear_request_commit while holding i_lock
> e6cf82d1830f pnfs: add pnfs_put_lseg_async
> 02d1426c7053 pnfs: find swapped pages on pnfs commit lists too
> b412ddf0661e nfs: fix comment and add warn_on for PG_INODE_REF
> e7029206ff43 nfs: check wait_on_bit_lock err in page_group_lock
>
> Also, if you are using the pnfs_nfs commit path, there are several recent fixes
> that are pretty important. Until recently, the only server to do NFS backed
> pnfs did only STABLE writes, so that stuff was pretty much untested…
>
> In fact, I have a patch to post today that fixes a problem in the commit path,
> although it probably isn’t an issue in RHEL as it was a regression introduced
> by a recent patch.
>
> -dros

2016-05-02 21:08:51

by Weston Andros Adamson

[permalink] [raw]
Subject: Re: blocked on the spin lock



> On May 2, 2016, at 4:07 PM, Mkrtchyan, Tigran <[email protected]> wrote:
>
> Hi Dros,
>
> as you can see from the stack trace, this is NFSv3.

Oh, I didn't notice that!

> For now, it happened only once. May be there is some change with
> our workflow. I will let you know, if we see it again.

OK. If you do, try applying that patch set and see if it’s still reproducible.

-dros

>
>
> Thanks,
> Tigran.
>
> ----- Original Message -----
>> From: "Weston Andros Adamson" <[email protected]>
>> To: "Tigran Mkrtchyan" <[email protected]>
>> Cc: "linux-nfs list" <[email protected]>, "Steve Dickson" <[email protected]>, "Andy Adamson"
>> <[email protected]>
>> Sent: Monday, May 2, 2016 3:30:29 PM
>> Subject: Re: blocked on the spin lock
>
>>> On May 2, 2016, at 3:29 AM, Mkrtchyan, Tigran <[email protected]> wrote:
>>>
>>>
>>>
>>> Hi Dros et. al.
>>>
>>> We have seen the following stack trace on one of out systems:
>>>
>>>
>>> Apr 28 22:09:51 bird510 kernel: BUG: soft lockup - CPU#7 stuck for 67s!
>>> [tee:13755]
>>> Apr 28 22:09:51 bird510 kernel: Modules linked in: ipmi_devintf dell_rbu
>>> nfs_layout_nfsv41_files fuse nfs lockd fscache auth_rpcgss nfs_acl vfat fat
>>> usb_storage mpt3sas mpt2sas raid_class mptctl openafs(P)(U) autofs4 sunrpc
>>> cpufreq_ondemand acpi_cpufreq freq_table mperf sg joy
>>> dev power_meter acpi_ipmi ipmi_si ipmi_msghandler iTCO_wdt iTCO_vendor_support
>>> bnx2 microcode dcdbas lpc_ich mfd_core i7core_edac edac_core shpchp ext4 jbd2
>>> mbcache sd_mod crc_t10dif mptsas mptscsih mptbase scsi_transport_sas mlx4_ib
>>> ib_sa ib_mad ib_core ib_addr ipv6 mlx4_en p
>>> tp pps_core mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded:
>>> ipmi_devintf]
>>> Apr 28 22:09:51 bird510 kernel: CPU 7
>>> Apr 28 22:09:51 bird510 kernel: Modules linked in: ipmi_devintf dell_rbu
>>> nfs_layout_nfsv41_files fuse nfs lockd fscache auth_rpcgss nfs_acl vfat fat
>>> usb_storage mpt3sas mpt2sas raid_class mptctl openafs(P)(U) autofs4 sunrpc
>>> cpufreq_ondemand acpi_cpufreq freq_table mperf sg joy
>>> dev power_meter acpi_ipmi ipmi_si ipmi_msghandler iTCO_wdt iTCO_vendor_support
>>> bnx2 microcode dcdbas lpc_ich mfd_core i7core_edac edac_core shpchp ext4 jbd2
>>> mbcache sd_mod crc_t10dif mptsas mptscsih mptbase scsi_transport_sas mlx4_ib
>>> ib_sa ib_mad ib_core ib_addr ipv6 mlx4_en p
>>> tp pps_core mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded:
>>> ipmi_devintf]
>>> Apr 28 22:09:51 bird510 kernel:
>>> Apr 28 22:09:51 bird510 kernel: Pid: 13755, comm: tee Tainted: P --
>>> ------------ 2.6.32-573.12.1.el6.x86_64 #1 Dell Inc. PowerEdge M610/0N582M
>>> Apr 28 22:09:51 bird510 kernel: RIP: 0010:[<ffffffff8153bde1>]
>>> [<ffffffff8153bde1>] _spin_lock+0x21/0x30
>>> Apr 28 22:09:51 bird510 kernel: RSP: 0018:ffff8803528179c8 EFLAGS: 00000297
>>> Apr 28 22:09:51 bird510 kernel: RAX: 00000000000003b1 RBX: ffff8803528179c8 RCX:
>>> 0000000000000000
>>> Apr 28 22:09:51 bird510 kernel: RDX: 00000000000003b0 RSI: ffff880159aeb820 RDI:
>>> ffff880159aeb8d0
>>> Apr 28 22:09:51 bird510 kernel: RBP: ffffffff8100bc0e R08: ffff880352817af0 R09:
>>> 0000000000000002
>>> Apr 28 22:09:51 bird510 kernel: R10: ffffea0008efc428 R11: 0000000000000003 R12:
>>> 0000000000000001
>>> Apr 28 22:09:51 bird510 kernel: R13: 001310facb5806bd R14: 0000000000000000 R15:
>>> 0000000000000000
>>> Apr 28 22:09:51 bird510 kernel: FS: 00007f3e78118700(0000)
>>> GS:ffff880028260000(0000) knlGS:0000000000000000
>>> Apr 28 22:09:51 bird510 kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
>>> 000000008005003b
>>> Apr 28 22:09:51 bird510 kernel: CR2: 00007fefa21ff000 CR3: 0000000352959000 CR4:
>>> 00000000000007e0
>>> Apr 28 22:09:51 bird510 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>> 0000000000000000
>>> Apr 28 22:09:51 bird510 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
>>> 0000000000000400
>>> Apr 28 22:09:51 bird510 kernel: Process tee (pid: 13755, threadinfo
>>> ffff880352814000, task ffff880636064ab0)
>>> Apr 28 22:09:51 bird510 kernel: Stack:
>>> Apr 28 22:09:51 bird510 kernel: ffff880352817a18 ffffffffa0490dc1
>>> ffff880159aeb8d0 ffff880159aeb770
>>> Apr 28 22:09:51 bird510 kernel: <d> 0000000000000000 0000000000000000
>>> ffffffffa04bd040 ffffffffa048f39a
>>> Apr 28 22:09:51 bird510 kernel: <d> ffff8803378761c0 ffffea000a13dcc0
>>> ffff880352817a48 ffffffffa0490ec8
>>> Apr 28 22:09:51 bird510 kernel: Call Trace:
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0490dc1>] ?
>>> nfs_clear_request_commit+0xa1/0xf0 [nfs]
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa048f39a>] ?
>>> nfs_page_find_request_locked+0x2a/0x40 [nfs]
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0490ec8>] ?
>>> nfs_wb_page_cancel+0xb8/0xf0 [nfs]
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa047e4f7>] ?
>>> nfs_invalidate_page+0x47/0x80 [nfs]
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113edb5>] ?
>>> do_invalidatepage+0x25/0x30
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f0d2>] ?
>>> truncate_inode_page+0xa2/0xc0
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f47f>] ?
>>> truncate_inode_pages_range+0x16f/0x500
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffff81129cd7>] ?
>>> mempool_free_slab+0x17/0x20
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f8a5>] ?
>>> truncate_inode_pages+0x15/0x20
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f8ff>] ?
>>> truncate_pagecache+0x4f/0x70
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0480eb9>] ?
>>> nfs_setattr_update_inode+0xb9/0x150 [nfs]
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0493a6b>] ?
>>> nfs3_proc_setattr+0xdb/0x120 [nfs]
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa048f6b3>] ? nfs_wb_all+0x43/0x50
>>> [nfs]
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0482610>] ? nfs_setattr+0xf0/0x170
>>> [nfs]
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffff811b1118>] ? notify_change+0x168/0x340
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0480de8>] ? nfs_open+0x78/0x90 [nfs]
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8118fcb4>] ? do_truncate+0x64/0xa0
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffff81232b9f>] ?
>>> security_inode_permission+0x1f/0x30
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffff811a4fc1>] ? do_filp_open+0x861/0xd20
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8129df8a>] ?
>>> strncpy_from_user+0x4a/0x90
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffff811b1ff2>] ? alloc_fd+0x92/0x160
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8118e967>] ? do_sys_open+0x67/0x130
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8118ea70>] ? sys_open+0x20/0x30
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8100b0d2>] ?
>>> system_call_fastpath+0x16/0x1b
>>> Apr 28 22:09:51 bird510 kernel: Code: 01 74 05 e8 52 1c d6 ff c9 c3 55 48 89 e5
>>> 0f 1f 44 00 00 b8 00 00 01 00 f0 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90
>>> 0f b7 17 <eb> f5 83 3f 00 75 f4 eb df c9 c3 0f 1f 40 00 55 48 89 e5 0f 1f
>>> Apr 28 22:09:51 bird510 kernel: Call Trace:
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0490dc1>] ?
>>> nfs_clear_request_commit+0xa1/0xf0 [nfs]
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa048f39a>] ?
>>> nfs_page_find_request_locked+0x2a/0x40 [nfs]
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0490ec8>] ?
>>> nfs_wb_page_cancel+0xb8/0xf0 [nfs]
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa047e4f7>] ?
>>> nfs_invalidate_page+0x47/0x80 [nfs]
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113edb5>] ?
>>> do_invalidatepage+0x25/0x30
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f0d2>] ?
>>> truncate_inode_page+0xa2/0xc0
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f47f>] ?
>>> truncate_inode_pages_range+0x16f/0x500
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffff81129cd7>] ?
>>> mempool_free_slab+0x17/0x20
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f8a5>] ?
>>> truncate_inode_pages+0x15/0x20
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f8ff>] ?
>>> truncate_pagecache+0x4f/0x70
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0480eb9>] ?
>>> nfs_setattr_update_inode+0xb9/0x150 [nfs]
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0493a6b>] ?
>>> nfs3_proc_setattr+0xdb/0x120 [nfs]
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa048f6b3>] ? nfs_wb_all+0x43/0x50
>>> [nfs]
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0482610>] ? nfs_setattr+0xf0/0x170
>>> [nfs]
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffff811b1118>] ? notify_change+0x168/0x340
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0480de8>] ? nfs_open+0x78/0x90 [nfs]
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8118fcb4>] ? do_truncate+0x64/0xa0
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffff81232b9f>] ?
>>> security_inode_permission+0x1f/0x30
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffff811a4fc1>] ? do_filp_open+0x861/0xd20
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8129df8a>] ?
>>> strncpy_from_user+0x4a/0x90
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffff811b1ff2>] ? alloc_fd+0x92/0x160
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8118e967>] ? do_sys_open+0x67/0x130
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8118ea70>] ? sys_open+0x20/0x30
>>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8100b0d2>] ?
>>> system_call_fastpath+0x16/0x1b
>>>
>>>
>>>
>>>
>>> The questionable spin lock was remove by commit 411a99adf. Is there was a
>>> problem?
>>> Do we need a packport in RHEL kernel?
>>>
>>> Thanks a lot,
>>> Tigran.
>>>
>>
>> Hey Tigran,
>>
>> It’s been a while since I looked at this, so I don’t know off the top of my
>> head, but IIRC the patch series that 411a99adf is part of are some important
>> bug fixes to the effort to split sub-page regions:
>>
>> 411a99adffb4 nfs: clear_request_commit while holding i_lock
>> e6cf82d1830f pnfs: add pnfs_put_lseg_async
>> 02d1426c7053 pnfs: find swapped pages on pnfs commit lists too
>> b412ddf0661e nfs: fix comment and add warn_on for PG_INODE_REF
>> e7029206ff43 nfs: check wait_on_bit_lock err in page_group_lock
>>
>> Also, if you are using the pnfs_nfs commit path, there are several recent fixes
>> that are pretty important. Until recently, the only server to do NFS backed
>> pnfs did only STABLE writes, so that stuff was pretty much untested…
>>
>> In fact, I have a patch to post today that fixes a problem in the commit path,
>> although it probably isn’t an issue in RHEL as it was a regression introduced
>> by a recent patch.
>>
>> -dros


2016-05-16 19:02:52

by David Wysochanski

[permalink] [raw]
Subject: Re: blocked on the spin lock

On Mon, 2016-05-02 at 09:29 +0200, Mkrtchyan, Tigran wrote:
>
> Hi Dros et. al.
>
> We have seen the following stack trace on one of out systems:
>
>
> Apr 28 22:09:51 bird510 kernel: BUG: soft lockup - CPU#7 stuck for 67s! [tee:13755]
> Apr 28 22:09:51 bird510 kernel: Modules linked in: ipmi_devintf dell_rbu nfs_layout_nfsv41_files fuse nfs lockd fscache auth_rpcgss nfs_acl vfat fat usb_storage mpt3sas mpt2sas raid_class mptctl openafs(P)(U) autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf sg joy
> dev power_meter acpi_ipmi ipmi_si ipmi_msghandler iTCO_wdt iTCO_vendor_support bnx2 microcode dcdbas lpc_ich mfd_core i7core_edac edac_core shpchp ext4 jbd2 mbcache sd_mod crc_t10dif mptsas mptscsih mptbase scsi_transport_sas mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_en p
> tp pps_core mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ipmi_devintf]
> Apr 28 22:09:51 bird510 kernel: CPU 7
> Apr 28 22:09:51 bird510 kernel: Modules linked in: ipmi_devintf dell_rbu nfs_layout_nfsv41_files fuse nfs lockd fscache auth_rpcgss nfs_acl vfat fat usb_storage mpt3sas mpt2sas raid_class mptctl openafs(P)(U) autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf sg joy
> dev power_meter acpi_ipmi ipmi_si ipmi_msghandler iTCO_wdt iTCO_vendor_support bnx2 microcode dcdbas lpc_ich mfd_core i7core_edac edac_core shpchp ext4 jbd2 mbcache sd_mod crc_t10dif mptsas mptscsih mptbase scsi_transport_sas mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_en p
> tp pps_core mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ipmi_devintf]
> Apr 28 22:09:51 bird510 kernel:
> Apr 28 22:09:51 bird510 kernel: Pid: 13755, comm: tee Tainted: P -- ------------ 2.6.32-573.12.1.el6.x86_64 #1 Dell Inc. PowerEdge M610/0N582M
> Apr 28 22:09:51 bird510 kernel: RIP: 0010:[<ffffffff8153bde1>] [<ffffffff8153bde1>] _spin_lock+0x21/0x30
> Apr 28 22:09:51 bird510 kernel: RSP: 0018:ffff8803528179c8 EFLAGS: 00000297
> Apr 28 22:09:51 bird510 kernel: RAX: 00000000000003b1 RBX: ffff8803528179c8 RCX: 0000000000000000
> Apr 28 22:09:51 bird510 kernel: RDX: 00000000000003b0 RSI: ffff880159aeb820 RDI: ffff880159aeb8d0
> Apr 28 22:09:51 bird510 kernel: RBP: ffffffff8100bc0e R08: ffff880352817af0 R09: 0000000000000002
> Apr 28 22:09:51 bird510 kernel: R10: ffffea0008efc428 R11: 0000000000000003 R12: 0000000000000001
> Apr 28 22:09:51 bird510 kernel: R13: 001310facb5806bd R14: 0000000000000000 R15: 0000000000000000
> Apr 28 22:09:51 bird510 kernel: FS: 00007f3e78118700(0000) GS:ffff880028260000(0000) knlGS:0000000000000000
> Apr 28 22:09:51 bird510 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> Apr 28 22:09:51 bird510 kernel: CR2: 00007fefa21ff000 CR3: 0000000352959000 CR4: 00000000000007e0
> Apr 28 22:09:51 bird510 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Apr 28 22:09:51 bird510 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Apr 28 22:09:51 bird510 kernel: Process tee (pid: 13755, threadinfo ffff880352814000, task ffff880636064ab0)
> Apr 28 22:09:51 bird510 kernel: Stack:
> Apr 28 22:09:51 bird510 kernel: ffff880352817a18 ffffffffa0490dc1 ffff880159aeb8d0 ffff880159aeb770
> Apr 28 22:09:51 bird510 kernel: <d> 0000000000000000 0000000000000000 ffffffffa04bd040 ffffffffa048f39a
> Apr 28 22:09:51 bird510 kernel: <d> ffff8803378761c0 ffffea000a13dcc0 ffff880352817a48 ffffffffa0490ec8
> Apr 28 22:09:51 bird510 kernel: Call Trace:
> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0490dc1>] ? nfs_clear_request_commit+0xa1/0xf0 [nfs]
> Apr 28 22:09:51 bird510 kernel: [<ffffffffa048f39a>] ? nfs_page_find_request_locked+0x2a/0x40 [nfs]
> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0490ec8>] ? nfs_wb_page_cancel+0xb8/0xf0 [nfs]
> Apr 28 22:09:51 bird510 kernel: [<ffffffffa047e4f7>] ? nfs_invalidate_page+0x47/0x80 [nfs]
> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113edb5>] ? do_invalidatepage+0x25/0x30
> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f0d2>] ? truncate_inode_page+0xa2/0xc0
> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f47f>] ? truncate_inode_pages_range+0x16f/0x500
> Apr 28 22:09:51 bird510 kernel: [<ffffffff81129cd7>] ? mempool_free_slab+0x17/0x20
> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f8a5>] ? truncate_inode_pages+0x15/0x20
> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f8ff>] ? truncate_pagecache+0x4f/0x70
> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0480eb9>] ? nfs_setattr_update_inode+0xb9/0x150 [nfs]
> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0493a6b>] ? nfs3_proc_setattr+0xdb/0x120 [nfs]
> Apr 28 22:09:51 bird510 kernel: [<ffffffffa048f6b3>] ? nfs_wb_all+0x43/0x50 [nfs]
> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0482610>] ? nfs_setattr+0xf0/0x170 [nfs]
> Apr 28 22:09:51 bird510 kernel: [<ffffffff811b1118>] ? notify_change+0x168/0x340
> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0480de8>] ? nfs_open+0x78/0x90 [nfs]
> Apr 28 22:09:51 bird510 kernel: [<ffffffff8118fcb4>] ? do_truncate+0x64/0xa0
> Apr 28 22:09:51 bird510 kernel: [<ffffffff81232b9f>] ? security_inode_permission+0x1f/0x30
> Apr 28 22:09:51 bird510 kernel: [<ffffffff811a4fc1>] ? do_filp_open+0x861/0xd20
> Apr 28 22:09:51 bird510 kernel: [<ffffffff8129df8a>] ? strncpy_from_user+0x4a/0x90
> Apr 28 22:09:51 bird510 kernel: [<ffffffff811b1ff2>] ? alloc_fd+0x92/0x160
> Apr 28 22:09:51 bird510 kernel: [<ffffffff8118e967>] ? do_sys_open+0x67/0x130
> Apr 28 22:09:51 bird510 kernel: [<ffffffff8118ea70>] ? sys_open+0x20/0x30
> Apr 28 22:09:51 bird510 kernel: [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b
> Apr 28 22:09:51 bird510 kernel: Code: 01 74 05 e8 52 1c d6 ff c9 c3 55 48 89 e5 0f 1f 44 00 00 b8 00 00 01 00 f0 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 0f b7 17 <eb> f5 83 3f 00 75 f4 eb df c9 c3 0f 1f 40 00 55 48 89 e5 0f 1f
> Apr 28 22:09:51 bird510 kernel: Call Trace:
> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0490dc1>] ? nfs_clear_request_commit+0xa1/0xf0 [nfs]
> Apr 28 22:09:51 bird510 kernel: [<ffffffffa048f39a>] ? nfs_page_find_request_locked+0x2a/0x40 [nfs]
> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0490ec8>] ? nfs_wb_page_cancel+0xb8/0xf0 [nfs]
> Apr 28 22:09:51 bird510 kernel: [<ffffffffa047e4f7>] ? nfs_invalidate_page+0x47/0x80 [nfs]
> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113edb5>] ? do_invalidatepage+0x25/0x30
> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f0d2>] ? truncate_inode_page+0xa2/0xc0
> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f47f>] ? truncate_inode_pages_range+0x16f/0x500
> Apr 28 22:09:51 bird510 kernel: [<ffffffff81129cd7>] ? mempool_free_slab+0x17/0x20
> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f8a5>] ? truncate_inode_pages+0x15/0x20
> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f8ff>] ? truncate_pagecache+0x4f/0x70
> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0480eb9>] ? nfs_setattr_update_inode+0xb9/0x150 [nfs]
> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0493a6b>] ? nfs3_proc_setattr+0xdb/0x120 [nfs]
> Apr 28 22:09:51 bird510 kernel: [<ffffffffa048f6b3>] ? nfs_wb_all+0x43/0x50 [nfs]
> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0482610>] ? nfs_setattr+0xf0/0x170 [nfs]
> Apr 28 22:09:51 bird510 kernel: [<ffffffff811b1118>] ? notify_change+0x168/0x340
> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0480de8>] ? nfs_open+0x78/0x90 [nfs]
> Apr 28 22:09:51 bird510 kernel: [<ffffffff8118fcb4>] ? do_truncate+0x64/0xa0
> Apr 28 22:09:51 bird510 kernel: [<ffffffff81232b9f>] ? security_inode_permission+0x1f/0x30
> Apr 28 22:09:51 bird510 kernel: [<ffffffff811a4fc1>] ? do_filp_open+0x861/0xd20
> Apr 28 22:09:51 bird510 kernel: [<ffffffff8129df8a>] ? strncpy_from_user+0x4a/0x90
> Apr 28 22:09:51 bird510 kernel: [<ffffffff811b1ff2>] ? alloc_fd+0x92/0x160
> Apr 28 22:09:51 bird510 kernel: [<ffffffff8118e967>] ? do_sys_open+0x67/0x130
> Apr 28 22:09:51 bird510 kernel: [<ffffffff8118ea70>] ? sys_open+0x20/0x30
> Apr 28 22:09:51 bird510 kernel: [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b
>
>
>
>
> The questionable spin lock was remove by commit 411a99adf. Is there was a problem?
> Do we need a packport in RHEL kernel?
>


This is a known bug 1322823. It should be fixed in RHEL6.7
kernel-2.6.32-573.27.1.el6 or above as well as RHEL6.8
kernel-2.6.32-642.el6 or higher. It only affects kernels
2.6.32-573.10.2.el6 until kernel-2.6.32-573.27.1.el6. The bug is
private but see the description below.

This bug was an unfortunate side-effect of a fix for another nasty bug
Bug 1135601 - System panics with "kernel BUG at fs/nfs/inode.c:113!
But 1135601 was fixed by a RHEL6 equivalent of the following upstream
patch: 3e21704 nfs: handle multiple reqs in nfs_wb_page_cancel

Saving &inode->i_lock in an NFS specific structure seems problematic
which maybe why it was missed in the backport, and partially removed in
commit 411a99adf. I posted a related cleanup patch in this area
https://www.spinics.net/lists/linux-nfs/msg57194.html



Bug 1322823 - RHEL6.7.z: nfs client kernel deadlock due to inode->i_lock
== nfs_commit_info.lock in code path nfs_wb_page_cancel ->
nfs_clear_request_commit due to patch to fix bug 1135601

Dave Wysochanski 2016-03-31 07:50:18 EDT

Description of problem:
Repeated kernel deadlock hit multiple times by a customer, and caused by the current fix to bug 1135601. Basically the NFS client / RPC deadlocks due to a spinlock being taken twice. We have vmcores.

I think in the fix for 1135601 the following "finer point" of the NFS
client was missed: So inode->i_lock == nfs_commit_info.lock? If that's
true it may be ok but it's not obvious and it's unclear why the code
isn't just using inode->i_lock but apparently copying the address into
nfs_commit_info.lock ? Indeed looking at the code we have:
~~~
static void nfs_init_cinfo_from_inode(struct nfs_commit_info *cinfo,
struct inode *inode)
{
cinfo->lock = &inode->i_lock;
cinfo->mds = &NFS_I(inode)->commit_info;
cinfo->ds = pnfs_get_ds_info(inode);
cinfo->dreq = NULL;
cinfo->completion_ops = &nfs_commit_completion_ops;
}
"fs/nfs/write.c" 1927L, 50230C
~~~


== Technical explanation of deadlock seen in vmcore ==

Ok with the understanding that inode->i_lock == nfs_commit_info.lock, take a look at the pid 11305
~~~
crash> bt
PID: 11305 TASK: ffff881c7715aab0 CPU: 8 COMMAND: "sas"
#0 [ffff881078806e90] crash_nmi_callback at ffffffff81033cf6
#1 [ffff881078806ea0] notifier_call_chain at ffffffff8153f545
#2 [ffff881078806ee0] atomic_notifier_call_chain at ffffffff8153f5aa
#3 [ffff881078806ef0] notify_die at ffffffff810a783e
#4 [ffff881078806f20] do_nmi at ffffffff8153d203
#5 [ffff881078806f50] nmi at ffffffff8153cac0
[exception RIP: _spin_lock+0x1c]
RIP: ffffffff8153c32c RSP: ffff88104032fb58 RFLAGS: 00000297
RAX: 00000000000009c1 RBX: ffff880a82b84bc0 RCX: 0000000000000000
RDX: 00000000000009c0 RSI: ffff8809ff681820 RDI: ffff8809ff6818d0
RBP: ffff88104032fb58 R8: ffff88104032fc80 R9: 0000000000000002
R10: ffffea0022eec790 R11: 0000000000000003 R12: ffff88104032fb68
R13: 0000000000000000 R14: ffff8809ff6818d0 R15: 0000000000000020
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#6 [ffff88104032fb58] _spin_lock at ffffffff8153c32c
#7 [ffff88104032fb60] nfs_clear_request_commit at ffffffffa03e9dc1 [nfs]
#8 [ffff88104032fbb0] nfs_wb_page_cancel at ffffffffa03e9ec8 [nfs]
#9 [ffff88104032fbe0] nfs_invalidate_page at ffffffffa03d74f7 [nfs]
#10 [ffff88104032fc00] do_invalidatepage at ffffffff8113edb5
#11 [ffff88104032fc10] truncate_inode_page at ffffffff8113f0d2
#12 [ffff88104032fc30] truncate_inode_pages_range at ffffffff8113f47f
#13 [ffff88104032fd20] truncate_inode_pages at ffffffff8113f8a5
#14 [ffff88104032fd30] truncate_pagecache at ffffffff8113f8ff
#15 [ffff88104032fd60] nfs_setattr_update_inode at ffffffffa03d9eb9 [nfs]
#16 [ffff88104032fd90] nfs3_proc_setattr at ffffffffa03eca6b [nfs]
#17 [ffff88104032fe20] nfs_setattr at ffffffffa03db610 [nfs]
#18 [ffff88104032fe60] notify_change at ffffffff811b1338
#19 [ffff88104032fed0] do_truncate at ffffffff8118fee4
#20 [ffff88104032ff40] sys_ftruncate at ffffffff81190290
#21 [ffff88104032ff80] system_call_fastpath at ffffffff8100b0d2
RIP: 00002b53f548a4d7 RSP: 00002b543fb36b50 RFLAGS: 00010246
RAX: 000000000000004d RBX: ffffffff8100b0d2 RCX: 00002b5426a29c48
RDX: 00002b543f11a880 RSI: 0000000000020000 RDI: 0000000000000028
RBP: 00002b543f11a880 R8: 0000000000010000 R9: 00002b544e8a63b0
R10: 000000000000002a R11: 0000000000000206 R12: 00002b543f11ad88
R13: 00002b53f9ae7fe0 R14: 00002b53fa43af20 R15: 00002b543f11a880
ORIG_RAX: 000000000000004d CS: 0033 SS: 002b
crash> dis -lr ffffffffa03e9ec8 | tail
0xffffffffa03e9eb6 <nfs_wb_page_cancel+0xa6>: pop %r13
0xffffffffa03e9eb8 <nfs_wb_page_cancel+0xa8>: pop %r14
0xffffffffa03e9eba <nfs_wb_page_cancel+0xaa>: leaveq
0xffffffffa03e9ebb <nfs_wb_page_cancel+0xab>: retq
0xffffffffa03e9ebc <nfs_wb_page_cancel+0xac>: nopl 0x0(%rax)
/usr/src/debug/kernel-2.6.32-573.18.1.el6/linux-2.6.32-573.18.1.el6.x86_64/fs/nfs/write.c: 1787
0xffffffffa03e9ec0 <nfs_wb_page_cancel+0xb0>: mov %rbx,%rdi
0xffffffffa03e9ec3 <nfs_wb_page_cancel+0xb3>: callq 0xffffffffa03e9d20 <nfs_clear_request_commit>
/usr/src/debug/kernel-2.6.32-573.18.1.el6/linux-2.6.32-573.18.1.el6.x86_64/arch/x86/include/asm/spinlock.h: 174
0xffffffffa03e9ec8 <nfs_wb_page_cancel+0xb8>: mov %r14,%rax
crash> dis -lr ffffffffa03e9dc1 | tail
0xffffffffa03e9dab <nfs_clear_request_commit+0x8b>: mov -0x10(%rbp),%rbx
0xffffffffa03e9daf <nfs_clear_request_commit+0x8f>: mov -0x8(%rbp),%r12
0xffffffffa03e9db3 <nfs_clear_request_commit+0x93>: leaveq
0xffffffffa03e9db4 <nfs_clear_request_commit+0x94>: retq
0xffffffffa03e9db5 <nfs_clear_request_commit+0x95>: nopl (%rax)
/usr/src/debug/kernel-2.6.32-573.18.1.el6/linux-2.6.32-573.18.1.el6.x86_64/fs/nfs/write.c: 558
0xffffffffa03e9db8 <nfs_clear_request_commit+0x98>: mov -0x40(%rbp),%rdi
0xffffffffa03e9dbc <nfs_clear_request_commit+0x9c>: callq 0xffffffff8153c310 <_spin_lock>
/usr/src/debug/kernel-2.6.32-573.18.1.el6/linux-2.6.32-573.18.1.el6.x86_64/arch/x86/include/asm/bitops.h: 252
0xffffffffa03e9dc1 <nfs_clear_request_commit+0xa1>: lock btrl $0x2,0x40(%rbx)
~~~


Now look at the code, specifically nfs_wb_page_cancel and nfs_clear_request_commit. On line 1780 we take inode->i_lock.
~~~
fs/nfs/write.c
1771
1772 int nfs_wb_page_cancel(struct inode *inode, struct page *page)
1773 {
1774 struct nfs_page *req;
1775 int ret = 0;
1776
1777 BUG_ON(!PageLocked(page));
1778 for (;;) {
1779 wait_on_page_writeback(page);
1780 spin_lock(&inode->i_lock); <----------- Takes spinlock the first time
1781 req = nfs_page_find_request_locked(page);
1782 if (req == NULL) {
1783 spin_unlock(&inode->i_lock);
1784 break;
1785 }
1786 if (nfs_lock_request(req)) {
1787 nfs_clear_request_commit(req); <---------- executing inside here
1788 spin_unlock(&inode->i_lock);


549 static void
550 nfs_clear_request_commit(struct nfs_page *req)
551 {
552 if (test_bit(PG_CLEAN, &req->wb_flags)) {
553 struct inode *inode = req->wb_context->dentry->d_inode;
554 struct nfs_commit_info cinfo;
555
556 nfs_init_cinfo_from_inode(&cinfo, inode);
557 if (!pnfs_clear_request_commit(req, &cinfo)) {
558--> spin_lock(cinfo.lock); <-------------- Tries to take spinlock a second time
559 nfs_request_remove_commit_list(req, &cinfo);
560 spin_unlock(cinfo.lock);
561 }
562 nfs_clear_page_commit(req->wb_page);
563 }
564 }
~~~

Version-Release number of selected component (if applicable):
2.6.32-573.18.1.el6

How reproducible:
multiple times by one customer so far

Steps to Reproduce:
TBD

Actual results:
NFS client system deadlocks



2016-05-16 19:15:51

by Weston Andros Adamson

[permalink] [raw]
Subject: Re: blocked on the spin lock


> On May 16, 2016, at 3:02 PM, Dave Wysochanski <[email protected]> wrote:
>
> On Mon, 2016-05-02 at 09:29 +0200, Mkrtchyan, Tigran wrote:
>>
>> Hi Dros et. al.
>>
>> We have seen the following stack trace on one of out systems:
>>
>>
>> Apr 28 22:09:51 bird510 kernel: BUG: soft lockup - CPU#7 stuck for 67s! [tee:13755]
>> Apr 28 22:09:51 bird510 kernel: Modules linked in: ipmi_devintf dell_rbu nfs_layout_nfsv41_files fuse nfs lockd fscache auth_rpcgss nfs_acl vfat fat usb_storage mpt3sas mpt2sas raid_class mptctl openafs(P)(U) autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf sg joy
>> dev power_meter acpi_ipmi ipmi_si ipmi_msghandler iTCO_wdt iTCO_vendor_support bnx2 microcode dcdbas lpc_ich mfd_core i7core_edac edac_core shpchp ext4 jbd2 mbcache sd_mod crc_t10dif mptsas mptscsih mptbase scsi_transport_sas mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_en p
>> tp pps_core mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ipmi_devintf]
>> Apr 28 22:09:51 bird510 kernel: CPU 7
>> Apr 28 22:09:51 bird510 kernel: Modules linked in: ipmi_devintf dell_rbu nfs_layout_nfsv41_files fuse nfs lockd fscache auth_rpcgss nfs_acl vfat fat usb_storage mpt3sas mpt2sas raid_class mptctl openafs(P)(U) autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf sg joy
>> dev power_meter acpi_ipmi ipmi_si ipmi_msghandler iTCO_wdt iTCO_vendor_support bnx2 microcode dcdbas lpc_ich mfd_core i7core_edac edac_core shpchp ext4 jbd2 mbcache sd_mod crc_t10dif mptsas mptscsih mptbase scsi_transport_sas mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_en p
>> tp pps_core mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ipmi_devintf]
>> Apr 28 22:09:51 bird510 kernel:
>> Apr 28 22:09:51 bird510 kernel: Pid: 13755, comm: tee Tainted: P -- ------------ 2.6.32-573.12.1.el6.x86_64 #1 Dell Inc. PowerEdge M610/0N582M
>> Apr 28 22:09:51 bird510 kernel: RIP: 0010:[<ffffffff8153bde1>] [<ffffffff8153bde1>] _spin_lock+0x21/0x30
>> Apr 28 22:09:51 bird510 kernel: RSP: 0018:ffff8803528179c8 EFLAGS: 00000297
>> Apr 28 22:09:51 bird510 kernel: RAX: 00000000000003b1 RBX: ffff8803528179c8 RCX: 0000000000000000
>> Apr 28 22:09:51 bird510 kernel: RDX: 00000000000003b0 RSI: ffff880159aeb820 RDI: ffff880159aeb8d0
>> Apr 28 22:09:51 bird510 kernel: RBP: ffffffff8100bc0e R08: ffff880352817af0 R09: 0000000000000002
>> Apr 28 22:09:51 bird510 kernel: R10: ffffea0008efc428 R11: 0000000000000003 R12: 0000000000000001
>> Apr 28 22:09:51 bird510 kernel: R13: 001310facb5806bd R14: 0000000000000000 R15: 0000000000000000
>> Apr 28 22:09:51 bird510 kernel: FS: 00007f3e78118700(0000) GS:ffff880028260000(0000) knlGS:0000000000000000
>> Apr 28 22:09:51 bird510 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> Apr 28 22:09:51 bird510 kernel: CR2: 00007fefa21ff000 CR3: 0000000352959000 CR4: 00000000000007e0
>> Apr 28 22:09:51 bird510 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> Apr 28 22:09:51 bird510 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> Apr 28 22:09:51 bird510 kernel: Process tee (pid: 13755, threadinfo ffff880352814000, task ffff880636064ab0)
>> Apr 28 22:09:51 bird510 kernel: Stack:
>> Apr 28 22:09:51 bird510 kernel: ffff880352817a18 ffffffffa0490dc1 ffff880159aeb8d0 ffff880159aeb770
>> Apr 28 22:09:51 bird510 kernel: <d> 0000000000000000 0000000000000000 ffffffffa04bd040 ffffffffa048f39a
>> Apr 28 22:09:51 bird510 kernel: <d> ffff8803378761c0 ffffea000a13dcc0 ffff880352817a48 ffffffffa0490ec8
>> Apr 28 22:09:51 bird510 kernel: Call Trace:
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0490dc1>] ? nfs_clear_request_commit+0xa1/0xf0 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa048f39a>] ? nfs_page_find_request_locked+0x2a/0x40 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0490ec8>] ? nfs_wb_page_cancel+0xb8/0xf0 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa047e4f7>] ? nfs_invalidate_page+0x47/0x80 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113edb5>] ? do_invalidatepage+0x25/0x30
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f0d2>] ? truncate_inode_page+0xa2/0xc0
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f47f>] ? truncate_inode_pages_range+0x16f/0x500
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff81129cd7>] ? mempool_free_slab+0x17/0x20
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f8a5>] ? truncate_inode_pages+0x15/0x20
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f8ff>] ? truncate_pagecache+0x4f/0x70
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0480eb9>] ? nfs_setattr_update_inode+0xb9/0x150 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0493a6b>] ? nfs3_proc_setattr+0xdb/0x120 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa048f6b3>] ? nfs_wb_all+0x43/0x50 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0482610>] ? nfs_setattr+0xf0/0x170 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff811b1118>] ? notify_change+0x168/0x340
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0480de8>] ? nfs_open+0x78/0x90 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8118fcb4>] ? do_truncate+0x64/0xa0
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff81232b9f>] ? security_inode_permission+0x1f/0x30
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff811a4fc1>] ? do_filp_open+0x861/0xd20
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8129df8a>] ? strncpy_from_user+0x4a/0x90
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff811b1ff2>] ? alloc_fd+0x92/0x160
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8118e967>] ? do_sys_open+0x67/0x130
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8118ea70>] ? sys_open+0x20/0x30
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b
>> Apr 28 22:09:51 bird510 kernel: Code: 01 74 05 e8 52 1c d6 ff c9 c3 55 48 89 e5 0f 1f 44 00 00 b8 00 00 01 00 f0 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 0f b7 17 <eb> f5 83 3f 00 75 f4 eb df c9 c3 0f 1f 40 00 55 48 89 e5 0f 1f
>> Apr 28 22:09:51 bird510 kernel: Call Trace:
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0490dc1>] ? nfs_clear_request_commit+0xa1/0xf0 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa048f39a>] ? nfs_page_find_request_locked+0x2a/0x40 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0490ec8>] ? nfs_wb_page_cancel+0xb8/0xf0 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa047e4f7>] ? nfs_invalidate_page+0x47/0x80 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113edb5>] ? do_invalidatepage+0x25/0x30
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f0d2>] ? truncate_inode_page+0xa2/0xc0
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f47f>] ? truncate_inode_pages_range+0x16f/0x500
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff81129cd7>] ? mempool_free_slab+0x17/0x20
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f8a5>] ? truncate_inode_pages+0x15/0x20
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f8ff>] ? truncate_pagecache+0x4f/0x70
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0480eb9>] ? nfs_setattr_update_inode+0xb9/0x150 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0493a6b>] ? nfs3_proc_setattr+0xdb/0x120 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa048f6b3>] ? nfs_wb_all+0x43/0x50 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0482610>] ? nfs_setattr+0xf0/0x170 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff811b1118>] ? notify_change+0x168/0x340
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0480de8>] ? nfs_open+0x78/0x90 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8118fcb4>] ? do_truncate+0x64/0xa0
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff81232b9f>] ? security_inode_permission+0x1f/0x30
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff811a4fc1>] ? do_filp_open+0x861/0xd20
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8129df8a>] ? strncpy_from_user+0x4a/0x90
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff811b1ff2>] ? alloc_fd+0x92/0x160
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8118e967>] ? do_sys_open+0x67/0x130
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8118ea70>] ? sys_open+0x20/0x30
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b
>>
>>
>>
>>
>> The questionable spin lock was remove by commit 411a99adf. Is there was a problem?
>> Do we need a packport in RHEL kernel?
>>
>
>
> This is a known bug 1322823. It should be fixed in RHEL6.7
> kernel-2.6.32-573.27.1.el6 or above as well as RHEL6.8
> kernel-2.6.32-642.el6 or higher. It only affects kernels
> 2.6.32-573.10.2.el6 until kernel-2.6.32-573.27.1.el6. The bug is
> private but see the description below.

Thanks for the detailed explanation!

-dros

>
> This bug was an unfortunate side-effect of a fix for another nasty bug
> Bug 1135601 - System panics with "kernel BUG at fs/nfs/inode.c:113!
> But 1135601 was fixed by a RHEL6 equivalent of the following upstream
> patch: 3e21704 nfs: handle multiple reqs in nfs_wb_page_cancel
>
> Saving &inode->i_lock in an NFS specific structure seems problematic
> which maybe why it was missed in the backport, and partially removed in
> commit 411a99adf. I posted a related cleanup patch in this area
> https://www.spinics.net/lists/linux-nfs/msg57194.html
>
>
>
> Bug 1322823 - RHEL6.7.z: nfs client kernel deadlock due to inode->i_lock
> == nfs_commit_info.lock in code path nfs_wb_page_cancel ->
> nfs_clear_request_commit due to patch to fix bug 1135601
>
> Dave Wysochanski 2016-03-31 07:50:18 EDT
>
> Description of problem:
> Repeated kernel deadlock hit multiple times by a customer, and caused by the current fix to bug 1135601. Basically the NFS client / RPC deadlocks due to a spinlock being taken twice. We have vmcores.
>
> I think in the fix for 1135601 the following "finer point" of the NFS
> client was missed: So inode->i_lock == nfs_commit_info.lock? If that's
> true it may be ok but it's not obvious and it's unclear why the code
> isn't just using inode->i_lock but apparently copying the address into
> nfs_commit_info.lock ? Indeed looking at the code we have:
> ~~~
> static void nfs_init_cinfo_from_inode(struct nfs_commit_info *cinfo,
> struct inode *inode)
> {
> cinfo->lock = &inode->i_lock;
> cinfo->mds = &NFS_I(inode)->commit_info;
> cinfo->ds = pnfs_get_ds_info(inode);
> cinfo->dreq = NULL;
> cinfo->completion_ops = &nfs_commit_completion_ops;
> }
> "fs/nfs/write.c" 1927L, 50230C
> ~~~
>
>
> == Technical explanation of deadlock seen in vmcore ==
>
> Ok with the understanding that inode->i_lock == nfs_commit_info.lock, take a look at the pid 11305
> ~~~
> crash> bt
> PID: 11305 TASK: ffff881c7715aab0 CPU: 8 COMMAND: "sas"
> #0 [ffff881078806e90] crash_nmi_callback at ffffffff81033cf6
> #1 [ffff881078806ea0] notifier_call_chain at ffffffff8153f545
> #2 [ffff881078806ee0] atomic_notifier_call_chain at ffffffff8153f5aa
> #3 [ffff881078806ef0] notify_die at ffffffff810a783e
> #4 [ffff881078806f20] do_nmi at ffffffff8153d203
> #5 [ffff881078806f50] nmi at ffffffff8153cac0
> [exception RIP: _spin_lock+0x1c]
> RIP: ffffffff8153c32c RSP: ffff88104032fb58 RFLAGS: 00000297
> RAX: 00000000000009c1 RBX: ffff880a82b84bc0 RCX: 0000000000000000
> RDX: 00000000000009c0 RSI: ffff8809ff681820 RDI: ffff8809ff6818d0
> RBP: ffff88104032fb58 R8: ffff88104032fc80 R9: 0000000000000002
> R10: ffffea0022eec790 R11: 0000000000000003 R12: ffff88104032fb68
> R13: 0000000000000000 R14: ffff8809ff6818d0 R15: 0000000000000020
> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> --- <NMI exception stack> ---
> #6 [ffff88104032fb58] _spin_lock at ffffffff8153c32c
> #7 [ffff88104032fb60] nfs_clear_request_commit at ffffffffa03e9dc1 [nfs]
> #8 [ffff88104032fbb0] nfs_wb_page_cancel at ffffffffa03e9ec8 [nfs]
> #9 [ffff88104032fbe0] nfs_invalidate_page at ffffffffa03d74f7 [nfs]
> #10 [ffff88104032fc00] do_invalidatepage at ffffffff8113edb5
> #11 [ffff88104032fc10] truncate_inode_page at ffffffff8113f0d2
> #12 [ffff88104032fc30] truncate_inode_pages_range at ffffffff8113f47f
> #13 [ffff88104032fd20] truncate_inode_pages at ffffffff8113f8a5
> #14 [ffff88104032fd30] truncate_pagecache at ffffffff8113f8ff
> #15 [ffff88104032fd60] nfs_setattr_update_inode at ffffffffa03d9eb9 [nfs]
> #16 [ffff88104032fd90] nfs3_proc_setattr at ffffffffa03eca6b [nfs]
> #17 [ffff88104032fe20] nfs_setattr at ffffffffa03db610 [nfs]
> #18 [ffff88104032fe60] notify_change at ffffffff811b1338
> #19 [ffff88104032fed0] do_truncate at ffffffff8118fee4
> #20 [ffff88104032ff40] sys_ftruncate at ffffffff81190290
> #21 [ffff88104032ff80] system_call_fastpath at ffffffff8100b0d2
> RIP: 00002b53f548a4d7 RSP: 00002b543fb36b50 RFLAGS: 00010246
> RAX: 000000000000004d RBX: ffffffff8100b0d2 RCX: 00002b5426a29c48
> RDX: 00002b543f11a880 RSI: 0000000000020000 RDI: 0000000000000028
> RBP: 00002b543f11a880 R8: 0000000000010000 R9: 00002b544e8a63b0
> R10: 000000000000002a R11: 0000000000000206 R12: 00002b543f11ad88
> R13: 00002b53f9ae7fe0 R14: 00002b53fa43af20 R15: 00002b543f11a880
> ORIG_RAX: 000000000000004d CS: 0033 SS: 002b
> crash> dis -lr ffffffffa03e9ec8 | tail
> 0xffffffffa03e9eb6 <nfs_wb_page_cancel+0xa6>: pop %r13
> 0xffffffffa03e9eb8 <nfs_wb_page_cancel+0xa8>: pop %r14
> 0xffffffffa03e9eba <nfs_wb_page_cancel+0xaa>: leaveq
> 0xffffffffa03e9ebb <nfs_wb_page_cancel+0xab>: retq
> 0xffffffffa03e9ebc <nfs_wb_page_cancel+0xac>: nopl 0x0(%rax)
> /usr/src/debug/kernel-2.6.32-573.18.1.el6/linux-2.6.32-573.18.1.el6.x86_64/fs/nfs/write.c: 1787
> 0xffffffffa03e9ec0 <nfs_wb_page_cancel+0xb0>: mov %rbx,%rdi
> 0xffffffffa03e9ec3 <nfs_wb_page_cancel+0xb3>: callq 0xffffffffa03e9d20 <nfs_clear_request_commit>
> /usr/src/debug/kernel-2.6.32-573.18.1.el6/linux-2.6.32-573.18.1.el6.x86_64/arch/x86/include/asm/spinlock.h: 174
> 0xffffffffa03e9ec8 <nfs_wb_page_cancel+0xb8>: mov %r14,%rax
> crash> dis -lr ffffffffa03e9dc1 | tail
> 0xffffffffa03e9dab <nfs_clear_request_commit+0x8b>: mov -0x10(%rbp),%rbx
> 0xffffffffa03e9daf <nfs_clear_request_commit+0x8f>: mov -0x8(%rbp),%r12
> 0xffffffffa03e9db3 <nfs_clear_request_commit+0x93>: leaveq
> 0xffffffffa03e9db4 <nfs_clear_request_commit+0x94>: retq
> 0xffffffffa03e9db5 <nfs_clear_request_commit+0x95>: nopl (%rax)
> /usr/src/debug/kernel-2.6.32-573.18.1.el6/linux-2.6.32-573.18.1.el6.x86_64/fs/nfs/write.c: 558
> 0xffffffffa03e9db8 <nfs_clear_request_commit+0x98>: mov -0x40(%rbp),%rdi
> 0xffffffffa03e9dbc <nfs_clear_request_commit+0x9c>: callq 0xffffffff8153c310 <_spin_lock>
> /usr/src/debug/kernel-2.6.32-573.18.1.el6/linux-2.6.32-573.18.1.el6.x86_64/arch/x86/include/asm/bitops.h: 252
> 0xffffffffa03e9dc1 <nfs_clear_request_commit+0xa1>: lock btrl $0x2,0x40(%rbx)
> ~~~
>
>
> Now look at the code, specifically nfs_wb_page_cancel and nfs_clear_request_commit. On line 1780 we take inode->i_lock.
> ~~~
> fs/nfs/write.c
> 1771
> 1772 int nfs_wb_page_cancel(struct inode *inode, struct page *page)
> 1773 {
> 1774 struct nfs_page *req;
> 1775 int ret = 0;
> 1776
> 1777 BUG_ON(!PageLocked(page));
> 1778 for (;;) {
> 1779 wait_on_page_writeback(page);
> 1780 spin_lock(&inode->i_lock); <----------- Takes spinlock the first time
> 1781 req = nfs_page_find_request_locked(page);
> 1782 if (req == NULL) {
> 1783 spin_unlock(&inode->i_lock);
> 1784 break;
> 1785 }
> 1786 if (nfs_lock_request(req)) {
> 1787 nfs_clear_request_commit(req); <---------- executing inside here
> 1788 spin_unlock(&inode->i_lock);
>
>
> 549 static void
> 550 nfs_clear_request_commit(struct nfs_page *req)
> 551 {
> 552 if (test_bit(PG_CLEAN, &req->wb_flags)) {
> 553 struct inode *inode = req->wb_context->dentry->d_inode;
> 554 struct nfs_commit_info cinfo;
> 555
> 556 nfs_init_cinfo_from_inode(&cinfo, inode);
> 557 if (!pnfs_clear_request_commit(req, &cinfo)) {
> 558--> spin_lock(cinfo.lock); <-------------- Tries to take spinlock a second time
> 559 nfs_request_remove_commit_list(req, &cinfo);
> 560 spin_unlock(cinfo.lock);
> 561 }
> 562 nfs_clear_page_commit(req->wb_page);
> 563 }
> 564 }
> ~~~
>
> Version-Release number of selected component (if applicable):
> 2.6.32-573.18.1.el6
>
> How reproducible:
> multiple times by one customer so far
>
> Steps to Reproduce:
> TBD
>
> Actual results:
> NFS client system deadlocks

-dros


2016-05-16 20:59:46

by Mkrtchyan, Tigran

[permalink] [raw]
Subject: Re: blocked on the spin lock

Hi Dave,

thanks a lot for the update!

Tigran.

----- Original Message -----
> From: "Dave Wysochanski" <[email protected]>
> To: "Mkrtchyan, Tigran" <[email protected]>
> Cc: "Weston Andros Adamson" <[email protected]>, "Linux NFS Mailing List" <[email protected]>, "Steve
> Dickson" <[email protected]>, "Andy Adamson" <[email protected]>
> Sent: Monday, May 16, 2016 9:02:50 PM
> Subject: Re: blocked on the spin lock

> On Mon, 2016-05-02 at 09:29 +0200, Mkrtchyan, Tigran wrote:
>>
>> Hi Dros et. al.
>>
>> We have seen the following stack trace on one of out systems:
>>
>>
>> Apr 28 22:09:51 bird510 kernel: BUG: soft lockup - CPU#7 stuck for 67s!
>> [tee:13755]
>> Apr 28 22:09:51 bird510 kernel: Modules linked in: ipmi_devintf dell_rbu
>> nfs_layout_nfsv41_files fuse nfs lockd fscache auth_rpcgss nfs_acl vfat fat
>> usb_storage mpt3sas mpt2sas raid_class mptctl openafs(P)(U) autofs4 sunrpc
>> cpufreq_ondemand acpi_cpufreq freq_table mperf sg joy
>> dev power_meter acpi_ipmi ipmi_si ipmi_msghandler iTCO_wdt iTCO_vendor_support
>> bnx2 microcode dcdbas lpc_ich mfd_core i7core_edac edac_core shpchp ext4 jbd2
>> mbcache sd_mod crc_t10dif mptsas mptscsih mptbase scsi_transport_sas mlx4_ib
>> ib_sa ib_mad ib_core ib_addr ipv6 mlx4_en p
>> tp pps_core mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded:
>> ipmi_devintf]
>> Apr 28 22:09:51 bird510 kernel: CPU 7
>> Apr 28 22:09:51 bird510 kernel: Modules linked in: ipmi_devintf dell_rbu
>> nfs_layout_nfsv41_files fuse nfs lockd fscache auth_rpcgss nfs_acl vfat fat
>> usb_storage mpt3sas mpt2sas raid_class mptctl openafs(P)(U) autofs4 sunrpc
>> cpufreq_ondemand acpi_cpufreq freq_table mperf sg joy
>> dev power_meter acpi_ipmi ipmi_si ipmi_msghandler iTCO_wdt iTCO_vendor_support
>> bnx2 microcode dcdbas lpc_ich mfd_core i7core_edac edac_core shpchp ext4 jbd2
>> mbcache sd_mod crc_t10dif mptsas mptscsih mptbase scsi_transport_sas mlx4_ib
>> ib_sa ib_mad ib_core ib_addr ipv6 mlx4_en p
>> tp pps_core mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded:
>> ipmi_devintf]
>> Apr 28 22:09:51 bird510 kernel:
>> Apr 28 22:09:51 bird510 kernel: Pid: 13755, comm: tee Tainted: P --
>> ------------ 2.6.32-573.12.1.el6.x86_64 #1 Dell Inc. PowerEdge M610/0N582M
>> Apr 28 22:09:51 bird510 kernel: RIP: 0010:[<ffffffff8153bde1>]
>> [<ffffffff8153bde1>] _spin_lock+0x21/0x30
>> Apr 28 22:09:51 bird510 kernel: RSP: 0018:ffff8803528179c8 EFLAGS: 00000297
>> Apr 28 22:09:51 bird510 kernel: RAX: 00000000000003b1 RBX: ffff8803528179c8 RCX:
>> 0000000000000000
>> Apr 28 22:09:51 bird510 kernel: RDX: 00000000000003b0 RSI: ffff880159aeb820 RDI:
>> ffff880159aeb8d0
>> Apr 28 22:09:51 bird510 kernel: RBP: ffffffff8100bc0e R08: ffff880352817af0 R09:
>> 0000000000000002
>> Apr 28 22:09:51 bird510 kernel: R10: ffffea0008efc428 R11: 0000000000000003 R12:
>> 0000000000000001
>> Apr 28 22:09:51 bird510 kernel: R13: 001310facb5806bd R14: 0000000000000000 R15:
>> 0000000000000000
>> Apr 28 22:09:51 bird510 kernel: FS: 00007f3e78118700(0000)
>> GS:ffff880028260000(0000) knlGS:0000000000000000
>> Apr 28 22:09:51 bird510 kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
>> 000000008005003b
>> Apr 28 22:09:51 bird510 kernel: CR2: 00007fefa21ff000 CR3: 0000000352959000 CR4:
>> 00000000000007e0
>> Apr 28 22:09:51 bird510 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>> 0000000000000000
>> Apr 28 22:09:51 bird510 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
>> 0000000000000400
>> Apr 28 22:09:51 bird510 kernel: Process tee (pid: 13755, threadinfo
>> ffff880352814000, task ffff880636064ab0)
>> Apr 28 22:09:51 bird510 kernel: Stack:
>> Apr 28 22:09:51 bird510 kernel: ffff880352817a18 ffffffffa0490dc1
>> ffff880159aeb8d0 ffff880159aeb770
>> Apr 28 22:09:51 bird510 kernel: <d> 0000000000000000 0000000000000000
>> ffffffffa04bd040 ffffffffa048f39a
>> Apr 28 22:09:51 bird510 kernel: <d> ffff8803378761c0 ffffea000a13dcc0
>> ffff880352817a48 ffffffffa0490ec8
>> Apr 28 22:09:51 bird510 kernel: Call Trace:
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0490dc1>] ?
>> nfs_clear_request_commit+0xa1/0xf0 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa048f39a>] ?
>> nfs_page_find_request_locked+0x2a/0x40 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0490ec8>] ?
>> nfs_wb_page_cancel+0xb8/0xf0 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa047e4f7>] ?
>> nfs_invalidate_page+0x47/0x80 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113edb5>] ?
>> do_invalidatepage+0x25/0x30
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f0d2>] ?
>> truncate_inode_page+0xa2/0xc0
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f47f>] ?
>> truncate_inode_pages_range+0x16f/0x500
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff81129cd7>] ?
>> mempool_free_slab+0x17/0x20
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f8a5>] ?
>> truncate_inode_pages+0x15/0x20
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f8ff>] ?
>> truncate_pagecache+0x4f/0x70
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0480eb9>] ?
>> nfs_setattr_update_inode+0xb9/0x150 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0493a6b>] ?
>> nfs3_proc_setattr+0xdb/0x120 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa048f6b3>] ? nfs_wb_all+0x43/0x50
>> [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0482610>] ? nfs_setattr+0xf0/0x170
>> [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff811b1118>] ? notify_change+0x168/0x340
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0480de8>] ? nfs_open+0x78/0x90 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8118fcb4>] ? do_truncate+0x64/0xa0
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff81232b9f>] ?
>> security_inode_permission+0x1f/0x30
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff811a4fc1>] ? do_filp_open+0x861/0xd20
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8129df8a>] ?
>> strncpy_from_user+0x4a/0x90
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff811b1ff2>] ? alloc_fd+0x92/0x160
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8118e967>] ? do_sys_open+0x67/0x130
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8118ea70>] ? sys_open+0x20/0x30
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8100b0d2>] ?
>> system_call_fastpath+0x16/0x1b
>> Apr 28 22:09:51 bird510 kernel: Code: 01 74 05 e8 52 1c d6 ff c9 c3 55 48 89 e5
>> 0f 1f 44 00 00 b8 00 00 01 00 f0 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90
>> 0f b7 17 <eb> f5 83 3f 00 75 f4 eb df c9 c3 0f 1f 40 00 55 48 89 e5 0f 1f
>> Apr 28 22:09:51 bird510 kernel: Call Trace:
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0490dc1>] ?
>> nfs_clear_request_commit+0xa1/0xf0 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa048f39a>] ?
>> nfs_page_find_request_locked+0x2a/0x40 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0490ec8>] ?
>> nfs_wb_page_cancel+0xb8/0xf0 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa047e4f7>] ?
>> nfs_invalidate_page+0x47/0x80 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113edb5>] ?
>> do_invalidatepage+0x25/0x30
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f0d2>] ?
>> truncate_inode_page+0xa2/0xc0
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f47f>] ?
>> truncate_inode_pages_range+0x16f/0x500
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff81129cd7>] ?
>> mempool_free_slab+0x17/0x20
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f8a5>] ?
>> truncate_inode_pages+0x15/0x20
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8113f8ff>] ?
>> truncate_pagecache+0x4f/0x70
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0480eb9>] ?
>> nfs_setattr_update_inode+0xb9/0x150 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0493a6b>] ?
>> nfs3_proc_setattr+0xdb/0x120 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa048f6b3>] ? nfs_wb_all+0x43/0x50
>> [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0482610>] ? nfs_setattr+0xf0/0x170
>> [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff811b1118>] ? notify_change+0x168/0x340
>> Apr 28 22:09:51 bird510 kernel: [<ffffffffa0480de8>] ? nfs_open+0x78/0x90 [nfs]
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8118fcb4>] ? do_truncate+0x64/0xa0
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff81232b9f>] ?
>> security_inode_permission+0x1f/0x30
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff811a4fc1>] ? do_filp_open+0x861/0xd20
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8129df8a>] ?
>> strncpy_from_user+0x4a/0x90
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff811b1ff2>] ? alloc_fd+0x92/0x160
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8118e967>] ? do_sys_open+0x67/0x130
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8118ea70>] ? sys_open+0x20/0x30
>> Apr 28 22:09:51 bird510 kernel: [<ffffffff8100b0d2>] ?
>> system_call_fastpath+0x16/0x1b
>>
>>
>>
>>
>> The questionable spin lock was remove by commit 411a99adf. Is there was a
>> problem?
>> Do we need a packport in RHEL kernel?
>>
>
>
> This is a known bug 1322823. It should be fixed in RHEL6.7
> kernel-2.6.32-573.27.1.el6 or above as well as RHEL6.8
> kernel-2.6.32-642.el6 or higher. It only affects kernels
> 2.6.32-573.10.2.el6 until kernel-2.6.32-573.27.1.el6. The bug is
> private but see the description below.
>
> This bug was an unfortunate side-effect of a fix for another nasty bug
> Bug 1135601 - System panics with "kernel BUG at fs/nfs/inode.c:113!
> But 1135601 was fixed by a RHEL6 equivalent of the following upstream
> patch: 3e21704 nfs: handle multiple reqs in nfs_wb_page_cancel
>
> Saving &inode->i_lock in an NFS specific structure seems problematic
> which maybe why it was missed in the backport, and partially removed in
> commit 411a99adf. I posted a related cleanup patch in this area
> https://www.spinics.net/lists/linux-nfs/msg57194.html
>
>
>
> Bug 1322823 - RHEL6.7.z: nfs client kernel deadlock due to inode->i_lock
> == nfs_commit_info.lock in code path nfs_wb_page_cancel ->
> nfs_clear_request_commit due to patch to fix bug 1135601
>
> Dave Wysochanski 2016-03-31 07:50:18 EDT
>
> Description of problem:
> Repeated kernel deadlock hit multiple times by a customer, and caused by the
> current fix to bug 1135601. Basically the NFS client / RPC deadlocks due to a
> spinlock being taken twice. We have vmcores.
>
> I think in the fix for 1135601 the following "finer point" of the NFS
> client was missed: So inode->i_lock == nfs_commit_info.lock? If that's
> true it may be ok but it's not obvious and it's unclear why the code
> isn't just using inode->i_lock but apparently copying the address into
> nfs_commit_info.lock ? Indeed looking at the code we have:
> ~~~
> static void nfs_init_cinfo_from_inode(struct nfs_commit_info *cinfo,
> struct inode *inode)
> {
> cinfo->lock = &inode->i_lock;
> cinfo->mds = &NFS_I(inode)->commit_info;
> cinfo->ds = pnfs_get_ds_info(inode);
> cinfo->dreq = NULL;
> cinfo->completion_ops = &nfs_commit_completion_ops;
> }
> "fs/nfs/write.c" 1927L, 50230C
> ~~~
>
>
> == Technical explanation of deadlock seen in vmcore ==
>
> Ok with the understanding that inode->i_lock == nfs_commit_info.lock, take a
> look at the pid 11305
> ~~~
> crash> bt
> PID: 11305 TASK: ffff881c7715aab0 CPU: 8 COMMAND: "sas"
> #0 [ffff881078806e90] crash_nmi_callback at ffffffff81033cf6
> #1 [ffff881078806ea0] notifier_call_chain at ffffffff8153f545
> #2 [ffff881078806ee0] atomic_notifier_call_chain at ffffffff8153f5aa
> #3 [ffff881078806ef0] notify_die at ffffffff810a783e
> #4 [ffff881078806f20] do_nmi at ffffffff8153d203
> #5 [ffff881078806f50] nmi at ffffffff8153cac0
> [exception RIP: _spin_lock+0x1c]
> RIP: ffffffff8153c32c RSP: ffff88104032fb58 RFLAGS: 00000297
> RAX: 00000000000009c1 RBX: ffff880a82b84bc0 RCX: 0000000000000000
> RDX: 00000000000009c0 RSI: ffff8809ff681820 RDI: ffff8809ff6818d0
> RBP: ffff88104032fb58 R8: ffff88104032fc80 R9: 0000000000000002
> R10: ffffea0022eec790 R11: 0000000000000003 R12: ffff88104032fb68
> R13: 0000000000000000 R14: ffff8809ff6818d0 R15: 0000000000000020
> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> --- <NMI exception stack> ---
> #6 [ffff88104032fb58] _spin_lock at ffffffff8153c32c
> #7 [ffff88104032fb60] nfs_clear_request_commit at ffffffffa03e9dc1 [nfs]
> #8 [ffff88104032fbb0] nfs_wb_page_cancel at ffffffffa03e9ec8 [nfs]
> #9 [ffff88104032fbe0] nfs_invalidate_page at ffffffffa03d74f7 [nfs]
> #10 [ffff88104032fc00] do_invalidatepage at ffffffff8113edb5
> #11 [ffff88104032fc10] truncate_inode_page at ffffffff8113f0d2
> #12 [ffff88104032fc30] truncate_inode_pages_range at ffffffff8113f47f
> #13 [ffff88104032fd20] truncate_inode_pages at ffffffff8113f8a5
> #14 [ffff88104032fd30] truncate_pagecache at ffffffff8113f8ff
> #15 [ffff88104032fd60] nfs_setattr_update_inode at ffffffffa03d9eb9 [nfs]
> #16 [ffff88104032fd90] nfs3_proc_setattr at ffffffffa03eca6b [nfs]
> #17 [ffff88104032fe20] nfs_setattr at ffffffffa03db610 [nfs]
> #18 [ffff88104032fe60] notify_change at ffffffff811b1338
> #19 [ffff88104032fed0] do_truncate at ffffffff8118fee4
> #20 [ffff88104032ff40] sys_ftruncate at ffffffff81190290
> #21 [ffff88104032ff80] system_call_fastpath at ffffffff8100b0d2
> RIP: 00002b53f548a4d7 RSP: 00002b543fb36b50 RFLAGS: 00010246
> RAX: 000000000000004d RBX: ffffffff8100b0d2 RCX: 00002b5426a29c48
> RDX: 00002b543f11a880 RSI: 0000000000020000 RDI: 0000000000000028
> RBP: 00002b543f11a880 R8: 0000000000010000 R9: 00002b544e8a63b0
> R10: 000000000000002a R11: 0000000000000206 R12: 00002b543f11ad88
> R13: 00002b53f9ae7fe0 R14: 00002b53fa43af20 R15: 00002b543f11a880
> ORIG_RAX: 000000000000004d CS: 0033 SS: 002b
> crash> dis -lr ffffffffa03e9ec8 | tail
> 0xffffffffa03e9eb6 <nfs_wb_page_cancel+0xa6>: pop %r13
> 0xffffffffa03e9eb8 <nfs_wb_page_cancel+0xa8>: pop %r14
> 0xffffffffa03e9eba <nfs_wb_page_cancel+0xaa>: leaveq
> 0xffffffffa03e9ebb <nfs_wb_page_cancel+0xab>: retq
> 0xffffffffa03e9ebc <nfs_wb_page_cancel+0xac>: nopl 0x0(%rax)
> /usr/src/debug/kernel-2.6.32-573.18.1.el6/linux-2.6.32-573.18.1.el6.x86_64/fs/nfs/write.c:
> 1787
> 0xffffffffa03e9ec0 <nfs_wb_page_cancel+0xb0>: mov %rbx,%rdi
> 0xffffffffa03e9ec3 <nfs_wb_page_cancel+0xb3>: callq 0xffffffffa03e9d20
> <nfs_clear_request_commit>
> /usr/src/debug/kernel-2.6.32-573.18.1.el6/linux-2.6.32-573.18.1.el6.x86_64/arch/x86/include/asm/spinlock.h:
> 174
> 0xffffffffa03e9ec8 <nfs_wb_page_cancel+0xb8>: mov %r14,%rax
> crash> dis -lr ffffffffa03e9dc1 | tail
> 0xffffffffa03e9dab <nfs_clear_request_commit+0x8b>: mov -0x10(%rbp),%rbx
> 0xffffffffa03e9daf <nfs_clear_request_commit+0x8f>: mov -0x8(%rbp),%r12
> 0xffffffffa03e9db3 <nfs_clear_request_commit+0x93>: leaveq
> 0xffffffffa03e9db4 <nfs_clear_request_commit+0x94>: retq
> 0xffffffffa03e9db5 <nfs_clear_request_commit+0x95>: nopl (%rax)
> /usr/src/debug/kernel-2.6.32-573.18.1.el6/linux-2.6.32-573.18.1.el6.x86_64/fs/nfs/write.c:
> 558
> 0xffffffffa03e9db8 <nfs_clear_request_commit+0x98>: mov -0x40(%rbp),%rdi
> 0xffffffffa03e9dbc <nfs_clear_request_commit+0x9c>: callq
> 0xffffffff8153c310 <_spin_lock>
> /usr/src/debug/kernel-2.6.32-573.18.1.el6/linux-2.6.32-573.18.1.el6.x86_64/arch/x86/include/asm/bitops.h:
> 252
> 0xffffffffa03e9dc1 <nfs_clear_request_commit+0xa1>: lock btrl
> $0x2,0x40(%rbx)
> ~~~
>
>
> Now look at the code, specifically nfs_wb_page_cancel and
> nfs_clear_request_commit. On line 1780 we take inode->i_lock.
> ~~~
> fs/nfs/write.c
> 1771
> 1772 int nfs_wb_page_cancel(struct inode *inode, struct page *page)
> 1773 {
> 1774 struct nfs_page *req;
> 1775 int ret = 0;
> 1776
> 1777 BUG_ON(!PageLocked(page));
> 1778 for (;;) {
> 1779 wait_on_page_writeback(page);
> 1780 spin_lock(&inode->i_lock); <----------- Takes spinlock the first time
> 1781 req = nfs_page_find_request_locked(page);
> 1782 if (req == NULL) {
> 1783 spin_unlock(&inode->i_lock);
> 1784 break;
> 1785 }
> 1786 if (nfs_lock_request(req)) {
> 1787 nfs_clear_request_commit(req); <---------- executing inside here
> 1788 spin_unlock(&inode->i_lock);
>
>
> 549 static void
> 550 nfs_clear_request_commit(struct nfs_page *req)
> 551 {
> 552 if (test_bit(PG_CLEAN, &req->wb_flags)) {
> 553 struct inode *inode = req->wb_context->dentry->d_inode;
> 554 struct nfs_commit_info cinfo;
> 555
> 556 nfs_init_cinfo_from_inode(&cinfo, inode);
> 557 if (!pnfs_clear_request_commit(req, &cinfo)) {
> 558--> spin_lock(cinfo.lock); <-------------- Tries to take spinlock a
> second time
> 559 nfs_request_remove_commit_list(req, &cinfo);
> 560 spin_unlock(cinfo.lock);
> 561 }
> 562 nfs_clear_page_commit(req->wb_page);
> 563 }
> 564 }
> ~~~
>
> Version-Release number of selected component (if applicable):
> 2.6.32-573.18.1.el6
>
> How reproducible:
> multiple times by one customer so far
>
> Steps to Reproduce:
> TBD
>
> Actual results:
> NFS client system deadlocks