Date: Mon, 3 Aug 2015 12:49:07 -0400
From: Steven Rostedt <rostedt@goodmis.org>
To: Nikolay Borisov <kernel@kyup.com>
Cc: "Linux-Kernel@Vger. Kernel. Org" <linux-kernel@vger.kernel.org>,
        oleg@redhat.com, riel@redhat.com,
        Peter Zijlstra <peterz@infradead.org>
Subject: Re: HARD LOCKUP: Strange hard lock up on
 spin_lock(&sighand->siglock);
Message-ID: <20150803124907.13bfc3e4@gandalf.local.home>
In-Reply-To: <55B629E9.6020207@kyup.com>
References: <55B629E9.6020207@kyup.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 5243
Lines: 105

On Mon, 27 Jul 2015 15:54:01 +0300
Nikolay Borisov <kernel@kyup.com> wrote:

> Hello, 
> 
> I have a machine running kernel 3.13.3, in fact I have 3 such machines 

3.13.3 is a very old kernel. A lot has changed. Also there's newer
stable kernels of that release too (3.13.11).

> which are connected via corosync in a cluster. On one of the machines 
> I observed the following lock-up: 
> 
> Jul 26 06:01:14 shrek kernel: ------------[ cut here ]------------


> Jul 26 06:01:14 shrek kernel: NMI backtrace for cpu 8
> Jul 26 06:01:14 shrek kernel: CPU: 8 PID: 29979 Comm: sg-queue Tainted:          W  O 3.13.3-sg1 #6
> Jul 26 06:01:14 shrek kernel: Hardware name: Supermicro X9DRD-7LN4F(-JBOD)/X9DRD-EF/X9DRD-7LN4F, BIOS 3.0a 12/05/2013
> Jul 26 06:01:14 shrek kernel: task: ffff880622796e40 ti: ffff880622797398 task.ti: ffff880622797398
> Jul 26 06:01:14 shrek kernel: RIP: 0010:[<ffffffff815ab0bd>]  [<ffffffff815ab0bd>] ffffffff815ab0bd
> Jul 26 06:01:14 shrek kernel: RSP: 0018:ffff88067e6a1e68  EFLAGS: 00000097
> Jul 26 06:01:14 shrek kernel: RAX: 000000000000002e RBX: ffff880622796e40 RCX: ffff880622796e40
> Jul 26 06:01:14 shrek kernel: RDX: 0000000000000000 RSI: ffff88067e6a1eb8 RDI: ffff880621f44a08
> Jul 26 06:01:14 shrek kernel: RBP: ffff88067e6a1e68 R08: ffff88067e6a1eb8 R09: 000000000000751a
> Jul 26 06:01:14 shrek kernel: R10: 0000000000000008 R11: 0000000000000246 R12: ffff88067e6a1e98
> Jul 26 06:01:14 shrek kernel: R13: ffff88067e6a1eb8 R14: 0000000002d7c680 R15: 0000000002d7ccb0
> Jul 26 06:01:14 shrek kernel: FS:  0000037e10229700(0000) GS:ffff88087fc40000(0000) knlGS:0000000000000000
> Jul 26 06:01:14 shrek kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Jul 26 06:01:14 shrek kernel: CR2: 0000037e0f8978d0 CR3: 00000000015c8000 CR4: 00000000001607f0
> Jul 26 06:01:14 shrek kernel: Stack:
> Jul 26 06:01:14 shrek kernel: ffff88067e6a1e88 ffffffff81053620 0000000000000000 0000000000000002
> Jul 26 06:01:14 shrek kernel: ffff88067e6a1ea8 ffffffff8105369a 0000000000000000 0000000002d7ccb0
> Jul 26 06:01:14 shrek kernel: ffff88067e6a1ef8 ffffffff81056fcb 0000000000000000 ffffffffffff4111
> Jul 26 06:01:14 shrek kernel: Call Trace:
> Jul 26 06:01:14 shrek kernel: [<ffffffff81053620>] __set_current_blocked+0x30/0x60
> Jul 26 06:01:14 shrek kernel: [<ffffffff8105369a>] sigprocmask+0x4a/0x80
> Jul 26 06:01:14 shrek kernel: [<ffffffff81056fcb>] SyS_rt_sigprocmask+0xbb/0xd0
> Jul 26 06:01:14 shrek kernel: [<ffffffff815ac028>] system_call_fastpath+0x16/0x1b
> Jul 26 06:01:14 shrek kernel: [<ffffffff815ab6bd>] ? retint_swapgs+0xe/0x11
> Jul 26 06:01:14 shrek kernel: Code: 48 83 c4 08 b8 01 00 00 00 5b c9 c3 66 90 55 48 89 e5 fa b8 00 01 00 00 f0 66 0f c1 07 89 c2 66 c1 ea 08 38 c2 75 04 c9 c3 f3 90 <0f> b6 07 38 c2 75 f7 c9 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48


> My investigation showed up that the CPUs are in the following state: 
> CPUS 1, 2, 3, 6, 7, 8, 9, 10, 11, 18, 21, 22 are locked on the spinlock 
> inside __lock_task_sighand, in a rcu_read_lock() section. Whereas 

Except CPU 8 is blocked in sigprocmask() not in __lock_task_sighand().

> CPUS: 4, 5, 13, 14, 15, 16, 17, 19, 20, 23 are staying idle.
> 
> >From the logs I gather that 2 different subsystems detected that:
> the NMI watchdog and the rcu stall detector (cpu's 8 and 18). 
> While looking at the code and googling around I discovered the 
> following commit from PeterZ: https://lkml.org/lkml/2008/11/13/157 
> 
> Based on that I think could be happening is that the sighand itself is 
> being freed while we are in the grace period inside __lock_task_sighand 
> but the slab page itself is not freed as per the semantics of 
> SLAB_DESTROY_BY_RCU. I looked up the source of this function in the 
> latest kernels and saw that Oleg had put a comment clarifying the 
> semantics but I'm still not convinced that it is safe. What if 
> we are trying to lock the spinlock before this particular slab is 
> initialised with sighand_ctor? Looking at the code and reading
> Peter's explanation of SLAB_RCU I cannot identify which portion

This confused me a while back but it is safe.

See this thread:
 https://lkml.org/lkml/2014/6/3/759

The ctor isn't called by new allocations, but when the slab is created.

But as I said, your kernel is very old and lots of things have changed.
Unless you can reproduce on a newer kernel, most wont spend time
looking at this.

-- Steve

 
> of __lock_task_sighand implements the following logic:
> 
> if (!try_get_ref(obj)) // might fail for free objects
> 	goto again;
> 
> 
> Also what's more peculiar was the fact that when "Shrek" experienced
> this issue, resources were migrated to the 2nd machine and it 
> also experienced the same behavior - CPUs locked on the sighand
> spinlock. 
> 
> 
> That's the first time I've seen this particular issue. And I'm not
> really able to come up with a way where a task's sighand could be 
> corrupted in a such a way as to prevent acquiring the lock. 
> 
> Regards, 
> Nikolay

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/