Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756935AbaKTPET (ORCPT ); Thu, 20 Nov 2014 10:04:19 -0500 Received: from mail-wg0-f49.google.com ([74.125.82.49]:37089 "EHLO mail-wg0-f49.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753977AbaKTPER (ORCPT ); Thu, 20 Nov 2014 10:04:17 -0500 Date: Thu, 20 Nov 2014 16:04:13 +0100 From: Frederic Weisbecker To: Dave Jones , Linus Torvalds , Don Zickus , Thomas Gleixner , Linux Kernel , the arch/x86 maintainers , Peter Zijlstra Subject: Re: frequent lockups in 3.18rc4 Message-ID: <20141120150411.GD2542@lerouge> References: <20141118020959.GA2091@redhat.com> <20141118023930.GA2871@redhat.com> <20141118145234.GA7487@redhat.com> <20141118215540.GD35311@redhat.com> <20141119021902.GA14216@redhat.com> <20141119145902.GA13387@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20141119145902.GA13387@redhat.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Nov 19, 2014 at 09:59:02AM -0500, Dave Jones wrote: > On Tue, Nov 18, 2014 at 08:40:55PM -0800, Linus Torvalds wrote: > > On Tue, Nov 18, 2014 at 6:19 PM, Dave Jones wrote: > > > > > > NMI watchdog: BUG: soft lockup - CPU#2 stuck for 21s! [trinity-c42:31480] > > > CPU: 2 PID: 31480 Comm: trinity-c42 Not tainted 3.18.0-rc5+ #91 [loadavg: 174.61 150.35 148.64 9/411 32140] > > > RIP: 0010:[] [] context_tracking_user_enter+0xa4/0x190 > > > Call Trace: > > > [] syscall_trace_leave+0xa5/0x160 > > > [] int_check_syscall_exit_work+0x34/0x3d > > > > Hmm, if we are getting soft-lockups here, maybe it suggest too much exit-work. > > > > Some TIF_NOHZ loop, perhaps? You have nohz on, don't you? > > > > That makes me wonder: does the problem go away if you disable NOHZ? > > Aparently not. > > NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [trinity-c75:25175] > CPU: 3 PID: 25175 Comm: trinity-c75 Not tainted 3.18.0-rc5+ #92 [loadavg: 168.72 151.72 150.38 9/410 27945] > task: ffff8800364e44d0 ti: ffff880192d2c000 task.ti: ffff880192d2c000 > RIP: 0010:[] [] context_tracking_user_exit+0x57/0x120 > RSP: 0018:ffff880192d2fee8 EFLAGS: 00000246 > RAX: 0000000000000000 RBX: 0000000100000046 RCX: 000000336ee35b47 > RDX: 0000000000000001 RSI: ffffffff94ac1e84 RDI: ffffffff94a93725 > RBP: ffff880192d2fef8 R08: 00007f9b74d0b740 R09: 0000000000000000 > R10: 0000000000000000 R11: 0000000000000246 R12: ffffffff940d8503 > R13: ffff880192d2fe98 R14: ffffffff943884e7 R15: ffff880192d2fe48 > FS: 00007f9b74d0b740(0000) GS:ffff880244600000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 000000336f1b7740 CR3: 0000000229a95000 CR4: 00000000001407e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 > Stack: > ffff880192d30000 0000000000080000 ffff880192d2ff78 ffffffff94012c25 > 00007f9b747a5000 00007f9b747a5068 0000000000000000 0000000000000000 > 0000000000000000 ffffffff9437b3be 0000000000000000 0000000000000000 > Call Trace: > [] syscall_trace_enter_phase1+0x125/0x1a0 > [] ? trace_hardirqs_on_thunk+0x3a/0x3f > [] tracesys+0x14/0x4a > Code: 42 fd ff 48 c7 c7 7a 1e ac 94 e8 25 29 21 00 65 8b 04 25 34 f7 1c 00 83 f8 01 74 28 f6 c7 02 74 13 0f 1f 00 e8 bb 43 fd ff 53 9d <5b> 41 5c 5d c3 0f 1f 40 00 53 9d e8 89 42 fd ff eb ee 0f 1f 80 > sending NMI to other CPUs: > NMI backtrace for cpu 1 > CPU: 1 PID: 25164 Comm: trinity-c64 Not tainted 3.18.0-rc5+ #92 [loadavg: 168.72 151.72 150.38 9/410 27945] > task: ffff88011600dbc0 ti: ffff8801a99a4000 task.ti: ffff8801a99a4000 > RIP: 0010:[] [] generic_exec_single+0xee/0x1a0 > RSP: 0018:ffff8801a99a7d18 EFLAGS: 00000202 > RAX: 0000000000000000 RBX: ffff8801a99a7d20 RCX: 0000000000000038 > RDX: 00000000000000ff RSI: 0000000000000008 RDI: 0000000000000000 > RBP: ffff8801a99a7d78 R08: ffff880242b57ce0 R09: 0000000000000000 > R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000003 > R13: 0000000000000001 R14: ffff880083c28948 R15: ffffffff94166aa0 > FS: 00007f9b74d0b740(0000) GS:ffff880244200000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000000000000001 CR3: 00000001d8611000 CR4: 00000000001407e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 > Stack: > ffff8801a99a7d28 0000000000000000 ffffffff94166aa0 ffff880083c28948 > 0000000000000003 00000000e38f9aac ffff880083c28948 00000000ffffffff > 0000000000000003 ffffffff94166aa0 ffff880083c28948 0000000000000001 > Call Trace: > [] ? perf_swevent_add+0x120/0x120 > [] ? perf_swevent_add+0x120/0x120 > [] smp_call_function_single+0x6a/0xe0 One thing that happens a lot in your crashes is a CPU sending IPIs. Maybe stuck polling on csd->lock or something. But's it's not the CPU that soft lockups. At least not the first that gets reported. > [] ? preempt_count_sub+0x7b/0x100 > [] perf_event_read+0xca/0xd0 > [] perf_event_read_value+0x90/0xe0 > [] perf_read+0x226/0x370 > [] ? security_file_permission+0x87/0xa0 > [] vfs_read+0x9f/0x180 > [] SyS_read+0x58/0xd0 > [] tracesys_phase2+0xd4/0xd9 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/