Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754449AbbL3Gnx (ORCPT ); Wed, 30 Dec 2015 01:43:53 -0500 Received: from mail-ob0-f176.google.com ([209.85.214.176]:33251 "EHLO mail-ob0-f176.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754372AbbL3Gnq (ORCPT ); Wed, 30 Dec 2015 01:43:46 -0500 MIME-Version: 1.0 In-Reply-To: <20151224105149.GA24863@nautica> References: <20151207224643.GA10531@nautica> <20151208023331.GJ20997@ZenIV.linux.org.uk> <20151209062316.GA29917@nautica> <20151209064542.GW20997@ZenIV.linux.org.uk> <20151224105149.GA24863@nautica> From: Andy Lutomirski Date: Tue, 29 Dec 2015 22:43:26 -0800 Message-ID: Subject: Re: [V9fs-developer] Hang triggered by udev coldplug, looks like a race To: Dominique Martinet , Peter Zijlstra , Thomas Gleixner , Ingo Molnar Cc: Al Viro , "linux-kernel@vger.kernel.org" , V9FS Developers , Linux FS Devel Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2186 Lines: 61 [add cc's] Hi scheduler people: This is relatively easy for me to reproduce. Any hints for debugging it? Could we really have a bug in which processes that are schedulable as a result of mutex unlock aren't always reliably scheduled? --Andy On Thu, Dec 24, 2015 at 2:51 AM, Dominique Martinet wrote: > Andy Lutomirski wrote on Thu, Dec 17, 2015: >> This could be QEMU's analysis script screwing up. Is there a good way >> for me to gather more info? > > I finally took some time to reproduce it (sorry for the delay) > > Using your config, virtme commit (17363c2) and kernel tag v4.4-rc3 I was > able to reproduce it just fine with my qemu (2.4.90) > > Now for the fun bit... I ran it with a gdb server, attaching gdb and > running cont always 'unblocks' it > Using the kernel gdb scripts (lx-ps) I see about 250 kworker threads > running, the backtraces all look the same: > > [ 20.273945] [] schedule+0x30/0x80 > [ 20.274644] [] schedule_preempt_disabled+0x9/0x10 > [ 20.275539] [] __mutex_lock_slowpath+0x107/0x2f0 > [ 20.276421] [] ? lookup_fast+0xbe/0x320 > [ 20.277195] [] mutex_lock+0x15/0x30 > [ 20.277916] [] walk_component+0x1a7/0x270 > > > so given it unblocks after hooking gdb + cont I'm actually thinking this > might be a pure scheduling issue? (e.g. thread is never re-scheduled or > something like that?) > I can't see any task not in schedule() in your sysrq dump task > transcript either. > > > Not sure how to go around debugging that, to be honest. > I've tried both default one virtual cpu and -smp 3 or 4 and both can > reproduce it; cpu usage on the host is always low so it doesn't look > like there's any busy-polling involved.. This is a pretty subtle bug we > have there.. > > -- > Dominique Martinet -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/