Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754046AbbKBV3j (ORCPT ); Mon, 2 Nov 2015 16:29:39 -0500 Received: from mail-wi0-f177.google.com ([209.85.212.177]:38479 "EHLO mail-wi0-f177.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752306AbbKBV3i (ORCPT ); Mon, 2 Nov 2015 16:29:38 -0500 MIME-Version: 1.0 In-Reply-To: <5637C288.9070508@intel.com> References: <87611kakr3.fsf@yhuang-dev.intel.com> <56379FDE.6010603@intel.com> <5637C288.9070508@intel.com> Date: Tue, 3 Nov 2015 00:29:36 +0300 Message-ID: Subject: Re: [lkp] [fs] df4c0e36f1: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1] From: Andrey Ryabinin To: Dave Hansen Cc: kernel test robot , LKP , LKML , Andrew Morton , David Rientjes , Pekka Enberg , "H. Peter Anvin" , Thomas Gleixner , Ingo Molnar , Andi Kleen , Joonsoo Kim , Christoph Lameter , Sasha Levin , Konstantin Khlebnikov , Yuri Gribov , Andrey Konovalov , Konstantin Serebryany , Dmitry Vyukov , Linus Torvalds Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 1914 Lines: 44 2015-11-02 23:07 GMT+03:00 Dave Hansen : > On 11/02/2015 11:34 AM, Andrey Ryabinin wrote: >>>> >> >>>> >> [ 1.159450] augmented rbtree testing -> 23675 cycles >>>> >> [ 1.864996] >>>> >> It took less than a second, meanwhile in your case it didn't finish in >>>> >> 22 seconds. >>>> >> >>>> >> This makes me think that your host is overloaded and the problem is on >>>> >> your side. >>> > >>> > It's probably just a matter of putting some cond_resched()s in the test >>> > code. >> Yes, but is it worthwhile? It's very likely that lockup will just >> trigger in another place. > > I'm guessing that the lockup here was because the tests were running for > too long. If we cond_resched() in there often enough, the kernel won't > detect a softlockup at all. Sure, but why are these tests running so long? In my setup it takes less than a second to finish these tests. On the same kernel version and config of course. Although I might have more powerful hardware it doesn't explain such huge difference. So these tests are actually fast tests. I guess that the host is overloaded and KVM guest runs so slow that even these simple tests start triggering softlockup. > It won't shift somewhere else. That's not what I mean. Sure, the cond_resched() in rbtree_test_init() will fix this particular softlockup. But if even such normally fast tests now are running too long, then a lot of other kernel code, which normally runs fast, likely becomes too slow on Ying's setup and will trigger another softlockup. rbtree_test_init() is just the first such place. In that case, sticking cond_resched()s across the whole kernel is not a solution. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/