Received: by 2002:a25:4158:0:0:0:0:0 with SMTP id o85csp882955yba; Fri, 26 Apr 2019 10:18:08 -0700 (PDT) X-Google-Smtp-Source: APXvYqyvkYIPjMxYHyz5jlggV0fuS6SlNbg5a+AYXihQ8ItMgaz6EEnc1BX1zPbB+pfkz3p2rcFD X-Received: by 2002:a63:c10e:: with SMTP id w14mr25036995pgf.206.1556299088732; Fri, 26 Apr 2019 10:18:08 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1556299088; cv=none; d=google.com; s=arc-20160816; b=iz+xhdHGoJuxgej9/sPXGHflmsRxwLo0QquX9NWYKF/zVM6ZWWQIkAfXT5M84jJLXV XrOC/4vPIdsm/zUQ8xF8JPc+19WtMpFqjnDqDfEreMTkhjI90vZghxdacKzWOsk6PV0d EPujpsrHVH7a3pdloTdNvwSlpgWRIdFwcIPmM2d7Zs6MjlH5ccsHdUFNaG4GaZf5EnBC JmYXSI+20uTyCfxq4daoh+QvnMgDlTs/aypPFzDQN3CYFNwY4b1/UOK1SDLJxFoodk0R KeYnLB2maoqgrY8cANEsFO7gHkWy5YTmdqnu1CSMMfaWU2yOdbvg+kqe/eP81q0GMfNa d/2w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=PHCJ82o03eSlcTdesJU3Jc3TZ9WbT04F2ZW8qqxQa2c=; b=lW9w8j3RBvbL30QoSfdYSfLNztnLsmUmbCMUPm2kEVLxBZwDvAphavtrSoT6kSH/Z0 UYTx/nP/inAU5JdTrTbtzLCXJ6/Go+BudYhbj2IWGBrnPzQ1Wet5l/aZFrTo3I94YKA3 eWokylznvBbCZ6yL3mpEFkS+a+mHlIUWXfwaigb/COLwFS4T0K5I6XsXNCDwMc2ecPkb KXstZodPjvRYY3gKH/WrKs2xb5L4MTCahQYGdom3Qqy2a3HW0UcDBBeBteMXwJZfyjCc CmU6f6LV7pc5xARcUbLgZUD3Hf6OsfdaQibr6lWXUGoP9Q/rJYJXFMAJppd+Cok1N+yC 48PA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id i96si26694975plb.331.2019.04.26.10.17.53; Fri, 26 Apr 2019 10:18:08 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727046AbfDZRPM (ORCPT + 99 others); Fri, 26 Apr 2019 13:15:12 -0400 Received: from mx1.redhat.com ([209.132.183.28]:33678 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726427AbfDZRPL (ORCPT ); Fri, 26 Apr 2019 13:15:11 -0400 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id EF6F4C0B2008; Fri, 26 Apr 2019 17:15:10 +0000 (UTC) Received: from prarit.bos.redhat.com (prarit-guest.khw1.lab.eng.bos.redhat.com [10.16.200.63]) by smtp.corp.redhat.com (Postfix) with ESMTP id 3E9AC1001E8E; Fri, 26 Apr 2019 17:15:07 +0000 (UTC) Subject: Re: [-next] system hangs likely due to "modules: Only return -EEXIST for modules that have finished loading" To: Jessica Yu , Heiko Carstens Cc: linux-next@vger.kernel.org, linux-kernel@vger.kernel.org, linux-s390@vger.kernel.org, Cathy Avery References: <20190426130736.GB8646@osiris> <20190426150741.GD8646@osiris> <20190426160956.GA3827@linux-8ccs> From: Prarit Bhargava Message-ID: <8402e0ea-30dc-206c-1d2b-f9f4ba594391@redhat.com> Date: Fri, 26 Apr 2019 13:15:07 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.4.0 MIME-Version: 1.0 In-Reply-To: <20190426160956.GA3827@linux-8ccs> Content-Type: text/plain; charset=utf-8 Content-Language: en-GB Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.31]); Fri, 26 Apr 2019 17:15:11 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 4/26/19 12:09 PM, Jessica Yu wrote: > +++ Heiko Carstens [26/04/19 17:07 +0200]: >> On Fri, Apr 26, 2019 at 09:22:34AM -0400, Prarit Bhargava wrote: >>> On 4/26/19 9:07 AM, Heiko Carstens wrote: >>> > Hello Prarit, >>> > >>> > it looks like your commit f9a75c1d717f ("modules: Only return -EEXIST >>> > for modules that have finished loading") _sometimes_ causes hangs on >>> > s390. This is unfortunately not 100% reproducible, however the >>> > mentioned commit seems to be the only relevant one in modules.c. >>> > >>> > What I see is a hanging system with messages like this on the console: >>> > >>> > [   65.876040] rcu: INFO: rcu_sched self-detected stall on CPU >>> > [   65.876049] rcu:     7-....: (5999 ticks this GP) >>> idle=eae/1/0x4000000000000002 softirq=1181/1181 fqs=2729 >>> > [   65.876078]  (t=6000 jiffies g=-471 q=17196) >>> > [   65.876084] Task dump for CPU 7: >>> > [   65.876088] systemd-udevd   R  running task        0   731    721 >>> 0x06000004 >>> > [   65.876097] Call Trace: >>> > [   65.876113] ([<0000000000abb264>] __schedule+0x2e4/0x6e0) >>> > [   65.876122]  [<00000000001ee486>] finished_loading+0x4e/0xb0 >>> > [   65.876128]  [<00000000001f1ed6>] load_module+0xcce/0x27a0 >>> > [   65.876134]  [<00000000001f3af0>] __s390x_sys_init_module+0x148/0x178 >>> > [   65.876142]  [<0000000000ac0766>] system_call+0x2aa/0x2c8 >>> > I did not look any further into the dump, however since the commit >>> > touches exactly the code path which seems to be looping... ;) >>> > >>> >>> Ouch :(  I wonder if I exposed a further race or another bug.  Heiko, can you >>> determine which module is stuck?  Warning: I have not compiled this code. >> >> Here we go: >> >> [   11.716866] PRARIT: waiting for module s390_trng to load. >> [   11.716867] PRARIT: waiting for module s390_trng to load. >> [   11.716868] PRARIT: waiting for module s390_trng to load. >> [   11.716870] PRARIT: waiting for module s390_trng to load. >> [   11.716871] PRARIT: waiting for module s390_trng to load. >> [   11.716872] PRARIT: waiting for module s390_trng to load. >> [   11.716874] PRARIT: waiting for module s390_trng to load. >> [   11.716875] PRARIT: waiting for module s390_trng to load. >> [   11.716876] PRARIT: waiting for module s390_trng to load. >> [   16.726850] add_unformed_module: 31403529 callbacks suppressed >> [   16.726853] PRARIT: waiting for module s390_trng to load. >> [   16.726862] PRARIT: waiting for module s390_trng to load. >> [   16.726865] PRARIT: waiting for module s390_trng to load. >> [   16.726867] PRARIT: waiting for module s390_trng to load. >> [   16.726869] PRARIT: waiting for module s390_trng to load. >> >> If I'm not mistaken then there was _no_ corresponding message on the >> console stating that the module already exists. > > Hm, my current theory is that we have a module whose exit() function > is taking a while to run to completion. While it is doing so, the > module's state is already set to MODULE_STATE_GOING. > > With Prarit's patch, since this module is probably still in GOING, > add_unformed_module() will wait until the module is finally gone. If > this takes too long, we will keep trying to add ourselves to the > module list and hence stay in the loop in add_unformed_module(). > According to Documentation/RCU/stallwarn.txt, this looping in the > kernel may trigger an rcu stall warning (see bullet point stating "a > CPU looping anywhere in the kernel without invoking schedule()". > Yeah, that's what I'm thinking too. The question is, however, why that module is taking so long that it stalls the system. If the module state is going then the probe of the module has already failed. Could it be some weird bug in the notifier chains? And :) why aren't I seeing this all the time? P. > Heiko, could you modify the patch to print the module's state to > confirm? > > Thanks, > > Jessica