Received: by 2002:a25:4158:0:0:0:0:0 with SMTP id o85csp1245256yba; Fri, 26 Apr 2019 17:23:36 -0700 (PDT) X-Google-Smtp-Source: APXvYqwvpeuOyTZgphS26tclf+2S1CKvqc9HxhljYnvM/zpwFIL95pRn9l22riLEtmOONEsHvSyU X-Received: by 2002:a62:b40b:: with SMTP id h11mr48451115pfn.133.1556324615980; Fri, 26 Apr 2019 17:23:35 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1556324615; cv=none; d=google.com; s=arc-20160816; b=rMH/7ttXrO1cIRWlei+PSJZOvhP8vuRxH7xVpyNJT5ru3W5GR9+8XQu0i/36qhjO96 K0A8HAedGTyzhjQ2YZoi1xyI/Xj6fcJ3fQ/bP13ODZQo8yVIG/f8Eay0Uzz5pk40sLDa opxZyXXYawFBFPUUITyeMn/UilCzJ2Nsxi4OUNxl4jMtjI/fU5kGzXeZujrWZhWp+HsR xl3fDtoNEfhE8k/4Lr/XUEKycCuZJCd4Zs/3KUWT/da3O7V2oemerVpTTHu1aNcVdkvG 5IbMPuZlMwYWTxT6eepFUGzIyjHIQJ7UnkuFLUW0TXPFzrMzvHY/o//xz6rwC+z/XkAO Mb/Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:references:cc:to:from:subject; bh=QLwl/wNmK9flY26sAiaf7oLIPBMnevhaX1WC8NKOdhc=; b=zGiyaFDH5fuwNs8iZq2haGrffRNyeRSKMashDY2JOX95m4CN5ZfeGaqW/P7PGyI3JJ M7gSeuvSTT2vGGCDpS33mH1ZXjoRpabCT5GwZ2kc/7ajKXeuA2eX3Od67TqMG4VO6Kv3 iB3ipHHZllOmK8dfs7xzc+2GbtJrb79+yzos1Z9ioyP/AUU0dfFiFMJtlELmxZSB/NK5 I7IQd1ofWqNj5BcaciW1ESYicAqmu5U7zc2WlX4x+ycyARl5UycudEzEH/44nz9e+fSk VXD3Cvo/7K33DNFGONh7tjeYK1Z2ccO6ag71AuLUYoTK95u3FIJXC8/krpAqnegPCToY zlHQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id a22si26307824plm.263.2019.04.26.17.23.20; Fri, 26 Apr 2019 17:23:35 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727236AbfD0AUy (ORCPT + 99 others); Fri, 26 Apr 2019 20:20:54 -0400 Received: from mx1.redhat.com ([209.132.183.28]:44916 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726807AbfD0AUy (ORCPT ); Fri, 26 Apr 2019 20:20:54 -0400 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id D060030A996E; Sat, 27 Apr 2019 00:20:53 +0000 (UTC) Received: from prarit.bos.redhat.com (prarit-guest.khw1.lab.eng.bos.redhat.com [10.16.200.63]) by smtp.corp.redhat.com (Postfix) with ESMTP id 22D4B26197; Sat, 27 Apr 2019 00:20:53 +0000 (UTC) Subject: Re: [-next] system hangs likely due to "modules: Only return -EEXIST for modules that have finished loading" From: Prarit Bhargava To: Jessica Yu , Heiko Carstens Cc: linux-next@vger.kernel.org, linux-kernel@vger.kernel.org, linux-s390@vger.kernel.org, Cathy Avery References: <20190426130736.GB8646@osiris> <20190426150741.GD8646@osiris> <20190426160956.GA3827@linux-8ccs> <2e047a7e-bf08-be8c-bdd0-429464fa133d@redhat.com> <52c293e9-ddfa-426a-a8f1-2106e250e78d@redhat.com> Message-ID: Date: Fri, 26 Apr 2019 20:20:52 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.4.0 MIME-Version: 1.0 In-Reply-To: <52c293e9-ddfa-426a-a8f1-2106e250e78d@redhat.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-GB Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.42]); Sat, 27 Apr 2019 00:20:53 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 4/26/19 3:45 PM, Prarit Bhargava wrote: > > > On 4/26/19 2:10 PM, Prarit Bhargava wrote: >> >> >> On 4/26/19 12:09 PM, Jessica Yu wrote: >>> +++ Heiko Carstens [26/04/19 17:07 +0200]: >>>> On Fri, Apr 26, 2019 at 09:22:34AM -0400, Prarit Bhargava wrote: >>>>> On 4/26/19 9:07 AM, Heiko Carstens wrote: >>>>>> Hello Prarit, >>>>>> >>>>>> it looks like your commit f9a75c1d717f ("modules: Only return -EEXIST >>>>>> for modules that have finished loading") _sometimes_ causes hangs on >>>>>> s390. This is unfortunately not 100% reproducible, however the >>>>>> mentioned commit seems to be the only relevant one in modules.c. >>>>>> >>>>>> What I see is a hanging system with messages like this on the console: >>>>>> >>>>>> [ 65.876040] rcu: INFO: rcu_sched self-detected stall on CPU >>>>>> [ 65.876049] rcu: 7-....: (5999 ticks this GP) >>>>> idle=eae/1/0x4000000000000002 softirq=1181/1181 fqs=2729 >>>>>> [ 65.876078] (t=6000 jiffies g=-471 q=17196) >>>>>> [ 65.876084] Task dump for CPU 7: >>>>>> [ 65.876088] systemd-udevd R running task 0 731 721 >>>>> 0x06000004 >>>>>> [ 65.876097] Call Trace: >>>>>> [ 65.876113] ([<0000000000abb264>] __schedule+0x2e4/0x6e0) >>>>>> [ 65.876122] [<00000000001ee486>] finished_loading+0x4e/0xb0 >>>>>> [ 65.876128] [<00000000001f1ed6>] load_module+0xcce/0x27a0 >>>>>> [ 65.876134] [<00000000001f3af0>] __s390x_sys_init_module+0x148/0x178 >>>>>> [ 65.876142] [<0000000000ac0766>] system_call+0x2aa/0x2c8 >>>>>> I did not look any further into the dump, however since the commit >>>>>> touches exactly the code path which seems to be looping... ;) >>>>>> >>>>> >>>>> Ouch :( I wonder if I exposed a further race or another bug. Heiko, can you >>>>> determine which module is stuck? Warning: I have not compiled this code. >>>> >>>> Here we go: >>>> >>>> [ 11.716866] PRARIT: waiting for module s390_trng to load. >>>> [ 11.716867] PRARIT: waiting for module s390_trng to load. >>>> [ 11.716868] PRARIT: waiting for module s390_trng to load. >>>> [ 11.716870] PRARIT: waiting for module s390_trng to load. >>>> [ 11.716871] PRARIT: waiting for module s390_trng to load. >>>> [ 11.716872] PRARIT: waiting for module s390_trng to load. >>>> [ 11.716874] PRARIT: waiting for module s390_trng to load. >>>> [ 11.716875] PRARIT: waiting for module s390_trng to load. >>>> [ 11.716876] PRARIT: waiting for module s390_trng to load. >>>> [ 16.726850] add_unformed_module: 31403529 callbacks suppressed >>>> [ 16.726853] PRARIT: waiting for module s390_trng to load. >>>> [ 16.726862] PRARIT: waiting for module s390_trng to load. >>>> [ 16.726865] PRARIT: waiting for module s390_trng to load. >>>> [ 16.726867] PRARIT: waiting for module s390_trng to load. >>>> [ 16.726869] PRARIT: waiting for module s390_trng to load. >>>> >>>> If I'm not mistaken then there was _no_ corresponding message on the >>>> console stating that the module already exists. >> >> Heiko, >> >> Where is the s390_trng module? I can't seem to find it in the tree. >> >> [02:06 PM root@ibm-z-108 linux-next]# find ./ -name *s390_trng* >> [02:06 PM root@ibm-z-108 linux-next]# git grep s390_trng >> [02:07 PM root@ibm-z-108 linux-next]# >> > > Never mind, I found it: s390-trng.c > > P. > >> P. >> Heiko and Jessica, The issue doesn't appear to be with my patch AFAICT. The s390_trng fails to load and then the kernel occasionally hangs (as Heiko mentioned) calling synchronize_rcu(). The call sequence is module_load() do_init_module() do_one_initcall(mod->init) which fails. The failure path in do_one_initcall() is entered and we start executing code at kernel/module.c:3541 fail_free_freeinit: kfree(freeinit); fail: /* Try to protect us from buggy refcounters. */ mod->state = MODULE_STATE_GOING; synchronize_rcu(); ^^^ the kernel hangs here. Sometimes it's very short and other times it seems to hang. I've left systems that appear to be hung and come back after 10 minutes to find that they've somehow made it through this call. Is there a known issue with RCU on s390 that is making this occur? P.