Received: by 2002:a05:6358:c692:b0:131:369:b2a3 with SMTP id fe18csp2165835rwb; Sat, 29 Jul 2023 00:55:41 -0700 (PDT) X-Google-Smtp-Source: APBJJlHyYZNjE9c1+/j1H+E5UpL/hvEJhDflG/3mrOITsaUdfcoW/EWiT/ETi/j9YSvikB2sZQMb X-Received: by 2002:a05:6a00:8c9:b0:682:a6bd:e952 with SMTP id s9-20020a056a0008c900b00682a6bde952mr4883700pfu.6.1690617341305; Sat, 29 Jul 2023 00:55:41 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1690617341; cv=none; d=google.com; s=arc-20160816; b=1JNU9xH4zBBh2PCGHw4aWg1vndn87eB7gCOuXJdeB0MlN3+8FCL1RUorQCYLFcf8u9 5NF3YPR3jgJOCuWK6wMn7oK16i5PU4QtUgv42GE+yx7wMouMyk6Dvg5JsfTvCTiimp8C 5RAyzRhKKZ035jDK0mGTc9iyoSy+zzKjMHZ9e0v4DuKRsGm9Q1GEX8bhg+xOELYpILsa kCLdkIV51y3Hq6y9DLlSZxEf3SOzLGpzjw23eVWpXkaCzgPIPQf1CP9ROok4FyztJKFm vhR+HyAABWtFWdQnjcBT+SzhLD220WvvJh3Qs+4SkqYuP1GPqj4ULHB+IInMjeOmUvAx aicA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:reply-to:message-id :subject:cc:to:from:date:dkim-signature; bh=jP7rZ/SzwSJJitT8/phJu5z++wsl9uf6/8ayYSVJwi4=; fh=O3fGuR2bJT0KsBCSFzsCm4Lu2PIRJ7ASyqG/pA4nzgM=; b=PjkSZwLDKgRSmHFw3+bg7ryrpumDrjpsYPDrFvj3OLXmLts6S2NVTWqH9HOdEn1AVX fZREwoL27D7TCtuV844OYUT72msBUMCofyxlqoyeZcyu+W9s2zTdneUL0Op4T0biRK1B pvdPKW2/LnvnN/5keSbd5t0fR0ZqfogW+8YzkPnYPCICpvaDShZ26tHBHblGw8yCbUeS kOREM6lF4XqPOTfZUdrIr7PLd2u9cxQFZ8Sut8i4Lw0ggKV0toS+mCGxgm2yZwjnP920 mxfmaqCsxdauAK9PstT+W3yi+JJve5+0VSSVwjdYSWTdMN38aMKm0LNWLV/j5KzjUexC b7aA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=CvMw+NEc; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id eb25-20020a056a004c9900b00686769858b3si677690pfb.60.2023.07.29.00.55.30; Sat, 29 Jul 2023 00:55:41 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=CvMw+NEc; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235660AbjG2FuG (ORCPT + 99 others); Sat, 29 Jul 2023 01:50:06 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33152 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229469AbjG2FuE (ORCPT ); Sat, 29 Jul 2023 01:50:04 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5B7EC44B7; Fri, 28 Jul 2023 22:50:03 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id E78E5601D7; Sat, 29 Jul 2023 05:50:02 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 3D22FC433C8; Sat, 29 Jul 2023 05:50:02 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1690609802; bh=v5hGWBiTSJxlL4QxQcZ8c+IeC6bh8js9tr3aOFwSrnk=; h=Date:From:To:Cc:Subject:Reply-To:References:In-Reply-To:From; b=CvMw+NEc9eDF8Uj98uBteDltkmPgxWI8iAVSBxwnPxKomZg+a5/joAtcnHkQ516ke QZSKWTa++L0zF4l1ldf0O3hcFCngaVpw51l0NVTxJiGsEpXaIoJTnlFYjY1Y3zLUpw b20WcdLoMg8fS0zf3o30Y947+TQ1UhrfDw4RG5V0eJ6/pylDeW848AirCR5HqLbNwi i5wXbl/s6OsF3iZyu/A+Eq6kayYf3RKPFurBL17WN9bNtWrye7jflcKMTD6/2P1KI3 tiEONlnLSGhoX/vK3wVvjvgRR7+RmZRtnGVZzxJM8ihELRcB2ml4jZHGOsr65VFyCB SLqU/Ao/3LJbA== Received: by paulmck-ThinkPad-P17-Gen-1.home (Postfix, from userid 1000) id C710ECE0ADE; Fri, 28 Jul 2023 22:50:01 -0700 (PDT) Date: Fri, 28 Jul 2023 22:50:01 -0700 From: "Paul E. McKenney" To: Joel Fernandes Cc: Guenter Roeck , Pavel Machek , Greg Kroah-Hartman , stable@vger.kernel.org, patches@lists.linux.dev, linux-kernel@vger.kernel.org, torvalds@linux-foundation.org, akpm@linux-foundation.org, shuah@kernel.org, patches@kernelci.org, lkft-triage@lists.linaro.org, jonathanh@nvidia.com, f.fainelli@gmail.com, sudipm.mukherjee@gmail.com, srw@sladewatkins.net, rwarsow@gmx.de, conor@kernel.org, rcu@vger.kernel.org Subject: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review Message-ID: Reply-To: paulmck@kernel.org References: <99B56FC7-9474-4968-B1DD-5862572FD0BA@joelfernandes.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_HI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED,WEIRD_PORT autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jul 28, 2023 at 09:25:35PM -0400, Joel Fernandes wrote: > On Fri, Jul 28, 2023 at 6:58 PM Paul E. McKenney wrote: > > > > > On Fri, Jul 28, 2023 at 05:17:59PM -0400, Joel Fernandes wrote: > > > > > > On Jul 27, 2023, at 7:18 PM, Joel Fernandes > > > wrote: > > > > > >  > > > > > > On Jul 27, 2023, at 4:33 PM, Paul E. McKenney > > > wrote: > > > > > > On Thu, Jul 27, 2023 at 10:39:17AM -0700, Guenter Roeck wrote: > > > > > > On 7/27/23 09:07, Paul E. McKenney wrote: > > > > > > ...] > > > > > > No. However, (unrelated) in linux-next, rcu tests sometimes result > > > in apparent hangs > > > > > > or long runtime. > > > > > > [ 0.778841] Mount-cache hash table entries: 512 (order: 0, 4096 > > > bytes, linear) > > > > > > [ 0.779011] Mountpoint-cache hash table entries: 512 (order: 0, > > > 4096 bytes, linear) > > > > > > [ 0.797998] Running RCU synchronous self tests > > > > > > [ 0.798209] Running RCU synchronous self tests > > > > > > [ 0.912368] smpboot: CPU0: AMD Opteron 63xx class CPU (family: > > > 0x15, model: 0x2, stepping: 0x0) > > > > > > [ 0.923398] RCU Tasks: Setting shift to 2 and lim to 1 > > > rcu_task_cb_adjust=1. > > > > > > [ 0.925419] Running RCU-tasks wait API self tests > > > > > > (hangs until aborted). This is primarily with Opteron CPUs, but also > > > with others such as Haswell, > [...] > > > Building > > > x86_64:q35:Icelake-Server:defconfig:preempt:smp4:net,ne2k_pci:efi:me > > > m2G:virtio:cd ... running ......... passed > [...] > > > I freely confess that I am having a hard time imagining what would > > > > > > be CPU dependent in that code. Timing, maybe? Whatever the reason, > > > > > > I am not seeing these failures in my testing. > > > > > > So which of the following Kconfig options is defined in your > > > .config? > > > > > > CONFIG_TASKS_RCU, CONFIG_TASKS_RUDE_RCU, and CONFIG_TASKS_TRACE_RCU. > > > > > > If you have more than one of them, could you please apply this patch > > > > > > and show me the corresponding console output from the resulting > > > hang? > > > > > > FWIW, I am not able to repro this issue either. If a .config can be > > > shared of the problem system, I can try it out to see if it can be > > > reproduced on my side. > > > > > > I do see this now on 5.15 stable: > > > > > >TASKS03 ------- 3089 GPs (0.858056/s) > > >QEMU killed > > >TASKS03 no success message, 64 successful version messages > > >!!! PID 3309783 hung at 3781 vs. 3600 seconds > > > > > > I have not looked too closely yet. The full test artifacts are here: > > > > > > [1]Artifacts of linux-5.15.y 5.15.123 : > > > /tools/testing/selftests/rcutorture/res/2023.07.28-04.00.44 [Jenkins] > > > [2]box.joelfernandes.org > > > [3]apple-touch-icon.png > > > > > > Thanks, > > > > > > - Joel > > > > > > (Apologies if the email is html, I am sending from phone). > > > > Heh. I have a script that runs lynx. Which isn't perfect, but usually > > makes things at least somewhat legible. > > Sorry I was too optimistic about the iPhone's capabilities when it > came to mailing list emails. > Here's what I said: > -------------- > I do see this now on 5.15 stable: > > TASKS03 ------- 3089 GPs (0.858056/s) > QEMU killed > TASKS03 no success message, 64 successful version messages > !!! PID 3309783 hung at 3781 vs. 3600 seconds > > Link to full logs/artifacts: > http://box.joelfernandes.org:9080/job/rcutorture_stable/job/linux-5.15.y/lastFailedBuild/artifact/tools/testing/selftests/rcutorture/res/2023.07.28-04.00.44/ > ---------------- > > > This looks like the prototypical hard hang with interrupts disabled, > > which could be anywhere in the kernel, including RCU. I am not seeing > > this. but the usual cause when I have seen it in the past was deadlock > > of irq-disabled locks. In one spectacular case, it was a timekeeping > > failure that messed up a CPU-hotplug operation. > > > > If this is reproducible, one trick would be to have a script look at > > the console.log file, and have it do something (NMI? sysrq? something > > else?) to qemu if output ceased for too long. > > > > One way to do this without messing with the rcutorture scripting is to > > grab the qemu-cmd file from this run, and then invoke that file from your > > own script, possibly with suitable modifications to qemu's parameters. > > Would it be better to have such monitoring as part of rcutorture > testing itself? Alternatively there is the NMI hardlockup detector > which I believe should also detect such cases and dump stacks. Quite possibly. But special-casing the prototype is probably going to be a lot faster and easier. If it works, then it might make a lot of sense to upgrade the scripting. If it doesn't work, then quite a bit less time is wasted than would be by messing with the scripting from the get-go. Also, you have the option of making qemu be interactive and manually triggering things, for example by checking up on the run near the end. Or having something handing commands to qemu. Either way allows much more interaction with qemu, and better experiementation, than could be done reasonably with the scripts. Thanx, Paul > thanks, > > - Joel > > > > > Thoughts? > > > > Thanx, Paul > > > > > Cheers, > > > - Joel > > > > > > Thanx, Paul > > > > > > -------------------------------------------------------------------- > > > ---- > > > > > > commit 709a917710dc01798e01750ea628ece4bfc42b7b > > > > > > Author: Paul E. McKenney > > > > > > Date: Thu Jul 27 13:13:46 2023 -0700 > > > > > > rcu-tasks: Add printk()s to localize boot-time self-test hang > > > > > > Currently, rcu_tasks_initiate_self_tests() prints a message and > > > then > > > > > > initiates self tests on up to three different RCU Tasks flavors. > > > If one > > > > > > of the flavors has a grace-period hang, it is not easy to work out > > > which > > > > > > of the three hung. This commit therefore prints a message prior > > > to each > > > > > > individual test. > > > > > > Reported-by: Guenter Roeck > > > > > > Signed-off-by: Paul E. McKenney > > > > > > diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h > > > > > > index 56c470a489c8..427433c90935 100644 > > > > > > --- a/kernel/rcu/tasks.h > > > > > > +++ b/kernel/rcu/tasks.h > > > > > > @@ -1981,20 +1981,22 @@ static void test_rcu_tasks_callback(struct > > > rcu_head *rhp) > > > > > > static void rcu_tasks_initiate_self_tests(void) > > > > > > { > > > > > > - pr_info("Running RCU-tasks wait API self tests\n"); > > > > > > #ifdef CONFIG_TASKS_RCU > > > > > > + pr_info("Running RCU Tasks wait API self tests\n"); > > > > > > tests[0].runstart = jiffies; > > > > > > synchronize_rcu_tasks(); > > > > > > call_rcu_tasks(&tests[0].rh, test_rcu_tasks_callback); > > > > > > #endif > > > > > > #ifdef CONFIG_TASKS_RUDE_RCU > > > > > > + pr_info("Running RCU Tasks Rude wait API self tests\n"); > > > > > > tests[1].runstart = jiffies; > > > > > > synchronize_rcu_tasks_rude(); > > > > > > call_rcu_tasks_rude(&tests[1].rh, test_rcu_tasks_callback); > > > > > > #endif > > > > > > #ifdef CONFIG_TASKS_TRACE_RCU > > > > > > + pr_info("Running RCU Tasks Trace wait API self tests\n"); > > > > > > tests[2].runstart = jiffies; > > > > > > synchronize_rcu_tasks_trace(); > > > > > > call_rcu_tasks_trace(&tests[2].rh, test_rcu_tasks_callback); > > > > > >References > > > > > > Visible links: > > > 1. http://box.joelfernandes.org:9080/job/rcutorture_stable/job/linux-5.15.y/lastFailedBuild/artifact/tools/testing/selftests/rcutorture/res/2023.07.28-04.00.44/ > > > 2. http://box.joelfernandes.org:9080/job/rcutorture_stable/job/linux-5.15.y/lastFailedBuild/artifact/tools/testing/selftests/rcutorture/res/2023.07.28-04.00.44/ > > > 3. http://box.joelfernandes.org:9080/job/rcutorture_stable/job/linux-5.15.y/lastFailedBuild/artifact/tools/testing/selftests/rcutorture/res/2023.07.28-04.00.44/ > > > > > > Hidden links: > > > 5. http://box.joelfernandes.org:9080/job/rcutorture_stable/job/linux-5.15.y/lastFailedBuild/artifact/tools/testing/selftests/rcutorture/res/2023.07.28-04.00.44/