Received: by 2002:a05:6358:45e:b0:b5:b6eb:e1f9 with SMTP id 30csp831428rwe; Wed, 24 Aug 2022 09:37:04 -0700 (PDT) X-Google-Smtp-Source: AA6agR6SjLkusvikBj1j+nkubrxK0Y/yHwYBFE0T0Ir2oc0wBTeuH5aQXCfCg9/Xj+Asgue0w4EU X-Received: by 2002:a17:902:e80b:b0:172:f096:12c5 with SMTP id u11-20020a170902e80b00b00172f09612c5mr13127876plg.102.1661359024522; Wed, 24 Aug 2022 09:37:04 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1661359024; cv=none; d=google.com; s=arc-20160816; b=DjHJ9TZgjwLItDOfrzeqkXTKTm3TO8lXZkDZBMHpRcJUzjuHtV2lnTvyQ460RlXdw5 TbXP1pOwBMvIeZ7R66trUHDQyL4JI411vV6LcpLM47Cgl2e/iu5A6FKUxtHNjy7juRb0 75araYdKuy9/aT+jc1aET4wAULG6asVusevPlWlK1IRVSYjFBZwoRIA8O9VWjEjUKftH lvNe/TVkDFgIYYAd2COm/+j06o7mlgpplkkQ3P9NwcZqED9xPQuyB8gBvisByhjSwJkb q8UrQcNXCHN6kL1aPQkSBQ4x+mpbW7I7EpzqbdJvCkfNw4B3JKIcyKvan2160QucBRKu AEGQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:reply-to:message-id:subject:cc:to:from:date :dkim-signature; bh=xBkqjIYsY0AS5qvLxvYxzL0KYXbsKE6nag9nlAohCD8=; b=wUvynlHkiptaaTkcHgUcsICFEBikgnADbR3JVvBN8G6ovTjxu8qtdmX/wXWPDrTXRi uFxc1LrFPJzoSMkaEeHP9N789czn6xHbc0SiVuMtvTzh/Z0G+VokFCr/WwIxPril8pOx 75wXY7UakBet9eO6MPslisZqDGKGOxo2CmBftMgWO0E7lOn7LI3r7+OyDVnCw+mNvrNK Wj2ChUJAyGx9SS0CpANZ7CfOc97Q+9kgCxhiKY/yWoX3GVhOLO2Rz4pT9OGI3EVySb/s veJF861FKn4oR8kaSdVaBgdMMNkvJMdEtIqq6ZFpBd6ZpQGCb5UgD/VOMvz5BuQEbAWR o1cQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b="j93K/hvs"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id p18-20020a056a000a1200b005289628be08si17302653pfh.336.2022.08.24.09.36.53; Wed, 24 Aug 2022 09:37:04 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b="j93K/hvs"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239216AbiHXQU7 (ORCPT + 99 others); Wed, 24 Aug 2022 12:20:59 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41874 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239173AbiHXQUy (ORCPT ); Wed, 24 Aug 2022 12:20:54 -0400 Received: from ams.source.kernel.org (ams.source.kernel.org [IPv6:2604:1380:4601:e00::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 57532647D0; Wed, 24 Aug 2022 09:20:53 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id 0522CB825A4; Wed, 24 Aug 2022 16:20:52 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id A4C7CC433D7; Wed, 24 Aug 2022 16:20:50 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1661358050; bh=vMKt/FR3QyAvD8amq6k0PFCyA73ecEmPJ9lBbXHQWbA=; h=Date:From:To:Cc:Subject:Reply-To:References:In-Reply-To:From; b=j93K/hvscliNMtfDuyS6lRs2U0EGQTErYymd/GINMX84gmzh6+5gWnuafWmB+I9DT Z40wF2qcwthsYtIXUuvLPSm9Bly1xBtfh4SKntD+gcD00swX8REbAhkJliUDPT4E5T ns8nI8u2vWlV6DM/hJwk78TWBtrD8KG6cFwRUNFDVusNykTfgOof2AX+v3z9JP+IE2 lD7koa6ffyedYNY7WmTzCfxR//fRDSmnq0bYVJscy8JWefBxWLKeXe8w2gioM3Vjwe Hmu0QEDcyeW6EdZl+o+AuiMKXsxNCpaAzk3qC170wVXwLaHKOABAjqlPiLZpwTL2O2 74pQ6xFYYYD6g== Received: by paulmck-ThinkPad-P17-Gen-1.home (Postfix, from userid 1000) id 3F9385C03F6; Wed, 24 Aug 2022 09:20:50 -0700 (PDT) Date: Wed, 24 Aug 2022 09:20:50 -0700 From: "Paul E. McKenney" To: Pingfan Liu Cc: LKML , rcu , Frederic Weisbecker , Neeraj Upadhyay , Josh Triplett , Steven Rostedt , Mathieu Desnoyers , Lai Jiangshan , Joel Fernandes , Thomas Gleixner , Steven Price , Mark Rutland , Kuppuswamy Sathyanarayanan , "Jason A. Donenfeld" , boqun.feng@gmail.com Subject: Re: [RFC 06/10] rcu/hotplug: Make rcutree_dead_cpu() parallel Message-ID: <20220824162050.GA6159@paulmck-ThinkPad-P17-Gen-1> Reply-To: paulmck@kernel.org References: <20220822021520.6996-1-kernelfans@gmail.com> <20220822021520.6996-7-kernelfans@gmail.com> <20220822024528.GC6159@paulmck-ThinkPad-P17-Gen-1> <20220823030125.GJ6159@paulmck-ThinkPad-P17-Gen-1> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_HI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Aug 24, 2022 at 09:53:11PM +0800, Pingfan Liu wrote: > On Tue, Aug 23, 2022 at 11:01 AM Paul E. McKenney wrote: > > > > On Tue, Aug 23, 2022 at 09:50:56AM +0800, Pingfan Liu wrote: > > > On Sun, Aug 21, 2022 at 07:45:28PM -0700, Paul E. McKenney wrote: > > > > On Mon, Aug 22, 2022 at 10:15:16AM +0800, Pingfan Liu wrote: > > > > > In order to support parallel, rcu_state.n_online_cpus should be > > > > > atomic_dec() > > > > > > > > > > Signed-off-by: Pingfan Liu > > > > > > > > I have to ask... What testing have you subjected this patch to? > > > > > > > > > > This patch subjects to [1]. The series aims to enable kexec-reboot in > > > parallel on all cpu. As a result, the involved RCU part is expected to > > > support parallel. > > > > I understand (and even sympathize with) the expectation. But results > > sometimes diverge from expectations. There have been implicit assumptions > > in RCU about only one CPU going offline at a time, and I am not sure > > that all of them have been addressed. Concurrent CPU onlining has > > been looked at recently here: > > > > https://docs.google.com/document/d/1jymsaCPQ1PUDcfjIKm0UIbVdrJAaGX-6cXrmcfm0PRU/edit?usp=sharing > > > > You did us atomic_dec() to make rcu_state.n_online_cpus decrementing be > > atomic, which is good. Did you look through the rest of RCU's CPU-offline > > code paths and related code paths? > > I went through those codes at a shallow level, especially at each > cpuhp_step hook in the RCU system. And that is fine, at least as a first step. > But as you pointed out, there are implicit assumptions about only one > CPU going offline at a time, I will chew the google doc which you > share. Then I can come to a final result. Boqun Feng, Neeraj Upadhyay, Uladzislau Rezki, and I took a quick look, and rcu_boost_kthread_setaffinity() seems to need some help. As it stands, it appears that concurrent invocations of this function from the CPU-offline path will cause all but the last outgoing CPU's bit to be (incorrectly) set in the cpumask_var_t passed to set_cpus_allowed_ptr(). This should not be difficult to fix, for example, by maintaining a separate per-leaf-rcu_node-structure bitmask of the concurrently outgoing CPUs for that rcu_node structure. (Similar in structure to the ->qsmask field.) There are probably more where that one came from. ;-) > > > [1]: https://lore.kernel.org/linux-arm-kernel/20220822021520.6996-3-kernelfans@gmail.com/T/#mf62352138d7b040fdb583ba66f8cd0ed1e145feb > > > > Perhaps I am more blind than usual today, but I am not seeing anything > > in this patch describing the testing. At this point, I am thinking in > > terms of making rcutorture test concurrent CPU offlining parallel > > Yes, testing results are more convincing in this area. > > After making clear the implicit assumptions, I will write some code to > bridge my code and rcutorture test. Since the series is a little > different from parallel cpu offlining. It happens after all devices > are torn down, and there is no way to rollback. Very good, looking forward to seeing what you come up with! > > Thoughts? > > Need a deeper dive into this field. Hope to bring out something soon. Again, looking forward to seeing what you find! Thanx, Paul