Received: by 2002:ac0:a594:0:0:0:0:0 with SMTP id m20-v6csp3308669imm; Fri, 25 May 2018 03:32:27 -0700 (PDT) X-Google-Smtp-Source: AB8JxZr2g4FLh5Ul181dOFNpmn3LjgdQ5D0dnVOy7DelCSH0RBOqteqN9chlZ8u60+AOfpE2WwG5 X-Received: by 2002:a17:902:5e3:: with SMTP id f90-v6mr2051623plf.175.1527244347683; Fri, 25 May 2018 03:32:27 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1527244347; cv=none; d=google.com; s=arc-20160816; b=o2/gGUDmHeAKD213toa7FFXearxdg+Bl2U6dD0nAcDGXzm5ND8WCrW12lhBeK4P2Ez 7P+lX1payte7nACZ6NkZM+bo3t5qTDMIEgIQ+470IliYvjGY7DC2z+m+fIR0V+ouswzT Uph+47TnQN2qKzqz6fmNp7Ro4ABMGN6gdQgbhYmIvE8n3WHzwMvwFFSKCmdA+pCz0evd wlxGQqu19VEVLZYANjTvMd+4hUo8ZLspGyBqqq5gYZy1pfBJbACGU0/D7KabuPHc3xHz MG2sxs40ZfNB9JbpVqmn3GbIccSkAxf+NHfm/24zavT4IjZ73gSXTaC3Mq0It6q/j3QZ S5QQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=1btWjEWjhpLCcTzchOq60PGfrjICyV1QnvcSIT3O6kA=; b=x2nau8rEX3erHpxPKcH9OCkiKvDhe7WxqKlVpToiWYjyK7gx00gN3wNcg0YGP0o/RZ MAIhdw8XSMuEnjXD9hVkPFkuHBP1lEMhGbb+gxwuS+A6I06mU0LuKni2G8D0mSHxHC5P m8Rmp2zHYwtX0BcfBv9t0z4tddClXG7r1eAMc3C6GekCD3a9VuinMAat0Y3+zf8DdYxZ yk2xbF6jCQCuVr7emVPir5SIDVXTzLYCtUMrqjLNW2q0gxF7T1d36u+WAnwJLKCqqEKl rDTJlGdRTBczki7KqrOGnROCQyzUBkV8IJttxYsOT9Jo5EaVvLNLtzjsnsIMqFvALcVT OLug== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id d124-v6si23437696pfc.176.2018.05.25.03.32.13; Fri, 25 May 2018 03:32:27 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965922AbeEYKb4 (ORCPT + 99 others); Fri, 25 May 2018 06:31:56 -0400 Received: from foss.arm.com ([217.140.101.70]:58928 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965255AbeEYKbx (ORCPT ); Fri, 25 May 2018 06:31:53 -0400 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 315AE80D; Fri, 25 May 2018 03:31:53 -0700 (PDT) Received: from e110439-lin (e110439-lin.cambridge.arm.com [10.1.210.68]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 6AE173F578; Fri, 25 May 2018 03:31:50 -0700 (PDT) Date: Fri, 25 May 2018 11:31:47 +0100 From: Patrick Bellasi To: Juri Lelli Cc: Waiman Long , Tejun Heo , Li Zefan , Johannes Weiner , Peter Zijlstra , Ingo Molnar , cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, kernel-team@fb.com, pjt@google.com, luto@amacapital.net, Mike Galbraith , torvalds@linux-foundation.org, Roman Gushchin Subject: Re: [PATCH v8 4/6] cpuset: Make generate_sched_domains() recognize isolated_cpus Message-ID: <20180525103147.GC30654@e110439-lin> References: <1526590545-3350-1-git-send-email-longman@redhat.com> <1526590545-3350-5-git-send-email-longman@redhat.com> <20180523173453.GY30654@e110439-lin> <20180524090430.GZ30654@e110439-lin> <20180524103938.GB3948@localhost.localdomain> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180524103938.GB3948@localhost.localdomain> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Juri, following are some notes I took while trying to understand what's going on... could be useful to understand if I have a correct view of all the different components and how they come together. At the end there are also a couple of possible updates and a question on your proposal. Cheers Patrick On 24-May 12:39, Juri Lelli wrote: > On 24/05/18 10:04, Patrick Bellasi wrote: > > [...] > > > From 84bb8137ce79f74849d97e30871cf67d06d8d682 Mon Sep 17 00:00:00 2001 > > From: Patrick Bellasi > > Date: Wed, 23 May 2018 16:33:06 +0100 > > Subject: [PATCH 1/1] cgroup/cpuset: disable sched domain rebuild when not > > required > > > > The generate_sched_domains() already addresses the "special case for 99% > > of systems" which require a single full sched domain at the root, > > spanning all the CPUs. However, the current support is based on an > > expensive sequence of operations which destroy and recreate the exact > > same scheduling domain configuration. > > > > If we notice that: > > > > 1) CPUs in "cpuset.isolcpus" are excluded from load balancing by the > > isolcpus= kernel boot option, and will never be load balanced > > regardless of the value of "cpuset.sched_load_balance" in any > > cpuset. > > > > 2) the root cpuset has load_balance enabled by default at boot and > > it's the only parameter which userspace can change at run-time. > > > > we know that, by default, every system comes up with a complete and > > properly configured set of scheduling domains covering all the CPUs. > > > > Thus, on every system, unless the user explicitly disables load balance > > for the top_cpuset, the scheduling domains already configured at boot > > time by the scheduler/topology code and updated in consequence of > > hotplug events, are already properly configured for cpuset too. > > > > This configuration is the default one for 99% of the systems, > > and it's also the one used by most of the Android devices which never > > disable load balance from the top_cpuset. > > > > Thus, while load balance is enabled for the top_cpuset, > > destroying/rebuilding the scheduling domains at every cpuset.cpus > > reconfiguration is a useless operation which will always produce the > > same result. > > > > Let's anticipate the "special" optimization within: > > > > rebuild_sched_domains_locked() > > > > thus completely skipping the expensive: > > > > generate_sched_domains() > > partition_sched_domains() > > > > for all the cases we know that the scheduling domains already defined > > will not be affected by whatsoever value of cpuset.cpus. > > [...] > > > + /* Special case for the 99% of systems with one, full, sched domain */ > > + if (!top_cpuset.isolation_count && > > + is_sched_load_balance(&top_cpuset)) > > + goto out; > > + > > Mmm, looks like we still need to destroy e recreate if there is a > new_topology (see arch_update_cpu_topology() in partition_sched_ > domains). Right, so the problem seems to be that we "need" to call arch_update_cpu_topology() and we do that by calling partition_sched_domains() which was initially introduced by: 029190c515f1 ("cpuset sched_load_balance flag") back in 2007, where it's also quite well explained the reasons behind the sched_load_balance flag and the idea to have "partitioned" SDs. I also (hopefully) understood that there are at least two actors involved: - A) arch code which creates SDs and SGs, usually to group CPUs depending on the memory hierarchy, to support different time granularity of load balancing operations Special case here are HP and hibernation which, by on-/off-lining CPUs they directly affect the SDs/SGs definitions. - B) cpusets which expose to userspace the possibility to define, _if possible_, a finer granularity set of SGs to further restrict the scope of load balancing operations Since B is a "possible finer granularity" refinement of A, then we trigger A's reconfigurations based on B's constraints. That's why, for example, in consequence of an HP online event, we have: --- core.c ------------------- HP[sched:active] | sched_cpu_activate() | cpuset_cpu_active() --- cpuset.c ----------------- | cpuset_update_active_cpus() | schedule_work(&cpuset_hotplug_work) \.. System Kworker \ | cpuset_hotplug_workfn() if (cpus_updated || force_rebuild) | rebuild_sched_domains() | rebuild_sched_domains_locked() | generate_sched_domains() --- topology.c --------------- | partition_sched_domains() | arch_update_cpu_topology() IOW, we need to pass via cpusets to rebuild the SDs whenever we there are HP events or we "need" to do an arch_update_cpu_topology() via the arch topology driver (drivers/base/arch_topology.c). This last bit is also interesting, whenever we detect arch topology information that required an SD rebuild, we need to force a partition_sched_domains(). But, for that, in: commit 50e76632339d ("sched/cpuset/pm: Fix cpuset vs. suspend-resume bugs") we just introduced the support for the "force_rebuild" flag to be set. Thus, potentially we can just extend the check I've proposed to consider the force rebuild flag, to be something like: ---8<--- diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 8f586e8bdc98..1f051fafaa3a 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -874,11 +874,19 @@ static void rebuild_sched_domains_locked(void) !cpumask_subset(top_cpuset.effective_cpus, cpu_active_mask)) goto out; + /* Special case for the 99% of systems with one, full, sched domain */ + if (!force_rebuild && + !top_cpuset.isolation_count && + is_sched_load_balance(&top_cpuset)) + goto out; + force_rebuild = false; + /* Generate domain masks and attrs */ ndoms = generate_sched_domains(&doms, &attr); /* Have scheduler rebuild the domains */ partition_sched_domains(ndoms, doms, attr); out: put_online_cpus(); ---8<--- Which would still allow to use something like: cpuset_force_rebuild() rebuild_sched_domains() to actually rebuild SD in consequence of arch topology changes. > > Maybe we could move the check you are proposing in update_cpumasks_ > hier() ? Yes, that's another option... although there we are outside of get_online_cpus(). Could be a problem? However, in general, I would say that all around: rebuild_sched_domains rebuild_sched_domains_locked update_cpumask update_cpumasks_hier a nice refactoring would be really deserved :) -- #include Patrick Bellasi