Received: by 2002:a05:7412:8521:b0:e2:908c:2ebd with SMTP id t33csp307089rdf; Fri, 3 Nov 2023 00:47:29 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHAxX1D99/6J3ohRmpjgw93CpezNnW2TNoTqovZpKlKC9pr83i9thQdJ6yOis5nOqbw1YM6 X-Received: by 2002:a17:903:244f:b0:1cc:4205:88ce with SMTP id l15-20020a170903244f00b001cc420588cemr17625804pls.46.1698997649104; Fri, 03 Nov 2023 00:47:29 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1698997649; cv=none; d=google.com; s=arc-20160816; b=XMGmwUUCZ5q/jJ3TlX9wvFbWbtUEt5foiCSnge18xBkwX6j/BKVhsXlFp+vy6k/UPG 2FBmq6UZmxvzd4AtJsd2cLcySgP6i+WxPIBDCJQJWwLQodl/oJbo9vTXhyxd+IHbFJxQ ReC839jO66h02QKLzGlc3beywCIGbCAlkIQZcvdhPQ7jRRpk2aCtdbeO+02KvopglSYz rPyumoD6ay3Q/Ir19S8JirIhCExjZMcyZexy+voMolEUE8Rzrb5awnRLWoskQnWpf+dR +Bpr1QN0LnU7cVcuAZ0KX9DXjQWDZwIA1LduF+/67FsKxQUhq4NfnarGGZ/JsuDV8dEz Ix6w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:user-agent:message-id:date :references:in-reply-to:subject:cc:to:from:dkim-signature; bh=rbTbNwUsGTp7JVPM2Ro11bI62qQkv5AbsgAPXyjbBmk=; fh=sTqMZbzOHYXNQgZF680nHelQNMfkfwDXg7EqPVO/Kh4=; b=Qj4X01ObzG36mPnfdjRhEe+2kCXbJJ5W/2onptUVNtLHA+P/tSwfmz/uzy5ywBp737 chMKmI9O3t68UDjw3XC7/gmHWSeV6Z4mxwv6tJ0b6xA0IGqezSBCQwXuL+huSRiz32nU ujBQDOHph4xuR4khzBKFwEeHb0tg3g+cXZWcrX+SRS2eh/7/5NgZV7HmBHpsDDO8+38K VzTq1c3yv4hzko6qWA2ejz3290mmNdeV7KauQS/Jrj22a0SBC1wim3pK7x07/6BidBmC 5ONyBnESU4n93kXE6zbE+a3V9tWZjaULkBjPUplgxf0bldAiUx1BdLKm9KABLr3hyj2x xh7w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=OL0p8vji; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from snail.vger.email (snail.vger.email. [23.128.96.37]) by mx.google.com with ESMTPS id kg15-20020a170903060f00b001c62d93585csi1029137plb.611.2023.11.03.00.47.28 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 03 Nov 2023 00:47:29 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) client-ip=23.128.96.37; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=OL0p8vji; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id ECA2E804E6AB; Fri, 3 Nov 2023 00:47:27 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1346241AbjKCHr1 (ORCPT + 99 others); Fri, 3 Nov 2023 03:47:27 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49432 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233449AbjKCHr0 (ORCPT ); Fri, 3 Nov 2023 03:47:26 -0400 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.7]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1C69283; Fri, 3 Nov 2023 00:47:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1698997640; x=1730533640; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version; bh=rezhK3cP/BaT5BMnIYABwg8Q2dxJJQOxJ5jTfjavXtc=; b=OL0p8vjiy3mxdGqiL+T1JYH3K7/re15pbHqasa9ZMaEx/J0fnO2xl/tw kQWRVUAhXEMRNZ9UaS6FvM4Gh4HV/eyB0g5zQfSA7kti9QEfEUC5ZYMC0 gbi6WezKopXsypFk2VWCE/mltkbIktJtwUJe9eQj/uvwfnZhBIGzA0PfJ RoswxILS8EbFMgAB02fYcKgYLYrRU63z37O51jbKPl5t87O5p7/dAxLZV tuCG+zSQmLKPrCmhTV62RlYrJqFgKsTUCxElMh5FLd39NaLJBlrvYI1aC E1EwFjesdvK94Uk1IPtUpTT/UmYrB3S08acV3MCuRT5GiuRcCXr5/ErJy g==; X-IronPort-AV: E=McAfee;i="6600,9927,10882"; a="10430963" X-IronPort-AV: E=Sophos;i="6.03,273,1694761200"; d="scan'208";a="10430963" Received: from orsmga002.jf.intel.com ([10.7.209.21]) by fmvoesa101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Nov 2023 00:47:20 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10882"; a="761535270" X-IronPort-AV: E=Sophos;i="6.03,273,1694761200"; d="scan'208";a="761535270" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orsmga002-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Nov 2023 00:47:15 -0700 From: "Huang, Ying" To: Gregory Price Cc: Michal Hocko , Johannes Weiner , Gregory Price , , , , , , , , , , , , Subject: Re: [RFC PATCH v3 0/4] Node Weights and Weighted Interleave In-Reply-To: (Gregory Price's message of "Wed, 1 Nov 2023 23:18:59 -0400") References: <20231031003810.4532-1-gregory.price@memverge.com> <20231031152142.GA3029315@cmpxchg.org> Date: Fri, 03 Nov 2023 15:45:13 +0800 Message-ID: <87fs1nz3ee.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE, SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Fri, 03 Nov 2023 00:47:28 -0700 (PDT) Gregory Price writes: > On Thu, Nov 02, 2023 at 10:47:33AM +0100, Michal Hocko wrote: >> On Wed 01-11-23 12:58:55, Gregory Price wrote: >> > Basically consider: `numactl --interleave=all ...` >> > >> > If `--weights=...`: when a node hotplug event occurs, there is no >> > recourse for adding a weight for the new node (it will default to 1). >> >> Correct and this is what I was asking about in an earlier email. How >> much do we really need to consider this setup. Is this something nice to >> have or does the nature of the technology requires to be fully dynamic >> and expect new nodes coming up at any moment? >> > > Dynamic Capacity is expected to cause a numa node to change size (in > number of memory blocks) rather than cause numa nodes to come and go, so > maybe handling the full node hotplug is a bit of an overreach. Will node max bandwidth change with the number of memory blocks? > Good call, I'll stop considering this problem for now. > >> > If the node is removed from the system, I believe (need to validate >> > this, but IIRC) the node will be removed from any registered cpusets. >> > As a result, that falls down to mempolicy, and the node is removed. >> >> I do not think we do anything like that. Userspace might decide to >> change the numa mask when a node is offlined but I do not think we do >> anything like that automagically. >> > > mpol_rebind_policy called by update_tasks_nodemask > https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L319 > https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L2016 > > falls down from cpuset_hotplug_workfn: > https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L3771 > > /* > * Keep top_cpuset.mems_allowed tracking node_states[N_MEMORY]. > * Call this routine anytime after node_states[N_MEMORY] changes. > * See cpuset_update_active_cpus() for CPU hotplug handling. > */ > static int cpuset_track_online_nodes(struct notifier_block *self, > unsigned long action, void *arg) > { > schedule_work(&cpuset_hotplug_work); > return NOTIFY_OK; > } > > void __init cpuset_init_smp(void) > { > ... > hotplug_memory_notifier(cpuset_track_online_nodes, CPUSET_CALLBACK_PRI); > } > > > Causes 1 of 3 situations: > MPOL_F_STATIC_NODES: overwrite with (old & new) > MPOL_F_RELATIVE_NODES: overwrite with a "relative" nodemask (fold+onto?) > Default: either does a remap or replaces old with new. > > My assumption based on this is that a hot-unplugged node would completely > be removed. Doesn't look like hot-add is handled at all, so I can just > drop that entirely for now (except add default weight of 1 incase it is > ever added in the future). > > I've been pushing agianst the weights being in memory-tiers.c for this > reason, as a weight set per-tier is meaningless if a node disappears. > > Example: Tier has 2 nodes with some weight N split between them, such > that interleave gives each node N/2 pages. If 1 node is removed, the > remaining node gets N pages, which is twice the allocation. Presumably > a node is an abstraction of 1 or more devices, therefore if the node is > removed, the weight should change. The per-tier weight can be defined as interleave weight of each node of the tier. Tier just groups NUMA nodes with similar performance. The performance (including bandwidth) is still per-node in the context of tier. If we have multiple nodes in one tier, this makes weight definition easier. > You could handle hotplug in tiers, but if a node being hotplugged forcibly > removes the node from cpusets and mempolicy nodemasks, then it's > irrelevant since the node can never get selected for allocation anyway. > > It's looking more like cgroups is the right place to put this. Have a cgroup/task level interface doesn't prevent us to have a system level interface to provide default for cgroups/tasks. Where performance information (e.g., from HMAT) can help define a reasonable default automatically. >> >> Moving the global policy to cgroups would make the main cocern of >> different workloads looking for different policy less problamatic. >> I didn't have much time to think that through but the main question is >> how to sanely define hierarchical properties of those weights? This is >> more of a resource distribution than enforcement so maybe a simple >> inherit or overwrite (if you have a more specific needs) semantic makes >> sense and it is sufficient. >> > > As a user I would assume it would operate much the same way as other > nested cgroups, which is inherit by default (with subsets) or an > explicit overwrite that can't exceed the higher level settings. > > Weights could arguably allow different settings than capacity controls, > but that could be an extension. > >> This is not as much about the code as it is about the proper interface >> because that will get cast in stone once introduced. It would be really >> bad to realize that we have a global policy that doesn't fit well and >> have hard time to work it around without breaking anybody. > > o7 I concur now. I'll take some time to rework this into a > cgroups+mempolicy proposal based on my earlier RFCs. -- Best Regards, Huang, Ying