Received: by 2002:a05:7412:8521:b0:e2:908c:2ebd with SMTP id t33csp357730rdf; Fri, 3 Nov 2023 02:57:34 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFWO+U5SZyLg1wusZ/wJysZodbJqtgRcuLC5HZavjw/evYUCpc8fE3DEukhlNMDzcKMH6Rb X-Received: by 2002:a05:6808:988:b0:3b2:e0f2:478a with SMTP id a8-20020a056808098800b003b2e0f2478amr21217312oic.30.1699005453966; Fri, 03 Nov 2023 02:57:33 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1699005453; cv=none; d=google.com; s=arc-20160816; b=bLpeJyObXFjBNCkj/evihysBfhmdsNx7B8uJ2y3HWHe11jE7/pdud76rHe9+tFM2di dudUvAqb0w8wisu0CZdKiYiaK7kpu3esl2gN1NgBbOwTq1vlzEhcxyi4/Sc1QE8fVSxt N5jTHPm7NsES0Kv8L6YwtxSWEuVquuvg+Xftav7G+FahYqq1fn7OiB0JJDaUw8p75Njr 7C//pTnH60lZvwxY3zU2ITsjD2zPpMwDdn9v3tcmgruHxbX6EmYyimLFfs8iPg/wx5KF ibv72h0xXZrefc8moh7r9VZmslZMMuKLj52a2sTNBcxJA+Lk50H3ozgKhJkeox3x1MSr pVnA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=ePm7WjwbNkGtg7aZPPVJIXKZCUCupTKzceDuad9x8Qg=; fh=69xE6aTFDK8ldYwWkp0RFdPiRhsMoPZOYNUHWFIPs+8=; b=kcTVKkz+clgr79qpA8Cp2EH2OhhCkLCC2RsPuDnLMmSXGNMsgSu2o0nlCQvk/LKlv4 wxMnYcz3bWKLmaR9AZLzg6Tbuvlbt/n7+eWZOnYGcGILuUtzJ8K3PrsJNFGDwq+f38uC KpHwflnDPx89kBFEAl8Ajl2KdIM1SgyPXJ1m5q+RKNz2bHyOjp2bbUCl3ek//iIsrSLs 4Zi2d3EMEopqr/Pd6O6AMnH4VZzKf2EGnPnVazFcsDQF9bshfhxpvuqJrPtBgZJUOuUy R475lSFgDMaQzKZAHDu9wzgXS0f1YKOuUL0ip9Z6fSLxAZarqbRo+9dikWn7vaN72JEU N8cA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=Boti3rzX; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=suse.com Return-Path: Received: from groat.vger.email (groat.vger.email. [2620:137:e000::3:5]) by mx.google.com with ESMTPS id j64-20020a638b43000000b005ab776a4d4csi1176293pge.610.2023.11.03.02.57.33 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 03 Nov 2023 02:57:33 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) client-ip=2620:137:e000::3:5; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=Boti3rzX; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=suse.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by groat.vger.email (Postfix) with ESMTP id 4FD6C826639E; Fri, 3 Nov 2023 02:57:26 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at groat.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1347697AbjKCJ4b (ORCPT + 99 others); Fri, 3 Nov 2023 05:56:31 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42356 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1347612AbjKCJ4S (ORCPT ); Fri, 3 Nov 2023 05:56:18 -0400 Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E39C6191; Fri, 3 Nov 2023 02:56:04 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 111421F8C0; Fri, 3 Nov 2023 09:56:02 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1699005362; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=ePm7WjwbNkGtg7aZPPVJIXKZCUCupTKzceDuad9x8Qg=; b=Boti3rzXvIy6GpXicVmWEQvGmOFfKkcLOYxK/Vs1qx/LvOFOCNZik2CcJI4ZDtqjYPnUt9 TUx62ElcRoAqB72miNFGbGnLgblzwKs4IHCvbfpL90O8ZapU9e3zW9je11NEo5PfW0kicP FhjCCXDkC1IoRZxEJR5d3903OYs8/uU= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id DB32513907; Fri, 3 Nov 2023 09:56:01 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id AY2ENLHDRGUWBgAAMHmgww (envelope-from ); Fri, 03 Nov 2023 09:56:01 +0000 Date: Fri, 3 Nov 2023 10:56:01 +0100 From: Michal Hocko To: Gregory Price Cc: Johannes Weiner , Gregory Price , linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org, linux-mm@kvack.org, ying.huang@intel.com, akpm@linux-foundation.org, aneesh.kumar@linux.ibm.com, weixugc@google.com, apopple@nvidia.com, tim.c.chen@intel.com, dave.hansen@intel.com, shy828301@gmail.com, gregkh@linuxfoundation.org, rafael@kernel.org Subject: Re: [RFC PATCH v3 0/4] Node Weights and Weighted Interleave Message-ID: References: <20231031003810.4532-1-gregory.price@memverge.com> <20231031152142.GA3029315@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on groat.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (groat.vger.email [0.0.0.0]); Fri, 03 Nov 2023 02:57:26 -0700 (PDT) On Wed 01-11-23 23:18:59, Gregory Price wrote: > On Thu, Nov 02, 2023 at 10:47:33AM +0100, Michal Hocko wrote: > > On Wed 01-11-23 12:58:55, Gregory Price wrote: > > > Basically consider: `numactl --interleave=all ...` > > > > > > If `--weights=...`: when a node hotplug event occurs, there is no > > > recourse for adding a weight for the new node (it will default to 1). > > > > Correct and this is what I was asking about in an earlier email. How > > much do we really need to consider this setup. Is this something nice to > > have or does the nature of the technology requires to be fully dynamic > > and expect new nodes coming up at any moment? > > > > Dynamic Capacity is expected to cause a numa node to change size (in > number of memory blocks) rather than cause numa nodes to come and go, so > maybe handling the full node hotplug is a bit of an overreach. > > Good call, I'll stop considering this problem for now. > > > > If the node is removed from the system, I believe (need to validate > > > this, but IIRC) the node will be removed from any registered cpusets. > > > As a result, that falls down to mempolicy, and the node is removed. > > > > I do not think we do anything like that. Userspace might decide to > > change the numa mask when a node is offlined but I do not think we do > > anything like that automagically. > > > > mpol_rebind_policy called by update_tasks_nodemask > https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L319 > https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L2016 > > falls down from cpuset_hotplug_workfn: > https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L3771 Ohh, have missed that. Thanks for the reference. Quite honestly I am not sure this code is really a) necessary and b) ever exercised. For the former I would argue that offline node could be treated as completely depleted one. From the correctness POV it shouldn't make any difference and I am rather skeptical it would have performance improvements. And for the latter, full node offlines are really rare from experience. I would be interested about actual real life usecases which do that regularly. I do remember a certain HW vendor working on a hotplugable system (both CPUs and memory) to reduce downtimes cause by misbehaving CPUs/memoryu. This has turned out very impractical because of movable memory requirements and also some HW limitations (like most HW attached to Node0 which has turned out to be single point of failure anyway). [...] [...] > > Moving the global policy to cgroups would make the main cocern of > > different workloads looking for different policy less problamatic. > > I didn't have much time to think that through but the main question is > > how to sanely define hierarchical properties of those weights? This is > > more of a resource distribution than enforcement so maybe a simple > > inherit or overwrite (if you have a more specific needs) semantic makes > > sense and it is sufficient. > > > > As a user I would assume it would operate much the same way as other > nested cgroups, which is inherit by default (with subsets) or an > explicit overwrite that can't exceed the higher level settings. This would make it rather impractical because a default (everything set to 1) would be cast in stone. As mentioned above this this not an enforcement limit. So I _think_ that a simple hierarchical rule like cgroup_interleaving_mask(cgroup) interleaving_mask = (cgroup->interleaving_mask) ? : cgroup_interleaving_mask(parent_cgroup(cgroup)) So child cgroups could overwrite parent as they wish. If there is any enforcement (like a cpuset) that would filter useable nodes and the allocation policy would simply apply weights on those. -- Michal Hocko SUSE Labs