Received: by 2002:a05:7412:8d10:b0:f3:1519:9f41 with SMTP id bj16csp663828rdb; Tue, 5 Dec 2023 16:52:45 -0800 (PST) X-Google-Smtp-Source: AGHT+IE4B2mXBCcL4qpoyQBuZtbp/lPBMKjQF5dixPOirfMOFJTtRizrCRAwao//il6QssMSAY+0 X-Received: by 2002:a05:6359:6c0b:b0:170:17eb:2056 with SMTP id tc11-20020a0563596c0b00b0017017eb2056mr145439rwb.63.1701823965537; Tue, 05 Dec 2023 16:52:45 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1701823965; cv=none; d=google.com; s=arc-20160816; b=CdqMoTY/Guf5IcCv7V41dnX3X1O6Zrj35dqS1tnYuKTVXwAUWcL4UOtUuV8hmNw+tb Kwf7b3nllBsR9n5faj3lP+1bQ8fB4AILyJg8EQtY9KH1DM0+r2NlRkWlm4xc166r21SK osP00tKN/F9GU4sz2WYMJA+fSarlbEG9SXDgpSKpjpzdwjiTRIQGZjK8dPk9z1Y1UtaA b41sGpVEb7+2a2LxiFrFU6hSGUJR8eSG7IzAgUfwYmne8Xz9WqVZXPvHAOSC98i0GsdG au3X4cZG8h3WKbqrjk/dYYTzvR9SdJ0XF8LC6meoEK1vPCyUNUZtRhcSlbSb08ru6Jui sbrQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:user-agent:message-id:date :references:in-reply-to:subject:cc:to:from:dkim-signature; bh=GZMZMHdiiiXNdlhfUahIdhzUq2D0dk/AvilXXch3Dok=; fh=Ec8+93QwXjRdnLuZ3TSm69spsYeQ+BX0TzruON84KMs=; b=hfzNaAGqx0GjHkfwou+aP0TZ5fl4oEag0VE3/2Ge+wipW21U9I8lT9al139bR7IU+4 hL4c1EAG2WsAZ80ezxuowCKebMNbylVv2wB40cACw02HQ605zYHe4yLZE+I4uqeY0B20 jAzY8cKbk/WsZ8ZYC6jrX7v3fDLtzgApdT6opJV3oe2svuSe4lQhJzmLyyqrIaomUYBL r6oEykY/FDU0ixYQohqQ5PXtgz/B5CMWZORYfNa+15A2lJBylRdwNvzJaHeriG7t3CHF YZt5ptQE1wmbq2czDcSbnibvOP3p7vAVYRR9bdoXhGnLmBlJnOVNx6c1rxQ26qVM/yWk G2tw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=nEP39hyH; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from howler.vger.email (howler.vger.email. [23.128.96.34]) by mx.google.com with ESMTPS id mr17-20020a17090b239100b00286686b9d9dsi6912233pjb.130.2023.12.05.16.52.45 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 05 Dec 2023 16:52:45 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) client-ip=23.128.96.34; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=nEP39hyH; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by howler.vger.email (Postfix) with ESMTP id AB74C82B2EA4; Tue, 5 Dec 2023 16:52:42 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at howler.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1376352AbjLFAwX (ORCPT + 99 others); Tue, 5 Dec 2023 19:52:23 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57876 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1376332AbjLFAwW (ORCPT ); Tue, 5 Dec 2023 19:52:22 -0500 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.11]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 44887C6; Tue, 5 Dec 2023 16:52:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1701823948; x=1733359948; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version; bh=80ctbQu0ehO8Eb46q4E7qyIosvMkmzHirGugGzD0fGM=; b=nEP39hyHfvzoRlsdcZVVjm+sRym3O/E5kVOwZZsoCdSZxLB8WvBkrY+9 0PJKnxrAVlR/IQ7iwyqbm8vY6A6F5e5/VZ60CR3LOyLcMtqNvXMhn3wwR r4N42KS3X6gqWNGM5czJi4f5jzu0x5x0bFONLzgan+zzAwSb3rt7VKagm D0diNi3uUpKO0fX+vMxnGjpGlgvooll2cNTC+q5b9hZ3HPyPEZR2N2DrL AkEoKI1k1ziPAS8Vd0t5V1sojcHeT8B0bHIM2IvPDehLHt8LaHgYDhj5U hXX2wQILePSXlkhco5RQ/nM781In2Vlo7eJb+lpELSON2ZLlcVv1c/Fqi w==; X-IronPort-AV: E=McAfee;i="6600,9927,10915"; a="821338" X-IronPort-AV: E=Sophos;i="6.04,253,1695711600"; d="scan'208";a="821338" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmvoesa105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Dec 2023 16:52:28 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10915"; a="861937995" X-IronPort-AV: E=Sophos;i="6.04,253,1695711600"; d="scan'208";a="861937995" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmsmga003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Dec 2023 16:52:23 -0800 From: "Huang, Ying" To: Gregory Price Cc: Michal Hocko , "tj@kernel.org" , "John Groves" , Gregory Price , "linux-kernel@vger.kernel.org" , "linux-cxl@vger.kernel.org" , "linux-mm@kvack.org" , "cgroups@vger.kernel.org" , "linux-doc@vger.kernel.org" , "akpm@linux-foundation.org" , "lizefan.x@bytedance.com" , "hannes@cmpxchg.org" , "corbet@lwn.net" , "roman.gushchin@linux.dev" , "shakeelb@google.com" , "muchun.song@linux.dev" , "jgroves@micron.com" Subject: Re: [RFC PATCH v4 0/3] memcg weighted interleave mempolicy control In-Reply-To: (Gregory Price's message of "Tue, 5 Dec 2023 09:47:51 -0500") References: <87o7fveeze.fsf@yhuang6-desk2.ccr.corp.intel.com> <87sf4i2xe1.fsf@yhuang6-desk2.ccr.corp.intel.com> <87fs0h2fb4.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Wed, 06 Dec 2023 08:50:23 +0800 Message-ID: <875y1c2lyo.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on howler.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (howler.vger.email [0.0.0.0]); Tue, 05 Dec 2023 16:52:43 -0800 (PST) Gregory Price writes: > On Tue, Dec 05, 2023 at 05:01:51PM +0800, Huang, Ying wrote: >> Gregory Price writes: >> >> > On Mon, Dec 04, 2023 at 04:19:02PM +0800, Huang, Ying wrote: >> >> Gregory Price writes: >> >> >> >> > If the structure is built as a matrix of (cpu_node,mem_nodes), >> >> > the you can also optimize based on the node the task is running on. >> >> >> >> The matrix stuff makes the situation complex. If people do need >> >> something like that, they can just use set_memorypolicy2() with user >> >> specified weights. I still believe that "make simple stuff simple, and >> >> complex stuff possible". >> >> >> > >> > I don't think it's particularly complex, since we already have a >> > distance matrix for numa nodes: >> > >> > available: 2 nodes (0-1) >> > ... snip ... >> > node distances: >> > node 0 1 >> > 0: 10 21 >> > 1: 21 10 >> > >> > This would follow the same thing, just adjustable for bandwidth. >> >> We add complexity for requirement. Not there's something similar >> already. >> >> > I personally find the (src,dst) matrix very important for flexibility. >> >> With set_memorypolicy2(), I think we have the needed flexibility for >> users needs the complexity. >> >> > But if there is particular pushback against it, having a one dimensional >> > array is better than not having it, so I will take what I can get. >> >> TBH, I don't think that we really need that. Especially given we will >> have set_memorypolicy2(). >> > > From a complexity standpoint, it is exactly as complex as the hardware > configuration itself: each socket has a different view of the memory > topology. If you have a non-homogeneous memory configuration (e.g. a > different number of CXL expanders on one socket thant he other), a flat > array of weights has no way of capturing this hardware configuration. One important task of the software is to hide the complexity of hardware from the users. At least it should provide the option. It only add complexity based on real requirements. > That makes the feature significantly less useful. In fact, it makes the > feature equivalent to set_mempolicy2 - except that weights could be > changed at runtime from outside a process. > > > A matrix resolves one very specific use case: task migration > > > set_mempolicy2 is not sufficient to solve this. There is presently no > way for an external task to change the mempolicy of an existing task. > That means a task must become "migration aware" to use weighting in the > context of containers where migrations are likely. > > Two things to consider: A task... > a) has no way of knowing a migration occured > b) may not have visibility of numa nodes outside its cpusets prior to > a migration - making it unlikely/not possible for them to set > weights correctly in the event a migration occurs. > > If a server with 2 sockets is set up non-homogeneously (different amount > of CXL memory expanders on each socket), then the effective bandwidth > distribution between sockets will be different. > > If a container is migrated between sockets in this situation, then tasks > with manually set weights, or if global weights are a single array, will > have poor memory distributions in relation to the new view of the system. > > Requiring the global settings to be an array basically requires global > weights to be sub-optimal for any use cases that is not explicitly a > single workload that consumes all the cores on the system. > > If the system provides a matrix, then the global settings can be optimal > and re-weighting in response to migration happens cleanly and transparently. For these complex requirements, we will have process_set_mempolicy2(). I think that it's even more flexible than the global matrix. -- Best Regards, Huang, Ying