Received: by 2002:a05:7412:b10a:b0:f3:1519:9f41 with SMTP id az10csp2559124rdb; Mon, 4 Dec 2023 00:21:28 -0800 (PST) X-Google-Smtp-Source: AGHT+IEWt9CtvG4vOoee7Pevcv3BKI9V39pV6ThV+qJydQ5hm8EaG2N9OVTh5fjNvk+Njk4LihOd X-Received: by 2002:a17:903:22cb:b0:1d0:8db6:17d0 with SMTP id y11-20020a17090322cb00b001d08db617d0mr860778plg.25.1701678088031; Mon, 04 Dec 2023 00:21:28 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1701678088; cv=none; d=google.com; s=arc-20160816; b=VomQKl2CTmwrzyb4HzuwJI8Z6d8DwOQlb7S0xa2roicGq1DQ6L8bXKhRCSPF5Ny6JT 8B1sXd1ug+oLoTcYNOrh/9uxhy+5zAlBGqALfs6dNw5BBD1HHspDK3UJbHRtIYdGVOqV /GIe+UPFoRYerfEc9wzlUxzRVdOdSOJMAdzYSUBiCyN2LNmzYQwAoysy8O7s2OUy+am1 7wxoCkmP55f0hOKbthckbHCpBx/7ndnSKmNFo+oaIVfhMHnVuap90SHcjhzUuYvWV66w Z2Li12zGMC/NacI0YJfLicVVuBBQsu3Yge3igYxIkZ1LuIahser1N4AdCRwQr9+aer+r HXOw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :user-agent:message-id:date:references:in-reply-to:subject:cc:to :from:dkim-signature; bh=p6c8CYxr/2zb/8jhHMv/0o8ObpVQVy8G5o4hpC8nHro=; fh=Ec8+93QwXjRdnLuZ3TSm69spsYeQ+BX0TzruON84KMs=; b=IIM8av6++PwG/jkZEKGW7EOYvP784Qxm4jK3CNmAgwUYlqrFmxHk1Qv/l8d2CINORR gwfkdlHUYt1Bm+l3FwyfMueXJcvEGIkObNMXFWZt8J5uKf7/EfGKPKzOp1F2O7USKXzr jfGtpMH+aa1VGXatLFftD6/qyz+PS22xmtKBt4S/7Hh/SlcKqteMsN+zleV9lQVPxlDj 8lRKjuTVSTeQXxWRtaKbQoVfbPiKXBX8qM/iB6c5ybb67O6Xq3or1/xurCMbX+6U5q7o bviY8YB96Ixz2LvUIRJR3fd3a54P1/bbgg9sAEG6QuOlo4XW728cC+/ZVQTMJMbHe75P pFZg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=cMF4sUax; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from howler.vger.email (howler.vger.email. [23.128.96.34]) by mx.google.com with ESMTPS id jg2-20020a17090326c200b001d0b06f196asi774716plb.28.2023.12.04.00.21.27 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 04 Dec 2023 00:21:28 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) client-ip=23.128.96.34; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=cMF4sUax; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by howler.vger.email (Postfix) with ESMTP id 97A9E80576D5; Mon, 4 Dec 2023 00:21:24 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at howler.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229563AbjLDIVH (ORCPT + 99 others); Mon, 4 Dec 2023 03:21:07 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58704 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231408AbjLDIVC (ORCPT ); Mon, 4 Dec 2023 03:21:02 -0500 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.12]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 84C42D8; Mon, 4 Dec 2023 00:21:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1701678068; x=1733214068; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=SctStx0/2r6BjPSkyNIBJr82NmxzrJUg7/EEBCUvaYM=; b=cMF4sUax1Cw27Kw1kE/AQuZ12AbCLPNFK4rEZ8Pp4pa+pvPRiVpwwN4A fRrfnr84T9YNan01J2gKiD45PyZHijeEffLKELxTtMIguglUvd0Lz/q6g JQRi/LrRhwffHAC6YDYlK29JRwPfB38AXfcmYoozF2nQb73x/T6YwlcNC aaAfJXUp3zp0zc1ECR8KL8naZkJ9N3q2ragK0RPvMAIk/1ZLVzOkYuouM rGNjgNQ6ls3Puk6D77DRJj8CKFrm029QpyWnBnBnAgO2eqqTrgGV6QJvq T3CClQaK1hLAGYor1tUyn71ss7mCCZZY4v9YMGrWoWa79j70UWygPbnQv w==; X-IronPort-AV: E=McAfee;i="6600,9927,10913"; a="752026" X-IronPort-AV: E=Sophos;i="6.04,249,1695711600"; d="scan'208";a="752026" Received: from orsmga003.jf.intel.com ([10.7.209.27]) by orvoesa104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Dec 2023 00:21:08 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10913"; a="720242377" X-IronPort-AV: E=Sophos;i="6.04,249,1695711600"; d="scan'208";a="720242377" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orsmga003-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Dec 2023 00:21:02 -0800 From: "Huang, Ying" To: Gregory Price Cc: Michal Hocko , "tj@kernel.org" , "John Groves" , Gregory Price , "linux-kernel@vger.kernel.org" , "linux-cxl@vger.kernel.org" , "linux-mm@kvack.org" , "cgroups@vger.kernel.org" , "linux-doc@vger.kernel.org" , "akpm@linux-foundation.org" , "lizefan.x@bytedance.com" , "hannes@cmpxchg.org" , "corbet@lwn.net" , "roman.gushchin@linux.dev" , "shakeelb@google.com" , "muchun.song@linux.dev" , "jgroves@micron.com" Subject: Re: [RFC PATCH v4 0/3] memcg weighted interleave mempolicy control In-Reply-To: (Gregory Price's message of "Sun, 3 Dec 2023 22:33:08 -0500") References: <0100018bb64636ef-9daaf0c0-813c-4209-94e4-96ba6854f554-000000@email.amazonses.com> <87o7fveeze.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Mon, 04 Dec 2023 16:19:02 +0800 Message-ID: <87sf4i2xe1.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on howler.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (howler.vger.email [0.0.0.0]); Mon, 04 Dec 2023 00:21:25 -0800 (PST) Gregory Price writes: > On Wed, Nov 15, 2023 at 01:56:53PM +0800, Huang, Ying wrote: >> Gregory Price writes: >>=20 >> Because we usually have multiple nodes in one mem-tier, I still think >> mem-tier-based interface is simpler than node-based. But, it seems more >> complex to introduce mem-tier into mempolicy. Especially if we have >> per-task weights. So, I am fine to go with node-based interface. >>=20 >> > * cgroups: "this doesn't involve dynamic resource accounting / >> > enforcement at all" and "these aren't resource >> > allocations, it's unclear what the hierarchical >> > relationship mean". >> > >> > * node: too global, explore smaller scope first then expand. >>=20 >> Why is it too global? I understand that it doesn't cover all possible >> use cases (although I don't know whether these use cases are practical >> or not). But it can provide a reasonable default per-node weight based >> on available node performance information (such as, HMAT, CDAT, etc.). >> And, quite some workloads can just use it. I think this is an useful >> feature. >> > > Have been sharing notes with more folks. Michal thinks a global set of > weights is unintuitive and not useful, and would prefer to see the > per-task weights first. > > Though this may have been in response to adding it as an attribute of > nodes directly.=20 > > Another proposal here suggested adding a new sysfs setting > https://github.com/skhynix/linux/commit/61d2fcc7a880185df186fa2544edcd2f8= 785952a > > $ tree /sys/kernel/mm/interleave_weight/ > /sys/kernel/mm/interleave_weight/ > =E2=94=9C=E2=94=80=E2=94=80 enabled [1] > =E2=94=9C=E2=94=80=E2=94=80 possible [2] > =E2=94=94=E2=94=80=E2=94=80 node > =E2=94=9C=E2=94=80=E2=94=80 node0 > =E2=94=82 =E2=94=94=E2=94=80=E2=94=80 interleave_weight [3] > =E2=94=94=E2=94=80=E2=94=80 node1 > =E2=94=94=E2=94=80=E2=94=80 interleave_weight [3] > > (this could be changed to /sys/kernel/mm/mempolicy/...) > > I think the internal representation of this can be simplified greatly, > over what the patch provides now, but maybe this solves the "it doesn't > belong in these other components" issue. > > Answer: Simply leave it as a static global kobject in mempolicy, which > also deals with many of the issues regarding race conditions. Although personally I prefer to add interleave weight as an attribute of nodes. I understand that some people think it's not appropriate to place anything node-specific there. So, some place under /sys/kernel/mm sounds reasonable too. > If a user provides weights, use those. If they do not, use globals. Yes. That is the target use case. > On a cpuset rebind event (container migration, mems_allowed changes), > manually set weights would have to remain, so in a bad case, the > weights would be very out of line with the real distribution of memory. > > Example: if your nodemask is (0,1,2) and a migration changes it to > (3,4,5), then unfortunately your weights will likely revert to [1,1,1] > > If set with global weights, they could automatically adjust. It > would not be perfect, but it would be better than the potential worst > case above. If that same migration occurs, the next allocation would > simply use whatever the target node weights are in the global config. > > So if globally you have weights [3,2,1,1,2,3], and you move from > nodemask (0,1,2) to (3,4,5), your weights change from [3,2,1] to > [1,2,3]. That is nice. And I prefer to emphasize the simple use case. Users don't need to specify interleave weight always. Just use MPOL_WEIGHTED_INTERLEAVE policy, and system will provide reasonable default weight. > If the structure is built as a matrix of (cpu_node,mem_nodes), > the you can also optimize based on the node the task is running on. The matrix stuff makes the situation complex. If people do need something like that, they can just use set_memorypolicy2() with user specified weights. I still believe that "make simple stuff simple, and complex stuff possible". > That feels very intuitive, deals with many race condition issues, and > the global setting can actually be implemented without the need for > set_mempolicy2 at all - which is certainly a bonus. > > Would love more thoughts here. Will have a new RFC with set_mempolicy2, > mbind2, and MPOL_WEIGHTED_INTERLEAVE soon that demonstrate the above. Thanks for doing all these! -- Best Regards, Huang, Ying