Received: by 2002:a05:7412:b101:b0:e2:908c:2ebd with SMTP id az1csp2389743rdb; Tue, 14 Nov 2023 21:59:19 -0800 (PST) X-Google-Smtp-Source: AGHT+IHDKAKeGWqa20MzXWk1nTFQZewkqBWsmmas0DA6ziGghtU97yNWq8u2StTaLMDtRzL9qBCq X-Received: by 2002:a05:6358:51d4:b0:16b:f950:ddb2 with SMTP id 20-20020a05635851d400b0016bf950ddb2mr3104615rwl.32.1700027958881; Tue, 14 Nov 2023 21:59:18 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700027958; cv=none; d=google.com; s=arc-20160816; b=TA5PhyC/vCklDxNp6YfJAdT/yy4Bv0UiZeM3Y9uugz11faIiBg1S10aHomfxXxUJWI 4QHl93X/E9+XZmA6LwQNUooyo0pRVmAE+1LtHjxO1d5CsE8B8qsCejvFAzGoJ0D6mrsZ HhTjA7qY8UdcSIkOtB+oxjUYuvWVE/qpVegxwM+OSaZsp+sSjp5e3IVuVNqBJ3aJvfHw IbP+Y62xnXC6oZA0KZw78bNeizujWn5IlOpPgGnHTqe+gAB9D2CgWNaUrggZflj8OfyM v19WTIKBoiyEsVTqIpFDfQ6nfyNweu3rw6QQBq+Ocm7w6mY4YG038TU2JTBu6rFFUc/k dk8g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:user-agent:message-id:date :references:in-reply-to:subject:cc:to:from:dkim-signature; bh=erWwPa2jR5UW8uhEsslp3CPHWkdPdVKkicJkjphlBKc=; fh=ilBlaI98OjYqxkSLahP1As4kdR9IQ1Y0hgcxXrIjbF8=; b=HulzE3LC8hEjktUlO5HdZGceayJvEypXke8znXoYC4mnbbfZg/qUIGPgq9HP2YgVpo Y3XOwpuVyTIeqnCneyrOm9l3B4BoN1GfjzopYWmNeWdC/u99Mu1rizgp1XHVCJYwaDa6 6uHjNHaJwVXHi3h+W4+3xE0/Qa/mlf56Rkm+k0R9JCpmAZbJBfa1o0WDAPLkTwY8dZoG BMLh0SKoBFFd2fF8BhvSzBBV9K78SpmizB4GLCpqDAxUG+YjY4DzXhGmzLAsF6/9D9Y0 U9zcpR2Uuc/1r2WOkscVPyOBKpUpglE0gWoDc9Lkiv6Tp8cfGXZbb6X3WuQk6pBvB99V bUCQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=ToJD1HbZ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from lipwig.vger.email (lipwig.vger.email. [2620:137:e000::3:3]) by mx.google.com with ESMTPS id c131-20020a633589000000b005ac8d44bad4si9064612pga.592.2023.11.14.21.59.18 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 14 Nov 2023 21:59:18 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) client-ip=2620:137:e000::3:3; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=ToJD1HbZ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by lipwig.vger.email (Postfix) with ESMTP id D34A380CE7E0; Tue, 14 Nov 2023 21:59:15 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at lipwig.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233763AbjKOF7E (ORCPT + 99 others); Wed, 15 Nov 2023 00:59:04 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42624 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229551AbjKOF7D (ORCPT ); Wed, 15 Nov 2023 00:59:03 -0500 Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.20]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C6E5E8E; Tue, 14 Nov 2023 21:58:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1700027939; x=1731563939; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version; bh=NQ+2z4Rav2ZbtZH8OQfdAniDcS40hfa5W0SOiCZegZ4=; b=ToJD1HbZ/UdqNEC/Ka2FUr+u241GhxeyQB0/cnMp57Lv1OypsiyOi+K7 73ySd4EFvpH/DxYOoMEndJa575ltwodKP4jQpLfK0HeOx6Hoz6POmMwx3 YoNOTJ+i/PCAKnOCXENh/XrwNugLIeVut80Qfi1IqsNomSv40Tohk2iup ymA9RsWuKqdSBPOrkTdDRzHS+aH9aGXwurzZ9XrpnLgiJPKFDxi12cefe 5D009BqSGrdtWWo0Vm8uE82Xoxz0dQK86wGX6+JN7nlTHA77ps6sEDWza BwObLscnfvPiHa2PfN07ji3X7oCOrI7GElzNUrvxHu9rsnFefbeFlfDSH g==; X-IronPort-AV: E=McAfee;i="6600,9927,10894"; a="381212929" X-IronPort-AV: E=Sophos;i="6.03,304,1694761200"; d="scan'208";a="381212929" Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Nov 2023 21:58:59 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10894"; a="1012165444" X-IronPort-AV: E=Sophos;i="6.03,304,1694761200"; d="scan'208";a="1012165444" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmsmga006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Nov 2023 21:58:54 -0800 From: "Huang, Ying" To: Gregory Price , Michal Hocko Cc: "tj@kernel.org" , John Groves , Gregory Price , "linux-kernel@vger.kernel.org" , "linux-cxl@vger.kernel.org" , "linux-mm@kvack.org" , "cgroups@vger.kernel.org" , "linux-doc@vger.kernel.org" , "akpm@linux-foundation.org" , "lizefan.x@bytedance.com" , "hannes@cmpxchg.org" , "corbet@lwn.net" , "roman.gushchin@linux.dev" , "shakeelb@google.com" , "muchun.song@linux.dev" , "jgroves@micron.com" Subject: Re: [RFC PATCH v4 0/3] memcg weighted interleave mempolicy control In-Reply-To: (Gregory Price's message of "Tue, 14 Nov 2023 12:49:36 -0500") References: <20231109002517.106829-1-gregory.price@memverge.com> <0100018bb64636ef-9daaf0c0-813c-4209-94e4-96ba6854f554-000000@email.amazonses.com> Date: Wed, 15 Nov 2023 13:56:53 +0800 Message-ID: <87o7fveeze.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]); Tue, 14 Nov 2023 21:59:16 -0800 (PST) Gregory Price writes: > On Tue, Nov 14, 2023 at 06:01:13PM +0100, Michal Hocko wrote: >> On Tue 14-11-23 10:50:51, Gregory Price wrote: >> > On Tue, Nov 14, 2023 at 10:43:13AM +0100, Michal Hocko wrote: >> [...] >> > > That being said, I still believe that a cgroup based interface is a much >> > > better choice over a global one. Cpusets seem to be a good fit as the >> > > controller does control memory placement wrt NUMA interfaces. >> > >> > I think cpusets is a non-starter due to the global spinlock required when >> > reading informaiton from it: >> > >> > https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L391 >> >> Right, our current cpuset implementation indeed requires callback lock >> from the page allocator. But that is an implementation detail. I do not >> remember bug reports about the lock being a bottle neck though. If >> anything cpusets lock optimizations would be win also for users who do >> not want to use weighted interleave interface. > > Definitely agree, but that's a rather large increase of scope :[ > > We could consider a push-model similar to how cpuset nodemasks are > pushed down to mempolicies, rather than a pull-model of having > mempolicy read directly from cpusets, at least until cpusets lock > optimization is undertaken. > > This pattern looks like a wart to me, which is why I avoided it, but the > locking implications on the pull-model make me sad. > > Would like to point out that Tejun pushed back on implementing weights > in cgroups (regardless of subcomponent), so I think we need to come > to a consensus on where this data should live in a "more global" > context (cpusets, memcg, nodes, etc) before I go mucking around > further. > > So far we have: > * mempolicy: updating weights is a very complicated undertaking, > and no (good) way to do this from outside the task. > would be better to have a coarser grained control. > > New syscall is likely needed to add/set weights in the > per-task mempolicy, or bite the bullet on set_mempolicy2 > and make the syscall extensible for the future. > > * memtiers: tier=node when devices are already interleaved or when all > devices are different, so why add yet another layer of > complexity if other constructs already exist. Additionally, > you lose task-placement relative weighting (or it becomes > very complex to implement. Because we usually have multiple nodes in one mem-tier, I still think mem-tier-based interface is simpler than node-based. But, it seems more complex to introduce mem-tier into mempolicy. Especially if we have per-task weights. So, I am fine to go with node-based interface. > * cgroups: "this doesn't involve dynamic resource accounting / > enforcement at all" and "these aren't resource > allocations, it's unclear what the hierarchical > relationship mean". > > * node: too global, explore smaller scope first then expand. Why is it too global? I understand that it doesn't cover all possible use cases (although I don't know whether these use cases are practical or not). But it can provide a reasonable default per-node weight based on available node performance information (such as, HMAT, CDAT, etc.). And, quite some workloads can just use it. I think this is an useful feature. > For now I think there is consensus that mempolicy should have weights > per-task regardless of how the more-global mechanism is defined, so i'll > go ahead and put up another RFC for some options on that in the next > week or so. > > The limitations on the first pass will be that only the task is capable > of re-weighting should cpusets.mems or the nodemask change. -- Best Regards, Huang, Ying