Received: by 2002:a05:7412:d1aa:b0:fc:a2b0:25d7 with SMTP id ba42csp1481931rdb; Tue, 30 Jan 2024 22:54:34 -0800 (PST) X-Google-Smtp-Source: AGHT+IGuWN5cYlibdAVrJw/hokn4iRofORc/C+Nxgt0nxuSf/U2RsYSY8ojLO7X/n7oAoMZlwaXL X-Received: by 2002:a37:c202:0:b0:784:b8b:2178 with SMTP id i2-20020a37c202000000b007840b8b2178mr3262222qkm.27.1706684074117; Tue, 30 Jan 2024 22:54:34 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1706684074; cv=pass; d=google.com; s=arc-20160816; b=bw/vyGHc37ngpUNIH9rDVs+walh58o+gN/KwlHYIwgTnemcEEuPgngAUF65U7+GJ1H R/LHA3rOTvhARKdxNMKKEnfOoRal7HH+hu0gPgvtdk4HJdum99fJfgY9OQV1AtXRFcVe 5avLFZ4K0fbeuGLm6KXhbaA422gDe8/RPIks+eZ5b0U2lYhCOZfLeGvtn0M9wecp/Hzk 28S0rfU8rhV0nOwuw/AfT6bAYL9xCyTDOZH9xiU8u8Kx8uXO37j0xKrPTODTJtGh1pD1 lRKc+1XWezojKqeQ+dLsdn4zlz5P9MOekdTmoqkVO1w7aUm9Y05JT6VKPmY+uk+9lJmV cFgw== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=mime-version:list-unsubscribe:list-subscribe:list-id:precedence :user-agent:message-id:date:references:in-reply-to:subject:cc:to :from:dkim-signature; bh=p2p/gLhIXyvXY+zwBK/zOGHYnoHFtgkj8dORv13jUuk=; fh=+uIZlDASK2BNtqVLExhh5/+lLNJXGfP0cRuAUGMCCZo=; b=RH42TYKyisImqwHVYel3TH7ZLYwYBDy/y05U7yD6Ds02IkglwgWG37gVmfGDQ8jyD/ ufHsuCdAScigKMHaqg2N7K4v2nrg9cvLumFsCDXqcv557zsXQ50v0zXQReQi2/Hk1ea/ 4fHTMnBT9a7cPK+/KyHLhdQbw2oszSUssfvVYsnB4DCr+0H1adEJcVCzVYtV8l9kVcKq YoQSAFaFBm+m8GZV5k6c1I6/cdAiJskDVUmCZ5eqqKFrMDAqleec6idKV+K3mbuCycXj hxMfxRdfyKJTDx5So3qOSt+j5WTd3to64cmDs41hlO8elYoK8SLCbEeWDD66LP4apyUJ mN+A==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=KtZ0BqXl; arc=pass (i=1 spf=pass spfdomain=intel.com dkim=pass dkdomain=intel.com dmarc=pass fromdomain=intel.com); spf=pass (google.com: domain of linux-kernel+bounces-45847-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-45847-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com X-Forwarded-Encrypted: i=1; AJvYcCXIrjniJ6eCA15JH/lOX0oZjG+V+jJSD4xSGEvot2jAkKKAq/uAXi40b2RuTmFvwZKtqDBpMQbXykRdU6+aD5EBLD9S6rQQLYxvzkY/QQ== Return-Path: Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [147.75.199.223]) by mx.google.com with ESMTPS id ou41-20020a05620a622900b00783ae3d4933si11143130qkn.190.2024.01.30.22.54.33 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 30 Jan 2024 22:54:34 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-45847-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) client-ip=147.75.199.223; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=KtZ0BqXl; arc=pass (i=1 spf=pass spfdomain=intel.com dkim=pass dkdomain=intel.com dmarc=pass fromdomain=intel.com); spf=pass (google.com: domain of linux-kernel+bounces-45847-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-45847-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id CEA801C26AE4 for ; Wed, 31 Jan 2024 06:54:33 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 07E4851C59; Wed, 31 Jan 2024 06:54:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="KtZ0BqXl" Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BBAFE5024D for ; Wed, 31 Jan 2024 06:54:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.12 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706684055; cv=none; b=LXB8H7OJEyBbHR8YqY63bnG0TRJi0+51lRzDt4MLgldtyKd1zdOv1EGjFHYmT70f+H9C4KzCuy6mgSA1eP8/RDsVXhkEjfivoU17NRlCk7knGLRBY6uwfiaH1oQyGGVbsxQww6lMhNDFVqi8dX2YSawnivheHs3bMj31pQiOrqw= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706684055; c=relaxed/simple; bh=/F/ertmwsCGvy8CX6KW93prCcPXDnU1V09z8hHjJkU0=; h=From:To:Cc:Subject:In-Reply-To:References:Date:Message-ID: MIME-Version:Content-Type; b=fCsXFw/0YoKp2z9NWcDqzwkL1FmgTPqssTjv8PnSiqPy9t9SjZ6AO68z0NUrkWb0bxEddXbwVozil+5hvYXDXbJQiscnF6u8+TjYprG7p4xJwe9fux7cpsW3qaeA06lS61A41jYnWoArixp9BaBrNVl35iB7qk6ng03ZlDstbg0= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=KtZ0BqXl; arc=none smtp.client-ip=192.198.163.12 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1706684054; x=1738220054; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version; bh=/F/ertmwsCGvy8CX6KW93prCcPXDnU1V09z8hHjJkU0=; b=KtZ0BqXl/AGHr7lyoxDr/a+LzLccUQvDwgVMbZFfmjqbyo2/fvvLBfgZ 9iypPGAdxgIfnita7UnkhsX/mQAitaZqkY5YU1Fv/lklCqp+TxLeOqIaD AQ1pjKZFXKta72CxwxCgaBOUNVFRND9QJJe2Gp0l82mZv8JnGLcffsKAW OJKVP7f8RhW3qupeDBQLw346sI1J2aK9q2jEiR85P1h2gWaaX49JQOPG3 ti6T0jv2Bjl1auSuW3yGTmzZQDkLQ9SJTdfuUSwaBFkU7613XvjY5Wos+ YY1I9y22lYLENNkwiRAnuWWZy2MtcrvyurWg6FAf6lRDjmGWBE6JcYLcO w==; X-IronPort-AV: E=McAfee;i="6600,9927,10969"; a="3353179" X-IronPort-AV: E=Sophos;i="6.05,231,1701158400"; d="scan'208";a="3353179" Received: from fmviesa001.fm.intel.com ([10.60.135.141]) by fmvoesa106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 30 Jan 2024 22:54:13 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.05,231,1701158400"; d="scan'208";a="30391993" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by smtpauth.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 30 Jan 2024 22:54:11 -0800 From: "Huang, Ying" To: "Yasunori Gotou (Fujitsu)" Cc: "Zhijian Li (Fujitsu)" , Andrew Morton , Greg Kroah-Hartman , "rafael@kernel.org" , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" Subject: Re: [PATCH RFC 1/4] drivers/base/node: Add demotion_nodes sys infterface In-Reply-To: (Yasunori Gotou's message of "Wed, 31 Jan 2024 06:23:12 +0000") References: <20231102025648.1285477-1-lizhijian@fujitsu.com> <20231102025648.1285477-2-lizhijian@fujitsu.com> <878r7g3ktj.fsf@yhuang6-desk2.ccr.corp.intel.com> <87fryegv9c.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Wed, 31 Jan 2024 14:52:15 +0800 Message-ID: <8734uegfkw.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=ascii "Yasunori Gotou (Fujitsu)" writes: > Hello, > >> Li Zhijian writes: >> >> > Hi Ying >> > >> > I need to pick up this thread/patch again. >> > >> >> We have /sys/devices/virtual/memory_tiering/memory_tier*/nodelist >> >> already. A node in a higher tier can demote to any node in the lower >> >> tiers. What's more need to be displayed in nodeX/demotion_nodes? >> >> >> > >> > Yes, it's believed that >> > /sys/devices/virtual/memory_tiering/memory_tierN/nodelist >> > are intended to show nodes in memory_tierN. But IMHO, it's not enough, >> > especially for the preferred demotion node(s). >> > >> > Currently, when a demotion occurs, it will prioritize selecting a node >> > from the preferred nodes as the destination node for the demotion. If >> > the preferred nodes does not meet the requirements, it will try from >> > all the lower memory tier nodes until it finds a suitable demotion >> > destination node or ultimately fails. >> > >> > However, currently it only lists the nodes of each tier. If the >> > administrators want to know all the possible demotion destinations for >> > a given node, they need to calculate it themselves: >> > Step 1, find the memory tier where the given node is located Step 2, >> > list all nodes under all its lower tiers >> > >> > It will be even more difficult to know the preferred nodes which >> > depend on more factors, distance etc. For the following example, we >> > may have 6 nodes splitting into three memory tiers. >> > >> > For emulated hmat numa topology example: >> >> $ numactl -H >> >> available: 6 nodes (0-5) >> >> node 0 cpus: 0 >> >> node 0 size: 1974 MB >> >> node 0 free: 1767 MB >> >> node 1 cpus: 1 >> >> node 1 size: 1694 MB >> >> node 1 free: 1454 MB >> >> node 2 cpus: >> >> node 2 size: 896 MB >> >> node 2 free: 896 MB >> >> node 3 cpus: >> >> node 3 size: 896 MB >> >> node 3 free: 896 MB >> >> node 4 cpus: >> >> node 4 size: 896 MB >> >> node 4 free: 896 MB >> >> node 5 cpus: >> >> node 5 size: 896 MB >> >> node 5 free: 896 MB >> >> node distances: >> >> node 0 1 2 3 4 5 >> >> 0: 10 31 21 41 21 41 >> >> 1: 31 10 41 21 41 21 >> >> 2: 21 41 10 51 21 51 >> >> 3: 31 21 51 10 51 21 >> >> 4: 21 41 21 51 10 51 >> >> 5: 31 21 51 21 51 10 >> >> $ cat memory_tier4/nodelist >> >> 0-1 >> >> $ cat memory_tier12/nodelist >> >> 2,5 >> >> $ cat memory_tier54/nodelist >> >> 3-4 >> > >> > For above topology, memory-tier will build the demotion path for each >> > node like this: >> > node[0].preferred = 2 >> > node[0].demotion_targets = 2-5 >> > node[1].preferred = 5 >> > node[1].demotion_targets = 2-5 >> > node[2].preferred = 4 >> > node[2].demotion_targets = 3-4 >> > node[3].preferred = >> > node[3].demotion_targets = >> > node[4].preferred = >> > node[4].demotion_targets = >> > node[5].preferred = 3 >> > node[5].demotion_targets = 3-4 >> > >> > But this demotion path is not explicitly known to administrator. And >> > with the feedback from our customers, they also think it is helpful to >> > know demotion path built by kernel to understand the demotion >> > behaviors. >> > >> > So i think we should have 2 new interfaces for each node: >> > >> > /sys/devices/system/node/nodeN/demotion_allowed_nodes >> > /sys/devices/system/node/nodeN/demotion_preferred_nodes >> > >> > I value your opinion, and I'd like to know what you think about... >> >> Per my understanding, we will not expose everything inside kernel to user >> space. For page placement in a tiered memory system, demotion is just a part >> of the story. For example, if the DRAM of a system becomes full, new page >> allocation will fall back to the CXL memory. Have we exposed the default page >> allocation fallback order to user space? > > In extreme terms, users want to analyze all the memory behaviors of memory management > while executing their workload, and want to trace ALL of them if possible. > Of course, it is impossible due to the heavy load, then users want to have other ways as > a compromise. Our request, the demotion target information, is just one of them. > > In my impression, users worry about the impact of the CXL memory device on their workload, > and want to have a way to understand the impact. > If they know there is no information to remove their anxious, they may avoid to buy CXL memory. > > In addition, our support team also needs to have clues to solve users' performance problems. > Even if new page allocation will fall back to the CXL memory, we need to explain why it would > happen as accountability. I guess /proc//numa_maps /sys/fs/cgroup//memory.numa_stat may help to understand system behavior. -- Best Regards, Huang, Ying >> >> All in all, in my opinion, we only expose as little as possible to user space >> because we need to maintain the ABI for ever. > > I can understand there is a compatibility problem by our propose, and kernel may > change its logic in future. This is a tug-of-war situation between kernel developers > and users or support engineers. I suppose It often occurs in many place... > > Hmm... I hope there is a new idea to solve this situation even if our proposal is rejected.. > Anyone? > > Thanks, > ---- > Yasunori Goto > >> >> -- >> Best Regards, >> Huang, Ying >> >> > >> > On 02/11/2023 11:17, Huang, Ying wrote: >> >> Li Zhijian writes: >> >> >> >>> It shows the demotion target nodes of a node. Export this >> >>> information to user directly. >> >>> >> >>> Below is an example where node0 node1 are DRAM, node3 is a PMEM >> node. >> >>> - Before PMEM is online, no demotion_nodes for node0 and node1. >> >>> $ cat /sys/devices/system/node/node0/demotion_nodes >> >>> >> >>> - After node3 is online as kmem >> >>> $ daxctl reconfigure-device --mode=system-ram --no-online dax0.0 && >> >>> daxctl online-memory dax0.0 [ >> >>> { >> >>> "chardev":"dax0.0", >> >>> "size":1054867456, >> >>> "target_node":3, >> >>> "align":2097152, >> >>> "mode":"system-ram", >> >>> "online_memblocks":0, >> >>> "total_memblocks":7 >> >>> } >> >>> ] >> >>> $ cat /sys/devices/system/node/node0/demotion_nodes >> >>> 3 >> >>> $ cat /sys/devices/system/node/node1/demotion_nodes >> >>> 3 >> >>> $ cat /sys/devices/system/node/node3/demotion_nodes >> >>> >> >> We have /sys/devices/virtual/memory_tiering/memory_tier*/nodelist >> >> already. A node in a higher tier can demote to any node in the lower >> >> tiers. What's more need to be displayed in nodeX/demotion_nodes? >> >> -- >> >> Best Regards, >> >> Huang, Ying >> >> >> >>> Signed-off-by: Li Zhijian >> >>> --- >> >>> drivers/base/node.c | 13 +++++++++++++ >> >>> include/linux/memory-tiers.h | 6 ++++++ >> >>> mm/memory-tiers.c | 8 ++++++++ >> >>> 3 files changed, 27 insertions(+) >> >>> >> >>> diff --git a/drivers/base/node.c b/drivers/base/node.c index >> >>> 493d533f8375..27e8502548a7 100644 >> >>> --- a/drivers/base/node.c >> >>> +++ b/drivers/base/node.c >> >>> @@ -7,6 +7,7 @@ >> >>> #include >> >>> #include >> >>> #include >> >>> +#include >> >>> #include >> >>> #include >> >>> #include >> >>> @@ -569,11 +570,23 @@ static ssize_t node_read_distance(struct device >> *dev, >> >>> } >> >>> static DEVICE_ATTR(distance, 0444, node_read_distance, NULL); >> >>> +static ssize_t demotion_nodes_show(struct device *dev, >> >>> + struct device_attribute *attr, char *buf) { >> >>> + int ret; >> >>> + nodemask_t nmask = next_demotion_nodes(dev->id); >> >>> + >> >>> + ret = sysfs_emit(buf, "%*pbl\n", nodemask_pr_args(&nmask)); >> >>> + return ret; >> >>> +} >> >>> +static DEVICE_ATTR_RO(demotion_nodes); >> >>> + >> >>> static struct attribute *node_dev_attrs[] = { >> >>> &dev_attr_meminfo.attr, >> >>> &dev_attr_numastat.attr, >> >>> &dev_attr_distance.attr, >> >>> &dev_attr_vmstat.attr, >> >>> + &dev_attr_demotion_nodes.attr, >> >>> NULL >> >>> }; >> >>> diff --git a/include/linux/memory-tiers.h >> >>> b/include/linux/memory-tiers.h index 437441cdf78f..8eb04923f965 >> >>> 100644 >> >>> --- a/include/linux/memory-tiers.h >> >>> +++ b/include/linux/memory-tiers.h >> >>> @@ -38,6 +38,7 @@ void init_node_memory_type(int node, struct >> memory_dev_type *default_type); >> >>> void clear_node_memory_type(int node, struct memory_dev_type >> *memtype); >> >>> #ifdef CONFIG_MIGRATION >> >>> int next_demotion_node(int node); >> >>> +nodemask_t next_demotion_nodes(int node); >> >>> void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t >> *targets); >> >>> bool node_is_toptier(int node); >> >>> #else >> >>> @@ -46,6 +47,11 @@ static inline int next_demotion_node(int node) >> >>> return NUMA_NO_NODE; >> >>> } >> >>> +static inline next_demotion_nodes next_demotion_nodes(int node) >> >>> +{ >> >>> + return NODE_MASK_NONE; >> >>> +} >> >>> + >> >>> static inline void node_get_allowed_targets(pg_data_t *pgdat, >> nodemask_t *targets) >> >>> { >> >>> *targets = NODE_MASK_NONE; >> >>> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index >> >>> 37a4f59d9585..90047f37d98a 100644 >> >>> --- a/mm/memory-tiers.c >> >>> +++ b/mm/memory-tiers.c >> >>> @@ -282,6 +282,14 @@ void node_get_allowed_targets(pg_data_t *pgdat, >> nodemask_t *targets) >> >>> rcu_read_unlock(); >> >>> } >> >>> +nodemask_t next_demotion_nodes(int node) >> >>> +{ >> >>> + if (!node_demotion) >> >>> + return NODE_MASK_NONE; >> >>> + >> >>> + return node_demotion[node].preferred; } >> >>> + >> >>> /** >> >>> * next_demotion_node() - Get the next node in the demotion path >> >>> * @node: The starting node to lookup the next node