Received: by 2002:a6b:fb09:0:0:0:0:0 with SMTP id h9csp4578990iog; Wed, 22 Jun 2022 01:29:21 -0700 (PDT) X-Google-Smtp-Source: AGRyM1vF1tpOTy2pK9JDq/RsghJvCEx5bZMQyBksfLW3tEbGx1DkA/s/8mKjDFhapvacloyq1nAp X-Received: by 2002:a63:3e47:0:b0:40c:f2dd:bc4 with SMTP id l68-20020a633e47000000b0040cf2dd0bc4mr1989077pga.47.1655886560826; Wed, 22 Jun 2022 01:29:20 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1655886560; cv=none; d=google.com; s=arc-20160816; b=UZ5X9q7AGAeeDF8RuLaeQvOdd8UZ1HrbfdCLqf4K0RwXhn6VcuhJWkFV4MgqQrseJI PFkVQwNazQWSaRyYLPIKsGqQ39perHn6IoGcQR6tPfy1E6mkt+5hGS9Udu1szov8bvIC 43QOKIhO3wf8epz3CtM2d8UMhQfHFs1L5q+9K6B3QWI469r8mPJMs7DWDs2frRUMptom /hiyW/mhjbQh6ZxQaQSSYkQZLp4G6jhpfBEIqw5GPmpvdALvv1rlfWNlidPUJUEuIYYU v0bwrT98L4+EBuEbGOp7x/kHrQ1UH/Zo3aMfvlNg5fCjQeroPxPT8qz5CMgwE2pisiG9 mamg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:content-transfer-encoding :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=VROdRSEQ/S0c3jBZGx05uszC//GrM1Yj8D2HDfideE0=; b=ZKgTUA0UAsf2CceHmmYCXycpp3c63uCVsHlPSxT8PNM8oDnAJRpfo9wMS3iiROJaOX wgzXZtB5q3ncfK7kjfbCnG6fJsMAON04Tss5EeEuCq0sHfuCo7yXs/hFSddU+raxh8l+ VPhWVqeo7rHOV9SuWQ7eAgOzghMqnLgCIasq9hwp1dnmFgqLSLJJA0r6uAIDXnUEQXfr 9wNtY1wa3Z5w2pKSVUktodaMjmpNgMmNBR4G9KOGxSd+yjzXcCpX+XP5ac2aR2WQGFTf pvyLkiGUl2Tjuj5Kgezc9oBUaQJuLYdGXteImiL5yblZsN3TMU4t1mXYN3GUGjgD1a9f qWlA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@ibm.com header.s=pp1 header.b=RAC9xQsy; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=ibm.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id c18-20020a170902b69200b00163a54d2612si20516332pls.418.2022.06.22.01.29.08; Wed, 22 Jun 2022 01:29:20 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@ibm.com header.s=pp1 header.b=RAC9xQsy; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=ibm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1351678AbiFVI0t (ORCPT + 99 others); Wed, 22 Jun 2022 04:26:49 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49640 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1349929AbiFVI0T (ORCPT ); Wed, 22 Jun 2022 04:26:19 -0400 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DD2E0381BC for ; Wed, 22 Jun 2022 01:26:18 -0700 (PDT) Received: from pps.filterd (m0098410.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 25M8CB7X028682; Wed, 22 Jun 2022 08:25:41 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : content-transfer-encoding : mime-version; s=pp1; bh=VROdRSEQ/S0c3jBZGx05uszC//GrM1Yj8D2HDfideE0=; b=RAC9xQsylRNI6X3iEGyJzr8EYa+riBaZzpNfFHvqN0r7ypGbcHE9Sb00GzrH9fyxwfeI XB7htGBgFEyEA+9wIGcNUCd5mbNtBJ583rTSfjTDAzmqsrB40IaW35cPa0bNZxRQ7zD1 x1ZWIUlFwv5Yalulzuua30S3KXAnZQas6J7re01D40U4PKYLrkbe1jgfR+w7ghpu0JDI tcibP3KI0MjBkAvJX2B7fTCFBtWKYCThVGGTn1OGXOrBQUK3bqFT+vt+3v2gVJ3QK+vB +O4uwMDsXX4HFnOfELTaEuf7LhE3yRsouIgTswAx/X54yAa4Pmg5yoqXmbZJIbdLDexw 0g== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3guybyrbak-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 22 Jun 2022 08:25:41 +0000 Received: from m0098410.ppops.net (m0098410.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 25M8EFSv003928; Wed, 22 Jun 2022 08:25:40 GMT Received: from ppma02dal.us.ibm.com (a.bd.3ea9.ip4.static.sl-reverse.com [169.62.189.10]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3guybyrba7-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 22 Jun 2022 08:25:40 +0000 Received: from pps.filterd (ppma02dal.us.ibm.com [127.0.0.1]) by ppma02dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 25M8K44j003459; Wed, 22 Jun 2022 08:25:39 GMT Received: from b03cxnp08025.gho.boulder.ibm.com (b03cxnp08025.gho.boulder.ibm.com [9.17.130.17]) by ppma02dal.us.ibm.com with ESMTP id 3gt0098r4w-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 22 Jun 2022 08:25:39 +0000 Received: from b03ledav001.gho.boulder.ibm.com (b03ledav001.gho.boulder.ibm.com [9.17.130.232]) by b03cxnp08025.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 25M8Pcvq34079000 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 22 Jun 2022 08:25:38 GMT Received: from b03ledav001.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 2DF876E04E; Wed, 22 Jun 2022 08:25:38 +0000 (GMT) Received: from b03ledav001.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 552776E056; Wed, 22 Jun 2022 08:25:32 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.69.226]) by b03ledav001.gho.boulder.ibm.com (Postfix) with ESMTP; Wed, 22 Jun 2022 08:25:31 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , "Aneesh Kumar K.V" , Jagdish Gediya Subject: [PATCH v7 01/12] mm/demotion: Add support for explicit memory tiers Date: Wed, 22 Jun 2022 13:55:02 +0530 Message-Id: <20220622082513.467538-2-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220622082513.467538-1-aneesh.kumar@linux.ibm.com> References: <20220622082513.467538-1-aneesh.kumar@linux.ibm.com> X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: MabY7QcKO4tuzi2l6m-P4thk0oTF-59f X-Proofpoint-GUID: Tzn7vHFTX388RGnIA-yPx5-6IC5cLSo3 Content-Transfer-Encoding: 8bit X-Proofpoint-UnRewURL: 0 URL was un-rewritten MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.64.514 definitions=2022-06-21_11,2022-06-22_01,2022-02-23_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 lowpriorityscore=0 clxscore=1015 mlxlogscore=999 suspectscore=0 impostorscore=0 malwarescore=0 adultscore=0 bulkscore=0 mlxscore=0 phishscore=0 priorityscore=1501 spamscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2204290000 definitions=main-2206220039 X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_EF,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org In the current kernel, memory tiers are defined implicitly via a demotion path relationship between NUMA nodes, which is created during the kernel initialization and updated when a NUMA node is hot-added or hot-removed. The current implementation puts all nodes with CPU into the top tier, and builds the tier hierarchy tier-by-tier by establishing the per-node demotion targets based on the distances between nodes. This current memory tier kernel interface needs to be improved for several important use cases, The current tier initialization code always initializes each memory-only NUMA node into a lower tier. But a memory-only NUMA node may have a high performance memory device (e.g. a DRAM device attached via CXL.mem or a DRAM-backed memory-only node on a virtual machine) and should be put into a higher tier. The current tier hierarchy always puts CPU nodes into the top tier. But on a system with HBM or GPU devices, the memory-only NUMA nodes mapping these devices should be in the top tier, and DRAM nodes with CPUs are better to be placed into the next lower tier. With current kernel higher tier node can only be demoted to selected nodes on the next lower tier as defined by the demotion path, not any other node from any lower tier. This strict, hard-coded demotion order does not work in all use cases (e.g. some use cases may want to allow cross-socket demotion to another node in the same demotion tier as a fallback when the preferred demotion node is out of space), This demotion order is also inconsistent with the page allocation fallback order when all the nodes in a higher tier are out of space: The page allocation can fall back to any node from any lower tier, whereas the demotion order doesn't allow that. The current kernel also don't provide any interfaces for the userspace to learn about the memory tier hierarchy in order to optimize its memory allocations. This patch series address the above by defining memory tiers explicitly. This patch introduce explicity memory tiers. The tier ID value of a memory tier is used to derive the demotion order between NUMA nodes. For example, if we have 3 memtiers: memtier100, memtier200, memiter300 then the memory tier order is: memtier300 -> memtier200 -> memtier100 where memtier300 is the highest tier and memtier100 is the lowest tier. While reclaim we migrate pages from fast(higher) tiers to slow(lower) tiers when the fast(higher) tier is under memory pressure. This patchset introduce 3 memory tiers (memtier100, memtier200 and memtier300) which are created by different kernel subsystems. The default memory tier created by the kernel is memtier200. A kernel parameter is provided to override the default memory tier. Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com Signed-off-by: Jagdish Gediya Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 15 +++++++ mm/Makefile | 1 + mm/memory-tiers.c | 78 ++++++++++++++++++++++++++++++++++++ 3 files changed, 94 insertions(+) create mode 100644 include/linux/memory-tiers.h create mode 100644 mm/memory-tiers.c diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h new file mode 100644 index 000000000000..a81dbc20e0d1 --- /dev/null +++ b/include/linux/memory-tiers.h @@ -0,0 +1,15 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_MEMORY_TIERS_H +#define _LINUX_MEMORY_TIERS_H + +#ifdef CONFIG_NUMA + +#define MEMORY_TIER_HBM_GPU 300 +#define MEMORY_TIER_DRAM 200 +#define MEMORY_TIER_PMEM 100 + +#define DEFAULT_MEMORY_TIER MEMORY_TIER_DRAM +#define MAX_MEMORY_TIER_ID 400 + +#endif /* CONFIG_NUMA */ +#endif /* _LINUX_MEMORY_TIERS_H */ diff --git a/mm/Makefile b/mm/Makefile index 6f9ffa968a1a..d30acebc2164 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/ obj-$(CONFIG_FAILSLAB) += failslab.o obj-$(CONFIG_MEMTEST) += memtest.o obj-$(CONFIG_MIGRATION) += migrate.o +obj-$(CONFIG_NUMA) += memory-tiers.o obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o obj-$(CONFIG_PAGE_COUNTER) += page_counter.o diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c new file mode 100644 index 000000000000..69a5d81c0a12 --- /dev/null +++ b/mm/memory-tiers.c @@ -0,0 +1,78 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include +#include + +struct memory_tier { + struct list_head list; + nodemask_t nodelist; + int id; +}; + +static DEFINE_MUTEX(memory_tier_lock); +static LIST_HEAD(memory_tiers); + +static void insert_memory_tier(struct memory_tier *memtier) +{ + struct list_head *ent; + struct memory_tier *tmp_memtier; + + lockdep_assert_held_once(&memory_tier_lock); + + list_for_each(ent, &memory_tiers) { + tmp_memtier = list_entry(ent, struct memory_tier, list); + if (tmp_memtier->id < memtier->id) { + list_add_tail(&memtier->list, ent); + return; + } + } + list_add_tail(&memtier->list, &memory_tiers); +} + +static struct memory_tier *register_memory_tier(unsigned int tier) +{ + struct memory_tier *memtier; + + if (tier > MAX_MEMORY_TIER_ID) + return ERR_PTR(-EINVAL); + + memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL); + if (!memtier) + return ERR_PTR(-ENOMEM); + + memtier->id = tier; + + insert_memory_tier(memtier); + + return memtier; +} + +static unsigned int default_memtier = DEFAULT_MEMORY_TIER; +core_param(default_memory_tier, default_memtier, uint, 0644); + +static int __init memory_tier_init(void) +{ + struct memory_tier *memtier; + + /* + * Register only default memory tier to hide all empty + * memory tier from sysfs. Since this is early during + * boot, we could avoid holding memtory_tier_lock. But + * keep it simple by holding locks. So we can add lock + * held debug checks in other functions. + */ + mutex_lock(&memory_tier_lock); + memtier = register_memory_tier(default_memtier); + if (IS_ERR(memtier)) + panic("%s() failed to register memory tier: %ld\n", + __func__, PTR_ERR(memtier)); + + /* CPU only nodes are not part of memory tiers. */ + memtier->nodelist = node_states[N_MEMORY]; + mutex_unlock(&memory_tier_lock); + return 0; +} +subsys_initcall(memory_tier_init); -- 2.36.1