Received: by 2002:a05:6a10:1287:0:0:0:0 with SMTP id d7csp62517pxv; Wed, 14 Jul 2021 22:54:34 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzfvJp7xKdf/TGCUjY+qFIgX+9nbmUsndffpfuKnsLc0WdxBdKQ22GQ/a+iFZ1g/VAyFNhz X-Received: by 2002:a6b:e70f:: with SMTP id b15mr1872001ioh.67.1626328474302; Wed, 14 Jul 2021 22:54:34 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1626328474; cv=none; d=google.com; s=arc-20160816; b=Q8M4fYGicu8wmw/fEu3OU9lu0tSNC8aZIDhpgXSwoseVEILgOkgpiRU5ZaIDYRzV+F Kk2BGwnhwfRsBMPVYk2fNg56ohnNC0ZzmaZf56HneRlYwUdqrWNosObL0gXe2uN90WTd sX6r5JQRsOblTEdE/xyWze3G6R8Y7RCUuK3EnJrj04YjQMTvwpa5XWDRahBJPqZ/PffQ WlTyoC/Oqy7JR0E1DWyrVs4Y4SJAR0k83B9UnGjOOjqLbejRE5djBVL4161Bj45koXq1 tFOqiGIc8duQb+BzLyokxSM+kGlbs3kN7b4oRjoHCWaAB5P7nY7g5YlxrBUvO1+rl4Dz WK0Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from; bh=kzUjZFkCM2Pn63aXXg0L5gsGpXyCLA1+FqbmA8+nStg=; b=ae2Nnq5fva0ahumCFy5EVatLnPDiNqWEkhk2avyHI1TipY0I/y9HH/l6M71J3k1Mwl X//QXEDIQRFisl3aJWR5Xkidn2ogwQ8Fs7mRhAU82+vQuDGKpw0aCvgisjSjPB5JPndJ krpUuzN2p4RdMRiY9e1hul62GSUvWBhQwti3naFcrPKZL7GCS8hjb0kTBm8gW7WgT6qF K9kHX6HaU69wOWBjtHFY+ZITO0ND1NGT/E+lU7pgcwkJqnhiS1dzXaa7nD3PFfIvHRWJ lp/A6MCXHQplxzLxenCTR2fOfRSkanZk0U8017vdMXsPir0EbO9n5Jl7DHLUcT4If6nY TEcA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id x12si5146066ilm.5.2021.07.14.22.54.22; Wed, 14 Jul 2021 22:54:34 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240150AbhGOFzi (ORCPT + 99 others); Thu, 15 Jul 2021 01:55:38 -0400 Received: from mga02.intel.com ([134.134.136.20]:22887 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S240121AbhGOFze (ORCPT ); Thu, 15 Jul 2021 01:55:34 -0400 X-IronPort-AV: E=McAfee;i="6200,9189,10045"; a="197661976" X-IronPort-AV: E=Sophos;i="5.84,240,1620716400"; d="scan'208";a="197661976" Received: from fmsmga002.fm.intel.com ([10.253.24.26]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Jul 2021 22:52:42 -0700 X-IronPort-AV: E=Sophos;i="5.84,240,1620716400"; d="scan'208";a="505591660" Received: from yhuang6-mobl1.sh.intel.com ([10.238.6.138]) by fmsmga002-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Jul 2021 22:52:39 -0700 From: Huang Ying To: Andrew Morton Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Huang Ying , Dave Hansen , Michal Hocko , Wei Xu , Yang Shi , Zi Yan , David Rientjes , Dan Williams , David Hildenbrand Subject: [PATCH -V10 9/9] mm/migrate: add sysfs interface to enable reclaim migration Date: Thu, 15 Jul 2021 13:51:45 +0800 Message-Id: <20210715055145.195411-10-ying.huang@intel.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20210715055145.195411-1-ying.huang@intel.com> References: <20210715055145.195411-1-ying.huang@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Some method is obviously needed to enable reclaim-based migration. Just like traditional autonuma, there will be some workloads that will benefit like workloads with more "static" configurations where hot pages stay hot and cold pages stay cold. If pages come and go from the hot and cold sets, the benefits of this approach will be more limited. The benefits are truly workload-based and *not* hardware-based. We do not believe that there is a viable threshold where certain hardware configurations should have this mechanism enabled while others do not. To be conservative, earlier work defaulted to disable reclaim- based migration and did not include a mechanism to enable it. This proposes add a new sysfs file /sys/kernel/mm/numa/demotion_enabled as a method to enable it. We are open to any alternative that allows end users to enable this mechanism or disable it if workload harm is detected (just like traditional autonuma). Once this is enabled page demotion may move data to a NUMA node that does not fall into the cpuset of the allocating process. This could be construed to violate the guarantees of cpusets. However, since this is an opt-in mechanism, the assumption is that anyone enabling it is content to relax the guarantees. Originally-by: Dave Hansen Signed-off-by: Huang Ying Cc: Michal Hocko Cc: Wei Xu Cc: Yang Shi Cc: Zi Yan Cc: David Rientjes Cc: Dan Williams Cc: David Hildenbrand Changes since 20210618: * Guard next_demotion_node() with numa_demotion_enabled if necessary per Wei's comments. Changes since 20210331: * Use sysfs interface separated from the zone_reclaim sysctl. Changes since 20210304: * Add Documentation/ material about relaxing cpuset constraints Changes since 20200122: * Changelog material about relaxing cpuset constraints --- .../ABI/testing/sysfs-kernel-mm-numa | 24 ++++++++ include/linux/mempolicy.h | 4 ++ mm/mempolicy.c | 61 +++++++++++++++++++ mm/vmscan.c | 5 +- 4 files changed, 92 insertions(+), 2 deletions(-) create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-numa diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-numa b/Documentation/ABI/testing/sysfs-kernel-mm-numa new file mode 100644 index 000000000000..77e559d4ed80 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-kernel-mm-numa @@ -0,0 +1,24 @@ +What: /sys/kernel/mm/numa/ +Date: June 2021 +Contact: Linux memory management mailing list +Description: Interface for NUMA + +What: /sys/kernel/mm/numa/demotion_enabled +Date: June 2021 +Contact: Linux memory management mailing list +Description: Enable/disable demoting pages during reclaim + + Page migration during reclaim is intended for systems + with tiered memory configurations. These systems have + multiple types of memory with varied performance + characteristics instead of plain NUMA systems where + the same kind of memory is found at varied distances. + Allowing page migration during reclaim enables these + systems to migrate pages from fast tiers to slow tiers + when the fast tier is under pressure. This migration + is performed before swap. It may move data to a NUMA + node that does not fall into the cpuset of the + allocating process which might be construed to violate + the guarantees of cpusets. This should not be enabled + on systems which need strict cpuset location + guarantees. diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index 0aaf91b496e2..4ca025e2a77e 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -184,6 +184,8 @@ extern bool vma_migratable(struct vm_area_struct *vma); extern int mpol_misplaced(struct page *, struct vm_area_struct *, unsigned long); extern void mpol_put_task_policy(struct task_struct *); +extern bool numa_demotion_enabled; + #else struct mempolicy {}; @@ -292,5 +294,7 @@ static inline nodemask_t *policy_nodemask_current(gfp_t gfp) { return NULL; } + +#define numa_demotion_enabled false #endif /* CONFIG_NUMA */ #endif diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 939eabcaf488..e675bfb856da 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -3021,3 +3021,64 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol) p += scnprintf(p, buffer + maxlen - p, ":%*pbl", nodemask_pr_args(&nodes)); } + +bool numa_demotion_enabled = false; + +#ifdef CONFIG_SYSFS +static ssize_t numa_demotion_enabled_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return sysfs_emit(buf, "%s\n", + numa_demotion_enabled? "true" : "false"); +} + +static ssize_t numa_demotion_enabled_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + if (!strncmp(buf, "true", 4) || !strncmp(buf, "1", 1)) + numa_demotion_enabled = true; + else if (!strncmp(buf, "false", 5) || !strncmp(buf, "0", 1)) + numa_demotion_enabled = false; + else + return -EINVAL; + + return count; +} + +static struct kobj_attribute numa_demotion_enabled_attr = + __ATTR(demotion_enabled, 0644, numa_demotion_enabled_show, + numa_demotion_enabled_store); + +static struct attribute *numa_attrs[] = { + &numa_demotion_enabled_attr.attr, + NULL, +}; + +static const struct attribute_group numa_attr_group = { + .attrs = numa_attrs, +}; + +static int __init numa_init_sysfs(void) +{ + int err; + struct kobject *numa_kobj; + + numa_kobj = kobject_create_and_add("numa", mm_kobj); + if (!numa_kobj) { + pr_err("failed to create numa kobject\n"); + return -ENOMEM; + } + err = sysfs_create_group(numa_kobj, &numa_attr_group); + if (err) { + pr_err("failed to register numa group\n"); + goto delete_obj; + } + return 0; + +delete_obj: + kobject_put(numa_kobj); + return err; +} +subsys_initcall(numa_init_sysfs); +#endif diff --git a/mm/vmscan.c b/mm/vmscan.c index b697f1a6108c..1afbbd7e853a 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -521,6 +521,8 @@ static long add_nr_deferred(long nr, struct shrinker *shrinker, static bool can_demote_anon_pages(int nid, struct scan_control *sc) { + if (!numa_demotion_enabled) + return false; if (sc) { if (sc->no_demotion) return false; @@ -531,8 +533,7 @@ static bool can_demote_anon_pages(int nid, struct scan_control *sc) if (next_demotion_node(nid) == NUMA_NO_NODE) return false; - // FIXME: actually enable this later in the series - return false; + return true; } static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg, -- 2.30.2