Received: by 2002:a05:6a10:d5a5:0:0:0:0 with SMTP id gn37csp2147641pxb; Fri, 8 Oct 2021 01:42:25 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwkGiJ2UmV4XnKmDEqJrOaO/JLEVdSDkCNKfnhsMD5NL6gv3uMJWf7+ye9lv9At0G2NfBjn X-Received: by 2002:a17:907:77c8:: with SMTP id kz8mr2592682ejc.406.1633682545464; Fri, 08 Oct 2021 01:42:25 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1633682545; cv=none; d=google.com; s=arc-20160816; b=CC5WkmQWJGT2xC/+v75CRX9Cuex7U74meNgK1mo05KLBlgG7ZP52cR7spZeJBdH0pb FDOMuIkQ6yAyORWu1zKs7Ap37Q1Xq0NW6V7SEcgh9zqMkHPBOvistMEdzHT2vJAW0rDr wvXLCjI7pEkTOUE7DDoCzp7wLCP37SNHeQ5NtL1hdDJY3NzL7768OjSi5jS0MyUUkL0I iageklV4fNn4q+NJT/FfacOTb3xBnPua+qFsfl8ZxPSkJlhWeUYXknIRomR7NmsRGZW4 csDjWLeZxduCt3yjhPp99d1POlP3Z5l04Gc2DeE34f8KzDtsTqteTczdvrrpsR5+qdRh NEqQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from; bh=mQN1rAlhjoqmpuDfdRnNsX6qF/G3+UVYfyGtE/aOj3U=; b=VVPzjFOtLrPSptM/TN47Xpv4gx/qIT8D+LonKk28KhEc6xzSKeH8Rts79qhnsHH+lx 7APD5Ahel0KkJ3nqrSOi/1Dec6wYCTJ9qwb3MJckO90LxmKQVsUMDKvXi3nnohlV56I/ +2KLlt0X0CpySl4KZIk5k0eOvYt0ilTxX10UcPWZ6Gw9GVPGhg31IcXgAn3D6eW9Ypg/ hH6my/clAgjXCCjQq5Pj5Y3N48x9Afs52qjF+wpnvPVlkQj3FgpiFiFrPtgmQS8ci5V7 aoFIvdQPC03KxsvZOUS95omW+R7d5CO2B+QMW5Wc/u6qA1SvuUemGtSg4RhVmPzRoZZM u4jg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id g4si1280086ejk.215.2021.10.08.01.42.02; Fri, 08 Oct 2021 01:42:25 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234364AbhJHImK (ORCPT + 99 others); Fri, 8 Oct 2021 04:42:10 -0400 Received: from mga04.intel.com ([192.55.52.120]:42236 "EHLO mga04.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234375AbhJHImI (ORCPT ); Fri, 8 Oct 2021 04:42:08 -0400 X-IronPort-AV: E=McAfee;i="6200,9189,10130"; a="225240227" X-IronPort-AV: E=Sophos;i="5.85,357,1624345200"; d="scan'208";a="225240227" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Oct 2021 01:40:13 -0700 X-IronPort-AV: E=Sophos;i="5.85,357,1624345200"; d="scan'208";a="439860403" Received: from yhuang6-desk2.sh.intel.com ([10.239.159.119]) by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Oct 2021 01:40:10 -0700 From: Huang Ying To: linux-kernel@vger.kernel.org Cc: Huang Ying , Andrew Morton , Michal Hocko , Rik van Riel , Mel Gorman , Peter Zijlstra , Dave Hansen , Yang Shi , Zi Yan , Wei Xu , osalvador , Shakeel Butt , linux-mm@kvack.org Subject: [PATCH -V9 5/6] memory tiering: rate limit NUMA migration throughput Date: Fri, 8 Oct 2021 16:39:37 +0800 Message-Id: <20211008083938.1702663-6-ying.huang@intel.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20211008083938.1702663-1-ying.huang@intel.com> References: <20211008083938.1702663-1-ying.huang@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org In NUMA balancing memory tiering mode, the hot slow memory pages could be promoted to the fast memory node via NUMA balancing. But this incurs some overhead too. So that sometimes the workload performance may be hurt. To avoid too much disturbing to the workload in these situations, we should make it possible to rate limit the promotion throughput. So, in this patch, we implement a simple rate limit algorithm as follows. The number of the candidate pages to be promoted to the fast memory node via NUMA balancing is counted, if the count exceeds the limit specified by the users, the NUMA balancing promotion will be stopped until the next second. A new sysctl knob kernel.numa_balancing_rate_limit_mbps is added for the users to specify the limit. TODO: Add ABI document for new sysctl knob. Signed-off-by: "Huang, Ying" Cc: Andrew Morton Cc: Michal Hocko Cc: Rik van Riel Cc: Mel Gorman Cc: Peter Zijlstra Cc: Dave Hansen Cc: Yang Shi Cc: Zi Yan Cc: Wei Xu Cc: osalvador Cc: Shakeel Butt Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org --- include/linux/mmzone.h | 5 +++++ include/linux/sched/sysctl.h | 1 + kernel/sched/fair.c | 29 +++++++++++++++++++++++++++-- kernel/sysctl.c | 8 ++++++++ mm/vmstat.c | 1 + 5 files changed, 42 insertions(+), 2 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 37ccd6158765..d6a0efd387bd 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -212,6 +212,7 @@ enum node_stat_item { #endif #ifdef CONFIG_NUMA_BALANCING PGPROMOTE_SUCCESS, /* promote successfully */ + PGPROMOTE_CANDIDATE, /* candidate pages to promote */ #endif NR_VM_NODE_STAT_ITEMS }; @@ -887,6 +888,10 @@ typedef struct pglist_data { struct deferred_split deferred_split_queue; #endif +#ifdef CONFIG_NUMA_BALANCING + unsigned long numa_ts; + unsigned long numa_nr_candidate; +#endif /* Fields commonly accessed by the page reclaim scanner */ /* diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index 0ea43b146aee..7d937adaac0f 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -42,6 +42,7 @@ enum sched_tunable_scaling { #ifdef CONFIG_NUMA_BALANCING extern int sysctl_numa_balancing_mode; extern unsigned int sysctl_numa_balancing_hot_threshold; +extern unsigned int sysctl_numa_balancing_rate_limit; #else #define sysctl_numa_balancing_mode 0 #endif diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8ed370c159dd..c57baeacfc1a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1071,6 +1071,11 @@ unsigned int sysctl_numa_balancing_scan_delay = 1000; /* The page with hint page fault latency < threshold in ms is considered hot */ unsigned int sysctl_numa_balancing_hot_threshold = 1000; +/* + * Restrict the NUMA migration per second in MB for each target node + * if no enough free space in target node + */ +unsigned int sysctl_numa_balancing_rate_limit = 65536; struct numa_group { refcount_t refcount; @@ -1443,6 +1448,23 @@ static int numa_hint_fault_latency(struct page *page) return (time - last_time) & PAGE_ACCESS_TIME_MASK; } +static bool numa_migration_check_rate_limit(struct pglist_data *pgdat, + unsigned long rate_limit, int nr) +{ + unsigned long nr_candidate; + unsigned long now = jiffies, last_ts; + + mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr); + nr_candidate = node_page_state(pgdat, PGPROMOTE_CANDIDATE); + last_ts = pgdat->numa_ts; + if (now > last_ts + HZ && + cmpxchg(&pgdat->numa_ts, last_ts, now) == last_ts) + pgdat->numa_nr_candidate = nr_candidate; + if (nr_candidate - pgdat->numa_nr_candidate > rate_limit) + return false; + return true; +} + bool should_numa_migrate_memory(struct task_struct *p, struct page * page, int src_nid, int dst_cpu) { @@ -1457,7 +1479,7 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page, if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING && !node_is_toptier(src_nid)) { struct pglist_data *pgdat; - unsigned long latency, th; + unsigned long rate_limit, latency, th; pgdat = NODE_DATA(dst_nid); if (pgdat_free_space_enough(pgdat)) @@ -1468,7 +1490,10 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page, if (latency > th) return false; - return true; + rate_limit = + sysctl_numa_balancing_rate_limit << (20 - PAGE_SHIFT); + return numa_migration_check_rate_limit(pgdat, rate_limit, + thp_nr_pages(page)); } this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid); diff --git a/kernel/sysctl.c b/kernel/sysctl.c index ea105f52b646..0d89021bd66a 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1818,6 +1818,14 @@ static struct ctl_table kern_table[] = { .mode = 0644, .proc_handler = proc_dointvec, }, + { + .procname = "numa_balancing_rate_limit_mbps", + .data = &sysctl_numa_balancing_rate_limit, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = SYSCTL_ZERO, + }, #endif /* CONFIG_NUMA_BALANCING */ { .procname = "sched_rt_period_us", diff --git a/mm/vmstat.c b/mm/vmstat.c index fff0ec94d795..da2abeaf9e6c 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1238,6 +1238,7 @@ const char * const vmstat_text[] = { #endif #ifdef CONFIG_NUMA_BALANCING "pgpromote_success", + "pgpromote_candidate", #endif /* enum writeback_stat_item counters */ -- 2.30.2