Received: by 2002:a05:6a10:f347:0:0:0:0 with SMTP id d7csp702898pxu; Wed, 2 Dec 2020 00:48:23 -0800 (PST) X-Google-Smtp-Source: ABdhPJxHAkHNK29k3qgK4R9bI0xKq2tyKRcTJQanm3bNJJzYmYxA5vEtwwprYBa/WrJ3Hk8ofHGJ X-Received: by 2002:a50:fc9a:: with SMTP id f26mr1522787edq.255.1606898902871; Wed, 02 Dec 2020 00:48:22 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1606898902; cv=none; d=google.com; s=arc-20160816; b=qv4IbbkaiJ6XyT1V7E4tEnYGF3wWBsIu/23TQqRXyIZa6bnKYgmjC5u8LzN9wrctvC qLqDtXNsrWOXtHjzWdI3V+oWaLB5hmaXlR2eXZMNj1Ix6np0Jdkhcazy/gwEf5GrmcW8 KGw2xKK6+n9o1rIHEZbg+tlpbhkRZ2JtAnS5tUjKPlMJ9iLf/QI6l5lQmv6KKKD7j/ax 1UKvcHZFh4EqWlMpiN04zgB3FRB02iGV4unKZ845ws0t5ubRUs24nm7lLPuZNbe70s82 sxcByZkDXzW4aOXhuCBvZfq7YNpL0176ULyIlFfCGMcDuzNmoxJ4SabziNyquKIu6Ntz Sm7A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :ironport-sdr:ironport-sdr; bh=jwo0qdQVmTgxih2xFOvG2nW91Un9ZsPs+VxRQbHHrD4=; b=jzkcsTxpWjp3M7e9K1ReumfwIEbnfKpJjx+FTmJHrm0FSSfyODXrYVzE55zOARHJ3Y xNAUvumk36HIPS+DUWYyWprNCzHp3rdeGTg8vD6jHZAhNe1nzztKeXNX6e0VLLOPj2R0 Ld0QczzgPJ9IMGUJMbq4aIyI/kBRFyNJ4uqDwVT+8IXFpfeQvu6eR7N9kpxd8Gc2Fkdg ab4AVnzv77a8bAt2uw/tB7d3sjdJpRknJeWBZCOlKFWGHKEPWCt7SKQWnYVs75lNYln5 MiOFR+3JrJeCunJJoebilpxFNyYQrB2F5FwcwQmSmIBAmh0aQnnhao7lQbb3TZwT4oNp 1ovg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id ca17si632285edb.223.2020.12.02.00.47.59; Wed, 02 Dec 2020 00:48:22 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387526AbgLBInm (ORCPT + 99 others); Wed, 2 Dec 2020 03:43:42 -0500 Received: from mga05.intel.com ([192.55.52.43]:38719 "EHLO mga05.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726309AbgLBInl (ORCPT ); Wed, 2 Dec 2020 03:43:41 -0500 IronPort-SDR: 6bWkYQf4RdSofpavhEZKNMhoh+t0EvN59cmqGu98xMwpfDZLOZBRACSGAzLEuIWSX//yZm9rUd JKcY8p3vaHhQ== X-IronPort-AV: E=McAfee;i="6000,8403,9822"; a="257690463" X-IronPort-AV: E=Sophos;i="5.78,386,1599548400"; d="scan'208";a="257690463" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Dec 2020 00:43:00 -0800 IronPort-SDR: s76vDhUEHZ3RZ4RavnvNuEQYmA2wyF3aXEBp3a7AVb3VsCXjcwBR1JXHp+bYUEiptCtWoCBXVW x2P0xKMaWNTA== X-IronPort-AV: E=Sophos;i="5.78,386,1599548400"; d="scan'208";a="539590147" Received: from yhuang6-mobl1.sh.intel.com (HELO yhuang6-MOBL1.intel.com) ([10.238.5.184]) by fmsmga005-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Dec 2020 00:42:54 -0800 From: Huang Ying To: Peter Zijlstra , Mel Gorman Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Huang Ying , "Matthew Wilcox (Oracle)" , Rafael Aquini , Andrew Morton , Ingo Molnar , Rik van Riel , Johannes Weiner , Dave Hansen , Andi Kleen , Michal Hocko , David Rientjes , linux-api@vger.kernel.org Subject: [PATCH -V6 RESEND 3/3] NOT kernel/numactl: Support to enable Linux kernel NUMA balancing Date: Wed, 2 Dec 2020 16:42:34 +0800 Message-Id: <20201202084234.15797-4-ying.huang@intel.com> X-Mailer: git-send-email 2.29.2 In-Reply-To: <20201202084234.15797-1-ying.huang@intel.com> References: <20201202084234.15797-1-ying.huang@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org A new API: numa_set_membind_balancing() is added to libnuma. It is same as numa_set_membind() except that the Linux kernel NUMA balancing will be enabled for the task if the feature is supported by the kernel. At the same time, a new option: --balancing (-b) is added to numactl. Which can be used before the memory policy options in the command line. With it, the Linux kernel NUMA balancing will be enabled for the process if the feature is supported by the kernel. Signed-off-by: "Huang, Ying" --- libnuma.c | 14 ++++++++++++++ numa.3 | 15 +++++++++++++++ numa.h | 4 ++++ numactl.8 | 9 +++++++++ numactl.c | 17 ++++++++++++++--- numaif.h | 3 +++ versions.ldscript | 8 ++++++++ 7 files changed, 67 insertions(+), 3 deletions(-) diff --git a/libnuma.c b/libnuma.c index 88f479b..f073c50 100644 --- a/libnuma.c +++ b/libnuma.c @@ -1064,6 +1064,20 @@ numa_set_membind_v2(struct bitmask *bmp) make_internal_alias(numa_set_membind_v2); +void +numa_set_membind_balancing(struct bitmask *bmp) +{ + /* MPOL_F_NUMA_BALANCING: ignore if unsupported */ + if (set_mempolicy(MPOL_BIND | MPOL_F_NUMA_BALANCING, + bmp->maskp, bmp->size + 1) < 0) { + if (errno == EINVAL) { + errno = 0; + numa_set_membind_v2(bmp); + } else + numa_error("set_mempolicy"); + } +} + /* * copy a bitmask map body to a numa.h nodemask_t structure */ diff --git a/numa.3 b/numa.3 index 3e18098..af01c8f 100644 --- a/numa.3 +++ b/numa.3 @@ -80,6 +80,8 @@ numa \- NUMA policy library .br .BI "void numa_set_membind(struct bitmask *" nodemask ); .br +.BI "void numa_set_membind_balancing(struct bitmask *" nodemask ); +.br .B struct bitmask *numa_get_membind(void); .sp .BI "void *numa_alloc_onnode(size_t " size ", int " node ); @@ -538,6 +540,19 @@ that contains nodes other than those in the mask returned by .IR numa_get_mems_allowed () will result in an error. +.BR numa_set_membind_balancing () +sets the memory allocation mask and enable the Linux kernel NUMA +balancing for the task if the feature is supported by the kernel. +The task will only allocate memory from the nodes set in +.IR nodemask . +Passing an empty +.I nodemask +or a +.I nodemask +that contains nodes other than those in the mask returned by +.IR numa_get_mems_allowed () +will result in an error. + .BR numa_get_membind () returns the mask of nodes from which memory can currently be allocated. If the returned mask is equal to diff --git a/numa.h b/numa.h index bd1d676..5d8543a 100644 --- a/numa.h +++ b/numa.h @@ -192,6 +192,10 @@ void numa_set_localalloc(void); /* Only allocate memory from the nodes set in mask. 0 to turn off */ void numa_set_membind(struct bitmask *nodemask); +/* Only allocate memory from the nodes set in mask. Optimize page + placement with Linux kernel NUMA balancing if possible. 0 to turn off */ +void numa_set_membind_balancing(struct bitmask *bmp); + /* Return current membind */ struct bitmask *numa_get_membind(void); diff --git a/numactl.8 b/numactl.8 index f3bb22b..109dd8f 100644 --- a/numactl.8 +++ b/numactl.8 @@ -25,6 +25,8 @@ numactl \- Control NUMA policy for processes or shared memory [ .B \-\-all ] [ +.B \-\-balancing +] [ .B \-\-interleave nodes ] [ .B \-\-preferred node @@ -168,6 +170,9 @@ but if memory cannot be allocated there fall back to other nodes. This option takes only a single node number. Relative notation may be used. .TP +.B \-\-balancing, \-b +Enable Linux kernel NUMA balancing for the process if it is supported by kernel. +.TP .B \-\-show, \-s Show NUMA policy settings of the current process. .TP @@ -278,6 +283,10 @@ numactl \-\-cpunodebind=0 \-\-membind=0,1 -- process -l Run process as above, but with an option (-l) that would be confused with a numactl option. +numactl \-\-cpunodebind=0 \-\-balancing \-\-membind=0,1 process +Run process on node 0 with memory allocated on node 0 and 1. Optimize the +page placement with Linux kernel NUMA balancing mechanism if possible. + numactl \-\-cpunodebind=netdev:eth0 \-\-membind=netdev:eth0 network-server Run network-server on the node of network device eth0 with its memory also in the same node. diff --git a/numactl.c b/numactl.c index df9dbcb..5a9d2df 100644 --- a/numactl.c +++ b/numactl.c @@ -45,6 +45,7 @@ struct option opts[] = { {"membind", 1, 0, 'm'}, {"show", 0, 0, 's' }, {"localalloc", 0,0, 'l'}, + {"balancing", 0, 0, 'b'}, {"hardware", 0,0,'H' }, {"shm", 1, 0, 'S'}, @@ -65,9 +66,10 @@ struct option opts[] = { void usage(void) { fprintf(stderr, - "usage: numactl [--all | -a] [--interleave= | -i ] [--preferred= | -p ]\n" - " [--physcpubind= | -C ] [--cpunodebind= | -N ]\n" - " [--membind= | -m ] [--localalloc | -l] command args ...\n" + "usage: numactl [--all | -a] [--balancing | -b] [--interleave= | -i ]\n" + " [--preferred= | -p ] [--physcpubind= | -C ]\n" + " [--cpunodebind= | -N ] [--membind= | -m ]\n" + " [--localalloc | -l] command args ...\n" " numactl [--show | -s]\n" " numactl [--hardware | -H]\n" " numactl [--length | -l ] [--offset | -o ] [--shmmode | -M ]\n" @@ -90,6 +92,8 @@ void usage(void) "all numbers and ranges can be made cpuset-relative with +\n" "the old --cpubind argument is deprecated.\n" "use --cpunodebind or --physcpubind instead\n" + "use --balancing | -b to enable Linux kernel NUMA balancing\n" + "for the process if it is supported by kernel\n" " can have g (GB), m (MB) or k (KB) suffixes\n"); exit(1); } @@ -338,6 +342,7 @@ int do_dump = 0; int shmattached = 0; int did_node_cpu_parse = 0; int parse_all = 0; +int numa_balancing = 0; char *shmoption; void check_cpubind(int flag) @@ -431,6 +436,10 @@ int main(int ac, char **av) nopolicy(); hardware(); exit(0); + case 'b': /* --balancing */ + nopolicy(); + numa_balancing = 1; + break; case 'i': /* --interleave */ checknuma(); if (parse_all) @@ -507,6 +516,8 @@ int main(int ac, char **av) numa_set_bind_policy(1); if (shmfd >= 0) { numa_tonodemask_memory(shmptr, shmlen, mask); + } else if (numa_balancing) { + numa_set_membind_balancing(mask); } else { numa_set_membind(mask); } diff --git a/numaif.h b/numaif.h index 91aa230..32c12c3 100644 --- a/numaif.h +++ b/numaif.h @@ -29,6 +29,9 @@ extern long move_pages(int pid, unsigned long count, #define MPOL_LOCAL 4 #define MPOL_MAX 5 +/* Flags for set_mempolicy, specified in mode */ +#define MPOL_F_NUMA_BALANCING (1 << 13) /* Optimize with NUMA balancing if possible */ + /* Flags for get_mem_policy */ #define MPOL_F_NODE (1<<0) /* return next il node or node of address */ /* Warning: MPOL_F_NODE is unsupported and diff --git a/versions.ldscript b/versions.ldscript index 23074a0..358eeeb 100644 --- a/versions.ldscript +++ b/versions.ldscript @@ -146,3 +146,11 @@ libnuma_1.4 { local: *; } libnuma_1.3; + +# New interface for membind with NUMA balancing optimization +libnuma_1.5 { + global: + numa_set_membind_balancing; + local: + *; +} libnuma_1.4; -- 2.29.2