2013-04-24 23:56:30

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] Add a sysctl for numa_balancing.

From: Andi Kleen <[email protected]>

As discussed earlier, this adds a working sysctl to enable/disable
automatic numa memory balancing at runtime.

This was possible earlier through debugfs, but only with special
debugging options set. Also fix the boot message.

Signed-off-by: Andi Kleen <[email protected]>
---
Documentation/sysctl/kernel.txt | 10 ++++++++++
include/linux/sched/sysctl.h | 4 ++++
kernel/sched/core.c | 24 +++++++++++++++++++++++-
kernel/sysctl.c | 11 +++++++++++
mm/mempolicy.c | 2 +-
5 files changed, 49 insertions(+), 2 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index ccd4258..17a7004 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -354,6 +354,16 @@ utilize.

==============================================================

+numa_balancing
+
+Enables/disables automatic page fault based NUMA memory
+balancing. Memory is moved automatically to nodes
+that access it often.
+
+TBD someone document the other numa_balancing tunables
+
+==============================================================
+
osrelease, ostype & version:

# cat osrelease
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index bf8086b..e228a1b 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -101,4 +101,8 @@ extern int sched_rt_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos);

+extern int sched_numa_balancing(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp,
+ loff_t *ppos);
+
#endif /* _SCHED_SYSCTL_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 67d0465..679be74 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1614,7 +1614,29 @@ void set_numabalancing_state(bool enabled)
numabalancing_enabled = enabled;
}
#endif /* CONFIG_SCHED_DEBUG */
-#endif /* CONFIG_NUMA_BALANCING */
+
+#ifdef CONFIG_PROC_SYSCTL
+int sched_numa_balancing(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ struct ctl_table t;
+ int err;
+ int state = numabalancing_enabled;
+
+ if (write && !capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ t = *table;
+ t.data = &state;
+ err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos);
+ if (err < 0)
+ return err;
+ if (write)
+ set_numabalancing_state(state);
+ return err;
+}
+#endif
+#endif

/*
* fork()/clone()-time setup:
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index afc1dc6..94164ac 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -393,6 +393,17 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec,
},
+ {
+ .procname = "numa_balancing",
+ .data = NULL, /* filled in by handler */
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = sched_numa_balancing,
+ .extra1 = &zero,
+ .extra2 = &one,
+ },
+
+
#endif /* CONFIG_NUMA_BALANCING */
#endif /* CONFIG_SCHED_DEBUG */
{
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 7431001..7eee646 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2531,7 +2531,7 @@ static void __init check_numabalancing_enable(void)

if (nr_node_ids > 1 && !numabalancing_override) {
printk(KERN_INFO "Enabling automatic NUMA balancing. "
- "Configure with numa_balancing= or sysctl");
+ "Configure with numa_balancing= or the kernel.numa_balancing sysctl");
set_numabalancing_state(numabalancing_default);
}
}
--
1.7.7.6


2013-04-25 00:14:14

by Will Huck

[permalink] [raw]
Subject: Re: [PATCH] Add a sysctl for numa_balancing.

On 04/25/2013 07:56 AM, Andi Kleen wrote:
> From: Andi Kleen <[email protected]>
>
> As discussed earlier, this adds a working sysctl to enable/disable
> automatic numa memory balancing at runtime.
>
> This was possible earlier through debugfs, but only with special
> debugging options set. Also fix the boot message.

One offline question.

If I configure uma to fake numa, is there benefit or downside?

>
> Signed-off-by: Andi Kleen <[email protected]>
> ---
> Documentation/sysctl/kernel.txt | 10 ++++++++++
> include/linux/sched/sysctl.h | 4 ++++
> kernel/sched/core.c | 24 +++++++++++++++++++++++-
> kernel/sysctl.c | 11 +++++++++++
> mm/mempolicy.c | 2 +-
> 5 files changed, 49 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
> index ccd4258..17a7004 100644
> --- a/Documentation/sysctl/kernel.txt
> +++ b/Documentation/sysctl/kernel.txt
> @@ -354,6 +354,16 @@ utilize.
>
> ==============================================================
>
> +numa_balancing
> +
> +Enables/disables automatic page fault based NUMA memory
> +balancing. Memory is moved automatically to nodes
> +that access it often.
> +
> +TBD someone document the other numa_balancing tunables
> +
> +==============================================================
> +
> osrelease, ostype & version:
>
> # cat osrelease
> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> index bf8086b..e228a1b 100644
> --- a/include/linux/sched/sysctl.h
> +++ b/include/linux/sched/sysctl.h
> @@ -101,4 +101,8 @@ extern int sched_rt_handler(struct ctl_table *table, int write,
> void __user *buffer, size_t *lenp,
> loff_t *ppos);
>
> +extern int sched_numa_balancing(struct ctl_table *table, int write,
> + void __user *buffer, size_t *lenp,
> + loff_t *ppos);
> +
> #endif /* _SCHED_SYSCTL_H */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 67d0465..679be74 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1614,7 +1614,29 @@ void set_numabalancing_state(bool enabled)
> numabalancing_enabled = enabled;
> }
> #endif /* CONFIG_SCHED_DEBUG */
> -#endif /* CONFIG_NUMA_BALANCING */
> +
> +#ifdef CONFIG_PROC_SYSCTL
> +int sched_numa_balancing(struct ctl_table *table, int write,
> + void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> + struct ctl_table t;
> + int err;
> + int state = numabalancing_enabled;
> +
> + if (write && !capable(CAP_SYS_ADMIN))
> + return -EPERM;
> +
> + t = *table;
> + t.data = &state;
> + err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos);
> + if (err < 0)
> + return err;
> + if (write)
> + set_numabalancing_state(state);
> + return err;
> +}
> +#endif
> +#endif
>
> /*
> * fork()/clone()-time setup:
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index afc1dc6..94164ac 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -393,6 +393,17 @@ static struct ctl_table kern_table[] = {
> .mode = 0644,
> .proc_handler = proc_dointvec,
> },
> + {
> + .procname = "numa_balancing",
> + .data = NULL, /* filled in by handler */
> + .maxlen = sizeof(unsigned int),
> + .mode = 0644,
> + .proc_handler = sched_numa_balancing,
> + .extra1 = &zero,
> + .extra2 = &one,
> + },
> +
> +
> #endif /* CONFIG_NUMA_BALANCING */
> #endif /* CONFIG_SCHED_DEBUG */
> {
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 7431001..7eee646 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2531,7 +2531,7 @@ static void __init check_numabalancing_enable(void)
>
> if (nr_node_ids > 1 && !numabalancing_override) {
> printk(KERN_INFO "Enabling automatic NUMA balancing. "
> - "Configure with numa_balancing= or sysctl");
> + "Configure with numa_balancing= or the kernel.numa_balancing sysctl");
> set_numabalancing_state(numabalancing_default);
> }
> }

2013-04-29 08:41:27

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] Add a sysctl for numa_balancing.

On Wed, Apr 24, 2013 at 04:56:24PM -0700, Andi Kleen wrote:
> From: Andi Kleen <[email protected]>
>
> As discussed earlier, this adds a working sysctl to enable/disable
> automatic numa memory balancing at runtime.
>
> This was possible earlier through debugfs, but only with special
> debugging options set. Also fix the boot message.
>
> Signed-off-by: Andi Kleen <[email protected]>

Acked-by: Mel Gorman <[email protected]>

Would you like to merge the following patch with it to remove the TBD?

---8<---
mm: numa: Document remaining automatic NUMA balancing sysctls

Signed-off-by: Mel Gorman <[email protected]>
---
Documentation/sysctl/kernel.txt | 62 +++++++++++++++++++++++++++++++++++++----
1 file changed, 57 insertions(+), 5 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 17a7004..4d56060 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -356,11 +356,63 @@ utilize.

numa_balancing

-Enables/disables automatic page fault based NUMA memory
-balancing. Memory is moved automatically to nodes
-that access it often.
-
-TBD someone document the other numa_balancing tunables
+Enables/disables automatic NUMA memory balancing. On NUMA machines, there
+is a performance penalty if remote memory is accessed by a CPU. When this
+feature is enabled the kernel samples what task thread is accessing memory
+by periodically unmapping pages and later trapping a page fault. At the
+time of the page fault, it is determined if the data being accessed should
+be migrated to a local memory node.
+
+The unmapping of pages and trapping faults incur additional overhead that
+ideally is offset by improved memory locality but there is no universal
+guarantee. If the target workload is already bound to NUMA nodes then this
+feature should be disabled. Otherwise, if the system overhead from the
+feature is too high then the rate the kernel samples for NUMA hinting
+faults may be controlled by the numa_balancing_scan_period_min_ms,
+numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
+numa_balancing_scan_period_max_ms and numa_balancing_scan_size_mb sysctls.
+
+==============================================================
+
+numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms,
+numa_balancing_scan_period_max_ms, numa_balancing_scan_period_reset,
+numa_balancing_scan_size_mb
+
+Automatic NUMA balancing scans tasks address space and unmaps pages to
+detect if pages are properly placed or if the data should be migrated to a
+memory node local to where the task is running. Every "scan delay" the task
+scans the next "scan size" number of pages in its address space. When the
+end of the address space is reached the scanner restarts from the beginning.
+
+In combination, the "scan delay" and "scan size" determine the scan rate.
+When "scan delay" decreases, the scan rate increases. The scan delay and
+hence the scan rate of every task is adaptive and depends on historical
+behaviour. If pages are properly placed then the scan delay increases,
+otherwise the scan delay decreases. The "scan size" is not adaptive but
+the higher the "scan size", the higher the scan rate.
+
+Higher scan rates incur higher system overhead as page faults must be
+trapped and potentially data must be migrated. However, the higher the scan
+rate, the more quickly a tasks memory is migrated to a local node if the
+workload pattern changes and minimises performance impact due to remote
+memory accesses. These sysctls control the thresholds for scan delays and
+the number of pages scanned.
+
+numa_balancing_scan_period_min_ms is the minimum delay in milliseconds
+between scans. It effectively controls the maximum scanning rate for
+each task.
+
+numa_balancing_scan_delay_ms is the starting "scan delay" used for a task
+when it initially forks.
+
+numa_balancing_scan_period_max_ms is the maximum delay between scans. It
+effectively controls the minimum scanning rate for each task.
+
+numa_balancing_scan_size_mb is how many megabytes worth of pages are
+scanned for a given scan.
+
+numa_balancing_scan_period_reset is a blunt instrument that controls how
+often a tasks scan delay is reset to detect sudden changes in task behaviour.

==============================================================

2013-04-29 20:32:21

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] Add a sysctl for numa_balancing.

On Mon, 29 Apr 2013, Mel Gorman wrote:

> On Wed, Apr 24, 2013 at 04:56:24PM -0700, Andi Kleen wrote:
> > From: Andi Kleen <[email protected]>
> >
> > As discussed earlier, this adds a working sysctl to enable/disable
> > automatic numa memory balancing at runtime.
> >
> > This was possible earlier through debugfs, but only with special
> > debugging options set. Also fix the boot message.
> >
> > Signed-off-by: Andi Kleen <[email protected]>
>
> Acked-by: Mel Gorman <[email protected]>
>

Acked-by: David Rientjes <[email protected]>

> Would you like to merge the following patch with it to remove the TBD?
>
> ---8<---
> mm: numa: Document remaining automatic NUMA balancing sysctls
>
> Signed-off-by: Mel Gorman <[email protected]>

Acked-by: David Rientjes <[email protected]>