2014-07-17 23:09:37

by Nishanth Aravamudan

[permalink] [raw]
Subject: [RFC 0/2] Memoryless nodes and kworker

[Apologies for the large Cc list, but I believe we have the following
interested parties:

x86 (recently posted memoryless node support)
ia64 (existing memoryless node support)
ppc (existing memoryless node support)
previous discussion of how to solve Anton's issue with slab usage
workqueue contributors/maintainers]

There is an issue currently where NUMA information is used on powerpc
(and possibly ia64) before it has been read from the device-tree, which
leads to large slab consumption with CONFIG_SLUB and memoryless nodes.

While testing memoryless nodes on PowerKVM guests with the patches in
this series, with a guest topology of

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
node 1 size: 16336 MB
node 1 free: 15329 MB
node distances:
node 0 1
0: 10 40
1: 40 10

the slab consumption decreases from

Slab: 932416 kB
SUnreclaim: 902336 kB

to

Slab: 395264 kB
SUnreclaim: 359424 kB

And we see a corresponding increase in the slab efficiency from

slab mem objs slabs
used active active
------------------------------------------------------------
kmalloc-16384 337 MB 11.28% 100.00%
task_struct 288 MB 9.93% 100.00%

to

slab mem objs slabs
used active active
------------------------------------------------------------
kmalloc-16384 37 MB 100.00% 100.00%
task_struct 31 MB 100.00% 100.00%

It turns out we see this large slab usage due to using the wrong NUMA
information when creating kthreads.

Two changes are required, one of which is in the workqueue code and one
of which is in the powerpc initialization. Note that ia64 may want to
consider something similar.


2014-07-17 23:10:12

by Nishanth Aravamudan

[permalink] [raw]
Subject: [RFC 1/2] workqueue: use the nearest NUMA node, not the local one

In the presence of memoryless nodes, the workqueue code incorrectly uses
cpu_to_node() to determine what node to prefer memory allocations come
from. cpu_to_mem() should be used instead, which will use the nearest
NUMA node with memory.

Signed-off-by: Nishanth Aravamudan <[email protected]>

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 35974ac..0bba022 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3547,7 +3547,12 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
for_each_node(node) {
if (cpumask_subset(pool->attrs->cpumask,
wq_numa_possible_cpumask[node])) {
- pool->node = node;
+ /*
+ * We could use local_memory_node(node) here,
+ * but it is expensive and the following caches
+ * the same value.
+ */
+ pool->node = cpu_to_mem(cpumask_first(pool->attrs->cpumask));
break;
}
}
@@ -4921,7 +4926,7 @@ static int __init init_workqueues(void)
pool->cpu = cpu;
cpumask_copy(pool->attrs->cpumask, cpumask_of(cpu));
pool->attrs->nice = std_nice[i++];
- pool->node = cpu_to_node(cpu);
+ pool->node = cpu_to_mem(cpu);

/* alloc pool ID */
mutex_lock(&wq_pool_mutex);

2014-07-17 23:15:27

by Nishanth Aravamudan

[permalink] [raw]
Subject: [RFC 2/2] powerpc: reorder per-cpu NUMA information's initialization

There is an issue currently where NUMA information is used on powerpc
(and possibly ia64) before it has been read from the device-tree, which
leads to large slab consumption with CONFIG_SLUB and memoryless nodes.

NUMA powerpc non-boot CPU's cpu_to_node/cpu_to_mem is only accurate
after start_secondary(), similar to ia64, which is invoked via
smp_init().

Commit 6ee0578b4daae ("workqueue: mark init_workqueues() as
early_initcall()") made init_workqueues() be invoked via
do_pre_smp_initcalls(), which is obviously before the secondary
processors are online.

Additionally, the following commits changed init_workqueues() to use
cpu_to_node to determine the node to use for kthread_create_on_node:

bce903809ab3f ("workqueue: add wq_numa_tbl_len and
wq_numa_possible_cpumask[]")
f3f90ad469342 ("workqueue: determine NUMA node of workers accourding to
the allowed cpumask")

Therefore, when init_workqueues() runs, it sees all CPUs as being on
Node 0. On LPARs or KVM guests where Node 0 is memoryless, this leads to
a high number of slab deactivations
(http://www.spinics.net/lists/linux-mm/msg67489.html).

Fix this by initializing the powerpc-specific CPU<->node/local memory
node mapping as early as possible, which on powerpc is
do_init_bootmem(). Currently that function initializes the mapping for
the boot CPU, but we extend it to setup the mapping for all possible
CPUs. Then, in smp_prepare_cpus(), we can correspondingly set the
per-cpu values for all possible CPUs. That ensures that before the
early_initcalls run (and really as early as possible), the per-cpu NUMA
mapping is accurate.

While testing memoryless nodes on PowerKVM guests with a fix to the
workqueue logic to use cpu_to_mem() instead of cpu_to_node(), with a
guest topology of:

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
node 1 size: 16336 MB
node 1 free: 15329 MB
node distances:
node 0 1
0: 10 40
1: 40 10

the slab consumption decreases from

Slab: 932416 kB
SUnreclaim: 902336 kB

to

Slab: 395264 kB
SUnreclaim: 359424 kB

And we a corresponding increase in the slab efficiency from

slab mem objs slabs
used active active
------------------------------------------------------------
kmalloc-16384 337 MB 11.28% 100.00%
task_struct 288 MB 9.93% 100.00%

to

slab mem objs slabs
used active active
------------------------------------------------------------
kmalloc-16384 37 MB 100.00% 100.00%
task_struct 31 MB 100.00% 100.00%

Powerpc didn't support memoryless nodes until recently (64bb80d87f01
"powerpc/numa: Enable CONFIG_HAVE_MEMORYLESS_NODES" and 8c272261194d
"powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID"). Those commits also
helped improve memory consumption with these kind of environments.

Signed-off-by: Nishanth Aravamudan <[email protected]>

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 51a3ff7..91ff531 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -376,6 +376,11 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
GFP_KERNEL, cpu_to_node(cpu));
zalloc_cpumask_var_node(&per_cpu(cpu_core_map, cpu),
GFP_KERNEL, cpu_to_node(cpu));
+ /*
+ * numa_node_id() works after this.
+ */
+ set_cpu_numa_node(cpu, numa_cpu_lookup_table[cpu]);
+ set_cpu_numa_mem(cpu, local_memory_node(numa_cpu_lookup_table[cpu]));
}

cpumask_set_cpu(boot_cpuid, cpu_sibling_mask(boot_cpuid));
@@ -723,12 +728,6 @@ void start_secondary(void *unused)
}
traverse_core_siblings(cpu, true);

- /*
- * numa_node_id() works after this.
- */
- set_numa_node(numa_cpu_lookup_table[cpu]);
- set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]));
-
smp_wmb();
notify_cpu_starting(cpu);
set_cpu_online(cpu, true);
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 3b181b2..b1f0b86 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1049,7 +1049,7 @@ static void __init mark_reserved_regions_for_nid(int nid)

void __init do_init_bootmem(void)
{
- int nid;
+ int nid, cpu;

min_low_pfn = 0;
max_low_pfn = memblock_end_of_DRAM() >> PAGE_SHIFT;
@@ -1122,8 +1122,15 @@ void __init do_init_bootmem(void)

reset_numa_cpu_lookup_table();
register_cpu_notifier(&ppc64_numa_nb);
- cpu_numa_callback(&ppc64_numa_nb, CPU_UP_PREPARE,
- (void *)(unsigned long)boot_cpuid);
+ /*
+ * We need the numa_cpu_lookup_table to be accurate for all CPUs,
+ * even before we online them, so that we can use cpu_to_{node,mem}
+ * early in boot, cf. smp_prepare_cpus().
+ */
+ for_each_possible_cpu(cpu) {
+ cpu_numa_callback(&ppc64_numa_nb, CPU_UP_PREPARE,
+ (void *)(unsigned long)cpu);
+ }
}

void __init paging_init(void)

2014-07-18 08:11:49

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [RFC 1/2] workqueue: use the nearest NUMA node, not the local one

Hi,

I'm curious about what will it happen when alloc_pages_node(memoryless_node).

If the memory is allocated from the most preferable node for the @memoryless_node,
why we need to bother and use cpu_to_mem() in the caller site?

If not, why the memory allocation subsystem refuses to find a preferable node
for @memoryless_node in this case? Does it intend on some purpose or
it can't find in some cases?

Thanks,
Lai

Added CC to Tejun (workqueue maintainer).

On 07/18/2014 07:09 AM, Nishanth Aravamudan wrote:
> In the presence of memoryless nodes, the workqueue code incorrectly uses
> cpu_to_node() to determine what node to prefer memory allocations come
> from. cpu_to_mem() should be used instead, which will use the nearest
> NUMA node with memory.
>
> Signed-off-by: Nishanth Aravamudan <[email protected]>
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 35974ac..0bba022 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -3547,7 +3547,12 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
> for_each_node(node) {
> if (cpumask_subset(pool->attrs->cpumask,
> wq_numa_possible_cpumask[node])) {
> - pool->node = node;
> + /*
> + * We could use local_memory_node(node) here,
> + * but it is expensive and the following caches
> + * the same value.
> + */
> + pool->node = cpu_to_mem(cpumask_first(pool->attrs->cpumask));
> break;
> }
> }
> @@ -4921,7 +4926,7 @@ static int __init init_workqueues(void)
> pool->cpu = cpu;
> cpumask_copy(pool->attrs->cpumask, cpumask_of(cpu));
> pool->attrs->nice = std_nice[i++];
> - pool->node = cpu_to_node(cpu);
> + pool->node = cpu_to_mem(cpu);
>
> /* alloc pool ID */
> mutex_lock(&wq_pool_mutex);
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2014-07-18 11:20:44

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC 0/2] Memoryless nodes and kworker

On Thu, Jul 17, 2014 at 04:09:23PM -0700, Nishanth Aravamudan wrote:
> [Apologies for the large Cc list, but I believe we have the following
> interested parties:
>
> x86 (recently posted memoryless node support)
> ia64 (existing memoryless node support)
> ppc (existing memoryless node support)
> previous discussion of how to solve Anton's issue with slab usage
> workqueue contributors/maintainers]

Well, you forgot to cc me.

...
> It turns out we see this large slab usage due to using the wrong NUMA
> information when creating kthreads.
>
> Two changes are required, one of which is in the workqueue code and one
> of which is in the powerpc initialization. Note that ia64 may want to
> consider something similar.

Wasn't there a thread on this exact subject a few weeks ago? Was that
someone else? Memory-less node detail leaking out of allocator proper
isn't a good idea. Please allow allocator users to specify the nodes
they're on and let the allocator layer deal with mapping that to
whatever is appropriate. Please don't push that to everybody.

Thanks.

--
tejun

2014-07-18 18:00:15

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC 0/2] Memoryless nodes and kworker

Hello,

On Fri, Jul 18, 2014 at 10:42:29AM -0700, Nish Aravamudan wrote:
> So, to be clear, this is not *necessarily* about memoryless nodes. It's
> about the semantics intended. The workqueue code currently calls
> cpu_to_node() in a few places, and passes that node into the core MM as a
> hint about where the memory should come from. However, when memoryless
> nodes are present, that hint is guaranteed to be wrong, as it's the nearest
> NUMA node to the CPU (which happens to be the one its on), not the nearest
> NUMA node with memory. The hint is correctly specified as cpu_to_mem(),

It's telling the allocator the node the CPU is on. Choosing and
falling back the actual allocation is the allocator's job.

> which does the right thing in the presence or absence of memoryless nodes.
> And I think encapsulates the hint's semantics correctly -- please give me
> memory from where I expect it, which is the closest NUMA node.

I don't think it does. It loses information at too high a layer.
Workqueue here doesn't care how memory subsystem is structured, it's
just telling the allocator where it's at and expecting it to do the
right thing. Please consider the following scenario.

A - B - C - D - E

Let's say C is a memory-less node. If we map from C to either B or D
from individual users and that node can't serve that memory request,
the allocator would fall back to A or E respectively when the right
thing to do would be falling back to D or B respectively, right?

This isn't a huge issue but it shows that this is the wrong layer to
deal with this issue. Let the allocators express where they are.
Choosing and falling back belong to the memory allocator. That's the
only place which has all the information that's necessary and those
details must be contained there. Please don't leak it to memory
allocator users.

Thanks.

--
tejun

2014-07-18 18:01:16

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC 0/2] Memoryless nodes and kworker

On Fri, Jul 18, 2014 at 02:00:08PM -0400, Tejun Heo wrote:
> This isn't a huge issue but it shows that this is the wrong layer to
> deal with this issue. Let the allocators express where they are.
^
allocator users
> Choosing and falling back belong to the memory allocator. That's the
> only place which has all the information that's necessary and those
> details must be contained there. Please don't leak it to memory
> allocator users.

--
tejun

2014-07-18 18:19:53

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC 0/2] Memoryless nodes and kworker

Hello,

On Fri, Jul 18, 2014 at 11:12:01AM -0700, Nish Aravamudan wrote:
> why aren't these callers using kthread_create_on_cpu()? That API was

It is using that. There just are other data structures too.

> already change to use cpu_to_mem() [so one change, rather than of all over
> the kernel source]. We could change it back to cpu_to_node and push down
> the knowledge about the fallback.

And once it's properly solved, please convert back kthread to use
cpu_to_node() too. We really shouldn't be sprinkling the new subtly
different variant across the kernel. It's wrong and confusing.

> Yes, this is a good point. But honestly, we're not really even to the point
> of talking about fallback here, at least in my testing, going off-node at
> all causes SLUB-configured slabs to deactivate, which then leads to an
> explosion in the unreclaimable slab.

I don't think moving the logic inside allocator proper is a huge
amount of work and this isn't the first spillage of this subtlety out
of allocator proper. Fortunately, it hasn't spread too much yet.
Let's please stop it here. I'm not saying you shouldn't or can't fix
the off-node allocation.

Thanks.

--
tejun

2014-07-18 18:58:35

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC 0/2] Memoryless nodes and kworker

Hello,

On Fri, Jul 18, 2014 at 11:47:08AM -0700, Nish Aravamudan wrote:
> Why are any callers of the format kthread_create_on_node(...,
> cpu_to_node(cpu), ...) not using kthread_create_on_cpu(..., cpu, ...)?

Ah, okay, that's because unbound workers are NUMA node affine, not
CPU.

> It seems like an additional reasonable approach would be to provide a
> suitable _cpu() API for the allocators. I'm not sure why saying that
> callers should know about NUMA (in order to call cpu_to_node() in every
> caller) is any better than saying that callers should know about memoryless
> nodes (in order to call cpu_to_mem() in every caller instead) -- when at

It is better because that's what they want to express - "I'm on this
memory node, please allocate memory on or close to this one". That's
what the caller cares about. Calling with cpu could be an option but
you'll eventually run into cases where you end up having to map back
NUMA node id to a CPU on it, which will probably feel at least a bit
silly. There are things which really are per-NUMA node.

So, let's please express what needs to be expressed. Massaging around
it can be useful at times but that doesn't seem to be the case here.

Thanks.

--
tejun