2008-03-05 22:42:47

by Joe Korty

[permalink] [raw]
Subject: [PATCH] NUMA slab allocator migration bugfix

NUMA slab allocator cpu migration bugfix

The NUMA slab allocator (specifically, cache_alloc_refill)
is not refreshing its local copies of what cpu and what
numa node it is on, when it drops and reacquires the irq
block that it inherited from its caller. As a result
those values become invalid if an attempt to migrate the
process to another numa node occured while the irq block
had been dropped.

The solution is to make cache_alloc_refill reload these
variables whenever it drops and reacquires the irq block.

The error is very difficult to hit. When it does occur,
one gets the following oops + stack traceback bits in
check_spinlock_acquired:

kernel BUG at mm/slab.c:2417
cache_alloc_refill+0xe6
kmem_cache_alloc+0xd0
...

This patch was developed against 2.6.23, ported to and
compiled-tested only against 2.6.25-rc4.

Signed-off-by: Joe Korty <[email protected]>

Index: 2.6.25-rc4/mm/slab.c
===================================================================
--- 2.6.25-rc4.orig/mm/slab.c 2008-03-05 16:07:56.000000000 -0500
+++ 2.6.25-rc4/mm/slab.c 2008-03-05 16:17:47.000000000 -0500
@@ -2964,11 +2964,10 @@
struct array_cache *ac;
int node;

- node = numa_node_id();
-
+retry:
check_irq_off();
+ node = numa_node_id();
ac = cpu_cache_get(cachep);
-retry:
batchcount = ac->batchcount;
if (!ac->touched && batchcount > BATCHREFILL_LIMIT) {
/*


2008-03-05 23:08:31

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] NUMA slab allocator migration bugfix

On Wed, 5 Mar 2008, Joe Korty wrote:

> The NUMA slab allocator (specifically, cache_alloc_refill)
> is not refreshing its local copies of what cpu and what
> numa node it is on, when it drops and reacquires the irq
> block that it inherited from its caller. As a result
> those values become invalid if an attempt to migrate the
> process to another numa node occured while the irq block
> had been dropped.

The new slab is allocated for the node that was determined earlier and
entered into the slab queues for that node. Howver, during the alloc we
were rescheduled.

Then we find ourselves on another processor and recalculate the ac
pointer. If we now retry then there is the danger of getting off node
objects into the per cpu queue. Which may cause the wrong lock to be taken
when draining queues. Sucks because it can cause data corruption. Same as
the other issues resolved by GFP_THISNODE.

Acked-by: Christoph Lameter <[email protected]>

Will queue it.