2006-02-03 20:53:50

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: [patch 0/3] NUMA slab locking fixes

There were some locking oddities in the NUMA slab allocator which broke cpu
hotplug. Here is a patchset to correct locking and also fix the
breakage with cpu hotplug Sonny Rao noticed on POWER5. We hit the bug
on POWER, but not on other arches because other arches failed to update
the node cpumasks when a cpu went down.

Following patches are included in the series:

1. slab-colour-next fix: Moves the slab colouring index from kmem_cache_t
to kmem_list3. Per node cache colouring makes sense, and also, we can now
avoid taking the cachep->spinlock in the alloc patch (cache_grow). Now, alloc
and free need not take cachep->spinlock and just need to take the respective
kmem_list3 locks.

2. slab-locking-irq-optimization: With patch 1, we don't need to disable
on chip interrupts while taking the cachep->spinlock. (We needed to disable
interrupts while taking cachep->spinlock earlier because alloc can happen
from interrupt context, and we had to take the cachep->lock to protect
color_next -- which is not the case now).

3. slab-hotplug-fix: Fix the cpu_down and cpu_up slab locking. When the last
cpu of a node goes down, cpu_down path used to free the shared array_cache,
alien array_cache and l3, with the kmem_list3 lock held, which resulted in
badness. This patch fixes the locking not to do that. Now we don't free
l3 on cpu_down (because there could be a request from other cpus to get
memory from the node going down, while the last cpu is going down). This
fixes the problem reported by Sonny Rao.

2.6.16-rc1mmm4 with the above patches ran successfully overnight on a 4 cpu 2
node amd tyan box, with 4 dbench processes running all the time and cpus
offlined and onlined in an infinite loop.

Thanks,
Kiran


2006-02-03 20:55:08

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: [patch 1/3] NUMA slab locking fixes -- slab-colour-next fix

colour_next is used as an index to add a colouring offset to a new slab
in the cache (colour_off * colour_next). Now with the NUMA aware slab
allocator, it makes sense to colour slabs added on the same node
sequentially with colour_next.

This patch moves the colouring index "colour_next" per-node by placing it
on kmem_list3 rather than kmem_cache.

This also avoids taking the cachep->spinlock on the alloc path at
cache_grow.

Signed-off-by: Alok N Kataria <[email protected]>
Signed-off-by: Ravikiran Thirumalai <[email protected]>
Signed-off-by: Shai Fultheim <[email protected]>

Index: linux-2.6.16-rc1mm4/mm/slab.c
===================================================================
--- linux-2.6.16-rc1mm4.orig/mm/slab.c 2006-01-29 20:20:20.000000000 -0800
+++ linux-2.6.16-rc1mm4/mm/slab.c 2006-01-29 20:23:28.000000000 -0800
@@ -294,6 +294,7 @@ struct kmem_list3 {
unsigned long next_reap;
int free_touched;
unsigned int free_limit;
+ unsigned int colour_next; /* Per-node cache coloring */
spinlock_t list_lock;
struct array_cache *shared; /* shared per node */
struct array_cache **alien; /* on other nodes */
@@ -344,6 +345,7 @@ static void kmem_list3_init(struct kmem_
INIT_LIST_HEAD(&parent->slabs_free);
parent->shared = NULL;
parent->alien = NULL;
+ parent->colour_next = 0;
spin_lock_init(&parent->list_lock);
parent->free_objects = 0;
parent->free_touched = 0;
@@ -390,7 +392,6 @@ struct kmem_cache {

size_t colour; /* cache colouring range */
unsigned int colour_off; /* colour offset */
- unsigned int colour_next; /* cache colouring */
struct kmem_cache *slabp_cache;
unsigned int slab_size;
unsigned int dflags; /* dynamic flags */
@@ -1127,7 +1128,6 @@ void __init kmem_cache_init(void)
BUG();

cache_cache.colour = left_over / cache_cache.colour_off;
- cache_cache.colour_next = 0;
cache_cache.slab_size = ALIGN(cache_cache.num * sizeof(kmem_bufctl_t) +
sizeof(struct slab), cache_line_size());

@@ -2334,18 +2334,19 @@ static int cache_grow(struct kmem_cache
*/
ctor_flags |= SLAB_CTOR_ATOMIC;

- /* About to mess with non-constant members - lock. */
+ /* Take the l3 list lock to change the colour_next on this node */
check_irq_off();
- spin_lock(&cachep->spinlock);
+ l3 = cachep->nodelists[nodeid];
+ spin_lock(&l3->list_lock);

/* Get colour for the slab, and cal the next value. */
- offset = cachep->colour_next;
- cachep->colour_next++;
- if (cachep->colour_next >= cachep->colour)
- cachep->colour_next = 0;
- offset *= cachep->colour_off;
+ offset = l3->colour_next;
+ l3->colour_next++;
+ if (l3->colour_next >= cachep->colour)
+ l3->colour_next = 0;
+ spin_unlock(&l3->list_lock);

- spin_unlock(&cachep->spinlock);
+ offset *= cachep->colour_off;

check_irq_off();
if (local_flags & __GFP_WAIT)
@@ -2377,7 +2378,6 @@ static int cache_grow(struct kmem_cache
if (local_flags & __GFP_WAIT)
local_irq_disable();
check_irq_off();
- l3 = cachep->nodelists[nodeid];
spin_lock(&l3->list_lock);

/* Make slab active. */

2006-02-03 20:56:08

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: [patch 2/3] NUMA slab locking fixes -- slab locking irq optimizations

Earlier, we had to disable on chip interrupts while taking the cachep->spinlock
because, at cache_grow, on every addition of a slab to a slab cache, we
incremented colour_next which was protected by the cachep->spinlock, and
cache_grow could occur at interrupt context. Since, now we protect the
per-node colour_next with the node's list_lock, we do not need to disable
on chip interrupts while taking the per-cache spinlock, but we
just need to disable interrupts when taking the per-node kmem_list3 list_lock.

Signed-off-by: Alok N Kataria <[email protected]>
Signed-off-by: Ravikiran Thirumalai <[email protected]>
Signed-off-by: Shai Fultheim <[email protected]>

Index: linux-2.6.16-rc1mm4/mm/slab.c
===================================================================
--- linux-2.6.16-rc1mm4.orig/mm/slab.c 2006-01-29 20:23:28.000000000 -0800
+++ linux-2.6.16-rc1mm4/mm/slab.c 2006-01-29 23:13:14.000000000 -0800
@@ -995,7 +995,7 @@ static int __devinit cpuup_callback(stru
cpumask_t mask;

mask = node_to_cpumask(node);
- spin_lock_irq(&cachep->spinlock);
+ spin_lock(&cachep->spinlock);
/* cpu is dead; no one can alloc from it. */
nc = cachep->array[cpu];
cachep->array[cpu] = NULL;
@@ -1004,7 +1004,7 @@ static int __devinit cpuup_callback(stru
if (!l3)
goto unlock_cache;

- spin_lock(&l3->list_lock);
+ spin_lock_irq(&l3->list_lock);

/* Free limit for this kmem_list3 */
l3->free_limit -= cachep->batchcount;
@@ -1012,7 +1012,7 @@ static int __devinit cpuup_callback(stru
free_block(cachep, nc->entry, nc->avail, node);

if (!cpus_empty(mask)) {
- spin_unlock(&l3->list_lock);
+ spin_unlock_irq(&l3->list_lock);
goto unlock_cache;
}

@@ -1031,13 +1031,13 @@ static int __devinit cpuup_callback(stru
/* free slabs belonging to this node */
if (__node_shrink(cachep, node)) {
cachep->nodelists[node] = NULL;
- spin_unlock(&l3->list_lock);
+ spin_unlock_irq(&l3->list_lock);
kfree(l3);
} else {
- spin_unlock(&l3->list_lock);
+ spin_unlock_irq(&l3->list_lock);
}
unlock_cache:
- spin_unlock_irq(&cachep->spinlock);
+ spin_unlock(&cachep->spinlock);
kfree(nc);
}
mutex_unlock(&cache_chain_mutex);
@@ -2021,18 +2021,18 @@ static void drain_cpu_caches(struct kmem

smp_call_function_all_cpus(do_drain, cachep);
check_irq_on();
- spin_lock_irq(&cachep->spinlock);
+ spin_lock(&cachep->spinlock);
for_each_online_node(node) {
l3 = cachep->nodelists[node];
if (l3) {
- spin_lock(&l3->list_lock);
+ spin_lock_irq(&l3->list_lock);
drain_array_locked(cachep, l3->shared, 1, node);
- spin_unlock(&l3->list_lock);
+ spin_unlock_irq(&l3->list_lock);
if (l3->alien)
drain_alien_cache(cachep, l3);
}
}
- spin_unlock_irq(&cachep->spinlock);
+ spin_unlock(&cachep->spinlock);
}

static int __node_shrink(struct kmem_cache *cachep, int node)
@@ -2348,7 +2348,6 @@ static int cache_grow(struct kmem_cache

offset *= cachep->colour_off;

- check_irq_off();
if (local_flags & __GFP_WAIT)
local_irq_enable();

@@ -2744,6 +2743,7 @@ static void *__cache_alloc_node(struct k
BUG_ON(!l3);

retry:
+ check_irq_off();
spin_lock(&l3->list_lock);
entry = l3->slabs_partial.next;
if (entry == &l3->slabs_partial) {
@@ -3323,11 +3323,11 @@ static int do_tune_cpucache(struct kmem_
smp_call_function_all_cpus(do_ccupdate_local, (void *)&new);

check_irq_on();
- spin_lock_irq(&cachep->spinlock);
+ spin_lock(&cachep->spinlock);
cachep->batchcount = batchcount;
cachep->limit = limit;
cachep->shared = shared;
- spin_unlock_irq(&cachep->spinlock);
+ spin_unlock(&cachep->spinlock);

for_each_online_cpu(i) {
struct array_cache *ccold = new.new[i];
@@ -3584,8 +3584,7 @@ static int s_show(struct seq_file *m, vo
int node;
struct kmem_list3 *l3;

- check_irq_on();
- spin_lock_irq(&cachep->spinlock);
+ spin_lock(&cachep->spinlock);
active_objs = 0;
num_slabs = 0;
for_each_online_node(node) {
@@ -3593,7 +3592,8 @@ static int s_show(struct seq_file *m, vo
if (!l3)
continue;

- spin_lock(&l3->list_lock);
+ check_irq_on();
+ spin_lock_irq(&l3->list_lock);

list_for_each(q, &l3->slabs_full) {
slabp = list_entry(q, struct slab, list);
@@ -3620,7 +3620,7 @@ static int s_show(struct seq_file *m, vo
free_objects += l3->free_objects;
shared_avail += l3->shared->avail;

- spin_unlock(&l3->list_lock);
+ spin_unlock_irq(&l3->list_lock);
}
num_slabs += active_slabs;
num_objs = num_slabs * cachep->num;
@@ -3670,7 +3670,7 @@ static int s_show(struct seq_file *m, vo
shrinker_stat_read(cachep->shrinker, nr_freed));
}
seq_putc(m, '\n');
- spin_unlock_irq(&cachep->spinlock);
+ spin_unlock(&cachep->spinlock);
return 0;
}

@@ -3702,10 +3702,10 @@ static void do_dump_slabp(kmem_cache_t *
int node;

check_irq_on();
- spin_lock_irq(&cachep->spinlock);
+ spin_lock(&cachep->spinlock);
for_each_online_node(node) {
struct kmem_list3 *rl3 = cachep->nodelists[node];
- spin_lock(&rl3->list_lock);
+ spin_lock_irq(&rl3->list_lock);

list_for_each(q, &rl3->slabs_full) {
int i;
@@ -3719,9 +3719,9 @@ static void do_dump_slabp(kmem_cache_t *
printk("\n");
}
}
- spin_unlock(&rl3->list_lock);
+ spin_unlock_irq(&rl3->list_lock);
}
- spin_unlock_irq(&cachep->spinlock);
+ spin_unlock(&cachep->spinlock);
#endif
}

2006-02-03 20:57:44

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: [patch 3/3] NUMA slab locking fixes -- slab cpu hotplug fix

This fixes locking and bugs in cpu_down and cpu_up paths of the NUMA slab
allocator. Sonny Rao <[email protected]> reported problems sometime back
on POWER5 boxes, when the last cpu on the nodes were being offlined. We could
not reproduce the same on x86_64 because the cpumask (node_to_cpumask) was not
being updated on cpu down. Since that issue is now fixed, we can reproduce
Sonny's problems on x86_64 NUMA, and here is the fix.

The problem earlier was on CPU_DOWN, if it was the last cpu on the node to go
down, the array_caches (shared, alien) and the kmem_list3 of the node were
being freed (kfree) with the kmem_list3 lock held. If the l3 or the
array_caches were to come from the same cache being cleared, we hit on badness.

This patch cleans up the locking in cpu_up and cpu_down path.
We cannot really free l3 on cpu down because, there is no node offlining yet
and even though a cpu is not yet up, node local memory can be allocated
for it. So l3s are usually allocated at keme_cache_create and destroyed at
kmem_cache_destroy. Hence, we don't need cachep->spinlock protection to get
to the cachep->nodelist[nodeid] either.

Patch survived onlining and offlining on a 4 core 2 node Tyan box with a
4 dbench process running all the time.

Signed-off-by: Alok N Kataria <[email protected]>
Signed-off-by: Ravikiran Thirumalai <[email protected]>
Signed-off-by: Shai Fultheim <[email protected]>

Index: linux-2.6.16-rc1mm4/mm/slab.c
===================================================================
--- linux-2.6.16-rc1mm4.orig/mm/slab.c 2006-02-01 18:04:21.000000000 -0800
+++ linux-2.6.16-rc1mm4/mm/slab.c 2006-02-01 23:44:25.000000000 -0800
@@ -892,14 +892,14 @@ static void __drain_alien_cache(struct k
}
}

-static void drain_alien_cache(struct kmem_cache *cachep, struct kmem_list3 *l3)
+static void drain_alien_cache(struct kmem_cache *cachep, struct array_cache **alien)
{
int i = 0;
struct array_cache *ac;
unsigned long flags;

for_each_online_node(i) {
- ac = l3->alien[i];
+ ac = alien[i];
if (ac) {
spin_lock_irqsave(&ac->lock, flags);
__drain_alien_cache(cachep, ac, i);
@@ -910,7 +910,7 @@ static void drain_alien_cache(struct kme
#else
#define alloc_alien_cache(node, limit) do { } while (0)
#define free_alien_cache(ac_ptr) do { } while (0)
-#define drain_alien_cache(cachep, l3) do { } while (0)
+#define drain_alien_cache(cachep, alien) do { } while (0)
#endif

static int __devinit cpuup_callback(struct notifier_block *nfb,
@@ -944,6 +944,9 @@ static int __devinit cpuup_callback(stru
l3->next_reap = jiffies + REAPTIMEOUT_LIST3 +
((unsigned long)cachep) % REAPTIMEOUT_LIST3;

+ /* The l3s don't come and go as cpus come and
+ go. cache_chain_mutex is sufficient
+ protection here */
cachep->nodelists[node] = l3;
}

@@ -957,27 +960,39 @@ static int __devinit cpuup_callback(stru
/* Now we can go ahead with allocating the shared array's
& array cache's */
list_for_each_entry(cachep, &cache_chain, next) {
- struct array_cache *nc;
+ struct array_cache *nc, *shared, **alien;

- nc = alloc_arraycache(node, cachep->limit,
- cachep->batchcount);
- if (!nc)
+ if (!(nc = alloc_arraycache(node,
+ cachep->limit, cachep->batchcount)))
goto bad;
+ if (!(shared = alloc_arraycache(node,
+ cachep->shared*cachep->batchcount, 0xbaadf00d)))
+ goto bad;
+#ifdef CONFIG_NUMA
+ if (!(alien = alloc_alien_cache(node,
+ cachep->limit)))
+ goto bad;
+#endif
cachep->array[cpu] = nc;

l3 = cachep->nodelists[node];
BUG_ON(!l3);
- if (!l3->shared) {
- if (!(nc = alloc_arraycache(node,
- cachep->shared *
- cachep->batchcount,
- 0xbaadf00d)))
- goto bad;

+ spin_lock_irq(&l3->list_lock);
+ if (!l3->shared) {
/* we are serialised from CPU_DEAD or
CPU_UP_CANCELLED by the cpucontrol lock */
- l3->shared = nc;
+ l3->shared = shared;
+ shared = NULL;
+ }
+ if (!l3->alien) {
+ l3->alien = alien;
+ alien = NULL;
}
+ spin_unlock_irq(&l3->list_lock);
+
+ kfree(shared);
+ free_alien_cache(alien);
}
mutex_unlock(&cache_chain_mutex);
break;
@@ -986,23 +1001,29 @@ static int __devinit cpuup_callback(stru
break;
#ifdef CONFIG_HOTPLUG_CPU
case CPU_DEAD:
+ /* Even if all the cpus of a node are down, we don't
+ * free the kmem_list3 of any cache. This to avoid a race
+ * between cpu_down, and a kmalloc allocation from another
+ * cpu for memory from the node of the cpu going down.
+ * The list3 structure is usually allocated from
+ * kmem_cache_create and gets destroyed at kmem_cache_destroy
+ */
/* fall thru */
case CPU_UP_CANCELED:
mutex_lock(&cache_chain_mutex);

list_for_each_entry(cachep, &cache_chain, next) {
- struct array_cache *nc;
+ struct array_cache *nc, *shared, **alien;
cpumask_t mask;

mask = node_to_cpumask(node);
- spin_lock(&cachep->spinlock);
/* cpu is dead; no one can alloc from it. */
nc = cachep->array[cpu];
cachep->array[cpu] = NULL;
l3 = cachep->nodelists[node];

if (!l3)
- goto unlock_cache;
+ goto free_array_cache;

spin_lock_irq(&l3->list_lock);

@@ -1013,33 +1034,43 @@ static int __devinit cpuup_callback(stru

if (!cpus_empty(mask)) {
spin_unlock_irq(&l3->list_lock);
- goto unlock_cache;
+ goto free_array_cache;
}

- if (l3->shared) {
+ if ((shared = l3->shared)) {
free_block(cachep, l3->shared->entry,
l3->shared->avail, node);
- kfree(l3->shared);
l3->shared = NULL;
}
- if (l3->alien) {
- drain_alien_cache(cachep, l3);
- free_alien_cache(l3->alien);
- l3->alien = NULL;
- }

- /* free slabs belonging to this node */
- if (__node_shrink(cachep, node)) {
- cachep->nodelists[node] = NULL;
- spin_unlock_irq(&l3->list_lock);
- kfree(l3);
- } else {
- spin_unlock_irq(&l3->list_lock);
+ alien = l3->alien;
+ l3->alien = NULL;
+
+ spin_unlock_irq(&l3->list_lock);
+
+ kfree(shared);
+ if (alien) {
+ drain_alien_cache(cachep, alien);
+ free_alien_cache(alien);
}
- unlock_cache:
- spin_unlock(&cachep->spinlock);
+ free_array_cache:
kfree(nc);
}
+ /*
+ * In the previous loop, all the objects were freed to
+ * the respective cache's slabs, now we can go ahead and
+ * shrink each nodelist to its limit.
+ */
+ list_for_each_entry(cachep, &cache_chain, next) {
+
+ l3 = cachep->nodelists[node];
+ if (!l3)
+ continue;
+ spin_lock_irq(&l3->list_lock);
+ /* free slabs belonging to this node */
+ __node_shrink(cachep, node);
+ spin_unlock_irq(&l3->list_lock);
+ }
mutex_unlock(&cache_chain_mutex);
break;
#endif
@@ -2021,7 +2052,6 @@ static void drain_cpu_caches(struct kmem

smp_call_function_all_cpus(do_drain, cachep);
check_irq_on();
- spin_lock(&cachep->spinlock);
for_each_online_node(node) {
l3 = cachep->nodelists[node];
if (l3) {
@@ -2029,10 +2059,9 @@ static void drain_cpu_caches(struct kmem
drain_array_locked(cachep, l3->shared, 1, node);
spin_unlock_irq(&l3->list_lock);
if (l3->alien)
- drain_alien_cache(cachep, l3);
+ drain_alien_cache(cachep, l3->alien);
}
}
- spin_unlock(&cachep->spinlock);
}

static int __node_shrink(struct kmem_cache *cachep, int node)
@@ -3459,7 +3488,7 @@ static void cache_reap(void *unused)

l3 = searchp->nodelists[numa_node_id()];
if (l3->alien)
- drain_alien_cache(searchp, l3);
+ drain_alien_cache(searchp, l3->alien);
spin_lock_irq(&l3->list_lock);

drain_array_locked(searchp, cpu_cache_get(searchp), 0,
@@ -3618,7 +3647,8 @@ static int s_show(struct seq_file *m, vo
num_slabs++;
}
free_objects += l3->free_objects;
- shared_avail += l3->shared->avail;
+ if (l3->shared)
+ shared_avail += l3->shared->avail;

spin_unlock_irq(&l3->list_lock);
}
@@ -3702,7 +3732,6 @@ static void do_dump_slabp(kmem_cache_t *
int node;

check_irq_on();
- spin_lock(&cachep->spinlock);
for_each_online_node(node) {
struct kmem_list3 *rl3 = cachep->nodelists[node];
spin_lock_irq(&rl3->list_lock);
@@ -3721,7 +3750,6 @@ static void do_dump_slabp(kmem_cache_t *
}
spin_unlock_irq(&rl3->list_lock);
}
- spin_unlock(&cachep->spinlock);
#endif
}

2006-02-03 22:06:15

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 0/3] NUMA slab locking fixes

Ravikiran G Thirumalai <[email protected]> wrote:
>
> There were some locking oddities in the NUMA slab allocator which broke cpu
> hotplug. Here is a patchset to correct locking and also fix the
> breakage with cpu hotplug Sonny Rao noticed on POWER5.

These patches are on rc1-mm4, which unfortunately contains quite a lot of
-mm-only debugging stuff in slab.c.

Could you please redo/retest these patches so that they apply on
2.6.16-rc2. Also, please arrange for any bugfixes to be staged ahead of
any optimisations - the optimisations we can defer until 2.6.17.

2006-02-03 23:06:40

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 0/3] NUMA slab locking fixes

On Fri, 3 Feb 2006, Andrew Morton wrote:

> Could you please redo/retest these patches so that they apply on
> 2.6.16-rc2. Also, please arrange for any bugfixes to be staged ahead of
> any optimisations - the optimisations we can defer until 2.6.17.

It seems that these two things are tightly connected. By changing the
locking he was able to address the hotplug breakage.

2006-02-04 01:09:15

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: [patch 0/3] NUMA slab locking fixes

On Fri, Feb 03, 2006 at 03:06:16PM -0800, Christoph Lameter wrote:
> On Fri, 3 Feb 2006, Andrew Morton wrote:
>
> > Could you please redo/retest these patches so that they apply on
> > 2.6.16-rc2. Also, please arrange for any bugfixes to be staged ahead of
> > any optimisations - the optimisations we can defer until 2.6.17.
>
> It seems that these two things are tightly connected. By changing the
> locking he was able to address the hotplug breakage.

Yes, they are. The cachep->spinlock goes away at many places due to the
colour_next patch and keeping the l3 around. (we should not have called that
optimization I guess :)), and the irq disabling had to be moved to the
l3 lock anyways. Maybe I could have done away with patch 2 (merged it with
patch3), but I thought keeping them seperate made it readable ....

Patchset against rc2 follows.

Thanks,
Kiran

PS: Many hotplug fixes are yet to be applied upstream I think. I know
these slab patches work well with mm4 (with the x86_64 subtree and hotplug
fixes in there) but I cannot really test cpu_up after cpu_down on rc2
because it is broken as of now (pending merges, I guess).

2006-02-04 01:15:31

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: [patch 1/3] NUMA slab locking fixes -- move color_next to l3

colour_next is used as an index to add a colouring offset to a new slab
in the cache (colour_off * colour_next). Now with the NUMA aware slab
allocator, it makes sense to colour slabs added on the same node
sequentially with colour_next.

This patch moves the colouring index "colour_next" per-node by placing it
on kmem_list3 rather than kmem_cache.

This also helps sinplify locking for cup up and down paths.

Signed-off-by: Alok N Kataria <[email protected]>
Signed-off-by: Ravikiran Thirumalai <[email protected]>
Signed-off-by: Shai Fultheim <[email protected]>

Index: linux-2.6.16-rc2/mm/slab.c
===================================================================
--- linux-2.6.16-rc2.orig/mm/slab.c 2006-02-03 14:31:01.000000000 -0800
+++ linux-2.6.16-rc2/mm/slab.c 2006-02-03 14:35:21.000000000 -0800
@@ -294,6 +294,7 @@ struct kmem_list3 {
unsigned long next_reap;
int free_touched;
unsigned int free_limit;
+ unsigned int colour_next; /* Per-node cache coloring */
spinlock_t list_lock;
struct array_cache *shared; /* shared per node */
struct array_cache **alien; /* on other nodes */
@@ -344,6 +345,7 @@ static void kmem_list3_init(struct kmem_
INIT_LIST_HEAD(&parent->slabs_free);
parent->shared = NULL;
parent->alien = NULL;
+ parent->colour_next = 0;
spin_lock_init(&parent->list_lock);
parent->free_objects = 0;
parent->free_touched = 0;
@@ -390,7 +392,6 @@ struct kmem_cache {

size_t colour; /* cache colouring range */
unsigned int colour_off; /* colour offset */
- unsigned int colour_next; /* cache colouring */
struct kmem_cache *slabp_cache;
unsigned int slab_size;
unsigned int dflags; /* dynamic flags */
@@ -1119,7 +1120,6 @@ void __init kmem_cache_init(void)
BUG();

cache_cache.colour = left_over / cache_cache.colour_off;
- cache_cache.colour_next = 0;
cache_cache.slab_size = ALIGN(cache_cache.num * sizeof(kmem_bufctl_t) +
sizeof(struct slab), cache_line_size());

@@ -2324,18 +2324,19 @@ static int cache_grow(struct kmem_cache
*/
ctor_flags |= SLAB_CTOR_ATOMIC;

- /* About to mess with non-constant members - lock. */
+ /* Take the l3 list lock to change the colour_next on this node */
check_irq_off();
- spin_lock(&cachep->spinlock);
+ l3 = cachep->nodelists[nodeid];
+ spin_lock(&l3->list_lock);

/* Get colour for the slab, and cal the next value. */
- offset = cachep->colour_next;
- cachep->colour_next++;
- if (cachep->colour_next >= cachep->colour)
- cachep->colour_next = 0;
- offset *= cachep->colour_off;
+ offset = l3->colour_next;
+ l3->colour_next++;
+ if (l3->colour_next >= cachep->colour)
+ l3->colour_next = 0;
+ spin_unlock(&l3->list_lock);

- spin_unlock(&cachep->spinlock);
+ offset *= cachep->colour_off;

check_irq_off();
if (local_flags & __GFP_WAIT)
@@ -2367,7 +2368,6 @@ static int cache_grow(struct kmem_cache
if (local_flags & __GFP_WAIT)
local_irq_disable();
check_irq_off();
- l3 = cachep->nodelists[nodeid];
spin_lock(&l3->list_lock);

/* Make slab active. */

2006-02-04 01:23:03

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 0/3] NUMA slab locking fixes

Ravikiran G Thirumalai <[email protected]> wrote:
>
> PS: Many hotplug fixes are yet to be applied upstream I think. I know
> these slab patches work well with mm4 (with the x86_64 subtree and hotplug
> fixes in there) but I cannot really test cpu_up after cpu_down on rc2
> because it is broken as of now (pending merges, I guess).

Yes, -rc2 was a shade too early for me to complete merging stuff. Current
-linus should be better.

2006-02-04 01:27:50

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: [patch 2/3] NUMA slab locking fixes - move irq disabling from cahep->spinlock to l3 lock

Earlier, we had to disable on chip interrupts while taking the cachep->spinlock
because, at cache_grow, on every addition of a slab to a slab cache, we
incremented colour_next which was protected by the cachep->spinlock, and
cache_grow could occur at interrupt context. Since, now we protect the
per-node colour_next with the node's list_lock, we do not need to disable
on chip interrupts while taking the per-cache spinlock, but we
just need to disable interrupts when taking the per-node kmem_list3 list_lock.

Signed-off-by: Alok N Kataria <[email protected]>
Signed-off-by: Ravikiran Thirumalai <[email protected]>
Signed-off-by: Shai Fultheim <[email protected]>

Index: linux-2.6.16-rc2/mm/slab.c
===================================================================
--- linux-2.6.16-rc2.orig/mm/slab.c 2006-02-03 15:07:56.000000000 -0800
+++ linux-2.6.16-rc2/mm/slab.c 2006-02-03 15:10:04.000000000 -0800
@@ -987,7 +987,7 @@ static int __devinit cpuup_callback(stru
cpumask_t mask;

mask = node_to_cpumask(node);
- spin_lock_irq(&cachep->spinlock);
+ spin_lock(&cachep->spinlock);
/* cpu is dead; no one can alloc from it. */
nc = cachep->array[cpu];
cachep->array[cpu] = NULL;
@@ -996,7 +996,7 @@ static int __devinit cpuup_callback(stru
if (!l3)
goto unlock_cache;

- spin_lock(&l3->list_lock);
+ spin_lock_irq(&l3->list_lock);

/* Free limit for this kmem_list3 */
l3->free_limit -= cachep->batchcount;
@@ -1004,7 +1004,7 @@ static int __devinit cpuup_callback(stru
free_block(cachep, nc->entry, nc->avail, node);

if (!cpus_empty(mask)) {
- spin_unlock(&l3->list_lock);
+ spin_unlock_irq(&l3->list_lock);
goto unlock_cache;
}

@@ -1023,13 +1023,13 @@ static int __devinit cpuup_callback(stru
/* free slabs belonging to this node */
if (__node_shrink(cachep, node)) {
cachep->nodelists[node] = NULL;
- spin_unlock(&l3->list_lock);
+ spin_unlock_irq(&l3->list_lock);
kfree(l3);
} else {
- spin_unlock(&l3->list_lock);
+ spin_unlock_irq(&l3->list_lock);
}
unlock_cache:
- spin_unlock_irq(&cachep->spinlock);
+ spin_unlock(&cachep->spinlock);
kfree(nc);
}
mutex_unlock(&cache_chain_mutex);
@@ -2011,18 +2011,18 @@ static void drain_cpu_caches(struct kmem

smp_call_function_all_cpus(do_drain, cachep);
check_irq_on();
- spin_lock_irq(&cachep->spinlock);
+ spin_lock(&cachep->spinlock);
for_each_online_node(node) {
l3 = cachep->nodelists[node];
if (l3) {
- spin_lock(&l3->list_lock);
+ spin_lock_irq(&l3->list_lock);
drain_array_locked(cachep, l3->shared, 1, node);
- spin_unlock(&l3->list_lock);
+ spin_unlock_irq(&l3->list_lock);
if (l3->alien)
drain_alien_cache(cachep, l3);
}
}
- spin_unlock_irq(&cachep->spinlock);
+ spin_unlock(&cachep->spinlock);
}

static int __node_shrink(struct kmem_cache *cachep, int node)
@@ -2338,7 +2338,6 @@ static int cache_grow(struct kmem_cache

offset *= cachep->colour_off;

- check_irq_off();
if (local_flags & __GFP_WAIT)
local_irq_enable();

@@ -2725,6 +2724,7 @@ static void *__cache_alloc_node(struct k
BUG_ON(!l3);

retry:
+ check_irq_off();
spin_lock(&l3->list_lock);
entry = l3->slabs_partial.next;
if (entry == &l3->slabs_partial) {
@@ -3304,11 +3304,11 @@ static int do_tune_cpucache(struct kmem_
smp_call_function_all_cpus(do_ccupdate_local, (void *)&new);

check_irq_on();
- spin_lock_irq(&cachep->spinlock);
+ spin_lock(&cachep->spinlock);
cachep->batchcount = batchcount;
cachep->limit = limit;
cachep->shared = shared;
- spin_unlock_irq(&cachep->spinlock);
+ spin_unlock(&cachep->spinlock);

for_each_online_cpu(i) {
struct array_cache *ccold = new.new[i];
@@ -3564,8 +3564,7 @@ static int s_show(struct seq_file *m, vo
int node;
struct kmem_list3 *l3;

- check_irq_on();
- spin_lock_irq(&cachep->spinlock);
+ spin_lock(&cachep->spinlock);
active_objs = 0;
num_slabs = 0;
for_each_online_node(node) {
@@ -3573,7 +3572,8 @@ static int s_show(struct seq_file *m, vo
if (!l3)
continue;

- spin_lock(&l3->list_lock);
+ check_irq_on();
+ spin_lock_irq(&l3->list_lock);

list_for_each(q, &l3->slabs_full) {
slabp = list_entry(q, struct slab, list);
@@ -3600,7 +3600,7 @@ static int s_show(struct seq_file *m, vo
free_objects += l3->free_objects;
shared_avail += l3->shared->avail;

- spin_unlock(&l3->list_lock);
+ spin_unlock_irq(&l3->list_lock);
}
num_slabs += active_slabs;
num_objs = num_slabs * cachep->num;
@@ -3644,7 +3644,7 @@ static int s_show(struct seq_file *m, vo
}
#endif
seq_putc(m, '\n');
- spin_unlock_irq(&cachep->spinlock);
+ spin_unlock(&cachep->spinlock);
return 0;
}

2006-02-04 01:29:48

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: [patch 3/3] NUMA slab locking fixes -- fix cpu down and up locking

This fixes locking and bugs in cpu_down and cpu_up paths of the NUMA slab
allocator. Sonny Rao <[email protected]> reported problems sometime back
on POWER5 boxes, when the last cpu on the nodes were being offlined. We could
not reproduce the same on x86_64 because the cpumask (node_to_cpumask) was not
being updated on cpu down. Since that issue is now fixed, we can reproduce
Sonny's problems on x86_64 NUMA, and here is the fix.

The problem earlier was on CPU_DOWN, if it was the last cpu on the node to go
down, the array_caches (shared, alien) and the kmem_list3 of the node were
being freed (kfree) with the kmem_list3 lock held. If the l3 or the
array_caches were to come from the same cache being cleared, we hit on badness.

This patch cleans up the locking in cpu_up and cpu_down path.
We cannot really free l3 on cpu down because, there is no node offlining yet
and even though a cpu is not yet up, node local memory can be allocated
for it. So l3s are usually allocated at keme_cache_create and destroyed at kmem_cache_destroy. Hence, we don't need cachep->spinlock protection to get
to the cachep->nodelist[nodeid] either.

Patch survived onlining and offlining on a 4 core 2 node Tyan box with a
4 dbench process running all the time.

Signed-off-by: Alok N Kataria <[email protected]>
Signed-off-by: Ravikiran Thirumalai <[email protected]>

Index: linux-2.6.16-rc2/mm/slab.c
===================================================================
--- linux-2.6.16-rc2.orig/mm/slab.c 2006-02-03 15:10:04.000000000 -0800
+++ linux-2.6.16-rc2/mm/slab.c 2006-02-03 15:18:51.000000000 -0800
@@ -884,14 +884,14 @@ static void __drain_alien_cache(struct k
}
}

-static void drain_alien_cache(struct kmem_cache *cachep, struct kmem_list3 *l3)
+static void drain_alien_cache(struct kmem_cache *cachep, struct array_cache **alien)
{
int i = 0;
struct array_cache *ac;
unsigned long flags;

for_each_online_node(i) {
- ac = l3->alien[i];
+ ac = alien[i];
if (ac) {
spin_lock_irqsave(&ac->lock, flags);
__drain_alien_cache(cachep, ac, i);
@@ -902,7 +902,7 @@ static void drain_alien_cache(struct kme
#else
#define alloc_alien_cache(node, limit) do { } while (0)
#define free_alien_cache(ac_ptr) do { } while (0)
-#define drain_alien_cache(cachep, l3) do { } while (0)
+#define drain_alien_cache(cachep, alien) do { } while (0)
#endif

static int __devinit cpuup_callback(struct notifier_block *nfb,
@@ -936,6 +936,9 @@ static int __devinit cpuup_callback(stru
l3->next_reap = jiffies + REAPTIMEOUT_LIST3 +
((unsigned long)cachep) % REAPTIMEOUT_LIST3;

+ /* The l3s don't come and go as cpus come and
+ go. cache_chain_mutex is sufficient
+ protection here */
cachep->nodelists[node] = l3;
}

@@ -949,27 +952,39 @@ static int __devinit cpuup_callback(stru
/* Now we can go ahead with allocating the shared array's
& array cache's */
list_for_each_entry(cachep, &cache_chain, next) {
- struct array_cache *nc;
+ struct array_cache *nc, *shared, **alien;

- nc = alloc_arraycache(node, cachep->limit,
- cachep->batchcount);
- if (!nc)
+ if (!(nc = alloc_arraycache(node,
+ cachep->limit, cachep->batchcount)))
goto bad;
+ if (!(shared = alloc_arraycache(node,
+ cachep->shared*cachep->batchcount, 0xbaadf00d)))
+ goto bad;
+#ifdef CONFIG_NUMA
+ if (!(alien = alloc_alien_cache(node,
+ cachep->limit)))
+ goto bad;
+#endif
cachep->array[cpu] = nc;

l3 = cachep->nodelists[node];
BUG_ON(!l3);
- if (!l3->shared) {
- if (!(nc = alloc_arraycache(node,
- cachep->shared *
- cachep->batchcount,
- 0xbaadf00d)))
- goto bad;

+ spin_lock_irq(&l3->list_lock);
+ if (!l3->shared) {
/* we are serialised from CPU_DEAD or
CPU_UP_CANCELLED by the cpucontrol lock */
- l3->shared = nc;
+ l3->shared = shared;
+ shared = NULL;
+ }
+ if (!l3->alien) {
+ l3->alien = alien;
+ alien = NULL;
}
+ spin_unlock_irq(&l3->list_lock);
+
+ kfree(shared);
+ free_alien_cache(alien);
}
mutex_unlock(&cache_chain_mutex);
break;
@@ -978,23 +993,29 @@ static int __devinit cpuup_callback(stru
break;
#ifdef CONFIG_HOTPLUG_CPU
case CPU_DEAD:
+ /* Even if all the cpus of a node are down, we don't
+ * free the kmem_list3 of any cache. This to avoid a race
+ * between cpu_down, and a kmalloc allocation from another
+ * cpu for memory from the node of the cpu going down.
+ * The list3 structure is usually allocated from
+ * kmem_cache_create and gets destroyed at kmem_cache_destroy
+ */
/* fall thru */
case CPU_UP_CANCELED:
mutex_lock(&cache_chain_mutex);

list_for_each_entry(cachep, &cache_chain, next) {
- struct array_cache *nc;
+ struct array_cache *nc, *shared, **alien;
cpumask_t mask;

mask = node_to_cpumask(node);
- spin_lock(&cachep->spinlock);
/* cpu is dead; no one can alloc from it. */
nc = cachep->array[cpu];
cachep->array[cpu] = NULL;
l3 = cachep->nodelists[node];

if (!l3)
- goto unlock_cache;
+ goto free_array_cache;

spin_lock_irq(&l3->list_lock);

@@ -1005,33 +1026,43 @@ static int __devinit cpuup_callback(stru

if (!cpus_empty(mask)) {
spin_unlock_irq(&l3->list_lock);
- goto unlock_cache;
+ goto free_array_cache;
}

- if (l3->shared) {
+ if ((shared = l3->shared)) {
free_block(cachep, l3->shared->entry,
l3->shared->avail, node);
- kfree(l3->shared);
l3->shared = NULL;
}
- if (l3->alien) {
- drain_alien_cache(cachep, l3);
- free_alien_cache(l3->alien);
- l3->alien = NULL;
- }

- /* free slabs belonging to this node */
- if (__node_shrink(cachep, node)) {
- cachep->nodelists[node] = NULL;
- spin_unlock_irq(&l3->list_lock);
- kfree(l3);
- } else {
- spin_unlock_irq(&l3->list_lock);
+ alien = l3->alien;
+ l3->alien = NULL;
+
+ spin_unlock_irq(&l3->list_lock);
+
+ kfree(shared);
+ if (alien) {
+ drain_alien_cache(cachep, alien);
+ free_alien_cache(alien);
}
- unlock_cache:
- spin_unlock(&cachep->spinlock);
+ free_array_cache:
kfree(nc);
}
+ /*
+ * In the previous loop, all the objects were freed to
+ * the respective cache's slabs, now we can go ahead and
+ * shrink each nodelist to its limit.
+ */
+ list_for_each_entry(cachep, &cache_chain, next) {
+
+ l3 = cachep->nodelists[node];
+ if (!l3)
+ continue;
+ spin_lock_irq(&l3->list_lock);
+ /* free slabs belonging to this node */
+ __node_shrink(cachep, node);
+ spin_unlock_irq(&l3->list_lock);
+ }
mutex_unlock(&cache_chain_mutex);
break;
#endif
@@ -2011,7 +2042,6 @@ static void drain_cpu_caches(struct kmem

smp_call_function_all_cpus(do_drain, cachep);
check_irq_on();
- spin_lock(&cachep->spinlock);
for_each_online_node(node) {
l3 = cachep->nodelists[node];
if (l3) {
@@ -2019,10 +2049,9 @@ static void drain_cpu_caches(struct kmem
drain_array_locked(cachep, l3->shared, 1, node);
spin_unlock_irq(&l3->list_lock);
if (l3->alien)
- drain_alien_cache(cachep, l3);
+ drain_alien_cache(cachep, l3->alien);
}
}
- spin_unlock(&cachep->spinlock);
}

static int __node_shrink(struct kmem_cache *cachep, int node)
@@ -3440,7 +3469,7 @@ static void cache_reap(void *unused)

l3 = searchp->nodelists[numa_node_id()];
if (l3->alien)
- drain_alien_cache(searchp, l3);
+ drain_alien_cache(searchp, l3->alien);
spin_lock_irq(&l3->list_lock);

drain_array_locked(searchp, cpu_cache_get(searchp), 0,
@@ -3598,7 +3627,8 @@ static int s_show(struct seq_file *m, vo
num_slabs++;
}
free_objects += l3->free_objects;
- shared_avail += l3->shared->avail;
+ if (l3->shared)
+ shared_avail += l3->shared->avail;

spin_unlock_irq(&l3->list_lock);
}

2006-02-04 09:49:08

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 2/3] NUMA slab locking fixes - move irq disabling from cahep->spinlock to l3 lock

Ravikiran G Thirumalai <[email protected]> wrote:
>
> Earlier, we had to disable on chip interrupts while taking the cachep->spinlock
> because, at cache_grow, on every addition of a slab to a slab cache, we
> incremented colour_next which was protected by the cachep->spinlock, and
> cache_grow could occur at interrupt context. Since, now we protect the
> per-node colour_next with the node's list_lock, we do not need to disable
> on chip interrupts while taking the per-cache spinlock, but we
> just need to disable interrupts when taking the per-node kmem_list3 list_lock.

It'd be nice to have some comments describing what cachep->spinlock
actually protects.

Does __cache_shrink() need some locking to prevent nodes from going offline?

2006-02-04 10:04:22

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 3/3] NUMA slab locking fixes -- fix cpu down and up locking

Ravikiran G Thirumalai <[email protected]> wrote:
>
> This fixes locking and bugs in cpu_down and cpu_up paths of the NUMA slab
> allocator. Sonny Rao <[email protected]> reported problems sometime back
> on POWER5 boxes, when the last cpu on the nodes were being offlined. We could
> not reproduce the same on x86_64 because the cpumask (node_to_cpumask) was not
> being updated on cpu down. Since that issue is now fixed, we can reproduce
> Sonny's problems on x86_64 NUMA, and here is the fix.
>
> The problem earlier was on CPU_DOWN, if it was the last cpu on the node to go
> down, the array_caches (shared, alien) and the kmem_list3 of the node were
> being freed (kfree) with the kmem_list3 lock held. If the l3 or the
> array_caches were to come from the same cache being cleared, we hit on badness.
>
> This patch cleans up the locking in cpu_up and cpu_down path.
> We cannot really free l3 on cpu down because, there is no node offlining yet
> and even though a cpu is not yet up, node local memory can be allocated
> for it. So l3s are usually allocated at keme_cache_create and destroyed at kmem_cache_destroy. Hence, we don't need cachep->spinlock protection to get
> to the cachep->nodelist[nodeid] either.
>
> Patch survived onlining and offlining on a 4 core 2 node Tyan box with a
> 4 dbench process running all the time.
>
> ...
>
> + if (!(nc = alloc_arraycache(node,
> + cachep->limit, cachep->batchcount)))
> goto bad;
> + if (!(shared = alloc_arraycache(node,
> + cachep->shared*cachep->batchcount, 0xbaadf00d)))
> + goto bad;

Please don't do things like that - it's quite hard to read and we avoid it.
Cleanup patch below.

--- devel/mm/slab.c~numa-slab-locking-fixes-fix-cpu-down-and-up-locking-tidy 2006-02-04 02:01:33.000000000 -0800
+++ devel-akpm/mm/slab.c 2006-02-04 02:01:33.000000000 -0800
@@ -936,9 +936,11 @@ static int __devinit cpuup_callback(stru
l3->next_reap = jiffies + REAPTIMEOUT_LIST3 +
((unsigned long)cachep) % REAPTIMEOUT_LIST3;

- /* The l3s don't come and go as cpus come and
- go. cache_chain_mutex is sufficient
- protection here */
+ /*
+ * The l3s don't come and go as CPUs come and
+ * go. cache_chain_mutex is sufficient
+ * protection here.
+ */
cachep->nodelists[node] = l3;
}

@@ -952,17 +954,22 @@ static int __devinit cpuup_callback(stru
/* Now we can go ahead with allocating the shared array's
& array cache's */
list_for_each_entry(cachep, &cache_chain, next) {
- struct array_cache *nc, *shared, **alien;
-
- if (!(nc = alloc_arraycache(node,
- cachep->limit, cachep->batchcount)))
+ struct array_cache *nc;
+ struct array_cache *shared;
+ struct array_cache **alien;
+
+ nc = alloc_arraycache(node, cachep->limit,
+ cachep->batchcount);
+ if (!nc)
goto bad;
- if (!(shared = alloc_arraycache(node,
- cachep->shared*cachep->batchcount, 0xbaadf00d)))
+ shared = alloc_arraycache(node,
+ cachep->shared * cachep->batchcount,
+ 0xbaadf00d);
+ if (!shared)
goto bad;
#ifdef CONFIG_NUMA
- if (!(alien = alloc_alien_cache(node,
- cachep->limit)))
+ alien = alloc_alien_cache(node, cachep->limit);
+ if (!alien)
goto bad;
#endif
cachep->array[cpu] = nc;
@@ -972,8 +979,10 @@ static int __devinit cpuup_callback(stru

spin_lock_irq(&l3->list_lock);
if (!l3->shared) {
- /* we are serialised from CPU_DEAD or
- CPU_UP_CANCELLED by the cpucontrol lock */
+ /*
+ * We are serialised from CPU_DEAD or
+ * CPU_UP_CANCELLED by the cpucontrol lock
+ */
l3->shared = shared;
shared = NULL;
}
@@ -993,19 +1002,22 @@ static int __devinit cpuup_callback(stru
break;
#ifdef CONFIG_HOTPLUG_CPU
case CPU_DEAD:
- /* Even if all the cpus of a node are down, we don't
- * free the kmem_list3 of any cache. This to avoid a race
- * between cpu_down, and a kmalloc allocation from another
- * cpu for memory from the node of the cpu going down.
- * The list3 structure is usually allocated from
- * kmem_cache_create and gets destroyed at kmem_cache_destroy
+ /*
+ * Even if all the cpus of a node are down, we don't free the
+ * kmem_list3 of any cache. This to avoid a race between
+ * cpu_down, and a kmalloc allocation from another cpu for
+ * memory from the node of the cpu going down. The list3
+ * structure is usually allocated from kmem_cache_create() and
+ * gets destroyed at kmem_cache_destroy().
*/
/* fall thru */
case CPU_UP_CANCELED:
mutex_lock(&cache_chain_mutex);

list_for_each_entry(cachep, &cache_chain, next) {
- struct array_cache *nc, *shared, **alien;
+ struct array_cache *nc;
+ struct array_cache *shared;
+ struct array_cache **alien;
cpumask_t mask;

mask = node_to_cpumask(node);
@@ -1029,7 +1041,8 @@ static int __devinit cpuup_callback(stru
goto free_array_cache;
}

- if ((shared = l3->shared)) {
+ shared = l3->shared;
+ if (shared) {
free_block(cachep, l3->shared->entry,
l3->shared->avail, node);
l3->shared = NULL;
@@ -1045,7 +1058,7 @@ static int __devinit cpuup_callback(stru
drain_alien_cache(cachep, alien);
free_alien_cache(alien);
}
- free_array_cache:
+free_array_cache:
kfree(nc);
}
/*
@@ -1054,7 +1067,6 @@ static int __devinit cpuup_callback(stru
* shrink each nodelist to its limit.
*/
list_for_each_entry(cachep, &cache_chain, next) {
-
l3 = cachep->nodelists[node];
if (!l3)
continue;
_

2006-02-04 10:06:00

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 3/3] NUMA slab locking fixes -- fix cpu down and up locking

Andrew Morton <[email protected]> wrote:
>
> Cleanup patch below.

After which we need to address this:

mm/slab.c: In function `cpuup_callback':
mm/slab.c:959: warning: `alien' might be used uninitialized in this function

I guess it won't cause crashes, but it makes me wonder if this was tested
on non-NUMA?

Is this right?

--- devel/mm/slab.c~numa-slab-locking-fixes-fix-cpu-down-and-up-locking-fix 2006-02-04 02:01:44.000000000 -0800
+++ devel-akpm/mm/slab.c 2006-02-04 02:01:44.000000000 -0800
@@ -901,8 +901,11 @@ static void drain_alien_cache(struct kme
}
#else
#define alloc_alien_cache(node, limit) do { } while (0)
-#define free_alien_cache(ac_ptr) do { } while (0)
#define drain_alien_cache(cachep, alien) do { } while (0)
+
+static inline void free_alien_cache(struct array_cache **ac_ptr)
+{
+}
#endif

static int __devinit cpuup_callback(struct notifier_block *nfb,
@@ -986,10 +989,12 @@ static int __devinit cpuup_callback(stru
l3->shared = shared;
shared = NULL;
}
+#ifdef CONFIG_NUMA
if (!l3->alien) {
l3->alien = alien;
alien = NULL;
}
+#endif
spin_unlock_irq(&l3->list_lock);

kfree(shared);
_

2006-02-06 22:51:11

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: [patch 2/3] NUMA slab locking fixes - move irq disabling from cahep->spinlock to l3 lock

On Sat, Feb 04, 2006 at 01:48:28AM -0800, Andrew Morton wrote:
> Ravikiran G Thirumalai <[email protected]> wrote:
> >
> > Earlier, we had to disable on chip interrupts while taking the cachep->spinlock
> > because, at cache_grow, on every addition of a slab to a slab cache, we
> > incremented colour_next which was protected by the cachep->spinlock, and
> > cache_grow could occur at interrupt context. Since, now we protect the
> > per-node colour_next with the node's list_lock, we do not need to disable
> > on chip interrupts while taking the per-cache spinlock, but we
> > just need to disable interrupts when taking the per-node kmem_list3 list_lock.
>
> It'd be nice to have some comments describing what cachep->spinlock
> actually protects.

Actually, it does not protect much anymore. Here's a cleanup+comments
patch (against current mainline).

The non atomic stats on the cachep structure now needs serialization though,
would a spinlock be better there instead of changing them over to atomic_t s
? I wonder...

>
> Does __cache_shrink() need some locking to prevent nodes from going offline?

Looks like l3 list locking (which is already done) is sufficient, although
one of the two paths calling __cache_shrink takes the cpu hotplug lock
(kmem_cache_destroy()). The other does not seem to be used much
(kmem_cache_shrink())

Thanks,
Kiran


Remove cachep->spinlock. Locking has moved to the kmem_list3 and
most of the structures protected earlier by cachep->spinlock is
now protected by the l3->list_lock. slab cache tunables like
batchcount are accessed always with the cache_chain_mutex held.

Patch tested on SMP and NUMA kernels with dbench processes
running, constant onlining/offlining, and constant cache tuning,
all at the same time.

Signed-off-by: Ravikiran Thirumalai <[email protected]>

Index: linux-2.6/mm/slab.c
===================================================================
--- linux-2.6.orig/mm/slab.c 2006-02-05 14:38:37.000000000 -0800
+++ linux-2.6/mm/slab.c 2006-02-06 12:05:26.000000000 -0800
@@ -373,17 +373,19 @@ static void kmem_list3_init(struct kmem_
struct kmem_cache {
/* 1) per-cpu data, touched during every alloc/free */
struct array_cache *array[NR_CPUS];
+/* 2) Cache tunables. Protected by cache_chain_mutex */
unsigned int batchcount;
unsigned int limit;
unsigned int shared;
+
unsigned int buffer_size;
-/* 2) touched by every alloc & free from the backend */
+/* 3) touched by every alloc & free from the backend */
struct kmem_list3 *nodelists[MAX_NUMNODES];
+
unsigned int flags; /* constant flags */
unsigned int num; /* # of objs per slab */
- spinlock_t spinlock;

-/* 3) cache_grow/shrink */
+/* 4) cache_grow/shrink */
/* order of pgs per slab (2^n) */
unsigned int gfporder;

@@ -402,11 +404,11 @@ struct kmem_cache {
/* de-constructor func */
void (*dtor) (void *, struct kmem_cache *, unsigned long);

-/* 4) cache creation/removal */
+/* 5) cache creation/removal */
const char *name;
struct list_head next;

-/* 5) statistics */
+/* 6) statistics */
#if STATS
unsigned long num_active;
unsigned long num_allocations;
@@ -643,7 +645,6 @@ static struct kmem_cache cache_cache = {
.shared = 1,
.buffer_size = sizeof(struct kmem_cache),
.flags = SLAB_NO_REAP,
- .spinlock = SPIN_LOCK_UNLOCKED,
.name = "kmem_cache",
#if DEBUG
.obj_size = sizeof(struct kmem_cache),
@@ -1909,7 +1910,6 @@ kmem_cache_create (const char *name, siz
cachep->gfpflags = 0;
if (flags & SLAB_CACHE_DMA)
cachep->gfpflags |= GFP_DMA;
- spin_lock_init(&cachep->spinlock);
cachep->buffer_size = size;

if (flags & CFLGS_OFF_SLAB)
@@ -3334,6 +3334,7 @@ static void do_ccupdate_local(void *info
new->new[smp_processor_id()] = old;
}

+/* Always called with the cache_chain_mutex held */
static int do_tune_cpucache(struct kmem_cache *cachep, int limit, int batchcount,
int shared)
{
@@ -3355,11 +3356,9 @@ static int do_tune_cpucache(struct kmem_
smp_call_function_all_cpus(do_ccupdate_local, (void *)&new);

check_irq_on();
- spin_lock(&cachep->spinlock);
cachep->batchcount = batchcount;
cachep->limit = limit;
cachep->shared = shared;
- spin_unlock(&cachep->spinlock);

for_each_online_cpu(i) {
struct array_cache *ccold = new.new[i];
@@ -3380,6 +3379,7 @@ static int do_tune_cpucache(struct kmem_
return 0;
}

+/* Called with cache_chain_mutex held always */
static void enable_cpucache(struct kmem_cache *cachep)
{
int err;
@@ -3615,7 +3615,6 @@ static int s_show(struct seq_file *m, vo
int node;
struct kmem_list3 *l3;

- spin_lock(&cachep->spinlock);
active_objs = 0;
num_slabs = 0;
for_each_online_node(node) {
@@ -3696,7 +3695,6 @@ static int s_show(struct seq_file *m, vo
}
#endif
seq_putc(m, '\n');
- spin_unlock(&cachep->spinlock);
return 0;
}

2006-02-06 23:28:51

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 2/3] NUMA slab locking fixes - move irq disabling from cahep->spinlock to l3 lock

Ravikiran G Thirumalai <[email protected]> wrote:
>
> On Sat, Feb 04, 2006 at 01:48:28AM -0800, Andrew Morton wrote:
> > Ravikiran G Thirumalai <[email protected]> wrote:
> > >
> > > Earlier, we had to disable on chip interrupts while taking the cachep->spinlock
> > > because, at cache_grow, on every addition of a slab to a slab cache, we
> > > incremented colour_next which was protected by the cachep->spinlock, and
> > > cache_grow could occur at interrupt context. Since, now we protect the
> > > per-node colour_next with the node's list_lock, we do not need to disable
> > > on chip interrupts while taking the per-cache spinlock, but we
> > > just need to disable interrupts when taking the per-node kmem_list3 list_lock.
> >
> > It'd be nice to have some comments describing what cachep->spinlock
> > actually protects.
>
> Actually, it does not protect much anymore. Here's a cleanup+comments
> patch (against current mainline).

This is getting scary. Manfred, Christoph, Pekka: have you guys taken a
close look at what's going on in here?

> The non atomic stats on the cachep structure now needs serialization though,
> would a spinlock be better there instead of changing them over to atomic_t s
> ? I wonder...

It's common for the kernel to use non-atomic ops for stats such as this
(networking especially). We accept the small potential for small
inaccuracy as an acceptable tradeoff for improved performance.

But given that a) these stats are only enabled with CONFIG_DEBUG_SLAB and
b) CONFIG_DEBUG_SLAB causes performance to suck the big one anyway and c)
the slab.c stats are wrapped in handy macros, yes, I guess it wouldn't hurt
to make these guys atomic_t. Sometime. My slab.c is looking rather
overpatched again at present.

A problem right now is the -mm-only slab debug patches
(slab-leak-detector.patch, slab-cache-shrinker-statistics.patch and the
currently-dropped
periodically-scan-redzone-entries-and-slab-control-structures.patch). It
would be good if we could finish these things off and get them into mainline
so things get a bit sane again.


2006-02-07 00:21:48

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 2/3] NUMA slab locking fixes - move irq disabling from cahep->spinlock to l3 lock

On Mon, 6 Feb 2006, Andrew Morton wrote:

> > Actually, it does not protect much anymore. Here's a cleanup+comments
> > patch (against current mainline).
>
> This is getting scary. Manfred, Christoph, Pekka: have you guys taken a
> close look at what's going on in here?

I looked at his patch and he seems to be right. Most of the kmem_cache
structure is established at slab creation. Updates are to the debug
counters and to nodelists[] during node online/offline and to array[]
during cpu online/offline. The chain mutex is used to protect the
setting of the tuning parameters. I still need to have a look at the
details though.


2006-02-07 07:36:57

by Pekka Enberg

[permalink] [raw]
Subject: Re: [patch 2/3] NUMA slab locking fixes - move irq disabling from cahep->spinlock to l3 lock

On Mon, 6 Feb 2006, Andrew Morton wrote:
> > This is getting scary. Manfred, Christoph, Pekka: have you guys taken a
> > close look at what's going on in here?

On Mon, 6 Feb 2006, Christoph Lameter wrote:
> I looked at his patch and he seems to be right. Most of the kmem_cache
> structure is established at slab creation. Updates are to the debug
> counters and to nodelists[] during node online/offline and to array[]
> during cpu online/offline. The chain mutex is used to protect the
> setting of the tuning parameters. I still need to have a look at the
> details though.

The patch looks correct but I am wondering if we should keep the spinlock
around for clarity? The chain mutex doesn't really have anything to do
with the tunables, it's there to protect the cache chain. I am worried
that this patch makes code restructuring harder. Hmm?

Pekka

2006-02-07 07:50:44

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: [patch 2/3] NUMA slab locking fixes - move irq disabling from cahep->spinlock to l3 lock

On Tue, Feb 07, 2006 at 09:36:40AM +0200, Pekka J Enberg wrote:
> On Mon, 6 Feb 2006, Andrew Morton wrote:
> > > This is getting scary. Manfred, Christoph, Pekka: have you guys taken a
> > > close look at what's going on in here?
>
> On Mon, 6 Feb 2006, Christoph Lameter wrote:
> > I looked at his patch and he seems to be right. Most of the kmem_cache
> > structure is established at slab creation. Updates are to the debug
> > counters and to nodelists[] during node online/offline and to array[]
> > during cpu online/offline. The chain mutex is used to protect the
> > setting of the tuning parameters. I still need to have a look at the
> > details though.
>
> The patch looks correct but I am wondering if we should keep the spinlock
> around for clarity? The chain mutex doesn't really have anything to do
> with the tunables, it's there to protect the cache chain. I am worried
> that this patch makes code restructuring harder. Hmm?

IMHO, if you keep something around which is not needed, it might later get
abused/misused. And what would you add in as comments for the
cachep->spinlock?

Instead, bold comments on cachep structure stating what all members are
protected by which lock/mutex should be sufficient no?

Thanks,
Kiran

2006-02-07 07:55:57

by Pekka Enberg

[permalink] [raw]
Subject: Re: [patch 2/3] NUMA slab locking fixes - move irq disabling from cahep->spinlock to l3 lock

On Mon, 6 Feb 2006, Ravikiran G Thirumalai wrote:
> IMHO, if you keep something around which is not needed, it might later get
> abused/misused. And what would you add in as comments for the
> cachep->spinlock?
>
> Instead, bold comments on cachep structure stating what all members are
> protected by which lock/mutex should be sufficient no?

Yeah, I guess we can put the spinlock back if we ever need it.

Pekka