2005-12-16 11:36:15

by Gunter Ohrner

[permalink] [raw]
Subject: 2.6.15-rc5-rt2 slowness

Hi!

Thanks to Steven's Kconfig fixes I was able to compile 2.6.15-rc5 with
Ingo's rt2-patch just fine.

I have two small problem with it, however. One is the Oops just posted, the
other is a high system load and bad responsiveness of the system as a
whole. I didn't even bother to measure timer/sleep jitters as the system
was dog slow and the fans started to run a full speed nearly immediately.

We observed this on two different systems, one with the config attached to
my mail with the Oops/backtrace.

I'll try to recompile the kernel with some debug options, if anyone knows if
there's something I should specifically look for this may be helpful.

Greetings,

Gunter


2005-12-16 11:46:01

by Gunter Ohrner

[permalink] [raw]
Subject: Re: 2.6.15-rc5-rt2 slowness

Gunter Ohrner wrote:
> the other is a high system load and bad responsiveness of the system as a
> whole. I didn't even bother to measure timer/sleep jitters as the system
> was dog slow and the fans started to run a full speed nearly immediately.

Ok, I recompiled the kernel with some debug options and attached
a /proc/latency_trace output, hoping it will be helpful to track down the
problem...

Please tell me if there's anything else I should do.

Greetings,

Gunter


Attachments:
lat_trace.log (6.49 kB)

2005-12-16 12:10:43

by Gunter Ohrner

[permalink] [raw]
Subject: Re: 2.6.15-rc5-rt2 slowness

Hi!

Some further info:

,----[ cat rcuctrs rcugp rcuptrs rcustats schedstat ]
| CPU last cur
| 0 0 0
| ggp = 36340
| oldggp=36343 newggp=36354
| nl=c07fd218/c96abee8 nt=c94eae1c
| wl=c07fd220/00000000 wt=c07fd220 dl=c07fd228/00000000 dt=c07fd228
| ggp=36368 lgp=36368 rcc=36368
| na=135275 nl=3 wa=135272 wl=0 da=135272 dl=0 dr=135272 di=135272
| rtf1=36624 rtf2=36624 rtf3=36368 rtfe1=0 rtfe2=0 rtfe3=256
| version 12
| timestamp 80698
| cpu0 0 0 4 4 736 1123111 85725 0 0 92271 332649 1037386
`----

.config of the currently running kernel is attached.

The system I'm trying this on is a Celeron-M 1,5 GHz Notebook with an i865
chipset. The CPU was forced to ACPI C0 state:

# cat /sys/module/processor/parameters/max_cstate
0

cpufreq scaling governor is set to "performance".

Greetings,

Gunter


Attachments:
config-2.6.15-rc5-rt2.zb.20051216.2 (38.61 kB)

2005-12-16 12:32:20

by Steven Rostedt

[permalink] [raw]
Subject: Re: 2.6.15-rc5-rt2 slowness

On Fri, 2005-12-16 at 12:30 +0100, Gunter Ohrner wrote:
> Hi!
>
> Thanks to Steven's Kconfig fixes I was able to compile 2.6.15-rc5 with
> Ingo's rt2-patch just fine.
>
> I have two small problem with it, however. One is the Oops just posted, the
> other is a high system load and bad responsiveness of the system as a
> whole. I didn't even bother to measure timer/sleep jitters as the system
> was dog slow and the fans started to run a full speed nearly immediately.
>
> We observed this on two different systems, one with the config attached to
> my mail with the Oops/backtrace.
>
> I'll try to recompile the kernel with some debug options, if anyone knows if
> there's something I should specifically look for this may be helpful.

I'll look into your oops later (or maybe Ingo has some time), but I've
also notice the slowness of 2.6.15-rc5-rt2, and I'm investigating it
now.

Thanks,

-- Steve

2005-12-16 12:34:19

by Steven Rostedt

[permalink] [raw]
Subject: Re: 2.6.15-rc5-rt2 slowness

On Fri, 2005-12-16 at 12:42 +0100, Gunter Ohrner wrote:
> Gunter Ohrner wrote:
> > the other is a high system load and bad responsiveness of the system as a
> > whole. I didn't even bother to measure timer/sleep jitters as the system
> > was dog slow and the fans started to run a full speed nearly immediately.
>
> Ok, I recompiled the kernel with some debug options and attached
> a /proc/latency_trace output, hoping it will be helpful to track down the
> problem...
>
> Please tell me if there's anything else I should do.

Sorry, your latency trace doesn't help. The 45usec is not really a
latency (well not a bad one).

-- Steve


2005-12-16 22:58:26

by john stultz

[permalink] [raw]
Subject: Re: 2.6.15-rc5-rt2 slowness

On Fri, 2005-12-16 at 07:32 -0500, Steven Rostedt wrote:
> I'll look into your oops later (or maybe Ingo has some time), but I've
> also notice the slowness of 2.6.15-rc5-rt2, and I'm investigating it
> now.

Hey Steven,
Do check that the slowness you're seeing isn't related to the
CONFIG_PARANIOD_GENERIC_TIME option being enabled. It is expected that
the extra checks made by that config option would slow things down a
bit.

thanks
-john

2005-12-17 00:24:40

by Gunter Ohrner

[permalink] [raw]
Subject: Re: 2.6.15-rc5-rt2 slowness

john stultz wrote:
> Do check that the slowness you're seeing isn't related to the
> CONFIG_PARANIOD_GENERIC_TIME option being enabled. It is expected that
> the extra checks made by that config option would slow things down a
> bit.

The first kernel I built which showed this behaviour had no debugging
options enabled.

It happens if the system is mostly idle, in this state "top" will show a
kernel usage of 20%-50%, and as soon as something cpu intensive is started,
the whole system becomes extremely unresponsive.

Greetings,

Gunter

2005-12-17 03:33:35

by Steven Rostedt

[permalink] [raw]
Subject: Re: 2.6.15-rc5-rt2 slowness

Ingo,

After searching all over to find out where the slowness is, I finally
discovered it. It's the SLOB!

I noticed a that a make install of the kernel over NFS took on
2.6.15-rc5 ~26 seconds to complete, and on 2.6.15-rc5-rt2 it took almost
2 minutes for the same operation on the same machine.

I added my logdev device to record lots of output, and it found the
place that is taking the longest:

In 2.6.15-rc5-rt2:

[ 789.171773] cpu:0 kfree_skbmem:291 in
[ 789.171873] cpu:0 kfree_skbmem:295 1
[ 789.172357] cpu:0 kfree_skbmem:320 out

in 2.6.15-rc5:

[ 343.253988] cpu:0 kfree_skbmem:291 in
[ 343.253990] cpu:0 kfree_skbmem:295 1
[ 343.253991] cpu:0 kfree_skbmem:320 out

Here's the code for both systems (they are identical here):

void kfree_skbmem(struct sk_buff *skb)
{
struct sk_buff *other;
atomic_t *fclone_ref;

edprint("in");
skb_release_data(skb);
switch (skb->fclone) {
case SKB_FCLONE_UNAVAILABLE:
edprint("1");
kmem_cache_free(skbuff_head_cache, skb);
break;

case SKB_FCLONE_ORIG:
edprint("2");
fclone_ref = (atomic_t *) (skb + 2);
if (atomic_dec_and_test(fclone_ref))
kmem_cache_free(skbuff_fclone_cache, skb);
break;

case SKB_FCLONE_CLONE:
edprint("3");
fclone_ref = (atomic_t *) (skb + 1);
other = skb - 1;

/* The clone portion is available for
* fast-cloning again.
*/
skb->fclone = SKB_FCLONE_UNAVAILABLE;

if (atomic_dec_and_test(fclone_ref))
kmem_cache_free(skbuff_fclone_cache, other);
break;
};
edprint("out");
}

My edprint records in a ring buffer (much like relayfs), and produces
the above output. The time in brackets is in seconds. We see the
difference between "1" and "out" is greatly different. (Note, I have a
edprint in all interrupts, so I would know if one was taken, and these
times are not a one time deal, but show up like this every time).

So for 2.6.15-rc5 the time difference is 343.253991 - 343.253990 or
1 usec, where as the time for 2.6.15-rc5-rt2 is 789.172357 - 789.171873
or 484 usecs! We're talking about a 48,400% increase here!

The difference here is that the kmem_cache_free in rt is a SLOB where as
the vanilla kernel still uses SLABs, What's the rational for SLOB now?

The patches used are here:

For the logdev device:
http://www.kihontech.com/logdev/logdev-2.6.15-rc5-rt2.patch
http://www.kihontech.com/logdev/logdev-2.6.15-rc5.patch

For debugging (on top of logdev):
http://www.kihontech.com/logdev/debug-2.6.15-rc5-rt2.patch
http://www.kihontech.com/logdev/debug-2.6.15-rc5.patch

(I also added my patches previously posted to get it to compile and
handle the softirq hrtimer problems).

Thanks,

-- Steve


2005-12-17 03:51:54

by Steven Rostedt

[permalink] [raw]
Subject: Re: 2.6.15-rc5-rt2 slowness

On Fri, 2005-12-16 at 14:58 -0800, john stultz wrote:
> On Fri, 2005-12-16 at 07:32 -0500, Steven Rostedt wrote:
> > I'll look into your oops later (or maybe Ingo has some time), but I've
> > also notice the slowness of 2.6.15-rc5-rt2, and I'm investigating it
> > now.
>
> Hey Steven,
> Do check that the slowness you're seeing isn't related to the
> CONFIG_PARANIOD_GENERIC_TIME option being enabled. It is expected that
> the extra checks made by that config option would slow things down a
> bit.
>

Hi John,

Thanks for the suggestion, but I've been running my tests with that
turned off. Actually, I've turned off pretty much all debugging, and
added my own logdev device. As mentioned in my previous email, that I
CC you on, I found the culprit.

Seems, the SLOB is not as fast as the SLAB. I'll look more into this
tomorrow. (My wife is taking by daughter out of town for gymnastics, so
I get to stay home and work :-/ )

-- Steve


2005-12-17 22:57:58

by Steven Rostedt

[permalink] [raw]
Subject: Re: 2.6.15-rc5-rt2 slowness

Ingo,

I ported your old changes of 2.6.14-rt22 of mm/slab.c to 2.6.15-rc5-rt2
and tried it out. I believe that this confirms that the SLOB _is_ the
problem in the slowness. Booting with this slab patch, gives the old
speeds that we use to have.

Now, is the solution to bring the SLOB up to par with the SLAB, or to
make the SLAB as close to possible to the mainline (why remove NUMA?)
and keep it for PREEMPT_RT?

Below is the port of the slab changes if anyone else would like to see
if this speeds things up for them.

-- Steve

Index: linux-2.6.15-rc5-rt2/init/Kconfig
===================================================================
--- linux-2.6.15-rc5-rt2.orig/init/Kconfig 2005-12-17 14:09:22.000000000 -0500
+++ linux-2.6.15-rc5-rt2/init/Kconfig 2005-12-17 14:09:41.000000000 -0500
@@ -402,7 +402,7 @@
default y
bool "Use full SLAB allocator" if EMBEDDED
# we switch to the SLOB on PREEMPT_RT
- depends on !PREEMPT_RT
+# depends on !PREEMPT_RT
help
Disabling this replaces the advanced SLAB allocator and
kmalloc support with the drastically simpler SLOB allocator.
Index: linux-2.6.15-rc5-rt2/mm/slab.c
===================================================================
--- linux-2.6.15-rc5-rt2.orig/mm/slab.c 2005-12-17 16:44:10.000000000 -0500
+++ linux-2.6.15-rc5-rt2/mm/slab.c 2005-12-17 17:27:30.000000000 -0500
@@ -75,15 +75,6 @@
*
* At present, each engine can be growing a cache. This should be blocked.
*
- * 15 March 2005. NUMA slab allocator.
- * Shai Fultheim <[email protected]>.
- * Shobhit Dayal <[email protected]>
- * Alok N Kataria <[email protected]>
- * Christoph Lameter <[email protected]>
- *
- * Modified the slab allocator to be node aware on NUMA systems.
- * Each node has its own list of partial, free and full slabs.
- * All object allocations for a node occur from node specific slab lists.
*/

#include <linux/config.h>
@@ -102,7 +93,6 @@
#include <linux/module.h>
#include <linux/rcupdate.h>
#include <linux/string.h>
-#include <linux/nodemask.h>

#include <asm/uaccess.h>
#include <asm/cacheflush.h>
@@ -222,7 +212,6 @@
void *s_mem; /* including colour offset */
unsigned int inuse; /* num of objs active in slab */
kmem_bufctl_t free;
- unsigned short nodeid;
};

/*
@@ -250,6 +239,7 @@
/*
* struct array_cache
*
+ * Per cpu structures
* Purpose:
* - LIFO ordering, to hand out cache-warm objects from _alloc
* - reduce the number of linked list operations
@@ -264,13 +254,6 @@
unsigned int limit;
unsigned int batchcount;
unsigned int touched;
- spinlock_t lock;
- void *entry[0]; /*
- * Must have this definition in here for the proper
- * alignment of array_cache. Also simplifies accessing
- * the entries.
- * [0] is for gcc 2.95. It should really be [].
- */
};

/* bootstrap: The caches do not work without cpuarrays anymore,
@@ -283,84 +266,34 @@
};

/*
- * The slab lists for all objects.
+ * The slab lists of all objects.
+ * Hopefully reduce the internal fragmentation
+ * NUMA: The spinlock could be moved from the kmem_cache_t
+ * into this structure, too. Figure out what causes
+ * fewer cross-node spinlock operations.
*/
struct kmem_list3 {
struct list_head slabs_partial; /* partial list first, better asm code */
struct list_head slabs_full;
struct list_head slabs_free;
unsigned long free_objects;
- unsigned long next_reap;
int free_touched;
- unsigned int free_limit;
- spinlock_t list_lock;
- struct array_cache *shared; /* shared per node */
- struct array_cache **alien; /* on other nodes */
+ unsigned long next_reap;
+ struct array_cache *shared;
};

-/*
- * Need this for bootstrapping a per node allocator.
- */
-#define NUM_INIT_LISTS (2 * MAX_NUMNODES + 1)
-struct kmem_list3 __initdata initkmem_list3[NUM_INIT_LISTS];
-#define CACHE_CACHE 0
-#define SIZE_AC 1
-#define SIZE_L3 (1 + MAX_NUMNODES)
-
-/*
- * This function must be completely optimized away if
- * a constant is passed to it. Mostly the same as
- * what is in linux/slab.h except it returns an
- * index.
- */
-static __always_inline int index_of(const size_t size)
-{
- if (__builtin_constant_p(size)) {
- int i = 0;
-
-#define CACHE(x) \
- if (size <=x) \
- return i; \
- else \
- i++;
-#include "linux/kmalloc_sizes.h"
-#undef CACHE
- {
- extern void __bad_size(void);
- __bad_size();
- }
- } else
- BUG();
- return 0;
-}
-
-#define INDEX_AC index_of(sizeof(struct arraycache_init))
-#define INDEX_L3 index_of(sizeof(struct kmem_list3))
-
-static inline void kmem_list3_init(struct kmem_list3 *parent)
-{
- INIT_LIST_HEAD(&parent->slabs_full);
- INIT_LIST_HEAD(&parent->slabs_partial);
- INIT_LIST_HEAD(&parent->slabs_free);
- parent->shared = NULL;
- parent->alien = NULL;
- spin_lock_init(&parent->list_lock);
- parent->free_objects = 0;
- parent->free_touched = 0;
-}
-
-#define MAKE_LIST(cachep, listp, slab, nodeid) \
- do { \
- INIT_LIST_HEAD(listp); \
- list_splice(&(cachep->nodelists[nodeid]->slab), listp); \
- } while (0)
-
-#define MAKE_ALL_LISTS(cachep, ptr, nodeid) \
- do { \
- MAKE_LIST((cachep), (&(ptr)->slabs_full), slabs_full, nodeid); \
- MAKE_LIST((cachep), (&(ptr)->slabs_partial), slabs_partial, nodeid); \
- MAKE_LIST((cachep), (&(ptr)->slabs_free), slabs_free, nodeid); \
- } while (0)
+#define LIST3_INIT(parent) \
+ { \
+ .slabs_full = LIST_HEAD_INIT(parent.slabs_full), \
+ .slabs_partial = LIST_HEAD_INIT(parent.slabs_partial), \
+ .slabs_free = LIST_HEAD_INIT(parent.slabs_free) \
+ }
+#define list3_data(cachep) \
+ (&(cachep)->lists)
+
+/* NUMA: per-node */
+#define list3_data_ptr(cachep, ptr) \
+ list3_data(cachep)

/*
* kmem_cache_t
@@ -373,12 +306,13 @@
struct array_cache *array[NR_CPUS];
unsigned int batchcount;
unsigned int limit;
- unsigned int shared;
- unsigned int objsize;
/* 2) touched by every alloc & free from the backend */
- struct kmem_list3 *nodelists[MAX_NUMNODES];
+ struct kmem_list3 lists;
+ /* NUMA: kmem_3list_t *nodelists[MAX_NUMNODES] */
+ unsigned int objsize;
unsigned int flags; /* constant flags */
unsigned int num; /* # of objs per slab */
+ unsigned int free_limit; /* upper limit of objects in the lists */
spinlock_t spinlock;

/* 3) cache_grow/shrink */
@@ -415,7 +349,6 @@
unsigned long errors;
unsigned long max_freeable;
unsigned long node_allocs;
- unsigned long node_frees;
atomic_t allochit;
atomic_t allocmiss;
atomic_t freehit;
@@ -451,7 +384,6 @@
} while (0)
#define STATS_INC_ERR(x) ((x)->errors++)
#define STATS_INC_NODEALLOCS(x) ((x)->node_allocs++)
-#define STATS_INC_NODEFREES(x) ((x)->node_frees++)
#define STATS_SET_FREEABLE(x, i) \
do { if ((x)->max_freeable < i) \
(x)->max_freeable = i; \
@@ -470,7 +402,6 @@
#define STATS_SET_HIGH(x) do { } while (0)
#define STATS_INC_ERR(x) do { } while (0)
#define STATS_INC_NODEALLOCS(x) do { } while (0)
-#define STATS_INC_NODEFREES(x) do { } while (0)
#define STATS_SET_FREEABLE(x, i) \
do { } while (0)

@@ -618,9 +549,9 @@

/* internal cache of cache description objs */
static kmem_cache_t cache_cache = {
+ .lists = LIST3_INIT(cache_cache.lists),
.batchcount = 1,
.limit = BOOT_CPUCACHE_ENTRIES,
- .shared = 1,
.objsize = sizeof(kmem_cache_t),
.flags = SLAB_NO_REAP,
.spinlock = SPIN_LOCK_UNLOCKED(cache_cache.spinlock),
@@ -641,6 +572,7 @@
* SLAB_RECLAIM_ACCOUNT turns this on per-slab
*/
atomic_t slab_reclaim_pages;
+EXPORT_SYMBOL(slab_reclaim_pages);

/*
* chicken and egg problem: delay the per-cpu array allocation
@@ -648,24 +580,28 @@
*/
static enum {
NONE,
- PARTIAL_AC,
- PARTIAL_L3,
+ PARTIAL,
FULL
} g_cpucache_up;

static DEFINE_PER_CPU(struct work_struct, reap_work);

-static void free_block(kmem_cache_t* cachep, void** objpp, int len, int node);
+static void free_block(kmem_cache_t* cachep, void** objpp, int len);
static void enable_cpucache (kmem_cache_t *cachep);
static void cache_reap (void *unused);
-static int __node_shrink(kmem_cache_t *cachep, int node);

-static inline struct array_cache *ac_data(kmem_cache_t *cachep)
+static inline void **ac_entry(struct array_cache *ac)
{
- return cachep->array[smp_processor_id()];
+ return (void**)(ac+1);
}

-static inline kmem_cache_t *__find_general_cachep(size_t size, gfp_t gfpflags)
+static inline struct array_cache *ac_data(kmem_cache_t *cachep, int cpu)
+{
+ return cachep->array[cpu];
+}
+
+static inline kmem_cache_t *__find_general_cachep(size_t size,
+ unsigned int __nocast gfpflags)
{
struct cache_sizes *csizep = malloc_sizes;

@@ -674,13 +610,13 @@
* kmem_cache_create(), or __kmalloc(), before
* the generic caches are initialized.
*/
- BUG_ON(malloc_sizes[INDEX_AC].cs_cachep == NULL);
+ BUG_ON(csizep->cs_cachep == NULL);
#endif
while (size > csizep->cs_size)
csizep++;

/*
- * Really subtle: The last entry with cs->cs_size==ULONG_MAX
+ * Really subtile: The last entry with cs->cs_size==ULONG_MAX
* has cs_{dma,}cachep==NULL. Thus no special case
* for large kmalloc calls required.
*/
@@ -689,7 +625,8 @@
return csizep->cs_cachep;
}

-kmem_cache_t *kmem_find_general_cachep(size_t size, gfp_t gfpflags)
+kmem_cache_t *kmem_find_general_cachep(size_t size,
+ unsigned int __nocast gfpflags)
{
return __find_general_cachep(size, gfpflags);
}
@@ -754,160 +691,48 @@
}
}

-static struct array_cache *alloc_arraycache(int node, int entries,
+static struct array_cache *alloc_arraycache(int cpu, int entries,
int batchcount)
{
int memsize = sizeof(void*)*entries+sizeof(struct array_cache);
struct array_cache *nc = NULL;

- nc = kmalloc_node(memsize, GFP_KERNEL, node);
+ if (cpu == -1)
+ nc = kmalloc(memsize, GFP_KERNEL);
+ else
+ nc = kmalloc_node(memsize, GFP_KERNEL, cpu_to_node(cpu));
+
if (nc) {
nc->avail = 0;
nc->limit = entries;
nc->batchcount = batchcount;
nc->touched = 0;
- spin_lock_init(&nc->lock);
}
return nc;
}

-#ifdef CONFIG_NUMA
-static inline struct array_cache **alloc_alien_cache(int node, int limit)
-{
- struct array_cache **ac_ptr;
- int memsize = sizeof(void*)*MAX_NUMNODES;
- int i;
-
- if (limit > 1)
- limit = 12;
- ac_ptr = kmalloc_node(memsize, GFP_KERNEL, node);
- if (ac_ptr) {
- for_each_node(i) {
- if (i == node || !node_online(i)) {
- ac_ptr[i] = NULL;
- continue;
- }
- ac_ptr[i] = alloc_arraycache(node, limit, 0xbaadf00d);
- if (!ac_ptr[i]) {
- for (i--; i <=0; i--)
- kfree(ac_ptr[i]);
- kfree(ac_ptr);
- return NULL;
- }
- }
- }
- return ac_ptr;
-}
-
-static inline void free_alien_cache(struct array_cache **ac_ptr)
-{
- int i;
-
- if (!ac_ptr)
- return;
-
- for_each_node(i)
- kfree(ac_ptr[i]);
-
- kfree(ac_ptr);
-}
-
-static inline void __drain_alien_cache(kmem_cache_t *cachep, struct array_cache *ac, int node)
-{
- struct kmem_list3 *rl3 = cachep->nodelists[node];
-
- if (ac->avail) {
- spin_lock(&rl3->list_lock);
- free_block(cachep, ac->entry, ac->avail, node);
- ac->avail = 0;
- spin_unlock(&rl3->list_lock);
- }
-}
-
-static void drain_alien_cache(kmem_cache_t *cachep, struct kmem_list3 *l3)
-{
- int i=0;
- struct array_cache *ac;
- unsigned long flags;
-
- for_each_online_node(i) {
- ac = l3->alien[i];
- if (ac) {
- spin_lock_irqsave(&ac->lock, flags);
- __drain_alien_cache(cachep, ac, i);
- spin_unlock_irqrestore(&ac->lock, flags);
- }
- }
-}
-#else
-#define alloc_alien_cache(node, limit) do { } while (0)
-#define free_alien_cache(ac_ptr) do { } while (0)
-#define drain_alien_cache(cachep, l3) do { } while (0)
-#endif
-
static int __devinit cpuup_callback(struct notifier_block *nfb,
unsigned long action, void *hcpu)
{
long cpu = (long)hcpu;
kmem_cache_t* cachep;
- struct kmem_list3 *l3 = NULL;
- int node = cpu_to_node(cpu);
- int memsize = sizeof(struct kmem_list3);
- struct array_cache *nc = NULL;

switch (action) {
case CPU_UP_PREPARE:
down(&cache_chain_sem);
- /* we need to do this right in the beginning since
- * alloc_arraycache's are going to use this list.
- * kmalloc_node allows us to add the slab to the right
- * kmem_list3 and not this cpu's kmem_list3
- */
-
list_for_each_entry(cachep, &cache_chain, next) {
- /* setup the size64 kmemlist for cpu before we can
- * begin anything. Make sure some other cpu on this
- * node has not already allocated this
- */
- if (!cachep->nodelists[node]) {
- if (!(l3 = kmalloc_node(memsize,
- GFP_KERNEL, node)))
- goto bad;
- kmem_list3_init(l3);
- l3->next_reap = jiffies + REAPTIMEOUT_LIST3 +
- ((unsigned long)cachep)%REAPTIMEOUT_LIST3;
-
- cachep->nodelists[node] = l3;
- }
-
- spin_lock_irq(&cachep->nodelists[node]->list_lock);
- cachep->nodelists[node]->free_limit =
- (1 + nr_cpus_node(node)) *
- cachep->batchcount + cachep->num;
- spin_unlock_irq(&cachep->nodelists[node]->list_lock);
- }
+ struct array_cache *nc;

- /* Now we can go ahead with allocating the shared array's
- & array cache's */
- list_for_each_entry(cachep, &cache_chain, next) {
- nc = alloc_arraycache(node, cachep->limit,
- cachep->batchcount);
+ nc = alloc_arraycache(cpu, cachep->limit, cachep->batchcount);
if (!nc)
goto bad;
+
+ spin_lock_irq(&cachep->spinlock);
cachep->array[cpu] = nc;
+ cachep->free_limit = (1+num_online_cpus())*cachep->batchcount
+ + cachep->num;
+ spin_unlock_irq(&cachep->spinlock);

- l3 = cachep->nodelists[node];
- BUG_ON(!l3);
- if (!l3->shared) {
- if (!(nc = alloc_arraycache(node,
- cachep->shared*cachep->batchcount,
- 0xbaadf00d)))
- goto bad;
-
- /* we are serialised from CPU_DEAD or
- CPU_UP_CANCELLED by the cpucontrol lock */
- l3->shared = nc;
- }
}
up(&cache_chain_sem);
break;
@@ -922,51 +747,13 @@

list_for_each_entry(cachep, &cache_chain, next) {
struct array_cache *nc;
- cpumask_t mask;

- mask = node_to_cpumask(node);
spin_lock_irq(&cachep->spinlock);
/* cpu is dead; no one can alloc from it. */
nc = cachep->array[cpu];
cachep->array[cpu] = NULL;
- l3 = cachep->nodelists[node];
-
- if (!l3)
- goto unlock_cache;
-
- spin_lock(&l3->list_lock);
-
- /* Free limit for this kmem_list3 */
- l3->free_limit -= cachep->batchcount;
- if (nc)
- free_block(cachep, nc->entry, nc->avail, node);
-
- if (!cpus_empty(mask)) {
- spin_unlock(&l3->list_lock);
- goto unlock_cache;
- }
-
- if (l3->shared) {
- free_block(cachep, l3->shared->entry,
- l3->shared->avail, node);
- kfree(l3->shared);
- l3->shared = NULL;
- }
- if (l3->alien) {
- drain_alien_cache(cachep, l3);
- free_alien_cache(l3->alien);
- l3->alien = NULL;
- }
-
- /* free slabs belonging to this node */
- if (__node_shrink(cachep, node)) {
- cachep->nodelists[node] = NULL;
- spin_unlock(&l3->list_lock);
- kfree(l3);
- } else {
- spin_unlock(&l3->list_lock);
- }
-unlock_cache:
+ cachep->free_limit -= cachep->batchcount;
+ free_block(cachep, ac_entry(nc), nc->avail);
spin_unlock_irq(&cachep->spinlock);
kfree(nc);
}
@@ -982,25 +769,6 @@

static struct notifier_block cpucache_notifier = { &cpuup_callback, NULL, 0 };

-/*
- * swap the static kmem_list3 with kmalloced memory
- */
-static void init_list(kmem_cache_t *cachep, struct kmem_list3 *list,
- int nodeid)
-{
- struct kmem_list3 *ptr;
-
- BUG_ON(cachep->nodelists[nodeid] != list);
- ptr = kmalloc_node(sizeof(struct kmem_list3), GFP_KERNEL, nodeid);
- BUG_ON(!ptr);
-
- local_irq_disable();
- memcpy(ptr, list, sizeof(struct kmem_list3));
- MAKE_ALL_LISTS(cachep, ptr, nodeid);
- cachep->nodelists[nodeid] = ptr;
- local_irq_enable();
-}
-
/* Initialisation.
* Called after the gfp() functions have been enabled, and before smp_init().
*/
@@ -1009,13 +777,6 @@
size_t left_over;
struct cache_sizes *sizes;
struct cache_names *names;
- int i;
-
- for (i = 0; i < NUM_INIT_LISTS; i++) {
- kmem_list3_init(&initkmem_list3[i]);
- if (i < MAX_NUMNODES)
- cache_cache.nodelists[i] = NULL;
- }

/*
* Fragmentation resistance on low memory - only use bigger
@@ -1024,24 +785,21 @@
if (num_physpages > (32 << 20) >> PAGE_SHIFT)
slab_break_gfp_order = BREAK_GFP_ORDER_HI;

+
/* Bootstrap is tricky, because several objects are allocated
* from caches that do not exist yet:
* 1) initialize the cache_cache cache: it contains the kmem_cache_t
* structures of all caches, except cache_cache itself: cache_cache
* is statically allocated.
- * Initially an __init data area is used for the head array and the
- * kmem_list3 structures, it's replaced with a kmalloc allocated
- * array at the end of the bootstrap.
+ * Initially an __init data area is used for the head array, it's
+ * replaced with a kmalloc allocated array at the end of the bootstrap.
* 2) Create the first kmalloc cache.
- * The kmem_cache_t for the new cache is allocated normally.
- * An __init data area is used for the head array.
- * 3) Create the remaining kmalloc caches, with minimally sized
- * head arrays.
+ * The kmem_cache_t for the new cache is allocated normally. An __init
+ * data area is used for the head array.
+ * 3) Create the remaining kmalloc caches, with minimally sized head arrays.
* 4) Replace the __init data head arrays for cache_cache and the first
* kmalloc cache with kmalloc allocated arrays.
- * 5) Replace the __init data for kmem_list3 for cache_cache and
- * the other cache's with kmalloc allocated memory.
- * 6) Resize the head arrays of the kmalloc caches to their final sizes.
+ * 5) Resize the head arrays of the kmalloc caches to their final sizes.
*/

/* 1) create the cache_cache */
@@ -1050,7 +808,6 @@
list_add(&cache_cache.next, &cache_chain);
cache_cache.colour_off = cache_line_size();
cache_cache.array[smp_processor_id()] = &initarray_cache.cache;
- cache_cache.nodelists[numa_node_id()] = &initkmem_list3[CACHE_CACHE];

cache_cache.objsize = ALIGN(cache_cache.objsize, cache_line_size());

@@ -1068,33 +825,15 @@
sizes = malloc_sizes;
names = cache_names;

- /* Initialize the caches that provide memory for the array cache
- * and the kmem_list3 structures first.
- * Without this, further allocations will bug
- */
-
- sizes[INDEX_AC].cs_cachep = kmem_cache_create(names[INDEX_AC].name,
- sizes[INDEX_AC].cs_size, ARCH_KMALLOC_MINALIGN,
- (ARCH_KMALLOC_FLAGS | SLAB_PANIC), NULL, NULL);
-
- if (INDEX_AC != INDEX_L3)
- sizes[INDEX_L3].cs_cachep =
- kmem_cache_create(names[INDEX_L3].name,
- sizes[INDEX_L3].cs_size, ARCH_KMALLOC_MINALIGN,
- (ARCH_KMALLOC_FLAGS | SLAB_PANIC), NULL, NULL);
-
while (sizes->cs_size != ULONG_MAX) {
- /*
- * For performance, all the general caches are L1 aligned.
+ /* For performance, all the general caches are L1 aligned.
* This should be particularly beneficial on SMP boxes, as it
* eliminates "false sharing".
* Note for systems short on memory removing the alignment will
- * allow tighter packing of the smaller caches.
- */
- if(!sizes->cs_cachep)
- sizes->cs_cachep = kmem_cache_create(names->name,
- sizes->cs_size, ARCH_KMALLOC_MINALIGN,
- (ARCH_KMALLOC_FLAGS | SLAB_PANIC), NULL, NULL);
+ * allow tighter packing of the smaller caches. */
+ sizes->cs_cachep = kmem_cache_create(names->name,
+ sizes->cs_size, ARCH_KMALLOC_MINALIGN,
+ (ARCH_KMALLOC_FLAGS | SLAB_PANIC), NULL, NULL);

/* Inc off-slab bufctl limit until the ceiling is hit. */
if (!(OFF_SLAB(sizes->cs_cachep))) {
@@ -1113,47 +852,25 @@
/* 4) Replace the bootstrap head arrays */
{
void * ptr;
+ int cpu = smp_processor_id();

ptr = kmalloc(sizeof(struct arraycache_init), GFP_KERNEL);
-
- local_irq_disable();
- BUG_ON(ac_data(&cache_cache) != &initarray_cache.cache);
- memcpy(ptr, ac_data(&cache_cache),
- sizeof(struct arraycache_init));
- cache_cache.array[smp_processor_id()] = ptr;
- local_irq_enable();
+ local_irq_disable_nort();
+ BUG_ON(ac_data(&cache_cache, cpu) != &initarray_cache.cache);
+ memcpy(ptr, ac_data(&cache_cache, cpu), sizeof(struct arraycache_init));
+ cache_cache.array[cpu] = ptr;
+ local_irq_enable_nort();

ptr = kmalloc(sizeof(struct arraycache_init), GFP_KERNEL);
-
- local_irq_disable();
- BUG_ON(ac_data(malloc_sizes[INDEX_AC].cs_cachep)
- != &initarray_generic.cache);
- memcpy(ptr, ac_data(malloc_sizes[INDEX_AC].cs_cachep),
+ local_irq_disable_nort();
+ BUG_ON(ac_data(malloc_sizes[0].cs_cachep, cpu) != &initarray_generic.cache);
+ memcpy(ptr, ac_data(malloc_sizes[0].cs_cachep, cpu),
sizeof(struct arraycache_init));
- malloc_sizes[INDEX_AC].cs_cachep->array[smp_processor_id()] =
- ptr;
- local_irq_enable();
- }
- /* 5) Replace the bootstrap kmem_list3's */
- {
- int node;
- /* Replace the static kmem_list3 structures for the boot cpu */
- init_list(&cache_cache, &initkmem_list3[CACHE_CACHE],
- numa_node_id());
-
- for_each_online_node(node) {
- init_list(malloc_sizes[INDEX_AC].cs_cachep,
- &initkmem_list3[SIZE_AC+node], node);
-
- if (INDEX_AC != INDEX_L3) {
- init_list(malloc_sizes[INDEX_L3].cs_cachep,
- &initkmem_list3[SIZE_L3+node],
- node);
- }
- }
+ malloc_sizes[0].cs_cachep->array[cpu] = ptr;
+ local_irq_enable_nort();
}

- /* 6) resize the head arrays to their final sizes */
+ /* 5) resize the head arrays to their final sizes */
{
kmem_cache_t *cachep;
down(&cache_chain_sem);
@@ -1170,6 +887,7 @@
*/
register_cpu_notifier(&cpucache_notifier);

+
/* The reap timers are started later, with a module init call:
* That part of the kernel is not yet operational.
*/
@@ -1183,8 +901,10 @@
* Register the timers that return unneeded
* pages to gfp.
*/
- for_each_online_cpu(cpu)
- start_cpu_timer(cpu);
+ for (cpu = 0; cpu < NR_CPUS; cpu++) {
+ if (cpu_online(cpu))
+ start_cpu_timer(cpu);
+ }

return 0;
}
@@ -1198,7 +918,7 @@
* did not request dmaable memory, we might get it, but that
* would be relatively rare and ignorable.
*/
-static void *kmem_getpages(kmem_cache_t *cachep, gfp_t flags, int nodeid)
+static void *kmem_getpages(kmem_cache_t *cachep, unsigned int __nocast flags, int nodeid)
{
struct page *page;
void *addr;
@@ -1268,7 +988,7 @@

*addr++=0x12345678;
*addr++=caller;
- *addr++=smp_processor_id();
+ *addr++=raw_smp_processor_id();
size -= 3*sizeof(unsigned long);
{
unsigned long *sptr = &caller;
@@ -1459,20 +1179,6 @@
}
}

-/* For setting up all the kmem_list3s for cache whose objsize is same
- as size of kmem_list3. */
-static inline void set_up_list3s(kmem_cache_t *cachep, int index)
-{
- int node;
-
- for_each_online_node(node) {
- cachep->nodelists[node] = &initkmem_list3[index+node];
- cachep->nodelists[node]->next_reap = jiffies +
- REAPTIMEOUT_LIST3 +
- ((unsigned long)cachep)%REAPTIMEOUT_LIST3;
- }
-}
-
/**
* kmem_cache_create - Create a cache.
* @name: A string which is used in /proc/slabinfo to identify this cache.
@@ -1514,6 +1220,7 @@
size_t left_over, slab_size, ralign;
kmem_cache_t *cachep = NULL;
struct list_head *p;
+ int cpu = raw_smp_processor_id();

/*
* Sanity checks... these are all serious usage bugs.
@@ -1656,7 +1363,7 @@
size += BYTES_PER_WORD;
}
#if FORCED_DEBUG && defined(CONFIG_DEBUG_PAGEALLOC)
- if (size >= malloc_sizes[INDEX_L3+1].cs_size && cachep->reallen > cache_line_size() && size < PAGE_SIZE) {
+ if (size > 128 && cachep->reallen > cache_line_size() && size < PAGE_SIZE) {
cachep->dbghead += PAGE_SIZE - size;
size = PAGE_SIZE;
}
@@ -1758,9 +1465,13 @@
cachep->gfpflags |= GFP_DMA;
spin_lock_init(&cachep->spinlock);
cachep->objsize = size;
+ /* NUMA */
+ INIT_LIST_HEAD(&cachep->lists.slabs_full);
+ INIT_LIST_HEAD(&cachep->lists.slabs_partial);
+ INIT_LIST_HEAD(&cachep->lists.slabs_free);

if (flags & CFLGS_OFF_SLAB)
- cachep->slabp_cache = kmem_find_general_cachep(slab_size, 0u);
+ cachep->slabp_cache = kmem_find_general_cachep(slab_size,0);
cachep->ctor = ctor;
cachep->dtor = dtor;
cachep->name = name;
@@ -1776,52 +1487,25 @@
* the cache that's used by kmalloc(24), otherwise
* the creation of further caches will BUG().
*/
- cachep->array[smp_processor_id()] =
- &initarray_generic.cache;
-
- /* If the cache that's used by
- * kmalloc(sizeof(kmem_list3)) is the first cache,
- * then we need to set up all its list3s, otherwise
- * the creation of further caches will BUG().
- */
- set_up_list3s(cachep, SIZE_AC);
- if (INDEX_AC == INDEX_L3)
- g_cpucache_up = PARTIAL_L3;
- else
- g_cpucache_up = PARTIAL_AC;
+ cachep->array[cpu] = &initarray_generic.cache;
+ g_cpucache_up = PARTIAL;
} else {
- cachep->array[smp_processor_id()] =
- kmalloc(sizeof(struct arraycache_init),
- GFP_KERNEL);
-
- if (g_cpucache_up == PARTIAL_AC) {
- set_up_list3s(cachep, SIZE_L3);
- g_cpucache_up = PARTIAL_L3;
- } else {
- int node;
- for_each_online_node(node) {
-
- cachep->nodelists[node] =
- kmalloc_node(sizeof(struct kmem_list3),
- GFP_KERNEL, node);
- BUG_ON(!cachep->nodelists[node]);
- kmem_list3_init(cachep->nodelists[node]);
- }
- }
+ cachep->array[cpu] = kmalloc(sizeof(struct arraycache_init),GFP_KERNEL);
}
- cachep->nodelists[numa_node_id()]->next_reap =
- jiffies + REAPTIMEOUT_LIST3 +
- ((unsigned long)cachep)%REAPTIMEOUT_LIST3;
-
- BUG_ON(!ac_data(cachep));
- ac_data(cachep)->avail = 0;
- ac_data(cachep)->limit = BOOT_CPUCACHE_ENTRIES;
- ac_data(cachep)->batchcount = 1;
- ac_data(cachep)->touched = 0;
+ BUG_ON(!ac_data(cachep, cpu));
+ ac_data(cachep, cpu)->avail = 0;
+ ac_data(cachep, cpu)->limit = BOOT_CPUCACHE_ENTRIES;
+ ac_data(cachep, cpu)->batchcount = 1;
+ ac_data(cachep, cpu)->touched = 0;
cachep->batchcount = 1;
cachep->limit = BOOT_CPUCACHE_ENTRIES;
+ cachep->free_limit = (1+num_online_cpus())*cachep->batchcount
+ + cachep->num;
}

+ cachep->lists.next_reap = jiffies + REAPTIMEOUT_LIST3 +
+ ((unsigned long)cachep)%REAPTIMEOUT_LIST3;
+
/* cache setup completed, link it into the list */
list_add(&cachep->next, &cache_chain);
unlock_cpu_hotplug();
@@ -1837,35 +1521,27 @@
#if DEBUG
static void check_irq_off(void)
{
- BUG_ON(!irqs_disabled());
+#ifndef CONFIG_PREEMPT_RT
+ BUG_ON(!raw_irqs_disabled());
+#endif
}

static void check_irq_on(void)
{
- BUG_ON(irqs_disabled());
+ BUG_ON(raw_irqs_disabled());
}

static void check_spinlock_acquired(kmem_cache_t *cachep)
{
#ifdef CONFIG_SMP
check_irq_off();
- assert_spin_locked(&cachep->nodelists[numa_node_id()]->list_lock);
-#endif
-}
-
-static inline void check_spinlock_acquired_node(kmem_cache_t *cachep, int node)
-{
-#ifdef CONFIG_SMP
- check_irq_off();
- assert_spin_locked(&cachep->nodelists[node]->list_lock);
+ BUG_ON(spin_trylock(&cachep->spinlock));
#endif
}
-
#else
#define check_irq_off() do { } while(0)
#define check_irq_on() do { } while(0)
#define check_spinlock_acquired(x) do { } while(0)
-#define check_spinlock_acquired_node(x, y) do { } while(0)
#endif

/*
@@ -1876,9 +1552,9 @@
check_irq_on();
preempt_disable();

- local_irq_disable();
+ raw_local_irq_disable();
func(arg);
- local_irq_enable();
+ raw_local_irq_enable();

if (smp_call_function(func, arg, 1, 1))
BUG();
@@ -1887,92 +1563,85 @@
}

static void drain_array_locked(kmem_cache_t* cachep,
- struct array_cache *ac, int force, int node);
+ struct array_cache *ac, int force);

-static void do_drain(void *arg)
+static void do_drain_cpu(kmem_cache_t *cachep, int cpu)
{
- kmem_cache_t *cachep = (kmem_cache_t*)arg;
struct array_cache *ac;
- int node = numa_node_id();

check_irq_off();
- ac = ac_data(cachep);
- spin_lock(&cachep->nodelists[node]->list_lock);
- free_block(cachep, ac->entry, ac->avail, node);
- spin_unlock(&cachep->nodelists[node]->list_lock);
+
+ spin_lock(&cachep->spinlock);
+ ac = ac_data(cachep, cpu);
+ free_block(cachep, &ac_entry(ac)[0], ac->avail);
ac->avail = 0;
+ spin_unlock(&cachep->spinlock);
}

-static void drain_cpu_caches(kmem_cache_t *cachep)
+#ifndef CONFIG_PREEMPT_RT
+/*
+ * Executes in an IRQ context:
+ */
+static void do_drain(void *arg)
{
- struct kmem_list3 *l3;
- int node;
+ do_drain_cpu((kmem_cache_t*)arg, smp_processor_id());
+}
+#endif

+static void drain_cpu_caches(kmem_cache_t *cachep)
+{
+#ifndef CONFIG_PREEMPT_RT
smp_call_function_all_cpus(do_drain, cachep);
+#else
+ int cpu;
+
+ for_each_online_cpu(cpu)
+ do_drain_cpu(cachep, cpu);
+#endif
check_irq_on();
spin_lock_irq(&cachep->spinlock);
- for_each_online_node(node) {
- l3 = cachep->nodelists[node];
- if (l3) {
- spin_lock(&l3->list_lock);
- drain_array_locked(cachep, l3->shared, 1, node);
- spin_unlock(&l3->list_lock);
- if (l3->alien)
- drain_alien_cache(cachep, l3);
- }
- }
+ if (cachep->lists.shared)
+ drain_array_locked(cachep, cachep->lists.shared, 1);
spin_unlock_irq(&cachep->spinlock);
}

-static int __node_shrink(kmem_cache_t *cachep, int node)
+
+/* NUMA shrink all list3s */
+static int __cache_shrink(kmem_cache_t *cachep)
{
struct slab *slabp;
- struct kmem_list3 *l3 = cachep->nodelists[node];
int ret;

- for (;;) {
+ drain_cpu_caches(cachep);
+
+ check_irq_on();
+ spin_lock_irq(&cachep->spinlock);
+
+ for(;;) {
struct list_head *p;

- p = l3->slabs_free.prev;
- if (p == &l3->slabs_free)
+ p = cachep->lists.slabs_free.prev;
+ if (p == &cachep->lists.slabs_free)
break;

- slabp = list_entry(l3->slabs_free.prev, struct slab, list);
+ slabp = list_entry(cachep->lists.slabs_free.prev, struct slab, list);
#if DEBUG
if (slabp->inuse)
BUG();
#endif
list_del(&slabp->list);

- l3->free_objects -= cachep->num;
- spin_unlock_irq(&l3->list_lock);
+ cachep->lists.free_objects -= cachep->num;
+ spin_unlock_irq(&cachep->spinlock);
slab_destroy(cachep, slabp);
- spin_lock_irq(&l3->list_lock);
+ spin_lock_irq(&cachep->spinlock);
}
- ret = !list_empty(&l3->slabs_full) ||
- !list_empty(&l3->slabs_partial);
+ ret = !list_empty(&cachep->lists.slabs_full) ||
+ !list_empty(&cachep->lists.slabs_partial);
+ spin_unlock_irq(&cachep->spinlock);
return ret;
}

-static int __cache_shrink(kmem_cache_t *cachep)
-{
- int ret = 0, i = 0;
- struct kmem_list3 *l3;
-
- drain_cpu_caches(cachep);
-
- check_irq_on();
- for_each_online_node(i) {
- l3 = cachep->nodelists[i];
- if (l3) {
- spin_lock_irq(&l3->list_lock);
- ret += __node_shrink(cachep, i);
- spin_unlock_irq(&l3->list_lock);
- }
- }
- return (ret ? 1 : 0);
-}
-
/**
* kmem_cache_shrink - Shrink a cache.
* @cachep: The cache to shrink.
@@ -2009,7 +1678,6 @@
int kmem_cache_destroy(kmem_cache_t * cachep)
{
int i;
- struct kmem_list3 *l3;

if (!cachep || in_interrupt())
BUG();
@@ -2037,17 +1705,15 @@
if (unlikely(cachep->flags & SLAB_DESTROY_BY_RCU))
synchronize_rcu();

- for_each_online_cpu(i)
+ /* no cpu_online check required here since we clear the percpu
+ * array on cpu offline and set this to NULL.
+ */
+ for (i = 0; i < NR_CPUS; i++)
kfree(cachep->array[i]);

/* NUMA: free the list3 structures */
- for_each_online_node(i) {
- if ((l3 = cachep->nodelists[i])) {
- kfree(l3->shared);
- free_alien_cache(l3->alien);
- kfree(l3);
- }
- }
+ kfree(cachep->lists.shared);
+ cachep->lists.shared = NULL;
kmem_cache_free(&cache_cache, cachep);

unlock_cpu_hotplug();
@@ -2057,8 +1723,8 @@
EXPORT_SYMBOL(kmem_cache_destroy);

/* Get the memory for a slab management obj. */
-static struct slab* alloc_slabmgmt(kmem_cache_t *cachep, void *objp,
- int colour_off, gfp_t local_flags)
+static struct slab* alloc_slabmgmt(kmem_cache_t *cachep,
+ void *objp, int colour_off, unsigned int __nocast local_flags)
{
struct slab *slabp;

@@ -2089,7 +1755,7 @@
int i;

for (i = 0; i < cachep->num; i++) {
- void *objp = slabp->s_mem+cachep->objsize*i;
+ void* objp = slabp->s_mem+cachep->objsize*i;
#if DEBUG
/* need to poison the objs? */
if (cachep->flags & SLAB_POISON)
@@ -2159,14 +1825,13 @@
* Grow (by 1) the number of slabs within a cache. This is called by
* kmem_cache_alloc() when there are no active objs left in a cache.
*/
-static int cache_grow(kmem_cache_t *cachep, gfp_t flags, int nodeid)
+static int cache_grow(kmem_cache_t *cachep, unsigned int __nocast flags, int nodeid)
{
struct slab *slabp;
void *objp;
size_t offset;
gfp_t local_flags;
unsigned long ctor_flags;
- struct kmem_list3 *l3;

/* Be lazy and only check for valid flags here,
* keeping it out of the critical path in kmem_cache_alloc().
@@ -2198,9 +1863,8 @@

spin_unlock(&cachep->spinlock);

- check_irq_off();
if (local_flags & __GFP_WAIT)
- local_irq_enable();
+ local_irq_enable_nort();

/*
* The test for missing atomic flag is performed here, rather than
@@ -2210,9 +1874,8 @@
*/
kmem_flagcheck(cachep, flags);

- /* Get mem for the objs.
- * Attempt to allocate a physical page from 'nodeid',
- */
+
+ /* Get mem for the objs. */
if (!(objp = kmem_getpages(cachep, flags, nodeid)))
goto failed;

@@ -2220,28 +1883,26 @@
if (!(slabp = alloc_slabmgmt(cachep, objp, offset, local_flags)))
goto opps1;

- slabp->nodeid = nodeid;
set_slab_attr(cachep, slabp, objp);

cache_init_objs(cachep, slabp, ctor_flags);

if (local_flags & __GFP_WAIT)
- local_irq_disable();
+ local_irq_disable_nort();
check_irq_off();
- l3 = cachep->nodelists[nodeid];
- spin_lock(&l3->list_lock);
+ spin_lock(&cachep->spinlock);

/* Make slab active. */
- list_add_tail(&slabp->list, &(l3->slabs_free));
+ list_add_tail(&slabp->list, &(list3_data(cachep)->slabs_free));
STATS_INC_GROWN(cachep);
- l3->free_objects += cachep->num;
- spin_unlock(&l3->list_lock);
+ list3_data(cachep)->free_objects += cachep->num;
+ spin_unlock(&cachep->spinlock);
return 1;
opps1:
kmem_freepages(cachep, objp);
failed:
if (local_flags & __GFP_WAIT)
- local_irq_disable();
+ local_irq_disable_nort();
return 0;
}

@@ -2341,6 +2002,7 @@
kmem_bufctl_t i;
int entries = 0;

+ check_spinlock_acquired(cachep);
/* Check slab's freelist to see if this obj is there. */
for (i = slabp->free; i != BUFCTL_END; i = slab_bufctl(slabp)[i]) {
entries++;
@@ -2366,14 +2028,14 @@
#define check_slabp(x,y) do { } while(0)
#endif

-static void *cache_alloc_refill(kmem_cache_t *cachep, gfp_t flags)
+static void *cache_alloc_refill(kmem_cache_t *cachep, unsigned int __nocast flags, int cpu)
{
int batchcount;
struct kmem_list3 *l3;
struct array_cache *ac;

check_irq_off();
- ac = ac_data(cachep);
+ ac = ac_data(cachep, cpu);
retry:
batchcount = ac->batchcount;
if (!ac->touched && batchcount > BATCHREFILL_LIMIT) {
@@ -2383,11 +2045,10 @@
*/
batchcount = BATCHREFILL_LIMIT;
}
- l3 = cachep->nodelists[numa_node_id()];
-
- BUG_ON(ac->avail > 0 || !l3);
- spin_lock(&l3->list_lock);
+ l3 = list3_data(cachep);

+ BUG_ON(ac->avail > 0);
+ spin_lock_nort(&cachep->spinlock);
if (l3->shared) {
struct array_cache *shared_array = l3->shared;
if (shared_array->avail) {
@@ -2395,9 +2056,8 @@
batchcount = shared_array->avail;
shared_array->avail -= batchcount;
ac->avail = batchcount;
- memcpy(ac->entry,
- &(shared_array->entry[shared_array->avail]),
- sizeof(void*)*batchcount);
+ memcpy(ac_entry(ac), &ac_entry(shared_array)[shared_array->avail],
+ sizeof(void*)*batchcount);
shared_array->touched = 1;
goto alloc_done;
}
@@ -2424,8 +2084,7 @@
STATS_SET_HIGH(cachep);

/* get obj pointer */
- ac->entry[ac->avail++] = slabp->s_mem +
- slabp->free*cachep->objsize;
+ ac_entry(ac)[ac->avail++] = slabp->s_mem + slabp->free*cachep->objsize;

slabp->inuse++;
next = slab_bufctl(slabp)[slabp->free];
@@ -2448,14 +2107,17 @@
must_grow:
l3->free_objects -= ac->avail;
alloc_done:
- spin_unlock(&l3->list_lock);
+ spin_unlock_nort(&cachep->spinlock);

if (unlikely(!ac->avail)) {
int x;
- x = cache_grow(cachep, flags, numa_node_id());
+ spin_unlock_rt(&cachep->spinlock);
+ x = cache_grow(cachep, flags, -1);

+ spin_lock_rt(&cachep->spinlock);
// cache_grow can reenable interrupts, then ac could change.
- ac = ac_data(cachep);
+ cpu = smp_processor_id_rt(cpu);
+ ac = ac_data(cachep, cpu);
if (!x && ac->avail == 0) // no objects in sight? abort
return NULL;

@@ -2463,11 +2125,11 @@
goto retry;
}
ac->touched = 1;
- return ac->entry[--ac->avail];
+ return ac_entry(ac)[--ac->avail];
}

static inline void
-cache_alloc_debugcheck_before(kmem_cache_t *cachep, gfp_t flags)
+cache_alloc_debugcheck_before(kmem_cache_t *cachep, unsigned int __nocast flags)
{
might_sleep_if(flags & __GFP_WAIT);
#if DEBUG
@@ -2478,7 +2140,7 @@
#if DEBUG
static void *
cache_alloc_debugcheck_after(kmem_cache_t *cachep,
- gfp_t flags, void *objp, void *caller)
+ unsigned int __nocast flags, void *objp, void *caller)
{
if (!objp)
return objp;
@@ -2521,118 +2183,47 @@
#define cache_alloc_debugcheck_after(a,b,objp,d) (objp)
#endif

-static inline void *____cache_alloc(kmem_cache_t *cachep, gfp_t flags)
+
+static inline void *__cache_alloc(kmem_cache_t *cachep, unsigned int __nocast flags)
{
+ int cpu;
+ unsigned long save_flags;
void* objp;
struct array_cache *ac;

- check_irq_off();
- ac = ac_data(cachep);
+ cache_alloc_debugcheck_before(cachep, flags);
+
+ local_irq_save_nort(save_flags);
+ spin_lock_rt(&cachep->spinlock);
+ cpu = raw_smp_processor_id();
+ ac = ac_data(cachep, cpu);
if (likely(ac->avail)) {
STATS_INC_ALLOCHIT(cachep);
ac->touched = 1;
- objp = ac->entry[--ac->avail];
+ objp = ac_entry(ac)[--ac->avail];
} else {
STATS_INC_ALLOCMISS(cachep);
- objp = cache_alloc_refill(cachep, flags);
+ objp = cache_alloc_refill(cachep, flags, cpu);
}
+ spin_unlock_rt(&cachep->spinlock);
+ local_irq_restore_nort(save_flags);
+ objp = cache_alloc_debugcheck_after(cachep, flags, objp, __builtin_return_address(0));
return objp;
}

-static inline void *__cache_alloc(kmem_cache_t *cachep, gfp_t flags)
-{
- unsigned long save_flags;
- void* objp;
-
- cache_alloc_debugcheck_before(cachep, flags);
-
- local_irq_save(save_flags);
- objp = ____cache_alloc(cachep, flags);
- local_irq_restore(save_flags);
- objp = cache_alloc_debugcheck_after(cachep, flags, objp,
- __builtin_return_address(0));
- prefetchw(objp);
- return objp;
-}
-
-#ifdef CONFIG_NUMA
/*
- * A interface to enable slab creation on nodeid
+ * NUMA: different approach needed if the spinlock is moved into
+ * the l3 structure
*/
-static void *__cache_alloc_node(kmem_cache_t *cachep, gfp_t flags, int nodeid)
-{
- struct list_head *entry;
- struct slab *slabp;
- struct kmem_list3 *l3;
- void *obj;
- kmem_bufctl_t next;
- int x;

- l3 = cachep->nodelists[nodeid];
- BUG_ON(!l3);
-
-retry:
- spin_lock(&l3->list_lock);
- entry = l3->slabs_partial.next;
- if (entry == &l3->slabs_partial) {
- l3->free_touched = 1;
- entry = l3->slabs_free.next;
- if (entry == &l3->slabs_free)
- goto must_grow;
- }
-
- slabp = list_entry(entry, struct slab, list);
- check_spinlock_acquired_node(cachep, nodeid);
- check_slabp(cachep, slabp);
-
- STATS_INC_NODEALLOCS(cachep);
- STATS_INC_ACTIVE(cachep);
- STATS_SET_HIGH(cachep);
-
- BUG_ON(slabp->inuse == cachep->num);
-
- /* get obj pointer */
- obj = slabp->s_mem + slabp->free*cachep->objsize;
- slabp->inuse++;
- next = slab_bufctl(slabp)[slabp->free];
-#if DEBUG
- slab_bufctl(slabp)[slabp->free] = BUFCTL_FREE;
-#endif
- slabp->free = next;
- check_slabp(cachep, slabp);
- l3->free_objects--;
- /* move slabp to correct slabp list: */
- list_del(&slabp->list);
-
- if (slabp->free == BUFCTL_END) {
- list_add(&slabp->list, &l3->slabs_full);
- } else {
- list_add(&slabp->list, &l3->slabs_partial);
- }
-
- spin_unlock(&l3->list_lock);
- goto done;
-
-must_grow:
- spin_unlock(&l3->list_lock);
- x = cache_grow(cachep, flags, nodeid);
-
- if (!x)
- return NULL;
-
- goto retry;
-done:
- return obj;
-}
-#endif
-
-/*
- * Caller needs to acquire correct kmem_list's list_lock
- */
-static void free_block(kmem_cache_t *cachep, void **objpp, int nr_objects, int node)
+static void free_block(kmem_cache_t *cachep, void **objpp, int nr_objects)
{
int i;
- struct kmem_list3 *l3;
+
+ check_spinlock_acquired(cachep);
+
+ /* NUMA: move add into loop */
+ cachep->lists.free_objects += nr_objects;

for (i = 0; i < nr_objects; i++) {
void *objp = objpp[i];
@@ -2640,19 +2231,16 @@
unsigned int objnr;

slabp = page_get_slab(virt_to_page(objp));
- l3 = cachep->nodelists[node];
list_del(&slabp->list);
objnr = (objp - slabp->s_mem) / cachep->objsize;
- check_spinlock_acquired_node(cachep, node);
check_slabp(cachep, slabp);
-
#if DEBUG
/* Verify that the slab belongs to the intended node */
WARN_ON(slabp->nodeid != node);

if (slab_bufctl(slabp)[objnr] != BUFCTL_FREE) {
- printk(KERN_ERR "slab: double free detected in cache "
- "'%s', objp %p\n", cachep->name, objp);
+ printk(KERN_ERR "slab: double free detected in cache '%s', objp %p.\n",
+ cachep->name, objp);
BUG();
}
#endif
@@ -2660,23 +2248,24 @@
slabp->free = objnr;
STATS_DEC_ACTIVE(cachep);
slabp->inuse--;
- l3->free_objects++;
check_slabp(cachep, slabp);

/* fixup slab chains */
if (slabp->inuse == 0) {
- if (l3->free_objects > l3->free_limit) {
- l3->free_objects -= cachep->num;
+ if (cachep->lists.free_objects > cachep->free_limit) {
+ cachep->lists.free_objects -= cachep->num;
slab_destroy(cachep, slabp);
} else {
- list_add(&slabp->list, &l3->slabs_free);
+ list_add(&slabp->list,
+ &list3_data_ptr(cachep, objp)->slabs_free);
}
} else {
/* Unconditionally move a slab to the end of the
* partial list on free - maximum time for the
* other objects to be freed, too.
*/
- list_add_tail(&slabp->list, &l3->slabs_partial);
+ list_add_tail(&slabp->list,
+ &list3_data_ptr(cachep, objp)->slabs_partial);
}
}
}
@@ -2684,39 +2273,36 @@
static void cache_flusharray(kmem_cache_t *cachep, struct array_cache *ac)
{
int batchcount;
- struct kmem_list3 *l3;
- int node = numa_node_id();

batchcount = ac->batchcount;
#if DEBUG
BUG_ON(!batchcount || batchcount > ac->avail);
#endif
check_irq_off();
- l3 = cachep->nodelists[node];
- spin_lock(&l3->list_lock);
- if (l3->shared) {
- struct array_cache *shared_array = l3->shared;
+ spin_lock_nort(&cachep->spinlock);
+ if (cachep->lists.shared) {
+ struct array_cache *shared_array = cachep->lists.shared;
int max = shared_array->limit-shared_array->avail;
if (max) {
if (batchcount > max)
batchcount = max;
- memcpy(&(shared_array->entry[shared_array->avail]),
- ac->entry,
+ memcpy(&ac_entry(shared_array)[shared_array->avail],
+ &ac_entry(ac)[0],
sizeof(void*)*batchcount);
shared_array->avail += batchcount;
goto free_done;
}
}

- free_block(cachep, ac->entry, batchcount, node);
+ free_block(cachep, &ac_entry(ac)[0], batchcount);
free_done:
#if STATS
{
int i = 0;
struct list_head *p;

- p = l3->slabs_free.next;
- while (p != &(l3->slabs_free)) {
+ p = list3_data(cachep)->slabs_free.next;
+ while (p != &(list3_data(cachep)->slabs_free)) {
struct slab *slabp;

slabp = list_entry(p, struct slab, list);
@@ -2728,13 +2314,12 @@
STATS_SET_FREEABLE(cachep, i);
}
#endif
- spin_unlock(&l3->list_lock);
+ spin_unlock_nort(&cachep->spinlock);
ac->avail -= batchcount;
- memmove(ac->entry, &(ac->entry[batchcount]),
+ memmove(&ac_entry(ac)[0], &ac_entry(ac)[batchcount],
sizeof(void*)*ac->avail);
}

-
/*
* __cache_free
* Release an obj back to its cache. If the obj has a constructed
@@ -2744,52 +2329,24 @@
*/
static inline void __cache_free(kmem_cache_t *cachep, void *objp)
{
- struct array_cache *ac = ac_data(cachep);
+ int cpu;
+ struct array_cache *ac;

check_irq_off();
objp = cache_free_debugcheck(cachep, objp, __builtin_return_address(0));

- /* Make sure we are not freeing a object from another
- * node to the array cache on this cpu.
- */
-#ifdef CONFIG_NUMA
- {
- struct slab *slabp;
- slabp = page_get_slab(virt_to_page(objp));
- if (unlikely(slabp->nodeid != numa_node_id())) {
- struct array_cache *alien = NULL;
- int nodeid = slabp->nodeid;
- struct kmem_list3 *l3 = cachep->nodelists[numa_node_id()];
-
- STATS_INC_NODEFREES(cachep);
- if (l3->alien && l3->alien[nodeid]) {
- alien = l3->alien[nodeid];
- spin_lock(&alien->lock);
- if (unlikely(alien->avail == alien->limit))
- __drain_alien_cache(cachep,
- alien, nodeid);
- alien->entry[alien->avail++] = objp;
- spin_unlock(&alien->lock);
- } else {
- spin_lock(&(cachep->nodelists[nodeid])->
- list_lock);
- free_block(cachep, &objp, 1, nodeid);
- spin_unlock(&(cachep->nodelists[nodeid])->
- list_lock);
- }
- return;
- }
- }
-#endif
+ spin_lock_rt(&cachep->spinlock);
+ cpu = raw_smp_processor_id();
+ ac = ac_data(cachep, cpu);
if (likely(ac->avail < ac->limit)) {
STATS_INC_FREEHIT(cachep);
- ac->entry[ac->avail++] = objp;
- return;
+ ac_entry(ac)[ac->avail++] = objp;
} else {
STATS_INC_FREEMISS(cachep);
cache_flusharray(cachep, ac);
- ac->entry[ac->avail++] = objp;
+ ac_entry(ac)[ac->avail++] = objp;
}
+ spin_unlock_rt(&cachep->spinlock);
}

/**
@@ -2800,7 +2357,7 @@
* Allocate an object from this cache. The flags are only relevant
* if the cache has no available objects.
*/
-void *kmem_cache_alloc(kmem_cache_t *cachep, gfp_t flags)
+void *kmem_cache_alloc(kmem_cache_t *cachep, unsigned int __nocast flags)
{
return __cache_alloc(cachep, flags);
}
@@ -2858,37 +2415,85 @@
* Identical to kmem_cache_alloc, except that this function is slow
* and can sleep. And it will allocate memory on the given node, which
* can improve the performance for cpu bound structures.
- * New and improved: it will now make sure that the object gets
- * put on the correct node list so that there is no false sharing.
*/
-void *kmem_cache_alloc_node(kmem_cache_t *cachep, gfp_t flags, int nodeid)
+void *kmem_cache_alloc_node(kmem_cache_t *cachep, unsigned int __nocast flags, int nodeid)
{
- unsigned long save_flags;
- void *ptr;
+ int loop;
+ void *objp;
+ struct slab *slabp;
+ kmem_bufctl_t next;

if (nodeid == -1)
- return __cache_alloc(cachep, flags);
+ return kmem_cache_alloc(cachep, flags);

- if (unlikely(!cachep->nodelists[nodeid])) {
- /* Fall back to __cache_alloc if we run into trouble */
- printk(KERN_WARNING "slab: not allocating in inactive node %d for cache %s\n", nodeid, cachep->name);
- return __cache_alloc(cachep,flags);
+ for (loop = 0;;loop++) {
+ struct list_head *q;
+
+ objp = NULL;
+ check_irq_on();
+ spin_lock_irq(&cachep->spinlock);
+ /* walk through all partial and empty slab and find one
+ * from the right node */
+ list_for_each(q,&cachep->lists.slabs_partial) {
+ slabp = list_entry(q, struct slab, list);
+
+ if (page_to_nid(virt_to_page(slabp->s_mem)) == nodeid ||
+ loop > 2)
+ goto got_slabp;
+ }
+ list_for_each(q, &cachep->lists.slabs_free) {
+ slabp = list_entry(q, struct slab, list);
+
+ if (page_to_nid(virt_to_page(slabp->s_mem)) == nodeid ||
+ loop > 2)
+ goto got_slabp;
+ }
+ spin_unlock_irq(&cachep->spinlock);
+
+ local_irq_disable_nort();
+ if (!cache_grow(cachep, flags, nodeid)) {
+ local_irq_enable_nort();
+ return NULL;
+ }
+ local_irq_enable_nort();
}
+got_slabp:
+ /* found one: allocate object */
+ check_slabp(cachep, slabp);
+ check_spinlock_acquired(cachep);

- cache_alloc_debugcheck_before(cachep, flags);
- local_irq_save(save_flags);
- if (nodeid == numa_node_id())
- ptr = ____cache_alloc(cachep, flags);
+ STATS_INC_ALLOCED(cachep);
+ STATS_INC_ACTIVE(cachep);
+ STATS_SET_HIGH(cachep);
+ STATS_INC_NODEALLOCS(cachep);
+
+ objp = slabp->s_mem + slabp->free*cachep->objsize;
+
+ slabp->inuse++;
+ next = slab_bufctl(slabp)[slabp->free];
+#if DEBUG
+ slab_bufctl(slabp)[slabp->free] = BUFCTL_FREE;
+#endif
+ slabp->free = next;
+ check_slabp(cachep, slabp);
+
+ /* move slabp to correct slabp list: */
+ list_del(&slabp->list);
+ if (slabp->free == BUFCTL_END)
+ list_add(&slabp->list, &cachep->lists.slabs_full);
else
- ptr = __cache_alloc_node(cachep, flags, nodeid);
- local_irq_restore(save_flags);
- ptr = cache_alloc_debugcheck_after(cachep, flags, ptr, __builtin_return_address(0));
+ list_add(&slabp->list, &cachep->lists.slabs_partial);
+
+ list3_data(cachep)->free_objects--;
+ spin_unlock_irq(&cachep->spinlock);

- return ptr;
+ objp = cache_alloc_debugcheck_after(cachep, GFP_KERNEL, objp,
+ __builtin_return_address(0));
+ return objp;
}
EXPORT_SYMBOL(kmem_cache_alloc_node);

-void *kmalloc_node(size_t size, gfp_t flags, int node)
+void *kmalloc_node(size_t size, unsigned int __nocast flags, int node)
{
kmem_cache_t *cachep;

@@ -2921,7 +2526,7 @@
* platforms. For example, on i386, it means that the memory must come
* from the first 16MB.
*/
-void *__kmalloc(size_t size, gfp_t flags)
+void *__kmalloc(size_t size, unsigned int __nocast flags)
{
kmem_cache_t *cachep;

@@ -2954,18 +2559,11 @@
if (!pdata)
return NULL;

- /*
- * Cannot use for_each_online_cpu since a cpu may come online
- * and we have no way of figuring out how to fix the array
- * that we have allocated then....
- */
- for_each_cpu(i) {
- int node = cpu_to_node(i);
-
- if (node_online(node))
- pdata->ptrs[i] = kmalloc_node(size, GFP_KERNEL, node);
- else
- pdata->ptrs[i] = kmalloc(size, GFP_KERNEL);
+ for (i = 0; i < NR_CPUS; i++) {
+ if (!cpu_possible(i))
+ continue;
+ pdata->ptrs[i] = kmalloc_node(size, GFP_KERNEL,
+ cpu_to_node(i));

if (!pdata->ptrs[i])
goto unwind_oom;
@@ -2999,18 +2597,31 @@
{
unsigned long flags;

- local_irq_save(flags);
+ local_irq_save_nort(flags);
__cache_free(cachep, objp);
- local_irq_restore(flags);
+ local_irq_restore_nort(flags);
}
EXPORT_SYMBOL(kmem_cache_free);

+#ifdef CONFIG_DEBUG_DEADLOCKS
+static size_t cache_size(kmem_cache_t *c)
+{
+ struct cache_sizes *csizep = malloc_sizes;
+
+ for ( ; csizep->cs_size; csizep++) {
+ if (csizep->cs_cachep == c)
+ return csizep->cs_size;
+ if (csizep->cs_dmacachep == c)
+ return csizep->cs_size;
+ }
+ return 0;
+}
+#endif
+
/**
* kfree - free previously allocated memory
* @objp: pointer returned by kmalloc.
*
- * If @objp is NULL, no operation is performed.
- *
* Don't free memory not originally allocated by kmalloc()
* or you will run into trouble.
*/
@@ -3021,11 +2632,16 @@

if (unlikely(!objp))
return;
- local_irq_save(flags);
+ local_irq_save_nort(flags);
kfree_debugcheck(objp);
c = page_get_cache(virt_to_page(objp));
+#ifdef CONFIG_DEBUG_DEADLOCKS
+ if (check_no_locks_freed(objp, objp+cache_size(c)))
+ printk("slab %s[%p] (%d), obj: %p\n",
+ c->name, c, c->objsize, objp);
+#endif
__cache_free(c, (void*)objp);
- local_irq_restore(flags);
+ local_irq_restore_nort(flags);
}
EXPORT_SYMBOL(kfree);

@@ -3043,11 +2659,11 @@
int i;
struct percpu_data *p = (struct percpu_data *) (~(unsigned long) objp);

- /*
- * We allocate for all cpus so we cannot use for online cpu here.
- */
- for_each_cpu(i)
+ for (i = 0; i < NR_CPUS; i++) {
+ if (!cpu_possible(i))
+ continue;
kfree(p->ptrs[i]);
+ }
kfree(p);
}
EXPORT_SYMBOL(free_percpu);
@@ -3065,76 +2681,21 @@
}
EXPORT_SYMBOL_GPL(kmem_cache_name);

-/*
- * This initializes kmem_list3 for all nodes.
- */
-static int alloc_kmemlist(kmem_cache_t *cachep)
-{
- int node;
- struct kmem_list3 *l3;
- int err = 0;
-
- for_each_online_node(node) {
- struct array_cache *nc = NULL, *new;
- struct array_cache **new_alien = NULL;
-#ifdef CONFIG_NUMA
- if (!(new_alien = alloc_alien_cache(node, cachep->limit)))
- goto fail;
-#endif
- if (!(new = alloc_arraycache(node, (cachep->shared*
- cachep->batchcount), 0xbaadf00d)))
- goto fail;
- if ((l3 = cachep->nodelists[node])) {
-
- spin_lock_irq(&l3->list_lock);
-
- if ((nc = cachep->nodelists[node]->shared))
- free_block(cachep, nc->entry,
- nc->avail, node);
-
- l3->shared = new;
- if (!cachep->nodelists[node]->alien) {
- l3->alien = new_alien;
- new_alien = NULL;
- }
- l3->free_limit = (1 + nr_cpus_node(node))*
- cachep->batchcount + cachep->num;
- spin_unlock_irq(&l3->list_lock);
- kfree(nc);
- free_alien_cache(new_alien);
- continue;
- }
- if (!(l3 = kmalloc_node(sizeof(struct kmem_list3),
- GFP_KERNEL, node)))
- goto fail;
-
- kmem_list3_init(l3);
- l3->next_reap = jiffies + REAPTIMEOUT_LIST3 +
- ((unsigned long)cachep)%REAPTIMEOUT_LIST3;
- l3->shared = new;
- l3->alien = new_alien;
- l3->free_limit = (1 + nr_cpus_node(node))*
- cachep->batchcount + cachep->num;
- cachep->nodelists[node] = l3;
- }
- return err;
-fail:
- err = -ENOMEM;
- return err;
-}
-
struct ccupdate_struct {
kmem_cache_t *cachep;
struct array_cache *new[NR_CPUS];
};

+/*
+ * Executes in IRQ context:
+ */
static void do_ccupdate_local(void *info)
{
struct ccupdate_struct *new = (struct ccupdate_struct *)info;
struct array_cache *old;

check_irq_off();
- old = ac_data(new->cachep);
+ old = ac_data(new->cachep, smp_processor_id());

new->cachep->array[smp_processor_id()] = new->new[smp_processor_id()];
new->new[smp_processor_id()] = old;
@@ -3145,14 +2706,19 @@
int shared)
{
struct ccupdate_struct new;
- int i, err;
+ struct array_cache *new_shared;
+ int i;

memset(&new.new,0,sizeof(new.new));
- for_each_online_cpu(i) {
- new.new[i] = alloc_arraycache(cpu_to_node(i), limit, batchcount);
- if (!new.new[i]) {
- for (i--; i >= 0; i--) kfree(new.new[i]);
- return -ENOMEM;
+ for (i = 0; i < NR_CPUS; i++) {
+ if (cpu_online(i)) {
+ new.new[i] = alloc_arraycache(i, limit, batchcount);
+ if (!new.new[i]) {
+ for (i--; i >= 0; i--) kfree(new.new[i]);
+ return -ENOMEM;
+ }
+ } else {
+ new.new[i] = NULL;
}
}
new.cachep = cachep;
@@ -3163,25 +2729,31 @@
spin_lock_irq(&cachep->spinlock);
cachep->batchcount = batchcount;
cachep->limit = limit;
- cachep->shared = shared;
+ cachep->free_limit = (1+num_online_cpus())*cachep->batchcount + cachep->num;
spin_unlock_irq(&cachep->spinlock);

- for_each_online_cpu(i) {
+ for (i = 0; i < NR_CPUS; i++) {
struct array_cache *ccold = new.new[i];
if (!ccold)
continue;
- spin_lock_irq(&cachep->nodelists[cpu_to_node(i)]->list_lock);
- free_block(cachep, ccold->entry, ccold->avail, cpu_to_node(i));
- spin_unlock_irq(&cachep->nodelists[cpu_to_node(i)]->list_lock);
+ spin_lock_irq(&cachep->spinlock);
+ free_block(cachep, ac_entry(ccold), ccold->avail);
+ spin_unlock_irq(&cachep->spinlock);
kfree(ccold);
}
-
- err = alloc_kmemlist(cachep);
- if (err) {
- printk(KERN_ERR "alloc_kmemlist failed for %s, error %d.\n",
- cachep->name, -err);
- BUG();
+ new_shared = alloc_arraycache(-1, batchcount*shared, 0xbaadf00d);
+ if (new_shared) {
+ struct array_cache *old;
+
+ spin_lock_irq(&cachep->spinlock);
+ old = cachep->lists.shared;
+ cachep->lists.shared = new_shared;
+ if (old)
+ free_block(cachep, ac_entry(old), old->avail);
+ spin_unlock_irq(&cachep->spinlock);
+ kfree(old);
}
+
return 0;
}

@@ -3232,6 +2804,10 @@
if (limit > 32)
limit = 32;
#endif
+#ifdef CONFIG_PREEMPT
+ if (limit > 16)
+ limit = 16;
+#endif
err = do_tune_cpucache(cachep, limit, (limit+1)/2, shared);
if (err)
printk(KERN_ERR "enable_cpucache failed for %s, error %d.\n",
@@ -3239,11 +2815,11 @@
}

static void drain_array_locked(kmem_cache_t *cachep,
- struct array_cache *ac, int force, int node)
+ struct array_cache *ac, int force)
{
int tofree;

- check_spinlock_acquired_node(cachep, node);
+ check_spinlock_acquired(cachep);
if (ac->touched && !force) {
ac->touched = 0;
} else if (ac->avail) {
@@ -3251,9 +2827,9 @@
if (tofree > ac->avail) {
tofree = (ac->avail+1)/2;
}
- free_block(cachep, ac->entry, tofree, node);
+ free_block(cachep, ac_entry(ac), tofree);
ac->avail -= tofree;
- memmove(ac->entry, &(ac->entry[tofree]),
+ memmove(&ac_entry(ac)[0], &ac_entry(ac)[tofree],
sizeof(void*)*ac->avail);
}
}
@@ -3272,12 +2848,14 @@
*/
static void cache_reap(void *unused)
{
+ int cpu;
struct list_head *walk;
- struct kmem_list3 *l3;

if (down_trylock(&cache_chain_sem)) {
/* Give up. Setup the next iteration. */
- schedule_delayed_work(&__get_cpu_var(reap_work), REAPTIMEOUT_CPUC);
+next_iteration:
+ cpu = raw_smp_processor_id();
+ schedule_delayed_work(&per_cpu(reap_work, cpu), REAPTIMEOUT_CPUC + cpu);
return;
}

@@ -3294,32 +2872,28 @@

check_irq_on();

- l3 = searchp->nodelists[numa_node_id()];
- if (l3->alien)
- drain_alien_cache(searchp, l3);
- spin_lock_irq(&l3->list_lock);
+ spin_lock_irq(&searchp->spinlock);
+ cpu = raw_smp_processor_id();

- drain_array_locked(searchp, ac_data(searchp), 0,
- numa_node_id());
+ drain_array_locked(searchp, ac_data(searchp, cpu), 0);

- if (time_after(l3->next_reap, jiffies))
+ if(time_after(searchp->lists.next_reap, jiffies))
goto next_unlock;

- l3->next_reap = jiffies + REAPTIMEOUT_LIST3;
+ searchp->lists.next_reap = jiffies + REAPTIMEOUT_LIST3;

- if (l3->shared)
- drain_array_locked(searchp, l3->shared, 0,
- numa_node_id());
+ if (searchp->lists.shared)
+ drain_array_locked(searchp, searchp->lists.shared, 0);

- if (l3->free_touched) {
- l3->free_touched = 0;
+ if (searchp->lists.free_touched) {
+ searchp->lists.free_touched = 0;
goto next_unlock;
}

- tofree = (l3->free_limit+5*searchp->num-1)/(5*searchp->num);
+ tofree = (searchp->free_limit+5*searchp->num-1)/(5*searchp->num);
do {
- p = l3->slabs_free.next;
- if (p == &(l3->slabs_free))
+ p = list3_data(searchp)->slabs_free.next;
+ if (p == &(list3_data(searchp)->slabs_free))
break;

slabp = list_entry(p, struct slab, list);
@@ -3332,13 +2906,13 @@
* searchp cannot disappear, we hold
* cache_chain_lock
*/
- l3->free_objects -= searchp->num;
- spin_unlock_irq(&l3->list_lock);
+ searchp->lists.free_objects -= searchp->num;
+ spin_unlock_irq(&searchp->spinlock);
slab_destroy(searchp, slabp);
- spin_lock_irq(&l3->list_lock);
+ spin_lock_irq(&searchp->spinlock);
} while(--tofree > 0);
next_unlock:
- spin_unlock_irq(&l3->list_lock);
+ spin_unlock_irq(&searchp->spinlock);
next:
cond_resched();
}
@@ -3346,7 +2920,7 @@
up(&cache_chain_sem);
drain_remote_pages();
/* Setup the next iteration */
- schedule_delayed_work(&__get_cpu_var(reap_work), REAPTIMEOUT_CPUC);
+ goto next_iteration;
}

#ifdef CONFIG_PROC_FS
@@ -3372,7 +2946,7 @@
seq_puts(m, " : slabdata <active_slabs> <num_slabs> <sharedavail>");
#if STATS
seq_puts(m, " : globalstat <listallocs> <maxobjs> <grown> <reaped>"
- " <error> <maxfreeable> <nodeallocs> <remotefrees>");
+ " <error> <maxfreeable> <freelimit> <nodeallocs>");
seq_puts(m, " : cpustat <allochit> <allocmiss> <freehit> <freemiss>");
#endif
seq_putc(m, '\n');
@@ -3407,53 +2981,39 @@
unsigned long active_objs;
unsigned long num_objs;
unsigned long active_slabs = 0;
- unsigned long num_slabs, free_objects = 0, shared_avail = 0;
+ unsigned long num_slabs;
const char *name;
char *error = NULL;
- int node;
- struct kmem_list3 *l3;

check_irq_on();
spin_lock_irq(&cachep->spinlock);
active_objs = 0;
num_slabs = 0;
- for_each_online_node(node) {
- l3 = cachep->nodelists[node];
- if (!l3)
- continue;
-
- spin_lock(&l3->list_lock);
-
- list_for_each(q,&l3->slabs_full) {
- slabp = list_entry(q, struct slab, list);
- if (slabp->inuse != cachep->num && !error)
- error = "slabs_full accounting error";
- active_objs += cachep->num;
- active_slabs++;
- }
- list_for_each(q,&l3->slabs_partial) {
- slabp = list_entry(q, struct slab, list);
- if (slabp->inuse == cachep->num && !error)
- error = "slabs_partial inuse accounting error";
- if (!slabp->inuse && !error)
- error = "slabs_partial/inuse accounting error";
- active_objs += slabp->inuse;
- active_slabs++;
- }
- list_for_each(q,&l3->slabs_free) {
- slabp = list_entry(q, struct slab, list);
- if (slabp->inuse && !error)
- error = "slabs_free/inuse accounting error";
- num_slabs++;
- }
- free_objects += l3->free_objects;
- shared_avail += l3->shared->avail;
-
- spin_unlock(&l3->list_lock);
+ list_for_each(q,&cachep->lists.slabs_full) {
+ slabp = list_entry(q, struct slab, list);
+ if (slabp->inuse != cachep->num && !error)
+ error = "slabs_full accounting error";
+ active_objs += cachep->num;
+ active_slabs++;
+ }
+ list_for_each(q,&cachep->lists.slabs_partial) {
+ slabp = list_entry(q, struct slab, list);
+ if (slabp->inuse == cachep->num && !error)
+ error = "slabs_partial inuse accounting error";
+ if (!slabp->inuse && !error)
+ error = "slabs_partial/inuse accounting error";
+ active_objs += slabp->inuse;
+ active_slabs++;
+ }
+ list_for_each(q,&cachep->lists.slabs_free) {
+ slabp = list_entry(q, struct slab, list);
+ if (slabp->inuse && !error)
+ error = "slabs_free/inuse accounting error";
+ num_slabs++;
}
num_slabs+=active_slabs;
num_objs = num_slabs*cachep->num;
- if (num_objs - active_objs != free_objects && !error)
+ if (num_objs - active_objs != cachep->lists.free_objects && !error)
error = "free_objects accounting error";

name = cachep->name;
@@ -3465,9 +3025,9 @@
cachep->num, (1<<cachep->gfporder));
seq_printf(m, " : tunables %4u %4u %4u",
cachep->limit, cachep->batchcount,
- cachep->shared);
- seq_printf(m, " : slabdata %6lu %6lu %6lu",
- active_slabs, num_slabs, shared_avail);
+ cachep->lists.shared->limit/cachep->batchcount);
+ seq_printf(m, " : slabdata %6lu %6lu %6u",
+ active_slabs, num_slabs, cachep->lists.shared->avail);
#if STATS
{ /* list3 stats */
unsigned long high = cachep->high_mark;
@@ -3476,13 +3036,12 @@
unsigned long reaped = cachep->reaped;
unsigned long errors = cachep->errors;
unsigned long max_freeable = cachep->max_freeable;
+ unsigned long free_limit = cachep->free_limit;
unsigned long node_allocs = cachep->node_allocs;
- unsigned long node_frees = cachep->node_frees;

- seq_printf(m, " : globalstat %7lu %6lu %5lu %4lu \
- %4lu %4lu %4lu %4lu",
+ seq_printf(m, " : globalstat %7lu %6lu %5lu %4lu %4lu %4lu %4lu %4lu",
allocs, high, grown, reaped, errors,
- max_freeable, node_allocs, node_frees);
+ max_freeable, free_limit, node_allocs);
}
/* cpu stats */
{
@@ -3561,10 +3120,9 @@
batchcount < 1 ||
batchcount > limit ||
shared < 0) {
- res = 0;
+ res = -EINVAL;
} else {
- res = do_tune_cpucache(cachep, limit,
- batchcount, shared);
+ res = do_tune_cpucache(cachep, limit, batchcount, shared);
}
break;
}
@@ -3576,22 +3134,18 @@
}
#endif

-/**
- * ksize - get the actual amount of memory allocated for a given object
- * @objp: Pointer to the object
- *
- * kmalloc may internally round up allocations and return more memory
- * than requested. ksize() can be used to determine the actual amount of
- * memory allocated. The caller may use this additional memory, even though
- * a smaller amount of memory was initially specified with the kmalloc call.
- * The caller must guarantee that objp points to a valid object previously
- * allocated with either kmalloc() or kmem_cache_alloc(). The object
- * must not be freed during the duration of the call.
- */
unsigned int ksize(const void *objp)
{
- if (unlikely(objp == NULL))
- return 0;
+ kmem_cache_t *c;
+ unsigned long flags;
+ unsigned int size = 0;
+
+ if (likely(objp != NULL)) {
+ local_irq_save_nort(flags);
+ c = page_get_cache(virt_to_page(objp));
+ size = kmem_cache_size(c);
+ local_irq_restore_nort(flags);
+ }

- return obj_reallen(page_get_cache(virt_to_page(objp)));
+ return size;
}


2005-12-18 16:08:01

by K.R. Foley

[permalink] [raw]
Subject: Re: 2.6.15-rc5-rt2 slowness

Steven Rostedt wrote:
> Ingo,
>
> I ported your old changes of 2.6.14-rt22 of mm/slab.c to 2.6.15-rc5-rt2
> and tried it out. I believe that this confirms that the SLOB _is_ the
> problem in the slowness. Booting with this slab patch, gives the old
> speeds that we use to have.
>
> Now, is the solution to bring the SLOB up to par with the SLAB, or to
> make the SLAB as close to possible to the mainline (why remove NUMA?)
> and keep it for PREEMPT_RT?
>
> Below is the port of the slab changes if anyone else would like to see
> if this speeds things up for them.
>
> -- Steve
>

This drastically improves performance on my slower uniprocessor system.
2.6.15-rc5-rt2 still doesn't boot on my dual 933 box, with or without
this patch. I will try to dig into that a bit more today.

--
kr

2005-12-20 13:33:06

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.6.15-rc5-rt2 slowness


* Steven Rostedt <[email protected]> wrote:

> I ported your old changes of 2.6.14-rt22 of mm/slab.c to
> 2.6.15-rc5-rt2 and tried it out. I believe that this confirms that
> the SLOB _is_ the problem in the slowness. Booting with this slab
> patch, gives the old speeds that we use to have.
>
> Now, is the solution to bring the SLOB up to par with the SLAB, or to
> make the SLAB as close to possible to the mainline (why remove NUMA?)
> and keep it for PREEMPT_RT?
>
> Below is the port of the slab changes if anyone else would like to see
> if this speeds things up for them.

ok, i've added this back in - but we really need a cleaner port of SLAB
...

Ingo

2005-12-20 13:38:34

by Steven Rostedt

[permalink] [raw]
Subject: Re: 2.6.15-rc5-rt2 slowness


On Tue, 20 Dec 2005, Ingo Molnar wrote:
>
> * Steven Rostedt <[email protected]> wrote:
>
> > I ported your old changes of 2.6.14-rt22 of mm/slab.c to
> > 2.6.15-rc5-rt2 and tried it out. I believe that this confirms that
> > the SLOB _is_ the problem in the slowness. Booting with this slab
> > patch, gives the old speeds that we use to have.
> >
> > Now, is the solution to bring the SLOB up to par with the SLAB, or to
> > make the SLAB as close to possible to the mainline (why remove NUMA?)
> > and keep it for PREEMPT_RT?
> >
> > Below is the port of the slab changes if anyone else would like to see
> > if this speeds things up for them.
>
> ok, i've added this back in - but we really need a cleaner port of SLAB
> ...
>

Actually, how much do you want that SLOB code? For the last couple of
days I've been working on different approaches that can speed it up.
Right now I have one that takes advantage of the different caches. But
unfortunately, I'm dealing with a bad pointer some where that keeps
making it bug. Argh!

-- Steve

2005-12-20 14:04:19

by Steven Rostedt

[permalink] [raw]
Subject: Re: 2.6.15-rc5-rt2 slowness

On Tue, 20 Dec 2005, Ingo Molnar wrote:
>
> * Steven Rostedt <[email protected]> wrote:
>
> > > > Now, is the solution to bring the SLOB up to par with the SLAB, or to
> > > > make the SLAB as close to possible to the mainline (why remove NUMA?)
> > > > and keep it for PREEMPT_RT?
> > > >
> > > > Below is the port of the slab changes if anyone else would like to see
> > > > if this speeds things up for them.
> > >
> > > ok, i've added this back in - but we really need a cleaner port of SLAB
> > > ...
> > >
> >
> > Actually, how much do you want that SLOB code? For the last couple of
> > days I've been working on different approaches that can speed it up.
> > Right now I have one that takes advantage of the different caches.
> > But unfortunately, I'm dealing with a bad pointer some where that
> > keeps making it bug. Argh!
>
> well, the SLOB is mainly about being simple and small. So as long as
> those speedups are SMP-only, they ought to be fine. The problems are
> mainly SMP related, correct?

Actually, no. My test is to do a make install over NFS of a kernel that
has already been built.

The times I'm getting for the SLAB is ~26 seconds, the time for the SLOB
is 1 minute 32 seconds. So your looking at >300% slowness here. The test
bed is a UP. (I do that first before looking into SMP).

I'm still trying to keep the SLOB simple. It's the lack of sleep that is
making it hard ;)

-- Steve

2005-12-20 13:58:09

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.6.15-rc5-rt2 slowness


* Steven Rostedt <[email protected]> wrote:

> > > Now, is the solution to bring the SLOB up to par with the SLAB, or to
> > > make the SLAB as close to possible to the mainline (why remove NUMA?)
> > > and keep it for PREEMPT_RT?
> > >
> > > Below is the port of the slab changes if anyone else would like to see
> > > if this speeds things up for them.
> >
> > ok, i've added this back in - but we really need a cleaner port of SLAB
> > ...
> >
>
> Actually, how much do you want that SLOB code? For the last couple of
> days I've been working on different approaches that can speed it up.
> Right now I have one that takes advantage of the different caches.
> But unfortunately, I'm dealing with a bad pointer some where that
> keeps making it bug. Argh!

well, the SLOB is mainly about being simple and small. So as long as
those speedups are SMP-only, they ought to be fine. The problems are
mainly SMP related, correct?

Ingo

2005-12-20 14:08:08

by Steven Rostedt

[permalink] [raw]
Subject: Re: 2.6.15-rc5-rt2 slowness


On Tue, 20 Dec 2005, Ingo Molnar wrote:
>
> well, the SLOB is mainly about being simple and small. So as long as
> those speedups are SMP-only, they ought to be fine. The problems are
> mainly SMP related, correct?

Oh, but if this does work out, it _will_ improve SMP performance greatly!

-- Steve

2005-12-20 14:33:54

by Steven Rostedt

[permalink] [raw]
Subject: Re: 2.6.15-rc5-rt2 slowness

On Tue, 2005-12-20 at 09:04 -0500, Steven Rostedt wrote:

>
> Actually, no. My test is to do a make install over NFS of a kernel that
> has already been built.
>
> The times I'm getting for the SLAB is ~26 seconds, the time for the SLOB
> is 1 minute 32 seconds. So your looking at >300% slowness here. The test
> bed is a UP. (I do that first before looking into SMP).
>
> I'm still trying to keep the SLOB simple. It's the lack of sleep that is
> making it hard ;)
>

Amazing what you see after a few hours of sleep. My bug that I was
looking for all yesterday happened to be a < when it should have been a
<=.

OK, it boots and runs. Now I just need to clean it up and prepare to
ship! My test which is simply to run make install on a prebuilt kernel
over NFS. The test box is:

bert:/home/rostedt/work/ernie/linux-2.6.15-rc5-rt2# cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 8
model name : Pentium III (Coppermine)
stepping : 3
cpu MHz : 736.045
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 mmx fxsr sse
bogomips : 1472.74

The box where the kernel is NFS mounted is running the 2.6.15-rc4:

rostedt@gandalf:~/work/ernie/linux-2.6.15-rc5-rt2$ cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 43
model name : AMD Athlon(tm) 64 X2 Dual Core Processor 4200+
stepping : 1
cpu MHz : 2210.221
cache size : 512 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext
fxsr_opt lm 3dnowext 3dnow pni lahf_lm cmp_legacy
bogomips : 4424.61
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

processor : 1
vendor_id : AuthenticAMD
cpu family : 15
model : 43
model name : AMD Athlon(tm) 64 X2 Dual Core Processor 4200+
stepping : 1
cpu MHz : 2210.221
cache size : 512 KB
physical id : 0
siblings : 2
core id : 1
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext
fxsr_opt lm 3dnowext 3dnow pni lahf_lm cmp_legacy
bogomips : 4419.74
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp



Here's the times for "time make install": Three runs for each.

rt with slab:

run 1:
real 0m27.327s
user 0m15.151s
sys 0m3.149s

run 2:
real 0m26.952s
user 0m15.171s
sys 0m3.178s

run 3:
real 0m27.269s
user 0m15.175s
sys 0m3.226s

rt with slob (plain):

run 1:
real 1m26.845s
user 0m16.173s
sys 0m29.558s

run 2:
real 1m27.895s
user 0m16.532s
sys 0m30.460s

run 3:
real 1m25.645s
user 0m16.468s
sys 0m30.973s

rt with slob (new):

run 1:
real 0m28.740s
user 0m15.364s
sys 0m3.866s

run 2:
real 0m27.782s
user 0m15.409s
sys 0m3.885s

run 3:
real 0m27.576s
user 0m15.193s
sys 0m3.933s

As you see, the new SLOB code runs almost as fast as the SLAB code.
With some more improvements, I'm sure it can get even faster.

I'll send out the patch real soon (after I wash and dry it).

Note: After I send out the patch, I'll give it a try on SMP.

-- Steve


2005-12-20 15:07:55

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.6.15-rc5-rt2 slowness


* Steven Rostedt <[email protected]> wrote:

> As you see, the new SLOB code runs almost as fast as the SLAB code.
> With some more improvements, I'm sure it can get even faster.

cool, the numbers are really impressive! I'm wondering where the biggest
hit comes from - perhaps the SLOB does linear list walking when
allocating?

Ingo

2005-12-20 15:16:40

by Steven Rostedt

[permalink] [raw]
Subject: Re: 2.6.15-rc5-rt2 slowness


On Tue, 20 Dec 2005, Ingo Molnar wrote:
>
> * Steven Rostedt <[email protected]> wrote:
>
> > As you see, the new SLOB code runs almost as fast as the SLAB code.
> > With some more improvements, I'm sure it can get even faster.
>
> cool, the numbers are really impressive! I'm wondering where the biggest
> hit comes from - perhaps the SLOB does linear list walking when
> allocating?
>

Yeah, I think that is the biggest hit. The SLOB does the old K&R memory
management. Basically, right from the book. But it is slow and can
fragment very easily.

I have a little more to do on this patch, (I don't perform the correct
cleanup on kmem_cache_destroy), but I'll send it to you anyway within
the next couple of minutes. Just so you can take a look and try it out.

-- Steve

2005-12-20 15:30:13

by K.R. Foley

[permalink] [raw]
Subject: Re: 2.6.15-rc5-rt2 slowness

Ingo Molnar wrote:
> * Steven Rostedt <[email protected]> wrote:
>
>>>> Now, is the solution to bring the SLOB up to par with the SLAB, or to
>>>> make the SLAB as close to possible to the mainline (why remove NUMA?)
>>>> and keep it for PREEMPT_RT?
>>>>
>>>> Below is the port of the slab changes if anyone else would like to see
>>>> if this speeds things up for them.
>>> ok, i've added this back in - but we really need a cleaner port of SLAB
>>> ...
>>>
>> Actually, how much do you want that SLOB code? For the last couple of
>> days I've been working on different approaches that can speed it up.
>> Right now I have one that takes advantage of the different caches.
>> But unfortunately, I'm dealing with a bad pointer some where that
>> keeps making it bug. Argh!
>
> well, the SLOB is mainly about being simple and small. So as long as
> those speedups are SMP-only, they ought to be fine. The problems are
> mainly SMP related, correct?
>
> Ingo

No. I experienced horrible performance running the original patch with
the SLOB on my uniprocessor system vs. the patch with Steven's SLAB
patch applied on the same system. In fact I am currently running the
latter on that system now. With the original patch the system is really
unusable.

--
kr

2005-12-20 15:44:46

by Steven Rostedt

[permalink] [raw]
Subject: [PATCH RT 02/02] SLOB - break SLOB up by caches

This patch breaks up the SLOBs by caches, and also uses the mem_map
pages to find the cache descriptor.


Once again:

TODO: IMPORTANT!!!

1) I haven't cleaned up the kmem_cache_destroy yet, so every time that
happens, there's a memory leak.

2) I need to test on SMP.

-- Steve

Signed-off-by: Steven Rostedt <[email protected]>

Index: linux-2.6.15-rc5-rt2/mm/slob.c
===================================================================
--- linux-2.6.15-rc5-rt2.orig/mm/slob.c 2005-12-19 18:00:01.000000000 -0500
+++ linux-2.6.15-rc5-rt2/mm/slob.c 2005-12-20 10:16:39.000000000 -0500
@@ -27,6 +27,20 @@
* are allocated by calling __get_free_pages. As SLAB objects know
* their size, no separate size bookkeeping is necessary and there is
* essentially no allocation space overhead.
+ *
+ * Modified by: Steven Rostedt <[email protected]> 12/20/05
+ *
+ * Now we take advantage of the kmem_cache usage. I've removed
+ * the global slobfree, and created one for every cache.
+ *
+ * For kmalloc/kfree I've reintroduced the usage of cache_sizes,
+ * but only for sizes 32 through PAGE_SIZE >> 1 by order of 2.
+ *
+ * Having the SLOB alloc per size of the cache should speed things up
+ * greatly, not only by making the search paths smaller, but also by
+ * keeping all the caches of similar units. This way the fragmentation
+ * should not be as big of a problem.
+ *
*/

#include <linux/config.h>
@@ -37,6 +51,8 @@
#include <linux/module.h>
#include <linux/timer.h>

+#undef DEBUG_CACHE
+
struct slob_block {
int units;
struct slob_block *next;
@@ -53,17 +69,66 @@
};
typedef struct bigblock bigblock_t;

-static slob_t arena = { .next = &arena, .units = 1 };
-static slob_t *slobfree = &arena;
-static DEFINE_SPINLOCK(slob_lock);
+struct kmem_cache {
+ unsigned int size, align;
+ const char *name;
+ slob_t *slobfree;
+ slob_t arena;
+ spinlock_t lock;
+ void (*ctor)(void *, struct kmem_cache *, unsigned long);
+ void (*dtor)(void *, struct kmem_cache *, unsigned long);
+ atomic_t items;
+ unsigned int free;
+ struct list_head list;
+};

-#define __get_slob_block(b) ((unsigned long)(b) & ~(PAGE_SIZE-1))
+#define NR_SLOB_CACHES ((PAGE_SHIFT) - 5) /* 32 to PAGE_SIZE-1 by order of 2 */
+#define MAX_SLOB_CACHE_SIZE (PAGE_SIZE >> 1)

-static inline struct page *get_slob_page(const void *mem)
+static struct kmem_cache *cache_sizes[NR_SLOB_CACHES];
+static struct kmem_cache *bb_cache;
+
+static struct semaphore cache_chain_sem;
+static struct list_head cache_chain;
+
+#ifdef DEBUG_CACHE
+static void test_cache(kmem_cache_t *c)
{
- void *virt = (void*)__get_slob_block(mem);
+ slob_t *cur = c->slobfree;
+ unsigned int x = -1 >> 2;

- return virt_to_page(virt);
+ do {
+ BUG_ON(!cur->next);
+ cur = cur->next;
+ } while (cur != c->slobfree && --x);
+ BUG_ON(!x);
+}
+#else
+#define test_cache(x) do {} while(0)
+#endif
+
+/*
+ * Here we take advantage of the lru field of the pages that
+ * map to the pages we use in the SLOB. This is done similar
+ * to what is done with SLAB.
+ *
+ * The lru.next field is used to get the bigblock descriptor
+ * for large blocks larger than PAGE_SIZE >> 1.
+ *
+ * Set and retrieved by set_slob_block and get_slob_block
+ * respectively.
+ *
+ * The lru.prev field is used to find the cache descriptor
+ * for small blocks smaller than or equal to PAGE_SIZE >> 1.
+ *
+ * Set and retrieved by set_slob_ptr and get_slob_ptr
+ * respectively.
+ *
+ * The use of lru.next tells us in kmalloc that the page is large.
+ */
+static inline struct page *get_slob_page(const void *mem)
+{
+ return virt_to_page(mem);
}

static inline void zero_slob_block(const void *b)
@@ -87,18 +152,39 @@
page->lru.next = data;
}

-static void slob_free(void *b, int size);
+static inline void *get_slob_ptr(const void *b)
+{
+ struct page *page;
+ page = get_slob_page(b);
+ return page->lru.prev;
+}
+
+static inline void set_slob_ptr(const void *b, void *data)
+{
+ struct page *page;
+ page = get_slob_page(b);
+ page->lru.prev = data;
+}
+
+static void slob_free(kmem_cache_t *cachep, void *b, int size);

-static void *slob_alloc(size_t size, gfp_t gfp, int align)
+static void *slob_alloc(kmem_cache_t *cachep, gfp_t gfp, int align)
{
+ size_t size;
slob_t *prev, *cur, *aligned = 0;
- int delta = 0, units = SLOB_UNITS(size);
+ int delta = 0, units;
unsigned long flags;

- spin_lock_irqsave(&slob_lock, flags);
- prev = slobfree;
+ size = cachep->size;
+ units = SLOB_UNITS(size);
+ BUG_ON(!units);
+
+ spin_lock_irqsave(&cachep->lock, flags);
+ prev = cachep->slobfree;
for (cur = prev->next; ; prev = cur, cur = cur->next) {
if (align) {
+ while (align < SLOB_UNIT)
+ align <<= 1;
aligned = (slob_t *)ALIGN((unsigned long)cur, align);
delta = aligned - cur;
}
@@ -121,12 +207,16 @@
cur->units = units;
}

- slobfree = prev;
- spin_unlock_irqrestore(&slob_lock, flags);
+ cachep->slobfree = prev;
+ test_cache(cachep);
+ if (prev < prev->next)
+ BUG_ON(cur + cur->units > prev->next);
+ spin_unlock_irqrestore(&cachep->lock, flags);
return cur;
}
- if (cur == slobfree) {
- spin_unlock_irqrestore(&slob_lock, flags);
+ if (cur == cachep->slobfree) {
+ test_cache(cachep);
+ spin_unlock_irqrestore(&cachep->lock, flags);

if (size == PAGE_SIZE) /* trying to shrink arena? */
return 0;
@@ -136,14 +226,15 @@
return 0;

zero_slob_block(cur);
- slob_free(cur, PAGE_SIZE);
- spin_lock_irqsave(&slob_lock, flags);
- cur = slobfree;
+ set_slob_ptr(cur, cachep);
+ slob_free(cachep, cur, PAGE_SIZE);
+ spin_lock_irqsave(&cachep->lock, flags);
+ cur = cachep->slobfree;
}
}
}

-static void slob_free(void *block, int size)
+static void slob_free(kmem_cache_t *cachep, void *block, int size)
{
slob_t *cur, *b = (slob_t *)block;
unsigned long flags;
@@ -155,26 +246,29 @@
b->units = SLOB_UNITS(size);

/* Find reinsertion point */
- spin_lock_irqsave(&slob_lock, flags);
- for (cur = slobfree; !(b > cur && b < cur->next); cur = cur->next)
+ spin_lock_irqsave(&cachep->lock, flags);
+ for (cur = cachep->slobfree; !(b > cur && b < cur->next); cur = cur->next)
if (cur >= cur->next && (b > cur || b < cur->next))
break;

if (b + b->units == cur->next) {
b->units += cur->next->units;
b->next = cur->next->next;
+ BUG_ON(cur->next == &cachep->arena);
} else
b->next = cur->next;

if (cur + cur->units == b) {
cur->units += b->units;
cur->next = b->next;
+ BUG_ON(b == &cachep->arena);
} else
cur->next = b;

- slobfree = cur;
+ cachep->slobfree = cur;

- spin_unlock_irqrestore(&slob_lock, flags);
+ test_cache(cachep);
+ spin_unlock_irqrestore(&cachep->lock, flags);
}

static int FASTCALL(find_order(int size));
@@ -188,15 +282,24 @@

void *kmalloc(size_t size, gfp_t gfp)
{
- slob_t *m;
bigblock_t *bb;

- if (size < PAGE_SIZE - SLOB_UNIT) {
- m = slob_alloc(size + SLOB_UNIT, gfp, 0);
- return m ? (void *)(m + 1) : 0;
+ /*
+ * If the size is less than PAGE_SIZE >> 1 then
+ * we use the generic caches. Otherwise, we
+ * just allocate the necessary pages.
+ */
+ if (size <= MAX_SLOB_CACHE_SIZE) {
+ int i;
+ int order;
+ for (i=0, order=32; i < NR_SLOB_CACHES; i++, order <<= 1)
+ if (size <= order)
+ break;
+ BUG_ON(i == NR_SLOB_CACHES);
+ return kmem_cache_alloc(cache_sizes[i], gfp);
}

- bb = slob_alloc(sizeof(bigblock_t), gfp, 0);
+ bb = slob_alloc(bb_cache, gfp, 0);
if (!bb)
return 0;

@@ -208,7 +311,7 @@
return bb->pages;
}

- slob_free(bb, sizeof(bigblock_t));
+ slob_free(bb_cache, bb, sizeof(bigblock_t));
return 0;
}

@@ -216,19 +319,26 @@

void kfree(const void *block)
{
+ kmem_cache_t *c;
bigblock_t *bb;

if (!block)
return;

+ /*
+ * look into the page of the allocated block to
+ * see if this is a big allocation or not.
+ */
bb = get_slob_block(block);
if (bb) {
free_pages((unsigned long)block, bb->order);
- slob_free(bb, sizeof(bigblock_t));
+ slob_free(bb_cache, bb, sizeof(bigblock_t));
return;
}

- slob_free((slob_t *)block - 1, 0);
+ c = get_slob_ptr(block);
+ kmem_cache_free(c, (void *)block);
+
return;
}

@@ -237,6 +347,7 @@
unsigned int ksize(const void *block)
{
bigblock_t *bb;
+ kmem_cache_t *c;

if (!block)
return 0;
@@ -245,14 +356,16 @@
if (bb)
return PAGE_SIZE << bb->order;

- return ((slob_t *)block - 1)->units * SLOB_UNIT;
+ c = get_slob_ptr(block);
+ return c->size;
}

-struct kmem_cache {
- unsigned int size, align;
- const char *name;
- void (*ctor)(void *, struct kmem_cache *, unsigned long);
- void (*dtor)(void *, struct kmem_cache *, unsigned long);
+static slob_t cache_arena = { .next = &cache_arena, .units = 0 };
+struct kmem_cache cache_cache = {
+ .name = "cache",
+ .slobfree = &cache_cache.arena,
+ .arena = { .next = &cache_cache.arena, .units = 0 },
+ .lock = SPIN_LOCK_UNLOCKED(cache_cache.lock)
};

struct kmem_cache *kmem_cache_create(const char *name, size_t size,
@@ -261,8 +374,22 @@
void (*dtor)(void*, struct kmem_cache *, unsigned long))
{
struct kmem_cache *c;
+ void *p;
+
+ c = slob_alloc(&cache_cache, flags, 0);
+
+ memset(c, 0, sizeof(*c));

- c = slob_alloc(sizeof(struct kmem_cache), flags, 0);
+ c->size = PAGE_SIZE;
+ c->arena.next = &c->arena;
+ c->arena.units = 0;
+ c->slobfree = &c->arena;
+ atomic_set(&c->items, 0);
+ spin_lock_init(&c->lock);
+
+ p = slob_alloc(c, 0, PAGE_SIZE-1);
+ if (p)
+ free_page((unsigned long)p);

if (c) {
c->name = name;
@@ -274,6 +401,9 @@
if (c->align < align)
c->align = align;
}
+ down(&cache_chain_sem);
+ list_add_tail(&c->list, &cache_chain);
+ up(&cache_chain_sem);

return c;
}
@@ -281,7 +411,17 @@

int kmem_cache_destroy(struct kmem_cache *c)
{
- slob_free(c, sizeof(struct kmem_cache));
+ down(&cache_chain_sem);
+ list_del(&c->list);
+ up(&cache_chain_sem);
+
+ BUG_ON(atomic_read(&c->items));
+
+ /*
+ * WARNING!!! Memory leak!
+ */
+ printk("FIX ME: need to free memory\n");
+ slob_free(&cache_cache, c, sizeof(struct kmem_cache));
return 0;
}
EXPORT_SYMBOL(kmem_cache_destroy);
@@ -290,11 +430,16 @@
{
void *b;

- if (c->size < PAGE_SIZE)
- b = slob_alloc(c->size, flags, c->align);
+ atomic_inc(&c->items);
+
+ if (c->size <= MAX_SLOB_CACHE_SIZE)
+ b = slob_alloc(c, flags, c->align);
else
b = (void *)__get_free_pages(flags, find_order(c->size));

+ if (!b)
+ return 0;
+
if (c->ctor)
c->ctor(b, c, SLAB_CTOR_CONSTRUCTOR);

@@ -304,11 +449,13 @@

void kmem_cache_free(struct kmem_cache *c, void *b)
{
+ atomic_dec(&c->items);
+
if (c->dtor)
c->dtor(b, c, 0);

- if (c->size < PAGE_SIZE)
- slob_free(b, c->size);
+ if (c->size <= MAX_SLOB_CACHE_SIZE)
+ slob_free(c, b, c->size);
else
free_pages((unsigned long)b, find_order(c->size));
}
@@ -326,22 +473,62 @@
}
EXPORT_SYMBOL(kmem_cache_name);

-static struct timer_list slob_timer = TIMER_INITIALIZER(
- (void (*)(unsigned long))kmem_cache_init, 0, 0);
+static char cache_names[NR_SLOB_CACHES][15];

void kmem_cache_init(void)
{
- void *p = slob_alloc(PAGE_SIZE, 0, PAGE_SIZE-1);
+ static int done;
+ void *p;

- if (p)
- free_page((unsigned long)p);
-
- mod_timer(&slob_timer, jiffies + HZ);
+ if (!done) {
+ int i;
+ int size = 32;
+ done = 1;
+
+ init_MUTEX(&cache_chain_sem);
+ INIT_LIST_HEAD(&cache_chain);
+
+ cache_cache.size = PAGE_SIZE;
+ p = slob_alloc(&cache_cache, 0, PAGE_SIZE-1);
+ if (p)
+ free_page((unsigned long)p);
+ cache_cache.size = sizeof(struct kmem_cache);
+
+ bb_cache = kmem_cache_create("bb_cache",sizeof(bigblock_t), 0,
+ GFP_KERNEL, NULL, NULL);
+ for (i=0; i < NR_SLOB_CACHES; i++, size <<= 1)
+ cache_sizes[i] = kmem_cache_create(cache_names[i], size, 0,
+ GFP_KERNEL, NULL, NULL);
+ }
}

atomic_t slab_reclaim_pages = ATOMIC_INIT(0);
EXPORT_SYMBOL(slab_reclaim_pages);

+static void test_slob(slob_t *s)
+{
+ slob_t *p;
+ long x = 0;
+
+ for (p=s->next; p != s && x < 10000; p = p->next, x++)
+ printk(".");
+}
+
+void print_slobs(void)
+{
+ struct list_head *curr;
+
+ list_for_each(curr, &cache_chain) {
+ kmem_cache_t *c = list_entry(curr, struct kmem_cache, list);
+
+ printk("%s items:%d",
+ c->name?:"<none>",
+ atomic_read(&c->items));
+ test_slob(&c->arena);
+ printk("\n");
+ }
+}
+
#ifdef CONFIG_SMP

void *__alloc_percpu(size_t size, size_t align)


2005-12-20 15:45:14

by Steven Rostedt

[permalink] [raw]
Subject: [PATCH RT 01/02] SLOB - remove bigblock list

This patch uses the mem_map pages to find the bigblock descriptor for
large allocations.

-- Steve

Signed-off-by: Steven Rostedt <[email protected]>

Index: linux-2.6.15-rc5-rt2/mm/slob.c
===================================================================
--- linux-2.6.15-rc5-rt2.orig/mm/slob.c 2005-12-19 10:45:55.000000000 -0500
+++ linux-2.6.15-rc5-rt2/mm/slob.c 2005-12-19 14:12:08.000000000 -0500
@@ -50,15 +50,42 @@
struct bigblock {
int order;
void *pages;
- struct bigblock *next;
};
typedef struct bigblock bigblock_t;

static slob_t arena = { .next = &arena, .units = 1 };
static slob_t *slobfree = &arena;
-static bigblock_t *bigblocks;
static DEFINE_SPINLOCK(slob_lock);
-static DEFINE_SPINLOCK(block_lock);
+
+#define __get_slob_block(b) ((unsigned long)(b) & ~(PAGE_SIZE-1))
+
+static inline struct page *get_slob_page(const void *mem)
+{
+ void *virt = (void*)__get_slob_block(mem);
+
+ return virt_to_page(virt);
+}
+
+static inline void zero_slob_block(const void *b)
+{
+ struct page *page;
+ page = get_slob_page(b);
+ memset(&page->lru, 0, sizeof(page->lru));
+}
+
+static inline void *get_slob_block(const void *b)
+{
+ struct page *page;
+ page = get_slob_page(b);
+ return page->lru.next;
+}
+
+static inline void set_slob_block(const void *b, void *data)
+{
+ struct page *page;
+ page = get_slob_page(b);
+ page->lru.next = data;
+}

static void slob_free(void *b, int size);

@@ -108,6 +135,7 @@
if (!cur)
return 0;

+ zero_slob_block(cur);
slob_free(cur, PAGE_SIZE);
spin_lock_irqsave(&slob_lock, flags);
cur = slobfree;
@@ -162,7 +190,6 @@
{
slob_t *m;
bigblock_t *bb;
- unsigned long flags;

if (size < PAGE_SIZE - SLOB_UNIT) {
m = slob_alloc(size + SLOB_UNIT, gfp, 0);
@@ -177,10 +204,7 @@
bb->pages = (void *)__get_free_pages(gfp, bb->order);

if (bb->pages) {
- spin_lock_irqsave(&block_lock, flags);
- bb->next = bigblocks;
- bigblocks = bb;
- spin_unlock_irqrestore(&block_lock, flags);
+ set_slob_block(bb->pages, bb);
return bb->pages;
}

@@ -192,25 +216,16 @@

void kfree(const void *block)
{
- bigblock_t *bb, **last = &bigblocks;
- unsigned long flags;
+ bigblock_t *bb;

if (!block)
return;

- if (!((unsigned long)block & (PAGE_SIZE-1))) {
- /* might be on the big block list */
- spin_lock_irqsave(&block_lock, flags);
- for (bb = bigblocks; bb; last = &bb->next, bb = bb->next) {
- if (bb->pages == block) {
- *last = bb->next;
- spin_unlock_irqrestore(&block_lock, flags);
- free_pages((unsigned long)block, bb->order);
- slob_free(bb, sizeof(bigblock_t));
- return;
- }
- }
- spin_unlock_irqrestore(&block_lock, flags);
+ bb = get_slob_block(block);
+ if (bb) {
+ free_pages((unsigned long)block, bb->order);
+ slob_free(bb, sizeof(bigblock_t));
+ return;
}

slob_free((slob_t *)block - 1, 0);
@@ -222,20 +237,13 @@
unsigned int ksize(const void *block)
{
bigblock_t *bb;
- unsigned long flags;

if (!block)
return 0;

- if (!((unsigned long)block & (PAGE_SIZE-1))) {
- spin_lock_irqsave(&block_lock, flags);
- for (bb = bigblocks; bb; bb = bb->next)
- if (bb->pages == block) {
- spin_unlock_irqrestore(&slob_lock, flags);
- return PAGE_SIZE << bb->order;
- }
- spin_unlock_irqrestore(&block_lock, flags);
- }
+ bb = get_slob_block(block);
+ if (bb)
+ return PAGE_SIZE << bb->order;

return ((slob_t *)block - 1)->units * SLOB_UNIT;
}


2005-12-20 15:44:47

by Steven Rostedt

[permalink] [raw]
Subject: [PATCH RT 00/02] SLOB optimizations

(Andrew, I'm CC'ing you and Matt to see if you would like this in -mm)

OK Ingo, here it is.

The old SLOB did the old K&R memory allocations.

It had a global link list "slobfree". When it needed memory it would
search this list linearly to find the first spot that fit and then
return it. It was broken up into SLOB_UNITS which was the number of
bytes to hold slob_t.

Since the sizes of the allocations would greatly fluctuate, the chances
for fragmentation was very high. This would also cause the looking for
free locations to increase, since the number of free blocks would also
increase due to the fragmentation.

It also had one global spinlock for ALL allocations. This would
obviously kill SMP performance.

For large blocks greater than PAGE_SIZE it would just use the buddy
system. This I didn't change, in fact, I made it use the buddy system
for blocks greater than PAGE_SIZE >> 1, but I'll explain that below.

The problem that this system had, was it made another global link list
to hold all of these large blocks, which means it needed another global
spinlock to manage this list.

When any block was freed via kfree, it would first search all the big
blocks to see if it was a large allocation, and if not, then it would
search the slobfree list to find where it goes. Both taking two global
spinlocks!


Here's what I've done to solve this.

First things first, the first patch was to get rid of the bigblock list.
I'm simple used the method of SLAB to use the lru list field of the
corresponding page to store the pointer to the bigblock descriptor which
has the information to free it. This got rid of the bigblock link list
and global spinlock.

The next patch was the big improvement, with the largest changes. I
took advantage of how the kmem_cache usage that SLAB also takes
advantage of. I created a memory pool like the global one, but for
every cache with a size less then PAGE_SIZE >> 1.

[ Note: I picked PAGE_SIZE >> 1, since it didn't seem to make much
difference when there were greater, since it would use a full page for
just one allocation. I can play with this more, but it still seems to
be a waste ].

I used lru.next of the page that the pages are allocated for, for the
bigblock descriptors (as described above), and now I use lru.prev to
point to the cache that the items belong to in the pool. So I removed
the need for the descriptor being held in the pool.

I also created the general caches like SLAB for kmalloc and kfree, for
sizes 32 through PAGE_SIZE >> 1. All greater allocations will use the
backend buddy system.

Tests:
=====

To test this, I used what showed the problem the greatest. Doing a make
install over NFS. So on my 733MHz UP machine, I ran "time make install"
on the -rt kernel with the old SLAB, as well as -rt kernel with the
default (old) SLOB, and then with the SLOB with these patches (three
tests each). Here's the results:

rt with slab:

run 1:
real 0m27.327s
user 0m15.151s
sys 0m3.149s

run 2:
real 0m26.952s
user 0m15.171s
sys 0m3.178s

run 3:
real 0m27.269s
user 0m15.175s
sys 0m3.226s

rt with slob (plain):

run 1:
real 1m26.845s
user 0m16.173s
sys 0m29.558s

run 2:
real 1m27.895s
user 0m16.532s
sys 0m30.460s

run 3:
real 1m25.645s
user 0m16.468s
sys 0m30.973s

rt with slob (new):

run 1:
real 0m28.740s
user 0m15.364s
sys 0m3.866s

run 2:
real 0m27.782s
user 0m15.409s
sys 0m3.885s

run 3:
real 0m27.576s
user 0m15.193s
sys 0m3.933s


So I have improved the speed of SLOB to almost that of SLAB!

TODO: IMPORTANT!!!

1) I haven't cleaned up the kmem_cache_destroy yet, so every time that
happens, there's a memory leak.

2) I need to test on SMP.

-- Steve


2005-12-20 15:56:59

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RT 00/02] SLOB optimizations

On Tue, 2005-12-20 at 10:44 -0500, Steven Rostedt wrote:
> (Andrew, I'm CC'ing you and Matt to see if you would like this in -mm)
>
> OK Ingo, here it is.

I just tested it out on SMP (2x), and it boots. Ingo, do you have a good
memory test that I can do benchmarks with? Something better that my
"make install" test.

Thanks,

-- Steve


2005-12-20 15:59:30

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RT 00/02] SLOB optimizations


* Steven Rostedt <[email protected]> wrote:

> On Tue, 2005-12-20 at 10:44 -0500, Steven Rostedt wrote:
> > (Andrew, I'm CC'ing you and Matt to see if you would like this in -mm)
> >
> > OK Ingo, here it is.
>
> I just tested it out on SMP (2x), and it boots. Ingo, do you have a
> good memory test that I can do benchmarks with? Something better that
> my "make install" test.

networking is the most SLAB-intensive, so your test over NFS ought to be
pretty good already.

Ingo

2005-12-20 16:14:36

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RT 00/02] SLOB optimizations


* Steven Rostedt <[email protected]> wrote:

> Tests:
> =====

could you also post the output of 'size mm/slob.o', with and without
these patches, with CONFIG_EMBEDDED and CONFIG_CC_OPTIMIZE_FOR_SIZE
enabled? (and with all debugging options disabled) Both the UP and the
SMP overhead would be interesting to see.

Ingo

2005-12-20 16:30:54

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RT 00/02] SLOB optimizations

On Tue, 20 Dec 2005, Ingo Molnar wrote:
> > Tests:
> > =====
>
> could you also post the output of 'size mm/slob.o', with and without
> these patches, with CONFIG_EMBEDDED and CONFIG_CC_OPTIMIZE_FOR_SIZE
> enabled? (and with all debugging options disabled) Both the UP and the
> SMP overhead would be interesting to see.
>

Well, there is definitely a hit there:

rt (slob new):
size mm/slob.o
text data bss dec hex filename
2051 112 233 2396 95c mm/slob.o

without
size mm/slob.o
text data bss dec hex filename
1331 120 8 1459 5b3 mm/slob.o

rt smp (slob new)
size mm/slob.o
text data bss dec hex filename
2297 120 233 2650 a5a mm/slob.o

without
size mm/slob.o
text data bss dec hex filename
1573 140 8 1721 6b9 mm/slob.o


So, should this be a third memory managment system? A fast_slob?


Just for kicks here's slab.o:

rt:
size mm/slab.o
text data bss dec hex filename
8896 556 144 9596 257c mm/slab.o

rt smp:
size mm/slab.o
text data bss dec hex filename
9679 640 84 10403 28a3 mm/slab.o

So there's still a great improvement on that (maybe not the bss though).

-- Steve

2005-12-20 16:40:06

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RT 00/02] SLOB optimizations

On Tue, 2005-12-20 at 11:29 -0500, Steven Rostedt wrote:
> On Tue, 20 Dec 2005, Ingo Molnar wrote:
> > > Tests:
> > > =====
> >
> > could you also post the output of 'size mm/slob.o', with and without
> > these patches, with CONFIG_EMBEDDED and CONFIG_CC_OPTIMIZE_FOR_SIZE
> > enabled? (and with all debugging options disabled) Both the UP and the
> > SMP overhead would be interesting to see.
> >

> Well, there is definitely a hit there:

There's also crap that can be removed in my patch. Like I started a
cache_chain algorithm. For example by adding just:

#if 0
static void test_slob(slob_t *s)
{

[...]

}

void print_slobs(void)
{

[...]

}
#endif

I get:
rt:
size mm/slob.o
text data bss dec hex filename
1889 112 233 2234 8ba mm/slob.o

rt smp:
size mm/slob.o
text data bss dec hex filename
2135 120 233 2488 9b8 mm/slob.o

So, I probably need to add stuff for CONFIG_OPTIMIZE_FOR_SIZE.

-- Steve

>
> rt (slob new):
> size mm/slob.o
> text data bss dec hex filename
> 2051 112 233 2396 95c mm/slob.o
>
> without
> size mm/slob.o
> text data bss dec hex filename
> 1331 120 8 1459 5b3 mm/slob.o
>
> rt smp (slob new)
> size mm/slob.o
> text data bss dec hex filename
> 2297 120 233 2650 a5a mm/slob.o
>
> without
> size mm/slob.o
> text data bss dec hex filename
> 1573 140 8 1721 6b9 mm/slob.o
>
>
> So, should this be a third memory managment system? A fast_slob?
>
>
> Just for kicks here's slab.o:
>
> rt:
> size mm/slab.o
> text data bss dec hex filename
> 8896 556 144 9596 257c mm/slab.o
>
> rt smp:
> size mm/slab.o
> text data bss dec hex filename
> 9679 640 84 10403 28a3 mm/slab.o
>
> So there's still a great improvement on that (maybe not the bss though).
>
> -- Steve
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
Steven Rostedt
Senior Programmer
Kihon Technologies
(607)786-4830

2005-12-20 18:20:51

by Matt Mackall

[permalink] [raw]
Subject: Re: [PATCH RT 00/02] SLOB optimizations

On Tue, Dec 20, 2005 at 10:44:20AM -0500, Steven Rostedt wrote:
> (Andrew, I'm CC'ing you and Matt to see if you would like this in -mm)
>
> OK Ingo, here it is.
>
> The old SLOB did the old K&R memory allocations.
>
> It had a global link list "slobfree". When it needed memory it would
> search this list linearly to find the first spot that fit and then
> return it. It was broken up into SLOB_UNITS which was the number of
> bytes to hold slob_t.
>
> Since the sizes of the allocations would greatly fluctuate, the chances
> for fragmentation was very high. This would also cause the looking for
> free locations to increase, since the number of free blocks would also
> increase due to the fragmentation.

On the target systems for the original SLOB design, we have less than
16MB of memory, so the linked list walking is pretty well bounded.

> It also had one global spinlock for ALL allocations. This would
> obviously kill SMP performance.

And again, the locking primarily exists for PREEMPT and small dual-core.
So I'm still a bit amused that you guys are using it for -RT.

> When any block was freed via kfree, it would first search all the big
> blocks to see if it was a large allocation, and if not, then it would
> search the slobfree list to find where it goes. Both taking two global
> spinlocks!

I don't think this is correct, or else indicates a bug. We should only
scan the big block list when the freed block was page-aligned.

> First things first, the first patch was to get rid of the bigblock list.
> I'm simple used the method of SLAB to use the lru list field of the
> corresponding page to store the pointer to the bigblock descriptor which
> has the information to free it. This got rid of the bigblock link list
> and global spinlock.

This I like a lot. I'd like to see a size/performance measurement of
this by itself. I suspect it's an unambiguous win in both categories.

> The next patch was the big improvement, with the largest changes. I
> took advantage of how the kmem_cache usage that SLAB also takes
> advantage of. I created a memory pool like the global one, but for
> every cache with a size less then PAGE_SIZE >> 1.

Hmm. By every size, I assume you mean powers of two. Which negates
some of the fine-grained allocation savings that current SLOB provides.

[...]
> So I have improved the speed of SLOB to almost that of SLAB!

Nice.

For what it's worth, I think we really ought to consider a generalized
allocator approach like Sun's VMEM, with various removable pieces.

Currently we've got something like this:

get_free_pages boot_mem idr resource_* vmalloc ...
|
slab
|
per_cpu/node
|
kmem_cache_alloc
|
kmalloc

We could take it in a direction like this:

generic range allocator (completely agnostic)
|
optional size buckets (reduced fragmentation, O(1))
|
optional slab (cache-friendly, pre-initialized)
|
optional per cpu/node caches (cache-hot and lockless)
|
kmalloc / kmem_cache_alloc / boot_mem / idr / resource_* / vmalloc / ...

(You read that right, the top level allocator can replace all the
different allocators that hand back integers or non-overlapping ranges.)

Each user of, say, kmem_create() could then pass in flags to specify
which caching layers ought to be bypassed. IDR, for example, would
probably disable all the layers and specify a first-fit policy.

And then depending on our global size and performance requirements, we
could globally disable some layers like SLAB, buckets, or per_cpu
caches. With all the optional layers disabled, we'd end up with
something much like SLOB (but underneath get_free_page!).

--
Mathematics is the supreme nostalgia of our time.

2005-12-20 19:15:53

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RT 00/02] SLOB optimizations

On Tue, 2005-12-20 at 12:19 -0600, Matt Mackall wrote:
> On Tue, Dec 20, 2005 at 10:44:20AM -0500, Steven Rostedt wrote:
> > (Andrew, I'm CC'ing you and Matt to see if you would like this in -mm)
> >
> > OK Ingo, here it is.
> >
> > The old SLOB did the old K&R memory allocations.
> >
> > It had a global link list "slobfree". When it needed memory it would
> > search this list linearly to find the first spot that fit and then
> > return it. It was broken up into SLOB_UNITS which was the number of
> > bytes to hold slob_t.
> >
> > Since the sizes of the allocations would greatly fluctuate, the chances
> > for fragmentation was very high. This would also cause the looking for
> > free locations to increase, since the number of free blocks would also
> > increase due to the fragmentation.
>
> On the target systems for the original SLOB design, we have less than
> 16MB of memory, so the linked list walking is pretty well bounded.

I bet after a while of running, your performance will still suffer due
to fragmentation. The more fragmented it is, the more space you lose
and the more steps you need to walk.

Remember, because of the small stack, kmalloc and kfree are used an
awful lot. And if you slow those down, you will start to take a big hit
in performance.

>
> > It also had one global spinlock for ALL allocations. This would
> > obviously kill SMP performance.
>
> And again, the locking primarily exists for PREEMPT and small dual-core.
> So I'm still a bit amused that you guys are using it for -RT.

I think this is due to the complexity of the current SLAB. With slab.c
unmodified, the RT kernel doesn't boot. And it's getting more complex,
so to make the proper changes to have it run under a fully preemptible
kernel, takes knowing all the details of the SLAB.

Ingo can answer this better himself, but I have a feeling he jumped to
your SLOB system just because of the simplicity.

>
> > When any block was freed via kfree, it would first search all the big
> > blocks to see if it was a large allocation, and if not, then it would
> > search the slobfree list to find where it goes. Both taking two global
> > spinlocks!
>
> I don't think this is correct, or else indicates a bug. We should only
> scan the big block list when the freed block was page-aligned.

Yep, you're right here. I forgot about that since updating the bigblock
list was the first thing I did, and I didn't need that check anymore.
So, I was wrong here, yours does _only_ scan the bigblock list if the
block is page aligned.

>
> > First things first, the first patch was to get rid of the bigblock list.
> > I'm simple used the method of SLAB to use the lru list field of the
> > corresponding page to store the pointer to the bigblock descriptor which
> > has the information to free it. This got rid of the bigblock link list
> > and global spinlock.
>
> This I like a lot. I'd like to see a size/performance measurement of
> this by itself. I suspect it's an unambiguous win in both categories.

Actually the performance gain was disappointingly small. As it was a
separate patch and I though it would gain a lot. But if IIRC, it only
increased the speed by a second or two (of the 1 minute 27 seconds).
That's why I spent so much time in the next approach.

>
> > The next patch was the big improvement, with the largest changes. I
> > took advantage of how the kmem_cache usage that SLAB also takes
> > advantage of. I created a memory pool like the global one, but for
> > every cache with a size less then PAGE_SIZE >> 1.
>
> Hmm. By every size, I assume you mean powers of two. Which negates
> some of the fine-grained allocation savings that current SLOB provides.

Yeah its the same as what the slabs use. But I would like to take
measurements of a running system between the two approaches. After a
day of heavy network traffic, see what the fragmentation is like and how
much is wasted. This would require me finishing my cache_chain work,
and adding something similar to your SLOB.

But the powers of two is only for the kmalloc, which this is a know
behavior of the current system. So it <should> only be used for things
that would alloc and free within a quick time (like for things you would
like to put on a stack but cant), or the size is close to (less than or
equal) a power of two. Otherwise a kmem_cache is made which is the size
of expected object (off by UNIT_SIZE).

Oh, this reminds me, I probably still need to add a shrink cache
algorithm. Which would be very hard to do in the current SLOB.

Also note, I don't need to save the descriptor in with each kmalloc as
the current SLOB does. Since each memory pool is of a fixed size, I
again use the mem_map pages to store the location of the descriptor. So
I save on memory that way.

>
> [...]
> > So I have improved the speed of SLOB to almost that of SLAB!
>
> Nice.
>
> For what it's worth, I think we really ought to consider a generalized
> allocator approach like Sun's VMEM, with various removable pieces.

Interesting, I don't know how Sun's VMEM works. Do you have links to
some documentation?

>
> Currently we've got something like this:
>
> get_free_pages boot_mem idr resource_* vmalloc ...
> |
> slab
> |
> per_cpu/node
> |
> kmem_cache_alloc
> |
> kmalloc
>
> We could take it in a direction like this:
>
> generic range allocator (completely agnostic)
> |
> optional size buckets (reduced fragmentation, O(1))
> |
> optional slab (cache-friendly, pre-initialized)
> |
> optional per cpu/node caches (cache-hot and lockless)
> |
> kmalloc / kmem_cache_alloc / boot_mem / idr / resource_* / vmalloc / ...
>
> (You read that right, the top level allocator can replace all the
> different allocators that hand back integers or non-overlapping ranges.)
>
> Each user of, say, kmem_create() could then pass in flags to specify
> which caching layers ought to be bypassed. IDR, for example, would
> probably disable all the layers and specify a first-fit policy.
>
> And then depending on our global size and performance requirements, we
> could globally disable some layers like SLAB, buckets, or per_cpu
> caches. With all the optional layers disabled, we'd end up with
> something much like SLOB (but underneath get_free_page!).

That looks like quite an undertaking, but may be well worth it. I think
Linux's memory management is starting to show it's age. It's been
through a few transformations, and maybe it's time to go through
another. The work being done by the NUMA folks, should be taking into
account, and maybe we can come up with a way that can make things easier
and less complex without losing performance.

BTW, the NUMA code in the slabs was the main killer for the RT
conversion.


Thanks for the input,

-- Steve


2005-12-20 19:45:37

by Matt Mackall

[permalink] [raw]
Subject: Re: [PATCH RT 00/02] SLOB optimizations

On Tue, Dec 20, 2005 at 02:15:24PM -0500, Steven Rostedt wrote:
> On Tue, 2005-12-20 at 12:19 -0600, Matt Mackall wrote:
> > On Tue, Dec 20, 2005 at 10:44:20AM -0500, Steven Rostedt wrote:
> > > (Andrew, I'm CC'ing you and Matt to see if you would like this in -mm)
> > >
> > > OK Ingo, here it is.
> > >
> > > The old SLOB did the old K&R memory allocations.
> > >
> > > It had a global link list "slobfree". When it needed memory it would
> > > search this list linearly to find the first spot that fit and then
> > > return it. It was broken up into SLOB_UNITS which was the number of
> > > bytes to hold slob_t.
> > >
> > > Since the sizes of the allocations would greatly fluctuate, the chances
> > > for fragmentation was very high. This would also cause the looking for
> > > free locations to increase, since the number of free blocks would also
> > > increase due to the fragmentation.
> >
> > On the target systems for the original SLOB design, we have less than
> > 16MB of memory, so the linked list walking is pretty well bounded.
>
> I bet after a while of running, your performance will still suffer due
> to fragmentation. The more fragmented it is, the more space you lose
> and the more steps you need to walk.
>
> Remember, because of the small stack, kmalloc and kfree are used an
> awful lot. And if you slow those down, you will start to take a big hit
> in performance.

True, with the exception that the improved packing may be the
difference between fitting the working set in memory and
thrashing/OOMing for some applications. Not running at all =
infinitely bad performance.

And the fragmentation is really not all that bad. Remember, Linux and
other legacy systems used similar allocators for ages.

> Ingo can answer this better himself, but I have a feeling he jumped to
> your SLOB system just because of the simplicity.

And only a config switch away..

> > This I like a lot. I'd like to see a size/performance measurement of
> > this by itself. I suspect it's an unambiguous win in both categories.
>
> Actually the performance gain was disappointingly small. As it was a
> separate patch and I though it would gain a lot. But if IIRC, it only
> increased the speed by a second or two (of the 1 minute 27 seconds).
> That's why I spent so much time in the next approach.

Still, if it's a size win, it definitely makes sense to merge.
Removing the big block list lock is also a good thing and might make a
bigger difference on SMP.

> > > The next patch was the big improvement, with the largest changes. I
> > > took advantage of how the kmem_cache usage that SLAB also takes
> > > advantage of. I created a memory pool like the global one, but for
> > > every cache with a size less then PAGE_SIZE >> 1.
> >
> > Hmm. By every size, I assume you mean powers of two. Which negates
> > some of the fine-grained allocation savings that current SLOB provides.
>
> Yeah its the same as what the slabs use. But I would like to take
> measurements of a running system between the two approaches. After a
> day of heavy network traffic, see what the fragmentation is like and how
> much is wasted. This would require me finishing my cache_chain work,
> and adding something similar to your SLOB.
>
> But the powers of two is only for the kmalloc, which this is a know
> behavior of the current system. So it <should> only be used for things
> that would alloc and free within a quick time (like for things you would
> like to put on a stack but cant), or the size is close to (less than or
> equal) a power of two. Otherwise a kmem_cache is made which is the size
> of expected object (off by UNIT_SIZE).

There are a fair number of long-lived kmalloc objects. You might try
playing with the kmalloc accounting patch in -tiny to see what's out
there.

http://www.selenic.com/repo/tiny?f=bbcd48f1d9c1;file=kmalloc-accounting.patch;style=raw

> Oh, this reminds me, I probably still need to add a shrink cache
> algorithm. Which would be very hard to do in the current SLOB.

Hmmm? It already has one.

> > For what it's worth, I think we really ought to consider a generalized
> > allocator approach like Sun's VMEM, with various removable pieces.
>
> Interesting, I don't know how Sun's VMEM works. Do you have links to
> some documentation?

http://citeseer.ist.psu.edu/bonwick01magazines.html

> That looks like quite an undertaking, but may be well worth it. I think
> Linux's memory management is starting to show it's age. It's been
> through a few transformations, and maybe it's time to go through
> another. The work being done by the NUMA folks, should be taking into
> account, and maybe we can come up with a way that can make things easier
> and less complex without losing performance.

Fortunately, it can be done completely piecemeal.

> BTW, the NUMA code in the slabs was the main killer for the RT
> conversion.

I think the VMEM scheme avoids that problem to some degree, but I
might be wrong.

--
Mathematics is the supreme nostalgia of our time.

2005-12-20 20:06:53

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RT 00/02] SLOB optimizations

On Tue, 2005-12-20 at 13:43 -0600, Matt Mackall wrote:
> >
> > I bet after a while of running, your performance will still suffer due
> > to fragmentation. The more fragmented it is, the more space you lose
> > and the more steps you need to walk.
> >
> > Remember, because of the small stack, kmalloc and kfree are used an
> > awful lot. And if you slow those down, you will start to take a big hit
> > in performance.
>
> True, with the exception that the improved packing may be the
> difference between fitting the working set in memory and
> thrashing/OOMing for some applications. Not running at all =
> infinitely bad performance.

Well the best way to see, is to try it out with real applications on
small machines. I guess I need to pull out my IBM Thinkpad 75c (32
megs, I'll need to only allocate half) and try out the two and see how
far I can push it. Unfortunately, this test may need to wait, since I
have a ton of other things to push out first.

If someone else (perhaps yourself) would like to give my patches a try,
I would be really appreciate it. :)

>
> And the fragmentation is really not all that bad. Remember, Linux and
> other legacy systems used similar allocators for ages.

But the performance, was greatly reduced, and the system just booted up.

>
> > Ingo can answer this better himself, but I have a feeling he jumped to
> > your SLOB system just because of the simplicity.
>
> And only a config switch away..
>
> > > This I like a lot. I'd like to see a size/performance measurement of
> > > this by itself. I suspect it's an unambiguous win in both categories.
> >
> > Actually the performance gain was disappointingly small. As it was a
> > separate patch and I though it would gain a lot. But if IIRC, it only
> > increased the speed by a second or two (of the 1 minute 27 seconds).
> > That's why I spent so much time in the next approach.
>
> Still, if it's a size win, it definitely makes sense to merge.
> Removing the big block list lock is also a good thing and might make a
> bigger difference on SMP.

Well, I guess I can check out the -mm branch and at least port the first
patch over.

>
> > > > The next patch was the big improvement, with the largest changes. I
> > > > took advantage of how the kmem_cache usage that SLAB also takes
> > > > advantage of. I created a memory pool like the global one, but for
> > > > every cache with a size less then PAGE_SIZE >> 1.
> > >
> > > Hmm. By every size, I assume you mean powers of two. Which negates
> > > some of the fine-grained allocation savings that current SLOB provides.
> >
> > Yeah its the same as what the slabs use. But I would like to take
> > measurements of a running system between the two approaches. After a
> > day of heavy network traffic, see what the fragmentation is like and how
> > much is wasted. This would require me finishing my cache_chain work,
> > and adding something similar to your SLOB.
> >
> > But the powers of two is only for the kmalloc, which this is a know
> > behavior of the current system. So it <should> only be used for things
> > that would alloc and free within a quick time (like for things you would
> > like to put on a stack but cant), or the size is close to (less than or
> > equal) a power of two. Otherwise a kmem_cache is made which is the size
> > of expected object (off by UNIT_SIZE).
>
> There are a fair number of long-lived kmalloc objects. You might try
> playing with the kmalloc accounting patch in -tiny to see what's out
> there.
>
> http://www.selenic.com/repo/tiny?f=bbcd48f1d9c1;file=kmalloc-accounting.patch;style=raw

I'll have to try this out too. Thanks for the link.
>
> > Oh, this reminds me, I probably still need to add a shrink cache
> > algorithm. Which would be very hard to do in the current SLOB.
>
> Hmmm? It already has one.

The current version in Ingo's 2.6.15-rc5-rt2 didn't have one.

>
> > > For what it's worth, I think we really ought to consider a generalized
> > > allocator approach like Sun's VMEM, with various removable pieces.
> >
> > Interesting, I don't know how Sun's VMEM works. Do you have links to
> > some documentation?
>
> http://citeseer.ist.psu.edu/bonwick01magazines.html

Thanks, I'll read up on this.

>
> > That looks like quite an undertaking, but may be well worth it. I think
> > Linux's memory management is starting to show it's age. It's been
> > through a few transformations, and maybe it's time to go through
> > another. The work being done by the NUMA folks, should be taking into
> > account, and maybe we can come up with a way that can make things easier
> > and less complex without losing performance.
>
> Fortunately, it can be done completely piecemeal.

If you would like me to test any code, I'd be happy to when I have time.
And maybe even add a few patches myself.

-- Steve


2005-12-20 20:15:53

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH RT 00/02] SLOB optimizations

Hi Steve and Matt,

On 12/20/05, Steven Rostedt <[email protected]> wrote:
> That looks like quite an undertaking, but may be well worth it. I think
> Linux's memory management is starting to show it's age. It's been
> through a few transformations, and maybe it's time to go through
> another. The work being done by the NUMA folks, should be taking into
> account, and maybe we can come up with a way that can make things easier
> and less complex without losing performance.

The slab allocator is indeed complex, messy, and hard to understand.
In case you're interested, I have included a replacement I started out
a while a go. It follows the design of a magazine allocator described
by Bonwick. It's not a complete replacement but should boot (well, did
anyway at some point). I have also included a user space test harness
I am using to smoke it.

If there's enough interest, I would be more than glad to help write a
replacement for mm/slab.c :-)

Pekka


Attachments:
(No filename) (0.98 kB)
magazine-slab.patch (80.51 kB)
Download all attachments

2005-12-20 21:43:20

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RT 00/02] SLOB optimizations

On Tue, 2005-12-20 at 22:15 +0200, Pekka Enberg wrote:
> Hi Steve and Matt,
>
> On 12/20/05, Steven Rostedt <[email protected]> wrote:
> > That looks like quite an undertaking, but may be well worth it. I think
> > Linux's memory management is starting to show it's age. It's been
> > through a few transformations, and maybe it's time to go through
> > another. The work being done by the NUMA folks, should be taking into
> > account, and maybe we can come up with a way that can make things easier
> > and less complex without losing performance.
>
> The slab allocator is indeed complex, messy, and hard to understand.
> In case you're interested, I have included a replacement I started out
> a while a go. It follows the design of a magazine allocator described
> by Bonwick. It's not a complete replacement but should boot (well, did
> anyway at some point). I have also included a user space test harness
> I am using to smoke it.
>
> If there's enough interest, I would be more than glad to help write a
> replacement for mm/slab.c :-)

Hi Pekka,

What other interest have you pulled up on this? I mean, have others
shown interest in pushing something like this. Today's slab system is
starting to become like the IDE where nobody, but a select few
sado-masochis, dare to venture in. (I've CC'd them ;) Perhaps it would
make the addition of NUMA easier.

Maybe, putting this into RT might be a way to get it tested, and help us
with the memory management and a fully preemptible kernel.

-- Steve

For those just coming in, Pekka posted this:

http://marc.theaimsgroup.com/?l=linux-kernel&m=113510997009883&w=2


2005-12-20 21:53:16

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH RT 00/02] SLOB optimizations

On Tue, 20 Dec 2005, Steven Rostedt wrote:

> What other interest have you pulled up on this? I mean, have others
> shown interest in pushing something like this. Today's slab system is
> starting to become like the IDE where nobody, but a select few
> sado-masochis, dare to venture in. (I've CC'd them ;) Perhaps it would
> make the addition of NUMA easier.

Hmm. The basics of the SLAB allocator are rather simple.

I'd be interested in seeing an alternate approach. There is the danger
that you will end up end up with the same complexity as before.

> http://marc.theaimsgroup.com/?l=linux-kernel&m=113510997009883&w=2

Quite a long list of unsupported features. These academic papers
usually only focus on one thing. The SLAB allocator has to work
for a variety of situations though.

It would help to explain what ultimately will be better in the new slab
allocator. The complexity could be taken care of by reorganizing the code.

2005-12-20 22:12:57

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RT 00/02] SLOB optimizations

On Tue, 2005-12-20 at 13:52 -0800, Christoph Lameter wrote:
> On Tue, 20 Dec 2005, Steven Rostedt wrote:
>
> > What other interest have you pulled up on this? I mean, have others
> > shown interest in pushing something like this. Today's slab system is
> > starting to become like the IDE where nobody, but a select few
> > sado-masochis, dare to venture in. (I've CC'd them ;) Perhaps it would
> > make the addition of NUMA easier.
>
> Hmm. The basics of the SLAB allocator are rather simple.
>
> I'd be interested in seeing an alternate approach. There is the danger
> that you will end up end up with the same complexity as before.

True. I understand the basics of the SLAB allocator very well, but I
stumble over the slab.c code quite a bit. This topic came up when Ingo
replaced slab with slob in the rt patch and it killed the performance.
I introduced a cross between the slab and the slob that sped up the
system almost to that of the current slab.

Matt Mackall needs a memory management that uses the smallest amount of
memory to handle embedded systems, and brought up the approach
referenced in the paper by Bonwick.

>
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=113510997009883&w=2
>
> Quite a long list of unsupported features. These academic papers
> usually only focus on one thing. The SLAB allocator has to work
> for a variety of situations though.
>
> It would help to explain what ultimately will be better in the new slab
> allocator. The complexity could be taken care of by reorganizing the code.
>

Honestly, what I would like is a simpler solution, whether we go with a
new approach or reorganize the current slab. I need to get -rt working,
and the code in slab is pulling my resources more than they can extend.
I'm capable to convert slab today as it is for RT but it will probably
take longer than I can afford.

Yes, if we go with a new system, it would not have all the features that
the slab has today, but I can live with that, and if I'm involved in the
work as it grows, I'll understand it better. The problem is, I wasn't
involved in the evolution of slab, and I have to deal with what it grew
into, without being there to see why it does what it does today.

-- Steve


2005-12-21 02:30:10

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH RT 00/02] SLOB optimizations

Steven Rostedt wrote:

> That looks like quite an undertaking, but may be well worth it. I think
> Linux's memory management is starting to show it's age. It's been

What do you mean by this? ie. what parts of it are a problem, and why?

I think that replacing the buddy allocator probably wouldn't be a good
idea because it is really fast and simple for page sized allocations which
are the most common, and it is good at naturally avoiding external
fragmentation. Internal fragmentation is not much of a problem because it
is handled by slab.

I can't see how replacing the buddy allocator with a completely agnostic
range allocator could be a win at all.

Perhaps it would make more sense for bootmem, resources, vmalloc, etc. and
I guess that is what Matt is suggesting.

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-12-21 02:41:37

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RT 00/02] SLOB optimizations


On Wed, 21 Dec 2005, Nick Piggin wrote:

> Steven Rostedt wrote:
>
> > That looks like quite an undertaking, but may be well worth it. I think
> > Linux's memory management is starting to show it's age. It's been
>
> What do you mean by this? ie. what parts of it are a problem, and why?
>
> I think that replacing the buddy allocator probably wouldn't be a good
> idea because it is really fast and simple for page sized allocations which
> are the most common, and it is good at naturally avoiding external
> fragmentation. Internal fragmentation is not much of a problem because it
> is handled by slab.

Actually, I wasn't talking about the buddy allocator, since it is probably
the best backend allocator to have. I actually like it alot and it
doesn't seem to have a problem.

But the slab code has gotten more complex, and probably too feature full.
And I'm afraid that Christoph Lameter may be right, in that we could go to
another allocation scheme and after adding all the features that the slab
has, we would be just as complex.


>
> I can't see how replacing the buddy allocator with a completely agnostic
> range allocator could be a win at all.

That part I didn't agree with (replacing the buddy system I mean).

>
> Perhaps it would make more sense for bootmem, resources, vmalloc, etc. and
> I guess that is what Matt is suggesting.

I'd still add slab there, but as I said above, anything else may become
too complex. Although, playing with this Magazine thingy is starting to
look interesting!

-- Steve

2005-12-21 06:37:10

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RT 00/02] SLOB optimizations


* Steven Rostedt <[email protected]> wrote:

> > > http://marc.theaimsgroup.com/?l=linux-kernel&m=113510997009883&w=2
> >
> > Quite a long list of unsupported features. These academic papers
> > usually only focus on one thing. The SLAB allocator has to work
> > for a variety of situations though.
> >
> > It would help to explain what ultimately will be better in the new slab
> > allocator. The complexity could be taken care of by reorganizing the code.
>
> Honestly, what I would like is a simpler solution, whether we go with
> a new approach or reorganize the current slab. I need to get -rt
> working, and the code in slab is pulling my resources more than they
> can extend. I'm capable to convert slab today as it is for RT but it
> will probably take longer than I can afford.

please, lets let the -rt tree out of the equation. The SLAB code is fine
on upstream, and it was a pure practical maintainance decision to go for
SLOB in the -rt tree. Yes, the SLAB code is complex, but i could hardly
list any complexity in it tht isnt justified with a performance
argument. _Maybe_ the SLAB code could be further cleaned up, maybe it
could even be replaced, but we'd have to see the patches first. In any
case, the -rt tree is not an argument that matters.

Ingo

2005-12-21 06:57:16

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RT 00/02] SLOB optimizations


* Steven Rostedt <[email protected]> wrote:

> [...] Today's slab system is starting to become like the IDE where
> nobody, but a select few sado-masochis, dare to venture in. (I've CC'd
> them ;) [...]

while it could possibly be cleaned up a bit, it's one of the
best-optimized subsystems Linux has. Most of the "unnecessary
complexity" in SLAB is related to a performance or a debugging feature.
Many times i have looked at the SLAB code in a disassembler, right next
to profile output from some hot workload, and have concluded: 'I couldnt
do this any better even with hand-coded assembly'.

SLAB-bashing has become somewhat fashionable, but i really challenge
everyone to improve the SLAB code first (to make it more modular, easier
to read, etc.), before suggesting replacements.

the SLOB is nice because it gives us a simple option at the other end of
the complexity spectrum. The SLOB should remain there. (but it certainly
makes sense to make it faster, within certain limits, so i'm not
opposing your SLOB patches per se.)

Ingo

2005-12-21 07:17:08

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH RT 00/02] SLOB optimizations

Hi Ingo,

Steven Rostedt <[email protected]> wrote:
> > [...] Today's slab system is starting to become like the IDE where
> > nobody, but a select few sado-masochis, dare to venture in. (I've CC'd
> > them ;) [...]

On Wed, 21 Dec 2005, Ingo Molnar wrote:
> while it could possibly be cleaned up a bit, it's one of the
> best-optimized subsystems Linux has. Most of the "unnecessary
> complexity" in SLAB is related to a performance or a debugging feature.
> Many times i have looked at the SLAB code in a disassembler, right next
> to profile output from some hot workload, and have concluded: 'I couldnt
> do this any better even with hand-coded assembly'.
>
> SLAB-bashing has become somewhat fashionable, but i really challenge
> everyone to improve the SLAB code first (to make it more modular, easier
> to read, etc.), before suggesting replacements.

I dropped working on the replacement because I wanted to do just that. I
sent my patch only because Matt and Steve talked about writing a
replacement and thought they would be interested to see it.

I am all for gradual improvements but after taking a stab at it, I
starting to think rewriting would be easier, simply because the slab
allocator has been clean-up resistant for so long.

Pekka

2005-12-21 07:20:16

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH RT 00/02] SLOB optimizations

Ingo Molnar a ?crit :
> * Steven Rostedt <[email protected]> wrote:
>
>
>>[...] Today's slab system is starting to become like the IDE where
>>nobody, but a select few sado-masochis, dare to venture in. (I've CC'd
>>them ;) [...]
>
>
> while it could possibly be cleaned up a bit, it's one of the
> best-optimized subsystems Linux has. Most of the "unnecessary
> complexity" in SLAB is related to a performance or a debugging feature.
> Many times i have looked at the SLAB code in a disassembler, right next
> to profile output from some hot workload, and have concluded: 'I couldnt
> do this any better even with hand-coded assembly'.

Well, I miss a version of kmem_cache_alloc()/kmem_cache_free() that wont play
with IRQ masking.

The local_irq_save()/local_irq_restore() pair is quite expensive and could be
avoided for several caches that are exclusively used in process context.

(Not speaking of general caches of course, but caches like dentry_cache, filp,
...)

Eric

2005-12-21 07:44:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RT 00/02] SLOB optimizations


* Eric Dumazet <[email protected]> wrote:

> >while it could possibly be cleaned up a bit, it's one of the
> >best-optimized subsystems Linux has. Most of the "unnecessary
> >complexity" in SLAB is related to a performance or a debugging feature.
> >Many times i have looked at the SLAB code in a disassembler, right next
> >to profile output from some hot workload, and have concluded: 'I couldnt
> >do this any better even with hand-coded assembly'.
>
> Well, I miss a version of kmem_cache_alloc()/kmem_cache_free() that
> wont play with IRQ masking.

sure, but adding this sure wont reduce complexity ;)

> The local_irq_save()/local_irq_restore() pair is quite expensive and
> could be avoided for several caches that are exclusively used in
> process context.

in any case, on sane platforms (i386, x86_64) an irq-disable is
well-optimized in hardware, and is just as fast as a preempt_disable().

Combined with the fact that CLI/STI has no register side-effects, it can
even be faster/cheaper, on x86 at least.

Ingo

2005-12-21 07:50:57

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RT 00/02] SLOB optimizations


* Pekka J Enberg <[email protected]> wrote:

> > SLAB-bashing has become somewhat fashionable, but i really challenge
> > everyone to improve the SLAB code first (to make it more modular, easier
> > to read, etc.), before suggesting replacements.
>
> I dropped working on the replacement because I wanted to do just that.
> I sent my patch only because Matt and Steve talked about writing a
> replacement and thought they would be interested to see it.
>
> I am all for gradual improvements but after taking a stab at it, I
> starting to think rewriting would be easier, simply because the slab
> allocator has been clean-up resistant for so long.

i'd suggest to try harder, unless you think the _fundamentals_ of the
SLAB allocator are wrong. (which you are entitled to believe, but we
also have to admit that the SLAB has been around for many years, and
works pretty well)

most of the ugliness in slab.c comes from:

1) debugging. There's no easy solutions here, but it could be improved.

2) bootstrapping. Bootstrapping an allocator in a generic way is hard.
E.g. what if cache_cache gets larger than 1 page?

3) cache-footprint tricks and lockless fastpath. SLAB does things all
the right way, even that ugly memmove is the right thing. Maybe it
could be cleaned up, but the fundamental complexity will likely
remain.

Ingo

2005-12-21 08:02:28

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH RT 00/02] SLOB optimizations

Ingo Molnar a ?crit :
> * Eric Dumazet <[email protected]> wrote:
>
>
>>>while it could possibly be cleaned up a bit, it's one of the
>>>best-optimized subsystems Linux has. Most of the "unnecessary
>>>complexity" in SLAB is related to a performance or a debugging feature.
>>>Many times i have looked at the SLAB code in a disassembler, right next
>>>to profile output from some hot workload, and have concluded: 'I couldnt
>>>do this any better even with hand-coded assembly'.
>>
>>Well, I miss a version of kmem_cache_alloc()/kmem_cache_free() that
>>wont play with IRQ masking.
>
>
> sure, but adding this sure wont reduce complexity ;)
>
>
>>The local_irq_save()/local_irq_restore() pair is quite expensive and
>>could be avoided for several caches that are exclusively used in
>>process context.
>
>
> in any case, on sane platforms (i386, x86_64) an irq-disable is
> well-optimized in hardware, and is just as fast as a preempt_disable().
>

I'm afraid its not the case on current hardware.

The irq enable/disable pair count for more than 50% the cpu time spent in
kmem_cache_alloc()/kmem_cache_free()/kfree()

oprofile results on a dual Opteron 246 :

You can see the high profile numbers right after cli and popf(sti)
instructions, popf being VERY expensive.

CPU: Hammer, speed 1993.39 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit
mask of 0x00 (No unit mask) count 50000

29993 1.9317 kfree
18654 1.2014 kmem_cache_alloc
12962 0.8348 kmem_cache_free

ffffffff8015c370 <kfree>: /* kfree total: 30334 1.9335 */
770 0.0491 :ffffffff8015c370: push %rbp
2477 0.1579 :ffffffff8015c371: mov %rdi,%rbp
:ffffffff8015c374: push %rbx
:ffffffff8015c375: sub $0x8,%rsp
1792 0.1142 :ffffffff8015c379: test %rdi,%rdi
:ffffffff8015c37c: je ffffffff8015c452 <kfree+0xe2>
122 0.0078 :ffffffff8015c382: pushfq
1001 0.0638 :ffffffff8015c383: popq (%rsp)
1456 0.0928 :ffffffff8015c386: cli
2489 0.1586 :ffffffff8015c387: mov $0xffffffff7fffffff,%rax <<

...
72 0.0046 :ffffffff8015c44e: pushq (%rsp)
1080 0.0688 :ffffffff8015c451: popfq
13934 0.8882 :ffffffff8015c452: add $0x8,%rsp << HERE >>
290 0.0185 :ffffffff8015c456: pop %rbx
:ffffffff8015c457: pop %rbp
124 0.0079 :ffffffff8015c458: retq


ffffffff8015c460 <kmem_cache_free>: /* kmem_cache_free total: 13084 0.8340 */
388 0.0247 :ffffffff8015c460: sub $0x18,%rsp
365 0.0233 :ffffffff8015c464: mov %rbp,0x10(%rsp)
:ffffffff8015c469: mov %rbx,0x8(%rsp)
121 0.0077 :ffffffff8015c46e: mov %rsi,%rbp
262 0.0167 :ffffffff8015c471: pushfq
549 0.0350 :ffffffff8015c472: popq (%rsp)
351 0.0224 :ffffffff8015c475: cli
2478 0.1579 :ffffffff8015c476: mov %gs:0x34,%eax
592 0.0377 :ffffffff8015c47e: cltq
:ffffffff8015c480: mov (%rdi,%rax,8),%rbx
7 4.5e-04 :ffffffff8015c484: mov (%rbx),%eax
200 0.0127 :ffffffff8015c486: cmp 0x4(%rbx),%eax
:ffffffff8015c489: jae ffffffff8015c48f
<kmem_cache_free+0x2f>
:ffffffff8015c48b: mov %eax,%eax
766 0.0488 :ffffffff8015c48d: jmp ffffffff8015c4a0
<kmem_cache_free+0x40>
:ffffffff8015c48f: mov %rbx,%rsi
71 0.0045 :ffffffff8015c492: callq ffffffff8015c810
<cache_flusharray>
:ffffffff8015c497: mov (%rbx),%eax
1 6.4e-05 :ffffffff8015c499: data16
:ffffffff8015c49a: data16
:ffffffff8015c49b: data16
:ffffffff8015c49c: nop
:ffffffff8015c49d: data16
:ffffffff8015c49e: data16
:ffffffff8015c49f: nop
:ffffffff8015c4a0: mov %rbp,0x10(%rbx,%rax,8)
20 0.0013 :ffffffff8015c4a5: incl (%rbx)
176 0.0112 :ffffffff8015c4a7: pushq (%rsp)
7 4.5e-04 :ffffffff8015c4aa: popfq
6187 0.3944 :ffffffff8015c4ab: mov 0x8(%rsp),%rbx << HERE>>
543 0.0346 :ffffffff8015c4b0: mov 0x10(%rsp),%rbp
:ffffffff8015c4b5: add $0x18,%rsp
:ffffffff8015c4b9: retq


ffffffff8015bd70 <kmem_cache_alloc>: /* kmem_cache_alloc total: 18803 1.1985 */
549 0.0350 :ffffffff8015bd70: sub $0x8,%rsp
700 0.0446 :ffffffff8015bd74: pushfq
1427 0.0910 :ffffffff8015bd75: popq (%rsp)
226 0.0144 :ffffffff8015bd78: cli
2399 0.1529 :ffffffff8015bd79: mov %gs:0x34,%eax <<HERE>>
416 0.0265 :ffffffff8015bd81: cltq
:ffffffff8015bd83: mov (%rdi,%rax,8),%rdx
21 0.0013 :ffffffff8015bd87: mov (%rdx),%eax
172 0.0110 :ffffffff8015bd89: test %eax,%eax
:ffffffff8015bd8b: je ffffffff8015bda1
<kmem_cache_alloc+0x31>
8 5.1e-04 :ffffffff8015bd8d: dec %eax
1338 0.0853 :ffffffff8015bd8f: movl $0x1,0xc(%rdx)
9 5.7e-04 :ffffffff8015bd96: mov %eax,(%rdx)
9 5.7e-04 :ffffffff8015bd98: mov %eax,%eax
1146 0.0730 :ffffffff8015bd9a: mov 0x10(%rdx,%rax,8),%rax
4 2.5e-04 :ffffffff8015bd9f: jmp ffffffff8015bda6
<kmem_cache_alloc+0x36>
:ffffffff8015bda1: callq ffffffff8015c160
<cache_alloc_refill>
154 0.0098 :ffffffff8015bda6: pushq (%rsp)
241 0.0154 :ffffffff8015bda9: popfq
9222 0.5878 :ffffffff8015bdaa: prefetchw (%rax) <<HERE>>
758 0.0483 :ffffffff8015bdad: add $0x8,%rsp
4 2.5e-04 :ffffffff8015bdb1: retq

Eric

2005-12-21 12:51:17

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RT 00/02] SLOB optimizations


On Wed, 21 Dec 2005, Ingo Molnar wrote:

>
> * Steven Rostedt <[email protected]> wrote:
>
> > > > http://marc.theaimsgroup.com/?l=linux-kernel&m=113510997009883&w=2
> > >
> > > Quite a long list of unsupported features. These academic papers
> > > usually only focus on one thing. The SLAB allocator has to work
> > > for a variety of situations though.
> > >
> > > It would help to explain what ultimately will be better in the new slab
> > > allocator. The complexity could be taken care of by reorganizing the code.
> >
> > Honestly, what I would like is a simpler solution, whether we go with
> > a new approach or reorganize the current slab. I need to get -rt
> > working, and the code in slab is pulling my resources more than they
> > can extend. I'm capable to convert slab today as it is for RT but it
> > will probably take longer than I can afford.
>
> please, lets let the -rt tree out of the equation. The SLAB code is fine
> on upstream, and it was a pure practical maintainance decision to go for
> SLOB in the -rt tree. Yes, the SLAB code is complex, but i could hardly
> list any complexity in it tht isnt justified with a performance
> argument. _Maybe_ the SLAB code could be further cleaned up, maybe it
> could even be replaced, but we'd have to see the patches first. In any
> case, the -rt tree is not an argument that matters.

You're right about the -rt tree not being an argument for upstream. I
used it as an example of the complexities. This is not limited to -rt,
but for any other changes as well. Years ago I tried changing the slab to
run on a small embedded device with very little memory, and I was pretty
much overwhelmed.

Now I see that people are converting it for NUMA. I give them a lot of
credit, since they must be smarter than I ;)

-- Steve

2005-12-21 13:02:55

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RT 00/02] SLOB optimizations


On Wed, 21 Dec 2005, Ingo Molnar wrote:
>
> * Steven Rostedt <[email protected]> wrote:
>
> > [...] Today's slab system is starting to become like the IDE where
> > nobody, but a select few sado-masochis, dare to venture in. (I've CC'd
> > them ;) [...]
>
> while it could possibly be cleaned up a bit, it's one of the
> best-optimized subsystems Linux has. Most of the "unnecessary
> complexity" in SLAB is related to a performance or a debugging feature.
> Many times i have looked at the SLAB code in a disassembler, right next
> to profile output from some hot workload, and have concluded: 'I couldnt
> do this any better even with hand-coded assembly'.

Exactly my point! The complexity of SLAB keeps it at the "I could not do
it better myself" catagory. This wasn't suppose to be a bash, it was
actually a complement. But things in the "I could not do it better
myself" catagory are usually very hard to modify. Because, unless you are
at the level of genius of those that wrote it, you may easily break it.
Or put it to a level of "Ha, I can do this better".

>
> SLAB-bashing has become somewhat fashionable, but i really challenge
> everyone to improve the SLAB code first (to make it more modular, easier
> to read, etc.), before suggesting replacements.

I perfectly agree with this statement. As I mentioned earlier, it may
have been different if I was a part of the changes that were made. But I
wasn't, and that leaves me the task to figure out why things were done the
way they were done. Before changes can be made, one must have a full
understanding of why things exist as it does.

Don't get me wrong, my comments are more of a frustration with myself that
I'm having trouble understanding all that's in SLAB. I understand the
SLAB concept, but I'm having trouble with understanding the current
implementation. That's _my_ problem. But I will continue to work at it,
and maybe I will be able to produce some clean up patches once I do
understand.

>
> the SLOB is nice because it gives us a simple option at the other end of
> the complexity spectrum. The SLOB should remain there. (but it certainly
> makes sense to make it faster, within certain limits, so i'm not
> opposing your SLOB patches per se.)
>

I like the SLOB code, because it was simple enough for my mortal mind. I
actually started to play with it to get a better understanding of the way
the SLAB works. It has actually helped in that catagory.

-- Steve

2005-12-21 13:13:46

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RT 00/02] SLOB optimizations


On Wed, 21 Dec 2005, Pekka J Enberg wrote:
> Hi Ingo,
>
> Steven Rostedt <[email protected]> wrote:
> > > [...] Today's slab system is starting to become like the IDE where
> > > nobody, but a select few sado-masochis, dare to venture in. (I've CC'd
> > > them ;) [...]
>
> On Wed, 21 Dec 2005, Ingo Molnar wrote:
> > while it could possibly be cleaned up a bit, it's one of the
> > best-optimized subsystems Linux has. Most of the "unnecessary
> > complexity" in SLAB is related to a performance or a debugging feature.
> > Many times i have looked at the SLAB code in a disassembler, right next
> > to profile output from some hot workload, and have concluded: 'I couldnt
> > do this any better even with hand-coded assembly'.
> >
> > SLAB-bashing has become somewhat fashionable, but i really challenge
> > everyone to improve the SLAB code first (to make it more modular, easier
> > to read, etc.), before suggesting replacements.
>
> I dropped working on the replacement because I wanted to do just that. I
> sent my patch only because Matt and Steve talked about writing a
> replacement and thought they would be interested to see it.
>
> I am all for gradual improvements but after taking a stab at it, I
> starting to think rewriting would be easier, simply because the slab
> allocator has been clean-up resistant for so long.

And I think that what was done to SLAB is excellent. But like code I've
written, I've often thought, if I rewrote it again, I would do it cleaner
since I learned so much in doing it.

So the only way that I can feel that I can actually improve the current
system, is to write one from scratch (or start with one that is simple)
and try to make it as good as the current system. But by the time I got
it there, it would be just as complex as it is today. So only then, I
could rewrite it to be better, since I learned why things were done the
way they were, and can have that in my mind as I rewrite. So that means
writing it twice!

Unfortunately, it is probably the case that those that wrote slab.c are
too busy doing other things (or probably just don't want to), to rewite
the slab.c with the prior knowledge of what they wrote.

For the short time, I could just force myself to study the code and play
with it to see what I break, and figure out "Oh, that's whay that was
done!".

-- Steve

2005-12-21 15:34:48

by Steven Rostedt

[permalink] [raw]
Subject: [PATCH] SLAB - have index_of bug at compile time.

Hi, after all the talk over SLAB and SLOBs I decided to make myself
useful, and I'm trying very hard to understand the Linux implementation
of SLAB. So I'm going through ever line of code and examining it
thoroughly, when I find something that could be improved, either
performance wise (highly doubtful), clean up wise, documentation wise,
enhancement wise, or just have a question, I'll make myself known.

This email is enhancement wise. ;)

I noticed the code for index_of is a creative way of finding the cache
index using the compiler to optimize to a single hard coded number. But
I couldn't help noticing that it uses two methods to let you know that
someone used it wrong. One is at compile time (the correct way), and
the other is at run time (not good).

OK, this isn't really an enhancement since the code already works. But
this change can help those later who do real enhancements to SLAB.

-- Steve

Signed-off-by: Steven Rostedt <[email protected]>

Index: linux-2.6.15-rc6/mm/slab.c
===================================================================
--- linux-2.6.15-rc6.orig/mm/slab.c 2005-12-20 16:47:05.000000000 -0500
+++ linux-2.6.15-rc6/mm/slab.c 2005-12-21 10:20:03.000000000 -0500
@@ -315,6 +315,8 @@
*/
static __always_inline int index_of(const size_t size)
{
+ extern void __bad_size(void);
+
if (__builtin_constant_p(size)) {
int i = 0;

@@ -325,12 +327,9 @@
i++;
#include "linux/kmalloc_sizes.h"
#undef CACHE
- {
- extern void __bad_size(void);
- __bad_size();
- }
+ __bad_size();
} else
- BUG();
+ __bad_size();
return 0;
}



2005-12-22 17:56:51

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: [PATCH RT 00/02] SLOB optimizations

On Wed, 21 Dec 2005, Eric Dumazet wrote:

> Ingo Molnar a ?crit :
> > * Eric Dumazet <[email protected]> wrote:
> >
> > in any case, on sane platforms (i386, x86_64) an irq-disable is
> > well-optimized in hardware, and is just as fast as a preempt_disable().
> >
>
> I'm afraid its not the case on current hardware.
>
> The irq enable/disable pair count for more than 50% the cpu time spent in
> kmem_cache_alloc()/kmem_cache_free()/kfree()
>
> oprofile results on a dual Opteron 246 :
>
> You can see the high profile numbers right after cli and popf(sti)
> instructions, popf being VERY expensive.
>
> CPU: Hammer, speed 1993.39 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit
> mask of 0x00 (No unit mask) count 50000
>
> 29993 1.9317 kfree
> 18654 1.2014 kmem_cache_alloc
> 12962 0.8348 kmem_cache_free
>
> ffffffff8015c370 <kfree>: /* kfree total: 30334 1.9335 */
> 770 0.0491 :ffffffff8015c370: push %rbp
> 2477 0.1579 :ffffffff8015c371: mov %rdi,%rbp
> :ffffffff8015c374: push %rbx
> :ffffffff8015c375: sub $0x8,%rsp
> 1792 0.1142 :ffffffff8015c379: test %rdi,%rdi
> :ffffffff8015c37c: je ffffffff8015c452 <kfree+0xe2>
> 122 0.0078 :ffffffff8015c382: pushfq
> 1001 0.0638 :ffffffff8015c383: popq (%rsp)
> 1456 0.0928 :ffffffff8015c386: cli
> 2489 0.1586 :ffffffff8015c387: mov $0xffffffff7fffffff,%rax <<
>
> ...
> 72 0.0046 :ffffffff8015c44e: pushq (%rsp)
> 1080 0.0688 :ffffffff8015c451: popfq
> 13934 0.8882 :ffffffff8015c452: add $0x8,%rsp << HERE >>

Isn't that due to taking an interrupt?

2005-12-22 21:12:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RT 00/02] SLOB optimizations


* Eric Dumazet <[email protected]> wrote:

> >in any case, on sane platforms (i386, x86_64) an irq-disable is
> >well-optimized in hardware, and is just as fast as a preempt_disable().
>
> I'm afraid its not the case on current hardware.
>
> The irq enable/disable pair count for more than 50% the cpu time spent
> in kmem_cache_alloc()/kmem_cache_free()/kfree()

because you are not using NMI based profiling?

> oprofile results on a dual Opteron 246 :
>
> You can see the high profile numbers right after cli and popf(sti)
> instructions, popf being VERY expensive.

that's just the profiling interrupt hitting them. You should not analyze
irq-safe code with a non-NMI profiling interrupt.

CLI/STI is extremely fast. (In fact in the -rt tree i'm using them
within mutexes instead of preempt_enable()/preempt_disable(), because
they are faster and generate less register side-effect.)

Ingo

2005-12-22 21:39:27

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH RT 00/02] SLOB optimizations

Ingo Molnar a ?crit :
> * Eric Dumazet <[email protected]> wrote:
>
>>> in any case, on sane platforms (i386, x86_64) an irq-disable is
>>> well-optimized in hardware, and is just as fast as a preempt_disable().
>> I'm afraid its not the case on current hardware.
>>
>> The irq enable/disable pair count for more than 50% the cpu time spent
>> in kmem_cache_alloc()/kmem_cache_free()/kfree()
>
> because you are not using NMI based profiling?
>
>> oprofile results on a dual Opteron 246 :
>>
>> You can see the high profile numbers right after cli and popf(sti)
>> instructions, popf being VERY expensive.
>
> that's just the profiling interrupt hitting them. You should not analyze
> irq-safe code with a non-NMI profiling interrupt.
>

I'm using oprofile on Opteron, and AFAIK it's NMI based.

# grep NMI /proc/interrupts ; sleep 1 ; grep NMI /proc/interrupts
NMI: 391352095 2867983903
NMI: 391359678 2867998498

thats 7583 and 14595 NMI / second on cpu0 and cpu1 respectivly in this sample.

> CLI/STI is extremely fast. (In fact in the -rt tree i'm using them
> within mutexes instead of preempt_enable()/preempt_disable(), because
> they are faster and generate less register side-effect.)
>
> Ingo
>
>

2005-12-22 21:47:41

by George Anzinger

[permalink] [raw]
Subject: Re: [PATCH RT 00/02] SLOB optimizations



> that's just the profiling interrupt hitting them. You should not analyze
> irq-safe code with a non-NMI profiling interrupt.
>
> CLI/STI is extremely fast. (In fact in the -rt tree i'm using them
> within mutexes instead of preempt_enable()/preempt_disable(), because
> they are faster and generate less register side-effect.)
>
Hm... I rather thought that the cli would cause a rather large hit on
the pipeline and certainly on OOE. Is your observation based on any
particular instruction stream? Sti, on the otherhand should be fast...
--
George Anzinger [email protected]
HRT (High-res-timers): http://sourceforge.net/projects/high-res-timers/

2005-12-22 22:00:20

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH RT 00/02] SLOB optimizations

George Anzinger a ?crit :
>
>
>> that's just the profiling interrupt hitting them. You should not
>> analyze irq-safe code with a non-NMI profiling interrupt.
>>
>> CLI/STI is extremely fast. (In fact in the -rt tree i'm using them
>> within mutexes instead of preempt_enable()/preempt_disable(), because
>> they are faster and generate less register side-effect.)
>>
> Hm... I rather thought that the cli would cause a rather large hit on
> the pipeline and certainly on OOE. Is your observation based on any
> particular instruction stream? Sti, on the otherhand should be fast...

Just to be exact, the 'cli' is coded as 3 instruction :

pushfq
popq (%rsp)
cli

and the 'sti' is coded as 2 instructions :
pushq (%rsp)
popfq

And 'popfq' seems to be expensive, at least on Opteron machines and if
oprofile is not completely wrong...

Eric

2005-12-22 22:08:19

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH RT 00/02] SLOB optimizations

Ingo Molnar a ?crit :
>
> CLI/STI is extremely fast. (In fact in the -rt tree i'm using them
> within mutexes instead of preempt_enable()/preempt_disable(), because
> they are faster and generate less register side-effect.)
>

Yes, but most of my machines have a ! CONFIG_PREEMPT kernel, so
preempt_enable()/preempt_disable() are empty, thus faster than CLI/STI for sure :)

Then, maybe the patch that moves 'current' in a dedicated x86_64 register may
help to lower the cost of preempt_enable()/preempt_disable() on a
CONFIG_PREEMPT kernel ?

Eric

2005-12-23 19:16:49

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: [PATCH RT 00/02] SLOB optimizations

On Thu, 22 Dec 2005, Eric Dumazet wrote:

> Ingo Molnar a ?crit :
> >
> > CLI/STI is extremely fast. (In fact in the -rt tree i'm using them within
> > mutexes instead of preempt_enable()/preempt_disable(), because they are
> > faster and generate less register side-effect.)
> >
>
> Yes, but most of my machines have a ! CONFIG_PREEMPT kernel, so
> preempt_enable()/preempt_disable() are empty, thus faster than CLI/STI for
> sure :)
>
> Then, maybe the patch that moves 'current' in a dedicated x86_64 register may
> help to lower the cost of preempt_enable()/preempt_disable() on a
> CONFIG_PREEMPT kernel ?

I'm not sure if it'll make much of a difference over;

mov %gs:offset,%reg

So 'current' already is fairly fast on x86_64.

2005-12-26 16:36:10

by Manfred Spraul

[permalink] [raw]
Subject: Re: [PATCH] SLAB - have index_of bug at compile time.

Steven wrote:

>So I'm going through ever line of code and examining it
>thoroughly, when I find something that could be improved
>
Try to find a kfree implementation that doesn't need virt_to_page().
This is a big restriction of the current implementation: It's in the hot
path, but the operation is only fast on simple systems, not on numa
systems without a big memory map.
And it makes it impossible to use vmalloc memory as the basis for slabs,
or to use slab for io memory.

--
Manfred

2005-12-26 17:25:48

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] SLAB - have index_of bug at compile time.


On Mon, 26 Dec 2005, Manfred Spraul wrote:

> Steven wrote:
>
> >So I'm going through ever line of code and examining it
> >thoroughly, when I find something that could be improved
> >
> Try to find a kfree implementation that doesn't need virt_to_page().
> This is a big restriction of the current implementation: It's in the hot
> path, but the operation is only fast on simple systems, not on numa
> systems without a big memory map.
> And it makes it impossible to use vmalloc memory as the basis for slabs,
> or to use slab for io memory.
>

Unfortunately, that virt_to_page helps make kmalloc and kfree more
efficient. Since it allows the use of the mem_map pages that map to the
used memory to be used for storing information about the slab.

So removing virt_to_page means you need to store the information relative
to the memory that is mapped. Thus you need to allocate more than is
needed.

Your question refers to only kfree, which would not really be that
difficult to remove the virt_to_page, since that would just need the extra
memory _for each allocation_. But then you mention vmalloc and numa,
where it is the generic use of virt_to_page through out slab.c that is the
issue. I counted 14 direct uses of virt_to_page in slab.c (2.6.15-rc5).
Now you need to find a way to store the information of the off slab
descriptors and for slabs that are more than one page.

Changing the use of virt_to_page would probably hurt those that can use
it, and the changes would not be accepted because of that. Unless you can
keep the same speed and memory efficiency of those "simple systems".

Now, maybe NUMA and vmalloc might be a good reason to start a new
allocation system along side of slab?

-- Steve

2005-12-26 18:34:50

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH] SLAB - have index_of bug at compile time.

Hi Steven,

On 12/26/05, Steven Rostedt <[email protected]> wrote:
> Now, maybe NUMA and vmalloc might be a good reason to start a new
> allocation system along side of slab?

A better approach would probably be to introduce a vmem layer similar
to what Solaris has to solve I/O memory and vmalloc issue. What NUMA
issue are you referring to btw? I don't see any problem with the
current design (in fact, I stole it for my magazine allocator too).
It's just that the current implementation is bit hard to understand.

Pekka

2005-12-26 23:22:04

by Manfred Spraul

[permalink] [raw]
Subject: Re: [PATCH] SLAB - have index_of bug at compile time.

Pekka Enberg wrote:

>Hi Steven,
>
>On 12/26/05, Steven Rostedt <[email protected]> wrote:
>
>
>>Now, maybe NUMA and vmalloc might be a good reason to start a new
>>allocation system along side of slab?
>>
>>
>
>A better approach would probably be to introduce a vmem layer similar
>to what Solaris has to solve I/O memory and vmalloc issue. What NUMA
>issue are you referring to btw? I don't see any problem with the
>current design (in fact, I stole it for my magazine allocator too).
>It's just that the current implementation is bit hard to understand.
>
>
>
This is virt_to_page() on i386: the object address is in %esi
lea 0x40000000(%esi),%eax
mov 0x0,%edx [0x0 is actually mem_map]
shr $0xc,%eax
shl $0x5,%eax
Just read the mem_map pointer and a few calculations.
And now retrieve the cachep pointer:
mov 0x18(%eax,%edx,1),%edx

With NUMA on i386 (GENERIC_ARCH)
lea 0x40000000(%edi),%eax
mov %eax,%ebx
shr $0x1c,%eax
movsbl 0x0(%eax),%eax [ 0x0 is physnode_map]
shr $0xc,%ebx
mov 0x0(,%eax,4),%ecx [0x0 is node_data]
mov %ebx,%eax
mov 0xaa0(%ecx),%edx
sub %edx,%eax
mov 0xa98(%ecx),%edx
shl $0x5,%eax
4 memory accesses.
mov 0x18(%eax,%edx,1),%ebp

--
Manfred