I was assuming that if you create a slab cache with SLAB_HWCACHE_ALIGN,
objects are guaranteed to be aligned to L1 cacheline. But this piece
of code in kmem_cache_create has raised doubts.
----------------------------------------------------------------------------
if (flags & SLAB_HWCACHE_ALIGN) {
/* Need to adjust size so that objs are cache aligned. */
/* Small obj size, can get at least two per cache line. */
while (size < align/2)
align /= 2;
size = (size+align-1)&(~(align-1));
}
----------------------------------------------------------------------------
Am I missing something or can there really be two objects on the same
cacheline even when SLAB_HWCACHE_ALIGN is specified?
Thanks,
Kiran
On Wed, 2003-09-10 at 04:16, Ravikiran G Thirumalai wrote:
> Am I missing something or can there really be two objects on the same
> cacheline even when SLAB_HWCACHE_ALIGN is specified?
No, you are right.
If your object _must_ be cache aligned, use SLAB_MUST_HWCACHE_ALIGN.
But note that this will result in cache aligning objects even if it ends
of wasting a _lot_ of space, such as when debugging is turned on.
I think we only use it for the task_struct_cachep.
Robert Love
On Wed, Sep 10, 2003 at 11:41:04AM -0400, Robert Love wrote:
> On Wed, 2003-09-10 at 04:16, Ravikiran G Thirumalai wrote:
>
> > Am I missing something or can there really be two objects on the same
> > cacheline even when SLAB_HWCACHE_ALIGN is specified?
>
> No, you are right.
>
> If your object _must_ be cache aligned, use SLAB_MUST_HWCACHE_ALIGN.
>
WOW!!!
Looking at the code though, SLAB_MUST_HWCACHE_ALIGN is never considered
by kmem_cache_create or kmem_cache_alloc. So, right now, there is no way of
getting objects which are _guaranteed_ to be cacheline aligned!!!(?)
Kiran
On Thu, Sep 11, 2003 at 11:24:30AM +0530, Ravikiran G Thirumalai wrote:
> On Wed, Sep 10, 2003 at 11:41:04AM -0400, Robert Love wrote:
> > On Wed, 2003-09-10 at 04:16, Ravikiran G Thirumalai wrote:
> >
> > > Am I missing something or can there really be two objects on the same
> > > cacheline even when SLAB_HWCACHE_ALIGN is specified?
> >
> > No, you are right.
> >
> > If your object _must_ be cache aligned, use SLAB_MUST_HWCACHE_ALIGN.
> >
>
> WOW!!!
> Looking at the code though, SLAB_MUST_HWCACHE_ALIGN is never considered
> by kmem_cache_create or kmem_cache_alloc. So, right now, there is no way of
> getting objects which are _guaranteed_ to be cacheline aligned!!!(?)
>
Hi Andrew, Manfred
Looks like SLAB_HWCACHE_ALIGN does not guarantee cacheline alignment
and SLAB_MUST_HWCACHE_ALIGN is not at all recognised by the slab code.
(Right now, SLAB_MUST_HWCACHE_ALIGN caches are aligned to sizeof (void *)!!)
The following patch fixes the slab for this. Note that I have replaced
alignment arithmetic to use ALIGN macro too. I have done some basic
testing on a 4 way pIII with 32 byte cacheline -- but the patch probably
needs testing on other archs. Please give it a whirl on mm* trees.
This should help task_struct cache and pte_chain caches (other users of
SLAB_MUST_HWCACHE_ALIGN also use SLAB_HWCACHE_ALIGN -- smart huh?)
I also propose to rename SLAB_HWCACHE_ALIGN to SLAB_MAY_HWCACHE_ALIGN since
SLAB_HWCACHE_ALIGN seems very misleading -- if everyone agrees.
Andrew, this also means alloc_percpu in its current form is not guaranteed
to work on machines with cacheline sizes greater than 32 bytes. I am
working on a fix for alloc_percpu.
Thanks,
Kiran
diff -ruN -X dontdiff.1 linux-2.6.0-test5-mm1/mm/slab.c slabfix-2.6.0-test5-mm1/mm/slab.c
--- linux-2.6.0-test5-mm1/mm/slab.c Thu Sep 11 15:08:37 2003
+++ slabfix-2.6.0-test5-mm1/mm/slab.c Thu Sep 11 15:23:28 2003
@@ -1101,7 +1101,7 @@
#endif
#endif
align = BYTES_PER_WORD;
- if (flags & SLAB_HWCACHE_ALIGN)
+ if ((flags & SLAB_HWCACHE_ALIGN) | (flags & SLAB_MUST_HWCACHE_ALIGN))
align = L1_CACHE_BYTES;
/* Determine if the slab management is 'on' or 'off' slab. */
@@ -1112,12 +1112,14 @@
*/
flags |= CFLGS_OFF_SLAB;
- if (flags & SLAB_HWCACHE_ALIGN) {
+ if (flags & SLAB_MUST_HWCACHE_ALIGN) {
/* Need to adjust size so that objs are cache aligned. */
+ size = ALIGN(size, align);
+ } else if (flags & SLAB_HWCACHE_ALIGN) {
/* Small obj size, can get at least two per cache line. */
while (size <= align/2)
align /= 2;
- size = (size+align-1)&(~(align-1));
+ size = ALIGN(size, align);
}
/* Cal size (in pages) of slabs, and the num of objs per slab.
@@ -1175,8 +1177,7 @@
}
/* Offset must be a multiple of the alignment. */
- offset += (align-1);
- offset &= ~(align-1);
+ offset = ALIGN(offset, align);
if (!offset)
offset = L1_CACHE_BYTES;
cachep->colour_off = offset;
Ravikiran G Thirumalai wrote:
>Hi Andrew, Manfred
>Looks like SLAB_HWCACHE_ALIGN does not guarantee cacheline alignment
>and SLAB_MUST_HWCACHE_ALIGN is not at all recognised by the slab code.
>(Right now, SLAB_MUST_HWCACHE_ALIGN caches are aligned to sizeof (void *)!!)
>
>
Correct.
SLAB_HWCACHE_ALIGN is a hint, which is always honoured except with
debugging turned on. Which debugging of, it's equivalent to
MUST_HWCACHE_ALIGN. The reason why debugging turns it of is to break
drivers that use kmalloc+virt_to_phys instead of pci_pool. IMHO that's a
good cause, thus I would like to leave that unchanged.
SLAB_MUST_HWCACHE_ALIGN guarantees alignment to the smaller of the L1
cache line size and the object size. I was added for PAE support: the
3rd level page tables must be 32-byte aligned. It's only intended for
the PAE buffers. Noone else is supposed to use that, i.e. the right
change for the pte cache and the task cache is s/_MUST_//.
But there are problems with the current implementation:
- some drivers/archs are too difficult to fix. James Bottomley asked me
to add a switch for archs that cannot transfer to pci_dma completely.
Basically guarantee that all cachelines that are touched by an object
are exclusive to the object. Left over bytes must not be used, they
could be overwritten randomly by the hardware.
- Russell King onced asked me for the ability for 1024 byte aligned
objects. IIRC ARM needs that for it's page tables.
- If I understand you correctly you want to avoid false sharing of the
per-cpu buffers that back alloc_percpu, correct?
I have two concerns:
- what about users that compile a kernel for a pIII and then run it on a
p4? L1_CACHE_BYTES is 32 bytes, but the real cache line size is 128 bytes.
- what about NUMA systems? IMHO the per-cpu buffers should definitively
be from the nearest node to the target cpu. Unfortunately slab has no
concept of nodes right now. There was a patch, but it's quite intrusive
and never fine-tuned. Thus we must either ignore NUMA, or alloc-percpu
would have to use it's own allocator. And: Which cacheline size is
relevant for NUMA? Is L1==L2==Ln_CACHE_BYTES on all archs?
IMHO the right solution is
- remove SLAB_MUST_HWCACHE_ALIGN and SLAB_HWCACHE_ALIGN. The API is
broken. Align everything always to L1_CACHE_BYTES.
- rename the "offset" parameter to "align", and use that to support
explicit alignment on a byte boundary. I have a patch somewhere, it's
not very large.
- alloc_percpu should set align to the real cache line size from cpu_data.
- add a minimal NUMA support: a very inefficient
"kmem_cache_alloc_forcpu(cachep, flags, target_cpu)". Not that difficult
if I sacrifice performance [no lockless per-cpu buffers, etc.]
What do you think? Possible now, or too intrusive before 2.6.0? The
align patch is not much larger than the patch Ravikiran attached. The
minimal numa patch wouldn't contain any core changes either.
--
Manfred
// $Header$
// Kernel Version:
// VERSION = 2
// PATCHLEVEL = 6
// SUBLEVEL = 0
// EXTRAVERSION = -test5-mm1
--- 2.6/mm/slab.c 2003-09-11 22:30:47.000000000 +0200
+++ build-2.6/mm/slab.c 2003-09-11 23:42:16.000000000 +0200
@@ -268,6 +268,7 @@
unsigned int colour_off; /* colour offset */
unsigned int colour_next; /* cache colouring */
kmem_cache_t *slabp_cache;
+ unsigned int slab_size;
unsigned int dflags; /* dynamic flags */
/* constructor func */
@@ -488,7 +489,7 @@
.objsize = sizeof(kmem_cache_t),
.flags = SLAB_NO_REAP,
.spinlock = SPIN_LOCK_UNLOCKED,
- .colour_off = L1_CACHE_BYTES,
+ .colour_off = SMP_CACHE_BYTES,
.name = "kmem_cache",
};
@@ -523,7 +524,7 @@
static void enable_cpucache (kmem_cache_t *cachep);
/* Cal the num objs, wastage, and bytes left over for a given slab size. */
-static void cache_estimate (unsigned long gfporder, size_t size,
+static void cache_estimate (unsigned long gfporder, size_t size, size_t align,
int flags, size_t *left_over, unsigned int *num)
{
int i;
@@ -536,7 +537,7 @@
extra = sizeof(kmem_bufctl_t);
}
i = 0;
- while (i*size + L1_CACHE_ALIGN(base+i*extra) <= wastage)
+ while (i*size + ALIGN(base+i*extra, align) <= wastage)
i++;
if (i > 0)
i--;
@@ -546,7 +547,7 @@
*num = i;
wastage -= i*size;
- wastage -= L1_CACHE_ALIGN(base+i*extra);
+ wastage -= ALIGN(base+i*extra, align);
*left_over = wastage;
}
@@ -690,14 +691,15 @@
list_add(&cache_cache.next, &cache_chain);
cache_cache.array[smp_processor_id()] = &initarray_cache.cache;
- cache_estimate(0, cache_cache.objsize, 0,
- &left_over, &cache_cache.num);
+ cache_estimate(0, cache_cache.objsize, SMP_CACHE_BYTES, 0,
+ &left_over, &cache_cache.num);
if (!cache_cache.num)
BUG();
cache_cache.colour = left_over/cache_cache.colour_off;
cache_cache.colour_next = 0;
-
+ cache_cache.slab_size = ALIGN(cache_cache.num*sizeof(kmem_bufctl_t) + sizeof(struct slab),
+ SMP_CACHE_BYTES);
/* 2+3) create the kmalloc caches */
sizes = malloc_sizes;
@@ -993,7 +995,7 @@
* kmem_cache_create - Create a cache.
* @name: A string which is used in /proc/slabinfo to identify this cache.
* @size: The size of objects to be created in this cache.
- * @offset: The offset to use within the page.
+ * @align: The required alignment for the objects.
* @flags: SLAB flags
* @ctor: A constructor for the objects.
* @dtor: A destructor for the objects.
@@ -1018,17 +1020,15 @@
* %SLAB_NO_REAP - Don't automatically reap this cache when we're under
* memory pressure.
*
- * %SLAB_HWCACHE_ALIGN - Align the objects in this cache to a hardware
- * cacheline. This can be beneficial if you're counting cycles as closely
- * as davem.
+ * %SLAB_HWCACHE_ALIGN - This flag has no effect and will be removed soon.
*/
kmem_cache_t *
-kmem_cache_create (const char *name, size_t size, size_t offset,
+kmem_cache_create (const char *name, size_t size, size_t align,
unsigned long flags, void (*ctor)(void*, kmem_cache_t *, unsigned long),
void (*dtor)(void*, kmem_cache_t *, unsigned long))
{
const char *func_nm = KERN_ERR "kmem_create: ";
- size_t left_over, align, slab_size;
+ size_t left_over, slab_size;
kmem_cache_t *cachep = NULL;
/*
@@ -1039,7 +1039,7 @@
(size < BYTES_PER_WORD) ||
(size > (1<<MAX_OBJ_ORDER)*PAGE_SIZE) ||
(dtor && !ctor) ||
- (offset < 0 || offset > size))
+ (align < 0))
BUG();
#if DEBUG
@@ -1051,22 +1051,16 @@
#if FORCED_DEBUG
/*
- * Enable redzoning and last user accounting, except
- * - for caches with forced alignment: redzoning would violate the
- * alignment
- * - for caches with large objects, if the increased size would
- * increase the object size above the next power of two: caches
- * with object sizes just above a power of two have a significant
- * amount of internal fragmentation
+ * Enable redzoning and last user accounting, except for caches with
+ * large objects, if the increased size would increase the object size
+ * above the next power of two: caches with object sizes just above a
+ * power of two have a significant amount of internal fragmentation.
*/
- if ((size < 4096 || fls(size-1) == fls(size-1+3*BYTES_PER_WORD))
- && !(flags & SLAB_MUST_HWCACHE_ALIGN)) {
+ if ((size < 4096 || fls(size-1) == fls(size-1+3*BYTES_PER_WORD)))
flags |= SLAB_RED_ZONE|SLAB_STORE_USER;
- }
flags |= SLAB_POISON;
#endif
#endif
-
/*
* Always checks flags, a caller might be expecting debug
* support which isn't available.
@@ -1074,15 +1068,23 @@
if (flags & ~CREATE_MASK)
BUG();
+ if (align) {
+ /* minimum supported alignment: */
+ if (align < BYTES_PER_WORD)
+ align = BYTES_PER_WORD;
+
+ /* combinations of forced alignment and advanced debugging is
+ * not yet implemented.
+ */
+ flags &= ~(SLAB_RED_ZONE|SLAB_STORE_USER);
+ }
+
/* Get cache's description obj. */
cachep = (kmem_cache_t *) kmem_cache_alloc(&cache_cache, SLAB_KERNEL);
if (!cachep)
goto opps;
memset(cachep, 0, sizeof(kmem_cache_t));
-#if DEBUG
- cachep->reallen = size;
-#endif
/* Check that size is in terms of words. This is needed to avoid
* unaligned accesses for some archs when redzoning is used, and makes
* sure any on-slab bufctl's are also correctly aligned.
@@ -1094,20 +1096,25 @@
}
#if DEBUG
+ cachep->reallen = size;
+
if (flags & SLAB_RED_ZONE) {
- /*
- * There is no point trying to honour cache alignment
- * when redzoning.
- */
- flags &= ~SLAB_HWCACHE_ALIGN;
+ /* redzoning only works with word aligned caches */
+ align = BYTES_PER_WORD;
+
/* add space for red zone words */
cachep->dbghead += BYTES_PER_WORD;
size += 2*BYTES_PER_WORD;
}
if (flags & SLAB_STORE_USER) {
- flags &= ~SLAB_HWCACHE_ALIGN;
- size += BYTES_PER_WORD; /* add space */
+ /* user store requires word alignment and
+ * one word storage behind the end of the real
+ * object.
+ */
+ align = BYTES_PER_WORD;
+ size += BYTES_PER_WORD;
}
+
#if FORCED_DEBUG && defined(CONFIG_DEBUG_PAGEALLOC)
if (size > 128 && cachep->reallen > L1_CACHE_BYTES && size < PAGE_SIZE) {
cachep->dbghead += PAGE_SIZE - size;
@@ -1115,9 +1122,6 @@
}
#endif
#endif
- align = BYTES_PER_WORD;
- if (flags & SLAB_HWCACHE_ALIGN)
- align = L1_CACHE_BYTES;
/* Determine if the slab management is 'on' or 'off' slab. */
if (size >= (PAGE_SIZE>>3))
@@ -1127,13 +1131,16 @@
*/
flags |= CFLGS_OFF_SLAB;
- if (flags & SLAB_HWCACHE_ALIGN) {
- /* Need to adjust size so that objs are cache aligned. */
- /* Small obj size, can get at least two per cache line. */
+ if (!align) {
+ /* Default alignment: compile time specified l1 cache size.
+ * But if an object is really small, then squeeze multiple
+ * into one cacheline.
+ */
+ align = L1_CACHE_BYTES;
while (size <= align/2)
align /= 2;
- size = (size+align-1)&(~(align-1));
}
+ size = ALIGN(size, align);
/* Cal size (in pages) of slabs, and the num of objs per slab.
* This could be made much more intelligent. For now, try to avoid
@@ -1143,7 +1150,7 @@
do {
unsigned int break_flag = 0;
cal_wastage:
- cache_estimate(cachep->gfporder, size, flags,
+ cache_estimate(cachep->gfporder, size, align, flags,
&left_over, &cachep->num);
if (break_flag)
break;
@@ -1177,8 +1184,8 @@
cachep = NULL;
goto opps;
}
- slab_size = L1_CACHE_ALIGN(cachep->num*sizeof(kmem_bufctl_t) +
- sizeof(struct slab));
+ slab_size = ALIGN(cachep->num*sizeof(kmem_bufctl_t) + sizeof(struct slab),
+ align);
/*
* If the slab has been placed off-slab, and we have enough space then
@@ -1189,14 +1196,17 @@
left_over -= slab_size;
}
+ if (flags & CFLGS_OFF_SLAB) {
+ /* really off slab. No need for manual alignment */
+ slab_size = cachep->num*sizeof(kmem_bufctl_t)+sizeof(struct slab);
+ }
+
+ cachep->colour_off = L1_CACHE_BYTES;
/* Offset must be a multiple of the alignment. */
- offset += (align-1);
- offset &= ~(align-1);
- if (!offset)
- offset = L1_CACHE_BYTES;
- cachep->colour_off = offset;
- cachep->colour = left_over/offset;
-
+ if (cachep->colour_off < align)
+ cachep->colour_off = align;
+ cachep->colour = left_over/cachep->colour_off;
+ cachep->slab_size = slab_size;
cachep->flags = flags;
cachep->gfpflags = 0;
if (flags & SLAB_CACHE_DMA)
@@ -1470,8 +1480,7 @@
return NULL;
} else {
slabp = objp+colour_off;
- colour_off += L1_CACHE_ALIGN(cachep->num *
- sizeof(kmem_bufctl_t) + sizeof(struct slab));
+ colour_off += cachep->slab_size;
}
slabp->inuse = 0;
slabp->colouroff = colour_off;
--- 2.6/arch/i386/mm/init.c 2003-09-11 22:30:47.000000000 +0200
+++ build-2.6/arch/i386/mm/init.c 2003-09-11 22:38:57.000000000 +0200
@@ -509,8 +509,8 @@
if (PTRS_PER_PMD > 1) {
pmd_cache = kmem_cache_create("pmd",
PTRS_PER_PMD*sizeof(pmd_t),
+ PTRS_PER_PMD*sizeof(pmd_t),
0,
- SLAB_HWCACHE_ALIGN | SLAB_MUST_HWCACHE_ALIGN,
pmd_ctor,
NULL);
if (!pmd_cache)
@@ -519,8 +519,8 @@
if (TASK_SIZE > PAGE_OFFSET) {
kpmd_cache = kmem_cache_create("kpmd",
PTRS_PER_PMD*sizeof(pmd_t),
+ PTRS_PER_PMD*sizeof(pmd_t),
0,
- SLAB_HWCACHE_ALIGN | SLAB_MUST_HWCACHE_ALIGN,
kpmd_ctor,
NULL);
if (!kpmd_cache)
@@ -541,8 +541,8 @@
pgd_cache = kmem_cache_create("pgd",
PTRS_PER_PGD*sizeof(pgd_t),
+ PTRS_PER_PGD*sizeof(pgd_t),
0,
- SLAB_HWCACHE_ALIGN | SLAB_MUST_HWCACHE_ALIGN,
ctor,
dtor);
if (!pgd_cache)
--- 2.6/mm/rmap.c 2003-09-11 19:33:57.000000000 +0200
+++ build-2.6/mm/rmap.c 2003-09-11 23:41:55.000000000 +0200
@@ -521,8 +521,8 @@
{
pte_chain_cache = kmem_cache_create( "pte_chain",
sizeof(struct pte_chain),
+ sizeof(struct pte_chain),
0,
- SLAB_MUST_HWCACHE_ALIGN,
pte_chain_ctor,
NULL);
On Thu, Sep 11, 2003 at 06:19:22PM +0200, Manfred Spraul wrote:
> ...
> But there are problems with the current implementation:
> - some drivers/archs are too difficult to fix. James Bottomley asked me
> to add a switch for archs that cannot transfer to pci_dma completely.
> Basically guarantee that all cachelines that are touched by an object
> are exclusive to the object. Left over bytes must not be used, they
> could be overwritten randomly by the hardware.
> - Russell King onced asked me for the ability for 1024 byte aligned
> objects. IIRC ARM needs that for it's page tables.
> - If I understand you correctly you want to avoid false sharing of the
> per-cpu buffers that back alloc_percpu, correct?
Correct. We were assuming kmalloc always returns cacheline aligned
objects and kinda assumed that cachelines touched by the object
are exclusive to that object. With SLAB_HWCACHE_ALIGN not being
always honoured, this assumption turned out to be false.
> I have two concerns:
> - what about users that compile a kernel for a pIII and then run it on a
> p4? L1_CACHE_BYTES is 32 bytes, but the real cache line size is 128 bytes.
If the target cpu is chosen properly during compile time, L1_CACHE_BYTES
is set accordingly (CONFIG_X86_L1_CACHE_SHIFT on x86) so this should
not be an issue right?
> - what about NUMA systems? IMHO the per-cpu buffers should definitively
> be from the nearest node to the target cpu. Unfortunately slab has no
> concept of nodes right now. There was a patch, but it's quite intrusive
> and never fine-tuned. Thus we must either ignore NUMA, or alloc-percpu
> would have to use it's own allocator. And: Which cacheline size is
> relevant for NUMA? Is L1==L2==Ln_CACHE_BYTES on all archs?
I am working on a simplistic allocator for alloc_percpu which
1. Minimises cache footprint (simple pointer arithmetic to get to each cpus
version
2. Does numa aware allocation
3. Does not fragment
4. Is simple and extends simple pointer arithmetic to get to cpus offsets
I wouldn't be using the slab at all because using slabs would mean using
NR_CPUs pointers and one extra dereference which is bad as we had found out
earlier. But I guess slab will have to do node local allocations for
other applications.
>
> IMHO the right solution is
> - remove SLAB_MUST_HWCACHE_ALIGN and SLAB_HWCACHE_ALIGN. The API is
> broken. Align everything always to L1_CACHE_BYTES.
> - rename the "offset" parameter to "align", and use that to support
> explicit alignment on a byte boundary. I have a patch somewhere, it's
> not very large.
You mean color slabs to L1_CACHE_BYTES and align objects to "align"
parameter? sounds great. I wonder why that wasn't done in the first
place even Bonwick's paper talks of passing align parameter and calculating
colour offset based on wastage etc. Maybe calculating colour offset
in terms of slab wastage can reduce fragmentation instead of hard coding
it to L1_CACHE_BYTES (when you think of large cachelines and small objects)?
> - alloc_percpu should set align to the real cache line size from cpu_data.
> - add a minimal NUMA support: a very inefficient
> "kmem_cache_alloc_forcpu(cachep, flags, target_cpu)". Not that difficult
> if I sacrifice performance [no lockless per-cpu buffers, etc.]
As I said, place holder for NR_CPUS will cause performance problems, so
IMHO staying away from slab for alloc_percpu is best.
Thanks,
Kiran
On Fri, 2003-09-12 at 10:59, Ravikiran G Thirumalai wrote:
> > I have two concerns:
> > - what about users that compile a kernel for a pIII and then run it on a
> > p4? L1_CACHE_BYTES is 32 bytes, but the real cache line size is 128 bytes.
>
> If the target cpu is chosen properly during compile time, L1_CACHE_BYTES
> is set accordingly (CONFIG_X86_L1_CACHE_SHIFT on x86) so this should
> not be an issue right?
In my opinion there is no excuse to make such an assumption at compile
time if its basically free to do it at runtime instead...
for structure padding, using the config setting is obviously the only
way to go; but for kmalloc you can use the runtime, actual, cacheline
size/
Ravikiran G Thirumalai wrote:
>I am working on a simplistic allocator for alloc_percpu which
>1. Minimises cache footprint (simple pointer arithmetic to get to each cpus
> version
>2. Does numa aware allocation
>3. Does not fragment
>4. Is simple and extends simple pointer arithmetic to get to cpus offsets
>
>I wouldn't be using the slab at all because using slabs would mean using
>NR_CPUs pointers and one extra dereference which is bad as we had found out
>earlier. But I guess slab will have to do node local allocations for
>other applications.
>
>
Interesting. Slab internally uses lots of large per-cpu arrays.
Alltogether something like around 40 kB/cpu. Right now implemented with
NR_CPUs pointers. In the long run I'll try to switch to your allocator.
But back to the patch that started this thread: Do you still need the
ability to set an explicit alignment for slab allocations? If yes, then
I'd polish my patch, double check all kmem_cache_create callers and then
send the patch to akpm. Otherwise I'd wait - the patch is not a bugfix.
--
Manfred
On Sat, Sep 13, 2003 at 10:06:08PM +0200, Manfred Spraul wrote:
> Ravikiran G Thirumalai wrote:
> >I wouldn't be using the slab at all because using slabs would mean using
> >NR_CPUs pointers and one extra dereference which is bad as we had found out
> >earlier. But I guess slab will have to do node local allocations for
> >other applications.
> >
> >
> Interesting. Slab internally uses lots of large per-cpu arrays.
> Alltogether something like around 40 kB/cpu. Right now implemented with
> NR_CPUs pointers. In the long run I'll try to switch to your allocator.
>
> But back to the patch that started this thread: Do you still need the
> ability to set an explicit alignment for slab allocations? If yes, then
> I'd polish my patch, double check all kmem_cache_create callers and then
> send the patch to akpm. Otherwise I'd wait - the patch is not a bugfix.
This is the problem - the current dynamic per-cpu allocator
(alloc_percpu()) is broken. I uses kmalloc() to allocate each
CPU's data, but kmalloc() doesn't gurantee cache line alignment
(only SLAB_HWCACHE_ALIGN). This may result in some per-CPU statistics
bouncing between CPUs specially on the ones with large L1 cache lines.
We have a number of options -
1. Force kmalloc() to strictly align on cache line boundary, but will
result in wastage of space elsewhere (with your strict align patch)
but alloc_percpu() will never result in cache line sharing.
2. Make alloc_percpu() use its own caches for various sizes with your
strictly align patch. The rest of the kernel is not affected.
3. Let alloc_percpu() use its own allocator which supports NUMA
and does not use an offset table.
#2 and #3 has less impact in the kernel and we should consider those,
IMO.
Thanks
Dipankar
Manfred Spraul wrote:
> But back to the patch that started this thread: Do you still need the
> ability to set an explicit alignment for slab allocations? If yes, then
> I'd polish my patch, double check all kmem_cache_create callers and then
> send the patch to akpm. Otherwise I'd wait - the patch is not a bugfix.
The explicit alignment would be my preferred way to fix slab debugging
on s390. We still have the problem that practically all s390 device drivers
need 64-bit alignment on slab allocations.
Our current hack is to redefine BYTES_PER_WORD in slab.c to 8, but what
I'd like to see is a per-architecture alignment in kmem_cache_init
that would default to 8 on s390 and to sizeof(long) otherwise.
Using the current SLAB_MUST_HWCACHE_ALIGN is not an option since
it wastes a lot of memory with our 2048-bit L1 cache lines.
I'd also like to avoid creating special slab caches for anything
that needs 8-byte alignment.
Arnd <><
On Sat, Sep 13, 2003 at 10:06:08PM +0200, Manfred Spraul wrote:
> Ravikiran G Thirumalai wrote:
> > ...
> >
> Interesting. Slab internally uses lots of large per-cpu arrays.
> Alltogether something like around 40 kB/cpu. Right now implemented with
> NR_CPUs pointers. In the long run I'll try to switch to your allocator.
>
> But back to the patch that started this thread: Do you still need the
> ability to set an explicit alignment for slab allocations? If yes, then
> I'd polish my patch, double check all kmem_cache_create callers and then
> send the patch to akpm.
No, even the arbitrary aligned patch might not fix the current brokenness
of alloc_percpu, because then we would have to change kmalloc caches
(malloc_sizes[]) to explicity align objects on cacheline boundaries
even for small objects. And that would be a major change at this stage.
Having different alloc_percpu caches for differnt objects is probably not
such a good thing right now -- this is if we tolerate the extra dereference.
So. IMO, going ahead with a new allocator is a good choice for alloc_percpu.
> Otherwise I'd wait - the patch is not a bugfix.
That said, the reason I see arbitray align patch for 2.6 is:
1. SLAB_MUST_HWCACHELINE_ALIGN is broken. If I understand the code right,
it will result in BYTES_PER_WORD alignment. This means there can be false
sharing between two task_structs depending on the slab calculations
and there is an atomic_t at the beginning of the task_struct.
This has to be fixed.
2. The fix should not change the way kmalloc caches currently work, i.e
by sharing cachelines for small objects. Later benchmarks might
prove that it might not be good to share cachelines for kmalloc, but
for now we don't know.
3. IMHO its not right that SLAB_HWCACHE_ALIGN is taken as a hint and not
as a directive. It is misleading. I don't like the fact that the
allocator can reduce the alignment. IMHO, alignment decisions should be left
to the user.
4. There are already requirements from users for arbitrary alignment as you'd
mentioned in earlier mails in this thread. I guess such requirements will
go up as 2.6 matures, so it is a good idea to provide alignment upto
PAGE alignment.
The patch I posted earlier takes care of 1 & 2, but arbitraty alignment takes
care of all 4, and as you mentioned the patch is small enough. If interface
change is a concern for 2.6, then IMHO the interface is broken anyways...who
uses offset parameter to specify coloring offsets???? IMHO, Slab
offset colour should be slab calculation based and if needed arch based.
added to that SLAB_HWCACHE_ALIGN name itself is broken anyways. Clearly
we should not wait for 2.7 for these fixes, if so now is the good time
right? This is just my opinion. It is left to akpm and you to decide....
Thanks,
Kiran
On Sun, Sep 14, 2003 at 01:39:42PM +0530, Ravikiran G Thirumalai wrote:
> The patch I posted earlier takes care of 1 & 2, but arbitraty alignment takes
> care of all 4, and as you mentioned the patch is small enough. If interface
While we are at it, we should also probably look up the cache line
size for a cpu family from a table, store it in some per-cpu data
(cpuinfo_x86 ?) and provide an l1_cache_bytes() api to
return it at run time.
Thanks
Dipankar
On Sun, Sep 14, 2003 at 06:30:37PM +0530, Dipankar Sarma wrote:
>
> While we are at it, we should also probably look up the cache line
> size for a cpu family from a table, store it in some per-cpu data
> (cpuinfo_x86 ?) and provide an l1_cache_bytes() api to
> return it at run time.
If we are going to solve the cache line size mismatch due to compile
time arch versus run time arch (compiling on a PIII and running on a P4),
we might end up with kernel code which sometimes refer to the static line size
and sometimes to the run time size, which might cause correctness issues.
So maybe it is good idea to have just one macro for l1 size? We can't
do away with the static one, so maybe we should keep it and expect users
to compile with the target cpu properly set (otherwise they lose out on
performance -- correctness won't be a problem). I could be wrong...just
thinking out loud...
Thanks,
Kiran