/mirror/kernel/people/akpm/patches/2.6/2.6.5/2.6.5-mc4/broken-out/slab-alignment-rework.patch
I don't think this is right:
> - if (flags & SLAB_HWCACHE_ALIGN) {
> - /* Need to adjust size so that objs are cache aligned. */
> - /* Small obj size, can get at least two per cache line. */
> + if (!align) {
> + /* Default alignment: compile time specified l1 cache size.
> + * Except if an object is really small, then squeeze multiple
> + * into one cacheline.
> + */
> + align = cache_line_size();
> while (size <= align/2)
> align /= 2;
> - size = (size+align-1)&(~(align-1));
> }
> + size = ALIGN(size, align);
>
I want anon-vma to really use only 12 bytes, period. No best-guess must
be made automatically by the slab code, rounding it to 16 bytes.
this is not just for anon-vma though. For this specific case it's
absolteuly useless to throw 4 more bytes of ram at every anon-vma, it's
not hw aligned anyways so there will be false sharing anyways, so at the
very least we must try to save ram since we cannot scale the cache anyways.
Other cases for small objects are similar.
Overall I think it's definitely wrong to remove the hw alignment request
from the caller, the hw alignment must be used only where worthwhile.
for vmas and anon-vmas the hw alignment definitely isn't worthwhile and
we must not waste ram like the above patch does.
The other changes like using cache_line_size() is an _obvious_
improvement of course, I'm not commenting that part, but I'd ask to
please rewrite that patch removing the removal of the hw-alignment
parameter dictated by the kmem_cache_create caller.
Andrea Arcangeli wrote:
>/mirror/kernel/people/akpm/patches/2.6/2.6.5/2.6.5-mc4/broken-out/slab-alignment-rework.patch
>
>I don't think this is right:
>
>
>
>>- if (flags & SLAB_HWCACHE_ALIGN) {
>>- /* Need to adjust size so that objs are cache aligned. */
>>- /* Small obj size, can get at least two per cache line. */
>>+ if (!align) {
>>+ /* Default alignment: compile time specified l1 cache size.
>>+ * Except if an object is really small, then squeeze multiple
>>+ * into one cacheline.
>>+ */
>>+ align = cache_line_size();
>> while (size <= align/2)
>> align /= 2;
>>- size = (size+align-1)&(~(align-1));
>> }
>>+ size = ALIGN(size, align);
>>
>>
>>
>
>I want anon-vma to really use only 12 bytes, period.
>
Then pass "4" as the align parameter to kmem_cache_create. That's the
main point of the patch: it's now possible to explicitely specify the
requested alignment. 32 for the 3rd level page tables, the optimal
number for the pte_chains, etc.
> No best-guess must
>be made automatically by the slab code, rounding it to 16 bytes.
>
If you pass 0 as align to kmem_cache_create, then it's rounded to L2
size. It's questionable if that's really the best thing - on
uniprocessor, 16-byte might result is better performance - there is no
risk of false sharing.
--
Manfred
On Mon, Apr 19, 2004 at 06:42:36PM +0200, Manfred Spraul wrote:
> Then pass "4" as the align parameter to kmem_cache_create. That's the
> main point of the patch: it's now possible to explicitely specify the
> requested alignment. 32 for the 3rd level page tables, the optimal
> number for the pte_chains, etc.
>
> >No best-guess must
> >be made automatically by the slab code, rounding it to 16 bytes.
> >
> If you pass 0 as align to kmem_cache_create, then it's rounded to L2
> size. It's questionable if that's really the best thing - on
> uniprocessor, 16-byte might result is better performance - there is no
> risk of false sharing.
my point is that 0 as align must not round to l2 size or you break
stuff. you likely have broken the vma alignment to make the most obvious
example because I don't think it's a l1 cache multiple.
Removing the HW_ALIGN bitflag isn't nearly enough to avoid breaking
stuff after your changes to the kmem_cache_create API, to avoid breaking
stuff you need to change _all_ other kmem_cache_create and replace 0
with 4 if they didn't have the HW_ALIGN parameter, you didn't do that
and things breaks, not just for small objects, but for the ones bigger
than the cachesize too.
If you fixup all other kmem_cache_create callers to pass 4 instead of 0
that could be fine with me, but I still think this removal of the
HW_ALIGN bitflag is completely worthless, you still have a branch
checking !align, so you don't seem to gain anything and you break stuff.
Note: I really appreciate the runtime evaluation of the cachesize, those
parts are fine, changing offset to "align" is also more than welcome,
only the removal of HW_ALIGN bitflag w/o s/0/4/ where HW_ALIGN was
missing is bogus and breaks stuff.
Manfred Spraul <[email protected]> wrote:
>
> > No best-guess must
> >be made automatically by the slab code, rounding it to 16 bytes.
> >
> If you pass 0 as align to kmem_cache_create, then it's rounded to L2
> size. It's questionable if that's really the best thing - on
> uniprocessor, 16-byte might result is better performance - there is no
> risk of false sharing.
Just about every time we've cared to investigate slab alignment, we've
ended up removing SLAB_HWCACHE_ALIGN. It seems inappropriate that _any_ of
the inode caches have this set, for example.
And this patch appears to have taken the effective size of the buffer_head
from 48 bytes up to 64, which hurts.
So I do think that we should either make "align=0" translate to "pack them
densely" or do the big sweep across all kmem_cache_create() callsites.
If the latter, while we're there, let's remove SLAB_HWCACHE_ALIGN where it
isn't obviously appropriate. I'd imagine that being able to fit more inodes
into memory is a net win over the occasional sharing effect, for example.
On Tue, Apr 20, 2004 at 12:24:23AM -0700, Andrew Morton wrote:
> So I do think that we should either make "align=0" translate to "pack them
> densely" or do the big sweep across all kmem_cache_create() callsites.
agreed.
> If the latter, while we're there, let's remove SLAB_HWCACHE_ALIGN where it
> isn't obviously appropriate. I'd imagine that being able to fit more inodes
> into memory is a net win over the occasional sharing effect, for example.
One warning here, false sharing here isn' the only reason for hw
alignment, for structures like inodes or other things often are coded
packing the fields used at the same time together in the same cacheline,
this pratically can reduce the cache utilization to 1 cacheline instead
of 2 cachelines at runtime (even if there's no false sharing at all
because the structure is much bigger than the l1 size anyways).
So the hardware alignment should be removed with care looking the layout
of the structures and evaluating if we're losing cacheline packing. For
example the task_struct definitely must be fully l1 aligned, not because
of false sharing issues that are probably non existent in the task
struct anyways, but because most important fileds in the task struct
are packed to maximize the cache utilization at runtime.
For 12 bytes small things including locks like anon-vma the false
sharing is the biggest issue (but still it doesn't worth to l1 align it
in the anon-vma case), for buffer headers and task_structs the cacheline
packing provided by the l1 alignment of the structure is the primary
reason for wanting an l1 alignment. Each case should be evaluated
separately.
Andrea Arcangeli wrote:
>On Tue, Apr 20, 2004 at 12:24:23AM -0700, Andrew Morton wrote:
>
>
>>So I do think that we should either make "align=0" translate to "pack them
>>densely" or do the big sweep across all kmem_cache_create() callsites.
>>
>>
>
>agreed.
>
>
What about this proposal:
SLAB_HWCACHE_ALIGN clear: align to max(sizeof(void*), align).
SLAB_HWCACHE_ALIGN set: align to max(cpu_align(), align).
cpu_align is the cpu cache line size - either runtime or compile time.
Or are there users that want an alignment smaller than sizeof(void*)?
--
Manfred
Manfred Spraul <[email protected]> wrote:
>
> Andrea Arcangeli wrote:
>
> >On Tue, Apr 20, 2004 at 12:24:23AM -0700, Andrew Morton wrote:
> >
> >
> >>So I do think that we should either make "align=0" translate to "pack them
> >>densely" or do the big sweep across all kmem_cache_create() callsites.
> >>
> >>
> >
> >agreed.
> >
> >
> What about this proposal:
> SLAB_HWCACHE_ALIGN clear: align to max(sizeof(void*), align).
> SLAB_HWCACHE_ALIGN set: align to max(cpu_align(), align).
>
> cpu_align is the cpu cache line size - either runtime or compile time.
>
> Or are there users that want an alignment smaller than sizeof(void*)?
I doubt if this is likely to cause problems, and in cases where we expect
to have really large numbers of objects we could explicitly select an
alignment of 4 anyway.
But why would you choose to make the "SLAB_HWCACHE_ALIGN clear" case use
sizeof(void*) rather than sizeof(int)?
On Tue, 20 Apr 2004, Andrew Morton wrote:
>
> But why would you choose to make the "SLAB_HWCACHE_ALIGN clear" case use
> sizeof(void*) rather than sizeof(int)?
Because a lot of architectures will cause unaligned faults on structures
that have pointers (or long's) in them if they are only aligned to "int"?
So "int"-aligned is no better than "char" alignment.
I suspect we should make the "minimum normal alignment" be architecture-
dependent, since some architectures can have even stricter requirements
(ie they may have compilers that assume 64-bit alignment for doing things
like multi-word loads).
My suggestion:
- if explicit alignment is passed in (non-zero), always use that. The
user knows best.
This allows a user to specify unaligned ("byte alignment", aka
"align=1") if he wants to.
- if the passed-in alignment was zero, use a CPU-specific alignment which
will depend on SLAB_HWCACHE_ALIGN, and will _usually_ be something like
align = (flags & SLAB_HWCACHE_ALIGN) ? cachelinesize : sizeof(ptr);
but some architecture might choose to do something different here.
Wouldn't that make everybody happy.
Linus
> But why would you choose to make the "SLAB_HWCACHE_ALIGN clear" case use
> sizeof(void*) rather than sizeof(int)?
so that if you kmalloc a struct the pointers in it don't cause unaligned
exceptions on 64 bit ;)
Hi,
below is a patch that redefines align:
- align not zero: use the specified alignment. I think values smaller than
sizeof(void*) will work, even on archs with strict alignment requirement
(or at least: slab shouldn't crash. Obviously the user must handle the
alignment properly).
- align zero:
* debug on: align to sizeof(void*)
* debug off, SLAB_HWCACHE_ALIGN clear: align to sizeof(void*)
* debug off, SLAB_HWCACHE_ALIGN set: align to the smaller of
- cache_line_size()
- the object size, rounded up to the next power of two.
Slab never honored cache align for tiny objects: otherwise the 32-byte
kmalloc objects would use 128 byte objects.
The patch is against 2.6.6-rc2. What do you think?
There is one additional point: right now slab uses ints for the bufctls.
Using short would save two bytes for each object. Initially I had used
short, but davem objected. IIRC because some archs do not handle short
effeciently. Should I allow arch overrides for the bufctls? On i386,
saving two bytes might allow a few additional anon_vma objects in each
page.
--
Manfred
<<<
--- 2.6/mm/slab.c 2004-04-16 22:29:56.000000000 +0200
+++ build-2.6/mm/slab.c 2004-04-21 18:16:03.000000000 +0200
@@ -1150,14 +1150,22 @@
BUG();
if (align) {
- /* minimum supported alignment: */
- if (align < BYTES_PER_WORD)
- align = BYTES_PER_WORD;
-
/* combinations of forced alignment and advanced debugging is
* not yet implemented.
*/
flags &= ~(SLAB_RED_ZONE|SLAB_STORE_USER);
+ } else {
+ if (flags & SLAB_HWCACHE_ALIGN) {
+ /* Default alignment: as specified by the arch code.
+ * Except if an object is really small, then squeeze multiple
+ * into one cacheline.
+ */
+ align = cache_line_size();
+ while (size <= align/2)
+ align /= 2;
+ } else {
+ align = BYTES_PER_WORD;
+ }
}
/* Get cache's description obj. */
@@ -1210,15 +1218,6 @@
*/
flags |= CFLGS_OFF_SLAB;
- if (!align) {
- /* Default alignment: compile time specified l1 cache size.
- * Except if an object is really small, then squeeze multiple
- * into one cacheline.
- */
- align = cache_line_size();
- while (size <= align/2)
- align /= 2;
- }
size = ALIGN(size, align);
/* Cal size (in pages) of slabs, and the num of objs per slab.
<<<<
Manfred Spraul <[email protected]> wrote:
>
> There is one additional point: right now slab uses ints for the bufctls.
> Using short would save two bytes for each object. Initially I had used
> short, but davem objected. IIRC because some archs do not handle short
> effeciently. Should I allow arch overrides for the bufctls? On i386,
> saving two bytes might allow a few additional anon_vma objects in each
> page.
There's one bufctl per object, yes?
Sounds like a worthwhile optimisation.