Currently, when enabling slab debugging, we lose the property of
having the objects aligned on a cache line size.
This is, imho, an error, especially if GFP_DMA is passed. Such an
object _must_ be cache alined (and it's size rounded to a multiple
of the cache line size).
There is a simple performance reason on cache coherent archs, but
there's also the fact that it will just _not_ work properly on
non cache-coherent archs. Actually, I also have to deal with some
old machines who have a SCSI controller who has a problem accessing
buffers that aren't aligned on a cache line size boundary.
This is typically causing me trouble in various parts of SCSI which
abuses kmalloc(512, GFP_KERNEL | GFP_DMA) for buffers passed to some
SCSI commands, typically "utility" commands used to read a disk
capacity, read read/write protect flags, some sense buffers, etc...
While I know SCSI shall use the consistent alloc things, it has not
been fully fixed yet and kmalloc with GFP_DMA is still valid despite
not beeing efficient, so we should make sure in this case, the returned
buffer is really suitable for DMA, that is cache aligned.
Ben.
--- mm/slab.c~ 2003-07-14 03:36:48.000000000 +0000
+++ mm/slab.c 2003-08-15 18:40:14.000000000 +0000
@@ -682,7 +682,7 @@
sizes->cs_dmacachep = kmem_cache_create(
names->name_dma, sizes->cs_size,
- 0, SLAB_CACHE_DMA|SLAB_HWCACHE_ALIGN, NULL, NULL);
+ 0, SLAB_CACHE_DMA|SLAB_MUST_HWCACHE_ALIGN, NULL, NULL);
if (!sizes->cs_dmacachep)
BUG();
> Attached untested patch should fix it (vs 2.6.0-test1), I've no
> idea if it's acceptable.
That will just disable redzoning, which isn't what I want...
Benjamin Herrenschmidt <[email protected]> wrote:
>
> Currently, when enabling slab debugging, we lose the property of
> having the objects aligned on a cache line size.
>
> This is, imho, an error, especially if GFP_DMA is passed. Such an
> object _must_ be cache alined (and it's size rounded to a multiple
> of the cache line size).
Well the theory is that SLAB_HWCACHE_ALIGN is advisory and
SLAB_MUST_HWCACHE_ALIGN is compulsory.
You're looking for compulsory alignment on the generic kmalloc() slab
pools, so we're a bit screwed.
I guess the redzoning would need to be taught to always pad by a full
cacheline.
Ben wrote:
>Currently, when enabling slab debugging, we lose the property of
>having the objects aligned on a cache line size.
>
>
Correct. Cache line alignment is advisory. Slab debugging is not the
only case that violates the alignment, for example 32-byte allocations
are not padded to the 128 byte cache line size of the Pentium 4 cpus. I
really doubt we want that.
Have you looked at pci_pool_{create,alloc,free,destroy}? The functions
were specifically written to provide aligned buffers for DMA operations.
Perhaps SCSI should use them?
--
Manfred
On Fri, 2003-08-15 at 23:50, Manfred Spraul wrote:
> Ben wrote:
>
> >Currently, when enabling slab debugging, we lose the property of
> >having the objects aligned on a cache line size.
> >
> >
> Correct. Cache line alignment is advisory. Slab debugging is not the
> only case that violates the alignment, for example 32-byte allocations
> are not padded to the 128 byte cache line size of the Pentium 4 cpus. I
> really doubt we want that.
Yes, I understand that, but that is wrong for GFP_DMA imho. Also,
SLAB_MUST_HWCACHE_ALIGN just disables redzoning, which is not smart,
I'd rather allocate more and keep both redzoning and cache alignement,
that would help catch some of those subtle problems when a chip DMA
engine plays funny tricks
> Have you looked at pci_pool_{create,alloc,free,destroy}? The functions
> were specifically written to provide aligned buffers for DMA operations.
> Perhaps SCSI should use them?
Of course it should, but it doesn't yet. And other drivers haven't been
ported yet neither though that is slowly happening. Still, I think the
point that a GFP_DMA should be cache aligned still stands, don't you
think ?
Ben.
Benjamin Herrenschmidt wrote:
>On Fri, 2003-08-15 at 23:50, Manfred Spraul wrote:
>
>
>>Ben wrote:
>>
>>
>>
>>>Currently, when enabling slab debugging, we lose the property of
>>>having the objects aligned on a cache line size.
>>>
>>>
>>>
>>>
>>Correct. Cache line alignment is advisory. Slab debugging is not the
>>only case that violates the alignment, for example 32-byte allocations
>>are not padded to the 128 byte cache line size of the Pentium 4 cpus. I
>>really doubt we want that.
>>
>>
>
>Yes, I understand that, but that is wrong for GFP_DMA imho. Also,
>SLAB_MUST_HWCACHE_ALIGN just disables redzoning, which is not smart,
>I'd rather allocate more and keep both redzoning and cache alignement,
>that would help catch some of those subtle problems when a chip DMA
>engine plays funny tricks.
>
I don't want to upgrade SLAB_HWCACHE_ALIGN to SLAB_MUST_HWCACHE_ALIGN
depending on GFP_DMA: IIRC one arch (ppc64?) marks everything as
GFP_DMA, because all memory is DMA capable.
Which arch do you use? Perhaps alignment could be added for broken archs.
Actually I think you should fix your arch, perhaps by double buffering
in pci_map_ if the input pointers are not aligned. What if someone uses
O_DIRECT with an unaligned pointer?
--
Manfred
> >
> >Yes, I understand that, but that is wrong for GFP_DMA imho. Also,
> >SLAB_MUST_HWCACHE_ALIGN just disables redzoning, which is not smart,
> >I'd rather allocate more and keep both redzoning and cache alignement,
> >that would help catch some of those subtle problems when a chip DMA
> >engine plays funny tricks.
> >
> I don't want to upgrade SLAB_HWCACHE_ALIGN to SLAB_MUST_HWCACHE_ALIGN
> depending on GFP_DMA: IIRC one arch (ppc64?) marks everything as
> GFP_DMA, because all memory is DMA capable.
Same for ppc32. Anyway, I don't like MUST_HWCACHE_ALIGN because it
just disables redzoning, I'd rather allocate more and do both redzoning
and cache alignement.
In fact, I think cache alignement should be a property of all kmalloc
calls for any size > L1 cache line size for obvious perfs reasons, and
redzoning shouldn't change such a property...
> Which arch do you use? Perhaps alignment could be added for broken archs.
>
> Actually I think you should fix your arch, perhaps by double buffering
> in pci_map_ if the input pointers are not aligned. What if someone uses
> O_DIRECT with an unaligned pointer?
ppc32 "normally" is cache coherent, that is with high end or desktop CPUs,
embedded CPUs aren't. Some archs like ARM arent't. But in this case, the
problem is related to a broken SCSI controller on the motherboard which
will cause corruption on non-aligned buffers (it basically always issues
memory write and invalidate cycles).
Anyway, I _still_ think it's stupid to return non-aligned buffers, both
for performances, and because that prevents from dealing with such cases,
typically the SCSI layer assumes alignement here among others...
Regarding O_DIRECT with an unaligned pointer, I haven't looked at this
case yet, I suppose it will be broken in a whole lot of cases.
Ben.
Benjamin Herrenschmidt wrote:
>>>Yes, I understand that, but that is wrong for GFP_DMA imho. Also,
>>>SLAB_MUST_HWCACHE_ALIGN just disables redzoning, which is not smart,
>>>I'd rather allocate more and keep both redzoning and cache alignement,
>>>that would help catch some of those subtle problems when a chip DMA
>>>engine plays funny tricks.
>>>
>>>
>>>
>>I don't want to upgrade SLAB_HWCACHE_ALIGN to SLAB_MUST_HWCACHE_ALIGN
>>depending on GFP_DMA: IIRC one arch (ppc64?) marks everything as
>>GFP_DMA, because all memory is DMA capable.
>>
>>
>
>Same for ppc32. Anyway, I don't like MUST_HWCACHE_ALIGN because it
>just disables redzoning, I'd rather allocate more and do both redzoning
>and cache alignement.
>
>
I have a patch that creates helper functions that make that simple. The
patch is stuck right now, because it exposes a bug in the i386 debug
register handling. I'll add it redzoning with MUST_HWCACHE_ALIGN after
that one is in.
>Anyway, I _still_ think it's stupid to return non-aligned buffers, both
>for performances, and because that prevents from dealing with such cases,
>typically the SCSI layer assumes alignement here among others...
>
>
I don't care about performance with slab debugging on. kmalloc(4096,)
usually takes ~40 cpu ticks on i386. With debugging on, it includes a
memset and an open coded memcmp - my guess is a few thousand cpu ticks,
and that's intentional. Do not enable it on production systems.
>Regarding O_DIRECT with an unaligned pointer, I haven't looked at this
>case yet, I suppose it will be broken in a whole lot of cases.
>
>
Hmm. That means slab debugging did it's job: The driver contains the
wrong assumption that all pointers are cache line aligned. Without slab
debugging, this would result in rare data corruptions in O_DIRECT apps.
With slab debugging on, it's exposed immediately.
--
Manfred
> >
> Hmm. That means slab debugging did it's job: The driver contains the
> wrong assumption that all pointers are cache line aligned. Without slab
> debugging, this would result in rare data corruptions in O_DIRECT apps.
> With slab debugging on, it's exposed immediately.
As we discussed on IRC, I think SCSI host drivers (at least) can assume
beeing passed aligned buffers, if somebody don't agree, please speak
up ;) Christoph said that O_DIRECT has strict alignement rules too.
Ben.
On Sat, 16 Aug 2003, Benjamin Herrenschmidt wrote:
>
> > >
> > Hmm. That means slab debugging did it's job: The driver contains the
> > wrong assumption that all pointers are cache line aligned. Without slab
> > debugging, this would result in rare data corruptions in O_DIRECT apps.
> > With slab debugging on, it's exposed immediately.
>
> As we discussed on IRC, I think SCSI host drivers (at least) can assume
> beeing passed aligned buffers, if somebody don't agree, please speak
> up ;) Christoph said that O_DIRECT has strict alignement rules too.
>
A character device (like st) doing direct i/o from user buffer to/from a
SCSI device does not currently have any alignment restrictions. I think
restricted alignment can't be required from a user of an ordinary
character device. This must then be handled by the driver. The solution is
to use bounce buffers in the driver if the alignment does not meet the
lower level requirements. This leads to surprises with performance if the
user buffer alignment does not satisfy the requirements (e.g., malloc()
may or may not return properly aligned blocks). These surprises should be
avoided as far as the hardware allows.
If an architecture has restrictions, they must, of course, be taken into
account. However, this should not punish architectures that don't have the
restrictions. Specifying that DMA buffers must be cache-line aligned would
be too strict. A separate alignment constraint for DMA in general and for
a device in specific would be a better alternative (a device may have
tighter restrictions than an architecture). The same applies to buffer
sizes. This would mean adding two more masks for each device (like the
current DMA address mask for a device).
--
Kai
> ->
> A character device (like st) doing direct i/o from user buffer to/from a
> SCSI device does not currently have any alignment restrictions. I think
> restricted alignment can't be required from a user of an ordinary
> character device. This must then be handled by the driver. The solution is
> to use bounce buffers in the driver if the alignment does not meet the
> lower level requirements. This leads to surprises with performance if the
> user buffer alignment does not satisfy the requirements (e.g., malloc()
> may or may not return properly aligned blocks). These surprises should be
> avoided as far as the hardware allows.
THe low level driver can't do the bounce buffer thing, it has to be
done at higher layers.
> If an architecture has restrictions, they must, of course, be taken into
> account. However, this should not punish architectures that don't have the
> restrictions. Specifying that DMA buffers must be cache-line aligned would
> be too strict. A separate alignment constraint for DMA in general and for
> a device in specific would be a better alternative (a device may have
> tighter restrictions than an architecture). The same applies to buffer
> sizes. This would mean adding two more masks for each device (like the
> current DMA address mask for a device).
That won't help for buffers coming from higher layers that don't know
the device they'll end up to
Ben.
Manfred Spraul wrote:
> Benjamin Herrenschmidt wrote:
>
>>Same for ppc32. Anyway, I don't like MUST_HWCACHE_ALIGN because it
>>just disables redzoning, I'd rather allocate more and do both redzoning
>>and cache alignement.
>>
>>
> I have a patch that creates helper functions that make that simple. The
> patch is stuck right now, because it exposes a bug in the i386 debug
> register handling. I'll add it redzoning with MUST_HWCACHE_ALIGN after
> that one is in.
I have a related problem on s390: Some of my GFP_DMA data must be 8 byte
aligned, while my cache lines are 256 bytes wide. With slab debugging,
the whole structure is only 4 byte aligned and I get addressing exceptions.
When I go to MUST_HWCACHE_ALIGN, the data already grows from ~100 Bytes to
a full cache line, adding redzoning would make it far worse.
Is it possible to make kmem_cache_create accept a user specified alignment
parameter? I suppose there are other cases where you want to force
a specific alignment that is different from the L1 cache lines.
Arnd <><
--- 2.4/mm/slab.c Wed Feb 28 14:05:52 2001
+++ build-2.4/mm/slab.c Thu Mar 1 17:45:11 2001
@@ -207,6 +207,7 @@
size_t colour; /* cache colouring range */
unsigned int colour_off; /* colour offset */
unsigned int colour_next; /* cache colouring */
+ size_t slab_size;
kmem_cache_t *slabp_cache;
unsigned int growing;
unsigned int dflags; /* dynamic flags */
@@ -356,7 +357,7 @@
objsize: sizeof(kmem_cache_t),
flags: SLAB_NO_REAP,
spinlock: SPIN_LOCK_UNLOCKED,
- colour_off: L1_CACHE_BYTES,
+ colour_off: SMP_CACHE_BYTES,
name: "kmem_cache",
};
@@ -379,8 +380,13 @@
static void enable_all_cpucaches (void);
#endif
+static size_t aligned_size(size_t x, size_t alignment)
+{
+ return (x+alignment-1)&(~(alignment-1));
+}
+
/* Cal the num objs, wastage, and bytes left over for a given slab size. */
-static void kmem_cache_estimate (unsigned long gfporder, size_t size,
+static void kmem_cache_estimate (unsigned long gfporder, size_t size, size_t align,
int flags, size_t *left_over, unsigned int *num)
{
int i;
@@ -393,7 +399,7 @@
extra = sizeof(kmem_bufctl_t);
}
i = 0;
- while (i*size + L1_CACHE_ALIGN(base+i*extra) <= wastage)
+ while (i*size + aligned_size(base+i*extra, align) <= wastage)
i++;
if (i > 0)
i--;
@@ -403,7 +409,7 @@
*num = i;
wastage -= i*size;
- wastage -= L1_CACHE_ALIGN(base+i*extra);
+ wastage -= aligned_size(base+i*extra, align);
*left_over = wastage;
}
@@ -415,13 +421,15 @@
init_MUTEX(&cache_chain_sem);
INIT_LIST_HEAD(&cache_chain);
- kmem_cache_estimate(0, cache_cache.objsize, 0,
+ kmem_cache_estimate(0, cache_cache.objsize, SMP_CACHE_BYTES, 0,
&left_over, &cache_cache.num);
if (!cache_cache.num)
BUG();
cache_cache.colour = left_over/cache_cache.colour_off;
cache_cache.colour_next = 0;
+ cache_cache.slab_size = aligned_size(sizeof(slab_t)
+ + cache_cache.num*sizeof(kmem_bufctl_t), SMP_CACHE_BYTES);
}
@@ -589,7 +597,7 @@
* kmem_cache_create - Create a cache.
* @name: A string which is used in /proc/slabinfo to identify this cache.
* @size: The size of objects to be created in this cache.
- * @offset: The offset to use within the page.
+ * @align: The required alignment for the objects.
* @flags: SLAB flags
* @ctor: A constructor for the objects.
* @dtor: A destructor for the objects.
@@ -614,12 +622,12 @@
* as davem.
*/
kmem_cache_t *
-kmem_cache_create (const char *name, size_t size, size_t offset,
+kmem_cache_create (const char *name, size_t size, size_t align,
unsigned long flags, void (*ctor)(void*, kmem_cache_t *, unsigned long),
void (*dtor)(void*, kmem_cache_t *, unsigned long))
{
const char *func_nm = KERN_ERR "kmem_create: ";
- size_t left_over, align, slab_size;
+ size_t left_over, slab_size;
kmem_cache_t *cachep = NULL;
/*
@@ -631,7 +639,7 @@
(size < BYTES_PER_WORD) ||
(size > (1<<MAX_OBJ_ORDER)*PAGE_SIZE) ||
(dtor && !ctor) ||
- (offset < 0 || offset > size))
+ (align < 0))
BUG();
#if DEBUG
@@ -647,7 +655,7 @@
flags &= ~SLAB_POISON;
}
#if FORCED_DEBUG
- if (size < (PAGE_SIZE>>3))
+ if (size < (PAGE_SIZE>>3) && (align <= BYTES_PER_WORD))
/*
* do not red zone large object, causes severe
* fragmentation.
@@ -680,38 +688,42 @@
size &= ~(BYTES_PER_WORD-1);
printk("%sForcing size word alignment - %s\n", func_nm, name);
}
-
-#if DEBUG
- if (flags & SLAB_RED_ZONE) {
- /*
- * There is no point trying to honour cache alignment
- * when redzoning.
+
+ if (align < BYTES_PER_WORD)
+ align = BYTES_PER_WORD;
+
+ /*
+ * There is no point trying to honour cache alignment
+ * when redzoning.
+ */
+ if ((flags & SLAB_HWCACHE_ALIGN) && !(flags & SLAB_RED_ZONE)) {
+ int autoalign = SMP_CACHE_BYTES;
+ /* HWCACHE_ALIGN is only a hint, squeeze multiple objects
+ * into one cache line if they fit.
+ * Otherwise we would 128 byte align the 32 byte kmalloc
+ * block on a P IV...
*/
- flags &= ~SLAB_HWCACHE_ALIGN;
- size += 2*BYTES_PER_WORD; /* words for redzone */
+ while (size < autoalign/2)
+ autoalign /= 2;
+ if (autoalign > align)
+ align = autoalign;
}
+
+#if DEBUG
+ if (flags & SLAB_RED_ZONE)
+ size += 2*BYTES_PER_WORD; /* words for redzone */
#endif
- align = BYTES_PER_WORD;
- if (flags & SLAB_HWCACHE_ALIGN)
- align = L1_CACHE_BYTES;
+
+ /* Need to adjust size so that objs are cache aligned. */
+ size = aligned_size(size, align);
/* Determine if the slab management is 'on' or 'off' slab. */
- if (size >= (PAGE_SIZE>>3))
+ if (size >= (PAGE_SIZE>>3) || align >= (PAGE_SIZE>>3))
/*
* Size is large, assume best to place the slab management obj
* off-slab (should allow better packing of objs).
*/
flags |= CFLGS_OFF_SLAB;
-
- if (flags & SLAB_HWCACHE_ALIGN) {
- /* Need to adjust size so that objs are cache aligned. */
- /* Small obj size, can get at least two per cache line. */
- /* FIXME: only power of 2 supported, was better */
- while (size < align/2)
- align /= 2;
- size = (size+align-1)&(~(align-1));
- }
-
/* Cal size (in pages) of slabs, and the num of objs per slab.
* This could be made much more intelligent. For now, try to avoid
* using high page-orders for slabs. When the gfp() funcs are more
@@ -720,7 +732,7 @@
do {
unsigned int break_flag = 0;
cal_wastage:
- kmem_cache_estimate(cachep->gfporder, size, flags,
+ kmem_cache_estimate(cachep->gfporder, size, align, flags,
&left_over, &cachep->num);
if (break_flag)
break;
@@ -754,24 +766,24 @@
cachep = NULL;
goto opps;
}
- slab_size = L1_CACHE_ALIGN(cachep->num*sizeof(kmem_bufctl_t)+sizeof(slab_t));
-
/*
* If the slab has been placed off-slab, and we have enough space then
* move it on-slab. This is at the expense of any extra colouring.
*/
- if (flags & CFLGS_OFF_SLAB && left_over >= slab_size) {
+ slab_size = aligned_size(cachep->num*sizeof(kmem_bufctl_t)+sizeof(slab_t), align);
+
+ if ((flags & CFLGS_OFF_SLAB) && left_over >= slab_size) {
flags &= ~CFLGS_OFF_SLAB;
left_over -= slab_size;
}
+ if (flags & CFLGS_OFF_SLAB) {
+ /* really off slab. No need for manual alignment */
+ slab_size = cachep->num*sizeof(kmem_bufctl_t)+sizeof(slab_t);
+ }
- /* Offset must be a multiple of the alignment. */
- offset += (align-1);
- offset &= ~(align-1);
- if (!offset)
- offset = L1_CACHE_BYTES;
- cachep->colour_off = offset;
- cachep->colour = left_over/offset;
+ cachep->slab_size = slab_size;
+ cachep->colour_off = align;
+ cachep->colour = left_over/align;
/* init remaining fields */
if (!cachep->gfporder && !(flags & CFLGS_OFF_SLAB))
@@ -1016,8 +1028,7 @@
* if you enable OPTIMIZE
*/
slabp = objp+colour_off;
- colour_off += L1_CACHE_ALIGN(cachep->num *
- sizeof(kmem_bufctl_t) + sizeof(slab_t));
+ colour_off += cachep->slab_size;
}
slabp->inuse = 0;
slabp->colouroff = colour_off;
On Sat, 16 Aug 2003, Benjamin Herrenschmidt wrote:
...
> THe low level driver can't do the bounce buffer thing, it has to be
> done at higher layers.
>
Agreed. A high-level driver must be able to handle this anyway because of
other constraints. The point is that this costs cpu cycles and bouncing
should be avoided if possible.
> > If an architecture has restrictions, they must, of course, be taken into
> > account. However, this should not punish architectures that don't have the
...
> > sizes. This would mean adding two more masks for each device (like the
> > current DMA address mask for a device).
>
> That won't help for buffers coming from higher layers that don't know
> the device they'll end up to
>
Agreed. This just enables the high-level driver to avoid bouncing. For
instance, in a P4 system it is usually not necessary to require 128 byte
alignment for DMA.
--
Kai
Perhaps we should remind ourselves what the alignment rules actually are
for kmalloc:
1) No two kmalloc allocations may share cache lines (otherwise data
corruption will result).
2) kmalloc should be on a power of two alignment consistent with the
widest processor data access requirements.
Now note that 1) is optional (but an efficiency speed up) for a coherent
architecture. It is an *absolute requirement* for a non-coherent
architecture.
2) is essentially what's causing Ben the problems. His driver appears
to be insisting that DMA be a full PCI cache line width. I can see
arguments for making this a driver problem.
However, as far as the redzoning issue goes, I think in order to avoid
cache line interference, the redzone should be sized to be a cache line
width, at least on non-coherent architectures.
In theory, the above should solve everyone's problems.
As far as I/O from user land goes (especially to tape), the users
usually can work out the alignment constraints and act accordingly. I'm
agnostic as to whether we should fail (with an error indicating
alignment problems) or rebuffer causing inefficiency in throughput in
the misaligned case.
James
On Sun, 17 Aug 2003, James Bottomley wrote:
...
> As far as I/O from user land goes (especially to tape), the users
> usually can work out the alignment constraints and act accordingly. I'm
> agnostic as to whether we should fail (with an error indicating
> alignment problems) or rebuffer causing inefficiency in throughput in
> the misaligned case.
>
I think we should rebuffer so that we don't fail writes and reads that
other systems can do.
However, I am not so optimistic about the users aligning the buffers.
According to the info, glibc aligns at 8 bytes or 16 bytes (64 bit
architectures). I made st fail writes if the test
#define ST_DIO_ALIGN_OK(x) \
(((unsigned long)(x) & (L1_CACHE_BYTES - 1)) == 0)
fails on the buffer address. With a P4 kernel the result was that tar to
tape failed ;-(
A solution would be to define the address test for user buffers based on
the configuration, for example:
#if defined(CONFIG_XXX)
#define ST_DIO_ALIGN_OK(x) \
(((unsigned long)(x) & (L1_CACHE_BYTES - 1)) == 0)
#elif defined(CONFIG_YYY)
#define ST_DIO_ALIGN_OK(x) \
(((unsigned long)(x) & 7) == 0)
#else
#define ST_DIO_ALIGN_OK(x) (1)
#endif
Of course, it would be better if this would be defined in a more general
place than st.c (some scsi header, dma-mapping.h, ... ?).
--
Kai
// $Header$
// Kernel Version:
// VERSION = 2
// PATCHLEVEL = 6
// SUBLEVEL = 0
// EXTRAVERSION = -test6
--- 2.6/mm/slab.c 2003-09-30 19:42:29.000000000 +0200
+++ build-2.6/mm/slab.c 2003-09-30 20:32:44.000000000 +0200
@@ -96,6 +96,12 @@
#include <asm/cacheflush.h>
#include <asm/tlbflush.h>
+#ifdef __ARCH_FORCE_KMALLOCALIGN
+#define KMALLOC_ALIGN __ARCH_FORCE_KMALLOCALIGN
+#else
+#define KMALLOC_ALIGN 0
+#endif
+
/*
* DEBUG - 1 for kmem_cache_create() to honour; SLAB_DEBUG_INITIAL,
* SLAB_RED_ZONE & SLAB_POISON.
@@ -268,6 +274,7 @@
unsigned int colour_off; /* colour offset */
unsigned int colour_next; /* cache colouring */
kmem_cache_t *slabp_cache;
+ unsigned int slab_size;
unsigned int dflags; /* dynamic flags */
/* constructor func */
@@ -488,8 +495,11 @@
.objsize = sizeof(kmem_cache_t),
.flags = SLAB_NO_REAP,
.spinlock = SPIN_LOCK_UNLOCKED,
- .colour_off = L1_CACHE_BYTES,
+ .colour_off = SMP_CACHE_BYTES,
.name = "kmem_cache",
+#if DEBUG
+ .reallen = sizeof(kmem_cache_t),
+#endif
};
/* Guard access to the cache-chain. */
@@ -523,7 +533,7 @@
static void enable_cpucache (kmem_cache_t *cachep);
/* Cal the num objs, wastage, and bytes left over for a given slab size. */
-static void cache_estimate (unsigned long gfporder, size_t size,
+static void cache_estimate (unsigned long gfporder, size_t size, size_t align,
int flags, size_t *left_over, unsigned int *num)
{
int i;
@@ -536,7 +546,7 @@
extra = sizeof(kmem_bufctl_t);
}
i = 0;
- while (i*size + L1_CACHE_ALIGN(base+i*extra) <= wastage)
+ while (i*size + ALIGN(base+i*extra, align) <= wastage)
i++;
if (i > 0)
i--;
@@ -546,7 +556,7 @@
*num = i;
wastage -= i*size;
- wastage -= L1_CACHE_ALIGN(base+i*extra);
+ wastage -= ALIGN(base+i*extra, align);
*left_over = wastage;
}
@@ -690,14 +700,15 @@
list_add(&cache_cache.next, &cache_chain);
cache_cache.array[smp_processor_id()] = &initarray_cache.cache;
- cache_estimate(0, cache_cache.objsize, 0,
- &left_over, &cache_cache.num);
+ cache_estimate(0, cache_cache.objsize, SMP_CACHE_BYTES, 0,
+ &left_over, &cache_cache.num);
if (!cache_cache.num)
BUG();
cache_cache.colour = left_over/cache_cache.colour_off;
cache_cache.colour_next = 0;
-
+ cache_cache.slab_size = ALIGN(cache_cache.num*sizeof(kmem_bufctl_t) + sizeof(struct slab),
+ SMP_CACHE_BYTES);
/* 2+3) create the kmalloc caches */
sizes = malloc_sizes;
@@ -711,7 +722,7 @@
* allow tighter packing of the smaller caches. */
sizes->cs_cachep = kmem_cache_create(
names->name, sizes->cs_size,
- 0, SLAB_HWCACHE_ALIGN, NULL, NULL);
+ KMALLOC_ALIGN, 0, NULL, NULL);
if (!sizes->cs_cachep)
BUG();
@@ -723,7 +734,7 @@
sizes->cs_dmacachep = kmem_cache_create(
names->name_dma, sizes->cs_size,
- 0, SLAB_CACHE_DMA|SLAB_HWCACHE_ALIGN, NULL, NULL);
+ KMALLOC_ALIGN, SLAB_CACHE_DMA, NULL, NULL);
if (!sizes->cs_dmacachep)
BUG();
@@ -993,7 +1004,7 @@
* kmem_cache_create - Create a cache.
* @name: A string which is used in /proc/slabinfo to identify this cache.
* @size: The size of objects to be created in this cache.
- * @offset: The offset to use within the page.
+ * @align: The required alignment for the objects.
* @flags: SLAB flags
* @ctor: A constructor for the objects.
* @dtor: A destructor for the objects.
@@ -1018,17 +1029,15 @@
* %SLAB_NO_REAP - Don't automatically reap this cache when we're under
* memory pressure.
*
- * %SLAB_HWCACHE_ALIGN - Align the objects in this cache to a hardware
- * cacheline. This can be beneficial if you're counting cycles as closely
- * as davem.
+ * %SLAB_HWCACHE_ALIGN - This flag has no effect and will be removed soon.
*/
kmem_cache_t *
-kmem_cache_create (const char *name, size_t size, size_t offset,
+kmem_cache_create (const char *name, size_t size, size_t align,
unsigned long flags, void (*ctor)(void*, kmem_cache_t *, unsigned long),
void (*dtor)(void*, kmem_cache_t *, unsigned long))
{
const char *func_nm = KERN_ERR "kmem_create: ";
- size_t left_over, align, slab_size;
+ size_t left_over, slab_size;
kmem_cache_t *cachep = NULL;
/*
@@ -1039,7 +1048,7 @@
(size < BYTES_PER_WORD) ||
(size > (1<<MAX_OBJ_ORDER)*PAGE_SIZE) ||
(dtor && !ctor) ||
- (offset < 0 || offset > size))
+ (align < 0))
BUG();
#if DEBUG
@@ -1052,22 +1061,16 @@
#if FORCED_DEBUG
/*
- * Enable redzoning and last user accounting, except
- * - for caches with forced alignment: redzoning would violate the
- * alignment
- * - for caches with large objects, if the increased size would
- * increase the object size above the next power of two: caches
- * with object sizes just above a power of two have a significant
- * amount of internal fragmentation
+ * Enable redzoning and last user accounting, except for caches with
+ * large objects, if the increased size would increase the object size
+ * above the next power of two: caches with object sizes just above a
+ * power of two have a significant amount of internal fragmentation.
*/
- if ((size < 4096 || fls(size-1) == fls(size-1+3*BYTES_PER_WORD))
- && !(flags & SLAB_MUST_HWCACHE_ALIGN)) {
+ if ((size < 4096 || fls(size-1) == fls(size-1+3*BYTES_PER_WORD)))
flags |= SLAB_RED_ZONE|SLAB_STORE_USER;
- }
flags |= SLAB_POISON;
#endif
#endif
-
/*
* Always checks flags, a caller might be expecting debug
* support which isn't available.
@@ -1075,15 +1078,23 @@
if (flags & ~CREATE_MASK)
BUG();
+ if (align) {
+ /* minimum supported alignment: */
+ if (align < BYTES_PER_WORD)
+ align = BYTES_PER_WORD;
+
+ /* combinations of forced alignment and advanced debugging is
+ * not yet implemented.
+ */
+ flags &= ~(SLAB_RED_ZONE|SLAB_STORE_USER);
+ }
+
/* Get cache's description obj. */
cachep = (kmem_cache_t *) kmem_cache_alloc(&cache_cache, SLAB_KERNEL);
if (!cachep)
goto opps;
memset(cachep, 0, sizeof(kmem_cache_t));
-#if DEBUG
- cachep->reallen = size;
-#endif
/* Check that size is in terms of words. This is needed to avoid
* unaligned accesses for some archs when redzoning is used, and makes
* sure any on-slab bufctl's are also correctly aligned.
@@ -1095,20 +1106,25 @@
}
#if DEBUG
+ cachep->reallen = size;
+
if (flags & SLAB_RED_ZONE) {
- /*
- * There is no point trying to honour cache alignment
- * when redzoning.
- */
- flags &= ~SLAB_HWCACHE_ALIGN;
+ /* redzoning only works with word aligned caches */
+ align = BYTES_PER_WORD;
+
/* add space for red zone words */
cachep->dbghead += BYTES_PER_WORD;
size += 2*BYTES_PER_WORD;
}
if (flags & SLAB_STORE_USER) {
- flags &= ~SLAB_HWCACHE_ALIGN;
- size += BYTES_PER_WORD; /* add space */
+ /* user store requires word alignment and
+ * one word storage behind the end of the real
+ * object.
+ */
+ align = BYTES_PER_WORD;
+ size += BYTES_PER_WORD;
}
+
#if FORCED_DEBUG && defined(CONFIG_DEBUG_PAGEALLOC)
if (size > 128 && cachep->reallen > L1_CACHE_BYTES && size < PAGE_SIZE) {
cachep->dbghead += PAGE_SIZE - size;
@@ -1116,9 +1132,6 @@
}
#endif
#endif
- align = BYTES_PER_WORD;
- if (flags & SLAB_HWCACHE_ALIGN)
- align = L1_CACHE_BYTES;
/* Determine if the slab management is 'on' or 'off' slab. */
if (size >= (PAGE_SIZE>>3))
@@ -1128,13 +1141,16 @@
*/
flags |= CFLGS_OFF_SLAB;
- if (flags & SLAB_HWCACHE_ALIGN) {
- /* Need to adjust size so that objs are cache aligned. */
- /* Small obj size, can get at least two per cache line. */
+ if (!align) {
+ /* Default alignment: compile time specified l1 cache size.
+ * But if an object is really small, then squeeze multiple
+ * into one cacheline.
+ */
+ align = L1_CACHE_BYTES;
while (size <= align/2)
align /= 2;
- size = (size+align-1)&(~(align-1));
}
+ size = ALIGN(size, align);
/* Cal size (in pages) of slabs, and the num of objs per slab.
* This could be made much more intelligent. For now, try to avoid
@@ -1144,7 +1160,7 @@
do {
unsigned int break_flag = 0;
cal_wastage:
- cache_estimate(cachep->gfporder, size, flags,
+ cache_estimate(cachep->gfporder, size, align, flags,
&left_over, &cachep->num);
if (break_flag)
break;
@@ -1178,7 +1194,7 @@
cachep = NULL;
goto opps;
}
- slab_size = L1_CACHE_ALIGN(cachep->num*sizeof(kmem_bufctl_t)+sizeof(struct slab));
+ slab_size = ALIGN(cachep->num*sizeof(kmem_bufctl_t)+sizeof(struct slab),align);
/*
* If the slab has been placed off-slab, and we have enough space then
@@ -1189,14 +1205,17 @@
left_over -= slab_size;
}
+ if (flags & CFLGS_OFF_SLAB) {
+ /* really off slab. No need for manual alignment */
+ slab_size = cachep->num*sizeof(kmem_bufctl_t)+sizeof(struct slab);
+ }
+
+ cachep->colour_off = L1_CACHE_BYTES;
/* Offset must be a multiple of the alignment. */
- offset += (align-1);
- offset &= ~(align-1);
- if (!offset)
- offset = L1_CACHE_BYTES;
- cachep->colour_off = offset;
- cachep->colour = left_over/offset;
-
+ if (cachep->colour_off < align)
+ cachep->colour_off = align;
+ cachep->colour = left_over/cachep->colour_off;
+ cachep->slab_size = slab_size;
cachep->flags = flags;
cachep->gfpflags = 0;
if (flags & SLAB_CACHE_DMA)
@@ -1459,8 +1478,7 @@
return NULL;
} else {
slabp = objp+colour_off;
- colour_off += L1_CACHE_ALIGN(cachep->num *
- sizeof(kmem_bufctl_t) + sizeof(struct slab));
+ colour_off += cachep->slab_size;
}
slabp->inuse = 0;
slabp->colouroff = colour_off;
--- 2.6/arch/i386/mm/init.c 2003-09-30 19:40:44.000000000 +0200
+++ build-2.6/arch/i386/mm/init.c 2003-09-30 19:55:49.000000000 +0200
@@ -517,8 +517,8 @@
if (PTRS_PER_PMD > 1) {
pmd_cache = kmem_cache_create("pmd",
PTRS_PER_PMD*sizeof(pmd_t),
+ PTRS_PER_PMD*sizeof(pmd_t),
0,
- SLAB_HWCACHE_ALIGN | SLAB_MUST_HWCACHE_ALIGN,
pmd_ctor,
NULL);
if (!pmd_cache)
@@ -526,8 +526,8 @@
}
pgd_cache = kmem_cache_create("pgd",
PTRS_PER_PGD*sizeof(pgd_t),
+ PTRS_PER_PGD*sizeof(pgd_t),
0,
- SLAB_HWCACHE_ALIGN | SLAB_MUST_HWCACHE_ALIGN,
pgd_ctor,
PTRS_PER_PMD == 1 ? pgd_dtor : NULL);
if (!pgd_cache)
--- 2.6/mm/rmap.c 2003-09-30 19:41:54.000000000 +0200
+++ build-2.6/mm/rmap.c 2003-09-30 19:54:34.000000000 +0200
@@ -521,8 +521,8 @@
{
pte_chain_cache = kmem_cache_create( "pte_chain",
sizeof(struct pte_chain),
+ sizeof(struct pte_chain),
0,
- SLAB_MUST_HWCACHE_ALIGN,
pte_chain_ctor,
NULL);