LinuxLists.cc - [PATCH] patch-slab-split-03-tail

2002-10-04 17:28:46

Subject: [PATCH] patch-slab-split-03-tail

--- 2.5/mm/slab.c Fri Oct 4 18:59:01 2002
+++ build-2.5/mm/slab.c Fri Oct 4 18:59:11 2002
@@ -1478,7 +1478,7 @@
} else if (unlikely(inuse == cachep->num)) {
/* Was full. */
list_del(&slabp->list);
- list_add(&slabp->list, &cachep->slabs_partial);
+ list_add_tail(&slabp->list, &cachep->slabs_partial);
}
}
}

Attachments:

patch-slab-split-03-tail (331.00 B)

2002-10-04 19:10:12

by Manfred Spraul

[permalink] [raw]

Subject: Re: [PATCH] patch-slab-split-03-tail

Andrew Morton wrote:
>
> Makes sense. It would be nice to get this confirmed in
> targetted testing ;)
>
Not yet done.

The right way to test it would be to collect data in kernel about
alloc/free, and then run that data against both versions, and check
which version gives less internal fragmentation.

Or perhaps Bonwick has done that for his slab paper, but I don't have it :-(

* An implementation of the Slab Allocator as described in outline in;
* UNIX Internals: The New Frontiers by Uresh Vahalia
* Pub: Prentice Hall ISBN 0-13-101908-2
* or with a little more detail in;
* The Slab Allocator: An Object-Caching Kernel Memory Allocator
* Jeff Bonwick (Sun Microsystems).
* Presented at: USENIX Summer 1994 Technical Conference

--
Manfred

2002-10-04 19:00:38

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH] patch-slab-split-03-tail

Manfred Spraul wrote:
>
> part 3:
> [depends on -02-SMP]
>
> If an object is freed from a slab, then move the slab to the tail of the
> partial list - this should increase the probability that the other
> objects from the same page are freed, too, and that a page can be
> returned to gfp later.
>
> The cpu arrays are now always in front of the list, i.e. cache hit rates
> should not matter.
>

Run that by me again? So we're saying "if we just freed an
object from this page then make this page be the *last* page
which is eligible for new allocations"? Under the assumption
that other objects in that same page are about to be freed
up as well?

Makes sense. It would be nice to get this confirmed in
targetted testing ;)

2002-10-04 19:04:28

by Martin J. Bligh

[permalink] [raw]

Subject: Re: [PATCH] patch-slab-split-03-tail

> Run that by me again? So we're saying "if we just freed an
> object from this page then make this page be the *last* page
> which is eligible for new allocations"? Under the assumption
> that other objects in that same page are about to be freed
> up as well?
>
> Makes sense. It would be nice to get this confirmed in
> targetted testing ;)

Just doing my normal boring kernel compile suggest Manfred's
last big rollup performs exactly the same as without it. Not
sure if that's any help or not ....

M.

2002-10-04 19:09:40

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH] patch-slab-split-03-tail

"Martin J. Bligh" wrote:
>
> > Run that by me again? So we're saying "if we just freed an
> > object from this page then make this page be the *last* page
> > which is eligible for new allocations"? Under the assumption
> > that other objects in that same page are about to be freed
> > up as well?
> >
> > Makes sense. It would be nice to get this confirmed in
> > targetted testing ;)
>
> Just doing my normal boring kernel compile suggest Manfred's
> last big rollup performs exactly the same as without it. Not
> sure if that's any help or not ....
>

Well. This patch is supposed to decrease internal fragmentation.
We need to prove that theory. An appropriate test would be:

- boot with `mem=48m'
- untar kernel
- build kernel
- capture /proc/slabinfo

- apply patch

- repeat

- compare and explain.

I know what your reboot times are like ;) I'll do it.

2002-10-04 20:19:07

by Randy.Dunlap

[permalink] [raw]

Subject: Re: [PATCH] patch-slab-split-03-tail

On Fri, 4 Oct 2002, Manfred Spraul wrote:

| Andrew Morton wrote:
| >
| > Makes sense. It would be nice to get this confirmed in
| > targetted testing ;)
| >
| Not yet done.
|
| The right way to test it would be to collect data in kernel about
| alloc/free, and then run that data against both versions, and check
| which version gives less internal fragmentation.
|
| Or perhaps Bonwick has done that for his slab paper, but I don't have it :-(

Did you look at http://www.usenix.org/events/usenix01/bonwick.html
for it?

| * An implementation of the Slab Allocator as described in outline in;
| * UNIX Internals: The New Frontiers by Uresh Vahalia
| * Pub: Prentice Hall ISBN 0-13-101908-2
| * or with a little more detail in;
| * The Slab Allocator: An Object-Caching Kernel Memory Allocator
| * Jeff Bonwick (Sun Microsystems).
| * Presented at: USENIX Summer 1994 Technical Conference
| --

--
~Randy

2002-10-04 21:20:26

by Manfred Spraul

[permalink] [raw]

Subject: Re: [PATCH] patch-slab-split-03-tail

Randy.Dunlap wrote:
>
> Did you look at http://www.usenix.org/events/usenix01/bonwick.html
> for it?
>
Thanks for the link - that describes the newer, per-cpu extensions to
slab. Quite similar to the Linux implementation.

The text also contains a link to the original paper:

http://www.usenix.org/publications/library/proceedings/bos94/bonwick.html

Bonwick used one partially sorted list [as linux in 2.2, and 2.4.<10],
instead of seperate lists - move tail was not an option.

The new paper contains one interesting comment:
<<<<<<<
An object cache's CPU layer contains per-CPU state that must be
protected either by per-CPU locking or by disabling interrupts. We
selected per-CPU locking for several reasons:
[...]
x Performance. On most modern processors, grabbing an uncontended
lock is cheaper than modifying the processor interrupt level.
<<<<<<<<

Which cpus have slow local_irq_disable() implementations? At least for
my Duron, this doesn't seem to be the case [~ 4 cpu cycles for cli]

--
Manfred

2002-10-04 21:38:23

by Robert Love

[permalink] [raw]

Subject: Re: [PATCH] patch-slab-split-03-tail

On Fri, 2002-10-04 at 17:25, Manfred Spraul wrote:

> Which cpus have slow local_irq_disable() implementations? At least
> for my Duron, this doesn't seem to be the case [~ 4 cpu cycles
> for cli]

I believe there are pipeline effects to disabling interrupts, e.g. it
has to be flushed?

Robert Love

2002-10-04 23:05:44

by Manfred Spraul

[permalink] [raw]

Subject: Re: [PATCH] patch-slab-split-03-tail

/*
* cli.cpp: RDTSC based performance tester.
*
* Copyright (C) 1999, 2001, 2002 by Manfred Spraul.
* All rights reserved except the rights granted by the GPL.
*
* Redistribution of this file is permitted under the terms of the GNU
* General Public License (GPL) version 2 or later.
* $Header: /pub/home/manfred/cvs-tree/timetest/cli.cpp,v 1.4 2002/10/04 21:22:09 manfred Exp $
*/

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <getopt.h>

// define a cache flushing function
#undef CACHE_FLUSH

// Intel recommends that a serializing instruction
// should be called before and after rdtsc.
// CPUID is a serializing instruction.
// ".align 128:" P 4 L2 cache line size
#define read_rdtsc_before(time) \
__asm__ __volatile__( \
".align 128\n\t" \
"xor %%eax,%%eax\n\t" \
"cpuid\n\t" \
"rdtsc\n\t" \
"mov %%eax,(%0)\n\t" \
"mov %%edx,4(%0)\n\t" \
"xor %%eax,%%eax\n\t" \
"cpuid\n\t" \
: /* no output */ \
: "S"(&time) \
: "eax", "ebx", "ecx", "edx", "memory")

#define read_rdtsc_after(time) \
__asm__ __volatile__( \
"xor %%eax,%%eax\n\t" \
"cpuid\n\t" \
"rdtsc\n\t" \
"mov %%eax,(%0)\n\t" \
"mov %%edx,4(%0)\n\t" \
"xor %%eax,%%eax\n\t" \
"cpuid\n\t" \
"sti\n\t" \
: /* no output */ \
: "S"(&time) \
: "eax", "ebx", "ecx", "edx", "memory")

#define BUILD_TESTFNC(name, text, instructions) \
void name##_dummy(void) \
{ \
__asm__ __volatile__( \
".align 4096\n\t" \
"xor %%eax, %%eax\n\t" \
: : : "eax"); \
} \
static unsigned long name##_best = 1024*1024*1024; \
\
static void name(void) \
{ \
unsigned long long time; \
unsigned long long time2; \
\
read_rdtsc_before(time); \
instructions; \
read_rdtsc_after(time2); \
if(time2-time < name##_best) { \
printf( text ":\t%10Ld ticks; \n", \
time2-time-zerotest_best); \
name##_best = time2-time; \
} \
}

void filler(void)
{
static int i = 3;
static int j;
j = i*i;
}

#define DO_3(x) \
do { x; x; x; } while(0)

#define DO_10(x) \
do { x; x; x; x; x; x; x; x; x; x;} while(0)

#define DO_50(x) \
do { DO_10(x); DO_10(x);DO_10(x); DO_10(x);DO_10(x);} while(0)

#define DO_T(y) do { \
DO_3(filler()); \
y; \
DO_3(filler());} while(0)

#ifdef CACHE_FLUSH
#define DRAIN_SZ (4*1024*1024)
int other[3*DRAIN_SZ] __attribute ((aligned (4096)));
static inline void drain_cache(void)
{
int i;
for(i=0;i<DRAIN_SZ;i++) other[DRAIN_SZ+i]=0;
for(i=0;i<DRAIN_SZ;i++) if(other[DRAIN_SZ+i]!=0) break;
}
#else
static inline void drain_cache(void)
{
}
#endif

#define DO_TEST(x) \
do { \
int i; \
for(i=0;i<500000;i++) \
x; \
} while(0)

//////////////////////////////////////////////////////////////////////////////

static inline void nothing()
{
__asm__ __volatile__("nop": : : "memory");
}

BUILD_TESTFNC(zerotest,"zerotest", DO_T(nothing()));

//////////////////////////////////////////////////////////////////////////////

static inline void test0()
{
__asm__ __volatile__("cli": : : "memory");
}

BUILD_TESTFNC(test_0, "cli", DO_T(test0()))

//////////////////////////////////////////////////////////////////////////////
extern "C" int iopl __P ((int __level));

int main()
{
printf("CLI bench\n");
iopl(3);

for(;;) {
DO_TEST(zerotest());
DO_TEST(test_0());
}
return 0;
}

Attachments:

cli.cpp (3.22 kB)

2002-10-05 01:22:10

by Anton Blanchard

[permalink] [raw]

Subject: Re: [PATCH] patch-slab-split-03-tail

> <<<<<<<
> An object cache's CPU layer contains per-CPU state that must be
> protected either by per-CPU locking or by disabling interrupts. We
> selected per-CPU locking for several reasons:
> [...]
> x Performance. On most modern processors, grabbing an uncontended
> lock is cheaper than modifying the processor interrupt level.
> <<<<<<<<
>
> Which cpus have slow local_irq_disable() implementations? At least for
> my Duron, this doesn't seem to be the case [~ 4 cpu cycles for cli]

Rusty did some tests and found on the intel chips he tested
local_irq_disable was slower. He posted the results to lkml a few weeks
ago.

On ppc64 it varies between chips.

Anton