2012-06-08 17:25:00

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 1/4] slub: change declare of get_slab() to inline at all times

__kmalloc and it's variants are invoked much frequently
and these are performance critical functions,
so their callee functions are declared '__always_inline'
But, currently, get_slab() isn't declared '__always_inline'.
In result, __kmalloc and it's variants call get_slab() on x86.
It is not desirable result, so change it to inline at all times.

Signed-off-by: Joonsoo Kim <[email protected]>

diff --git a/mm/slub.c b/mm/slub.c
index 71de9b5..30ceb6d 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3320,7 +3320,7 @@ static inline int size_index_elem(size_t bytes)
return (bytes - 1) / 8;
}

-static struct kmem_cache *get_slab(size_t size, gfp_t flags)
+static __always_inline struct kmem_cache *get_slab(size_t size, gfp_t flags)
{
int index;

--
1.7.9.5


2012-06-08 17:25:08

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 2/4] slub: use __cmpxchg_double_slab() at interrupt disabled place

get_freelist(), unfreeze_partials() are only called with interrupt disabled,
so __cmpxchg_double_slab() is suitable.

Acked-by: Christoph Lameter <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>

diff --git a/mm/slub.c b/mm/slub.c
index 30ceb6d..686ed90 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1879,7 +1879,11 @@ redo:
}
}

-/* Unfreeze all the cpu partial slabs */
+/*
+ * Unfreeze all the cpu partial slabs.
+ *
+ * This function must be called with interrupt disabled.
+ */
static void unfreeze_partials(struct kmem_cache *s)
{
struct kmem_cache_node *n = NULL;
@@ -1935,7 +1939,7 @@ static void unfreeze_partials(struct kmem_cache *s)
l = m;
}

- } while (!cmpxchg_double_slab(s, page,
+ } while (!__cmpxchg_double_slab(s, page,
old.freelist, old.counters,
new.freelist, new.counters,
"unfreezing slab"));
@@ -2163,6 +2167,8 @@ static inline void *new_slab_objects(struct kmem_cache *s, gfp_t flags,
* The page is still frozen if the return value is not NULL.
*
* If this function returns NULL then the page has been unfrozen.
+ *
+ * This function must be called with interrupt disabled.
*/
static inline void *get_freelist(struct kmem_cache *s, struct page *page)
{
@@ -2179,7 +2185,7 @@ static inline void *get_freelist(struct kmem_cache *s, struct page *page)
new.inuse = page->objects;
new.frozen = freelist != NULL;

- } while (!cmpxchg_double_slab(s, page,
+ } while (!__cmpxchg_double_slab(s, page,
freelist, counters,
NULL, new.counters,
"get_freelist"));
--
1.7.9.5

2012-06-08 17:25:19

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 4/4] slub: deactivate freelist of kmem_cache_cpu all at once in deactivate_slab()

Current implementation of deactivate_slab() which deactivate
freelist of kmem_cache_cpu one by one is inefficient.
This patch changes it to deactivate freelist all at once.
But, there is no overall performance benefit,
because deactivate_slab() is invoked infrequently.

Signed-off-by: Joonsoo Kim <[email protected]>

diff --git a/mm/slub.c b/mm/slub.c
index b5f2108..7bcb434 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1733,16 +1733,14 @@ void init_kmem_cache_cpus(struct kmem_cache *s)
*/
static void deactivate_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
{
- enum slab_modes { M_NONE, M_PARTIAL, M_FULL, M_FREE };
struct page *page = c->page;
struct kmem_cache_node *n = get_node(s, page_to_nid(page));
- int lock = 0;
- enum slab_modes l = M_NONE, m = M_NONE;
- void *freelist;
- void *nextfree;
- int tail = DEACTIVATE_TO_HEAD;
+ void *freelist, *lastfree = NULL;
+ unsigned int nr_free = 0;
struct page new;
- struct page old;
+ void *prior;
+ unsigned long counters;
+ int lock = 0, tail = DEACTIVATE_TO_HEAD;

if (page->freelist) {
stat(s, DEACTIVATE_REMOTE_FREES);
@@ -1752,127 +1750,54 @@ static void deactivate_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
c->tid = next_tid(c->tid);
c->page = NULL;
freelist = c->freelist;
- c->freelist = NULL;
-
- /*
- * Stage one: Free all available per cpu objects back
- * to the page freelist while it is still frozen. Leave the
- * last one.
- *
- * There is no need to take the list->lock because the page
- * is still frozen.
- */
- while (freelist && (nextfree = get_freepointer(s, freelist))) {
- void *prior;
- unsigned long counters;
-
- do {
- prior = page->freelist;
- counters = page->counters;
- set_freepointer(s, freelist, prior);
- new.counters = counters;
- new.inuse--;
- VM_BUG_ON(!new.frozen);
-
- } while (!__cmpxchg_double_slab(s, page,
- prior, counters,
- freelist, new.counters,
- "drain percpu freelist"));
-
- freelist = nextfree;
+ while (freelist) {
+ lastfree = freelist;
+ freelist = get_freepointer(s, freelist);
+ nr_free++;
}

- /*
- * Stage two: Ensure that the page is unfrozen while the
- * list presence reflects the actual number of objects
- * during unfreeze.
- *
- * We setup the list membership and then perform a cmpxchg
- * with the count. If there is a mismatch then the page
- * is not unfrozen but the page is on the wrong list.
- *
- * Then we restart the process which may have to remove
- * the page from the list that we just put it on again
- * because the number of objects in the slab may have
- * changed.
- */
-redo:
+ freelist = c->freelist;
+ c->freelist = NULL;

- old.freelist = page->freelist;
- old.counters = page->counters;
- VM_BUG_ON(!old.frozen);
+ do {
+ if (lock) {
+ lock = 0;
+ spin_unlock(&n->list_lock);
+ }

- /* Determine target state of the slab */
- new.counters = old.counters;
- if (freelist) {
- new.inuse--;
- set_freepointer(s, freelist, old.freelist);
- new.freelist = freelist;
- } else
- new.freelist = old.freelist;
+ prior = page->freelist;
+ counters = page->counters;

- new.frozen = 0;
+ if (lastfree)
+ set_freepointer(s, lastfree, prior);
+ else
+ freelist = prior;

- if (!new.inuse && n->nr_partial > s->min_partial)
- m = M_FREE;
- else if (new.freelist) {
- m = M_PARTIAL;
- if (!lock) {
- lock = 1;
- /*
- * Taking the spinlock removes the possiblity
- * that acquire_slab() will see a slab page that
- * is frozen
- */
- spin_lock(&n->list_lock);
- }
- } else {
- m = M_FULL;
- if (kmem_cache_debug(s) && !lock) {
+ new.counters = counters;
+ VM_BUG_ON(!new.frozen);
+ new.inuse -= nr_free;
+ new.frozen = 0;
+
+ if (new.inuse || n->nr_partial <= s->min_partial) {
lock = 1;
- /*
- * This also ensures that the scanning of full
- * slabs from diagnostic functions will not see
- * any frozen slabs.
- */
spin_lock(&n->list_lock);
}
- }
-
- if (l != m) {
-
- if (l == M_PARTIAL)
-
- remove_partial(n, page);
-
- else if (l == M_FULL)
-
- remove_full(s, page);
-
- if (m == M_PARTIAL) {

+ } while (!__cmpxchg_double_slab(s, page,
+ prior, counters,
+ freelist, new.counters,
+ "drain percpu freelist"));
+ if (lock) {
+ if (kmem_cache_debug(s) && !freelist) {
+ add_full(s, n, page);
+ stat(s, DEACTIVATE_FULL);
+ } else {
add_partial(n, page, tail);
stat(s, tail);
-
- } else if (m == M_FULL) {
-
- stat(s, DEACTIVATE_FULL);
- add_full(s, n, page);
-
}
- }
-
- l = m;
- if (!__cmpxchg_double_slab(s, page,
- old.freelist, old.counters,
- new.freelist, new.counters,
- "unfreezing slab"))
- goto redo;
-
- if (lock)
spin_unlock(&n->list_lock);
-
- if (m == M_FREE) {
+ } else {
+ VM_BUG_ON(new.inuse);
stat(s, DEACTIVATE_EMPTY);
discard_slab(s, page);
stat(s, FREE_SLAB);
--
1.7.9.5

2012-06-08 17:25:51

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 3/4] slub: refactoring unfreeze_partials()

Current implementation of unfreeze_partials() is so complicated,
but benefit from it is insignificant. In addition many code in
do {} while loop have a bad influence to a fail rate of cmpxchg_double_slab.
Under current implementation which test status of cpu partial slab
and acquire list_lock in do {} while loop,
we don't need to acquire a list_lock and gain a little benefit
when front of the cpu partial slab is to be discarded, but this is a rare case.
In case that add_partial is performed and cmpxchg_double_slab is failed,
remove_partial should be called case by case.

I think that these are disadvantages of current implementation,
so I do refactoring unfreeze_partials().

Minimizing code in do {} while loop introduce a reduced fail rate
of cmpxchg_double_slab. Below is output of 'slabinfo -r kmalloc-256'
when './perf stat -r 33 hackbench 50 process 4000 > /dev/null' is done.

** before **
Cmpxchg_double Looping
------------------------
Locked Cmpxchg Double redos 182685
Unlocked Cmpxchg Double redos 0

** after **
Cmpxchg_double Looping
------------------------
Locked Cmpxchg Double redos 177995
Unlocked Cmpxchg Double redos 1

We can see cmpxchg_double_slab fail rate is improved slightly.

Bolow is output of './perf stat -r 30 hackbench 50 process 4000 > /dev/null'.

** before **
Performance counter stats for './hackbench 50 process 4000' (30 runs):

108517.190463 task-clock # 7.926 CPUs utilized ( +- 0.24% )
2,919,550 context-switches # 0.027 M/sec ( +- 3.07% )
100,774 CPU-migrations # 0.929 K/sec ( +- 4.72% )
124,201 page-faults # 0.001 M/sec ( +- 0.15% )
401,500,234,387 cycles # 3.700 GHz ( +- 0.24% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
250,576,913,354 instructions # 0.62 insns per cycle ( +- 0.13% )
45,934,956,860 branches # 423.297 M/sec ( +- 0.14% )
188,219,787 branch-misses # 0.41% of all branches ( +- 0.56% )

13.691837307 seconds time elapsed ( +- 0.24% )

** after **
Performance counter stats for './hackbench 50 process 4000' (30 runs):

107784.479767 task-clock # 7.928 CPUs utilized ( +- 0.22% )
2,834,781 context-switches # 0.026 M/sec ( +- 2.33% )
93,083 CPU-migrations # 0.864 K/sec ( +- 3.45% )
123,967 page-faults # 0.001 M/sec ( +- 0.15% )
398,781,421,836 cycles # 3.700 GHz ( +- 0.22% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
250,189,160,419 instructions # 0.63 insns per cycle ( +- 0.09% )
45,855,370,128 branches # 425.436 M/sec ( +- 0.10% )
169,881,248 branch-misses # 0.37% of all branches ( +- 0.43% )

13.596272341 seconds time elapsed ( +- 0.22% )

No regression is found, but rather we can see slightly better result.

Acked-by: Christoph Lameter <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>

diff --git a/mm/slub.c b/mm/slub.c
index 686ed90..b5f2108 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1886,18 +1886,24 @@ redo:
*/
static void unfreeze_partials(struct kmem_cache *s)
{
- struct kmem_cache_node *n = NULL;
+ struct kmem_cache_node *n = NULL, *n2 = NULL;
struct kmem_cache_cpu *c = this_cpu_ptr(s->cpu_slab);
struct page *page, *discard_page = NULL;

while ((page = c->partial)) {
- enum slab_modes { M_PARTIAL, M_FREE };
- enum slab_modes l, m;
struct page new;
struct page old;

c->partial = page->next;
- l = M_FREE;
+
+ n2 = get_node(s, page_to_nid(page));
+ if (n != n2) {
+ if (n)
+ spin_unlock(&n->list_lock);
+
+ n = n2;
+ spin_lock(&n->list_lock);
+ }

do {

@@ -1910,43 +1916,17 @@ static void unfreeze_partials(struct kmem_cache *s)

new.frozen = 0;

- if (!new.inuse && (!n || n->nr_partial > s->min_partial))
- m = M_FREE;
- else {
- struct kmem_cache_node *n2 = get_node(s,
- page_to_nid(page));
-
- m = M_PARTIAL;
- if (n != n2) {
- if (n)
- spin_unlock(&n->list_lock);
-
- n = n2;
- spin_lock(&n->list_lock);
- }
- }
-
- if (l != m) {
- if (l == M_PARTIAL) {
- remove_partial(n, page);
- stat(s, FREE_REMOVE_PARTIAL);
- } else {
- add_partial(n, page,
- DEACTIVATE_TO_TAIL);
- stat(s, FREE_ADD_PARTIAL);
- }
-
- l = m;
- }
-
} while (!__cmpxchg_double_slab(s, page,
old.freelist, old.counters,
new.freelist, new.counters,
"unfreezing slab"));

- if (m == M_FREE) {
+ if (unlikely(!new.inuse && n->nr_partial > s->min_partial)) {
page->next = discard_page;
discard_page = page;
+ } else {
+ add_partial(n, page, DEACTIVATE_TO_TAIL);
+ stat(s, FREE_ADD_PARTIAL);
}
}

--
1.7.9.5

Subject: Re: [PATCH 1/4] slub: change declare of get_slab() to inline at all times

On Sat, 9 Jun 2012, Joonsoo Kim wrote:

> -static struct kmem_cache *get_slab(size_t size, gfp_t flags)
> +static __always_inline struct kmem_cache *get_slab(size_t size, gfp_t flags)

I thought that the compiler felt totally free to inline static functions
at will? This may be a matter of compiler optimization settings. I can
understand the use of always_inline in a header file but why in a .c file?

Subject: Re: [PATCH 4/4] slub: deactivate freelist of kmem_cache_cpu all at once in deactivate_slab()

On Sat, 9 Jun 2012, Joonsoo Kim wrote:

> Current implementation of deactivate_slab() which deactivate
> freelist of kmem_cache_cpu one by one is inefficient.
> This patch changes it to deactivate freelist all at once.
> But, there is no overall performance benefit,
> because deactivate_slab() is invoked infrequently.

Hmm, deactivate freelist can race with slab_free. Need to look at this in
detail.

2012-06-09 15:57:50

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 1/4] slub: change declare of get_slab() to inline at all times

2012/6/9 Christoph Lameter <[email protected]>:
> On Sat, 9 Jun 2012, Joonsoo Kim wrote:
>
>> -static struct kmem_cache *get_slab(size_t size, gfp_t flags)
>> +static __always_inline struct kmem_cache *get_slab(size_t size, gfp_t flags)
>
> I thought that the compiler felt totally free to inline static functions
> at will? This may be a matter of compiler optimization settings. I can
> understand the use of always_inline in a header file but why in a .c file?

Yes, but the compiler with -O2 doesn't inline get_slab() which
declared just 'static'.
I think that inlining get_slab() have a performance benefit, so add
'__always_inline' to declare of get_slab().
Other functions like slab_alloc, slab_free also use 'always_inline' in
.c file (slub.c)

2012-06-10 10:27:29

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 4/4] slub: deactivate freelist of kmem_cache_cpu all at once in deactivate_slab()

2012/6/9 Christoph Lameter <[email protected]>:
> On Sat, 9 Jun 2012, Joonsoo Kim wrote:
>
>> Current implementation of deactivate_slab() which deactivate
>> freelist of kmem_cache_cpu one by one is inefficient.
>> This patch changes it to deactivate freelist all at once.
>> But, there is no overall performance benefit,
>> because deactivate_slab() is invoked infrequently.
>
> Hmm, deactivate freelist can race with slab_free. Need to look at this in
> detail.

Implemented logic is nearly same as previous one.
I just merge first step of previous deactivate_slab() with second one.
In case of failure of cmpxchg_double_slab(), reloading page->freelist,
page->counters and recomputing inuse
ensure that race with slab_free() cannot be possible.
In case that we need a lock, try to get a lock before invoking
cmpxchg_double_slab(),
so race with slab_free cannot be occured too.

Above is my humble opinion, please give me some comments.

Subject: Re: [PATCH 1/4] slub: change declare of get_slab() to inline at all times

On Sun, 10 Jun 2012, JoonSoo Kim wrote:

> 2012/6/9 Christoph Lameter <[email protected]>:
> > On Sat, 9 Jun 2012, Joonsoo Kim wrote:
> >
> >> -static struct kmem_cache *get_slab(size_t size, gfp_t flags)
> >> +static __always_inline struct kmem_cache *get_slab(size_t size, gfp_t flags)
> >
> > I thought that the compiler felt totally free to inline static functions
> > at will? This may be a matter of compiler optimization settings. I can
> > understand the use of always_inline in a header file but why in a .c file?
>
> Yes, but the compiler with -O2 doesn't inline get_slab() which
> declared just 'static'.
> I think that inlining get_slab() have a performance benefit, so add
> '__always_inline' to declare of get_slab().
> Other functions like slab_alloc, slab_free also use 'always_inline' in
> .c file (slub.c)

Yea I thought about removing those since I would think that the compiler
should be doing the right thing.

Does gcc inline with higher optimization settings?

2012-06-20 07:19:59

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 3/4] slub: refactoring unfreeze_partials()

On Fri, Jun 8, 2012 at 8:23 PM, Joonsoo Kim <[email protected]> wrote:
> Current implementation of unfreeze_partials() is so complicated,
> but benefit from it is insignificant. In addition many code in
> do {} while loop have a bad influence to a fail rate of cmpxchg_double_slab.
> Under current implementation which test status of cpu partial slab
> and acquire list_lock in do {} while loop,
> we don't need to acquire a list_lock and gain a little benefit
> when front of the cpu partial slab is to be discarded, but this is a rare case.
> In case that add_partial is performed and cmpxchg_double_slab is failed,
> remove_partial should be called case by case.
>
> I think that these are disadvantages of current implementation,
> so I do refactoring unfreeze_partials().
>
> Minimizing code in do {} while loop introduce a reduced fail rate
> of cmpxchg_double_slab. Below is output of 'slabinfo -r kmalloc-256'
> when './perf stat -r 33 hackbench 50 process 4000 > /dev/null' is done.
>
> ** before **
> Cmpxchg_double Looping
> ------------------------
> Locked Cmpxchg Double redos ? 182685
> Unlocked Cmpxchg Double redos 0
>
> ** after **
> Cmpxchg_double Looping
> ------------------------
> Locked Cmpxchg Double redos ? 177995
> Unlocked Cmpxchg Double redos 1
>
> We can see cmpxchg_double_slab fail rate is improved slightly.
>
> Bolow is output of './perf stat -r 30 hackbench 50 process 4000 > /dev/null'.
>
> ** before **
> ?Performance counter stats for './hackbench 50 process 4000' (30 runs):
>
> ? ? 108517.190463 task-clock ? ? ? ? ? ? ? ?# ? ?7.926 CPUs utilized ? ? ? ? ? ?( +- ?0.24% )
> ? ? ? ? 2,919,550 context-switches ? ? ? ? ?# ? ?0.027 M/sec ? ? ? ? ? ? ? ? ? ?( +- ?3.07% )
> ? ? ? ? ? 100,774 CPU-migrations ? ? ? ? ? ?# ? ?0.929 K/sec ? ? ? ? ? ? ? ? ? ?( +- ?4.72% )
> ? ? ? ? ? 124,201 page-faults ? ? ? ? ? ? ? # ? ?0.001 M/sec ? ? ? ? ? ? ? ? ? ?( +- ?0.15% )
> ? 401,500,234,387 cycles ? ? ? ? ? ? ? ? ? ?# ? ?3.700 GHz ? ? ? ? ? ? ? ? ? ? ?( +- ?0.24% )
> ? <not supported> stalled-cycles-frontend
> ? <not supported> stalled-cycles-backend
> ? 250,576,913,354 instructions ? ? ? ? ? ? ?# ? ?0.62 ?insns per cycle ? ? ? ? ?( +- ?0.13% )
> ? ?45,934,956,860 branches ? ? ? ? ? ? ? ? ?# ?423.297 M/sec ? ? ? ? ? ? ? ? ? ?( +- ?0.14% )
> ? ? ? 188,219,787 branch-misses ? ? ? ? ? ? # ? ?0.41% of all branches ? ? ? ? ?( +- ?0.56% )
>
> ? ? ?13.691837307 seconds time elapsed ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?( +- ?0.24% )
>
> ** after **
> ?Performance counter stats for './hackbench 50 process 4000' (30 runs):
>
> ? ? 107784.479767 task-clock ? ? ? ? ? ? ? ?# ? ?7.928 CPUs utilized ? ? ? ? ? ?( +- ?0.22% )
> ? ? ? ? 2,834,781 context-switches ? ? ? ? ?# ? ?0.026 M/sec ? ? ? ? ? ? ? ? ? ?( +- ?2.33% )
> ? ? ? ? ? ?93,083 CPU-migrations ? ? ? ? ? ?# ? ?0.864 K/sec ? ? ? ? ? ? ? ? ? ?( +- ?3.45% )
> ? ? ? ? ? 123,967 page-faults ? ? ? ? ? ? ? # ? ?0.001 M/sec ? ? ? ? ? ? ? ? ? ?( +- ?0.15% )
> ? 398,781,421,836 cycles ? ? ? ? ? ? ? ? ? ?# ? ?3.700 GHz ? ? ? ? ? ? ? ? ? ? ?( +- ?0.22% )
> ? <not supported> stalled-cycles-frontend
> ? <not supported> stalled-cycles-backend
> ? 250,189,160,419 instructions ? ? ? ? ? ? ?# ? ?0.63 ?insns per cycle ? ? ? ? ?( +- ?0.09% )
> ? ?45,855,370,128 branches ? ? ? ? ? ? ? ? ?# ?425.436 M/sec ? ? ? ? ? ? ? ? ? ?( +- ?0.10% )
> ? ? ? 169,881,248 branch-misses ? ? ? ? ? ? # ? ?0.37% of all branches ? ? ? ? ?( +- ?0.43% )
>
> ? ? ?13.596272341 seconds time elapsed ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?( +- ?0.22% )
>
> No regression is found, but rather we can see slightly better result.
>
> Acked-by: Christoph Lameter <[email protected]>
> Signed-off-by: Joonsoo Kim <[email protected]>

Applied, thanks!

> diff --git a/mm/slub.c b/mm/slub.c
> index 686ed90..b5f2108 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1886,18 +1886,24 @@ redo:
> ?*/
> ?static void unfreeze_partials(struct kmem_cache *s)
> ?{
> - ? ? ? struct kmem_cache_node *n = NULL;
> + ? ? ? struct kmem_cache_node *n = NULL, *n2 = NULL;
> ? ? ? ?struct kmem_cache_cpu *c = this_cpu_ptr(s->cpu_slab);
> ? ? ? ?struct page *page, *discard_page = NULL;
>
> ? ? ? ?while ((page = c->partial)) {
> - ? ? ? ? ? ? ? enum slab_modes { M_PARTIAL, M_FREE };
> - ? ? ? ? ? ? ? enum slab_modes l, m;
> ? ? ? ? ? ? ? ?struct page new;
> ? ? ? ? ? ? ? ?struct page old;
>
> ? ? ? ? ? ? ? ?c->partial = page->next;
> - ? ? ? ? ? ? ? l = M_FREE;
> +
> + ? ? ? ? ? ? ? n2 = get_node(s, page_to_nid(page));
> + ? ? ? ? ? ? ? if (n != n2) {
> + ? ? ? ? ? ? ? ? ? ? ? if (n)
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? spin_unlock(&n->list_lock);
> +
> + ? ? ? ? ? ? ? ? ? ? ? n = n2;
> + ? ? ? ? ? ? ? ? ? ? ? spin_lock(&n->list_lock);
> + ? ? ? ? ? ? ? }
>
> ? ? ? ? ? ? ? ?do {
>
> @@ -1910,43 +1916,17 @@ static void unfreeze_partials(struct kmem_cache *s)
>
> ? ? ? ? ? ? ? ? ? ? ? ?new.frozen = 0;
>
> - ? ? ? ? ? ? ? ? ? ? ? if (!new.inuse && (!n || n->nr_partial > s->min_partial))
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? m = M_FREE;
> - ? ? ? ? ? ? ? ? ? ? ? else {
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct kmem_cache_node *n2 = get_node(s,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? page_to_nid(page));
> -
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? m = M_PARTIAL;
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? if (n != n2) {
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? if (n)
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? spin_unlock(&n->list_lock);
> -
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? n = n2;
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? spin_lock(&n->list_lock);
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? }
> - ? ? ? ? ? ? ? ? ? ? ? }
> -
> - ? ? ? ? ? ? ? ? ? ? ? if (l != m) {
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? if (l == M_PARTIAL) {
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? remove_partial(n, page);
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? stat(s, FREE_REMOVE_PARTIAL);
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? } else {
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? add_partial(n, page,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? DEACTIVATE_TO_TAIL);
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? stat(s, FREE_ADD_PARTIAL);
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? }
> -
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? l = m;
> - ? ? ? ? ? ? ? ? ? ? ? }
> -
> ? ? ? ? ? ? ? ?} while (!__cmpxchg_double_slab(s, page,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?old.freelist, old.counters,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?new.freelist, new.counters,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?"unfreezing slab"));
>
> - ? ? ? ? ? ? ? if (m == M_FREE) {
> + ? ? ? ? ? ? ? if (unlikely(!new.inuse && n->nr_partial > s->min_partial)) {
> ? ? ? ? ? ? ? ? ? ? ? ?page->next = discard_page;
> ? ? ? ? ? ? ? ? ? ? ? ?discard_page = page;
> + ? ? ? ? ? ? ? } else {
> + ? ? ? ? ? ? ? ? ? ? ? add_partial(n, page, DEACTIVATE_TO_TAIL);
> + ? ? ? ? ? ? ? ? ? ? ? stat(s, FREE_ADD_PARTIAL);
> ? ? ? ? ? ? ? ?}
> ? ? ? ?}
>
> --
> 1.7.9.5
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. ?For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2012-06-22 18:34:29

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 4/4] slub: deactivate freelist of kmem_cache_cpu all at once in deactivate_slab()

2012/6/10 JoonSoo Kim <[email protected]>:
> 2012/6/9 Christoph Lameter <[email protected]>:
>> On Sat, 9 Jun 2012, Joonsoo Kim wrote:
>>
>>> Current implementation of deactivate_slab() which deactivate
>>> freelist of kmem_cache_cpu one by one is inefficient.
>>> This patch changes it to deactivate freelist all at once.
>>> But, there is no overall performance benefit,
>>> because deactivate_slab() is invoked infrequently.
>>
>> Hmm, deactivate freelist can race with slab_free. Need to look at this in
>> detail.
>
> Implemented logic is nearly same as previous one.
> I just merge first step of previous deactivate_slab() with second one.
> In case of failure of cmpxchg_double_slab(), reloading page->freelist,
> page->counters and recomputing inuse
> ensure that race with slab_free() cannot be possible.
> In case that we need a lock, try to get a lock before invoking
> cmpxchg_double_slab(),
> so race with slab_free cannot be occured too.
>
> Above is my humble opinion, please give me some comments.

Hi Pekka and Christoph.
Could you give me some comments about this, please?