Subject: [RFC 0/4] [RFC] slub: Fastpath optimization (especially for RT)

We had to insert a preempt enable/disable in the fastpath a while ago. This
was mainly due to a lot of state that is kept to be allocating from the per
cpu freelist. In particular the page field is not covered by
this_cpu_cmpxchg used in the fastpath to do the necessary atomic state
change for fast path allocation and freeing.

This patch removes the need for the page field to describe the state of the
per cpu list. The freelist pointer can be used to determine the page struct
address if necessary.

However, currently this does not work for the termination value of a list
which is NULL and the same for all slab pages. If we use a valid pointer
into the page as well as set the last bit then all freelist pointers can
always be used to determine the address of the page struct and we will not
need the page field anymore in the per cpu are for a slab. Testing for the
end of the list is a test if the first bit is set.

So the first patch changes the termination pointer for freelists to do just
that. The second removes the page field and then third can then remove the
preempt enable/disable.

There are currently a number of caveats because we are adding calls to
page_address() and virt_to_head_page() in a number of code paths. These
can hopefully be removed one way or the other.

Removing the ->page field reduces the cache footprint of the fastpath so hopefully overall
allocator effectiveness will increase further. Also RT uses full preemption which means
that currently pretty expensive code has to be inserted into the fastpath. This approach
allows the removal of that code and a corresponding performance increase.



2014-10-23 08:08:51

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [RFC 0/4] [RFC] slub: Fastpath optimization (especially for RT)

On Wed, Oct 22, 2014 at 10:55:17AM -0500, Christoph Lameter wrote:
> We had to insert a preempt enable/disable in the fastpath a while ago. This
> was mainly due to a lot of state that is kept to be allocating from the per
> cpu freelist. In particular the page field is not covered by
> this_cpu_cmpxchg used in the fastpath to do the necessary atomic state
> change for fast path allocation and freeing.
>
> This patch removes the need for the page field to describe the state of the
> per cpu list. The freelist pointer can be used to determine the page struct
> address if necessary.
>
> However, currently this does not work for the termination value of a list
> which is NULL and the same for all slab pages. If we use a valid pointer
> into the page as well as set the last bit then all freelist pointers can
> always be used to determine the address of the page struct and we will not
> need the page field anymore in the per cpu are for a slab. Testing for the
> end of the list is a test if the first bit is set.
>
> So the first patch changes the termination pointer for freelists to do just
> that. The second removes the page field and then third can then remove the
> preempt enable/disable.
>
> There are currently a number of caveats because we are adding calls to
> page_address() and virt_to_head_page() in a number of code paths. These
> can hopefully be removed one way or the other.
>
> Removing the ->page field reduces the cache footprint of the fastpath so hopefully overall
> allocator effectiveness will increase further. Also RT uses full preemption which means
> that currently pretty expensive code has to be inserted into the fastpath. This approach
> allows the removal of that code and a corresponding performance increase.
>

Hello, Christoph.

Preemption disable during very short code would cause large problem for RT?

And, if page_address() and virt_to_head_page() remain as current patchset
implementation, this would work worse than before.

I looked at the patchset quickly and found another idea to remove
preemption disable. How about just retrieving s->cpu_slab->tid first,
before accessing s->cpu_slab, in slab_alloc() and slab_free()?
Retrieved tid may ensure that we aren't migrated to other CPUs so that
we can remove code for preemption disable.

Following is the patch implementing above idea.

Thanks.

------------->8------------------------
diff --git a/mm/slub.c b/mm/slub.c
index ae7b9f1..af622d8 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2386,28 +2386,21 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
s = memcg_kmem_get_cache(s, gfpflags);
redo:
/*
+ * The transaction ids are globally unique per cpu and per operation on
+ * a per cpu queue. Thus they can be guarantee that the cmpxchg_double
+ * occurs on the right processor and that there was no operation on the
+ * linked list in between.
+ */
+ tid = this_cpu_read(s->cpu_slab->tid);
+
+ /*
* Must read kmem_cache cpu data via this cpu ptr. Preemption is
* enabled. We may switch back and forth between cpus while
* reading from one cpu area. That does not matter as long
* as we end up on the original cpu again when doing the cmpxchg.
- *
- * Preemption is disabled for the retrieval of the tid because that
- * must occur from the current processor. We cannot allow rescheduling
- * on a different processor between the determination of the pointer
- * and the retrieval of the tid.
*/
- preempt_disable();
c = this_cpu_ptr(s->cpu_slab);

- /*
- * The transaction ids are globally unique per cpu and per operation on
- * a per cpu queue. Thus they can be guarantee that the cmpxchg_double
- * occurs on the right processor and that there was no operation on the
- * linked list in between.
- */
- tid = c->tid;
- preempt_enable();
-
object = c->freelist;
page = c->page;
if (unlikely(!object || !node_match(page, node))) {
@@ -2646,18 +2639,16 @@ static __always_inline void slab_free(struct kmem_cache *s,
slab_free_hook(s, x);

redo:
+ tid = this_cpu_read(s->cpu_slab->tid);
+
/*
* Determine the currently cpus per cpu slab.
* The cpu may change afterward. However that does not matter since
* data is retrieved via this pointer. If we are on the same cpu
* during the cmpxchg then the free will succedd.
*/
- preempt_disable();
c = this_cpu_ptr(s->cpu_slab);

- tid = c->tid;
- preempt_enable();
-
if (likely(page == c->page)) {
set_freepointer(s, object, c->freelist);

Subject: Re: [RFC 0/4] [RFC] slub: Fastpath optimization (especially for RT)

On Thu, 23 Oct 2014, Joonsoo Kim wrote:

> Preemption disable during very short code would cause large problem for RT?

This is the hotpath and preempt enable/disable adds a significant number
of cycles.

> And, if page_address() and virt_to_head_page() remain as current patchset
> implementation, this would work worse than before.

Right.

> I looked at the patchset quickly and found another idea to remove
> preemption disable. How about just retrieving s->cpu_slab->tid first,
> before accessing s->cpu_slab, in slab_alloc() and slab_free()?
> Retrieved tid may ensure that we aren't migrated to other CPUs so that
> we can remove code for preemption disable.

You cannot do any of these things because you need the tid from the right
cpu and the scheduler can prempt you and reschedule you on another
processor at will. tid and c may be from different per cpu areas.

2014-10-24 04:55:29

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [RFC 0/4] [RFC] slub: Fastpath optimization (especially for RT)

On Thu, Oct 23, 2014 at 09:18:29AM -0500, Christoph Lameter wrote:
> On Thu, 23 Oct 2014, Joonsoo Kim wrote:
>
> > Preemption disable during very short code would cause large problem for RT?
>
> This is the hotpath and preempt enable/disable adds a significant number
> of cycles.
>
> > And, if page_address() and virt_to_head_page() remain as current patchset
> > implementation, this would work worse than before.
>
> Right.
>
> > I looked at the patchset quickly and found another idea to remove
> > preemption disable. How about just retrieving s->cpu_slab->tid first,
> > before accessing s->cpu_slab, in slab_alloc() and slab_free()?
> > Retrieved tid may ensure that we aren't migrated to other CPUs so that
> > we can remove code for preemption disable.
>
> You cannot do any of these things because you need the tid from the right
> cpu and the scheduler can prempt you and reschedule you on another
> processor at will. tid and c may be from different per cpu areas.

I found that you said retrieving tid first is sufficient to do
things right in old discussion. :)

https://lkml.org/lkml/2013/1/18/430

Think about following 4 examples.

TID CPU_CACHE CMPX_DOUBLE
1. cpu0 cpu0 cpu0
2. cpu0 cpu0 cpu1
3. cpu0 cpu1 cpu0
4. cpu0 cpu1 cpu1

1) has no problem and will succeed.
2, 4) would be failed due to tid mismatch.
Only complicated case is scenario 3).

In this case, object from cpu1's cpu_cache should be
different with cpu0's, so allocation would be failed.

Only problem of this method is that it's not easy to understand.

Am I missing something?

Thanks.

Subject: Re: [RFC 0/4] [RFC] slub: Fastpath optimization (especially for RT)

On Fri, 24 Oct 2014, Joonsoo Kim wrote:

> In this case, object from cpu1's cpu_cache should be
> different with cpu0's, so allocation would be failed.

That is true for most object pointers unless the value is NULL. Which it
can be. But if this is the only case then the second patch + your approach
would work too.

Subject: Re: [RFC 0/4] [RFC] slub: Fastpath optimization (especially for RT)

> I found that you said retrieving tid first is sufficient to do
> things right in old discussion. :)

Right but the tid can be obtained from a different processor.


One other aspect of this patchset is that it reduces the cache footprint
of the alloc and free functions. This typically results in a performance
increase for the allocator. If we can avoid the page_address() and
virt_to_head_page() stuff that is required because we drop the ->page
field in a sufficient number of places then this may be a benefit that
goes beyond the RT and CONFIG_PREEMPT case.

2014-10-27 07:53:13

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [RFC 0/4] [RFC] slub: Fastpath optimization (especially for RT)

On Fri, Oct 24, 2014 at 09:02:18AM -0500, Christoph Lameter wrote:
> On Fri, 24 Oct 2014, Joonsoo Kim wrote:
>
> > In this case, object from cpu1's cpu_cache should be
> > different with cpu0's, so allocation would be failed.
>
> That is true for most object pointers unless the value is NULL. Which it
> can be. But if this is the only case then the second patch + your approach
> would work too.

Indeed... I missed the null value case.
Your second patch + mine would fix that situation, but, I need more
thinking. :)

Thanks.

2014-10-27 07:57:17

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [RFC 0/4] [RFC] slub: Fastpath optimization (especially for RT)

On Fri, Oct 24, 2014 at 09:41:49AM -0500, Christoph Lameter wrote:
> > I found that you said retrieving tid first is sufficient to do
> > things right in old discussion. :)
>
> Right but the tid can be obtained from a different processor.
>
>
> One other aspect of this patchset is that it reduces the cache footprint
> of the alloc and free functions. This typically results in a performance
> increase for the allocator. If we can avoid the page_address() and
> virt_to_head_page() stuff that is required because we drop the ->page
> field in a sufficient number of places then this may be a benefit that
> goes beyond the RT and CONFIG_PREEMPT case.

Yeah... if we can avoid those function calls, it would be good.

But, current struct kmem_cache_cpu occupies just 32 bytes on 64 bits
machine, and, that means just 1 cacheline. Reducing size of struct may have
no remarkable performance benefit in this case.

Thanks.

Subject: Re: [RFC 0/4] [RFC] slub: Fastpath optimization (especially for RT)

On Mon, 27 Oct 2014, Joonsoo Kim wrote:

> > One other aspect of this patchset is that it reduces the cache footprint
> > of the alloc and free functions. This typically results in a performance
> > increase for the allocator. If we can avoid the page_address() and
> > virt_to_head_page() stuff that is required because we drop the ->page
> > field in a sufficient number of places then this may be a benefit that
> > goes beyond the RT and CONFIG_PREEMPT case.
>
> Yeah... if we can avoid those function calls, it would be good.

One trick that may be possible is to have an address mask for the
page_address. If a pointer satisfies the mask requuirements then its on
the right page and we do not need to do virt_to_head_page.

> But, current struct kmem_cache_cpu occupies just 32 bytes on 64 bits
> machine, and, that means just 1 cacheline. Reducing size of struct may have
> no remarkable performance benefit in this case.

Hmmm... If we also drop the partial field then a 64 byte cacheline would
fit kmem_cache_cpu structs from 4 caches. If we place them correctly then
the frequently used caches could avoid fetching up to 3 cachelines.

You are right just dropping ->page wont do anything since the
kmem_cache_cpu struct is aligned to a double word boundary.