Avoid using the page struct address on free by just doing an
address comparison. That is easily doable now that the page address
is available in the page struct and we already have the page struct
address of the object to be freed calculated.
Signed-off-by: Christoph Lameter <[email protected]>
Index: linux/mm/slub.c
===================================================================
--- linux.orig/mm/slub.c 2014-12-09 12:25:45.770405462 -0600
+++ linux/mm/slub.c 2014-12-09 12:25:45.766405582 -0600
@@ -2625,6 +2625,13 @@ slab_empty:
discard_slab(s, page);
}
+static bool same_slab_page(struct kmem_cache *s, struct page *page, void *p)
+{
+ long d = p - page->address;
+
+ return d > 0 && d < (1 << MAX_ORDER) && d < (compound_order(page) << PAGE_SHIFT);
+}
+
/*
* Fastpath with forced inlining to produce a kfree and kmem_cache_free that
* can perform fastpath freeing without additional function calls.
@@ -2658,7 +2665,7 @@ redo:
tid = c->tid;
preempt_enable();
- if (likely(page == c->page)) {
+ if (likely(same_slab_page(s, page, c->freelist))) {
set_freepointer(s, object, c->freelist);
if (unlikely(!this_cpu_cmpxchg_double(
On Wed, Dec 10, 2014 at 6:30 PM, Christoph Lameter <[email protected]> wrote:
> Avoid using the page struct address on free by just doing an
> address comparison. That is easily doable now that the page address
> is available in the page struct and we already have the page struct
> address of the object to be freed calculated.
>
> Signed-off-by: Christoph Lameter <[email protected]>
>
> Index: linux/mm/slub.c
> ===================================================================
> --- linux.orig/mm/slub.c 2014-12-09 12:25:45.770405462 -0600
> +++ linux/mm/slub.c 2014-12-09 12:25:45.766405582 -0600
> @@ -2625,6 +2625,13 @@ slab_empty:
> discard_slab(s, page);
> }
>
> +static bool same_slab_page(struct kmem_cache *s, struct page *page, void *p)
Why are you passing a pointer to struct kmem_cache here? You don't
seem to use it.
> +{
> + long d = p - page->address;
> +
> + return d > 0 && d < (1 << MAX_ORDER) && d < (compound_order(page) << PAGE_SHIFT);
> +}
Can you elaborate on what this is doing? I don't really understand it.
- Pekka
On Wed, 10 Dec 2014, Pekka Enberg wrote:
> > +static bool same_slab_page(struct kmem_cache *s, struct page *page, void *p)
>
> Why are you passing a pointer to struct kmem_cache here? You don't
> seem to use it.
True.
> > +{
> > + long d = p - page->address;
> > +
> > + return d > 0 && d < (1 << MAX_ORDER) && d < (compound_order(page) << PAGE_SHIFT);
> > +}
>
> Can you elaborate on what this is doing? I don't really understand it.
Checks if the pointer points to the slab page. Also it tres to avoid
having to call compound_order needlessly. Not sure if that optimization is
worth it.
On Wed, Dec 10, 2014 at 7:08 PM, Christoph Lameter <[email protected]> wrote:
>> > +{
>> > + long d = p - page->address;
>> > +
>> > + return d > 0 && d < (1 << MAX_ORDER) && d < (compound_order(page) << PAGE_SHIFT);
>> > +}
>>
>> Can you elaborate on what this is doing? I don't really understand it.
>
> Checks if the pointer points to the slab page. Also it tres to avoid
> having to call compound_order needlessly. Not sure if that optimization is
> worth it.
Aah, it's the (1 << MAX_ORDER) optimization that confused me. Perhaps
add a comment there to make it more obvious?
I'm fine with the optimization:
Reviewed-by: Pekka Enberg <[email protected]>
- Pekka
On Wed, 10 Dec 2014, Pekka Enberg wrote:
> I'm fine with the optimization:
>
> Reviewed-by: Pekka Enberg <[email protected]>
There were some other issues so its now:
Subject: slub: Do not use c->page on free
Avoid using the page struct address on free by just doing an
address comparison. That is easily doable now that the page address
is available in the page struct and we already have the page struct
address of the object to be freed calculated.
Reviewed-by: Pekka Enberg <[email protected]>
Signed-off-by: Christoph Lameter <[email protected]>
Index: linux/mm/slub.c
===================================================================
--- linux.orig/mm/slub.c 2014-12-10 11:35:32.538563734 -0600
+++ linux/mm/slub.c 2014-12-10 11:36:39.032447807 -0600
@@ -2625,6 +2625,17 @@ slab_empty:
discard_slab(s, page);
}
+static bool is_pointer_to_page(struct page *page, void *p)
+{
+ long d = p - page->address;
+
+ /*
+ * Do a comparison for a MAX_ORDER page first before using
+ * compound_order() to determine the actual page size.
+ */
+ return d >= 0 && d < (1 << MAX_ORDER) && d < (compound_order(page) << PAGE_SHIFT);
+}
+
/*
* Fastpath with forced inlining to produce a kfree and kmem_cache_free that
* can perform fastpath freeing without additional function calls.
@@ -2658,7 +2669,7 @@ redo:
tid = c->tid;
preempt_enable();
- if (likely(page == c->page)) {
+ if (likely(is_pointer_to_page(page, c->freelist))) {
set_freepointer(s, object, c->freelist);
if (unlikely(!this_cpu_cmpxchg_double(
On Wed, 10 Dec 2014 11:37:56 -0600 (CST) Christoph Lameter <[email protected]> wrote:
[...]
>
> There were some other issues so its now:
>
>
> Subject: slub: Do not use c->page on free
>
> Avoid using the page struct address on free by just doing an
> address comparison. That is easily doable now that the page address
> is available in the page struct and we already have the page struct
> address of the object to be freed calculated.
>
> Reviewed-by: Pekka Enberg <[email protected]>
> Signed-off-by: Christoph Lameter <[email protected]>
>
> Index: linux/mm/slub.c
> ===================================================================
> --- linux.orig/mm/slub.c 2014-12-10 11:35:32.538563734 -0600
> +++ linux/mm/slub.c 2014-12-10 11:36:39.032447807 -0600
> @@ -2625,6 +2625,17 @@ slab_empty:
> discard_slab(s, page);
> }
>
> +static bool is_pointer_to_page(struct page *page, void *p)
> +{
> + long d = p - page->address;
> +
> + /*
> + * Do a comparison for a MAX_ORDER page first before using
> + * compound_order() to determine the actual page size.
> + */
> + return d >= 0 && d < (1 << MAX_ORDER) && d < (compound_order(page) << PAGE_SHIFT);
> +}
My current compiler (gcc 4.9.1), choose not to inline is_pointer_to_page().
(perf record of [1])
Samples: 8K of event 'cycles', Event count (approx.): 5737618489
+ 46.13% modprobe [kernel.kallsyms] [k] kmem_cache_free
+ 33.02% modprobe [kernel.kallsyms] [k] kmem_cache_alloc
+ 16.14% modprobe [kernel.kallsyms] [k] is_pointer_to_page
If I explicitly add "inline", then it gets inlined, and performance is good again.
Test[1] cost of kmem_cache_alloc+free:
* baseline: 47 cycles(tsc) 19.032 ns (net-next without patchset)
* patchset: 50 cycles(tsc) 20.028 ns
* inline : 45 cycles(tsc) 18.135 ns (inlined is_pointer_to_page())
> /*
> * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
> * can perform fastpath freeing without additional function calls.
> @@ -2658,7 +2669,7 @@ redo:
> tid = c->tid;
> preempt_enable();
>
> - if (likely(page == c->page)) {
> + if (likely(is_pointer_to_page(page, c->freelist))) {
> set_freepointer(s, object, c->freelist);
>
> if (unlikely(!this_cpu_cmpxchg_double(
[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_kmem_cache1.c
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Sr. Network Kernel Developer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer
On Thu, 11 Dec 2014, Jesper Dangaard Brouer wrote:
> If I explicitly add "inline", then it gets inlined, and performance is good again.
Ok adding inline for the next release.
On Wed, Dec 10, 2014 at 10:30:20AM -0600, Christoph Lameter wrote:
> Avoid using the page struct address on free by just doing an
> address comparison. That is easily doable now that the page address
> is available in the page struct and we already have the page struct
> address of the object to be freed calculated.
>
> Signed-off-by: Christoph Lameter <[email protected]>
>
> Index: linux/mm/slub.c
> ===================================================================
> --- linux.orig/mm/slub.c 2014-12-09 12:25:45.770405462 -0600
> +++ linux/mm/slub.c 2014-12-09 12:25:45.766405582 -0600
> @@ -2625,6 +2625,13 @@ slab_empty:
> discard_slab(s, page);
> }
>
> +static bool same_slab_page(struct kmem_cache *s, struct page *page, void *p)
> +{
> + long d = p - page->address;
> +
> + return d > 0 && d < (1 << MAX_ORDER) && d < (compound_order(page) << PAGE_SHIFT);
> +}
> +
Somtimes, compound_order() induces one more cacheline access, because
compound_order() access second struct page in order to get order. Is there
any way to remove this?
Thanks.
On Mon, 15 Dec 2014, Joonsoo Kim wrote:
> > +static bool same_slab_page(struct kmem_cache *s, struct page *page, void *p)
> > +{
> > + long d = p - page->address;
> > +
> > + return d > 0 && d < (1 << MAX_ORDER) && d < (compound_order(page) << PAGE_SHIFT);
> > +}
> > +
>
> Somtimes, compound_order() induces one more cacheline access, because
> compound_order() access second struct page in order to get order. Is there
> any way to remove this?
I already have code there to avoid the access if its within a MAX_ORDER
page. We could probably go for a smaller setting there. PAGE_COSTLY_ORDER?
On Mon, Dec 15, 2014 at 08:16:00AM -0600, Christoph Lameter wrote:
> On Mon, 15 Dec 2014, Joonsoo Kim wrote:
>
> > > +static bool same_slab_page(struct kmem_cache *s, struct page *page, void *p)
> > > +{
> > > + long d = p - page->address;
> > > +
> > > + return d > 0 && d < (1 << MAX_ORDER) && d < (compound_order(page) << PAGE_SHIFT);
> > > +}
> > > +
> >
> > Somtimes, compound_order() induces one more cacheline access, because
> > compound_order() access second struct page in order to get order. Is there
> > any way to remove this?
>
> I already have code there to avoid the access if its within a MAX_ORDER
> page. We could probably go for a smaller setting there. PAGE_COSTLY_ORDER?
That is the solution to avoid compound_order() call when slab of
object isn't matched with per cpu slab.
What I'm asking is whether there is a way to avoid compound_order() call when slab
of object is matched with per cpu slab or not.
Thanks.
2014-12-16 5:42 GMT+03:00 Joonsoo Kim <[email protected]>:
> On Mon, Dec 15, 2014 at 08:16:00AM -0600, Christoph Lameter wrote:
>> On Mon, 15 Dec 2014, Joonsoo Kim wrote:
>>
>> > > +static bool same_slab_page(struct kmem_cache *s, struct page *page, void *p)
>> > > +{
>> > > + long d = p - page->address;
>> > > +
>> > > + return d > 0 && d < (1 << MAX_ORDER) && d < (compound_order(page) << PAGE_SHIFT);
>> > > +}
>> > > +
>> >
>> > Somtimes, compound_order() induces one more cacheline access, because
>> > compound_order() access second struct page in order to get order. Is there
>> > any way to remove this?
>>
>> I already have code there to avoid the access if its within a MAX_ORDER
>> page. We could probably go for a smaller setting there. PAGE_COSTLY_ORDER?
>
> That is the solution to avoid compound_order() call when slab of
> object isn't matched with per cpu slab.
>
> What I'm asking is whether there is a way to avoid compound_order() call when slab
> of object is matched with per cpu slab or not.
>
Can we use page->objects for that?
Like this:
return d > 0 && d < page->objects * s->size;
On Tue, Dec 16, 2014 at 11:54:12AM +0400, Andrey Ryabinin wrote:
> 2014-12-16 5:42 GMT+03:00 Joonsoo Kim <[email protected]>:
> > On Mon, Dec 15, 2014 at 08:16:00AM -0600, Christoph Lameter wrote:
> >> On Mon, 15 Dec 2014, Joonsoo Kim wrote:
> >>
> >> > > +static bool same_slab_page(struct kmem_cache *s, struct page *page, void *p)
> >> > > +{
> >> > > + long d = p - page->address;
> >> > > +
> >> > > + return d > 0 && d < (1 << MAX_ORDER) && d < (compound_order(page) << PAGE_SHIFT);
> >> > > +}
> >> > > +
> >> >
> >> > Somtimes, compound_order() induces one more cacheline access, because
> >> > compound_order() access second struct page in order to get order. Is there
> >> > any way to remove this?
> >>
> >> I already have code there to avoid the access if its within a MAX_ORDER
> >> page. We could probably go for a smaller setting there. PAGE_COSTLY_ORDER?
> >
> > That is the solution to avoid compound_order() call when slab of
> > object isn't matched with per cpu slab.
> >
> > What I'm asking is whether there is a way to avoid compound_order() call when slab
> > of object is matched with per cpu slab or not.
> >
>
> Can we use page->objects for that?
>
> Like this:
>
> return d > 0 && d < page->objects * s->size;
>
Yes! That's what I'm looking for.
Christoph, how about above change?
Thanks.
On Tue, 16 Dec 2014 11:54:12 +0400
Andrey Ryabinin <[email protected]> wrote:
> 2014-12-16 5:42 GMT+03:00 Joonsoo Kim <[email protected]>:
> > On Mon, Dec 15, 2014 at 08:16:00AM -0600, Christoph Lameter wrote:
> >> On Mon, 15 Dec 2014, Joonsoo Kim wrote:
> >>
> >> > > +static bool same_slab_page(struct kmem_cache *s, struct page *page, void *p)
> >> > > +{
> >> > > + long d = p - page->address;
> >> > > +
> >> > > + return d > 0 && d < (1 << MAX_ORDER) && d < (compound_order(page) << PAGE_SHIFT);
> >> > > +}
> >> > > +
> >> >
> >> > Somtimes, compound_order() induces one more cacheline access, because
> >> > compound_order() access second struct page in order to get order. Is there
> >> > any way to remove this?
> >>
> >> I already have code there to avoid the access if its within a MAX_ORDER
> >> page. We could probably go for a smaller setting there. PAGE_COSTLY_ORDER?
> >
> > That is the solution to avoid compound_order() call when slab of
> > object isn't matched with per cpu slab.
> >
> > What I'm asking is whether there is a way to avoid compound_order() call when slab
> > of object is matched with per cpu slab or not.
> >
>
> Can we use page->objects for that?
>
> Like this:
>
> return d > 0 && d < page->objects * s->size;
I gave this change a quick micro benchmark spin (with Christoph's
tool), the results are below.
Notice, the "2. Kmalloc: alloc/free test" for small obj sizes improves,
which is more "back-to-normal" as before this patchset.
Before (with curr patchset):
============================
Single thread testing
=====================
1. Kmalloc: Repeatedly allocate then free test
10000 times kmalloc(8) -> 50 cycles kfree -> 60 cycles
10000 times kmalloc(16) -> 52 cycles kfree -> 60 cycles
10000 times kmalloc(32) -> 56 cycles kfree -> 64 cycles
10000 times kmalloc(64) -> 67 cycles kfree -> 72 cycles
10000 times kmalloc(128) -> 86 cycles kfree -> 79 cycles
10000 times kmalloc(256) -> 97 cycles kfree -> 110 cycles
10000 times kmalloc(512) -> 88 cycles kfree -> 114 cycles
10000 times kmalloc(1024) -> 91 cycles kfree -> 115 cycles
10000 times kmalloc(2048) -> 119 cycles kfree -> 131 cycles
10000 times kmalloc(4096) -> 159 cycles kfree -> 163 cycles
10000 times kmalloc(8192) -> 269 cycles kfree -> 226 cycles
10000 times kmalloc(16384) -> 498 cycles kfree -> 291 cycles
2. Kmalloc: alloc/free test
10000 times kmalloc(8)/kfree -> 112 cycles
10000 times kmalloc(16)/kfree -> 118 cycles
10000 times kmalloc(32)/kfree -> 117 cycles
10000 times kmalloc(64)/kfree -> 122 cycles
10000 times kmalloc(128)/kfree -> 133 cycles
10000 times kmalloc(256)/kfree -> 79 cycles
10000 times kmalloc(512)/kfree -> 79 cycles
10000 times kmalloc(1024)/kfree -> 79 cycles
10000 times kmalloc(2048)/kfree -> 72 cycles
10000 times kmalloc(4096)/kfree -> 78 cycles
10000 times kmalloc(8192)/kfree -> 78 cycles
10000 times kmalloc(16384)/kfree -> 596 cycles
After (with proposed change):
=============================
Single thread testing
=====================
1. Kmalloc: Repeatedly allocate then free test
10000 times kmalloc(8) -> 53 cycles kfree -> 62 cycles
10000 times kmalloc(16) -> 53 cycles kfree -> 64 cycles
10000 times kmalloc(32) -> 57 cycles kfree -> 66 cycles
10000 times kmalloc(64) -> 68 cycles kfree -> 72 cycles
10000 times kmalloc(128) -> 77 cycles kfree -> 80 cycles
10000 times kmalloc(256) -> 98 cycles kfree -> 110 cycles
10000 times kmalloc(512) -> 87 cycles kfree -> 113 cycles
10000 times kmalloc(1024) -> 90 cycles kfree -> 116 cycles
10000 times kmalloc(2048) -> 116 cycles kfree -> 131 cycles
10000 times kmalloc(4096) -> 160 cycles kfree -> 164 cycles
10000 times kmalloc(8192) -> 269 cycles kfree -> 226 cycles
10000 times kmalloc(16384) -> 499 cycles kfree -> 295 cycles
2. Kmalloc: alloc/free test
10000 times kmalloc(8)/kfree -> 74 cycles
10000 times kmalloc(16)/kfree -> 73 cycles
10000 times kmalloc(32)/kfree -> 73 cycles
10000 times kmalloc(64)/kfree -> 74 cycles
10000 times kmalloc(128)/kfree -> 73 cycles
10000 times kmalloc(256)/kfree -> 72 cycles
10000 times kmalloc(512)/kfree -> 73 cycles
10000 times kmalloc(1024)/kfree -> 72 cycles
10000 times kmalloc(2048)/kfree -> 73 cycles
10000 times kmalloc(4096)/kfree -> 72 cycles
10000 times kmalloc(8192)/kfree -> 72 cycles
10000 times kmalloc(16384)/kfree -> 556 cycles
(kernel 3.18.0-net-next+ SMP PREEMPT on top of f96fe225677)
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Sr. Network Kernel Developer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer
On Tue, 16 Dec 2014, Joonsoo Kim wrote:
> > Like this:
> >
> > return d > 0 && d < page->objects * s->size;
> >
>
> Yes! That's what I'm looking for.
> Christoph, how about above change?
Ok but now there is a multiplication in the fast path.
On Tue, 16 Dec 2014 08:53:08 -0600 (CST)
Christoph Lameter <[email protected]> wrote:
> On Tue, 16 Dec 2014, Joonsoo Kim wrote:
>
> > > Like this:
> > >
> > > return d > 0 && d < page->objects * s->size;
> > >
> >
> > Yes! That's what I'm looking for.
> > Christoph, how about above change?
>
> Ok but now there is a multiplication in the fast path.
Could we pre-calculate the value (page->objects * s->size) and e.g store it
in struct kmem_cache, thus saving the imul ?
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Sr. Network Kernel Developer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer
2014-12-16 17:53 GMT+03:00 Christoph Lameter <[email protected]>:
> On Tue, 16 Dec 2014, Joonsoo Kim wrote:
>
>> > Like this:
>> >
>> > return d > 0 && d < page->objects * s->size;
>> >
>>
>> Yes! That's what I'm looking for.
>> Christoph, how about above change?
>
> Ok but now there is a multiplication in the fast path.
>
Another idea - store page's order in the lower bits of page->address.
2014-12-16 18:15 GMT+03:00 Jesper Dangaard Brouer <[email protected]>:
> On Tue, 16 Dec 2014 08:53:08 -0600 (CST)
> Christoph Lameter <[email protected]> wrote:
>
>> On Tue, 16 Dec 2014, Joonsoo Kim wrote:
>>
>> > > Like this:
>> > >
>> > > return d > 0 && d < page->objects * s->size;
>> > >
>> >
>> > Yes! That's what I'm looking for.
>> > Christoph, how about above change?
>>
>> Ok but now there is a multiplication in the fast path.
>
> Could we pre-calculate the value (page->objects * s->size) and e.g store it
> in struct kmem_cache, thus saving the imul ?
>
No, one kmem_cache could have several pages with different orders,
therefore different page->objects.
> --
> Best regards,
> Jesper Dangaard Brouer
> MSc.CS, Sr. Network Kernel Developer at Red Hat
> Author of http://www.iptv-analyzer.org
> LinkedIn: http://www.linkedin.com/in/brouer
On Tue, 16 Dec 2014, Jesper Dangaard Brouer wrote:
> > Ok but now there is a multiplication in the fast path.
>
> Could we pre-calculate the value (page->objects * s->size) and e.g store it
> in struct kmem_cache, thus saving the imul ?
I think I just used the last available field for the page->address.
2014-12-17 0:48 GMT+09:00 Christoph Lameter <[email protected]>:
> On Tue, 16 Dec 2014, Jesper Dangaard Brouer wrote:
>
>> > Ok but now there is a multiplication in the fast path.
>>
>> Could we pre-calculate the value (page->objects * s->size) and e.g store it
>> in struct kmem_cache, thus saving the imul ?
>
> I think I just used the last available field for the page->address.
Possibly, we can use _count field.
Thanks.