2021-10-08 13:38:11

by Hyeonggon Yoo

[permalink] [raw]
Subject: [PATCH] mm, slub: Use prefetchw instead of prefetch

It's certain that an object will be not only read, but also
written after allocation.

Use prefetchw instead of prefetchw. On supported architecture
like x86, it helps to invalidate cache line when the object exists
in other processors' cache.

Signed-off-by: Hyeonggon Yoo <[email protected]>
---
mm/slub.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 3d2025f7163b..2aca7523165e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -352,9 +352,9 @@ static inline void *get_freepointer(struct kmem_cache *s, void *object)
return freelist_dereference(s, object + s->offset);
}

-static void prefetch_freepointer(const struct kmem_cache *s, void *object)
+static void prefetchw_freepointer(const struct kmem_cache *s, void *object)
{
- prefetch(object + s->offset);
+ prefetchw(object + s->offset);
}

static inline void *get_freepointer_safe(struct kmem_cache *s, void *object)
@@ -3195,10 +3195,9 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
note_cmpxchg_failure("slab_alloc", s, tid);
goto redo;
}
- prefetch_freepointer(s, next_object);
+ prefetchw_freepointer(s, next_object);
stat(s, ALLOC_FASTPATH);
}
-
maybe_wipe_obj_freeptr(s, object);
init = slab_want_init_on_alloc(gfpflags, s);

--
2.27.0


2021-10-11 01:13:13

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] mm, slub: Use prefetchw instead of prefetch

On Fri, 8 Oct 2021, Hyeonggon Yoo wrote:

> It's certain that an object will be not only read, but also
> written after allocation.
>

Why is it certain? I think perhaps what you meant to say is that if we
are doing any prefetching here, then access will benefit from prefetchw
instead of prefetch. But it's not "certain" that allocated memory will be
accessed at all.

> Use prefetchw instead of prefetchw. On supported architecture

If we're using prefetchw instead of prefetchw, I think the diff would be
0 lines changed :)

> like x86, it helps to invalidate cache line when the object exists
> in other processors' cache.
>
> Signed-off-by: Hyeonggon Yoo <[email protected]>
> ---
> mm/slub.c | 7 +++----
> 1 file changed, 3 insertions(+), 4 deletions(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 3d2025f7163b..2aca7523165e 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -352,9 +352,9 @@ static inline void *get_freepointer(struct kmem_cache *s, void *object)
> return freelist_dereference(s, object + s->offset);
> }
>
> -static void prefetch_freepointer(const struct kmem_cache *s, void *object)
> +static void prefetchw_freepointer(const struct kmem_cache *s, void *object)
> {
> - prefetch(object + s->offset);
> + prefetchw(object + s->offset);
> }
>
> static inline void *get_freepointer_safe(struct kmem_cache *s, void *object)
> @@ -3195,10 +3195,9 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
> note_cmpxchg_failure("slab_alloc", s, tid);
> goto redo;
> }
> - prefetch_freepointer(s, next_object);
> + prefetchw_freepointer(s, next_object);
> stat(s, ALLOC_FASTPATH);
> }
> -
> maybe_wipe_obj_freeptr(s, object);
> init = slab_want_init_on_alloc(gfpflags, s);
>
> --
> 2.27.0
>
>

2021-10-11 11:33:01

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] mm, slub: Use prefetchw instead of prefetch

On Fri, 8 Oct 2021, Hyeonggon Yoo wrote:

> It's certain that an object will be not only read, but also
> written after allocation.

get_freepointer is used in multiple code path not only in allocation. It
is for example used when scanning through a freelist.

With this change all objects get needlessly dirtied and the cross cpu
cache contention is increased.


2021-10-11 11:34:16

by Hyeonggon Yoo

[permalink] [raw]
Subject: Re: [PATCH] mm, slub: Use prefetchw instead of prefetch

On Sun, Oct 10, 2021 at 03:49:07PM -0700, David Rientjes wrote:
> On Fri, 8 Oct 2021, Hyeonggon Yoo wrote:
>
> > It's certain that an object will be not only read, but also
> > written after allocation.
> >
>
> Why is it certain? I think perhaps what you meant to say is that if we
> are doing any prefetching here, then access will benefit from prefetchw
> instead of prefetch. But it's not "certain" that allocated memory will be
> accessed at all.
>

Blame my english skill :(

When I wrote I thought it was ok, but it was unclear.
Thank you for pointing them!

What I meant was "When accessing an object, it must be written before read.
So There's no situation that caller only reads an object and does not
write. Thus it's better to use prefetchw instead of prefetch.".

Let's rephrase:

commit 0ad9500e16fe ("slub: prefetch next freelist pointer in
slab_alloc()") introduced prefetch_freepointer() because when other cpu(s)
freed objects into a page that current cpu owns, the freelist link is
hot on cpu(s) which freed objects and possibly very cold on current cpu.

But if freelist link chain is hot on cpu(s) which freed objects,
it's better to invalidate that chain because they're not going to access
again within a short time.

So use prefetchw instead of prefetch. On supported architectures like x86,
it invalidates other copied instances of a cache line when prefetching it.

> > Use prefetchw instead of prefetchw. On supported architecture
>
> If we're using prefetchw instead of prefetchw, I think the diff would be
> 0 lines changed :)
>

That was my typo. thankfully Andrew fixed that.

> > like x86, it helps to invalidate cache line when the object exists
> > in other processors' cache.
> >
> > Signed-off-by: Hyeonggon Yoo <[email protected]>
> > ---
> > mm/slub.c | 7 +++----
> > 1 file changed, 3 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 3d2025f7163b..2aca7523165e 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -352,9 +352,9 @@ static inline void *get_freepointer(struct kmem_cache *s, void *object)
> > return freelist_dereference(s, object + s->offset);
> > }
> >
> > -static void prefetch_freepointer(const struct kmem_cache *s, void *object)
> > +static void prefetchw_freepointer(const struct kmem_cache *s, void *object)
> > {
> > - prefetch(object + s->offset);
> > + prefetchw(object + s->offset);
> > }
> >
> > static inline void *get_freepointer_safe(struct kmem_cache *s, void *object)
> > @@ -3195,10 +3195,9 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
> > note_cmpxchg_failure("slab_alloc", s, tid);
> > goto redo;
> > }
> > - prefetch_freepointer(s, next_object);
> > + prefetchw_freepointer(s, next_object);
> > stat(s, ALLOC_FASTPATH);
> > }
> > -
> > maybe_wipe_obj_freeptr(s, object);
> > init = slab_want_init_on_alloc(gfpflags, s);
> >
> > --
> > 2.27.0
> >
> >

2021-10-11 11:48:48

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] mm, slub: Use prefetchw instead of prefetch

On 10/11/21 00:49, David Rientjes wrote:
> On Fri, 8 Oct 2021, Hyeonggon Yoo wrote:
>
>> It's certain that an object will be not only read, but also
>> written after allocation.
>>
>
> Why is it certain? I think perhaps what you meant to say is that if we
> are doing any prefetching here, then access will benefit from prefetchw
> instead of prefetch. But it's not "certain" that allocated memory will be
> accessed at all.

I think the primary reason there's a prefetch is freelist traversal. The
cacheline we prefetch will be read during the next allocation, so if we
expect there to be one soon, prefetch might help. That the freepointer is
part of object itself and thus the cache line will be probably accessed also
after the allocation, is secondary. Yeah this might help some workloads, but
perhaps hurt others - these things might look obvious in theory but be
rather unpredictable in practice. At least some hackbench results would help...

>> Use prefetchw instead of prefetchw. On supported architecture
>
> If we're using prefetchw instead of prefetchw, I think the diff would be
> 0 lines changed :)
>
>> like x86, it helps to invalidate cache line when the object exists
>> in other processors' cache.
>>
>> Signed-off-by: Hyeonggon Yoo <[email protected]>
>> ---
>> mm/slub.c | 7 +++----
>> 1 file changed, 3 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/slub.c b/mm/slub.c
>> index 3d2025f7163b..2aca7523165e 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -352,9 +352,9 @@ static inline void *get_freepointer(struct kmem_cache *s, void *object)
>> return freelist_dereference(s, object + s->offset);
>> }
>>
>> -static void prefetch_freepointer(const struct kmem_cache *s, void *object)
>> +static void prefetchw_freepointer(const struct kmem_cache *s, void *object)

I wouldn't rename the function itself, unless we have both variants for
different situations (we don't). That it uses prefetchw() is internal detail
at this point.

>> {
>> - prefetch(object + s->offset);
>> + prefetchw(object + s->offset);
>> }
>>
>> static inline void *get_freepointer_safe(struct kmem_cache *s, void *object)
>> @@ -3195,10 +3195,9 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
>> note_cmpxchg_failure("slab_alloc", s, tid);
>> goto redo;
>> }
>> - prefetch_freepointer(s, next_object);
>> + prefetchw_freepointer(s, next_object);
>> stat(s, ALLOC_FASTPATH);
>> }
>> -
>> maybe_wipe_obj_freeptr(s, object);
>> init = slab_want_init_on_alloc(gfpflags, s);
>>
>> --
>> 2.27.0
>>
>>
>

2021-10-11 11:50:18

by Hyeonggon Yoo

[permalink] [raw]
Subject: Re: [PATCH] mm, slub: Use prefetchw instead of prefetch

On Mon, Oct 11, 2021 at 09:20:36AM +0200, Christoph Lameter wrote:
> On Fri, 8 Oct 2021, Hyeonggon Yoo wrote:
>
> > It's certain that an object will be not only read, but also
> > written after allocation.
>
> get_freepointer is used in multiple code path not only in allocation. It
> is for example used when scanning through a freelist.
>
> With this change all objects get needlessly dirtied and the cross cpu
> cache contention is increased.

I didn't touch get_freepointer and there's
only one caller of prefetch_freepointer.

My change was not adding additional prefetch on get_freepointer,
but changing existing prefetch into prefetchw.

The prefetch was introcued by commit 0ad9500e16fe ("slub: prefetch next
freelist pointer in slab_alloc()") that you ACKed in 2011.

Do you think removeing existing prefetch is better than changing it
from prefetch to prefetchw?

2021-10-11 13:33:51

by Hyeonggon Yoo

[permalink] [raw]
Subject: Perf and Hackbench results on my machine

Hello Vlastimil.

On Mon, Oct 11, 2021 at 09:21:01AM +0200, Vlastimil Babka wrote:
> On 10/11/21 00:49, David Rientjes wrote:
> > On Fri, 8 Oct 2021, Hyeonggon Yoo wrote:
> >
> >> It's certain that an object will be not only read, but also
> >> written after allocation.
> >>
> >
> > Why is it certain? I think perhaps what you meant to say is that if we
> > are doing any prefetching here, then access will benefit from prefetchw
> > instead of prefetch. But it's not "certain" that allocated memory will be
> > accessed at all.
>
> I think the primary reason there's a prefetch is freelist traversal. The
> cacheline we prefetch will be read during the next allocation, so if we
> expect there to be one soon, prefetch might help.

I agree that.

> That the freepointer is
> part of object itself and thus the cache line will be probably accessed also
> after the allocation, is secondary.

Right. it depends on cache line size and whether first cache line of an
object is frequently accessed or not.

> Yeah this might help some workloads, but
> perhaps hurt others - these things might look obvious in theory but be
> rather unpredictable in practice. At least some hackbench results would help...
>

Below is my measurement. it seems prefetch(w) is not making things worse
at least on hackbench.

Measured on 16 CPUs (ARM64) / 16G RAM
Without prefetch:

Time: 91.989
Performance counter stats for 'hackbench -g 100 -l 10000':
1467926.03 msec cpu-clock # 15.907 CPUs utilized
17782076 context-switches # 12.114 K/sec
957523 cpu-migrations # 652.296 /sec
104561 page-faults # 71.230 /sec
1622117569931 cycles # 1.105 GHz (54.54%)
2002981132267 instructions # 1.23 insn per cycle (54.32%)
5600876429 branch-misses (54.28%)
642657442307 cache-references # 437.800 M/sec (54.27%)
19404890844 cache-misses # 3.019 % of all cache refs (54.28%)
640413686039 L1-dcache-loads # 436.271 M/sec (46.85%)
19110650580 L1-dcache-load-misses # 2.98% of all L1-dcache accesses (46.83%)
651556334841 dTLB-loads # 443.862 M/sec (46.63%)
3193647402 dTLB-load-misses # 0.49% of all dTLB cache accesses (46.84%)
538927659684 iTLB-loads # 367.135 M/sec (54.31%)
118503839 iTLB-load-misses # 0.02% of all iTLB cache accesses (54.35%)
625750168840 L1-icache-loads # 426.282 M/sec (46.80%)
24348083282 L1-icache-load-misses # 3.89% of all L1-icache accesses (46.78%)

92.284351157 seconds time elapsed

44.524693000 seconds user
1426.214006000 seconds sys

With prefetch:

Time: 91.677

Performance counter stats for 'hackbench -g 100 -l 10000':
1462938.07 msec cpu-clock # 15.908 CPUs utilized
18072550 context-switches # 12.354 K/sec
1018814 cpu-migrations # 696.416 /sec
104558 page-faults # 71.471 /sec
2003670016013 instructions # 1.27 insn per cycle (54.31%)
5702204863 branch-misses (54.28%)
643368500985 cache-references # 439.778 M/sec (54.26%)
18475582235 cache-misses # 2.872 % of all cache refs (54.28%)
642206796636 L1-dcache-loads # 438.984 M/sec (46.87%)
18215813147 L1-dcache-load-misses # 2.84% of all L1-dcache accesses (46.83%)
653842996501 dTLB-loads # 446.938 M/sec (46.63%)
3227179675 dTLB-load-misses # 0.49% of all dTLB cache accesses (46.85%)
537531951350 iTLB-loads # 367.433 M/sec (54.33%)
114750630 iTLB-load-misses # 0.02% of all iTLB cache accesses (54.37%)
630135543177 L1-icache-loads # 430.733 M/sec (46.80%)
22923237620 L1-icache-load-misses # 3.64% of all L1-icache accesses (46.76%)

91.964452802 seconds time elapsed

43.416742000 seconds user
1422.441123000 seconds sys

With prefetchw:

Time: 90.220

Performance counter stats for 'hackbench -g 100 -l 10000':
1437418.48 msec cpu-clock # 15.880 CPUs utilized
17694068 context-switches # 12.310 K/sec
958257 cpu-migrations # 666.651 /sec
100604 page-faults # 69.989 /sec
1583259429428 cycles # 1.101 GHz (54.57%)
2004002484935 instructions # 1.27 insn per cycle (54.37%)
5594202389 branch-misses (54.36%)
643113574524 cache-references # 447.409 M/sec (54.39%)
18233791870 cache-misses # 2.835 % of all cache refs (54.37%)
640205852062 L1-dcache-loads # 445.386 M/sec (46.75%)
17968160377 L1-dcache-load-misses # 2.81% of all L1-dcache accesses (46.79%)
651747432274 dTLB-loads # 453.415 M/sec (46.59%)
3127124271 dTLB-load-misses # 0.48% of all dTLB cache accesses (46.75%)
535395273064 iTLB-loads # 372.470 M/sec (54.38%)
113500056 iTLB-load-misses # 0.02% of all iTLB cache accesses (54.35%)
628871845924 L1-icache-loads # 437.501 M/sec (46.80%)
22585641203 L1-icache-load-misses # 3.59% of all L1-icache accesses (46.79%)

90.514819303 seconds time elapsed

43.877656000 seconds user
1397.176001000 seconds sys

Thanks,
Hyeonggon

2021-10-11 16:31:44

by Hyeonggon Yoo

[permalink] [raw]
Subject: Re: Perf and Hackbench results on my machine

On Mon, Oct 11, 2021 at 10:33:02AM +0000, Hyeonggon Yoo wrote:
> Hello Vlastimil.
>
> On Mon, Oct 11, 2021 at 09:21:01AM +0200, Vlastimil Babka wrote:
> > On 10/11/21 00:49, David Rientjes wrote:
> > > On Fri, 8 Oct 2021, Hyeonggon Yoo wrote:
> > >
> > >> It's certain that an object will be not only read, but also
> > >> written after allocation.
> > >>
> > >
> > > Why is it certain? I think perhaps what you meant to say is that if we
> > > are doing any prefetching here, then access will benefit from prefetchw
> > > instead of prefetch. But it's not "certain" that allocated memory will be
> > > accessed at all.
> >
> > I think the primary reason there's a prefetch is freelist traversal. The
> > cacheline we prefetch will be read during the next allocation, so if we
> > expect there to be one soon, prefetch might help.
>
> I agree that.
>
> > That the freepointer is
> > part of object itself and thus the cache line will be probably accessed also
> > after the allocation, is secondary.
>
> Right. it depends on cache line size and whether first cache line of an
> object is frequently accessed or not.

Not first cache line because free pointer is in the middle of object or
out of object area. my mistake.

> >> Use prefetchw instead of prefetchw. On supported architecture
> >
> > If we're using prefetchw instead of prefetchw, I think the diff would be
> > 0 lines changed :)
> >
> >> like x86, it helps to invalidate cache line when the object exists
> >> in other processors' cache.
> >>
> >> Signed-off-by: Hyeonggon Yoo <[email protected]>
> >> ---
> >> mm/slub.c | 7 +++----
> >> 1 file changed, 3 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/mm/slub.c b/mm/slub.c
> >> index 3d2025f7163b..2aca7523165e 100644
> >> --- a/mm/slub.c
> >> +++ b/mm/slub.c
> >> @@ -352,9 +352,9 @@ static inline void *get_freepointer(struct kmem_cache *s, void *object)
> >> return freelist_dereference(s, object + s->offset);
> >> }
> >>
> >> -static void prefetch_freepointer(const struct kmem_cache *s, void *object)
> >> +static void prefetchw_freepointer(const struct kmem_cache *s, void *object)
>
> I wouldn't rename the function itself, unless we have both variants for
> different situations (we don't). That it uses prefetchw() is internal detail
> at this point.

looks good. that is simpler.