2008-02-21 18:00:32

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH] alloc_percpu() fails to allocate percpu data

diff --git a/mm/allocpercpu.c b/mm/allocpercpu.c
index 7e58322..b0012e2 100644
--- a/mm/allocpercpu.c
+++ b/mm/allocpercpu.c
@@ -6,6 +6,10 @@
#include <linux/mm.h>
#include <linux/module.h>

+#ifndef cache_line_size
+#define cache_line_size() L1_CACHE_BYTES
+#endif
+
/**
* percpu_depopulate - depopulate per-cpu data for given cpu
* @__pdata: per-cpu data to depopulate
@@ -52,6 +56,11 @@ void *percpu_populate(void *__pdata, size_t size, gfp_t gfp, int cpu)
struct percpu_data *pdata = __percpu_disguise(__pdata);
int node = cpu_to_node(cpu);

+ /*
+ * We should make sure each CPU gets private memory.
+ */
+ size = roundup(size, cache_line_size());
+
BUG_ON(pdata->ptrs[cpu]);
if (node_online(node))
pdata->ptrs[cpu] = kmalloc_node(size, gfp|__GFP_ZERO, node);
@@ -98,7 +107,11 @@ EXPORT_SYMBOL_GPL(__percpu_populate_mask);
*/
void *__percpu_alloc_mask(size_t size, gfp_t gfp, cpumask_t *mask)
{
- void *pdata = kzalloc(nr_cpu_ids * sizeof(void *), gfp);
+ /*
+ * We allocate whole cache lines to avoid false sharing
+ */
+ size_t sz = roundup(nr_cpu_ids * sizeof(void *), cache_line_size());
+ void *pdata = kzalloc(sz, gfp);
void *__pdata = __percpu_disguise(pdata);

if (unlikely(!pdata))


Attachments:
percpu_populate.patch (1.20 kB)

2008-02-21 22:26:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] alloc_percpu() fails to allocate percpu data


On Thu, 2008-02-21 at 19:00 +0100, Eric Dumazet wrote:
> Some oprofile results obtained while using tbench on a 2x2 cpu machine
> were very surprising.
>
> For example, loopback_xmit() function was using high number of cpu
> cycles to perform the statistic updates, supposed to be real cheap
> since they use percpu data
>
> pcpu_lstats = netdev_priv(dev);
> lb_stats = per_cpu_ptr(pcpu_lstats, smp_processor_id());
> lb_stats->packets++; /* HERE : serious contention */
> lb_stats->bytes += skb->len;
>
>
> struct pcpu_lstats is a small structure containing two longs. It
> appears that on my 32bits platform, alloc_percpu(8) allocates a single
> cache line, instead of giving to each cpu a separate cache line.
>
> Using the following patch gave me impressive boost in various
> benchmarks ( 6 % in tbench) (all percpu_counters hit this bug too)
>
> Long term fix (ie >= 2.6.26) would be to let each CPU allocate their
> own block of memory, so that we dont need to roudup sizes to
> L1_CACHE_BYTES, or merging the SGI stuff of course...
>
> Note : SLUB vs SLAB is important here to *show* the improvement, since
> they dont have the same minimum allocation sizes (8 bytes vs 32
> bytes). This could very well explain regressions some guys reported
> when they switched to SLUB.

I've complained about this false sharing as well, so until we get the
new and improved percpu allocators,

Acked-by: Peter Zijlstra <[email protected]>

> Signed-off-by: Eric Dumazet <[email protected]>
>
> mm/allocpercpu.c | 15 ++++++++++++++-
> 1 files changed, 14 insertions(+), 1 deletion(-)
>
>
> plain text document attachment (percpu_populate.patch)
> diff --git a/mm/allocpercpu.c b/mm/allocpercpu.c
> index 7e58322..b0012e2 100644
> --- a/mm/allocpercpu.c
> +++ b/mm/allocpercpu.c
> @@ -6,6 +6,10 @@
> #include <linux/mm.h>
> #include <linux/module.h>
>
> +#ifndef cache_line_size
> +#define cache_line_size() L1_CACHE_BYTES
> +#endif
> +
> /**
> * percpu_depopulate - depopulate per-cpu data for given cpu
> * @__pdata: per-cpu data to depopulate
> @@ -52,6 +56,11 @@ void *percpu_populate(void *__pdata, size_t size, gfp_t gfp, int cpu)
> struct percpu_data *pdata = __percpu_disguise(__pdata);
> int node = cpu_to_node(cpu);
>
> + /*
> + * We should make sure each CPU gets private memory.
> + */
> + size = roundup(size, cache_line_size());
> +
> BUG_ON(pdata->ptrs[cpu]);
> if (node_online(node))
> pdata->ptrs[cpu] = kmalloc_node(size, gfp|__GFP_ZERO, node);
> @@ -98,7 +107,11 @@ EXPORT_SYMBOL_GPL(__percpu_populate_mask);
> */
> void *__percpu_alloc_mask(size_t size, gfp_t gfp, cpumask_t *mask)
> {
> - void *pdata = kzalloc(nr_cpu_ids * sizeof(void *), gfp);
> + /*
> + * We allocate whole cache lines to avoid false sharing
> + */
> + size_t sz = roundup(nr_cpu_ids * sizeof(void *), cache_line_size());
> + void *pdata = kzalloc(sz, gfp);
> void *__pdata = __percpu_disguise(pdata);
>
> if (unlikely(!pdata))

2008-02-23 08:23:41

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] alloc_percpu() fails to allocate percpu data

On Thu, 21 Feb 2008 19:00:03 +0100 Eric Dumazet <[email protected]> wrote:

> +#ifndef cache_line_size
> +#define cache_line_size() L1_CACHE_BYTES
> +#endif

argh, you made me look.

Really cache_line_size() should be implemented in include/linux/cache.h.
Then we tromp the stupid private implementations in slob.c and slub.c.

Then we wonder why x86 uses a custom cache_line_size(), but still uses
L1_CACHE_BYTES for its L1_CACHE_ALIGN().

Once we've answered that, we look at your

+ /*
+ * We should make sure each CPU gets private memory.
+ */
+ size = roundup(size, cache_line_size());

and wonder whether it should have used L1_CACHE_ALIGN().

I think I'd better stop looking.

2008-02-23 09:24:22

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] alloc_percpu() fails to allocate percpu data

On Friday 22 February 2008 09:26, Peter Zijlstra wrote:
> On Thu, 2008-02-21 at 19:00 +0100, Eric Dumazet wrote:
> > Some oprofile results obtained while using tbench on a 2x2 cpu machine
> > were very surprising.
> >
> > For example, loopback_xmit() function was using high number of cpu
> > cycles to perform the statistic updates, supposed to be real cheap
> > since they use percpu data
> >
> > pcpu_lstats = netdev_priv(dev);
> > lb_stats = per_cpu_ptr(pcpu_lstats, smp_processor_id());
> > lb_stats->packets++; /* HERE : serious contention */
> > lb_stats->bytes += skb->len;
> >
> >
> > struct pcpu_lstats is a small structure containing two longs. It
> > appears that on my 32bits platform, alloc_percpu(8) allocates a single
> > cache line, instead of giving to each cpu a separate cache line.
> >
> > Using the following patch gave me impressive boost in various
> > benchmarks ( 6 % in tbench) (all percpu_counters hit this bug too)
> >
> > Long term fix (ie >= 2.6.26) would be to let each CPU allocate their
> > own block of memory, so that we dont need to roudup sizes to
> > L1_CACHE_BYTES, or merging the SGI stuff of course...
> >
> > Note : SLUB vs SLAB is important here to *show* the improvement, since
> > they dont have the same minimum allocation sizes (8 bytes vs 32
> > bytes). This could very well explain regressions some guys reported
> > when they switched to SLUB.
>
> I've complained about this false sharing as well, so until we get the
> new and improved percpu allocators,

What I don't understand is why the slab allocators have something like
this in it:

if ((flags & SLAB_HWCACHE_ALIGN) &&
size > cache_line_size() / 2)
return max_t(unsigned long, align, cache_line_size());

If you ask for HWCACHE_ALIGN, then you should get it. I don't
understand, why do they think they knows better than the caller?
Things like this are just going to lead to very difficult to track
performance problems. Possibly correctness problems in rare cases.

There could be another flag for "maybe align".

2008-02-27 19:44:54

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] alloc_percpu() fails to allocate percpu data

On Sat, 23 Feb 2008, Nick Piggin wrote:

> What I don't understand is why the slab allocators have something like
> this in it:
>
> if ((flags & SLAB_HWCACHE_ALIGN) &&
> size > cache_line_size() / 2)
> return max_t(unsigned long, align, cache_line_size());
>
> If you ask for HWCACHE_ALIGN, then you should get it. I don't
> understand, why do they think they knows better than the caller?

Tradition.... Its irks me as well.

> Things like this are just going to lead to very difficult to track
> performance problems. Possibly correctness problems in rare cases.
>
> There could be another flag for "maybe align".

SLAB_HWCACHE_ALIGN *is* effectively a maybe align flag given the above
code.

If we all agree then we could change this to have must have semantics? It
has the potential of enlarging objects for small caches.

SLAB_HWCACHE_ALIGN has an effect that varies according to the alignment
requirements of the architecture that the kernel is build on. We may be in
for some surprises if we change this.

2008-02-27 19:59:48

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] alloc_percpu() fails to allocate percpu data

Any decision made on what to do about this one? Mike or I can
repost the per cpu allocator against mm? The fix by Eric could be used
in the interim for 2.6.24?

2008-02-27 20:27:00

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] alloc_percpu() fails to allocate percpu data

On Wed, 27 Feb 2008 11:59:32 -0800 (PST)
Christoph Lameter <[email protected]> wrote:

> Any decision made on what to do about this one? Mike or I can
> repost the per cpu allocator against mm? The fix by Eric could be used
> in the interim for 2.6.24?
>

I suppose I'll merge Eric's patch when I've tested it fully (well, as fully
as I test stuff).

It'd be nice to get that cache_line_size()/L1_CACHE_BYTES/L1_CACHE_ALIGN()
mess sorted out. If it's a mess - I _think_ it is?

2008-02-27 21:57:31

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] alloc_percpu() fails to allocate percpu data

On Wed, 27 Feb 2008, Andrew Morton wrote:

> On Wed, 27 Feb 2008 11:59:32 -0800 (PST)
> Christoph Lameter <[email protected]> wrote:
>
> > Any decision made on what to do about this one? Mike or I can
> > repost the per cpu allocator against mm? The fix by Eric could be used
> > in the interim for 2.6.24?
> >
>
> I suppose I'll merge Eric's patch when I've tested it fully (well, as fully
> as I test stuff).

Urgh. You too?

> It'd be nice to get that cache_line_size()/L1_CACHE_BYTES/L1_CACHE_ALIGN()
> mess sorted out. If it's a mess - I _think_ it is?

Well I tried it when slub went first in and it did not go well. The issue
is that x86 detects the cache line size on bootup. Thus cache_line_size().
Most of the other arch have compile time cache line sizes. Thus
L1_CACHE_BYTES. So L1_CACHE_BYTES is the maximum value that
cache_line_size() can take.

What I was attempting to do is to make x86 have one compile time cache
line size L1_CACHE_BYTES. That raised objections because space was wasted.

2008-03-01 13:53:42

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] alloc_percpu() fails to allocate percpu data

Andrew Morton a ?crit :
> On Wed, 27 Feb 2008 11:59:32 -0800 (PST)
> Christoph Lameter <[email protected]> wrote:
>
>> Any decision made on what to do about this one? Mike or I can
>> repost the per cpu allocator against mm? The fix by Eric could be used
>> in the interim for 2.6.24?
>>
>
> I suppose I'll merge Eric's patch when I've tested it fully (well, as fully
> as I test stuff).
>
> It'd be nice to get that cache_line_size()/L1_CACHE_BYTES/L1_CACHE_ALIGN()
> mess sorted out. If it's a mess - I _think_ it is?

Just coming back from hollidays, sorry for the delay.

I can provide a patch so that L1_CACHE_BYTES is not anymore a compile time
constant if you want, but I am not sure it is worth the trouble ? (and this
certainly not 2.6.{24|25} stuff :) )

Current situation :

L1_CACHE_BYTES is known at compile time, and can be quite large (128 bytes),
while cache_line_size() gives the real cache line size selected at boot time
given the hardware capabilities.

If L1_CACHE_BYTES is not anymore a constant, compiler will also uses plain
divides to compute L1_CACHE_ALIGN()

Maybe uses of L1_CACHE_ALIGN() in fastpath would 'force' us to not only
declare a cache_line_size() but also a cache_line_size_{mask|shift}() so that
x86 could use :

#define L1_CACHE_ALIGN(x) ((((x)+cache_line_mask())) >> cache_line_shift())

#define L1_CACHE_BYTES (cache_line_size())

But I am not sure we want to play these games (we must also make sure nothing
in the tree wants a constant L1_CACHE_BYTES and replace by SMP_CACHE_BYTES)

2008-03-03 03:15:41

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] alloc_percpu() fails to allocate percpu data

On Thursday 28 February 2008 06:44, Christoph Lameter wrote:
> On Sat, 23 Feb 2008, Nick Piggin wrote:
> > What I don't understand is why the slab allocators have something like
> > this in it:
> >
> > if ((flags & SLAB_HWCACHE_ALIGN) &&
> > size > cache_line_size() / 2)
> > return max_t(unsigned long, align, cache_line_size());
> >
> > If you ask for HWCACHE_ALIGN, then you should get it. I don't
> > understand, why do they think they knows better than the caller?
>
> Tradition.... Its irks me as well.
>
> > Things like this are just going to lead to very difficult to track
> > performance problems. Possibly correctness problems in rare cases.
> >
> > There could be another flag for "maybe align".
>
> SLAB_HWCACHE_ALIGN *is* effectively a maybe align flag given the above
> code.
>
> If we all agree then we could change this to have must have semantics? It
> has the potential of enlarging objects for small caches.
>
> SLAB_HWCACHE_ALIGN has an effect that varies according to the alignment
> requirements of the architecture that the kernel is build on. We may be in
> for some surprises if we change this.

I think so. If we ask for HWCACHE_ALIGN, it must be for a good reason.
If some structures get too bloated for no good reason, then the problem
is not with the slab allocator but with the caller asking for
HWCACHE_ALIGN.

2008-03-03 07:49:05

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] alloc_percpu() fails to allocate percpu data

Nick Piggin a ?crit :
> On Thursday 28 February 2008 06:44, Christoph Lameter wrote:
>> On Sat, 23 Feb 2008, Nick Piggin wrote:
>>> What I don't understand is why the slab allocators have something like
>>> this in it:
>>>
>>> if ((flags & SLAB_HWCACHE_ALIGN) &&
>>> size > cache_line_size() / 2)
>>> return max_t(unsigned long, align, cache_line_size());
>>>
>>> If you ask for HWCACHE_ALIGN, then you should get it. I don't
>>> understand, why do they think they knows better than the caller?
>> Tradition.... Its irks me as well.
>>
>>> Things like this are just going to lead to very difficult to track
>>> performance problems. Possibly correctness problems in rare cases.
>>>
>>> There could be another flag for "maybe align".
>> SLAB_HWCACHE_ALIGN *is* effectively a maybe align flag given the above
>> code.
>>
>> If we all agree then we could change this to have must have semantics? It
>> has the potential of enlarging objects for small caches.
>>
>> SLAB_HWCACHE_ALIGN has an effect that varies according to the alignment
>> requirements of the architecture that the kernel is build on. We may be in
>> for some surprises if we change this.
>
> I think so. If we ask for HWCACHE_ALIGN, it must be for a good reason.
> If some structures get too bloated for no good reason, then the problem
> is not with the slab allocator but with the caller asking for
> HWCACHE_ALIGN.
>

HWCACHE_ALIGN is commonly used, even for large structures, because the
processor cache line on x86 is not known at compile time (can go from 32 bytes
to 128 bytes).

The problem that above code is trying to address is about small objects.

Because at the time code using HWCACHE_ALIGN was written, cache line size was
32 bytes. Now we have CPU with 128 bytes cache lines, we would waste space if
SLAB_HWCACHE_ALIGN was honored for small objects.

Some occurences of SLAB_HWCACHE_ALIGN are certainly not usefull, we should zap
them. Last one I removed was the one for "struct flow_cache_entry" (commit
dd5a1843d566911dbb077c4022c4936697495af6 : [IPSEC] flow: reorder "struct
flow_cache_entry" and remove SLAB_HWCACHE_ALIGN)

2008-03-03 09:42:55

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] alloc_percpu() fails to allocate percpu data

On Monday 03 March 2008 18:48, Eric Dumazet wrote:
> Nick Piggin a ?crit :
> > On Thursday 28 February 2008 06:44, Christoph Lameter wrote:
> >> On Sat, 23 Feb 2008, Nick Piggin wrote:
> >>> What I don't understand is why the slab allocators have something like
> >>> this in it:
> >>>
> >>> if ((flags & SLAB_HWCACHE_ALIGN) &&
> >>> size > cache_line_size() / 2)
> >>> return max_t(unsigned long, align, cache_line_size());
> >>>
> >>> If you ask for HWCACHE_ALIGN, then you should get it. I don't
> >>> understand, why do they think they knows better than the caller?
> >>
> >> Tradition.... Its irks me as well.
> >>
> >>> Things like this are just going to lead to very difficult to track
> >>> performance problems. Possibly correctness problems in rare cases.
> >>>
> >>> There could be another flag for "maybe align".
> >>
> >> SLAB_HWCACHE_ALIGN *is* effectively a maybe align flag given the above
> >> code.
> >>
> >> If we all agree then we could change this to have must have semantics?
> >> It has the potential of enlarging objects for small caches.
> >>
> >> SLAB_HWCACHE_ALIGN has an effect that varies according to the alignment
> >> requirements of the architecture that the kernel is build on. We may be
> >> in for some surprises if we change this.
> >
> > I think so. If we ask for HWCACHE_ALIGN, it must be for a good reason.
> > If some structures get too bloated for no good reason, then the problem
> > is not with the slab allocator but with the caller asking for
> > HWCACHE_ALIGN.
>
> HWCACHE_ALIGN is commonly used, even for large structures, because the
> processor cache line on x86 is not known at compile time (can go from 32
> bytes to 128 bytes).

Sure.


> The problem that above code is trying to address is about small objects.
>
> Because at the time code using HWCACHE_ALIGN was written, cache line size
> was 32 bytes. Now we have CPU with 128 bytes cache lines, we would waste
> space if SLAB_HWCACHE_ALIGN was honored for small objects.

I understand that, but I don't think it is a good reason.
SLAB_HWCACHE_ALIGN should only be specified if it is really needed.
If it is not really needed, it should not be specified. And if it
is, then the allocator should not disregard it.

But let's see. There is a valid case where we want to align to a
power of 2 >= objsize and <= hw cache size. That is if we carefully
pack objects so that we know where cacheline boundaries are and only
take the minimum number of cache misses to access them, but are not
concerned about false sharing. That appears to be what HWCACHE_ALIGN
is for, but SLUB does not really get that right either, because it
drops that alignment restriction completely if the object is <= the
cache line size. It should use the same calculation that SLAB uses.
I would have preferred it to be called something else...

For the case where we want to avoid false sharing, we need a new
SLAB_SMP_ALIGN, which always pads out to cacheline size, but only
for num_possible_cpus() > 1.

That still leaves the problem of how to align kmalloc(). SLAB gives
it HWCACHE_ALIGN by default. Why not do the same for SLUB (which
could be changed if CONFIG_SMALL is set)? That would give a more
consistent allocation pattern, at least (eg. you wouldn't get your
structures suddenly straddling cachelines if you reduce it from 100
bytes to 96 bytes...).

And for kmalloc that requires SMP_ALIGN, I guess it is impossible.
Maybe the percpu allocator could just have its own kmem_cache of
size cache_line_size() and use that for all allocations <= that size.
Then just let the scalemp guys worry about using that wasted padding
for same-CPU allocations ;)

And I guess if there is some other situation where alignment is
required, it could be specified explicitly.

> Some occurences of SLAB_HWCACHE_ALIGN are certainly not usefull, we should
> zap them. Last one I removed was the one for "struct flow_cache_entry"
> (commit dd5a1843d566911dbb077c4022c4936697495af6 : [IPSEC] flow: reorder
> "struct flow_cache_entry" and remove SLAB_HWCACHE_ALIGN)

Sure.

But in general it isn't always easy to tell what should be aligned
and what should not. If you have a set of smallish objects where you
are likely to do basically random lookups to them and they are not
likely to be in cache, then SLAB_HWCACHE_ALIGN can be a good idea so
that you hit as few cachelines as possible when doing the lookup.

2008-03-03 19:30:34

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] alloc_percpu() fails to allocate percpu data

On Mon, 3 Mar 2008, Nick Piggin wrote:

> I think so. If we ask for HWCACHE_ALIGN, it must be for a good reason.
> If some structures get too bloated for no good reason, then the problem
> is not with the slab allocator but with the caller asking for
> HWCACHE_ALIGN.

Right. There should only be a few select users of this strange flag.
Otherwise if a certain alignment is wanted specify it in the alignment
parameter passed to kmem_cache_create.

2008-03-11 18:15:56

by Mike Snitzer

[permalink] [raw]
Subject: Re: [PATCH] alloc_percpu() fails to allocate percpu data

On 2/21/08, Eric Dumazet <[email protected]> wrote:
> Some oprofile results obtained while using tbench on a 2x2 cpu machine
> were very surprising.
>
> For example, loopback_xmit() function was using high number of cpu
> cycles to perform
> the statistic updates, supposed to be real cheap since they use percpu data
>
> pcpu_lstats = netdev_priv(dev);
> lb_stats = per_cpu_ptr(pcpu_lstats, smp_processor_id());
> lb_stats->packets++; /* HERE : serious contention */
> lb_stats->bytes += skb->len;
>
>
> struct pcpu_lstats is a small structure containing two longs. It appears
> that on my 32bits platform,
> alloc_percpu(8) allocates a single cache line, instead of giving to
> each cpu a separate
> cache line.
>
> Using the following patch gave me impressive boost in various benchmarks
> ( 6 % in tbench)
> (all percpu_counters hit this bug too)
>
> Long term fix (ie >= 2.6.26) would be to let each CPU allocate their own
> block of memory, so that we
> dont need to roudup sizes to L1_CACHE_BYTES, or merging the SGI stuff of
> course...
>
> Note : SLUB vs SLAB is important here to *show* the improvement, since
> they dont have the same minimum
> allocation sizes (8 bytes vs 32 bytes).
> This could very well explain regressions some guys reported when they
> switched to SLUB.


I see that this fix was committed to mainline as commit
be852795e1c8d3829ddf3cb1ce806113611fa555

The commit didn't "Cc: <[email protected]>", and it doesn't appear to
be queued for 2.6.24.x. Should it be?

If I understand you correctly, SLAB doesn't create this particular
cache thrashing on 32bit systems? Is SLAB ok on other architectures
too? Can you (or others) comment on the importance of this fix
relative to x86_64 (64byte cacheline) and SLAB?

I'm particularly interested in this given the use of percpu_counters
with the per bdi write throttling.

Mike

2008-03-11 19:06:51

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] alloc_percpu() fails to allocate percpu data

Mike Snitzer a ?crit :
> On 2/21/08, Eric Dumazet <[email protected]> wrote:
>
>> Some oprofile results obtained while using tbench on a 2x2 cpu machine
>> were very surprising.
>>
>> For example, loopback_xmit() function was using high number of cpu
>> cycles to perform
>> the statistic updates, supposed to be real cheap since they use percpu data
>>
>> pcpu_lstats = netdev_priv(dev);
>> lb_stats = per_cpu_ptr(pcpu_lstats, smp_processor_id());
>> lb_stats->packets++; /* HERE : serious contention */
>> lb_stats->bytes += skb->len;
>>
>>
>> struct pcpu_lstats is a small structure containing two longs. It appears
>> that on my 32bits platform,
>> alloc_percpu(8) allocates a single cache line, instead of giving to
>> each cpu a separate
>> cache line.
>>
>> Using the following patch gave me impressive boost in various benchmarks
>> ( 6 % in tbench)
>> (all percpu_counters hit this bug too)
>>
>> Long term fix (ie >= 2.6.26) would be to let each CPU allocate their own
>> block of memory, so that we
>> dont need to roudup sizes to L1_CACHE_BYTES, or merging the SGI stuffof
>> course...
>>
>> Note : SLUB vs SLAB is important here to *show* the improvement, since
>> they dont have the same minimum
>> allocation sizes (8 bytes vs 32 bytes).
>> This could very well explain regressions some guys reported when they
>> switched to SLUB.
>>
>
>
> I see that this fix was committed to mainline as commit
> be852795e1c8d3829ddf3cb1ce806113611fa555
>
> The commit didn't "Cc: <[email protected]>", and it doesn't appear to
> be queued for 2.6.24.x. Should it be?
>
>
Yes, it should be queued fo 2.6.24.x

> If I understand you correctly, SLAB doesn't create this particular
> cache thrashing on 32bit systems? Is SLAB ok on other architectures
> too? Can you (or others) comment on the importance of this fix
> relative to x86_64 (64byte cacheline) and SLAB?
>
>

Fix is important both for 32 and 64 bits kernels, SLAB or SLUB.

SLAB does have this problem, but less prevalent than SLUB, because these
allocators dont have the same minimal size allocation (32 vs 8)

So with SLUB, it is possible that 8 CPUS share the same 64 bytes
cacheline to store their percpu counters, while only 2 cpus can share
this same cache line with SLAB allocator.

> I'm particularly interested in this given the use of percpu_counters
> with the per bdi write throttling.
>
> Mike
> --
>


2008-03-11 19:39:20

by Mike Snitzer

[permalink] [raw]
Subject: Re: [PATCH] alloc_percpu() fails to allocate percpu data

On 3/11/08, Eric Dumazet <[email protected]> wrote:
> Mike Snitzer a ?crit :
>
> > On 2/21/08, Eric Dumazet <[email protected]> wrote:
> >
> >> Some oprofile results obtained while using tbench on a 2x2 cpu machine
> >> were very surprising.
> >>
> >> For example, loopback_xmit() function was using high number of cpu
> >> cycles to perform
> >> the statistic updates, supposed to be real cheap since they use percpu data
> >>
> >> pcpu_lstats = netdev_priv(dev);
> >> lb_stats = per_cpu_ptr(pcpu_lstats, smp_processor_id());
> >> lb_stats->packets++; /* HERE : serious contention */
> >> lb_stats->bytes += skb->len;
> >>
> >>
> >> struct pcpu_lstats is a small structure containing two longs. It appears
> >> that on my 32bits platform,
> >> alloc_percpu(8) allocates a single cache line, instead of giving to
> >> each cpu a separate
> >> cache line.
> >>
> >> Using the following patch gave me impressive boost in various benchmarks
> >> ( 6 % in tbench)
> >> (all percpu_counters hit this bug too)
> >>
> >> Long term fix (ie >= 2.6.26) would be to let each CPU allocate their own
> >> block of memory, so that we
> >> dont need to roudup sizes to L1_CACHE_BYTES, or merging the SGI stuffof
> >> course...
> >>
> >> Note : SLUB vs SLAB is important here to *show* the improvement, since
> >> they dont have the same minimum
> >> allocation sizes (8 bytes vs 32 bytes).
> >> This could very well explain regressions some guys reported when they
> >> switched to SLUB.
> >>
> >
> >
> > I see that this fix was committed to mainline as commit
> > be852795e1c8d3829ddf3cb1ce806113611fa555
> >
> > The commit didn't "Cc: <[email protected]>", and it doesn't appear to
> > be queued for 2.6.24.x. Should it be?
> >
> >
>
> Yes, it should be queued fo 2.6.24.x

That means both of the following commits need to be cherry-picked into 2.6.24.x:
b3242151906372f30f57feaa43b4cac96a23edb1
be852795e1c8d3829ddf3cb1ce806113611fa555

> > If I understand you correctly, SLAB doesn't create this particular
> > cache thrashing on 32bit systems? Is SLAB ok on other architectures
> > too? Can you (or others) comment on the importance of this fix
> > relative to x86_64 (64byte cacheline) and SLAB?
> >
> >
>
>
> Fix is important both for 32 and 64 bits kernels, SLAB or SLUB.
>
> SLAB does have this problem, but less prevalent than SLUB, because these
> allocators dont have the same minimal size allocation (32 vs 8)
>
> So with SLUB, it is possible that 8 CPUS share the same 64 bytes
> cacheline to store their percpu counters, while only 2 cpus can share
> this same cache line with SLAB allocator.

Thanks for the clarification.

2008-03-12 00:20:01

by Chris Wright

[permalink] [raw]
Subject: Re: [stable] [PATCH] alloc_percpu() fails to allocate percpu data

* Mike Snitzer ([email protected]) wrote:
> On 3/11/08, Eric Dumazet <[email protected]> wrote:
> > Mike Snitzer a ?crit :
> > > I see that this fix was committed to mainline as commit
> > > be852795e1c8d3829ddf3cb1ce806113611fa555
> > >
> > > The commit didn't "Cc: <[email protected]>", and it doesn't appear to
> > > be queued for 2.6.24.x. Should it be?
> >
> > Yes, it should be queued fo 2.6.24.x
>
> That means both of the following commits need to be cherry-picked into 2.6.24.x:
> b3242151906372f30f57feaa43b4cac96a23edb1
> be852795e1c8d3829ddf3cb1ce806113611fa555

Just send each with appropriate commit hash id and relevant Cc's to
[email protected] so we can pick them up.

thanks,
-chris