2008-12-15 13:47:46

by Steven Rostedt

[permalink] [raw]
Subject: local_add_return

Hi Rusty,

I'm trying to make the ring buffers lockless and reentrant. It is slowly
going that way. The local_add_return is to reserve a part of the ring
buffer even when an interrupt can come in and reserver part of that same
ring buffer. The atomic add here has to only be atomic with respect to
interrupts.

On intel, there is no reason to use a LOCK increment, since the ring
buffers are per cpu. A simple asm inc would work. I was thinking that is
what local_add_return gives me. I could replace the local_add_returns with
atomic_add_return, but that just seems to be adding overhead for archs
that do not need it.

Shouldn't local_add_return be a way for archs that can increment a memory
location atomically against interrupts to use that infrastructure? It can
simply fall back to atomic_add_return for those archs that do not have
a lesser equivalent of atomic_add_return.

-- Steve


2008-12-16 06:33:17

by Rusty Russell

[permalink] [raw]
Subject: Re: local_add_return

On Tuesday 16 December 2008 00:17:35 Steven Rostedt wrote:
> Shouldn't local_add_return be a way for archs that can increment a memory
> location atomically against interrupts to use that infrastructure? It can
> simply fall back to atomic_add_return for those archs that do not have
> a lesser equivalent of atomic_add_return.

local_t was originally introduced (but actually never used for) the
SNMP counters. They use two counters to avoid atomics, but as the ancient
comment says:

/*
* FIXME: On x86 and some other CPUs the split into user and softirq parts
* is not needed because addl $1,memory is atomic against interrupts (but
* atomic_inc would be overkill because of the lock cycles). Wants new
* nonlocked_atomic_inc() primitives -AK
*/
#define DEFINE_SNMP_STAT(type, name) \
__typeof__(type) *name[2]

Then last year Mathieu sent (and Andrew accepted) a "rich set of atomic
operations", including excellent documentation "local_ops.txt". Except
he thought they were atomics, so treated them accordingly. Also, there
were no users (you're now the only one).

But if these new operations are to become the norm, it changes how archs
should implement local_t. eg. trivalue becomes less attractive, atomic_long
more. x86 has its own implementation so doesn't have these issues.

Now, I posted a benchmark patch before for archs to test. I'm interested
in Sparc64. Does any arch win from using multiple counters? PowerPC has
soft interrupt disable, so that solution wins over atomic_long_t for them.

Cheers,
Rusty.

2008-12-16 06:57:17

by David Miller

[permalink] [raw]
Subject: Re: local_add_return

From: Rusty Russell <[email protected]>
Date: Tue, 16 Dec 2008 17:03:00 +1030

> Now, I posted a benchmark patch before for archs to test. I'm interested
> in Sparc64. Does any arch win from using multiple counters? PowerPC has
> soft interrupt disable, so that solution wins over atomic_long_t for them.

I'll get you some sparc64 numbers.

2008-12-16 07:13:22

by David Miller

[permalink] [raw]
Subject: Re: local_add_return

From: Rusty Russell <[email protected]>
Date: Tue, 16 Dec 2008 17:03:00 +1030

> Now, I posted a benchmark patch before for archs to test. I'm interested
> in Sparc64.

Here ya go:

UltraSPARC-IIIi:

atomic_long: local_inc=8180000021/243
local_add=7469999999/222
cpu_local_inc=1260000001/37
local_read=5690000000/169
(total was 1326071152640)

irqsave/restore:
local_inc=6909999997/205
local_add=6899999996/205
cpu_local_inc=860000000/25
local_read=5690000005/169
(total was 1326071152640)

trivalue:
local_inc=6490000000/193
local_add=6500000000/193
cpu_local_inc=370000002/11
local_read=7419999998/221
(total was 1326071152640)

local_t:
local_inc=7440000000/221
local_add=7440000000/221
cpu_local_inc=1260000004/37
local_read=5689999997/169
(total was 1326071152640, warm_total 34443624448)

Niagara-2:

atomic_long:
local_inc=6960000013/207
local_add=6940000002/206
cpu_local_inc=2440000002/72
local_read=5390000000/160
(total was 1326071152640)

irqsave/restore:
local_inc=7660000002/228
local_add=7669999999/228
cpu_local_inc=2650000004/78
local_read=5379999999/160
(total was 1326071152640)

trivalue:
local_inc=5789999998/172
local_add=5789999995/172
cpu_local_inc=689999994/20
local_read=7470000000/222
(total was 1326071152640)

local_t:
local_inc=6940000000/206
local_add=6950000000/207
cpu_local_inc=2460000000/73
local_read=5390000004/160
(total was 1326071152640, warm_total 34443624448)

2008-12-16 16:25:43

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: local_add_return

* Rusty Russell ([email protected]) wrote:
> On Tuesday 16 December 2008 00:17:35 Steven Rostedt wrote:
> > Shouldn't local_add_return be a way for archs that can increment a memory
> > location atomically against interrupts to use that infrastructure? It can
> > simply fall back to atomic_add_return for those archs that do not have
> > a lesser equivalent of atomic_add_return.
>
> local_t was originally introduced (but actually never used for) the
> SNMP counters. They use two counters to avoid atomics, but as the ancient
> comment says:
>
> /*
> * FIXME: On x86 and some other CPUs the split into user and softirq parts
> * is not needed because addl $1,memory is atomic against interrupts (but
> * atomic_inc would be overkill because of the lock cycles). Wants new
> * nonlocked_atomic_inc() primitives -AK
> */
> #define DEFINE_SNMP_STAT(type, name) \
> __typeof__(type) *name[2]
>
> Then last year Mathieu sent (and Andrew accepted) a "rich set of atomic
> operations", including excellent documentation "local_ops.txt". Except
> he thought they were atomics, so treated them accordingly. Also, there
> were no users (you're now the only one).
>
> But if these new operations are to become the norm, it changes how archs
> should implement local_t. eg. trivalue becomes less attractive, atomic_long
> more. x86 has its own implementation so doesn't have these issues.
>
> Now, I posted a benchmark patch before for archs to test. I'm interested
> in Sparc64. Does any arch win from using multiple counters? PowerPC has
> soft interrupt disable, so that solution wins over atomic_long_t for them.
>

Hi Rusty,

I'd like to comment on your test case found at
http://groups.google.com/group/linux.kernel/msg/98c512fceda26351

Specifically on this comment :

+/* There are three obvious ways to implement local_t on an arch which
+ * can't do single-instruction inc/dec etc.
+ * 1) atomic_long
+ * 2) irq_save/irq_restore
+ * 3) multiple counters.
+ *
+ * This does a very rough benchmark on each one.
+ */

Option 3) is not workable for tracers, because it's not safe against
some exceptions (e.g. some hardware errors) nor NMIs. Also, local_t
operations must have preemption disabled before playing on per-cpu data,
which I don't see in your test. This has to be taken into account in the
runtime cost. The "multiple counters" options should also disable
preemption, because a thread being moved to another CPU could corrupt
some other thread's data when being rescheduled.

Only two alternatives does not have this preempt_disable() requirement :
atomic_long_t and the CPU_OPS work done by Christoph Lameter which use
segments to address the per-cpu data, which effectively removes the need
for disabling preemption around local_t operations because the CPU ID
becomes encoded in a cpu register.

Otherwise, you can be moved to a different CPU between the moment you
read the CPU ID and the moment you access the local data, which can lead
to corruption with local_t and multiple counters options.

Cheers,

Mathieu

> Cheers,
> Rusty.

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-12-18 22:53:18

by Rusty Russell

[permalink] [raw]
Subject: Re: local_add_return

On Wednesday 17 December 2008 10:31:55 Mathieu Desnoyers wrote:
> I think we have two different use-cases here :
>
> - local_t is useful as-is for things such as a tracer, which need to
> modify an element of data atomically wrt local interrupts. The
> atomic_long_t, in this case, is the correct fallback.
> - local_count_t could be used for fast counters.

Hi Mathieu,

Complete agreement.

I guess I'm biassed towards local_t == counter version, something else
== nmi-safe version because that's what it was originally. Looking through
the tree, there are only 5 users: module, dmaengine and percpu_counter want
a counter, and tracing and x86 nmi.c want nmi-safe. There are several other
places I know of which want local_t-the-counter.

I'll prepare a patch which adds nmi_safe_t, and see how it looks. There's
no amazing hurry on this, so I won't race to hit the merge window.

Thanks!
Rusty.

2008-12-19 03:35:32

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: local_add_return

* Rusty Russell ([email protected]) wrote:
> On Wednesday 17 December 2008 10:31:55 Mathieu Desnoyers wrote:
> > I think we have two different use-cases here :
> >
> > - local_t is useful as-is for things such as a tracer, which need to
> > modify an element of data atomically wrt local interrupts. The
> > atomic_long_t, in this case, is the correct fallback.
> > - local_count_t could be used for fast counters.
>
> Hi Mathieu,
>
> Complete agreement.
>
> I guess I'm biassed towards local_t == counter version, something else
> == nmi-safe version because that's what it was originally. Looking through
> the tree, there are only 5 users: module, dmaengine and percpu_counter want
> a counter, and tracing and x86 nmi.c want nmi-safe. There are several other
> places I know of which want local_t-the-counter.
>
> I'll prepare a patch which adds nmi_safe_t, and see how it looks. There's
> no amazing hurry on this, so I won't race to hit the merge window.
>

OK,

But can we turn what you call "nmi_safe_t" into "local_atomic_t" then ?
Because we have to specify that this type must only be used as part of
per-cpu data with preemption disabled, and we also specify that it is
atomic.

Plus, nmi_safe_t does not make much sense on architectures without NMIs,
where we sometimes disable interrupts to make the modification "atomic"
wrt all other interrupts that can happen.

Mathieu

> Thanks!
> Rusty.

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-12-19 05:55:04

by Rusty Russell

[permalink] [raw]
Subject: Re: local_add_return

On Friday 19 December 2008 14:05:14 Mathieu Desnoyers wrote:
> * Rusty Russell ([email protected]) wrote:
> But can we turn what you call "nmi_safe_t" into "local_atomic_t" then ?
> Because we have to specify that this type must only be used as part of
> per-cpu data with preemption disabled, and we also specify that it is
> atomic.
>
> Plus, nmi_safe_t does not make much sense on architectures without NMIs,
> where we sometimes disable interrupts to make the modification "atomic"
> wrt all other interrupts that can happen.

But those archs can use local_t. I don't like either name local_t nor
atomic_local_t, but renaming sucks too.

OK, how about a different approach? Since there's really only one case
where we need this local_t property outside arch-specific code, how about
we define ARCH_LOCAL_T_TRACE_SAFE for x86?

Then some trace-specific typedef like "trace_counter_t" which goes to local_t
or atomic_(long?)_t?

Should be a simple patch and pretty clear.

Thanks,
Rusty.

2008-12-19 17:06:42

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: local_add_return

* Rusty Russell ([email protected]) wrote:
> On Friday 19 December 2008 14:05:14 Mathieu Desnoyers wrote:
> > * Rusty Russell ([email protected]) wrote:
> > But can we turn what you call "nmi_safe_t" into "local_atomic_t" then ?
> > Because we have to specify that this type must only be used as part of
> > per-cpu data with preemption disabled, and we also specify that it is
> > atomic.
> >
> > Plus, nmi_safe_t does not make much sense on architectures without NMIs,
> > where we sometimes disable interrupts to make the modification "atomic"
> > wrt all other interrupts that can happen.
>
> But those archs can use local_t. I don't like either name local_t nor
> atomic_local_t, but renaming sucks too.
>
> OK, how about a different approach? Since there's really only one case
> where we need this local_t property outside arch-specific code, how about
> we define ARCH_LOCAL_T_TRACE_SAFE for x86?
>
> Then some trace-specific typedef like "trace_counter_t" which goes to local_t
> or atomic_(long?)_t?
>
> Should be a simple patch and pretty clear.
>

Hrm, is it me or linking a basic type definition to a single user seems
like the wrong approach ?

The idea behind declaring new types is, to me, that they should describe
as generally as possible what they provide and what they are. If we
think of the future, where we might want to use such local atomic types
for other purposes than tracing, I think we will end up regretting such
specific naming scheme. I don't think the argument "because the type has
only one arch-independent user" holds, because the idea behind new types
is that they _will_ be used by others eventually. For instance, we've
done some work on moving the slub allocator to such local atomic
operations last year, and it gave very good results on architectures
where disabling interrupt is costly (threefold acceleration of the
fastpath).

In your trace_counter_t proposal, you don't take into account that (what
I call) local_atomic_long_t is a _new_ primitive, which cannot be
implemented by a trivalue and differs from atomic_long_t, on more
architectures than x86. On mips and powerpc, at least, it can be
implemented as an atomic operation without the memory barriers, which
improves performances a lot.

I think the following scheme would be pretty simple and yet not tied to
any specific user :

local_long_t
- Fast per-cpu counter, not necessarily atomic.
Implements long trivalues, or uses local_atomic_long_t.
local_atomic_long_t
- Fast per-cpu atomic counter.
Implements per-cpu atomic counters or uses atomic_long_t.
atomic_long_t
- Global atomic counter.
Implements globally synchronized atomic operations.

We could do the same with "int" type for :
local_t
local_atomic_t
atomic_t

If we need smaller counters.

Mathieu


> Thanks,
> Rusty.

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-12-20 01:34:21

by Rusty Russell

[permalink] [raw]
Subject: Re: local_add_return

On Saturday 20 December 2008 03:36:27 Mathieu Desnoyers wrote:
> * Rusty Russell ([email protected]) wrote:
> > Then some trace-specific typedef like "trace_counter_t" which goes to local_t
> > or atomic_(long?)_t?
> >
> > Should be a simple patch and pretty clear.
>
> Hrm, is it me or linking a basic type definition to a single user seems
> like the wrong approach ?

Well, it's an ongoing debate. Old school kernel coders believe that
infrastructure should be resisted: you implement what you need to, then
if it turns out to be generically useful you put it somewhere that the
second user can access it.

Otherwise we end up with unused infrastructure, or overspecialized
infrastructure which doesn't actually meet the general need. local_t
displays both these properties.

> The idea behind declaring new types is, to me, that they should describe
> as generally as possible what they provide and what they are. If we
> think of the future, where we might want to use such local atomic types
> for other purposes than tracing, I think we will end up regretting such
> specific naming scheme.

I can be convinced, but I'll need more than speculation. Assuming
local_long_atomic_t, can you produce a patch which uses it somewhere else?

> local_atomic_long_t is a _new_ primitive, which cannot be
> implemented by a trivalue and differs from atomic_long_t, on more
> architectures than x86. On mips and powerpc, at least, it can be
> implemented as an atomic operation without the memory barriers, which
> improves performances a lot.

OK, you lost me here. I don't see a memory barrier in the powerpc atomic
ops. Is there an implied one I missed?

> I think the following scheme would be pretty simple and yet not tied to
> any specific user :
>
> local_long_t
> - Fast per-cpu counter, not necessarily atomic.
> Implements long trivalues, or uses local_atomic_long_t.
> local_atomic_long_t
> - Fast per-cpu atomic counter.
> Implements per-cpu atomic counters or uses atomic_long_t.

>From the point of view of someone trying to decide what to use, the real
difference is: use local_long_t unless you need the atomic-style operators
which are only available on local_atomic_long_t, or NMI-safe behaviour.
Is this correct?

> We could do the same with "int" type for :
> local_t
> local_atomic_t
> atomic_t
>
> If we need smaller counters.

Erk... no, renaming one to two is bad enough. Changing the semantics of
one and introducing three more is horrible.

If we're going to rename, I'd choose local_counter_t and local_atomic_t
(both long: I don't think there's a real penalty is there?).

Thanks,
Rusty.

2008-12-22 18:43:58

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: local_add_return

* Rusty Russell ([email protected]) wrote:
> On Saturday 20 December 2008 03:36:27 Mathieu Desnoyers wrote:
> > * Rusty Russell ([email protected]) wrote:
> > > Then some trace-specific typedef like "trace_counter_t" which goes to local_t
> > > or atomic_(long?)_t?
> > >
> > > Should be a simple patch and pretty clear.
> >
> > Hrm, is it me or linking a basic type definition to a single user seems
> > like the wrong approach ?
>
> Well, it's an ongoing debate. Old school kernel coders believe that
> infrastructure should be resisted: you implement what you need to, then
> if it turns out to be generically useful you put it somewhere that the
> second user can access it.
>
> Otherwise we end up with unused infrastructure, or overspecialized
> infrastructure which doesn't actually meet the general need. local_t
> displays both these properties.
>

Yes.. well given every iteration on that kind of primitive touches _all_
architectures supported by Linux, I think it's good to think a bit about
the design in advance to minimize the amout of wasted effort. Especially
because it requires some coordination between many architecture
maintainers.

> > The idea behind declaring new types is, to me, that they should describe
> > as generally as possible what they provide and what they are. If we
> > think of the future, where we might want to use such local atomic types
> > for other purposes than tracing, I think we will end up regretting such
> > specific naming scheme.
>
> I can be convinced, but I'll need more than speculation. Assuming
> local_long_atomic_t, can you produce a patch which uses it somewhere else?
>

I had this patch applying over Christoph Lameter's vm tree last
February. It did accelerate the slub fastpath allocator by using
cmpxchg_local rather than disabling interrupts. cmpxchg_local is not
using the local_t type, but behaves similarly to local_cmpxchg.

http://lkml.org/lkml/2008/2/28/568


> > local_atomic_long_t is a _new_ primitive, which cannot be
> > implemented by a trivalue and differs from atomic_long_t, on more
> > architectures than x86. On mips and powerpc, at least, it can be
> > implemented as an atomic operation without the memory barriers, which
> > improves performances a lot.
>
> OK, you lost me here. I don't see a memory barrier in the powerpc atomic
> ops. Is there an implied one I missed?
>

Look for LWSYNC_ON_SMP and ISYNC_ON_SMP in
arch/powerpc/include/asm/atomic.h

They map to the lwsync and isync instruction, which are more or less
memory ops and instruction execution order barriers. They become both
unneeded when modifying per-cpu data from a single CPU.

> > I think the following scheme would be pretty simple and yet not tied to
> > any specific user :
> >
> > local_long_t
> > - Fast per-cpu counter, not necessarily atomic.
> > Implements long trivalues, or uses local_atomic_long_t.
> > local_atomic_long_t
> > - Fast per-cpu atomic counter.
> > Implements per-cpu atomic counters or uses atomic_long_t.
>
> From the point of view of someone trying to decide what to use, the real
> difference is: use local_long_t unless you need the atomic-style operators
> which are only available on local_atomic_long_t, or NMI-safe behaviour.
> Is this correct?
>

Yes.

> > We could do the same with "int" type for :
> > local_t
> > local_atomic_t
> > atomic_t
> >
> > If we need smaller counters.
>
> Erk... no, renaming one to two is bad enough. Changing the semantics of
> one and introducing three more is horrible.
>
> If we're going to rename, I'd choose local_counter_t and local_atomic_t
> (both long: I don't think there's a real penalty is there?).
>

The penality is only space and wasted cache-lines whenever the data fits
in smaller data types, but I think we can start with a single data type
(long) and add more if needed. I agree with you on renaming, it's bad.
People trying to forward port their code will have a hard time managing
the type behavior change, especially if the compiler does not complain.
local_counter_t and local_atomic_t seems good to me, except the fact
that atomic_t maps to "int" and local_atomic_t would map to "long",
which might be unexpected. If possible, I'd try to follow the current
semantics of "atomic_t" for int and "atomic_long_t" for long, because I
think those types should offer a similar interface. I know that
local_counter_long_t and local_atomic_long_t are painful to write, but
that would follow the current atomic_t vs atomic_long_t semantics. Hm ?

Mathieu

> Thanks,
> Rusty.
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-12-24 11:43:19

by Rusty Russell

[permalink] [raw]
Subject: Re: local_add_return

On Tuesday 23 December 2008 05:13:28 Mathieu Desnoyers wrote:
> > I can be convinced, but I'll need more than speculation. Assuming
> > local_long_atomic_t, can you produce a patch which uses it somewhere else?
>
> I had this patch applying over Christoph Lameter's vm tree last
> February. It did accelerate the slub fastpath allocator by using
> cmpxchg_local rather than disabling interrupts. cmpxchg_local is not
> using the local_t type, but behaves similarly to local_cmpxchg.
>
> http://lkml.org/lkml/2008/2/28/568

OK, I'll buy that. So we split local_t into a counter and an atomic type.

> I know that
> local_counter_long_t and local_atomic_long_t are painful to write, but
> that would follow the current atomic_t vs atomic_long_t semantics. Hm ?

OK, I've looked at how they're used, to try to figure out whether long
is the right thing. Counters generally want to be long, but I was in doubt
about atomics; yet grep shows that atomic_long_t is quite popular. Then
I hit struct nfs_iostats which would want a u64 and a long. I don't think
we want local_counter_u64 etc.

Just thinking out loud, perhaps a new *type* is the wrong direction? How
about a set of macros which take a fundamental type, such as:

DECLARE_LOCAL_COUNTER(type, name);
local_counter_inc(type, addr);
...
DECLARE_LOCAL_ATOMIC(type, name);
local_atomic_add_return(type, addr);

This allows pointers, u32, u64, long, etc. If a 32-bit arch can't do 64-bit
local_counter_inc easily, at least the hairy 64-bit code can be eliminated at
compile time.

Or maybe that's overdesign?
Rusty.

2008-12-24 18:53:30

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: local_add_return

* Rusty Russell ([email protected]) wrote:
> On Tuesday 23 December 2008 05:13:28 Mathieu Desnoyers wrote:
> > > I can be convinced, but I'll need more than speculation. Assuming
> > > local_long_atomic_t, can you produce a patch which uses it somewhere else?
> >
> > I had this patch applying over Christoph Lameter's vm tree last
> > February. It did accelerate the slub fastpath allocator by using
> > cmpxchg_local rather than disabling interrupts. cmpxchg_local is not
> > using the local_t type, but behaves similarly to local_cmpxchg.
> >
> > http://lkml.org/lkml/2008/2/28/568
>
> OK, I'll buy that. So we split local_t into a counter and an atomic type.
>
> > I know that
> > local_counter_long_t and local_atomic_long_t are painful to write, but
> > that would follow the current atomic_t vs atomic_long_t semantics. Hm ?
>
> OK, I've looked at how they're used, to try to figure out whether long
> is the right thing. Counters generally want to be long, but I was in doubt
> about atomics; yet grep shows that atomic_long_t is quite popular. Then
> I hit struct nfs_iostats which would want a u64 and a long. I don't think
> we want local_counter_u64 etc.
>
> Just thinking out loud, perhaps a new *type* is the wrong direction? How
> about a set of macros which take a fundamental type, such as:
>
> DECLARE_LOCAL_COUNTER(type, name);
> local_counter_inc(type, addr);
> ...
> DECLARE_LOCAL_ATOMIC(type, name);
> local_atomic_add_return(type, addr);
>
> This allows pointers, u32, u64, long, etc. If a 32-bit arch can't do 64-bit
> local_counter_inc easily, at least the hairy 64-bit code can be eliminated at
> compile time.
>
> Or maybe that's overdesign?
> Rusty.

Yeah, I also thought of this, but I am not sure every architecture
provides primitives to modify u16 or u8 data atomically like x86 does.
But yes, I remember hearing Christoph Lameter being interested to use
unsigned char or short atomic counters for the vm allocator in the past.
The rationale was mostly that he wanted to keep a counter in a very
small data type, expecting to "poll" the counter periodically (e.g.
every X counter increment) and sum the total somewhere else.

So I think it would be the right design in the end if we want to allow
wider use of such atomic primitives for counters w/o interrupts
disabled. And I would propose we use a BUILD_BUG_ON() when the
architecture does not support an atomic operation on a specific type.
We should also document which type sizes are supported portably and
which are architecture-specific.

Or, as you say, maybe it's overdesign ? If we have to pick something
simple, just supporting "long" would be a good start.

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-12-16 22:38:26

by Rusty Russell

[permalink] [raw]
Subject: Re: local_add_return

On Tuesday 16 December 2008 17:43:14 David Miller wrote:
> Here ya go:

Very interesting. There's a little noise there (that first local_inc of 243
is wrong), but the picture is clear: trivalue is the best implementation for
sparc64.

Note: trivalue uses 3 values, so instead of hitting random values across 8MB
it's across 24MB, and despite the resulting cache damage it's 15% faster. The
cpu_local_inc test is a single value, so no cache effects: it shows trivalue
to be 3 to 3.5 times faster in the cache-hot case.

This sucks, because it really does mean that there's no one-size-fits-all
implementation of local_t. There's also no platform yet where atomic_long_t
is the right choice; and that's the default!

Any chance of an IA64 or s390 run? You can normalize if you like, since
it's only to compare the different approaches.

Cheers,
Rusty.

Benchmarks for local_t variants

(This patch also fixes the x86 cpu_local_* macros, which are obviously
unused).

I chose a large array (1M longs) for the inc/add/add_return tests so
the trivalue case would show some cache pressure.

The cpu_local_inc case is always cache-hot, so it's not comparable to
the others.

Time in ns per iteration (brackets is with CONFIG_PREEMPT=y):

inc add add_return cpu_local_inc read
x86-32: 2.13 Ghz Core Duo 2
atomic_long 118 118 115 17 17
irqsave/rest 77 78 77 23 16
trivalue 45 45 127 3(6) 21
local_t 36 36 36 1(5) 17

x86-64: 2.6 GHz Dual-Core AMD Opteron 2218
atomic_long 55 60 - 6 19
irqsave/rest 54 54 - 11 19
trivalue 47 47 - 5 28
local_t 47 46 - 1 19

PPC64: 2.7 GHz PPC970MP [normalized]
atomic_long 18 18 20 3(4) 8
irqsave/rest 10(4) 5(4) 4 8(9) 10(9)
trivalue 9 9 2 1(3) 10
local_t 18 18 18 3(4) 8

Sparc64: UltraSPARC-IIIi
atomic_long 243 222 - 37 169
irqsave/rest 205 205 - 25 169
trivalue 193 193 - 11 221
local_t 221 221 - 37 169

Sparc64: Niagara-2
atomic_long 207 206 - 72 160
irqsave/rest 228 228 - 78 160
trivalue: 172 172 - 20 222
local_t 206 207 - 73 160

Signed-off-by: Rusty Russell <[email protected]>
---
arch/x86/include/asm/local.h | 20 +--
init/main.c | 223 +++++++++++++++++++++++++++++++++++++++++++
2 files changed, 233 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/local.h b/arch/x86/include/asm/local.h
--- a/arch/x86/include/asm/local.h
+++ b/arch/x86/include/asm/local.h
@@ -220,16 +220,16 @@ static inline long local_sub_return(long
preempt_enable(); \
}) \

-#define cpu_local_read(l) cpu_local_wrap_v(local_read(&__get_cpu_var((l))))
-#define cpu_local_set(l, i) cpu_local_wrap(local_set(&__get_cpu_var((l)), (i)))
-#define cpu_local_inc(l) cpu_local_wrap(local_inc(&__get_cpu_var((l))))
-#define cpu_local_dec(l) cpu_local_wrap(local_dec(&__get_cpu_var((l))))
-#define cpu_local_add(i, l) cpu_local_wrap(local_add((i), &__get_cpu_var((l))))
-#define cpu_local_sub(i, l) cpu_local_wrap(local_sub((i), &__get_cpu_var((l))))
+#define cpu_local_read(l) cpu_local_wrap_v(local_read(&__get_cpu_var(l)))
+#define cpu_local_set(l, i) cpu_local_wrap(local_set(&__get_cpu_var(l), (i)))
+#define cpu_local_inc(l) cpu_local_wrap(local_inc(&__get_cpu_var(l)))
+#define cpu_local_dec(l) cpu_local_wrap(local_dec(&__get_cpu_var(l)))
+#define cpu_local_add(i, l) cpu_local_wrap(local_add((i), &__get_cpu_var(l)))
+#define cpu_local_sub(i, l) cpu_local_wrap(local_sub((i), &__get_cpu_var(l)))

-#define __cpu_local_inc(l) cpu_local_inc((l))
-#define __cpu_local_dec(l) cpu_local_dec((l))
-#define __cpu_local_add(i, l) cpu_local_add((i), (l))
-#define __cpu_local_sub(i, l) cpu_local_sub((i), (l))
+#define __cpu_local_inc(l) cpu_local_inc(l)
+#define __cpu_local_dec(l) cpu_local_dec(l)
+#define __cpu_local_add(i, l) cpu_local_add((i), l)
+#define __cpu_local_sub(i, l) cpu_local_sub((i), l)

#endif /* _ASM_X86_LOCAL_H */
diff --git a/init/main.c b/init/main.c
--- a/init/main.c
+++ b/init/main.c
@@ -534,6 +534,225 @@ void __init __weak thread_info_cache_ini
{
}

+/* There are three obvious ways to implement local_t on an arch which
+ * can't do single-instruction inc/dec etc.
+ * 1) atomic_long
+ * 2) irq_save/irq_restore
+ * 3) multiple counters.
+ *
+ * This does a very rough benchmark on each one.
+ */
+struct local1 {
+ atomic_long_t v;
+};
+struct local2 {
+ unsigned long v;
+};
+struct local3 {
+ unsigned long v[3];
+};
+
+/* Enough to put some pressure on the caches. */
+#define NUM_LOCAL_TEST (1024*1024)
+#define NUM_LOCAL_RUNS (NUM_LOCAL_TEST*32)
+/* This will make it jump around looking random */
+#define STRIDE 514001
+
+static void *test_local_variants_mem;
+
+static void init_test_local_variants(void)
+{
+ unsigned long size;
+ size = max(sizeof(struct local1),
+ max(sizeof(struct local2),
+ max(sizeof(struct local3), sizeof(local_t))))
+ * NUM_LOCAL_TEST;
+ /* Assume this works in early boot. */
+ test_local_variants_mem = alloc_bootmem_nopanic(size);
+
+ if (!test_local_variants_mem) {
+ printk("test_local_variants: failed to allocate %lu bytes\n",
+ size);
+ return;
+ }
+}
+
+static void print_result(const char *str,
+ struct timespec start, struct timespec end)
+{
+ s64 diff;
+
+ diff = ktime_to_ns(ktime_sub(timespec_to_ktime(end), timespec_to_ktime(start)));
+ printk("%s=%lli/%lli ",
+ str, diff, diff/NUM_LOCAL_RUNS);
+}
+
+static unsigned int warm_local_test_cache(const void *mem, size_t len)
+{
+ unsigned int i, total = 0;
+ for (i = 0; i < len; i++)
+ total += ((char *)mem)[i];
+ return total;
+}
+
+#define TEST_LOOP(expr) \
+ n = 0; \
+ getnstimeofday(&start); \
+ for (i = 0; i < NUM_LOCAL_RUNS; i++) { \
+ expr; \
+ n += STRIDE; \
+ n %= NUM_LOCAL_TEST; \
+ } \
+ getnstimeofday(&end);
+
+/* This doesn't test cache effects at all */
+#define NUM_PERCPU_VARS 16
+DEFINE_PER_CPU(struct local1[NUM_PERCPU_VARS], local1_test);
+DEFINE_PER_CPU(struct local2[NUM_PERCPU_VARS], local2_test);
+DEFINE_PER_CPU(struct local3[NUM_PERCPU_VARS], local3_test);
+DEFINE_PER_CPU(local_t[NUM_PERCPU_VARS], local4_test);
+
+static void test_local_variants(void)
+{
+ struct timespec start, end;
+ unsigned int i, n;
+ unsigned long total, warm_total = 0;
+ struct local1 *l1;
+ struct local2 *l2;
+ struct local3 *l3;
+ local_t *l4;
+
+ if (!test_local_variants_mem)
+ return;
+
+ printk("Running local_t variant benchmarks\n");
+ l1 = test_local_variants_mem;
+ l2 = test_local_variants_mem;
+ l3 = test_local_variants_mem;
+ l4 = test_local_variants_mem;
+
+ printk("atomic_long: ");
+ memset(l1, 0, sizeof(*l1)*NUM_LOCAL_TEST);
+ TEST_LOOP(atomic_long_inc(&l1[n].v));
+ print_result("local_inc", start, end);
+
+ warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
+ TEST_LOOP(atomic_long_add(1234, &l1[n].v));
+ print_result("local_add", start, end);
+
+ warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
+ TEST_LOOP(atomic_long_inc(&__get_cpu_var(local1_test)[n%NUM_PERCPU_VARS].v));
+ print_result("cpu_local_inc", start, end);
+
+ warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
+ total = 0;
+ TEST_LOOP(total += atomic_long_read(&l1[n].v));
+ print_result("local_read", start, end);
+
+ warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
+ TEST_LOOP(total += atomic_long_add_return(7, &l1[n].v));
+ print_result("local_add_return", start, end);
+
+ printk("(total was %lu)\n", total);
+
+ printk("irqsave/restore: ");
+ memset(l2, 0, sizeof(*l2)*NUM_LOCAL_TEST);
+ TEST_LOOP(unsigned long flags;
+ local_irq_save(flags);
+ l2[n].v++;
+ local_irq_restore(flags));
+ print_result("local_inc", start, end);
+
+ warm_total += warm_local_test_cache(l2, sizeof(*l2)*NUM_LOCAL_TEST);
+ TEST_LOOP(unsigned long flags;
+ local_irq_save(flags);
+ l2[n].v += 1234;
+ local_irq_restore(flags));
+ print_result("local_add", start, end);
+
+ warm_total += warm_local_test_cache(l2, sizeof(*l2)*NUM_LOCAL_TEST);
+ TEST_LOOP(unsigned long flags;
+ local_irq_save(flags);
+ __get_cpu_var(local2_test)[n%NUM_PERCPU_VARS].v++;
+ local_irq_restore(flags));
+ print_result("cpu_local_inc", start, end);
+
+ warm_total += warm_local_test_cache(l2, sizeof(*l2)*NUM_LOCAL_TEST);
+ total = 0;
+ TEST_LOOP(total += l2[n].v);
+ print_result("local_read", start, end);
+
+ warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
+ TEST_LOOP(unsigned long flags;
+ local_irq_save(flags);
+ l2[n].v += 7;
+ total += l2[n].v;
+ local_irq_restore(flags));
+ print_result("local_add_return", start, end);
+ printk("(total was %lu)\n", total);
+
+ printk("trivalue: ");
+ memset(l3, 0, sizeof(*l3)*NUM_LOCAL_TEST);
+ TEST_LOOP(unsigned int idx
+ = !(preempt_count() & (HARDIRQ_MASK | SOFTIRQ_MASK)) +
+ !(preempt_count() & HARDIRQ_MASK);
+ l3[n].v[idx]++);
+ print_result("local_inc", start, end);
+
+ warm_total += warm_local_test_cache(l3, sizeof(*l3)*NUM_LOCAL_TEST);
+ TEST_LOOP(unsigned int idx
+ = !(preempt_count() & (HARDIRQ_MASK | SOFTIRQ_MASK)) +
+ !(preempt_count() & HARDIRQ_MASK);
+ l3[n].v[idx] += 1234);
+ print_result("local_add", start, end);
+
+ warm_total += warm_local_test_cache(l3, sizeof(*l3)*NUM_LOCAL_TEST);
+ TEST_LOOP(unsigned int idx
+ = !(preempt_count() & (HARDIRQ_MASK | SOFTIRQ_MASK)) +
+ !(preempt_count() & HARDIRQ_MASK);
+ get_cpu_var(local3_test)[n%NUM_PERCPU_VARS].v[idx]++;
+ put_cpu_var());
+ print_result("cpu_local_inc", start, end);
+
+ warm_total += warm_local_test_cache(l3, sizeof(*l3)*NUM_LOCAL_TEST);
+ total = 0;
+ TEST_LOOP(total += l3[n].v[0] + l3[n].v[1] + l3[n].v[2]);
+ print_result("local_read", start, end);
+
+ warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
+ TEST_LOOP(unsigned long flags;
+ local_irq_save(flags);
+ l3[n].v[0] += 7;
+ total += l3[n].v[0] + l3[n].v[1] + l3[n].v[2];
+ local_irq_restore(flags));
+ print_result("local_add_return", start, end);
+
+ printk("(total was %lu)\n", total);
+
+ printk("local_t: ");
+ memset(l4, 0, sizeof(*l4)*NUM_LOCAL_TEST);
+ TEST_LOOP(local_inc(&l4[n]));
+ print_result("local_inc", start, end);
+
+ warm_total += warm_local_test_cache(l4, sizeof(*l4)*NUM_LOCAL_TEST);
+ TEST_LOOP(local_add(1234, &l4[n]));
+ print_result("local_add", start, end);
+
+ warm_total += warm_local_test_cache(l4, sizeof(*l4)*NUM_LOCAL_TEST);
+ TEST_LOOP(cpu_local_inc(local4_test[n%NUM_PERCPU_VARS]));
+ print_result("cpu_local_inc", start, end);
+
+ warm_total += warm_local_test_cache(l4, sizeof(*l4)*NUM_LOCAL_TEST);
+ total = 0;
+ TEST_LOOP(total += local_read(&l4[n]));
+ print_result("local_read", start, end);
+
+ warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
+ TEST_LOOP(total += local_add_return(7, &l1[n].v));
+ print_result("local_add_return", start, end);
+ printk("(total was %lu, warm_total %lu)\n", total, warm_total);
+}
+
asmlinkage void __init start_kernel(void)
{
char * command_line;
@@ -630,6 +849,8 @@ asmlinkage void __init start_kernel(void
*/
locking_selftest();

+ init_test_local_variants();
+
#ifdef CONFIG_BLK_DEV_INITRD
if (initrd_start && !initrd_below_start_ok &&
page_to_pfn(virt_to_page((void *)initrd_start)) < min_low_pfn) {
@@ -687,6 +908,8 @@ asmlinkage void __init start_kernel(void
acpi_early_init(); /* before LAPIC and SMP init */

ftrace_init();
+
+ test_local_variants();

/* Do the rest non-__init'ed, we're now alive */
rest_init();

2008-12-16 23:26:15

by Tony Luck

[permalink] [raw]
Subject: RE: local_add_return

> Any chance of an IA64 or s390 run? You can normalize if you like, since
> it's only to compare the different approaches.

-EDOESNOTCOMPILE

CC init/main.o
init/main.c: In function 'test_local_variants':
init/main.c:756: error: 'atomic_long_t' has no member named 'a'
make[1]: *** [init/main.o] Error 1

-Tony

2008-12-16 23:44:25

by Heiko Carstens

[permalink] [raw]
Subject: Re: local_add_return

On Wed, Dec 17, 2008 at 09:08:04AM +1030, Rusty Russell wrote:
> On Tuesday 16 December 2008 17:43:14 David Miller wrote:
> > Here ya go:
>
> Very interesting. There's a little noise there (that first local_inc of 243
> is wrong), but the picture is clear: trivalue is the best implementation for
> sparc64.
>
> Note: trivalue uses 3 values, so instead of hitting random values across 8MB
> it's across 24MB, and despite the resulting cache damage it's 15% faster. The
> cpu_local_inc test is a single value, so no cache effects: it shows trivalue
> to be 3 to 3.5 times faster in the cache-hot case.
>
> This sucks, because it really does mean that there's no one-size-fits-all
> implementation of local_t. There's also no platform yet where atomic_long_t
> is the right choice; and that's the default!
>
> Any chance of an IA64 or s390 run? You can normalize if you like, since
> it's only to compare the different approaches.

atomic_long_t seems to be the right choice on s390. IRQ disable/enable is
expensive, but the compare and swap instruction is cheap. I just gave it
a quick shot, but please note that there were two hypervisors running below
my system (add_return is missing since I used your first patch):

atomic_long 19 18 - 3 17
irqsave/rest 57 58 - 39 22
trivalue 43 43 - 4 45
local_t 18 20 - 2 16

2008-12-17 00:00:28

by Eric Dumazet

[permalink] [raw]
Subject: Re: local_add_return

Rusty Russell a ?crit :
> On Tuesday 16 December 2008 17:43:14 David Miller wrote:
>> Here ya go:
>
> Very interesting. There's a little noise there (that first local_inc of 243
> is wrong), but the picture is clear: trivalue is the best implementation for
> sparc64.
>
> Note: trivalue uses 3 values, so instead of hitting random values across 8MB
> it's across 24MB, and despite the resulting cache damage it's 15% faster. The
> cpu_local_inc test is a single value, so no cache effects: it shows trivalue
> to be 3 to 3.5 times faster in the cache-hot case.
>
> This sucks, because it really does mean that there's no one-size-fits-all
> implementation of local_t. There's also no platform yet where atomic_long_t
> is the right choice; and that's the default!
>
> Any chance of an IA64 or s390 run? You can normalize if you like, since
> it's only to compare the different approaches.
>
> Cheers,
> Rusty.
>
> Benchmarks for local_t variants
>
> (This patch also fixes the x86 cpu_local_* macros, which are obviously
> unused).
>
> I chose a large array (1M longs) for the inc/add/add_return tests so
> the trivalue case would show some cache pressure.
>
> The cpu_local_inc case is always cache-hot, so it's not comparable to
> the others.

Would be good to differenciate results, if data is already in cache or not...

>
> Time in ns per iteration (brackets is with CONFIG_PREEMPT=y):
>
> inc add add_return cpu_local_inc read
> x86-32: 2.13 Ghz Core Duo 2
> atomic_long 118 118 115 17 17

really strange atomic_long performs so badly here.
LOCK + data not in cache -> really really bad...

> irqsave/rest 77 78 77 23 16
> trivalue 45 45 127 3(6) 21
> local_t 36 36 36 1(5) 17
>
> x86-64: 2.6 GHz Dual-Core AMD Opteron 2218
> atomic_long 55 60 - 6 19
> irqsave/rest 54 54 - 11 19
> trivalue 47 47 - 5 28
> local_t 47 46 - 1 19
>

Running local_t variant benchmarks
atomic_long: local_inc=395001846/11 local_add=395000325/11 cpu_local_inc=362000295/10 local_read=49000040/1 local_add_return=396000322/11 (total was 1728053248)
irqsave/restore: local_inc=498000400/14 local_add=496000395/14 cpu_local_inc=486000384/14 local_read=68000054/2 local_add_return=502000394/14 (total was 1728053248)
trivalue: local_inc=1325001024/39 local_add=1324001226/39 cpu_local_inc=81000080/2 local_read=786000766/23 local_add_return=4193003781/124 (total was 1728053248)
local_t: local_inc=69000059/2 local_add=69000058/2 cpu_local_inc=42000035/1 local_read=50000043/1 local_add_return=90000076/2 (total was 1728053248, warm_total 62914562)


Intel(R) Xeon(R) CPU E5450 @ 3.00GHz

two quadcore cpus, x86-32 kernel

It seems Core2 are really better than Core Duo 2,
or their cache is big enough to hold the array of your test...

(at least for l1 & l2, their 4Mbytes working set fits in cache)

processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Xeon(R) CPU E5450 @ 3.00GHz
stepping : 6
cpu MHz : 3000.099
cache size : 6144 KB <<<< yes, thats big :) >>>>

If I double size of working set

#define NUM_LOCAL_TEST (2*1024*1024)

then I get quite different numbers :

Running local_t variant benchmarks
atomic_long: local_inc=6729007264/100 local_add=6727005943/100 cpu_local_inc=724000569/10 local_read=1030000784/15 local
_add_return=6623004616/98 (total was 3456106496)
irqsave/restore: local_inc=4458002796/66 local_add=4459001998/66 cpu_local_inc=971000381/14 local_read=1060000389/15 loc
al_add_return=4528001388/67 (total was 3456106496)
trivalue: local_inc=2871000855/42 local_add=2867000976/42 cpu_local_inc=162000052/2 local_read=1747000551/26 local_add_r
eturn=8829002352/131 (total was 3456106496)
local_t: local_inc=2210000492/32 local_add=2206000460/32 cpu_local_inc=84000017/1 local_read=1029000203/15 local_add_ret
urn=2216000415/33 (total was 3456106496, warm_total 125829124)

If now I reduce NUM_LOCAL_TEST to 256*1024 so that even trivalue l3 fits cache.

Running local_t variant benchmarks
atomic_long: local_inc=98984929/11 local_add=98984889/11 cpu_local_inc=89986248/10 local_read=11998165/1 local_add_retur
n=99003292/11 (total was 2579496960)
irqsave/restore: local_inc=124000102/14 local_add=124000102/14 cpu_local_inc=121000100/14 local_read=17000013/2 local_ad
d_return=126000103/15 (total was 2579496960)
trivalue: local_inc=21000017/2 local_add=20000016/2 cpu_local_inc=20000017/2 local_read=25000021/2 local_add_return=1360
00110/16 (total was 2579496960)
local_t: local_inc=17000014/2 local_add=17000015/2 cpu_local_inc=11000009/1 local_read=12000010/1 local_add_return=23000
019/2 (total was 2579496960, warm_total 15728642)



About trivalues, their use in percpu_counter local storage (one trivalue for each cpu)
would make the accuracy a litle bit more lazy...

2008-12-17 00:02:19

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: local_add_return

* Rusty Russell ([email protected]) wrote:
> On Tuesday 16 December 2008 17:43:14 David Miller wrote:
> > Here ya go:
>
> Very interesting. There's a little noise there (that first local_inc of 243
> is wrong), but the picture is clear: trivalue is the best implementation for
> sparc64.
>
> Note: trivalue uses 3 values, so instead of hitting random values across 8MB
> it's across 24MB, and despite the resulting cache damage it's 15% faster. The
> cpu_local_inc test is a single value, so no cache effects: it shows trivalue
> to be 3 to 3.5 times faster in the cache-hot case.
>
> This sucks, because it really does mean that there's no one-size-fits-all
> implementation of local_t. There's also no platform yet where atomic_long_t
> is the right choice; and that's the default!
>

This problem could be fixed by introducing a local_count_t, which maps
to either local_t or to a trivalue, along with read accessors which sums
the trivalues.

I think we have two different use-cases here :

- local_t is useful as-is for things such as a tracer, which need to
modify an element of data atomically wrt local interrupts. The
atomic_long_t, in this case, is the correct fallback.
- local_count_t could be used for fast counters. It could be a
requirement to only use it from thread/softirq/irq context (never NMI
or exception) so we are sure the trivalue approach will not lead to
corruption. local_count_t could use either local_t or trivalue
depending on which is the fastest on a given architecture.

Mathieu


> Any chance of an IA64 or s390 run? You can normalize if you like, since
> it's only to compare the different approaches.
>
> Cheers,
> Rusty.
>
> Benchmarks for local_t variants
>
> (This patch also fixes the x86 cpu_local_* macros, which are obviously
> unused).
>
> I chose a large array (1M longs) for the inc/add/add_return tests so
> the trivalue case would show some cache pressure.
>
> The cpu_local_inc case is always cache-hot, so it's not comparable to
> the others.
>
> Time in ns per iteration (brackets is with CONFIG_PREEMPT=y):
>
> inc add add_return cpu_local_inc read
> x86-32: 2.13 Ghz Core Duo 2
> atomic_long 118 118 115 17 17
> irqsave/rest 77 78 77 23 16
> trivalue 45 45 127 3(6) 21
> local_t 36 36 36 1(5) 17
>
> x86-64: 2.6 GHz Dual-Core AMD Opteron 2218
> atomic_long 55 60 - 6 19
> irqsave/rest 54 54 - 11 19
> trivalue 47 47 - 5 28
> local_t 47 46 - 1 19
>
> PPC64: 2.7 GHz PPC970MP [normalized]
> atomic_long 18 18 20 3(4) 8
> irqsave/rest 10(4) 5(4) 4 8(9) 10(9)
> trivalue 9 9 2 1(3) 10
> local_t 18 18 18 3(4) 8
>
> Sparc64: UltraSPARC-IIIi
> atomic_long 243 222 - 37 169
> irqsave/rest 205 205 - 25 169
> trivalue 193 193 - 11 221
> local_t 221 221 - 37 169
>
> Sparc64: Niagara-2
> atomic_long 207 206 - 72 160
> irqsave/rest 228 228 - 78 160
> trivalue: 172 172 - 20 222
> local_t 206 207 - 73 160
>
> Signed-off-by: Rusty Russell <[email protected]>
> ---
> arch/x86/include/asm/local.h | 20 +--
> init/main.c | 223 +++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 233 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/include/asm/local.h b/arch/x86/include/asm/local.h
> --- a/arch/x86/include/asm/local.h
> +++ b/arch/x86/include/asm/local.h
> @@ -220,16 +220,16 @@ static inline long local_sub_return(long
> preempt_enable(); \
> }) \
>
> -#define cpu_local_read(l) cpu_local_wrap_v(local_read(&__get_cpu_var((l))))
> -#define cpu_local_set(l, i) cpu_local_wrap(local_set(&__get_cpu_var((l)), (i)))
> -#define cpu_local_inc(l) cpu_local_wrap(local_inc(&__get_cpu_var((l))))
> -#define cpu_local_dec(l) cpu_local_wrap(local_dec(&__get_cpu_var((l))))
> -#define cpu_local_add(i, l) cpu_local_wrap(local_add((i), &__get_cpu_var((l))))
> -#define cpu_local_sub(i, l) cpu_local_wrap(local_sub((i), &__get_cpu_var((l))))
> +#define cpu_local_read(l) cpu_local_wrap_v(local_read(&__get_cpu_var(l)))
> +#define cpu_local_set(l, i) cpu_local_wrap(local_set(&__get_cpu_var(l), (i)))
> +#define cpu_local_inc(l) cpu_local_wrap(local_inc(&__get_cpu_var(l)))
> +#define cpu_local_dec(l) cpu_local_wrap(local_dec(&__get_cpu_var(l)))
> +#define cpu_local_add(i, l) cpu_local_wrap(local_add((i), &__get_cpu_var(l)))
> +#define cpu_local_sub(i, l) cpu_local_wrap(local_sub((i), &__get_cpu_var(l)))
>
> -#define __cpu_local_inc(l) cpu_local_inc((l))
> -#define __cpu_local_dec(l) cpu_local_dec((l))
> -#define __cpu_local_add(i, l) cpu_local_add((i), (l))
> -#define __cpu_local_sub(i, l) cpu_local_sub((i), (l))
> +#define __cpu_local_inc(l) cpu_local_inc(l)
> +#define __cpu_local_dec(l) cpu_local_dec(l)
> +#define __cpu_local_add(i, l) cpu_local_add((i), l)
> +#define __cpu_local_sub(i, l) cpu_local_sub((i), l)
>
> #endif /* _ASM_X86_LOCAL_H */
> diff --git a/init/main.c b/init/main.c
> --- a/init/main.c
> +++ b/init/main.c
> @@ -534,6 +534,225 @@ void __init __weak thread_info_cache_ini
> {
> }
>
> +/* There are three obvious ways to implement local_t on an arch which
> + * can't do single-instruction inc/dec etc.
> + * 1) atomic_long
> + * 2) irq_save/irq_restore
> + * 3) multiple counters.
> + *
> + * This does a very rough benchmark on each one.
> + */
> +struct local1 {
> + atomic_long_t v;
> +};
> +struct local2 {
> + unsigned long v;
> +};
> +struct local3 {
> + unsigned long v[3];
> +};
> +
> +/* Enough to put some pressure on the caches. */
> +#define NUM_LOCAL_TEST (1024*1024)
> +#define NUM_LOCAL_RUNS (NUM_LOCAL_TEST*32)
> +/* This will make it jump around looking random */
> +#define STRIDE 514001
> +
> +static void *test_local_variants_mem;
> +
> +static void init_test_local_variants(void)
> +{
> + unsigned long size;
> + size = max(sizeof(struct local1),
> + max(sizeof(struct local2),
> + max(sizeof(struct local3), sizeof(local_t))))
> + * NUM_LOCAL_TEST;
> + /* Assume this works in early boot. */
> + test_local_variants_mem = alloc_bootmem_nopanic(size);
> +
> + if (!test_local_variants_mem) {
> + printk("test_local_variants: failed to allocate %lu bytes\n",
> + size);
> + return;
> + }
> +}
> +
> +static void print_result(const char *str,
> + struct timespec start, struct timespec end)
> +{
> + s64 diff;
> +
> + diff = ktime_to_ns(ktime_sub(timespec_to_ktime(end), timespec_to_ktime(start)));
> + printk("%s=%lli/%lli ",
> + str, diff, diff/NUM_LOCAL_RUNS);
> +}
> +
> +static unsigned int warm_local_test_cache(const void *mem, size_t len)
> +{
> + unsigned int i, total = 0;
> + for (i = 0; i < len; i++)
> + total += ((char *)mem)[i];
> + return total;
> +}
> +
> +#define TEST_LOOP(expr) \
> + n = 0; \
> + getnstimeofday(&start); \
> + for (i = 0; i < NUM_LOCAL_RUNS; i++) { \
> + expr; \
> + n += STRIDE; \
> + n %= NUM_LOCAL_TEST; \
> + } \
> + getnstimeofday(&end);
> +
> +/* This doesn't test cache effects at all */
> +#define NUM_PERCPU_VARS 16
> +DEFINE_PER_CPU(struct local1[NUM_PERCPU_VARS], local1_test);
> +DEFINE_PER_CPU(struct local2[NUM_PERCPU_VARS], local2_test);
> +DEFINE_PER_CPU(struct local3[NUM_PERCPU_VARS], local3_test);
> +DEFINE_PER_CPU(local_t[NUM_PERCPU_VARS], local4_test);
> +
> +static void test_local_variants(void)
> +{
> + struct timespec start, end;
> + unsigned int i, n;
> + unsigned long total, warm_total = 0;
> + struct local1 *l1;
> + struct local2 *l2;
> + struct local3 *l3;
> + local_t *l4;
> +
> + if (!test_local_variants_mem)
> + return;
> +
> + printk("Running local_t variant benchmarks\n");
> + l1 = test_local_variants_mem;
> + l2 = test_local_variants_mem;
> + l3 = test_local_variants_mem;
> + l4 = test_local_variants_mem;
> +
> + printk("atomic_long: ");
> + memset(l1, 0, sizeof(*l1)*NUM_LOCAL_TEST);
> + TEST_LOOP(atomic_long_inc(&l1[n].v));
> + print_result("local_inc", start, end);
> +
> + warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
> + TEST_LOOP(atomic_long_add(1234, &l1[n].v));
> + print_result("local_add", start, end);
> +
> + warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
> + TEST_LOOP(atomic_long_inc(&__get_cpu_var(local1_test)[n%NUM_PERCPU_VARS].v));
> + print_result("cpu_local_inc", start, end);
> +
> + warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
> + total = 0;
> + TEST_LOOP(total += atomic_long_read(&l1[n].v));
> + print_result("local_read", start, end);
> +
> + warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
> + TEST_LOOP(total += atomic_long_add_return(7, &l1[n].v));
> + print_result("local_add_return", start, end);
> +
> + printk("(total was %lu)\n", total);
> +
> + printk("irqsave/restore: ");
> + memset(l2, 0, sizeof(*l2)*NUM_LOCAL_TEST);
> + TEST_LOOP(unsigned long flags;
> + local_irq_save(flags);
> + l2[n].v++;
> + local_irq_restore(flags));
> + print_result("local_inc", start, end);
> +
> + warm_total += warm_local_test_cache(l2, sizeof(*l2)*NUM_LOCAL_TEST);
> + TEST_LOOP(unsigned long flags;
> + local_irq_save(flags);
> + l2[n].v += 1234;
> + local_irq_restore(flags));
> + print_result("local_add", start, end);
> +
> + warm_total += warm_local_test_cache(l2, sizeof(*l2)*NUM_LOCAL_TEST);
> + TEST_LOOP(unsigned long flags;
> + local_irq_save(flags);
> + __get_cpu_var(local2_test)[n%NUM_PERCPU_VARS].v++;
> + local_irq_restore(flags));
> + print_result("cpu_local_inc", start, end);
> +
> + warm_total += warm_local_test_cache(l2, sizeof(*l2)*NUM_LOCAL_TEST);
> + total = 0;
> + TEST_LOOP(total += l2[n].v);
> + print_result("local_read", start, end);
> +
> + warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
> + TEST_LOOP(unsigned long flags;
> + local_irq_save(flags);
> + l2[n].v += 7;
> + total += l2[n].v;
> + local_irq_restore(flags));
> + print_result("local_add_return", start, end);
> + printk("(total was %lu)\n", total);
> +
> + printk("trivalue: ");
> + memset(l3, 0, sizeof(*l3)*NUM_LOCAL_TEST);
> + TEST_LOOP(unsigned int idx
> + = !(preempt_count() & (HARDIRQ_MASK | SOFTIRQ_MASK)) +
> + !(preempt_count() & HARDIRQ_MASK);
> + l3[n].v[idx]++);
> + print_result("local_inc", start, end);
> +
> + warm_total += warm_local_test_cache(l3, sizeof(*l3)*NUM_LOCAL_TEST);
> + TEST_LOOP(unsigned int idx
> + = !(preempt_count() & (HARDIRQ_MASK | SOFTIRQ_MASK)) +
> + !(preempt_count() & HARDIRQ_MASK);
> + l3[n].v[idx] += 1234);
> + print_result("local_add", start, end);
> +
> + warm_total += warm_local_test_cache(l3, sizeof(*l3)*NUM_LOCAL_TEST);
> + TEST_LOOP(unsigned int idx
> + = !(preempt_count() & (HARDIRQ_MASK | SOFTIRQ_MASK)) +
> + !(preempt_count() & HARDIRQ_MASK);
> + get_cpu_var(local3_test)[n%NUM_PERCPU_VARS].v[idx]++;
> + put_cpu_var());
> + print_result("cpu_local_inc", start, end);
> +
> + warm_total += warm_local_test_cache(l3, sizeof(*l3)*NUM_LOCAL_TEST);
> + total = 0;
> + TEST_LOOP(total += l3[n].v[0] + l3[n].v[1] + l3[n].v[2]);
> + print_result("local_read", start, end);
> +
> + warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
> + TEST_LOOP(unsigned long flags;
> + local_irq_save(flags);
> + l3[n].v[0] += 7;
> + total += l3[n].v[0] + l3[n].v[1] + l3[n].v[2];
> + local_irq_restore(flags));
> + print_result("local_add_return", start, end);
> +
> + printk("(total was %lu)\n", total);
> +
> + printk("local_t: ");
> + memset(l4, 0, sizeof(*l4)*NUM_LOCAL_TEST);
> + TEST_LOOP(local_inc(&l4[n]));
> + print_result("local_inc", start, end);
> +
> + warm_total += warm_local_test_cache(l4, sizeof(*l4)*NUM_LOCAL_TEST);
> + TEST_LOOP(local_add(1234, &l4[n]));
> + print_result("local_add", start, end);
> +
> + warm_total += warm_local_test_cache(l4, sizeof(*l4)*NUM_LOCAL_TEST);
> + TEST_LOOP(cpu_local_inc(local4_test[n%NUM_PERCPU_VARS]));
> + print_result("cpu_local_inc", start, end);
> +
> + warm_total += warm_local_test_cache(l4, sizeof(*l4)*NUM_LOCAL_TEST);
> + total = 0;
> + TEST_LOOP(total += local_read(&l4[n]));
> + print_result("local_read", start, end);
> +
> + warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
> + TEST_LOOP(total += local_add_return(7, &l1[n].v));
> + print_result("local_add_return", start, end);
> + printk("(total was %lu, warm_total %lu)\n", total, warm_total);
> +}
> +
> asmlinkage void __init start_kernel(void)
> {
> char * command_line;
> @@ -630,6 +849,8 @@ asmlinkage void __init start_kernel(void
> */
> locking_selftest();
>
> + init_test_local_variants();
> +
> #ifdef CONFIG_BLK_DEV_INITRD
> if (initrd_start && !initrd_below_start_ok &&
> page_to_pfn(virt_to_page((void *)initrd_start)) < min_low_pfn) {
> @@ -687,6 +908,8 @@ asmlinkage void __init start_kernel(void
> acpi_early_init(); /* before LAPIC and SMP init */
>
> ftrace_init();
> +
> + test_local_variants();
>
> /* Do the rest non-__init'ed, we're now alive */
> rest_init();

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-12-17 11:23:22

by Rusty Russell

[permalink] [raw]
Subject: Re: local_add_return

On Wednesday 17 December 2008 02:55:32 Mathieu Desnoyers wrote:
> Specifically on this comment :
>
> +/* There are three obvious ways to implement local_t on an arch which
> + * can't do single-instruction inc/dec etc.
> + * 1) atomic_long
> + * 2) irq_save/irq_restore
> + * 3) multiple counters.
>
> Option 3) is not workable for tracers, because it's not safe against
> some exceptions (e.g. some hardware errors) nor NMIs.

Hmm, nor is option 2. Now I understand where you were coming from and
I sympathize with your dilemna, but I don't think that non-x86 archs should
pay for it where local_t is used as intended, so I don't think local_t should
be (have been) hijacked for this. nmi_safe_t?

> Also, local_t
> operations must have preemption disabled before playing on per-cpu data,
> which I don't see in your test. This has to be taken into account in the
> runtime cost.

atomic_long_t implementations don't have to. local_irq_save does it
as a side effect. You're right about multiple counters tho. We can
either do it conditionally or unconditionally, but I think unconditional
makes sense (CONFIG_PREEMPT=y seems to be less popular than it was).

> the CPU_OPS work done by Christoph Lameter which use
> segments to address the per-cpu data, which effectively removes the need
> for disabling preemption around local_t operations because the CPU ID
> becomes encoded in a cpu register.

Well, we did this for 32-bit x86 some time ago, so that works today.
64-bit was delayed because of the stack protection code, which needs
a fixed offset for the canary so needs zero-based percpu, but IIRC
that's orthogonal to the CPU_OPS work itself.

Here's the timing diff when trivalue is fixed here (preempt on)
Before:
local_inc=45 local_add=45 cpu_local_inc=6 local_read=21 local_add_return=127
After:
local_inc=47 local_add=47 cpu_local_inc=6 local_read=41 local_add_return=127

Since sparc64 has CONFIG_PREEMPT=n in its defconfig, I think it is still
ahead with trivalue.

Thanks,
Rusty.