2008-06-02 03:34:48

by Rusty Russell

[permalink] [raw]
Subject: Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations

On Saturday 31 May 2008 04:00:40 Christoph Lameter wrote:
> On Fri, 30 May 2008, Rusty Russell wrote:
> > > No its not! In order to increment a per cpu value you need to calculate
> > > the per cpu pointer address in the current per cpu segment.
> >
> > Christoph, you just missed it, that's all. Look at cpu_local_read et al
> > in include/asm-i386/local.h (ie. before the x86 mergers chose the lowest
> > common denominator one).
>
> There is no doubt that local_t does perform an atomic vs. interrupt inc
> for example. But its not usable. Because you need to determine the address
> of the local_t belonging to the current processor first.

Christoph!

STOP typing, and START reading.

cpu_local_inc() does all this: it takes the name of a local_t var, and is
expected to increment this cpu's version of that. You ripped this out and
called it CPU_INC().

Do not make me explain it a third time.

> As soon as you
> have loaded a processor specific address you can no longer be preempted
> because that may change the processor and then the wrong address may be
> increment (and then we have a race again since now we are incrementing
> counters belonging to other processors). So local_t at mininum requires
> disabling preempt.

Think for a moment. What are the chances that I didn't understand this when I
wrote the multiple implementations of local_t?

You are wasting my time explaining the obvious, and wasting your own.

> Believe me I have tried to use local_t repeatedly for vm statistics etc.
> It always fails on that issue.

Frankly, I am finding it increasingly easy to believe that you failed. But
you are blaming the wrong thing.

There are three implementations of local_t which are obvious. The best is for
architectures which can locate and increment a per-cpu var in one instruction
(eg. x86). Otherwise, using atomic_t/atomic64_t for local_t provides a
general solution. The other general solution would involve
local_irq_disable()/increment/local_irq_enable().

My (fading) hope is that this idiocy is an abberation,
Rusty.


2008-06-04 18:18:32

by Mike Travis

[permalink] [raw]
Subject: Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations


> cpu_local_inc() does all this: it takes the name of a local_t var, and is
> expected to increment this cpu's version of that. You ripped this out and
> called it CPU_INC().

Hi,

I'm attempting to test both approaches to compare the object generated in order
to understand the issues involved here. Here's my code:

void test_cpu_inc(int *s)
{
__CPU_INC(s);
}

void test_local_inc(local_t *t)
{
__local_inc(THIS_CPU(t));
}

void test_cpu_local_inc(local_t *t)
{
__cpu_local_inc(t);
}

But I don't know how I can use cpu_local_inc because the pointer to the object
is not &__get_cpu_var(l):

#define __cpu_local_inc(l) cpu_local_inc((l))
#define cpu_local_inc(l) cpu_local_wrap(local_inc(&__get_cpu_var((l))))

At the minimum, we would need a new local_t op to get the correct CPU_ALLOC'd
pointer value for the increment. These new local_t ops for CPU_ALLOC'd variables
could use CPU_XXX primitives to implement them, or just a base val_to_ptr primitive
to replace __get_cpu_var().

I did notice this in local.h:

* X86_64: This could be done better if we moved the per cpu data directly
* after GS.

... which it now is, so true per_cpu variables could be optimized better as well.

Also, the above cpu_local_wrap(...) adds:

#define cpu_local_wrap(l) \
({ \
preempt_disable(); \
(l); \
preempt_enable(); \
}) \

... and there isn't a non-preemption version that I can find.

Here are the objects.

0000000000000000 <test_cpu_inc>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: 48 83 ec 08 sub $0x8,%rsp
8: 48 89 7d f8 mov %rdi,0xfffffffffffffff8(%rbp)
c: 65 48 ff 45 f8 incq %gs:0xfffffffffffffff8(%rbp)
11: c9 leaveq
12: c3 retq

0000000000000013 <test_local_inc>:
13: 55 push %rbp
14: 65 48 8b 05 00 00 00 mov %gs:0(%rip),%rax # 1c <test_local_inc+0x9>
1b: 00
1c: 48 89 e5 mov %rsp,%rbp
1f: 48 ff 04 07 incq (%rdi,%rax,1)
23: c9 leaveq
24: c3 retq


With a new local_t op then test_local_inc probably could be optimized to be
the same instructions as test_cpu_inc.

One other distinction is CPU_INC increments an arbitrary sized variable
while local_inc requires a local_t variable. This may not make it usable
in all cases.

Thanks,
Mike

2008-06-05 23:59:29

by Rusty Russell

[permalink] [raw]
Subject: Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations

On Thursday 05 June 2008 04:18:19 Mike Travis wrote:
> > cpu_local_inc() does all this: it takes the name of a local_t var, and is
> > expected to increment this cpu's version of that. You ripped this out
> > and called it CPU_INC().
>
> Hi,
>
> I'm attempting to test both approaches to compare the object generated in
> order to understand the issues involved here. Here's my code:
>
> void test_cpu_inc(int *s)
> {
> __CPU_INC(s);
> }
>
> void test_local_inc(local_t *t)
> {
> __local_inc(THIS_CPU(t));
> }
>
> void test_cpu_local_inc(local_t *t)
> {
> __cpu_local_inc(t);
> }
>
> But I don't know how I can use cpu_local_inc because the pointer to the
> object is not &__get_cpu_var(l):

Yes. Because the only true per-cpu vars are the static ones, cpu_local_inc()
only works on identifiers, not arbitrary pointers. Once this is fixed, we
should be enhancing the infrastructure to allow that (AFAICT it's not too
hard, but we should add an __percpu marker for sparse).

> At the minimum, we would need a new local_t op to get the correct
> CPU_ALLOC'd pointer value for the increment. These new local_t ops for
> CPU_ALLOC'd variables could use CPU_XXX primitives to implement them, or
> just a base val_to_ptr primitive to replace __get_cpu_var().

I think the latter: __get_cpu_ptr() perhaps?

> I did notice this in local.h:
>
> * X86_64: This could be done better if we moved the per cpu data directly
> * after GS.
>
> ... which it now is, so true per_cpu variables could be optimized better as
> well.

Indeed.

>
> Also, the above cpu_local_wrap(...) adds:
>
> #define cpu_local_wrap(l) \
> ({ \
> preempt_disable(); \
> (l); \
> preempt_enable(); \
> }) \
>
> ... and there isn't a non-preemption version that I can find.

Yes, this should be fixed. I thought i386 had optimized versions pre-merge,
but I was wrong (%gs for per-cpu came later, and noone cleaned up these naive
versions). Did you want me to write them?

I actually think that using local_t == atomic_t is better than
preempt_disable/enable for most archs which can't do atomic deref-and-inc.

> One other distinction is CPU_INC increments an arbitrary sized variable
> while local_inc requires a local_t variable. This may not make it usable
> in all cases.

You might be right, but note that local_t is 64 bit on 64-bit platforms. And
speculation of possible use cases isn't a good reason to rip out working
infrastructure :)

Cheers,
Rusty.


>
> Thanks,
> Mike

2008-06-09 19:00:47

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations

On Fri, 6 Jun 2008, Rusty Russell wrote:

> > Also, the above cpu_local_wrap(...) adds:
> >
> > #define cpu_local_wrap(l) \
> > ({ \
> > preempt_disable(); \
> > (l); \
> > preempt_enable(); \
> > }) \
> >
> > ... and there isn't a non-preemption version that I can find.
>
> Yes, this should be fixed. I thought i386 had optimized versions pre-merge,
> but I was wrong (%gs for per-cpu came later, and noone cleaned up these naive
> versions). Did you want me to write them?

How can that be fixed? You have no atomic instruction that calculates the
per cpu address in one go. And as long as that is the case you need to
disable preempt. Otherwise you may increment the per cpu variable of
another processor because the process was rescheduled after the address
was calculated but before the increment was done.

> > One other distinction is CPU_INC increments an arbitrary sized variable
> > while local_inc requires a local_t variable. This may not make it usable
> > in all cases.
>
> You might be right, but note that local_t is 64 bit on 64-bit platforms. And
> speculation of possible use cases isn't a good reason to rip out working
> infrastructure :)

Its fundamentally broken since because of the preemption issue. This is
also why local_t is rarely used.

2008-06-09 23:09:55

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations

On Wed, 4 Jun 2008, Mike Travis wrote:

> 0000000000000013 <test_local_inc>:
> 13: 55 push %rbp
> 14: 65 48 8b 05 00 00 00 mov %gs:0(%rip),%rax # 1c <test_local_inc+0x9>
> 1b: 00
> 1c: 48 89 e5 mov %rsp,%rbp
> 1f: 48 ff 04 07 incq (%rdi,%rax,1)
> 23: c9 leaveq
> 24: c3 retq

Note also that the address calculation occurs before the incq. That is why
disabling preemption is required otherwise the processor may change
between the determination of the per cpu area address and the increment.

The local_t operations could be modified to avoid the preemption issues
with the zero based patches applied. Then there would still be the
inflexbility of not being able to increment an arbitrary variable.

I think it is also bad to treat a per cpu variable like an atomic. Its not
truly atomic nor are strictly atomic accesses used. It is fine to use
regular operations on the per cpu variable provided one has either
disabled preemption or interrupts. The per cpu atomic wrt interrupt ops
are only useful when preemption and/or interrupts are off.

2008-06-09 23:29:32

by Rusty Russell

[permalink] [raw]
Subject: Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations

On Tuesday 10 June 2008 05:00:36 Christoph Lameter wrote:
> On Fri, 6 Jun 2008, Rusty Russell wrote:
> > > Also, the above cpu_local_wrap(...) adds:
> > >
> > > #define cpu_local_wrap(l) \
> > > ({ \
> > > preempt_disable(); \
> > > (l); \
> > > preempt_enable(); \
> > > }) \
> > >
> > > ... and there isn't a non-preemption version that I can find.
> >
> > Yes, this should be fixed. I thought i386 had optimized versions
> > pre-merge, but I was wrong (%gs for per-cpu came later, and noone cleaned
> > up these naive versions). Did you want me to write them?
>
> How can that be fixed? You have no atomic instruction that calculates the
> per cpu address in one go.

Huh? "incl %fs:varname" does exactly this.

> And as long as that is the case you need to
> disable preempt. Otherwise you may increment the per cpu variable of
> another processor because the process was rescheduled after the address
> was calculated but before the increment was done.

But of course, that is not a problem. You make local_t an atomic_t, and then
it doesn't matter which CPU you incremented.

By definition if the caller cared, they would have had premption disabled.

Hope that clarifies,
Rusty.

2008-06-09 23:54:20

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations

On Tue, 10 Jun 2008, Rusty Russell wrote:

> > > Yes, this should be fixed. I thought i386 had optimized versions
> > > pre-merge, but I was wrong (%gs for per-cpu came later, and noone cleaned
> > > up these naive versions). Did you want me to write them?
> >
> > How can that be fixed? You have no atomic instruction that calculates the
> > per cpu address in one go.
>
> Huh? "incl %fs:varname" does exactly this.

Right that is what the cpu alloc patches do. So you could implement
cpu_local_inc on top of some of the cpu alloc patches.

> > And as long as that is the case you need to
> > disable preempt. Otherwise you may increment the per cpu variable of
> > another processor because the process was rescheduled after the address
> > was calculated but before the increment was done.
>
> But of course, that is not a problem. You make local_t an atomic_t, and then
> it doesn't matter which CPU you incremented.

But then the whole point of local_t is gone. Why not use atomic_t in the
first place?

> By definition if the caller cared, they would have had premption disabled.

There are numerous instances where the caller does not care about
preemption. Its just important that one per cpu counter is increment in
the least intrusive way. See f.e. the VM event counters.

2008-06-10 02:56:55

by Rusty Russell

[permalink] [raw]
Subject: Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations

On Tuesday 10 June 2008 09:54:09 Christoph Lameter wrote:
> On Tue, 10 Jun 2008, Rusty Russell wrote:
> > > > Yes, this should be fixed. I thought i386 had optimized versions
> > > > pre-merge, but I was wrong (%gs for per-cpu came later, and noone
> > > > cleaned up these naive versions). Did you want me to write them?
> > >
> > > How can that be fixed? You have no atomic instruction that calculates
> > > the per cpu address in one go.
> >
> > Huh? "incl %fs:varname" does exactly this.
>
> Right that is what the cpu alloc patches do. So you could implement
> cpu_local_inc on top of some of the cpu alloc patches.

Or you could just implement it today as a standalone patch.

> > > And as long as that is the case you need to
> > > disable preempt. Otherwise you may increment the per cpu variable of
> > > another processor because the process was rescheduled after the address
> > > was calculated but before the increment was done.
> >
> > But of course, that is not a problem. You make local_t an atomic_t, and
> > then it doesn't matter which CPU you incremented.
>
> But then the whole point of local_t is gone. Why not use atomic_t in the
> first place?

Because some archs can do better.

> > By definition if the caller cared, they would have had premption
> > disabled.
>
> There are numerous instances where the caller does not care about
> preemption. Its just important that one per cpu counter is increment in
> the least intrusive way. See f.e. the VM event counters.

Yes, and that's exactly the point. The VM event counters are exactly a case
where you should have used cpu_local_inc.

Rusty.

2008-06-10 03:18:35

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations

On Tue, 10 Jun 2008, Rusty Russell wrote:

> > Right that is what the cpu alloc patches do. So you could implement
> > cpu_local_inc on top of some of the cpu alloc patches.
>
> Or you could just implement it today as a standalone patch.

You need at least the zero basing to enable the use of the segment
register on x86_64.

> > But then the whole point of local_t is gone. Why not use atomic_t in the
> > first place?
>
> Because some archs can do better.

The argument does not make any sense. First you want to use atomic_t then
not?

> > > By definition if the caller cared, they would have had premption
> > > disabled.
> >
> > There are numerous instances where the caller does not care about
> > preemption. Its just important that one per cpu counter is increment in
> > the least intrusive way. See f.e. the VM event counters.
>
> Yes, and that's exactly the point. The VM event counters are exactly a case
> where you should have used cpu_local_inc.

I tried it and did not give any benefit except first failing due to bugs
because local_t did not disable preempt6... This led to Andi fixing
local_t.

But with the preempt disabling I could not discern what the benefit
would be.

CPU_INC does not require disabling of preempt and the cpu alloc patches
shorten the code sequence to increment a VM counter significantly.

Here is the header from the patch. How would cpu_local_inc be able to do
better unless you adopt this patchset and add a shim layer?

Subject: VM statistics: Use CPU ops

The use of CPU ops here avoids the offset calculations that we used to
have to do with per cpu operations. The result of this patch is that event
counters are coded with a single instruction the following way:

incq %gs:offset(%rip)

Without these patches this was:

mov %gs:0x8,%rdx
mov %eax,0x38(%rsp)
mov xxx(%rip),%eax
mov %eax,0x48(%rsp)
mov varoffset,%rax
incq 0x110(%rax,%rdx,1)

2008-06-10 17:42:35

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations

On Mon, 2 Jun 2008, Rusty Russell wrote:

> > Believe me I have tried to use local_t repeatedly for vm statistics etc.
> > It always fails on that issue.
>
> Frankly, I am finding it increasingly easy to believe that you failed. But
> you are blaming the wrong thing.
>
> There are three implementations of local_t which are obvious. The best is for
> architectures which can locate and increment a per-cpu var in one instruction
> (eg. x86). Otherwise, using atomic_t/atomic64_t for local_t provides a
> general solution. The other general solution would involve
> local_irq_disable()/increment/local_irq_enable().
>
> My (fading) hope is that this idiocy is an abberation,

1. The x86 implementation does not exist because the segment register has
so far not been available on x86_64. So you could not do the solution.
You need the zero basing. Then you can use per_xxx_add in cpu_inc.

2. The general solution created overhead that is often not needed. If we
would have done vm event counters with local_t then we would have
atomic overhead for each increment on f.e. IA64. That was not
acceptable. cpu_alloc never falls back to atomic operations.

3. local_t is based on the atomic logic. But percpu handling is
fundamentally different in that accesses without the special macros
are okay provided you are in a non preemptible or irq context!
A local_t declaration makes such accesses impossible.

4. The modeling of local_t on atomic_t limits it to 32bit! There is no
way to use this with pointers or 64 bit entities. Adding that would
duplicate the API for each type added.

2008-06-11 00:04:21

by Rusty Russell

[permalink] [raw]
Subject: Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations

On Tuesday 10 June 2008 13:18:25 Christoph Lameter wrote:
> On Tue, 10 Jun 2008, Rusty Russell wrote:
> > > Right that is what the cpu alloc patches do. So you could implement
> > > cpu_local_inc on top of some of the cpu alloc patches.
> >
> > Or you could just implement it today as a standalone patch.
>
> You need at least the zero basing to enable the use of the segment
> register on x86_64.

Indeed. Works for i386 as is, but 64 bit will need that patch.

> > > But then the whole point of local_t is gone. Why not use atomic_t in
> > > the first place?
> >
> > Because some archs can do better.
>
> The argument does not make any sense. First you want to use atomic_t then
> not?

You're being obtuse. See previous mail about the three possible
implementations of local_t, and the comment in asm-generic/local.h.

The paths forward are clear:
1) Improve x86 local_t (mostly orthogonal to the others, but useful).
2) Implement extensible per-cpu areas.
3) Generalize per-cpu accessors.
4) Extend or replace the module.c per-cpu allocator to alloc from the other
areas.
5) Convert alloc_percpu et al. to use the new code.

Hope that clarifies,
Rusty.

2008-06-11 00:16:01

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations

On Wed, 11 Jun 2008, Rusty Russell wrote:

> You're being obtuse. See previous mail about the three possible
> implementations of local_t, and the comment in asm-generic/local.h.

OK. I hope I responded in the other email in a more intelligent
fashion.

> The paths forward are clear:
> 1) Improve x86 local_t (mostly orthogonal to the others, but useful).

Not sure about that. Its rarely used and the more general cpu alloc stuff
can be used in lots of places as evident by the rest of the patchset. But
lets leave it if its important for some reason.

> 2) Implement extensible per-cpu areas.
> 3) Generalize per-cpu accessors.
> 4) Extend or replace the module.c per-cpu allocator to alloc from the other
> areas.
> 5) Convert alloc_percpu et al. to use the new code.

Yes thanks. We are mostly on the same wavelength.

2008-06-11 11:11:19

by Rusty Russell

[permalink] [raw]
Subject: Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations

On Wednesday 11 June 2008 03:42:15 Christoph Lameter wrote:
> 1. The x86 implementation does not exist because the segment register has
> so far not been available on x86_64. So you could not do the solution.
> You need the zero basing. Then you can use per_xxx_add in cpu_inc.

Yes: for 64 bit x86, getting rid of the PDA or zero-basing is required.

> 2. The general solution created overhead that is often not needed. If we
> would have done vm event counters with local_t then we would have
> atomic overhead for each increment on f.e. IA64. That was not
> acceptable. cpu_alloc never falls back to atomic operations.

You can implement it either way. I've said that three times now. The current
generic one uses atomics, but preempt disable/enable is possible.

> 3. local_t is based on the atomic logic. But percpu handling is
> fundamentally different in that accesses without the special macros
> are okay provided you are in a non preemptible or irq context!
> A local_t declaration makes such accesses impossible.

Again, untrue. The interface is already there. So feel free to implement
__cpu_local_inc et al in terms of preempt enable and disable so it doesn't
need to use atomics.

> 4. The modeling of local_t on atomic_t limits it to 32bit!

Again wrong. And adding an exclamation mark doesn't make it true.

Rusty.

2008-06-11 23:39:20

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations

On Wed, 11 Jun 2008, Rusty Russell wrote:

> > 4. The modeling of local_t on atomic_t limits it to 32bit!
>
> Again wrong. And adding an exclamation mark doesn't make it true.

Ewww ... Its atomic_long_t ahh. Ok then there no 32 bit support. What about pointers?

2008-06-12 00:58:21

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations

On Thursday 12 June 2008 09:39, Christoph Lameter wrote:
> On Wed, 11 Jun 2008, Rusty Russell wrote:
> > > 4. The modeling of local_t on atomic_t limits it to 32bit!
> >
> > Again wrong. And adding an exclamation mark doesn't make it true.
>
> Ewww ... Its atomic_long_t ahh. Ok then there no 32 bit support. What about
> pointers?

sizeof(long) == sizeof(void *) in Linux, right?

If you were to support just a single data type, long would probably
be the most useful. Still, it might be more consistent to support
int and long, same as atomic.

2008-06-12 02:44:33

by Rusty Russell

[permalink] [raw]
Subject: Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations

On Thursday 12 June 2008 10:58:01 Nick Piggin wrote:
> On Thursday 12 June 2008 09:39, Christoph Lameter wrote:
> > On Wed, 11 Jun 2008, Rusty Russell wrote:
> > > > 4. The modeling of local_t on atomic_t limits it to 32bit!
> > >
> > > Again wrong. And adding an exclamation mark doesn't make it true.
> >
> > Ewww ... Its atomic_long_t ahh. Ok then there no 32 bit support. What
> > about pointers?
>
> sizeof(long) == sizeof(void *) in Linux, right?
>
> If you were to support just a single data type, long would probably
> be the most useful. Still, it might be more consistent to support
> int and long, same as atomic.

Sure, but in practice these tend to be simple counters: that could well change
when dynamic percpu allocs become first class citizens, but let's not put the
cart before the horse...

Per-cpu seems to be particularly prone to over-engineering: see commit
7ff6f08295d90ab20d25200ef485ebb45b1b8d71 from almost two years ago. Grepping
here reveals that this infrastructure is still not used.

Cheers,
Rusty.

2008-06-12 03:40:46

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations

On Thursday 12 June 2008 12:44, Rusty Russell wrote:
> On Thursday 12 June 2008 10:58:01 Nick Piggin wrote:
> > On Thursday 12 June 2008 09:39, Christoph Lameter wrote:
> > > On Wed, 11 Jun 2008, Rusty Russell wrote:
> > > > > 4. The modeling of local_t on atomic_t limits it to 32bit!
> > > >
> > > > Again wrong. And adding an exclamation mark doesn't make it true.
> > >
> > > Ewww ... Its atomic_long_t ahh. Ok then there no 32 bit support. What
> > > about pointers?
> >
> > sizeof(long) == sizeof(void *) in Linux, right?
> >
> > If you were to support just a single data type, long would probably
> > be the most useful. Still, it might be more consistent to support
> > int and long, same as atomic.
>
> Sure, but in practice these tend to be simple counters: that could well
> change when dynamic percpu allocs become first class citizens, but let's
> not put the cart before the horse...

Right, I was just responding to Christoph's puzzling question.


> Per-cpu seems to be particularly prone to over-engineering: see commit
> 7ff6f08295d90ab20d25200ef485ebb45b1b8d71 from almost two years ago.
> Grepping here reveals that this infrastructure is still not used.

Hmm. Something like that needs the question asked "who uses this?"
before it is merged I guess. If it were a trivial patch maybe not,
but something like this that sits untested for so long is almost
broken by definition ;)

2008-06-12 09:38:19

by Martin Peschke

[permalink] [raw]
Subject: Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations

On Thu, 2008-06-12 at 13:40 +1000, Nick Piggin wrote:
> On Thursday 12 June 2008 12:44, Rusty Russell wrote:
> > On Thursday 12 June 2008 10:58:01 Nick Piggin wrote:
> > > On Thursday 12 June 2008 09:39, Christoph Lameter wrote:
> > > > On Wed, 11 Jun 2008, Rusty Russell wrote:
> > > > > > 4. The modeling of local_t on atomic_t limits it to 32bit!
> > > > >
> > > > > Again wrong. And adding an exclamation mark doesn't make it true.
> > > >
> > > > Ewww ... Its atomic_long_t ahh. Ok then there no 32 bit support. What
> > > > about pointers?
> > >
> > > sizeof(long) == sizeof(void *) in Linux, right?
> > >
> > > If you were to support just a single data type, long would probably
> > > be the most useful. Still, it might be more consistent to support
> > > int and long, same as atomic.
> >
> > Sure, but in practice these tend to be simple counters: that could well
> > change when dynamic percpu allocs become first class citizens, but let's
> > not put the cart before the horse...
>
> Right, I was just responding to Christoph's puzzling question.
>
>
> > Per-cpu seems to be particularly prone to over-engineering: see commit
> > 7ff6f08295d90ab20d25200ef485ebb45b1b8d71 from almost two years ago.
> > Grepping here reveals that this infrastructure is still not used.
>
> Hmm. Something like that needs the question asked "who uses this?"
> before it is merged I guess. If it were a trivial patch maybe not,
> but something like this that sits untested for so long is almost
> broken by definition ;)

Some code of mine which didn't make it beyond -mm used this small
per-cpu extension. So the commit you refer to was tested.

2008-06-12 11:21:30

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations

On Thursday 12 June 2008 19:37, Martin Peschke wrote:
> On Thu, 2008-06-12 at 13:40 +1000, Nick Piggin wrote:
> > On Thursday 12 June 2008 12:44, Rusty Russell wrote:

> > > Per-cpu seems to be particularly prone to over-engineering: see commit
> > > 7ff6f08295d90ab20d25200ef485ebb45b1b8d71 from almost two years ago.
> > > Grepping here reveals that this infrastructure is still not used.
> >
> > Hmm. Something like that needs the question asked "who uses this?"
> > before it is merged I guess. If it were a trivial patch maybe not,
> > but something like this that sits untested for so long is almost
> > broken by definition ;)
>
> Some code of mine which didn't make it beyond -mm used this small
> per-cpu extension. So the commit you refer to was tested.

Right, but it can easily rot after initial testing if it isn't
continually used.

Maybe this isn't the best example because maybe it still works
fine. But in general, unused, non-trivial code isn't good just
to leave around "just in case" IMO.

2008-06-12 17:20:11

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations

Populating the per cpu areas on demand is a good thing especially for
configurations with a large number of processors. If we really go to
support 4k processor by default then we need to allocate the smallest
amount of per cpu structures necessary. Maybe ACPI or so can tell us how
many processors are possible and we only allocate those. But it would be
best if the percpu structures are only allocated for actually active
processors.

2008-06-13 00:38:38

by Rusty Russell

[permalink] [raw]
Subject: Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations

On Friday 13 June 2008 03:19:51 Christoph Lameter wrote:
> Populating the per cpu areas on demand is a good thing especially for
> configurations with a large number of processors. If we really go to
> support 4k processor by default then we need to allocate the smallest
> amount of per cpu structures necessary. Maybe ACPI or so can tell us how
> many processors are possible and we only allocate those. But it would be
> best if the percpu structures are only allocated for actually active
> processors.

cpu_possible_map should definitely be minimal, but your point is well made:
dynamic percpu could actually cut memory allocation. If we go for a hybrid
scheme where static percpu is always allocated from the initial chunk,
however, we still need the current pessimistic overallocation.

Mike's a clever guy, I'm sure he'll think of something :)
Rusty.

2008-06-13 02:27:22

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations

On Fri, 13 Jun 2008, Rusty Russell wrote:

> cpu_possible_map should definitely be minimal, but your point is well made:
> dynamic percpu could actually cut memory allocation. If we go for a hybrid
> scheme where static percpu is always allocated from the initial chunk,
> however, we still need the current pessimistic overallocation.

The initial chunk would mean that the percpu areas all come from the same
NUMA node. We really need to allocate from the node that is nearest to a
processor (not all processors have processor local memory!).

It would be good to standardize the way that percpu areas are allocated.
We have various ways of allocation now in various arches.
init/main.c:setup_per_cpu_ares() needs to be generalized:

1. Allocate the per cpu areas in a NUMA aware fashions.

2. Have a function for instantiating a single per cpu area that
can be used during cpu hotplug.

3. Some hooks for arches to override particular behavior as needed.
F.e. IA64 allocates percpu structures in a special way. x86_64
needs to do some tricks for the pda etc etc.

> Mike's a clever guy, I'm sure he'll think of something :)

Hopefully. Otherwise he will ask me =-).

2008-06-15 10:33:28

by Rusty Russell

[permalink] [raw]
Subject: Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations

On Friday 13 June 2008 12:27:07 Christoph Lameter wrote:
> On Fri, 13 Jun 2008, Rusty Russell wrote:
> > cpu_possible_map should definitely be minimal, but your point is well
> > made: dynamic percpu could actually cut memory allocation. If we go for
> > a hybrid scheme where static percpu is always allocated from the initial
> > chunk, however, we still need the current pessimistic overallocation.
>
> The initial chunk would mean that the percpu areas all come from the same
> NUMA node. We really need to allocate from the node that is nearest to a
> processor (not all processors have processor local memory!).

Yes, this is where it gets nasty. We shouldn't even allocate the initial
chunk in a non-NUMA aware way (I'm using the term chunk loosely, it's a chunk
per cpu of course).

> It would be good to standardize the way that percpu areas are allocated.
> We have various ways of allocation now in various arches.
> init/main.c:setup_per_cpu_ares() needs to be generalized:
>
> 1. Allocate the per cpu areas in a NUMA aware fashions.

Definitely. We also need to reserve virtual address space to create more
areas with congruent mappings; that's the fun part.

Maybe a simpler non-NUMA variant too, but it's trivial if we want it.

> 2. Have a function for instantiating a single per cpu area that
> can be used during cpu hotplug.

Unfortunately this breaks the current percpu semantics: that if you iterate
over all possible cpus you can access percpu vars. This means you don't need
to have hotplug CPU notifiers for simple percpu counters. We could do this
with helpers, but AFAICT it's orthogonal to the other plans.

> 3. Some hooks for arches to override particular behavior as needed.
> F.e. IA64 allocates percpu structures in a special way. x86_64
> needs to do some tricks for the pda etc etc.

IA64 is going to need some work, since dynamic percpu addresses won't be able
to use their pinned TLB trick to get the local version.

> > Mike's a clever guy, I'm sure he'll think of something :)
>
> Hopefully. Otherwise he will ask me =-).

And as always, lkml will offer feedback; useful and otherwise :)

Cheers,
Rusty.

2008-06-16 14:52:23

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations

On Sun, 15 Jun 2008, Rusty Russell wrote:

> > 3. Some hooks for arches to override particular behavior as needed.
> > F.e. IA64 allocates percpu structures in a special way. x86_64
> > needs to do some tricks for the pda etc etc.
>
> IA64 is going to need some work, since dynamic percpu addresses won't be able
> to use their pinned TLB trick to get the local version.

The ia64 hook could simply return the address of percpu area that
was reserved when the per node memory layout was generated (which happens
very early during node bootstrap).

2008-06-17 00:25:16

by Rusty Russell

[permalink] [raw]
Subject: Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations

On Tuesday 17 June 2008 00:52:08 Christoph Lameter wrote:
> On Sun, 15 Jun 2008, Rusty Russell wrote:
> > > 3. Some hooks for arches to override particular behavior as needed.
> > > F.e. IA64 allocates percpu structures in a special way. x86_64
> > > needs to do some tricks for the pda etc etc.
> >
> > IA64 is going to need some work, since dynamic percpu addresses won't be
> > able to use their pinned TLB trick to get the local version.
>
> The ia64 hook could simply return the address of percpu area that
> was reserved when the per node memory layout was generated (which happens
> very early during node bootstrap).

Apologies, this time I read the code. I thought IA64 used the pinned TLB area
to access per-cpu vars under some circumstances, but they only do that via an
arch-specific macro.

So creating new congruent mappings to expand the percpu area(s) is our main
concern now?

Rusty.

2008-06-17 02:29:27

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations

On Tue, 17 Jun 2008, Rusty Russell wrote:

> > The ia64 hook could simply return the address of percpu area that
> > was reserved when the per node memory layout was generated (which happens
> > very early during node bootstrap).
>
> Apologies, this time I read the code. I thought IA64 used the pinned TLB area
> to access per-cpu vars under some circumstances, but they only do that via an
> arch-specific macro.
>
> So creating new congruent mappings to expand the percpu area(s) is our main
> concern now?

The concern here was just consolidating the setup code for the per cpu
areas.

2008-06-17 14:21:20

by Mike Travis

[permalink] [raw]
Subject: Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations

Rusty Russell wrote:
> On Tuesday 17 June 2008 00:52:08 Christoph Lameter wrote:
>> On Sun, 15 Jun 2008, Rusty Russell wrote:
>>>> 3. Some hooks for arches to override particular behavior as needed.
>>>> F.e. IA64 allocates percpu structures in a special way. x86_64
>>>> needs to do some tricks for the pda etc etc.
>>> IA64 is going to need some work, since dynamic percpu addresses won't be
>>> able to use their pinned TLB trick to get the local version.
>> The ia64 hook could simply return the address of percpu area that
>> was reserved when the per node memory layout was generated (which happens
>> very early during node bootstrap).
>
> Apologies, this time I read the code. I thought IA64 used the pinned TLB area
> to access per-cpu vars under some circumstances, but they only do that via an
> arch-specific macro.
>
> So creating new congruent mappings to expand the percpu area(s) is our main
> concern now?
>
> Rusty.

Not exactly. Getting the system to not panic early in the boot (before
x86_64_start_kernel()) is the primary problem right now. This happens
in the tip tree with the change to use zero-based percpu offsets. It
gets much farther on the linux-next tree.

Thanks,
Mike