Subject: [cpuops cmpxchg V2 5/5] cpuops: Use cmpxchg for xchg to avoid lock semantics

Use cmpxchg instead of xchg to realize this_cpu_xchg.

xchg will cause LOCK overhead since LOCK is always implied but cmpxchg
will not.

Baselines:

xchg() = 18 cycles (no segment prefix, LOCK semantics)
__this_cpu_xchg = 1 cycle

(simulated using this_cpu_read/write, two prefixes. Looks like the
cpu can use loop optimization to get rid of most of the overhead)

Cycles before:

this_cpu_xchg = 37 cycles (segment prefix and LOCK (implied by xchg))

After:

this_cpu_xchg = 11 cycle (using cmpxchg without lock semantics)

Signed-off-by: Christoph Lameter <[email protected]>

---
arch/x86/include/asm/percpu.h | 21 +++++++++++++++------
1 file changed, 15 insertions(+), 6 deletions(-)

Index: linux-2.6/arch/x86/include/asm/percpu.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/percpu.h 2010-12-10 12:46:31.000000000 -0600
+++ linux-2.6/arch/x86/include/asm/percpu.h 2010-12-10 13:25:21.000000000 -0600
@@ -213,8 +213,9 @@ do { \
})

/*
- * Beware: xchg on x86 has an implied lock prefix. There will be the cost of
- * full lock semantics even though they are not needed.
+ * xchg is implemented using cmpxchg without a lock prefix. xchg is
+ * expensive due to the implied lock prefix. The processor cannot prefetch
+ * cachelines if xchg is used.
*/
#define percpu_xchg_op(var, nval) \
({ \
@@ -222,25 +223,33 @@ do { \
typeof(var) __new = (nval); \
switch (sizeof(var)) { \
case 1: \
- asm("xchgb %2, "__percpu_arg(1) \
+ asm("\n1:mov "__percpu_arg(1)",%%al" \
+ "\n\tcmpxchgb %2, "__percpu_arg(1) \
+ "\n\tjnz 1b" \
: "=a" (__ret), "+m" (var) \
: "q" (__new) \
: "memory"); \
break; \
case 2: \
- asm("xchgw %2, "__percpu_arg(1) \
+ asm("\n1:mov "__percpu_arg(1)",%%ax" \
+ "\n\tcmpxchgw %2, "__percpu_arg(1) \
+ "\n\tjnz 1b" \
: "=a" (__ret), "+m" (var) \
: "r" (__new) \
: "memory"); \
break; \
case 4: \
- asm("xchgl %2, "__percpu_arg(1) \
+ asm("\n1:mov "__percpu_arg(1)",%%eax" \
+ "\n\tcmpxchgl %2, "__percpu_arg(1) \
+ "\n\tjnz 1b" \
: "=a" (__ret), "+m" (var) \
: "r" (__new) \
: "memory"); \
break; \
case 8: \
- asm("xchgq %2, "__percpu_arg(1) \
+ asm("\n1:mov "__percpu_arg(1)",%%rax" \
+ "\n\tcmpxchgq %2, "__percpu_arg(1) \
+ "\n\tjnz 1b" \
: "=a" (__ret), "+m" (var) \
: "r" (__new) \
: "memory"); \


2010-12-14 16:35:14

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [cpuops cmpxchg V2 5/5] cpuops: Use cmpxchg for xchg to avoid lock semantics

* Christoph Lameter ([email protected]) wrote:
> Use cmpxchg instead of xchg to realize this_cpu_xchg.
>
> xchg will cause LOCK overhead since LOCK is always implied but cmpxchg
> will not.
>
> Baselines:
>
> xchg() = 18 cycles (no segment prefix, LOCK semantics)
> __this_cpu_xchg = 1 cycle
>
> (simulated using this_cpu_read/write, two prefixes. Looks like the
> cpu can use loop optimization to get rid of most of the overhead)
>
> Cycles before:
>
> this_cpu_xchg = 37 cycles (segment prefix and LOCK (implied by xchg))
>
> After:
>
> this_cpu_xchg = 11 cycle (using cmpxchg without lock semantics)

Cool! Thanks for benchmarking these, it's really worth it.

Acked-by: Mathieu Desnoyers <[email protected]>

>
> Signed-off-by: Christoph Lameter <[email protected]>
>
> ---
> arch/x86/include/asm/percpu.h | 21 +++++++++++++++------
> 1 file changed, 15 insertions(+), 6 deletions(-)
>
> Index: linux-2.6/arch/x86/include/asm/percpu.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/percpu.h 2010-12-10 12:46:31.000000000 -0600
> +++ linux-2.6/arch/x86/include/asm/percpu.h 2010-12-10 13:25:21.000000000 -0600
> @@ -213,8 +213,9 @@ do { \
> })
>
> /*
> - * Beware: xchg on x86 has an implied lock prefix. There will be the cost of
> - * full lock semantics even though they are not needed.
> + * xchg is implemented using cmpxchg without a lock prefix. xchg is
> + * expensive due to the implied lock prefix. The processor cannot prefetch
> + * cachelines if xchg is used.
> */
> #define percpu_xchg_op(var, nval) \
> ({ \
> @@ -222,25 +223,33 @@ do { \
> typeof(var) __new = (nval); \
> switch (sizeof(var)) { \
> case 1: \
> - asm("xchgb %2, "__percpu_arg(1) \
> + asm("\n1:mov "__percpu_arg(1)",%%al" \
> + "\n\tcmpxchgb %2, "__percpu_arg(1) \
> + "\n\tjnz 1b" \
> : "=a" (__ret), "+m" (var) \
> : "q" (__new) \
> : "memory"); \
> break; \
> case 2: \
> - asm("xchgw %2, "__percpu_arg(1) \
> + asm("\n1:mov "__percpu_arg(1)",%%ax" \
> + "\n\tcmpxchgw %2, "__percpu_arg(1) \
> + "\n\tjnz 1b" \
> : "=a" (__ret), "+m" (var) \
> : "r" (__new) \
> : "memory"); \
> break; \
> case 4: \
> - asm("xchgl %2, "__percpu_arg(1) \
> + asm("\n1:mov "__percpu_arg(1)",%%eax" \
> + "\n\tcmpxchgl %2, "__percpu_arg(1) \
> + "\n\tjnz 1b" \
> : "=a" (__ret), "+m" (var) \
> : "r" (__new) \
> : "memory"); \
> break; \
> case 8: \
> - asm("xchgq %2, "__percpu_arg(1) \
> + asm("\n1:mov "__percpu_arg(1)",%%rax" \
> + "\n\tcmpxchgq %2, "__percpu_arg(1) \
> + "\n\tjnz 1b" \
> : "=a" (__ret), "+m" (var) \
> : "r" (__new) \
> : "memory"); \
>

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-12-14 16:44:42

by Eric Dumazet

[permalink] [raw]
Subject: Re: [cpuops cmpxchg V2 5/5] cpuops: Use cmpxchg for xchg to avoid lock semantics

Le mardi 14 décembre 2010 à 10:28 -0600, Christoph Lameter a écrit :
> pièce jointe document texte brut (cpuops_xchg_with_cmpxchg)
> Use cmpxchg instead of xchg to realize this_cpu_xchg.
>
> xchg will cause LOCK overhead since LOCK is always implied but cmpxchg
> will not.
>
> Baselines:
>
> xchg() = 18 cycles (no segment prefix, LOCK semantics)
> __this_cpu_xchg = 1 cycle
>
> (simulated using this_cpu_read/write, two prefixes. Looks like the
> cpu can use loop optimization to get rid of most of the overhead)
>
> Cycles before:
>
> this_cpu_xchg = 37 cycles (segment prefix and LOCK (implied by xchg))
>
> After:
>
> this_cpu_xchg = 11 cycle (using cmpxchg without lock semantics)
>
> Signed-off-by: Christoph Lameter <[email protected]>
>
> ---
> arch/x86/include/asm/percpu.h | 21 +++++++++++++++------
> 1 file changed, 15 insertions(+), 6 deletions(-)
>
> Index: linux-2.6/arch/x86/include/asm/percpu.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/percpu.h 2010-12-10 12:46:31.000000000 -0600
> +++ linux-2.6/arch/x86/include/asm/percpu.h 2010-12-10 13:25:21.000000000 -0600
> @@ -213,8 +213,9 @@ do { \
> })
>
> /*
> - * Beware: xchg on x86 has an implied lock prefix. There will be the cost of
> - * full lock semantics even though they are not needed.
> + * xchg is implemented using cmpxchg without a lock prefix. xchg is
> + * expensive due to the implied lock prefix. The processor cannot prefetch
> + * cachelines if xchg is used.
> */
> #define percpu_xchg_op(var, nval) \
> ({ \
> @@ -222,25 +223,33 @@ do { \
> typeof(var) __new = (nval); \
> switch (sizeof(var)) { \
> case 1: \
> - asm("xchgb %2, "__percpu_arg(1) \
> + asm("\n1:mov "__percpu_arg(1)",%%al" \
> + "\n\tcmpxchgb %2, "__percpu_arg(1) \
> + "\n\tjnz 1b" \


You should use the fact that the failed cmpxchg loads in al/ax/eax/rax
the current value, so :

"\n\tmov "__percpu_arg(1)",%%al"
"\n1:\tcmpxchgb %2, "__percpu_arg(1)
"\n\tjnz 1b"

(No need to reload the value again)


> : "=a" (__ret), "+m" (var) \
> : "q" (__new) \
> : "memory"); \
> break; \
> case 2: \
> - asm("xchgw %2, "__percpu_arg(1) \
> + asm("\n1:mov "__percpu_arg(1)",%%ax" \
> + "\n\tcmpxchgw %2, "__percpu_arg(1) \
> + "\n\tjnz 1b" \
> : "=a" (__ret), "+m" (var) \
> : "r" (__new) \
> : "memory"); \
> break; \
> case 4: \
> - asm("xchgl %2, "__percpu_arg(1) \
> + asm("\n1:mov "__percpu_arg(1)",%%eax" \
> + "\n\tcmpxchgl %2, "__percpu_arg(1) \
> + "\n\tjnz 1b" \
> : "=a" (__ret), "+m" (var) \
> : "r" (__new) \
> : "memory"); \
> break; \
> case 8: \
> - asm("xchgq %2, "__percpu_arg(1) \
> + asm("\n1:mov "__percpu_arg(1)",%%rax" \
> + "\n\tcmpxchgq %2, "__percpu_arg(1) \
> + "\n\tjnz 1b" \
> : "=a" (__ret), "+m" (var) \
> : "r" (__new) \
> : "memory"); \
>

Subject: Re: [cpuops cmpxchg V2 5/5] cpuops: Use cmpxchg for xchg to avoid lock semantics

On Tue, 14 Dec 2010, Eric Dumazet wrote:

> You should use the fact that the failed cmpxchg loads in al/ax/eax/rax
> the current value, so :
>
> "\n\tmov "__percpu_arg(1)",%%al"
> "\n1:\tcmpxchgb %2, "__percpu_arg(1)
> "\n\tjnz 1b"
>
> (No need to reload the value again)

Subject: cpuops xchg: Exploit the fact that failed cmpxchges return old value

We do not need to reload the value as pointed out by Eric. It is already in
the correct register so just rerun the cmpxchg without the load.

Signed-off-by: Christoph Lameter <[email protected]>

---
arch/x86/include/asm/percpu.h | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)

Index: linux-2.6/arch/x86/include/asm/percpu.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/percpu.h 2010-12-14 10:51:57.000000000 -0600
+++ linux-2.6/arch/x86/include/asm/percpu.h 2010-12-14 10:54:46.000000000 -0600
@@ -223,32 +223,32 @@ do { \
typeof(var) __new = (nval); \
switch (sizeof(var)) { \
case 1: \
- asm("\n1:mov "__percpu_arg(1)",%%al" \
- "\n\tcmpxchgb %2, "__percpu_arg(1) \
+ asm("\n\tmov "__percpu_arg(1)",%%al" \
+ "\n1:cmpxchgb %2, "__percpu_arg(1) \
"\n\tjnz 1b" \
: "=a" (__ret), "+m" (var) \
: "q" (__new) \
: "memory"); \
break; \
case 2: \
- asm("\n1:mov "__percpu_arg(1)",%%ax" \
- "\n\tcmpxchgw %2, "__percpu_arg(1) \
+ asm("\n\tmov "__percpu_arg(1)",%%ax" \
+ "\n1:cmpxchgw %2, "__percpu_arg(1) \
"\n\tjnz 1b" \
: "=a" (__ret), "+m" (var) \
: "r" (__new) \
: "memory"); \
break; \
case 4: \
- asm("\n1:mov "__percpu_arg(1)",%%eax" \
- "\n\tcmpxchgl %2, "__percpu_arg(1) \
+ asm("\n\tmov "__percpu_arg(1)",%%eax" \
+ "\n1:cmpxchgl %2, "__percpu_arg(1) \
"\n\tjnz 1b" \
: "=a" (__ret), "+m" (var) \
: "r" (__new) \
: "memory"); \
break; \
case 8: \
- asm("\n1:mov "__percpu_arg(1)",%%rax" \
- "\n\tcmpxchgq %2, "__percpu_arg(1) \
+ asm("\n\tmov "__percpu_arg(1)",%%rax" \
+ "\n1:cmpxchgq %2, "__percpu_arg(1) \
"\n\tjnz 1b" \
: "=a" (__ret), "+m" (var) \
: "r" (__new) \

2010-12-14 17:02:12

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [cpuops cmpxchg V2 5/5] cpuops: Use cmpxchg for xchg to avoid lock semantics

On 12/14/2010 08:55 AM, Christoph Lameter wrote:
>
> We do not need to reload the value as pointed out by Eric. It is already in
> the correct register so just rerun the cmpxchg without the load.
>
> Signed-off-by: Christoph Lameter <[email protected]>
>

Is it genuinely faster to do the pre-load mov, or can we drop that too?
My guess would be that yes it is, but if it happens not to be it would
be nice to reduce the code size.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

Subject: Re: [cpuops cmpxchg V2 5/5] cpuops: Use cmpxchg for xchg to avoid lock semantics



On Tue, 14 Dec 2010, H. Peter Anvin wrote:

> On 12/14/2010 08:55 AM, Christoph Lameter wrote:
> >
> > We do not need to reload the value as pointed out by Eric. It is already in
> > the correct register so just rerun the cmpxchg without the load.
> >
> > Signed-off-by: Christoph Lameter <[email protected]>
> >
>
> Is it genuinely faster to do the pre-load mov, or can we drop that too?
> My guess would be that yes it is, but if it happens not to be it would
> be nice to reduce the code size.

Dropping the load increases the cycle count from 11 to 16.

---
arch/x86/include/asm/percpu.h | 12 ++++--------
1 file changed, 4 insertions(+), 8 deletions(-)

Index: linux-2.6/arch/x86/include/asm/percpu.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/percpu.h 2010-12-14 11:07:18.000000000 -0600
+++ linux-2.6/arch/x86/include/asm/percpu.h 2010-12-14 11:14:37.000000000 -0600
@@ -223,32 +223,28 @@ do { \
typeof(var) __new = (nval); \
switch (sizeof(var)) { \
case 1: \
- asm("\n1:mov "__percpu_arg(1)",%%al" \
- "\n\tcmpxchgb %2, "__percpu_arg(1) \
+ asm("\n1:cmpxchgb %2, "__percpu_arg(1) \
"\n\tjnz 1b" \
: "=a" (__ret), "+m" (var) \
: "q" (__new) \
: "memory"); \
break; \
case 2: \
- asm("\n1:mov "__percpu_arg(1)",%%ax" \
- "\n\tcmpxchgw %2, "__percpu_arg(1) \
+ asm("\n1:cmpxchgw %2, "__percpu_arg(1) \
"\n\tjnz 1b" \
: "=a" (__ret), "+m" (var) \
: "r" (__new) \
: "memory"); \
break; \
case 4: \
- asm("\n1:mov "__percpu_arg(1)",%%eax" \
- "\n\tcmpxchgl %2, "__percpu_arg(1) \
+ asm("\n1:cmpxchgl %2, "__percpu_arg(1) \
"\n\tjnz 1b" \
: "=a" (__ret), "+m" (var) \
: "r" (__new) \
: "memory"); \
break; \
case 8: \
- asm("\n1:mov "__percpu_arg(1)",%%rax" \
- "\n\tcmpxchgq %2, "__percpu_arg(1) \
+ asm("\n1:cmpxchgq %2, "__percpu_arg(1) \
"\n\tjnz 1b" \
: "=a" (__ret), "+m" (var) \
: "r" (__new) \

2010-12-14 17:23:16

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [cpuops cmpxchg V2 5/5] cpuops: Use cmpxchg for xchg to avoid lock semantics

On 12/14/2010 09:19 AM, Christoph Lameter wrote:
>>
>> Is it genuinely faster to do the pre-load mov, or can we drop that too?
>> My guess would be that yes it is, but if it happens not to be it would
>> be nice to reduce the code size.
>
> Dropping the load increases the cycle count from 11 to 16.
>

Great, that answers that! I'll pick up the patch hopefully today (I'm
finally ramping back up on arch/x86 again after having been diverted to
an internal project for a while...)

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2010-12-14 17:30:03

by Tejun Heo

[permalink] [raw]
Subject: Re: [cpuops cmpxchg V2 5/5] cpuops: Use cmpxchg for xchg to avoid lock semantics

Hello,

On 12/14/2010 06:22 PM, H. Peter Anvin wrote:
> On 12/14/2010 09:19 AM, Christoph Lameter wrote:
>>>
>>> Is it genuinely faster to do the pre-load mov, or can we drop that too?
>>> My guess would be that yes it is, but if it happens not to be it would
>>> be nice to reduce the code size.
>>
>> Dropping the load increases the cycle count from 11 to 16.
>
> Great, that answers that! I'll pick up the patch hopefully today (I'm
> finally ramping back up on arch/x86 again after having been diverted to
> an internal project for a while...)

How do you want to route these? All patches before this series is
already in the percpu tree. I can pull the generic bits and leave out
the x86 bits so that x86 tree can pull in percpu bits and then put x86
stuff on top of it. If you wanna go that way, I would drop all x86
related patches from the previous patchsets too.

Thanks.

--
tejun

Subject: Re: [cpuops cmpxchg V2 5/5] cpuops: Use cmpxchg for xchg to avoid lock semantics

On Tue, 14 Dec 2010, Tejun Heo wrote:

> > Great, that answers that! I'll pick up the patch hopefully today (I'm
> > finally ramping back up on arch/x86 again after having been diverted to
> > an internal project for a while...)
>
> How do you want to route these? All patches before this series is
> already in the percpu tree. I can pull the generic bits and leave out
> the x86 bits so that x86 tree can pull in percpu bits and then put x86
> stuff on top of it. If you wanna go that way, I would drop all x86
> related patches from the previous patchsets too.

I think it is better to merge the generic pieces and x86 at the same time.
Since this has worked so well with the percpu tree so far I'd say we
continue that way.

Regardless of how this will be merged: It would be good if hpa would take
a look at the patches we have accumulated so far.

2010-12-15 01:08:17

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [cpuops cmpxchg V2 5/5] cpuops: Use cmpxchg for xchg to avoid lock semantics

On 12/14/2010 09:29 AM, Tejun Heo wrote:
> Hello,
>
> On 12/14/2010 06:22 PM, H. Peter Anvin wrote:
>> On 12/14/2010 09:19 AM, Christoph Lameter wrote:
>>>>
>>>> Is it genuinely faster to do the pre-load mov, or can we drop that too?
>>>> My guess would be that yes it is, but if it happens not to be it would
>>>> be nice to reduce the code size.
>>>
>>> Dropping the load increases the cycle count from 11 to 16.
>>
>> Great, that answers that! I'll pick up the patch hopefully today (I'm
>> finally ramping back up on arch/x86 again after having been diverted to
>> an internal project for a while...)
>
> How do you want to route these? All patches before this series is
> already in the percpu tree. I can pull the generic bits and leave out
> the x86 bits so that x86 tree can pull in percpu bits and then put x86
> stuff on top of it. If you wanna go that way, I would drop all x86
> related patches from the previous patchsets too.
>

It's probably easiest if you just add them with my Acked-by: then.
Alternatively, I could pick them up so they go into -tip and the tip
test machinery.

-hpa

2010-12-15 16:30:10

by Tejun Heo

[permalink] [raw]
Subject: Re: [cpuops cmpxchg V2 5/5] cpuops: Use cmpxchg for xchg to avoid lock semantics

Hello,

On 12/15/2010 02:06 AM, H. Peter Anvin wrote:
> It's probably easiest if you just add them with my Acked-by: then.
> Alternatively, I could pick them up so they go into -tip and the tip
> test machinery.

All patches accepted into percpu till now are in the following branch.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu.git for-next

The tree is based on v2.6.37-rc5 and v2.6.37-rc5..for-next would give
21 patches. I think most of these don't fit x86 tree but there are
several which would fit there much better. d80aadf9 (x86: Replace
uses of current_cpu_data with this_cpu ops) and e5195e91 (x86: Use
this_cpu_inc_return for nmi counter).

As it would be beneficial to push these through -tip testing anyway,
how about the following?

* If you review and ack the x86 related bits in the series, I'll
regenerated and add the ACKs.

* It would be better if the two commits mentioned above get routed
through x86 tree rather than percpu tree, so I'll drop the above two
from percpu tree and you can pull percpu into x86 and then apply
those in x86 tree.

Thanks.

--
tejun

2010-12-15 16:36:58

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [cpuops cmpxchg V2 5/5] cpuops: Use cmpxchg for xchg to avoid lock semantics

On 12/15/2010 08:29 AM, Tejun Heo wrote:
> Hello,
>
> On 12/15/2010 02:06 AM, H. Peter Anvin wrote:
>> It's probably easiest if you just add them with my Acked-by: then.
>> Alternatively, I could pick them up so they go into -tip and the tip
>> test machinery.
>
> All patches accepted into percpu till now are in the following branch.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu.git for-next
>
> The tree is based on v2.6.37-rc5 and v2.6.37-rc5..for-next would give
> 21 patches. I think most of these don't fit x86 tree but there are
> several which would fit there much better. d80aadf9 (x86: Replace
> uses of current_cpu_data with this_cpu ops) and e5195e91 (x86: Use
> this_cpu_inc_return for nmi counter).
>
> As it would be beneficial to push these through -tip testing anyway,
> how about the following?
>
> * If you review and ack the x86 related bits in the series, I'll
> regenerated and add the ACKs.
>
> * It would be better if the two commits mentioned above get routed
> through x86 tree rather than percpu tree, so I'll drop the above two
> from percpu tree and you can pull percpu into x86 and then apply
> those in x86 tree.
>

I think we can do that... let me look through the tree first officially
committing. Alternatively, I can pull the patches from your tree and
individually review them and add them to a tip branch, if you prefer.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2010-12-15 16:40:47

by Tejun Heo

[permalink] [raw]
Subject: Re: [cpuops cmpxchg V2 5/5] cpuops: Use cmpxchg for xchg to avoid lock semantics

Hello,

On 12/15/2010 05:35 PM, H. Peter Anvin wrote:
> I think we can do that... let me look through the tree first officially
> committing. Alternatively, I can pull the patches from your tree and
> individually review them and add them to a tip branch, if you prefer.

I'd prefer percpu things going through percpu tree, if for nothing
else for git history's sake, but I don't think it really matters. The
series is spread all over the place anyway. As long as each
maintainer is properly alerted about the changes, it should be okay.
Please let me know whether you agree with the changes currently queued
in percpu#for-next. I'll update the tree with your Acked-by's and
freeze it.

Thanks.

--
tejun

2010-12-15 16:48:36

by Tejun Heo

[permalink] [raw]
Subject: Re: [cpuops cmpxchg V2 5/5] cpuops: Use cmpxchg for xchg to avoid lock semantics

On 12/14/2010 05:28 PM, Christoph Lameter wrote:
> Use cmpxchg instead of xchg to realize this_cpu_xchg.
>
> xchg will cause LOCK overhead since LOCK is always implied but cmpxchg
> will not.
>
> Baselines:
>
> xchg() = 18 cycles (no segment prefix, LOCK semantics)
> __this_cpu_xchg = 1 cycle
>
> (simulated using this_cpu_read/write, two prefixes. Looks like the
> cpu can use loop optimization to get rid of most of the overhead)
>
> Cycles before:
>
> this_cpu_xchg = 37 cycles (segment prefix and LOCK (implied by xchg))
>
> After:
>
> this_cpu_xchg = 11 cycle (using cmpxchg without lock semantics)
>
> Signed-off-by: Christoph Lameter <[email protected]>

It's not a bad idea to keep this patch separate from the original one
but as both are not applied yet, it probably is better to put this
right after the original addition if you end up re-posting the series;
otherwise, I'll just reorder it when I apply.

Thanks.

--
tejun

2010-12-16 16:14:55

by Tejun Heo

[permalink] [raw]
Subject: Re: [cpuops cmpxchg V2 5/5] cpuops: Use cmpxchg for xchg to avoid lock semantics

Hey, again.

On 12/15/2010 05:39 PM, Tejun Heo wrote:
> I'd prefer percpu things going through percpu tree, if for nothing
> else for git history's sake, but I don't think it really matters. The
> series is spread all over the place anyway. As long as each
> maintainer is properly alerted about the changes, it should be okay.
> Please let me know whether you agree with the changes currently queued
> in percpu#for-next. I'll update the tree with your Acked-by's and
> freeze it.

Are you okay with the patches currently in percpu#for-next? If so,
I'll regenerate patches with your acked-by and pop the two previously
mentioned commits and proceed with the rest of the series.

Thank you.

--
tejun

Subject: x86: Use this_cpu_has for thermal_interrupt

It is more effective to use a segment prefix instead of calculating the
address of the current cpu area amd then testing flags.

Signed-off-by: Christoph Lameter <[email protected]>

---
arch/x86/kernel/cpu/mcheck/therm_throt.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)

Index: linux-2.6/arch/x86/kernel/cpu/mcheck/therm_throt.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/mcheck/therm_throt.c 2010-12-16 11:41:48.000000000 -0600
+++ linux-2.6/arch/x86/kernel/cpu/mcheck/therm_throt.c 2010-12-16 11:43:42.000000000 -0600
@@ -317,7 +317,6 @@ device_initcall(thermal_throttle_init_de
static void intel_thermal_interrupt(void)
{
__u64 msr_val;
- struct cpuinfo_x86 *c = &cpu_data(smp_processor_id());

rdmsrl(MSR_IA32_THERM_STATUS, msr_val);

@@ -326,19 +325,19 @@ static void intel_thermal_interrupt(void
CORE_LEVEL) != 0)
mce_log_therm_throt_event(CORE_THROTTLED | msr_val);

- if (cpu_has(c, X86_FEATURE_PLN))
+ if (this_cpu_has(X86_FEATURE_PLN))
if (therm_throt_process(msr_val & THERM_STATUS_POWER_LIMIT,
POWER_LIMIT_EVENT,
CORE_LEVEL) != 0)
mce_log_therm_throt_event(CORE_POWER_LIMIT | msr_val);

- if (cpu_has(c, X86_FEATURE_PTS)) {
+ if (this_cpu_has(X86_FEATURE_PTS)) {
rdmsrl(MSR_IA32_PACKAGE_THERM_STATUS, msr_val);
if (therm_throt_process(msr_val & PACKAGE_THERM_STATUS_PROCHOT,
THERMAL_THROTTLING_EVENT,
PACKAGE_LEVEL) != 0)
mce_log_therm_throt_event(PACKAGE_THROTTLED | msr_val);
- if (cpu_has(c, X86_FEATURE_PLN))
+ if (this_cpu_has(X86_FEATURE_PLN))
if (therm_throt_process(msr_val &
PACKAGE_THERM_STATUS_POWER_LIMIT,
POWER_LIMIT_EVENT,

Subject: x86: udelay: Use this_cpu_read to avoid address calculation

The code will use a segment prefix instead of doing the lookup and calculation.

Signed-off-by: Christoph Lameter <[email protected]>

---
arch/x86/lib/delay.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/arch/x86/lib/delay.c
===================================================================
--- linux-2.6.orig/arch/x86/lib/delay.c 2010-12-16 11:45:12.000000000 -0600
+++ linux-2.6/arch/x86/lib/delay.c 2010-12-16 11:45:37.000000000 -0600
@@ -121,7 +121,7 @@ inline void __const_udelay(unsigned long
asm("mull %%edx"
:"=d" (xloops), "=&a" (d0)
:"1" (xloops), "0"
- (cpu_data(raw_smp_processor_id()).loops_per_jiffy * (HZ/4)));
+ (this_cpu_read(cpu_info.loops_per_jiffy) * (HZ/4)));

__delay(++xloops);
}

Subject: gameport: use this_cpu_read instead of lookup



Signed-off-by: Christoph Lameter <[email protected]>

---
drivers/input/gameport/gameport.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/drivers/input/gameport/gameport.c
===================================================================
--- linux-2.6.orig/drivers/input/gameport/gameport.c 2010-12-16 11:46:18.000000000 -0600
+++ linux-2.6/drivers/input/gameport/gameport.c 2010-12-16 11:46:56.000000000 -0600
@@ -123,7 +123,7 @@ static int gameport_measure_speed(struct
}

gameport_close(gameport);
- return (cpu_data(raw_smp_processor_id()).loops_per_jiffy *
+ return (this_cpu_read(cpu_info.loops_per_jiffy) *
(unsigned long)HZ / (1000 / 50)) / (tx < 1 ? 1 : tx);

#else

Subject: acpi throttling: Use this_cpu_has and simplify code


With the this_cpu_xx we no longer need to pass an acpi
structure to the msr management code. Simplifies code and improves
performance.

Signed-off-by: Christoph Lameter <[email protected]>

---
drivers/acpi/processor_throttling.c | 32 ++++++++++----------------------
1 file changed, 10 insertions(+), 22 deletions(-)

Index: linux-2.6/drivers/acpi/processor_throttling.c
===================================================================
--- linux-2.6.orig/drivers/acpi/processor_throttling.c 2010-12-16 11:56:57.000000000 -0600
+++ linux-2.6/drivers/acpi/processor_throttling.c 2010-12-16 12:00:17.000000000 -0600
@@ -662,20 +662,14 @@ static int acpi_processor_get_throttling
}

#ifdef CONFIG_X86
-static int acpi_throttling_rdmsr(struct acpi_processor *pr,
- u64 *value)
+static int acpi_throttling_rdmsr(u64 *value)
{
- struct cpuinfo_x86 *c;
u64 msr_high, msr_low;
- unsigned int cpu;
u64 msr = 0;
int ret = -1;

- cpu = pr->id;
- c = &cpu_data(cpu);
-
- if ((c->x86_vendor != X86_VENDOR_INTEL) ||
- !cpu_has(c, X86_FEATURE_ACPI)) {
+ if ((this_cpu_read(cpu_info.x86_vendor) != X86_VENDOR_INTEL) ||
+ !this_cpu_has(X86_FEATURE_ACPI)) {
printk(KERN_ERR PREFIX
"HARDWARE addr space,NOT supported yet\n");
} else {
@@ -690,18 +684,13 @@ static int acpi_throttling_rdmsr(struct
return ret;
}

-static int acpi_throttling_wrmsr(struct acpi_processor *pr, u64 value)
+static int acpi_throttling_wrmsr(u64 value)
{
- struct cpuinfo_x86 *c;
- unsigned int cpu;
int ret = -1;
u64 msr;

- cpu = pr->id;
- c = &cpu_data(cpu);
-
- if ((c->x86_vendor != X86_VENDOR_INTEL) ||
- !cpu_has(c, X86_FEATURE_ACPI)) {
+ if ((this_cpu_read(cpu_info.x86_vendor) != X86_VENDOR_INTEL) ||
+ !this_cpu_has(X86_FEATURE_ACPI)) {
printk(KERN_ERR PREFIX
"HARDWARE addr space,NOT supported yet\n");
} else {
@@ -713,15 +702,14 @@ static int acpi_throttling_wrmsr(struct
return ret;
}
#else
-static int acpi_throttling_rdmsr(struct acpi_processor *pr,
- u64 *value)
+static int acpi_throttling_rdmsr(u64 *value)
{
printk(KERN_ERR PREFIX
"HARDWARE addr space,NOT supported yet\n");
return -1;
}

-static int acpi_throttling_wrmsr(struct acpi_processor *pr, u64 value)
+static int acpi_throttling_wrmsr(u64 value)
{
printk(KERN_ERR PREFIX
"HARDWARE addr space,NOT supported yet\n");
@@ -753,7 +741,7 @@ static int acpi_read_throttling_status(s
ret = 0;
break;
case ACPI_ADR_SPACE_FIXED_HARDWARE:
- ret = acpi_throttling_rdmsr(pr, value);
+ ret = acpi_throttling_rdmsr(value);
break;
default:
printk(KERN_ERR PREFIX "Unknown addr space %d\n",
@@ -786,7 +774,7 @@ static int acpi_write_throttling_state(s
ret = 0;
break;
case ACPI_ADR_SPACE_FIXED_HARDWARE:
- ret = acpi_throttling_wrmsr(pr, value);
+ ret = acpi_throttling_wrmsr(value);
break;
default:
printk(KERN_ERR PREFIX "Unknown addr space %d\n",

2010-12-16 18:20:19

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [cpuops cmpxchg V2 5/5] cpuops: Use cmpxchg for xchg to avoid lock semantics

On 12/16/2010 08:14 AM, Tejun Heo wrote:
>
> Are you okay with the patches currently in percpu#for-next? If so,
> I'll regenerate patches with your acked-by and pop the two previously
> mentioned commits and proceed with the rest of the series.
>
> Thank you.
>

Still looking through them (sorry.) I note that we probably do need to
get Christoph's followup patches into -tip, so we need to get it all
into tip; as such, even if it goes through your tree I'll need to pull
it into a tip branch.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2010-12-16 18:56:05

by Tejun Heo

[permalink] [raw]
Subject: Re: [cpuops cmpxchg V2 5/5] cpuops: Use cmpxchg for xchg to avoid lock semantics

Hello,

On 12/16/2010 07:19 PM, H. Peter Anvin wrote:
> On 12/16/2010 08:14 AM, Tejun Heo wrote:
>>
>> Are you okay with the patches currently in percpu#for-next? If so,
>> I'll regenerate patches with your acked-by and pop the two previously
>> mentioned commits and proceed with the rest of the series.
>
> Still looking through them (sorry.) I note that we probably do need to
> get Christoph's followup patches into -tip, so we need to get it all
> into tip; as such, even if it goes through your tree I'll need to pull
> it into a tip branch.

Yeah, no problem. Pekka also wants to pull the essential part into
the memory allocator part, so I think it would be best to keep at
least the proper percpu part and x86 specific ops in percpu tree tho.

Thanks.

--
tejun

2010-12-16 20:43:10

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [cpuops cmpxchg V2 5/5] cpuops: Use cmpxchg for xchg to avoid lock semantics

On 12/16/2010 08:14 AM, Tejun Heo wrote:
> Hey, again.
>
> On 12/15/2010 05:39 PM, Tejun Heo wrote:
>> I'd prefer percpu things going through percpu tree, if for nothing
>> else for git history's sake, but I don't think it really matters. The
>> series is spread all over the place anyway. As long as each
>> maintainer is properly alerted about the changes, it should be okay.
>> Please let me know whether you agree with the changes currently queued
>> in percpu#for-next. I'll update the tree with your Acked-by's and
>> freeze it.
>
> Are you okay with the patches currently in percpu#for-next? If so,
> I'll regenerate patches with your acked-by and pop the two previously
> mentioned commits and proceed with the rest of the series.
>

Just finished reviewing the patches in percpu#for-next.

Acked-by: H. Peter Anvin <[email protected]>

Look good. Let me know when you have a baseline I can pull into a tip
branch so we can build the rest of the patches on top.

-hpa

2010-12-18 15:34:50

by Tejun Heo

[permalink] [raw]
Subject: Re: gameport: use this_cpu_read instead of lookup

On 12/16/2010 07:15 PM, Christoph Lameter wrote:
>
> Signed-off-by: Christoph Lameter <[email protected]>

Acked-by: Tejun Heo <[email protected]>

--
tejun

2010-12-18 15:35:39

by Tejun Heo

[permalink] [raw]
Subject: Re: x86: Use this_cpu_has for thermal_interrupt

On 12/16/2010 07:13 PM, Christoph Lameter wrote:
> It is more effective to use a segment prefix instead of calculating the
> address of the current cpu area amd then testing flags.
>
> Signed-off-by: Christoph Lameter <[email protected]>

For this and the other x86 patch.

Acked-by: Tejun Heo <[email protected]>

I suppose these two x86 patches and the gameport one would go through
the x86 tree, right?

Thank you.

--
tejun

2010-12-18 15:50:30

by Tejun Heo

[permalink] [raw]
Subject: Re: acpi throttling: Use this_cpu_has and simplify code

(cc'ing ACPI ppl and quoting the whole body)

On 12/16/2010 07:16 PM, Christoph Lameter wrote:
>
> With the this_cpu_xx we no longer need to pass an acpi
> structure to the msr management code. Simplifies code and improves
> performance.
>
> Signed-off-by: Christoph Lameter <[email protected]>
>
> ---
> drivers/acpi/processor_throttling.c | 32 ++++++++++----------------------
> 1 file changed, 10 insertions(+), 22 deletions(-)
>
> Index: linux-2.6/drivers/acpi/processor_throttling.c
> ===================================================================
> --- linux-2.6.orig/drivers/acpi/processor_throttling.c 2010-12-16 11:56:57.000000000 -0600
> +++ linux-2.6/drivers/acpi/processor_throttling.c 2010-12-16 12:00:17.000000000 -0600
> @@ -662,20 +662,14 @@ static int acpi_processor_get_throttling
> }
>
> #ifdef CONFIG_X86
> -static int acpi_throttling_rdmsr(struct acpi_processor *pr,
> - u64 *value)
> +static int acpi_throttling_rdmsr(u64 *value)
> {
> - struct cpuinfo_x86 *c;
> u64 msr_high, msr_low;
> - unsigned int cpu;
> u64 msr = 0;
> int ret = -1;
>
> - cpu = pr->id;
> - c = &cpu_data(cpu);
> -
> - if ((c->x86_vendor != X86_VENDOR_INTEL) ||
> - !cpu_has(c, X86_FEATURE_ACPI)) {
> + if ((this_cpu_read(cpu_info.x86_vendor) != X86_VENDOR_INTEL) ||
> + !this_cpu_has(X86_FEATURE_ACPI)) {
> printk(KERN_ERR PREFIX
> "HARDWARE addr space,NOT supported yet\n");
> } else {
> @@ -690,18 +684,13 @@ static int acpi_throttling_rdmsr(struct
> return ret;
> }
>
> -static int acpi_throttling_wrmsr(struct acpi_processor *pr, u64 value)
> +static int acpi_throttling_wrmsr(u64 value)
> {
> - struct cpuinfo_x86 *c;
> - unsigned int cpu;
> int ret = -1;
> u64 msr;
>
> - cpu = pr->id;
> - c = &cpu_data(cpu);
> -
> - if ((c->x86_vendor != X86_VENDOR_INTEL) ||
> - !cpu_has(c, X86_FEATURE_ACPI)) {
> + if ((this_cpu_read(cpu_info.x86_vendor) != X86_VENDOR_INTEL) ||
> + !this_cpu_has(X86_FEATURE_ACPI)) {
> printk(KERN_ERR PREFIX
> "HARDWARE addr space,NOT supported yet\n");
> } else {
> @@ -713,15 +702,14 @@ static int acpi_throttling_wrmsr(struct
> return ret;
> }
> #else
> -static int acpi_throttling_rdmsr(struct acpi_processor *pr,
> - u64 *value)
> +static int acpi_throttling_rdmsr(u64 *value)
> {
> printk(KERN_ERR PREFIX
> "HARDWARE addr space,NOT supported yet\n");
> return -1;
> }
>
> -static int acpi_throttling_wrmsr(struct acpi_processor *pr, u64 value)
> +static int acpi_throttling_wrmsr(u64 value)
> {
> printk(KERN_ERR PREFIX
> "HARDWARE addr space,NOT supported yet\n");
> @@ -753,7 +741,7 @@ static int acpi_read_throttling_status(s
> ret = 0;
> break;
> case ACPI_ADR_SPACE_FIXED_HARDWARE:
> - ret = acpi_throttling_rdmsr(pr, value);
> + ret = acpi_throttling_rdmsr(value);
> break;
> default:
> printk(KERN_ERR PREFIX "Unknown addr space %d\n",
> @@ -786,7 +774,7 @@ static int acpi_write_throttling_state(s
> ret = 0;
> break;
> case ACPI_ADR_SPACE_FIXED_HARDWARE:
> - ret = acpi_throttling_wrmsr(pr, value);
> + ret = acpi_throttling_wrmsr(value);
> break;
> default:
> printk(KERN_ERR PREFIX "Unknown addr space %d\n",
>

It's bothersome that these methods don't have any indication that
they're bound to local CPU when they can't be called with @pr for
another CPU as MSRs can only be accessed from local CPU.

In the longer run, it would be nice if there's an indication that this
is only for the local CPU and maybe a WARN_ON_ONCE(). Maybe dropping
@pr and using this_cpu_*() is better for performance too?

Anyways, the above doesn't make the situation any worse, so...

Acked-by: Tejun Heo <[email protected]>

I think this one too fits the x86 tree better.

Thanks.

--
tejun

2010-12-21 00:57:30

by H. Peter Anvin

[permalink] [raw]
Subject: Re: x86: Use this_cpu_has for thermal_interrupt

On 12/18/2010 07:35 AM, Tejun Heo wrote:
> On 12/16/2010 07:13 PM, Christoph Lameter wrote:
>> It is more effective to use a segment prefix instead of calculating the
>> address of the current cpu area amd then testing flags.
>>
>> Signed-off-by: Christoph Lameter <[email protected]>
>
> For this and the other x86 patch.
>
> Acked-by: Tejun Heo <[email protected]>
>
> I suppose these two x86 patches and the gameport one would go through
> the x86 tree, right?
>

Yes, as long as I can get a stable base to pull into -tip.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2010-12-21 01:57:45

by Zhao, Yakui

[permalink] [raw]
Subject: Re: acpi throttling: Use this_cpu_has and simplify code

On Sat, 2010-12-18 at 23:50 +0800, Tejun Heo wrote:
> (cc'ing ACPI ppl and quoting the whole body)
>
> On 12/16/2010 07:16 PM, Christoph Lameter wrote:
> >
> > With the this_cpu_xx we no longer need to pass an acpi
> > structure to the msr management code. Simplifies code and improves
> > performance.
> >
> > Signed-off-by: Christoph Lameter <[email protected]>
> >
> > ---
> > drivers/acpi/processor_throttling.c | 32 ++++++++++----------------------
> > 1 file changed, 10 insertions(+), 22 deletions(-)
> >
> > Index: linux-2.6/drivers/acpi/processor_throttling.c
> > ===================================================================
> > --- linux-2.6.orig/drivers/acpi/processor_throttling.c 2010-12-16 11:56:57.000000000 -0600
> > +++ linux-2.6/drivers/acpi/processor_throttling.c 2010-12-16 12:00:17.000000000 -0600
> > @@ -662,20 +662,14 @@ static int acpi_processor_get_throttling
> > }
> >
> > #ifdef CONFIG_X86
> > -static int acpi_throttling_rdmsr(struct acpi_processor *pr,
> > - u64 *value)
> > +static int acpi_throttling_rdmsr(u64 *value)
> > {
> > - struct cpuinfo_x86 *c;
> > u64 msr_high, msr_low;
> > - unsigned int cpu;
> > u64 msr = 0;
> > int ret = -1;
> >
> > - cpu = pr->id;
> > - c = &cpu_data(cpu);
> > -
> > - if ((c->x86_vendor != X86_VENDOR_INTEL) ||
> > - !cpu_has(c, X86_FEATURE_ACPI)) {
> > + if ((this_cpu_read(cpu_info.x86_vendor) != X86_VENDOR_INTEL) ||
> > + !this_cpu_has(X86_FEATURE_ACPI)) {
> > printk(KERN_ERR PREFIX
> > "HARDWARE addr space,NOT supported yet\n");
> > } else {
> > @@ -690,18 +684,13 @@ static int acpi_throttling_rdmsr(struct
> > return ret;
> > }
> >
> > -static int acpi_throttling_wrmsr(struct acpi_processor *pr, u64 value)
> > +static int acpi_throttling_wrmsr(u64 value)
> > {
> > - struct cpuinfo_x86 *c;
> > - unsigned int cpu;
> > int ret = -1;
> > u64 msr;
> >
> > - cpu = pr->id;
> > - c = &cpu_data(cpu);
> > -
> > - if ((c->x86_vendor != X86_VENDOR_INTEL) ||
> > - !cpu_has(c, X86_FEATURE_ACPI)) {
> > + if ((this_cpu_read(cpu_info.x86_vendor) != X86_VENDOR_INTEL) ||
> > + !this_cpu_has(X86_FEATURE_ACPI)) {
> > printk(KERN_ERR PREFIX
> > "HARDWARE addr space,NOT supported yet\n");
> > } else {
> > @@ -713,15 +702,14 @@ static int acpi_throttling_wrmsr(struct
> > return ret;
> > }
> > #else
> > -static int acpi_throttling_rdmsr(struct acpi_processor *pr,
> > - u64 *value)
> > +static int acpi_throttling_rdmsr(u64 *value)
> > {
> > printk(KERN_ERR PREFIX
> > "HARDWARE addr space,NOT supported yet\n");
> > return -1;
> > }
> >
> > -static int acpi_throttling_wrmsr(struct acpi_processor *pr, u64 value)
> > +static int acpi_throttling_wrmsr(u64 value)
> > {
> > printk(KERN_ERR PREFIX
> > "HARDWARE addr space,NOT supported yet\n");
> > @@ -753,7 +741,7 @@ static int acpi_read_throttling_status(s
> > ret = 0;
> > break;
> > case ACPI_ADR_SPACE_FIXED_HARDWARE:
> > - ret = acpi_throttling_rdmsr(pr, value);
> > + ret = acpi_throttling_rdmsr(value);
> > break;
> > default:
> > printk(KERN_ERR PREFIX "Unknown addr space %d\n",
> > @@ -786,7 +774,7 @@ static int acpi_write_throttling_state(s
> > ret = 0;
> > break;
> > case ACPI_ADR_SPACE_FIXED_HARDWARE:
> > - ret = acpi_throttling_wrmsr(pr, value);
> > + ret = acpi_throttling_wrmsr(value);
> > break;
> > default:
> > printk(KERN_ERR PREFIX "Unknown addr space %d\n",
> >
>
> It's bothersome that these methods don't have any indication that
> they're bound to local CPU when they can't be called with @pr for
> another CPU as MSRs can only be accessed from local CPU.

The above function may be called under the scenario with irq disabled.
In such case if the corresponding MRS is accessed by using remote
method, it will complain the oops. Maybe it will be safer to firstly
switch to the local CPU and then read/write the MRS register related
with the throttling.

>
> In the longer run, it would be nice if there's an indication that this
> is only for the local CPU and maybe a WARN_ON_ONCE(). Maybe dropping
> @pr and using this_cpu_*() is better for performance too?

It is also ok to me that drops the pr input argument of
acpi_throttling_rdmsr/wrmsr as it is already switched to the local cpu.

>
> Anyways, the above doesn't make the situation any worse, so...
>
> Acked-by: Tejun Heo <[email protected]>
>
> I think this one too fits the x86 tree better.
>
> Thanks.
>

2010-12-21 04:28:29

by Len Brown

[permalink] [raw]
Subject: Re: acpi throttling: Use this_cpu_has and simplify code

This patch is a valid cleanup, and I've applied it to the acpi-test tree.

The routines named *rdmsr() and *wrmsr() run on the local CPU
by definition. I think that after this change to get rid of pr*,
that is even more clear.

The part about checking that you are on the right processor
should really be done in the caller. Yakui should send a patch
for the driver to check for successful return from
set_cpus_allowed_ptr() before invoking this code.

> With the this_cpu_xx we no longer need to pass an acpi
> structure to the msr management code. Simplifies code and improves
> performance.

Simplifies, yes.
Performance -- not the focus here.
If you are running routines to throttle the clock frequency,
then you have already decided that performance isn't your top priority:-)

thanks,
Len Brown, Intel Open Source Technology Center

> Signed-off-by: Christoph Lameter <[email protected]>
>
> ---
> drivers/acpi/processor_throttling.c | 32 ++++++++++----------------------
> 1 file changed, 10 insertions(+), 22 deletions(-)

Subject: Re: acpi throttling: Use this_cpu_has and simplify code

On Sat, 18 Dec 2010, Tejun Heo wrote:

> It's bothersome that these methods don't have any indication that
> they're bound to local CPU when they can't be called with @pr for
> another CPU as MSRs can only be accessed from local CPU.

Right.

> In the longer run, it would be nice if there's an indication that this
> is only for the local CPU and maybe a WARN_ON_ONCE(). Maybe dropping
> @pr and using this_cpu_*() is better for performance too?

That is precisely the difficulty with many other functions that take
pointers to cpu_info. If we can establish that they are called *only* with
pointers to the current cpu then we can improve the function by dropping
the parameter and using this_cpu ops to access the cpu_info fields.

Using this_cpu ops instead of access to a field of a struct pointed to by
an array avoids an addition of pointers and replacest with a operation
that essentially takes a global address with a segment prefix. It reduces
register pressure and avoids the additions of pointers but it will encode
the address in the instruction.

2010-12-30 11:29:40

by Tejun Heo

[permalink] [raw]
Subject: Re: x86: Use this_cpu_has for thermal_interrupt

On Mon, Dec 20, 2010 at 04:56:19PM -0800, H. Peter Anvin wrote:
> On 12/18/2010 07:35 AM, Tejun Heo wrote:
> > On 12/16/2010 07:13 PM, Christoph Lameter wrote:
> >> It is more effective to use a segment prefix instead of calculating the
> >> address of the current cpu area amd then testing flags.
> >>
> >> Signed-off-by: Christoph Lameter <[email protected]>
> >
> > For this and the other x86 patch.
> >
> > Acked-by: Tejun Heo <[email protected]>
> >
> > I suppose these two x86 patches and the gameport one would go through
> > the x86 tree, right?
> >
>
> Yes, as long as I can get a stable base to pull into -tip.

I suppose I should take the two x86 and the gameport patches here into
percpu tree with your ACKs too. Would that be okay with you?

Thanks.

--
tejun

2010-12-30 18:21:22

by H. Peter Anvin

[permalink] [raw]
Subject: Re: x86: Use this_cpu_has for thermal_interrupt

On 12/30/2010 03:29 AM, Tejun Heo wrote:
> On Mon, Dec 20, 2010 at 04:56:19PM -0800, H. Peter Anvin wrote:
>> On 12/18/2010 07:35 AM, Tejun Heo wrote:
>>> On 12/16/2010 07:13 PM, Christoph Lameter wrote:
>>>> It is more effective to use a segment prefix instead of calculating the
>>>> address of the current cpu area amd then testing flags.
>>>>
>>>> Signed-off-by: Christoph Lameter <[email protected]>
>>>
>>> For this and the other x86 patch.
>>>
>>> Acked-by: Tejun Heo <[email protected]>
>>>
>>> I suppose these two x86 patches and the gameport one would go through
>>> the x86 tree, right?
>>>
>>
>> Yes, as long as I can get a stable base to pull into -tip.
>
> I suppose I should take the two x86 and the gameport patches here into
> percpu tree with your ACKs too. Would that be okay with you?
>

Yes, please.

Acked-by: H. Peter Anvin <[email protected]>

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2010-12-31 12:43:58

by Tejun Heo

[permalink] [raw]
Subject: Re: x86: Use this_cpu_has for thermal_interrupt

On Thu, Dec 30, 2010 at 10:19:52AM -0800, H. Peter Anvin wrote:
> Acked-by: H. Peter Anvin <[email protected]>

Three patches committed to percpu#for-2.6.38. Thanks.

--
tejun