2016-03-10 01:56:26

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH -tip 0/2] kernel/smp: Small csd_lock optimizations

From: Davidlohr Bueso <[email protected]>

Hi,

Justifications are in each patch, there is slight impact (patch 2)
on some tlb flushing intensive benchmarks (albeit using ipi batching
nowadays). Specifically for the pft
benchmark, on a 12-core box:

pft faults
4.4 4.4
vanilla smp
Hmean faults/cpu-1 801432.1608 ( 0.00%) 795719.8859 ( -0.71%)
Hmean faults/cpu-3 702578.6659 ( 0.00%) 752796.6960 ( 7.15%)
Hmean faults/cpu-5 606080.3473 ( 0.00%) 595890.0451 ( -1.68%)
Hmean faults/cpu-7 460369.0724 ( 0.00%) 485283.6343 ( 5.41%)
Hmean faults/cpu-12 294445.4701 ( 0.00%) 298300.6011 ( 1.31%)
Hmean faults/cpu-18 213156.0860 ( 0.00%) 213584.2741 ( 0.20%)
Hmean faults/cpu-24 153104.2995 ( 0.00%) 153198.8473 ( 0.06%)
Hmean faults/sec-1 796329.3184 ( 0.00%) 614222.4594 (-22.87%)
Hmean faults/sec-3 1947806.7372 ( 0.00%) 2169267.1582 ( 11.37%)
Hmean faults/sec-5 2611152.0422 ( 0.00%) 2544652.6871 ( -2.55%)
Hmean faults/sec-7 2493705.4668 ( 0.00%) 2674847.5270 ( 7.26%)
Hmean faults/sec-12 2583139.7724 ( 0.00%) 2614404.6002 ( 1.21%)
Hmean faults/sec-18 2661410.8170 ( 0.00%) 2683427.0703 ( 0.83%)
Hmean faults/sec-24 2670463.4814 ( 0.00%) 2666221.6332 ( -0.16%)
Stddev faults/cpu-1 27537.6676 ( 0.00%) 25753.4945 ( 6.48%)
Stddev faults/cpu-3 62616.8041 ( 0.00%) 44728.0990 ( 28.57%)
Stddev faults/cpu-5 70976.9184 ( 0.00%) 74720.5716 ( -5.27%)
Stddev faults/cpu-7 47426.5952 ( 0.00%) 32758.2705 ( 30.93%)
Stddev faults/cpu-12 6951.8792 ( 0.00%) 9097.0782 (-30.86%)
Stddev faults/cpu-18 4293.1696 ( 0.00%) 5826.9446 (-35.73%)
Stddev faults/cpu-24 3195.0939 ( 0.00%) 3373.7230 ( -5.59%)
Stddev faults/sec-1 27315.3093 ( 0.00%) 148601.7795 (-444.02%)
Stddev faults/sec-3 271560.5941 ( 0.00%) 193681.0177 ( 28.68%)
Stddev faults/sec-5 429633.7378 ( 0.00%) 458426.3306 ( -6.70%)
Stddev faults/sec-7 338229.0746 ( 0.00%) 226146.3450 ( 33.14%)
Stddev faults/sec-12 57766.4604 ( 0.00%) 82734.3638 (-43.22%)
Stddev faults/sec-18 118572.1909 ( 0.00%) 134966.7210 (-13.83%)
Stddev faults/sec-24 57452.7350 ( 0.00%) 57542.7755 ( -0.16%)

4.4 4.4
vanilla smp
User 11.91 11.85
System 197.11 194.69
Elapsed 44.24 40.26

While the single thread is an abnormality, overall we don't seem
to do any harm (noise range). Could be give or take, but overall
the patches at least make some sense afaict.

Thanks!

Davidlohr Bueso (2):
kernel/smp: Explicitly inline cds_lock helpers
kernel/smp: Use make csd_lock_wait be smp_cond_acquire

kernel/smp.c | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)

--
2.1.4


2016-03-10 01:56:33

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH 1/2] kernel/smp: Explicitly inline cds_lock helpers

From: Davidlohr Bueso <[email protected]>

While the compiler tends to already to it for us (except for
cds_unlock), make it explicit. These helpers mainly deal with
the ->flags, are short-lived and can be called, for example,
from smp_call_function_many().

Signed-off-by: Davidlohr Bueso <[email protected]>
---
kernel/smp.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index 822ffb1ada3f..c91e00178f8f 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -105,13 +105,13 @@ void __init call_function_init(void)
* previous function call. For multi-cpu calls its even more interesting
* as we'll have to ensure no other cpu is observing our csd.
*/
-static void csd_lock_wait(struct call_single_data *csd)
+static __always_inline void csd_lock_wait(struct call_single_data *csd)
{
while (smp_load_acquire(&csd->flags) & CSD_FLAG_LOCK)
cpu_relax();
}

-static void csd_lock(struct call_single_data *csd)
+static __always_inline void csd_lock(struct call_single_data *csd)
{
csd_lock_wait(csd);
csd->flags |= CSD_FLAG_LOCK;
@@ -124,7 +124,7 @@ static void csd_lock(struct call_single_data *csd)
smp_wmb();
}

-static void csd_unlock(struct call_single_data *csd)
+static __always_inline void csd_unlock(struct call_single_data *csd)
{
WARN_ON(!(csd->flags & CSD_FLAG_LOCK));

--
2.1.4

2016-03-10 01:56:47

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH 2/2] kernel/smp: Use make csd_lock_wait be smp_cond_acquire

From: Davidlohr Bueso <dave@stgolabs>

We can micro-optimize this call and mildly relax the
barrier requirements by relying on ctrl + rmb, keeping
the acquire semantics. In addition, this is pretty much
the now standard for busy-waiting under such restraints.

Signed-off-by: Davidlohr Bueso <[email protected]>
---
kernel/smp.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index c91e00178f8f..74165443c240 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -107,8 +107,7 @@ void __init call_function_init(void)
*/
static __always_inline void csd_lock_wait(struct call_single_data *csd)
{
- while (smp_load_acquire(&csd->flags) & CSD_FLAG_LOCK)
- cpu_relax();
+ smp_cond_acquire(!(csd->flags & CSD_FLAG_LOCK));
}

static __always_inline void csd_lock(struct call_single_data *csd)
--
2.1.4

2016-03-10 09:17:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH -tip 0/2] kernel/smp: Small csd_lock optimizations

On Wed, Mar 09, 2016 at 05:55:34PM -0800, Davidlohr Bueso wrote:
> From: Davidlohr Bueso <[email protected]>
>
> Hi,
>
> Justifications are in each patch, there is slight impact (patch 2)
> on some tlb flushing intensive benchmarks (albeit using ipi batching
> nowadays). Specifically for the pft
> benchmark, on a 12-core box:

> 4.4 4.4
> vanilla smp
> User 11.91 11.85
> System 197.11 194.69
> Elapsed 44.24 40.26
>
> While the single thread is an abnormality, overall we don't seem
> to do any harm (noise range). Could be give or take, but overall
> the patches at least make some sense afaict.

Acked-by: Peter Zijlstra (Intel) <[email protected]>

Subject: [tip:locking/core] locking/csd_lock: Explicitly inline csd_lock*() helpers

Commit-ID: 90d1098478fb08a1ef166fe91622d8046869e17b
Gitweb: http://git.kernel.org/tip/90d1098478fb08a1ef166fe91622d8046869e17b
Author: Davidlohr Bueso <[email protected]>
AuthorDate: Wed, 9 Mar 2016 17:55:35 -0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 10 Mar 2016 10:28:35 +0100

locking/csd_lock: Explicitly inline csd_lock*() helpers

While the compiler tends to already to it for us (except for
csd_unlock), make it explicit. These helpers mainly deal with
the ->flags, are short-lived and can be called, for example,
from smp_call_function_many().

Signed-off-by: Davidlohr Bueso <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/smp.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index d903c02..5099db1 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -105,13 +105,13 @@ void __init call_function_init(void)
* previous function call. For multi-cpu calls its even more interesting
* as we'll have to ensure no other cpu is observing our csd.
*/
-static void csd_lock_wait(struct call_single_data *csd)
+static __always_inline void csd_lock_wait(struct call_single_data *csd)
{
while (smp_load_acquire(&csd->flags) & CSD_FLAG_LOCK)
cpu_relax();
}

-static void csd_lock(struct call_single_data *csd)
+static __always_inline void csd_lock(struct call_single_data *csd)
{
csd_lock_wait(csd);
csd->flags |= CSD_FLAG_LOCK;
@@ -124,7 +124,7 @@ static void csd_lock(struct call_single_data *csd)
smp_wmb();
}

-static void csd_unlock(struct call_single_data *csd)
+static __always_inline void csd_unlock(struct call_single_data *csd)
{
WARN_ON(!(csd->flags & CSD_FLAG_LOCK));


Subject: [tip:locking/core] locking/csd_lock: Use smp_cond_acquire() in csd_lock_wait()

Commit-ID: 38460a2178d225b39ade5ac66586c3733391cf86
Gitweb: http://git.kernel.org/tip/38460a2178d225b39ade5ac66586c3733391cf86
Author: Davidlohr Bueso <dave@stgolabs>
AuthorDate: Wed, 9 Mar 2016 17:55:36 -0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 10 Mar 2016 10:28:35 +0100

locking/csd_lock: Use smp_cond_acquire() in csd_lock_wait()

We can micro-optimize this call and mildly relax the
barrier requirements by relying on ctrl + rmb, keeping
the acquire semantics. In addition, this is pretty much
the now standard for busy-waiting under such restraints.

Signed-off-by: Davidlohr Bueso <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/smp.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index 5099db1..300d293 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -107,8 +107,7 @@ void __init call_function_init(void)
*/
static __always_inline void csd_lock_wait(struct call_single_data *csd)
{
- while (smp_load_acquire(&csd->flags) & CSD_FLAG_LOCK)
- cpu_relax();
+ smp_cond_acquire(!(csd->flags & CSD_FLAG_LOCK));
}

static __always_inline void csd_lock(struct call_single_data *csd)