2013-04-17 19:23:41

by Waiman Long

[permalink] [raw]
Subject: [PATCH v4 0/4] mutex: Improve mutex performance by doing less atomic-ops & better spinning

v3->v4
- Merge patch 4 into patch 2
- Move patch 5 forward to become patch 1

v2->v3
- Add patch 4 to remove new typedefs introduced in patch 2.
- Add patch 5 to remove SCHED_FEAT_OWNER_SPIN and move the mutex
spinning code to mutex.c.

v1->v2
- Remove the 2 mutex spinner patches and replaced it by another one
to improve the mutex spinning process.
- Remove changes made to kernel/mutex.h & localize changes in
kernel/mutex.c.
- Add an optional patch to remove architecture specific check in patch
1.

This patch set is a collection of 4 different mutex related patches
aimed at improving mutex performance especially for system with large
number of CPUs. This is achieved by doing less atomic operations and
better mutex spinning (when the CONFIG_MUTEX_SPIN_ON_OWNER is on).

Patch 1 removes SCHED_FEAT_OWNER_SPIN which was just an earlier hack
for testing purpose. It also moves the mutex spinning code back to
mutex.c.

Patch 2 reduces the number of atomic operations executed. It can
produce dramatic performance improvement in the AIM7 benchmark with
large number of CPUs. For example, there was a more than 3X improvement
in the high_systime workload with a 3.7.10 kernel on an 8-socket
x86-64 system with 80 cores. The 3.8 kernels, on the other hand,
are not mutex limited for that workload anymore. So the performance
improvement is only about 1% for the high_systime workload.

Patch 3 improves the mutex spinning process by reducing contention
among the spinners when competing for the mutex. This is done by
using a MCS lock to put the spinners in a queue so that only the
first spinner will try to acquire the mutex when it is available. This
patch showed significant performance improvement of +30% on the AIM7
fserver and new_fserver workload.

Compared with patches 2&3 in v1, the new patch 3 consistently provided
better performance improvement at high user load (1100-2000) for the
fserver and new_fserver AIM7 workloads. The old patches had around 10%
and less improvement at high user load while the new patch produced
30% better performance for the same workloads.

Patch 4 is an optional one for backing out architecture specific
check in patch 2, if so desired.

Waiman Long (4):
mutex: Move mutex spinning code from sched/core.c back to mutex.c
mutex: Make more scalable by doing less atomic operations
mutex: Queue mutex spinners with MCS lock to reduce cacheline
contention
mutex: back out architecture specific check for negative mutex count

include/linux/mutex.h | 3 +
include/linux/sched.h | 1 -
kernel/mutex.c | 151 +++++++++++++++++++++++++++++++++++++++++++++-
kernel/sched/core.c | 45 --------------
kernel/sched/features.h | 7 --
5 files changed, 150 insertions(+), 57 deletions(-)


2013-04-17 19:23:47

by Waiman Long

[permalink] [raw]
Subject: [PATCH v4 2/4] mutex: Make more scalable by doing less atomic operations

In the __mutex_lock_common() function, an initial entry into
the lock slow path will cause two atomic_xchg instructions to be
issued. Together with the atomic decrement in the fast path, a total
of three atomic read-modify-write instructions will be issued in
rapid succession. This can cause a lot of cache bouncing when many
tasks are trying to acquire the mutex at the same time.

This patch will reduce the number of atomic_xchg instructions used by
checking the counter value first before issuing the instruction. The
atomic_read() function is just a simple memory read. The atomic_xchg()
function, on the other hand, can be up to 2 order of magnitude or even
more in cost when compared with atomic_read(). By using atomic_read()
to check the value first before calling atomic_xchg(), we can avoid a
lot of unnecessary cache coherency traffic. The only downside with this
change is that a task on the slow path will have a tiny bit
less chance of getting the mutex when competing with another task
in the fast path.

The same is true for the atomic_cmpxchg() function in the
mutex-spin-on-owner loop. So an atomic_read() is also performed before
calling atomic_cmpxchg().

The mutex locking and unlocking code for the x86 architecture can allow
any negative number to be used in the mutex count to indicate that some
tasks are waiting for the mutex. I am not so sure if that is the case
for the other architectures. So the default is to avoid atomic_xchg()
if the count has already been set to -1. For x86, the check is modified
to include all negative numbers to cover a larger case.

The following table shows the jobs per minutes (JPM) scalability data
on an 8-node 80-core Westmere box with a 3.7.10 kernel. The numactl
command is used to restrict the running of the high_systime workloads
to 1/2/4/8 nodes with hyperthreading on and off.

+-----------------+-----------+------------+----------+
| Configuration | Mean JPM | Mean JPM | % Change |
| | w/o patch | with patch | |
+-----------------+-----------------------------------+
| | User Range 1100 - 2000 |
+-----------------+-----------------------------------+
| 8 nodes, HT on | 36980 | 148590 | +301.8% |
| 8 nodes, HT off | 42799 | 145011 | +238.8% |
| 4 nodes, HT on | 61318 | 118445 | +51.1% |
| 4 nodes, HT off | 158481 | 158592 | +0.1% |
| 2 nodes, HT on | 180602 | 173967 | -3.7% |
| 2 nodes, HT off | 198409 | 198073 | -0.2% |
| 1 node , HT on | 149042 | 147671 | -0.9% |
| 1 node , HT off | 126036 | 126533 | +0.4% |
+-----------------+-----------------------------------+
| | User Range 200 - 1000 |
+-----------------+-----------------------------------+
| 8 nodes, HT on | 41525 | 122349 | +194.6% |
| 8 nodes, HT off | 49866 | 124032 | +148.7% |
| 4 nodes, HT on | 66409 | 106984 | +61.1% |
| 4 nodes, HT off | 119880 | 130508 | +8.9% |
| 2 nodes, HT on | 138003 | 133948 | -2.9% |
| 2 nodes, HT off | 132792 | 131997 | -0.6% |
| 1 node , HT on | 116593 | 115859 | -0.6% |
| 1 node , HT off | 104499 | 104597 | +0.1% |
+-----------------+------------+-----------+----------+

At low user range 10-100, the JPM differences were within +/-1%. So
they are not that interesting.

AIM7 benchmark run has a pretty large run-to-run variance due to random
nature of the subtests executed. So a difference of less than +-5%
may not be really significant.

This patch improves high_systime workload performance at 4 nodes
and up by maintaining transaction rates without significant drop-off
at high node count. The patch has practically no impact on 1 and 2
nodes system.

The table below shows the percentage time (as reported by perf
record -a -s -g) spent on the __mutex_lock_slowpath() function by
the high_systime workload at 1500 users for 2/4/8-node configurations
with hyperthreading off.

+---------------+-----------------+------------------+---------+
| Configuration | %Time w/o patch | %Time with patch | %Change |
+---------------+-----------------+------------------+---------+
| 8 nodes | 65.34% | 0.69% | -99% |
| 4 nodes | 8.70% | 1.02% | -88% |
| 2 nodes | 0.41% | 0.32% | -22% |
+---------------+-----------------+------------------+---------+

It is obvious that the dramatic performance improvement at 8
nodes was due to the drastic cut in the time spent within the
__mutex_lock_slowpath() function.

The table below show the improvements in other AIM7 workloads (at 8
nodes, hyperthreading off).

+--------------+---------------+----------------+-----------------+
| Workload | mean % change | mean % change | mean % change |
| | 10-100 users | 200-1000 users | 1100-2000 users |
+--------------+---------------+----------------+-----------------+
| alltests | +0.6% | +104.2% | +185.9% |
| five_sec | +1.9% | +0.9% | +0.9% |
| fserver | +1.4% | -7.7% | +5.1% |
| new_fserver | -0.5% | +3.2% | +3.1% |
| shared | +13.1% | +146.1% | +181.5% |
| short | +7.4% | +5.0% | +4.2% |
+--------------+---------------+----------------+-----------------+

Signed-off-by: Waiman Long <[email protected]>
Reviewed-by: Davidlohr Bueso <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
---
arch/x86/include/asm/mutex.h | 10 ++++++++++
kernel/mutex.c | 19 ++++++++++++++++---
2 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/mutex.h b/arch/x86/include/asm/mutex.h
index 7d3a482..bc2a0b0 100644
--- a/arch/x86/include/asm/mutex.h
+++ b/arch/x86/include/asm/mutex.h
@@ -3,3 +3,13 @@
#else
# include <asm/mutex_64.h>
#endif
+
+#ifndef __ASM_MUTEX_H
+#define __ASM_MUTEX_H
+/*
+ * For the x86 architecture, it allows any negative number (besides -1) in
+ * the mutex count to indicate that some other threads are waiting on the
+ * mutex.
+ */
+#define __ARCH_ALLOW_ANY_NEGATIVE_MUTEX_COUNT 1
+#endif
diff --git a/kernel/mutex.c b/kernel/mutex.c
index 262d717..70ebd85 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -37,6 +37,17 @@
# include <asm/mutex.h>
#endif

+/*
+ * A mutex count of -1 indicates that waiters are sleeping waiting for the
+ * mutex. Some architectures can allow any negative number, not just -1, for
+ * this purpose.
+ */
+#ifdef __ARCH_ALLOW_ANY_NEGATIVE_MUTEX_COUNT
+#define MUTEX_SHOW_NO_WAITER(mutex) (atomic_read(&(mutex)->count) >= 0)
+#else
+#define MUTEX_SHOW_NO_WAITER(mutex) (atomic_read(&(mutex)->count) != -1)
+#endif
+
void
__mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key)
{
@@ -217,7 +228,8 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
if (owner && !mutex_spin_on_owner(lock, owner))
break;

- if (atomic_cmpxchg(&lock->count, 1, 0) == 1) {
+ if ((atomic_read(&lock->count) == 1) &&
+ (atomic_cmpxchg(&lock->count, 1, 0) == 1)) {
lock_acquired(&lock->dep_map, ip);
mutex_set_owner(lock);
preempt_enable();
@@ -251,7 +263,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
list_add_tail(&waiter.list, &lock->wait_list);
waiter.task = task;

- if (atomic_xchg(&lock->count, -1) == 1)
+ if (MUTEX_SHOW_NO_WAITER(lock) && (atomic_xchg(&lock->count, -1) == 1))
goto done;

lock_contended(&lock->dep_map, ip);
@@ -266,7 +278,8 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
* that when we release the lock, we properly wake up the
* other waiters:
*/
- if (atomic_xchg(&lock->count, -1) == 1)
+ if (MUTEX_SHOW_NO_WAITER(lock) &&
+ (atomic_xchg(&lock->count, -1) == 1))
break;

/*
--
1.7.1

2013-04-17 19:23:52

by Waiman Long

[permalink] [raw]
Subject: [PATCH v4 optional 4/4] mutex: back out architecture specific check for negative mutex count

If it is confirmed that all the supported architectures can allow a
negative mutex count without incorrect behavior, we can then back
out the architecture specific change and allow the mutex count to
go to any negative number. That should further reduce contention for
non-x86 architecture.

If this is not the case, this patch should be dropped.

Signed-off-by: Waiman Long <[email protected]>
---
arch/x86/include/asm/mutex.h | 10 ----------
kernel/mutex.c | 9 ++-------
2 files changed, 2 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/mutex.h b/arch/x86/include/asm/mutex.h
index bc2a0b0..7d3a482 100644
--- a/arch/x86/include/asm/mutex.h
+++ b/arch/x86/include/asm/mutex.h
@@ -3,13 +3,3 @@
#else
# include <asm/mutex_64.h>
#endif
-
-#ifndef __ASM_MUTEX_H
-#define __ASM_MUTEX_H
-/*
- * For the x86 architecture, it allows any negative number (besides -1) in
- * the mutex count to indicate that some other threads are waiting on the
- * mutex.
- */
-#define __ARCH_ALLOW_ANY_NEGATIVE_MUTEX_COUNT 1
-#endif
diff --git a/kernel/mutex.c b/kernel/mutex.c
index 1dbd421..ad53a66 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -38,15 +38,10 @@
#endif

/*
- * A mutex count of -1 indicates that waiters are sleeping waiting for the
- * mutex. Some architectures can allow any negative number, not just -1, for
- * this purpose.
+ * A negative mutex count indicates that waiters are sleeping waiting for the
+ * mutex.
*/
-#ifdef __ARCH_ALLOW_ANY_NEGATIVE_MUTEX_COUNT
#define MUTEX_SHOW_NO_WAITER(mutex) (atomic_read(&(mutex)->count) >= 0)
-#else
-#define MUTEX_SHOW_NO_WAITER(mutex) (atomic_read(&(mutex)->count) != -1)
-#endif

void
__mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key)
--
1.7.1

2013-04-17 19:24:18

by Waiman Long

[permalink] [raw]
Subject: [PATCH v4 1/4] mutex: Move mutex spinning code from sched/core.c back to mutex.c

As mentioned by Ingo, the SCHED_FEAT_OWNER_SPIN scheduler feature
bit was really just an early hack to make with/without mutex-spinning
testable. So it is no longer necessary.

This patch removes the SCHED_FEAT_OWNER_SPIN feature bit and move the
mutex spinning code from kernel/sched/core.c back to kernel/mutex.c
which is where they should belong.

Signed-off-by: Waiman Long <[email protected]>
---
include/linux/sched.h | 1 -
kernel/mutex.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/core.c | 45 ---------------------------------------------
kernel/sched/features.h | 7 -------
4 files changed, 46 insertions(+), 53 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e692a02..2d02c76 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -321,7 +321,6 @@ extern signed long schedule_timeout_killable(signed long timeout);
extern signed long schedule_timeout_uninterruptible(signed long timeout);
asmlinkage void schedule(void);
extern void schedule_preempt_disabled(void);
-extern int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner);

struct nsproxy;
struct user_namespace;
diff --git a/kernel/mutex.c b/kernel/mutex.c
index 52f2301..262d717 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -95,6 +95,52 @@ void __sched mutex_lock(struct mutex *lock)
EXPORT_SYMBOL(mutex_lock);
#endif

+#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
+/*
+ * Mutex spinning code migrated from kernel/sched/core.c
+ */
+
+static inline bool owner_running(struct mutex *lock, struct task_struct *owner)
+{
+ if (lock->owner != owner)
+ return false;
+
+ /*
+ * Ensure we emit the owner->on_cpu, dereference _after_ checking
+ * lock->owner still matches owner, if that fails, owner might
+ * point to free()d memory, if it still matches, the rcu_read_lock()
+ * ensures the memory stays valid.
+ */
+ barrier();
+
+ return owner->on_cpu;
+}
+
+/*
+ * Look out! "owner" is an entirely speculative pointer
+ * access and not reliable.
+ */
+static noinline
+int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
+{
+ rcu_read_lock();
+ while (owner_running(lock, owner)) {
+ if (need_resched())
+ break;
+
+ arch_mutex_cpu_relax();
+ }
+ rcu_read_unlock();
+
+ /*
+ * We break out the loop above on need_resched() and when the
+ * owner changed, which is a sign for heavy contention. Return
+ * success only when lock->owner is NULL.
+ */
+ return lock->owner == NULL;
+}
+#endif
+
static __used noinline void __sched __mutex_unlock_slowpath(atomic_t *lock_count);

/**
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 67d0465..4205354 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2999,51 +2999,6 @@ void __sched schedule_preempt_disabled(void)
preempt_disable();
}

-#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
-
-static inline bool owner_running(struct mutex *lock, struct task_struct *owner)
-{
- if (lock->owner != owner)
- return false;
-
- /*
- * Ensure we emit the owner->on_cpu, dereference _after_ checking
- * lock->owner still matches owner, if that fails, owner might
- * point to free()d memory, if it still matches, the rcu_read_lock()
- * ensures the memory stays valid.
- */
- barrier();
-
- return owner->on_cpu;
-}
-
-/*
- * Look out! "owner" is an entirely speculative pointer
- * access and not reliable.
- */
-int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
-{
- if (!sched_feat(OWNER_SPIN))
- return 0;
-
- rcu_read_lock();
- while (owner_running(lock, owner)) {
- if (need_resched())
- break;
-
- arch_mutex_cpu_relax();
- }
- rcu_read_unlock();
-
- /*
- * We break out the loop above on need_resched() and when the
- * owner changed, which is a sign for heavy contention. Return
- * success only when lock->owner is NULL.
- */
- return lock->owner == NULL;
-}
-#endif
-
#ifdef CONFIG_PREEMPT
/*
* this is the entry point to schedule() from in-kernel preemption
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 1ad1d2b..99399f8 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -46,13 +46,6 @@ SCHED_FEAT(DOUBLE_TICK, false)
SCHED_FEAT(LB_BIAS, true)

/*
- * Spin-wait on mutex acquisition when the mutex owner is running on
- * another cpu -- assumes that when the owner is running, it will soon
- * release the lock. Decreases scheduling overhead.
- */
-SCHED_FEAT(OWNER_SPIN, true)
-
-/*
* Decrement CPU power based on time not spent running tasks
*/
SCHED_FEAT(NONTASK_POWER, true)
--
1.7.1

2013-04-17 19:24:14

by Waiman Long

[permalink] [raw]
Subject: [PATCH v4 3/4] mutex: Queue mutex spinners with MCS lock to reduce cacheline contention

The current mutex spinning code (with MUTEX_SPIN_ON_OWNER option turned
on) allow multiple tasks to spin on a single mutex concurrently. A
potential problem with the current approach is that when the mutex
becomes available, all the spinning tasks will try to acquire the
mutex more or less simultaneously. As a result, there will be a lot of
cacheline bouncing especially on systems with a large number of CPUs.

This patch tries to reduce this kind of contention by putting the
mutex spinners into a queue so that only the first one in the queue
will try to acquire the mutex. This will reduce contention and allow
all the tasks to move forward faster.

The queuing of mutex spinners is done using an MCS lock based
implementation which will further reduce contention on the mutex
cacheline than a similar ticket spinlock based implementation. This
patch will add a new field into the mutex data structure for holding
the MCS lock. This expands the mutex size by 8 bytes for 64-bit system
and 4 bytes for 32-bit system. This overhead will be avoid if the
MUTEX_SPIN_ON_OWNER option is turned off.

The following table shows the jobs per minute (JPM) scalability data
on an 8-node 80-core Westmere box with a 3.7.10 kernel. The numactl
command is used to restrict the running of the fserver workloads to
1/2/4/8 nodes with hyperthreading off.

+-----------------+-----------+-----------+-------------+----------+
| Configuration | Mean JPM | Mean JPM | Mean JPM | % Change |
| | w/o patch | patch 1 | patches 1&2 | 1->1&2 |
+-----------------+------------------------------------------------+
| | User Range 1100 - 2000 |
+-----------------+------------------------------------------------+
| 8 nodes, HT off | 227972 | 227237 | 305043 | +34.2% |
| 4 nodes, HT off | 393503 | 381558 | 394650 | +3.4% |
| 2 nodes, HT off | 334957 | 325240 | 338853 | +4.2% |
| 1 node , HT off | 198141 | 197972 | 198075 | +0.1% |
+-----------------+------------------------------------------------+
| | User Range 200 - 1000 |
+-----------------+------------------------------------------------+
| 8 nodes, HT off | 282325 | 312870 | 332185 | +6.2% |
| 4 nodes, HT off | 390698 | 378279 | 393419 | +4.0% |
| 2 nodes, HT off | 336986 | 326543 | 340260 | +4.2% |
| 1 node , HT off | 197588 | 197622 | 197582 | 0.0% |
+-----------------+-----------+-----------+-------------+----------+

At low user range 10-100, the JPM differences were within +/-1%. So
they are not that interesting.

The fserver workload uses mutex spinning extensively. With just
the mutex change in the first patch, there is no noticeable change
in performance. Rather, there is a slight drop in performance. This
mutex spinning patch more than recovers the lost performance and show
a significant increase of +30% at high user load with the full 8 nodes.
Similar improvements were also seen in a 3.8 kernel.

The table below shows the %time spent by different kernel functions
as reported by perf when running the fserver workload at 1500 users
with all 8 nodes.

+-----------------------+-----------+---------+-------------+
| Function | % time | % time | % time |
| | w/o patch | patch 1 | patches 1&2 |
+-----------------------+-----------+---------+-------------+
| __read_lock_failed | 34.96% | 34.91% | 29.14% |
| __write_lock_failed | 10.14% | 10.68% | 7.51% |
| mutex_spin_on_owner | 3.62% | 3.42% | 2.33% |
| mspin_lock | N/A | N/A | 9.90% |
| __mutex_lock_slowpath | 1.46% | 0.81% | 0.14% |
| _raw_spin_lock | 2.25% | 2.50% | 1.10% |
+-----------------------+-----------+---------+-------------+

The fserver workload for an 8-node system is dominated by the
contention in the read/write lock. Mutex contention also plays a
role. With the first patch only, mutex contention is down (as shown by
the __mutex_lock_slowpath figure) which help a little bit. We saw only
a few percents improvement with that.

By applying patch 2 as well, the single mutex_spin_on_owner figure is
now split out into an additional mspin_lock figure. The time increases
from 3.42% to 11.23%. It shows a great reduction in contention among
the spinners leading to a 30% improvement. The time ratio 9.9/2.33=4.3
indicates that there are on average 4+ spinners waiting in the spin_lock
loop for each spinner in the mutex_spin_on_owner loop. Contention in
other locking functions also go down by quite a lot.

The table below shows the performance change of both patches 1 & 2 over
patch 1 alone in other AIM7 workloads (at 8 nodes, hyperthreading off).

+--------------+---------------+----------------+-----------------+
| Workload | mean % change | mean % change | mean % change |
| | 10-100 users | 200-1000 users | 1100-2000 users |
+--------------+---------------+----------------+-----------------+
| alltests | 0.0% | -0.8% | +0.6% |
| five_sec | -0.3% | +0.8% | +0.8% |
| high_systime | +0.4% | +2.4% | +2.1% |
| new_fserver | +0.1% | +14.1% | +34.2% |
| shared | -0.5% | -0.3% | -0.4% |
| short | -1.7% | -9.8% | -8.3% |
+--------------+---------------+----------------+-----------------+

The short workload is the only one that shows a decline in performance
probably due to the spinner locking and queuing overhead.

Signed-off-by: Waiman Long <[email protected]>
Acked-by: Rik van Riel <[email protected]>
---
include/linux/mutex.h | 3 ++
kernel/mutex.c | 91 ++++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 93 insertions(+), 1 deletions(-)

diff --git a/include/linux/mutex.h b/include/linux/mutex.h
index 9121595..433da8a 100644
--- a/include/linux/mutex.h
+++ b/include/linux/mutex.h
@@ -53,6 +53,9 @@ struct mutex {
#if defined(CONFIG_DEBUG_MUTEXES) || defined(CONFIG_SMP)
struct task_struct *owner;
#endif
+#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
+ void *spin_mlock; /* Spinner MCS lock */
+#endif
#ifdef CONFIG_DEBUG_MUTEXES
const char *name;
void *magic;
diff --git a/kernel/mutex.c b/kernel/mutex.c
index 70ebd85..1dbd421 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -55,6 +55,9 @@ __mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key)
spin_lock_init(&lock->wait_lock);
INIT_LIST_HEAD(&lock->wait_list);
mutex_clear_owner(lock);
+#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
+ lock->spin_mlock = NULL;
+#endif

debug_mutex_init(lock, name, key);
}
@@ -108,6 +111,60 @@ EXPORT_SYMBOL(mutex_lock);

#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
/*
+ * In order to avoid a stampede of mutex spinners from acquiring the mutex
+ * more or less simultaneously, the spinners need to acquire a MCS lock
+ * first before spinning on the owner field.
+ *
+ * We don't inline mspin_lock() so that perf can correctly account for the
+ * time spent in this lock function.
+ */
+struct mspin_node {
+ struct mspin_node *next ;
+ int locked; /* 1 if lock acquired */
+};
+#define MLOCK(mutex) ((struct mspin_node **)&((mutex)->spin_mlock))
+
+static noinline
+void mspin_lock(struct mspin_node **lock, struct mspin_node *node)
+{
+ struct mspin_node *prev;
+
+ /* Init node */
+ node->locked = 0;
+ node->next = NULL;
+
+ prev = xchg(lock, node);
+ if (likely(prev == NULL)) {
+ /* Lock acquired */
+ node->locked = 1;
+ return;
+ }
+ ACCESS_ONCE(prev->next) = node;
+ smp_wmb();
+ /* Wait until the lock holder passes the lock down */
+ while (!ACCESS_ONCE(node->locked))
+ arch_mutex_cpu_relax();
+}
+
+static void mspin_unlock(struct mspin_node **lock, struct mspin_node *node)
+{
+ struct mspin_node *next = ACCESS_ONCE(node->next);
+
+ if (likely(!next)) {
+ /*
+ * Release the lock by setting it to NULL
+ */
+ if (cmpxchg(lock, node, NULL) == node)
+ return;
+ /* Wait until the next pointer is set */
+ while (!(next = ACCESS_ONCE(node->next)))
+ arch_mutex_cpu_relax();
+ }
+ ACCESS_ONCE(next->locked) = 1;
+ smp_wmb();
+}
+
+/*
* Mutex spinning code migrated from kernel/sched/core.c
*/

@@ -150,6 +207,24 @@ int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
*/
return lock->owner == NULL;
}
+
+/*
+ * Initial check for entering the mutex spinning loop
+ */
+static inline int mutex_can_spin_on_owner(struct mutex *lock)
+{
+ int retval = 1;
+
+ rcu_read_lock();
+ if (lock->owner)
+ retval = lock->owner->on_cpu;
+ rcu_read_unlock();
+ /*
+ * if lock->owner is not set, the mutex owner may have just acquired
+ * it and not set the owner yet or the mutex has been released.
+ */
+ return retval;
+}
#endif

static __used noinline void __sched __mutex_unlock_slowpath(atomic_t *lock_count);
@@ -215,26 +290,39 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
*
* We can't do this for DEBUG_MUTEXES because that relies on wait_lock
* to serialize everything.
+ *
+ * The mutex spinners are queued up using MCS lock so that only one
+ * spinner can compete for the mutex. However, if mutex spinning isn't
+ * going to happen, there is no point in going through the lock/unlock
+ * overhead.
*/
+ if (!mutex_can_spin_on_owner(lock))
+ goto slowpath;

for (;;) {
struct task_struct *owner;
+ struct mspin_node node;

/*
* If there's an owner, wait for it to either
* release the lock or go to sleep.
*/
+ mspin_lock(MLOCK(lock), &node);
owner = ACCESS_ONCE(lock->owner);
- if (owner && !mutex_spin_on_owner(lock, owner))
+ if (owner && !mutex_spin_on_owner(lock, owner)) {
+ mspin_unlock(MLOCK(lock), &node);
break;
+ }

if ((atomic_read(&lock->count) == 1) &&
(atomic_cmpxchg(&lock->count, 1, 0) == 1)) {
lock_acquired(&lock->dep_map, ip);
mutex_set_owner(lock);
+ mspin_unlock(MLOCK(lock), &node);
preempt_enable();
return 0;
}
+ mspin_unlock(MLOCK(lock), &node);

/*
* When there's no owner, we might have preempted between the
@@ -253,6 +341,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
*/
arch_mutex_cpu_relax();
}
+slowpath:
#endif
spin_lock_mutex(&lock->wait_lock, flags);

--
1.7.1

2013-04-18 03:00:24

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH v4 3/4] mutex: Queue mutex spinners with MCS lock to reduce cacheline contention

On Wed, 2013-04-17 at 15:23 -0400, Waiman Long wrote:
> The current mutex spinning code (with MUTEX_SPIN_ON_OWNER option turned
> on) allow multiple tasks to spin on a single mutex concurrently. A
> potential problem with the current approach is that when the mutex
> becomes available, all the spinning tasks will try to acquire the
> mutex more or less simultaneously. As a result, there will be a lot of
> cacheline bouncing especially on systems with a large number of CPUs.
>
> This patch tries to reduce this kind of contention by putting the
> mutex spinners into a queue so that only the first one in the queue
> will try to acquire the mutex. This will reduce contention and allow
> all the tasks to move forward faster.
>
> The queuing of mutex spinners is done using an MCS lock based
> implementation which will further reduce contention on the mutex
> cacheline than a similar ticket spinlock based implementation. This
> patch will add a new field into the mutex data structure for holding
> the MCS lock. This expands the mutex size by 8 bytes for 64-bit system
> and 4 bytes for 32-bit system. This overhead will be avoid if the
> MUTEX_SPIN_ON_OWNER option is turned off.
>
> The following table shows the jobs per minute (JPM) scalability data
> on an 8-node 80-core Westmere box with a 3.7.10 kernel. The numactl
> command is used to restrict the running of the fserver workloads to
> 1/2/4/8 nodes with hyperthreading off.
[...]
>
> The short workload is the only one that shows a decline in performance
> probably due to the spinner locking and queuing overhead.
>
> Signed-off-by: Waiman Long <[email protected]>
> Acked-by: Rik van Riel <[email protected]>

Reviewed-by: Davidlohr Bueso <[email protected]>

2013-04-19 07:45:38

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v4 0/4] mutex: Improve mutex performance by doing less atomic-ops & better spinning


* Waiman Long <[email protected]> wrote:

> v3->v4
> - Merge patch 4 into patch 2
> - Move patch 5 forward to become patch 1
>
> v2->v3
> - Add patch 4 to remove new typedefs introduced in patch 2.
> - Add patch 5 to remove SCHED_FEAT_OWNER_SPIN and move the mutex
> spinning code to mutex.c.
>
> v1->v2
> - Remove the 2 mutex spinner patches and replaced it by another one
> to improve the mutex spinning process.
> - Remove changes made to kernel/mutex.h & localize changes in
> kernel/mutex.c.
> - Add an optional patch to remove architecture specific check in patch
> 1.

The patches look pretty nice now - thanks Waiman. I have applied them to
tip:core/locking and started testing them.

Thanks,

Ingo

Subject: [tip:core/locking] mutex: Move mutex spinning code from sched/ core.c back to mutex.c

Commit-ID: 41fcb9f230bf773656d1768b73000ef720bf00c3
Gitweb: http://git.kernel.org/tip/41fcb9f230bf773656d1768b73000ef720bf00c3
Author: Waiman Long <[email protected]>
AuthorDate: Wed, 17 Apr 2013 15:23:11 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Fri, 19 Apr 2013 09:33:34 +0200

mutex: Move mutex spinning code from sched/core.c back to mutex.c

As mentioned by Ingo, the SCHED_FEAT_OWNER_SPIN scheduler
feature bit was really just an early hack to make with/without
mutex-spinning testable. So it is no longer necessary.

This patch removes the SCHED_FEAT_OWNER_SPIN feature bit and
move the mutex spinning code from kernel/sched/core.c back to
kernel/mutex.c which is where they should belong.

Signed-off-by: Waiman Long <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Chandramouleeswaran Aswin <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Norton Scott J <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: David Howells <[email protected]>
Cc: Dave Jones <[email protected]>
Cc: Clark Williams <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/sched.h | 1 -
kernel/mutex.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/core.c | 45 ---------------------------------------------
kernel/sched/features.h | 7 -------
4 files changed, 46 insertions(+), 53 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d35d2b6..aefe45d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -320,7 +320,6 @@ extern signed long schedule_timeout_killable(signed long timeout);
extern signed long schedule_timeout_uninterruptible(signed long timeout);
asmlinkage void schedule(void);
extern void schedule_preempt_disabled(void);
-extern int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner);

struct nsproxy;
struct user_namespace;
diff --git a/kernel/mutex.c b/kernel/mutex.c
index 52f2301..262d717 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -95,6 +95,52 @@ void __sched mutex_lock(struct mutex *lock)
EXPORT_SYMBOL(mutex_lock);
#endif

+#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
+/*
+ * Mutex spinning code migrated from kernel/sched/core.c
+ */
+
+static inline bool owner_running(struct mutex *lock, struct task_struct *owner)
+{
+ if (lock->owner != owner)
+ return false;
+
+ /*
+ * Ensure we emit the owner->on_cpu, dereference _after_ checking
+ * lock->owner still matches owner, if that fails, owner might
+ * point to free()d memory, if it still matches, the rcu_read_lock()
+ * ensures the memory stays valid.
+ */
+ barrier();
+
+ return owner->on_cpu;
+}
+
+/*
+ * Look out! "owner" is an entirely speculative pointer
+ * access and not reliable.
+ */
+static noinline
+int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
+{
+ rcu_read_lock();
+ while (owner_running(lock, owner)) {
+ if (need_resched())
+ break;
+
+ arch_mutex_cpu_relax();
+ }
+ rcu_read_unlock();
+
+ /*
+ * We break out the loop above on need_resched() and when the
+ * owner changed, which is a sign for heavy contention. Return
+ * success only when lock->owner is NULL.
+ */
+ return lock->owner == NULL;
+}
+#endif
+
static __used noinline void __sched __mutex_unlock_slowpath(atomic_t *lock_count);

/**
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7f12624..b37a22b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2997,51 +2997,6 @@ void __sched schedule_preempt_disabled(void)
preempt_disable();
}

-#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
-
-static inline bool owner_running(struct mutex *lock, struct task_struct *owner)
-{
- if (lock->owner != owner)
- return false;
-
- /*
- * Ensure we emit the owner->on_cpu, dereference _after_ checking
- * lock->owner still matches owner, if that fails, owner might
- * point to free()d memory, if it still matches, the rcu_read_lock()
- * ensures the memory stays valid.
- */
- barrier();
-
- return owner->on_cpu;
-}
-
-/*
- * Look out! "owner" is an entirely speculative pointer
- * access and not reliable.
- */
-int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
-{
- if (!sched_feat(OWNER_SPIN))
- return 0;
-
- rcu_read_lock();
- while (owner_running(lock, owner)) {
- if (need_resched())
- break;
-
- arch_mutex_cpu_relax();
- }
- rcu_read_unlock();
-
- /*
- * We break out the loop above on need_resched() and when the
- * owner changed, which is a sign for heavy contention. Return
- * success only when lock->owner is NULL.
- */
- return lock->owner == NULL;
-}
-#endif
-
#ifdef CONFIG_PREEMPT
/*
* this is the entry point to schedule() from in-kernel preemption
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 1ad1d2b..99399f8 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -46,13 +46,6 @@ SCHED_FEAT(DOUBLE_TICK, false)
SCHED_FEAT(LB_BIAS, true)

/*
- * Spin-wait on mutex acquisition when the mutex owner is running on
- * another cpu -- assumes that when the owner is running, it will soon
- * release the lock. Decreases scheduling overhead.
- */
-SCHED_FEAT(OWNER_SPIN, true)
-
-/*
* Decrement CPU power based on time not spent running tasks
*/
SCHED_FEAT(NONTASK_POWER, true)

Subject: [tip:core/locking] mutex: Make more scalable by doing less atomic operations

Commit-ID: 0dc8c730c98a06a4d927f8d08bd0dd6de973b8dd
Gitweb: http://git.kernel.org/tip/0dc8c730c98a06a4d927f8d08bd0dd6de973b8dd
Author: Waiman Long <[email protected]>
AuthorDate: Wed, 17 Apr 2013 15:23:12 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Fri, 19 Apr 2013 09:33:35 +0200

mutex: Make more scalable by doing less atomic operations

In the __mutex_lock_common() function, an initial entry into
the lock slow path will cause two atomic_xchg instructions to be
issued. Together with the atomic decrement in the fast path, a
total of three atomic read-modify-write instructions will be
issued in rapid succession. This can cause a lot of cache
bouncing when many tasks are trying to acquire the mutex at the
same time.

This patch will reduce the number of atomic_xchg instructions
used by checking the counter value first before issuing the
instruction. The atomic_read() function is just a simple memory
read. The atomic_xchg() function, on the other hand, can be up
to 2 order of magnitude or even more in cost when compared with
atomic_read(). By using atomic_read() to check the value first
before calling atomic_xchg(), we can avoid a lot of unnecessary
cache coherency traffic. The only downside with this change is
that a task on the slow path will have a tiny bit less chance of
getting the mutex when competing with another task in the fast
path.

The same is true for the atomic_cmpxchg() function in the
mutex-spin-on-owner loop. So an atomic_read() is also performed
before calling atomic_cmpxchg().

The mutex locking and unlocking code for the x86 architecture
can allow any negative number to be used in the mutex count to
indicate that some tasks are waiting for the mutex. I am not so
sure if that is the case for the other architectures. So the
default is to avoid atomic_xchg() if the count has already been
set to -1. For x86, the check is modified to include all
negative numbers to cover a larger case.

The following table shows the jobs per minutes (JPM) scalability
data on an 8-node 80-core Westmere box with a 3.7.10 kernel. The
numactl command is used to restrict the running of the
high_systime workloads to 1/2/4/8 nodes with hyperthreading on
and off.

+-----------------+-----------+------------+----------+
| Configuration | Mean JPM | Mean JPM | % Change |
| | w/o patch | with patch | |
+-----------------+-----------------------------------+
| | User Range 1100 - 2000 |
+-----------------+-----------------------------------+
| 8 nodes, HT on | 36980 | 148590 | +301.8% |
| 8 nodes, HT off | 42799 | 145011 | +238.8% |
| 4 nodes, HT on | 61318 | 118445 | +51.1% |
| 4 nodes, HT off | 158481 | 158592 | +0.1% |
| 2 nodes, HT on | 180602 | 173967 | -3.7% |
| 2 nodes, HT off | 198409 | 198073 | -0.2% |
| 1 node , HT on | 149042 | 147671 | -0.9% |
| 1 node , HT off | 126036 | 126533 | +0.4% |
+-----------------+-----------------------------------+
| | User Range 200 - 1000 |
+-----------------+-----------------------------------+
| 8 nodes, HT on | 41525 | 122349 | +194.6% |
| 8 nodes, HT off | 49866 | 124032 | +148.7% |
| 4 nodes, HT on | 66409 | 106984 | +61.1% |
| 4 nodes, HT off | 119880 | 130508 | +8.9% |
| 2 nodes, HT on | 138003 | 133948 | -2.9% |
| 2 nodes, HT off | 132792 | 131997 | -0.6% |
| 1 node , HT on | 116593 | 115859 | -0.6% |
| 1 node , HT off | 104499 | 104597 | +0.1% |
+-----------------+------------+-----------+----------+

At low user range 10-100, the JPM differences were within +/-1%.
So they are not that interesting.

AIM7 benchmark run has a pretty large run-to-run variance due to
random nature of the subtests executed. So a difference of less
than +-5% may not be really significant.

This patch improves high_systime workload performance at 4 nodes
and up by maintaining transaction rates without significant
drop-off at high node count. The patch has practically no
impact on 1 and 2 nodes system.

The table below shows the percentage time (as reported by perf
record -a -s -g) spent on the __mutex_lock_slowpath() function
by the high_systime workload at 1500 users for 2/4/8-node
configurations with hyperthreading off.

+---------------+-----------------+------------------+---------+
| Configuration | %Time w/o patch | %Time with patch | %Change |
+---------------+-----------------+------------------+---------+
| 8 nodes | 65.34% | 0.69% | -99% |
| 4 nodes | 8.70% | 1.02% | -88% |
| 2 nodes | 0.41% | 0.32% | -22% |
+---------------+-----------------+------------------+---------+

It is obvious that the dramatic performance improvement at 8
nodes was due to the drastic cut in the time spent within the
__mutex_lock_slowpath() function.

The table below show the improvements in other AIM7 workloads
(at 8 nodes, hyperthreading off).

+--------------+---------------+----------------+-----------------+
| Workload | mean % change | mean % change | mean % change |
| | 10-100 users | 200-1000 users | 1100-2000 users |
+--------------+---------------+----------------+-----------------+
| alltests | +0.6% | +104.2% | +185.9% |
| five_sec | +1.9% | +0.9% | +0.9% |
| fserver | +1.4% | -7.7% | +5.1% |
| new_fserver | -0.5% | +3.2% | +3.1% |
| shared | +13.1% | +146.1% | +181.5% |
| short | +7.4% | +5.0% | +4.2% |
+--------------+---------------+----------------+-----------------+

Signed-off-by: Waiman Long <[email protected]>
Reviewed-by: Davidlohr Bueso <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Chandramouleeswaran Aswin <[email protected]>
Cc: Norton: Scott J <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: David Howells <[email protected]>
Cc: Dave Jones <[email protected]>
Cc: Clark Williams <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/mutex.h | 10 ++++++++++
kernel/mutex.c | 19 ++++++++++++++++---
2 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/mutex.h b/arch/x86/include/asm/mutex.h
index 7d3a482..bc2a0b0 100644
--- a/arch/x86/include/asm/mutex.h
+++ b/arch/x86/include/asm/mutex.h
@@ -3,3 +3,13 @@
#else
# include <asm/mutex_64.h>
#endif
+
+#ifndef __ASM_MUTEX_H
+#define __ASM_MUTEX_H
+/*
+ * For the x86 architecture, it allows any negative number (besides -1) in
+ * the mutex count to indicate that some other threads are waiting on the
+ * mutex.
+ */
+#define __ARCH_ALLOW_ANY_NEGATIVE_MUTEX_COUNT 1
+#endif
diff --git a/kernel/mutex.c b/kernel/mutex.c
index 262d717..70ebd85 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -37,6 +37,17 @@
# include <asm/mutex.h>
#endif

+/*
+ * A mutex count of -1 indicates that waiters are sleeping waiting for the
+ * mutex. Some architectures can allow any negative number, not just -1, for
+ * this purpose.
+ */
+#ifdef __ARCH_ALLOW_ANY_NEGATIVE_MUTEX_COUNT
+#define MUTEX_SHOW_NO_WAITER(mutex) (atomic_read(&(mutex)->count) >= 0)
+#else
+#define MUTEX_SHOW_NO_WAITER(mutex) (atomic_read(&(mutex)->count) != -1)
+#endif
+
void
__mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key)
{
@@ -217,7 +228,8 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
if (owner && !mutex_spin_on_owner(lock, owner))
break;

- if (atomic_cmpxchg(&lock->count, 1, 0) == 1) {
+ if ((atomic_read(&lock->count) == 1) &&
+ (atomic_cmpxchg(&lock->count, 1, 0) == 1)) {
lock_acquired(&lock->dep_map, ip);
mutex_set_owner(lock);
preempt_enable();
@@ -251,7 +263,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
list_add_tail(&waiter.list, &lock->wait_list);
waiter.task = task;

- if (atomic_xchg(&lock->count, -1) == 1)
+ if (MUTEX_SHOW_NO_WAITER(lock) && (atomic_xchg(&lock->count, -1) == 1))
goto done;

lock_contended(&lock->dep_map, ip);
@@ -266,7 +278,8 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
* that when we release the lock, we properly wake up the
* other waiters:
*/
- if (atomic_xchg(&lock->count, -1) == 1)
+ if (MUTEX_SHOW_NO_WAITER(lock) &&
+ (atomic_xchg(&lock->count, -1) == 1))
break;

/*

Subject: [tip:core/locking] mutex: Back out architecture specific check for negative mutex count

Commit-ID: cc189d2513d1f45cde87a9043fe3be28559c7490
Gitweb: http://git.kernel.org/tip/cc189d2513d1f45cde87a9043fe3be28559c7490
Author: Waiman Long <[email protected]>
AuthorDate: Wed, 17 Apr 2013 15:23:14 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Fri, 19 Apr 2013 09:33:36 +0200

mutex: Back out architecture specific check for negative mutex count

Linus suggested that probably all the supported architectures can
allow a negative mutex count without incorrect behavior, so we can
then back out the architecture specific change and allow the
mutex count to go to any negative number. That should further
reduce contention for non-x86 architecture.

Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Waiman Long <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Chandramouleeswaran Aswin <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Norton Scott J <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: David Howells <[email protected]>
Cc: Dave Jones <[email protected]>
Cc: Clark Williams <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/mutex.h | 10 ----------
kernel/mutex.c | 9 ++-------
2 files changed, 2 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/mutex.h b/arch/x86/include/asm/mutex.h
index bc2a0b0..7d3a482 100644
--- a/arch/x86/include/asm/mutex.h
+++ b/arch/x86/include/asm/mutex.h
@@ -3,13 +3,3 @@
#else
# include <asm/mutex_64.h>
#endif
-
-#ifndef __ASM_MUTEX_H
-#define __ASM_MUTEX_H
-/*
- * For the x86 architecture, it allows any negative number (besides -1) in
- * the mutex count to indicate that some other threads are waiting on the
- * mutex.
- */
-#define __ARCH_ALLOW_ANY_NEGATIVE_MUTEX_COUNT 1
-#endif
diff --git a/kernel/mutex.c b/kernel/mutex.c
index 1dbd421..ad53a66 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -38,15 +38,10 @@
#endif

/*
- * A mutex count of -1 indicates that waiters are sleeping waiting for the
- * mutex. Some architectures can allow any negative number, not just -1, for
- * this purpose.
+ * A negative mutex count indicates that waiters are sleeping waiting for the
+ * mutex.
*/
-#ifdef __ARCH_ALLOW_ANY_NEGATIVE_MUTEX_COUNT
#define MUTEX_SHOW_NO_WAITER(mutex) (atomic_read(&(mutex)->count) >= 0)
-#else
-#define MUTEX_SHOW_NO_WAITER(mutex) (atomic_read(&(mutex)->count) != -1)
-#endif

void
__mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key)

Subject: [tip:core/locking] mutex: Queue mutex spinners with MCS lock to reduce cacheline contention

Commit-ID: 2bd2c92cf07cc4a373bf316c75b78ac465fefd35
Gitweb: http://git.kernel.org/tip/2bd2c92cf07cc4a373bf316c75b78ac465fefd35
Author: Waiman Long <[email protected]>
AuthorDate: Wed, 17 Apr 2013 15:23:13 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Fri, 19 Apr 2013 09:33:36 +0200

mutex: Queue mutex spinners with MCS lock to reduce cacheline contention

The current mutex spinning code (with MUTEX_SPIN_ON_OWNER option
turned on) allow multiple tasks to spin on a single mutex
concurrently. A potential problem with the current approach is
that when the mutex becomes available, all the spinning tasks
will try to acquire the mutex more or less simultaneously. As a
result, there will be a lot of cacheline bouncing especially on
systems with a large number of CPUs.

This patch tries to reduce this kind of contention by putting
the mutex spinners into a queue so that only the first one in
the queue will try to acquire the mutex. This will reduce
contention and allow all the tasks to move forward faster.

The queuing of mutex spinners is done using an MCS lock based
implementation which will further reduce contention on the mutex
cacheline than a similar ticket spinlock based implementation.
This patch will add a new field into the mutex data structure
for holding the MCS lock. This expands the mutex size by 8 bytes
for 64-bit system and 4 bytes for 32-bit system. This overhead
will be avoid if the MUTEX_SPIN_ON_OWNER option is turned off.

The following table shows the jobs per minute (JPM) scalability
data on an 8-node 80-core Westmere box with a 3.7.10 kernel. The
numactl command is used to restrict the running of the fserver
workloads to 1/2/4/8 nodes with hyperthreading off.

+-----------------+-----------+-----------+-------------+----------+
| Configuration | Mean JPM | Mean JPM | Mean JPM | % Change |
| | w/o patch | patch 1 | patches 1&2 | 1->1&2 |
+-----------------+------------------------------------------------+
| | User Range 1100 - 2000 |
+-----------------+------------------------------------------------+
| 8 nodes, HT off | 227972 | 227237 | 305043 | +34.2% |
| 4 nodes, HT off | 393503 | 381558 | 394650 | +3.4% |
| 2 nodes, HT off | 334957 | 325240 | 338853 | +4.2% |
| 1 node , HT off | 198141 | 197972 | 198075 | +0.1% |
+-----------------+------------------------------------------------+
| | User Range 200 - 1000 |
+-----------------+------------------------------------------------+
| 8 nodes, HT off | 282325 | 312870 | 332185 | +6.2% |
| 4 nodes, HT off | 390698 | 378279 | 393419 | +4.0% |
| 2 nodes, HT off | 336986 | 326543 | 340260 | +4.2% |
| 1 node , HT off | 197588 | 197622 | 197582 | 0.0% |
+-----------------+-----------+-----------+-------------+----------+

At low user range 10-100, the JPM differences were within +/-1%.
So they are not that interesting.

The fserver workload uses mutex spinning extensively. With just
the mutex change in the first patch, there is no noticeable
change in performance. Rather, there is a slight drop in
performance. This mutex spinning patch more than recovers the
lost performance and show a significant increase of +30% at high
user load with the full 8 nodes. Similar improvements were also
seen in a 3.8 kernel.

The table below shows the %time spent by different kernel
functions as reported by perf when running the fserver workload
at 1500 users with all 8 nodes.

+-----------------------+-----------+---------+-------------+
| Function | % time | % time | % time |
| | w/o patch | patch 1 | patches 1&2 |
+-----------------------+-----------+---------+-------------+
| __read_lock_failed | 34.96% | 34.91% | 29.14% |
| __write_lock_failed | 10.14% | 10.68% | 7.51% |
| mutex_spin_on_owner | 3.62% | 3.42% | 2.33% |
| mspin_lock | N/A | N/A | 9.90% |
| __mutex_lock_slowpath | 1.46% | 0.81% | 0.14% |
| _raw_spin_lock | 2.25% | 2.50% | 1.10% |
+-----------------------+-----------+---------+-------------+

The fserver workload for an 8-node system is dominated by the
contention in the read/write lock. Mutex contention also plays a
role. With the first patch only, mutex contention is down (as
shown by the __mutex_lock_slowpath figure) which help a little
bit. We saw only a few percents improvement with that.

By applying patch 2 as well, the single mutex_spin_on_owner
figure is now split out into an additional mspin_lock figure.
The time increases from 3.42% to 11.23%. It shows a great
reduction in contention among the spinners leading to a 30%
improvement. The time ratio 9.9/2.33=4.3 indicates that there
are on average 4+ spinners waiting in the spin_lock loop for
each spinner in the mutex_spin_on_owner loop. Contention in
other locking functions also go down by quite a lot.

The table below shows the performance change of both patches 1 &
2 over patch 1 alone in other AIM7 workloads (at 8 nodes,
hyperthreading off).

+--------------+---------------+----------------+-----------------+
| Workload | mean % change | mean % change | mean % change |
| | 10-100 users | 200-1000 users | 1100-2000 users |
+--------------+---------------+----------------+-----------------+
| alltests | 0.0% | -0.8% | +0.6% |
| five_sec | -0.3% | +0.8% | +0.8% |
| high_systime | +0.4% | +2.4% | +2.1% |
| new_fserver | +0.1% | +14.1% | +34.2% |
| shared | -0.5% | -0.3% | -0.4% |
| short | -1.7% | -9.8% | -8.3% |
+--------------+---------------+----------------+-----------------+

The short workload is the only one that shows a decline in
performance probably due to the spinner locking and queuing
overhead.

Signed-off-by: Waiman Long <[email protected]>
Reviewed-by: Davidlohr Bueso <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Chandramouleeswaran Aswin <[email protected]>
Cc: Norton Scott J <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: David Howells <[email protected]>
Cc: Dave Jones <[email protected]>
Cc: Clark Williams <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/mutex.h | 3 ++
kernel/mutex.c | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 93 insertions(+), 1 deletion(-)

diff --git a/include/linux/mutex.h b/include/linux/mutex.h
index 9121595..433da8a 100644
--- a/include/linux/mutex.h
+++ b/include/linux/mutex.h
@@ -53,6 +53,9 @@ struct mutex {
#if defined(CONFIG_DEBUG_MUTEXES) || defined(CONFIG_SMP)
struct task_struct *owner;
#endif
+#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
+ void *spin_mlock; /* Spinner MCS lock */
+#endif
#ifdef CONFIG_DEBUG_MUTEXES
const char *name;
void *magic;
diff --git a/kernel/mutex.c b/kernel/mutex.c
index 70ebd85..1dbd421 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -55,6 +55,9 @@ __mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key)
spin_lock_init(&lock->wait_lock);
INIT_LIST_HEAD(&lock->wait_list);
mutex_clear_owner(lock);
+#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
+ lock->spin_mlock = NULL;
+#endif

debug_mutex_init(lock, name, key);
}
@@ -108,6 +111,60 @@ EXPORT_SYMBOL(mutex_lock);

#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
/*
+ * In order to avoid a stampede of mutex spinners from acquiring the mutex
+ * more or less simultaneously, the spinners need to acquire a MCS lock
+ * first before spinning on the owner field.
+ *
+ * We don't inline mspin_lock() so that perf can correctly account for the
+ * time spent in this lock function.
+ */
+struct mspin_node {
+ struct mspin_node *next ;
+ int locked; /* 1 if lock acquired */
+};
+#define MLOCK(mutex) ((struct mspin_node **)&((mutex)->spin_mlock))
+
+static noinline
+void mspin_lock(struct mspin_node **lock, struct mspin_node *node)
+{
+ struct mspin_node *prev;
+
+ /* Init node */
+ node->locked = 0;
+ node->next = NULL;
+
+ prev = xchg(lock, node);
+ if (likely(prev == NULL)) {
+ /* Lock acquired */
+ node->locked = 1;
+ return;
+ }
+ ACCESS_ONCE(prev->next) = node;
+ smp_wmb();
+ /* Wait until the lock holder passes the lock down */
+ while (!ACCESS_ONCE(node->locked))
+ arch_mutex_cpu_relax();
+}
+
+static void mspin_unlock(struct mspin_node **lock, struct mspin_node *node)
+{
+ struct mspin_node *next = ACCESS_ONCE(node->next);
+
+ if (likely(!next)) {
+ /*
+ * Release the lock by setting it to NULL
+ */
+ if (cmpxchg(lock, node, NULL) == node)
+ return;
+ /* Wait until the next pointer is set */
+ while (!(next = ACCESS_ONCE(node->next)))
+ arch_mutex_cpu_relax();
+ }
+ ACCESS_ONCE(next->locked) = 1;
+ smp_wmb();
+}
+
+/*
* Mutex spinning code migrated from kernel/sched/core.c
*/

@@ -150,6 +207,24 @@ int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
*/
return lock->owner == NULL;
}
+
+/*
+ * Initial check for entering the mutex spinning loop
+ */
+static inline int mutex_can_spin_on_owner(struct mutex *lock)
+{
+ int retval = 1;
+
+ rcu_read_lock();
+ if (lock->owner)
+ retval = lock->owner->on_cpu;
+ rcu_read_unlock();
+ /*
+ * if lock->owner is not set, the mutex owner may have just acquired
+ * it and not set the owner yet or the mutex has been released.
+ */
+ return retval;
+}
#endif

static __used noinline void __sched __mutex_unlock_slowpath(atomic_t *lock_count);
@@ -215,26 +290,39 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
*
* We can't do this for DEBUG_MUTEXES because that relies on wait_lock
* to serialize everything.
+ *
+ * The mutex spinners are queued up using MCS lock so that only one
+ * spinner can compete for the mutex. However, if mutex spinning isn't
+ * going to happen, there is no point in going through the lock/unlock
+ * overhead.
*/
+ if (!mutex_can_spin_on_owner(lock))
+ goto slowpath;

for (;;) {
struct task_struct *owner;
+ struct mspin_node node;

/*
* If there's an owner, wait for it to either
* release the lock or go to sleep.
*/
+ mspin_lock(MLOCK(lock), &node);
owner = ACCESS_ONCE(lock->owner);
- if (owner && !mutex_spin_on_owner(lock, owner))
+ if (owner && !mutex_spin_on_owner(lock, owner)) {
+ mspin_unlock(MLOCK(lock), &node);
break;
+ }

if ((atomic_read(&lock->count) == 1) &&
(atomic_cmpxchg(&lock->count, 1, 0) == 1)) {
lock_acquired(&lock->dep_map, ip);
mutex_set_owner(lock);
+ mspin_unlock(MLOCK(lock), &node);
preempt_enable();
return 0;
}
+ mspin_unlock(MLOCK(lock), &node);

/*
* When there's no owner, we might have preempted between the
@@ -253,6 +341,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
*/
arch_mutex_cpu_relax();
}
+slowpath:
#endif
spin_lock_mutex(&lock->wait_lock, flags);