2013-04-15 14:38:30

by Waiman Long

[permalink] [raw]
Subject: [PATCH 0/3 v2] mutex: Improve mutex performance by doing less atomic-ops & better spinning

v1->v2
- Remove the 2 mutex spinner patches and replaced it by another one
to improve the mutex spinning process.
- Remove changes made to kernel/mutex.h & localize changes in
kernel/mutex.c.
- Add an optional patch to remove architecture specific check in patch
1.

This patch set is a collection of 3 different mutex related patches
aimed at improving mutex performance especially for system with large
number of CPUs. This is achieved by doing less atomic operations and
better mutex spinning (when the CONFIG_MUTEX_SPIN_ON_OWNER is on).

The first patch reduces the number of atomic operations executed. It
can produce dramatic performance improvement in the AIM7 benchmark
with large number of CPUs. For example, there was a more than 3X
improvement in the high_systime workload with a 3.7.10 kernel on
an 8-socket x86-64 system with 80 cores. The 3.8 kernels, on the
other hand, are not mutex limited for that workload anymore. So the
performance improvement is only about 1% for the high_systime workload.

Patches 2 improves the mutex spinning process by reducing contention
among the spinners when competing for the mutex. This is done by
using a MCS lock to put the spinners in a queue so that only the
first spinner will try to acquire the mutex when it is available. This
patch showed significant performance improvement of +30% on the AIM7
fserver and new_fserver workload.

The last patch is an optional one for backing out architecture specific
check in patch 1, if so desired.

Waiman Long (3):
mutex: Make more scalable by doing less atomic operations
mutex: Queue mutex spinners with MCS lock to reduce cacheline
contention
mutex: back out architecture specific check for negative mutex count

include/linux/mutex.h | 3 ++
include/linux/sched.h | 3 ++
kernel/mutex.c | 92 ++++++++++++++++++++++++++++++++++++++++++++++--
kernel/sched/core.c | 24 +++++++++++--
4 files changed, 115 insertions(+), 7 deletions(-)


2013-04-15 14:38:32

by Waiman Long

[permalink] [raw]
Subject: [PATCH v2 optional 3/3] mutex: back out architecture specific check for negative mutex count

If it is confirmed that all the supported architectures can allow a
negative mutex count without incorrect behavior, we can then back
out the architecture specific change and allow the mutex count to
go to any negative number. That should further reduce contention for
non-x86 architecture.

If this is not the case, this patch should be dropped.

Signed-off-by: Waiman Long <[email protected]>
---
arch/x86/include/asm/mutex.h | 10 ----------
kernel/mutex.c | 9 ++-------
2 files changed, 2 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/mutex.h b/arch/x86/include/asm/mutex.h
index bc2a0b0..7d3a482 100644
--- a/arch/x86/include/asm/mutex.h
+++ b/arch/x86/include/asm/mutex.h
@@ -3,13 +3,3 @@
#else
# include <asm/mutex_64.h>
#endif
-
-#ifndef __ASM_MUTEX_H
-#define __ASM_MUTEX_H
-/*
- * For the x86 architecture, it allows any negative number (besides -1) in
- * the mutex count to indicate that some other threads are waiting on the
- * mutex.
- */
-#define __ARCH_ALLOW_ANY_NEGATIVE_MUTEX_COUNT 1
-#endif
diff --git a/kernel/mutex.c b/kernel/mutex.c
index e6a90de..5600bdf 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -38,15 +38,10 @@
#endif

/*
- * A mutex count of -1 indicates that waiters are sleeping waiting for the
- * mutex. Some architectures can allow any negative number, not just -1, for
- * this purpose.
+ * A negative mutex count indicates that waiters are sleeping waiting for the
+ * mutex.
*/
-#ifdef __ARCH_ALLOW_ANY_NEGATIVE_MUTEX_COUNT
#define MUTEX_SHOW_NO_WAITER(mutex) (atomic_read(&(mutex)->count) >= 0)
-#else
-#define MUTEX_SHOW_NO_WAITER(mutex) (atomic_read(&(mutex)->count) != -1)
-#endif

void
__mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key)
--
1.7.1

2013-04-15 14:38:50

by Waiman Long

[permalink] [raw]
Subject: [PATCH v2 2/3] mutex: Queue mutex spinners with MCS lock to reduce cacheline contention

The current mutex spinning code (with MUTEX_SPIN_ON_OWNER option turned
on) allow multiple tasks to spin on a single mutex concurrently. A
potential problem with the current approach is that when the mutex
becomes available, all the spinning tasks will try to acquire the
mutex more or less simultaneously. As a result, there will be a lot of
cacheline bouncing especially on systems with a large number of CPUs.

This patch tries to reduce this kind of contention by putting the
mutex spinners into a queue so that only the first one in the queue
will try to acquire the mutex. This will reduce contention and allow
all the tasks to move forward faster.

The queuing of mutex spinners is done using an MCS lock based
implementation which will further reduce contention on the mutex
cacheline than a similar ticket spinlock based implementation. This
patch will add a new field into the mutex data structure for holding
the MCS lock. This expands the mutex size by 8 bytes for 64-bit system
and 4 bytes for 32-bit system. This overhead will be avoid if the
MUTEX_SPIN_ON_OWNER option is turned off.

The following table shows the jobs per minute (JPM) scalability data
on an 8-node 80-core Westmere box with a 3.7.10 kernel. The numactl
command is used to restrict the running of the fserver workloads to
1/2/4/8 nodes with hyperthreading off.

+-----------------+-----------+-----------+-------------+----------+
| Configuration | Mean JPM | Mean JPM | Mean JPM | % Change |
| | w/o patch | patch 1 | patches 1&2 | 1->1&2 |
+-----------------+------------------------------------------------+
| | User Range 1100 - 2000 |
+-----------------+------------------------------------------------+
| 8 nodes, HT off | 227972 | 227237 | 305043 | +34.2% |
| 4 nodes, HT off | 393503 | 381558 | 394650 | +3.4% |
| 2 nodes, HT off | 334957 | 325240 | 338853 | +4.2% |
| 1 node , HT off | 198141 | 197972 | 198075 | +0.1% |
+-----------------+------------------------------------------------+
| | User Range 200 - 1000 |
+-----------------+------------------------------------------------+
| 8 nodes, HT off | 282325 | 312870 | 332185 | +6.2% |
| 4 nodes, HT off | 390698 | 378279 | 393419 | +4.0% |
| 2 nodes, HT off | 336986 | 326543 | 340260 | +4.2% |
| 1 node , HT off | 197588 | 197622 | 197582 | 0.0% |
+-----------------+-----------+-----------+-------------+----------+

At low user range 10-100, the JPM differences were within +/-1%. So
they are not that interesting.

The fserver workload uses mutex spinning extensively. With just
the mutex change in the first patch, there is no noticeable change
in performance. Rather, there is a slight drop in performance. This
mutex spinning patch more than recovers the lost performance and show
a significant increase of +30% at high user load with the full 8 nodes.
Similar improvements were also seen in a 3.8 kernel.

The table below shows the %time spent by different kernel functions
as reported by perf when running the fserver workload at 1500 users
with all 8 nodes.

+-----------------------+-----------+---------+-------------+
| Function | % time | % time | % time |
| | w/o patch | patch 1 | patches 1&2 |
+-----------------------+-----------+---------+-------------+
| __read_lock_failed | 34.96% | 34.91% | 29.14% |
| __write_lock_failed | 10.14% | 10.68% | 7.51% |
| mutex_spin_on_owner | 3.62% | 3.42% | 2.33% |
| mspin_lock | N/A | N/A | 9.90% |
| __mutex_lock_slowpath | 1.46% | 0.81% | 0.14% |
| _raw_spin_lock | 2.25% | 2.50% | 1.10% |
+-----------------------+-----------+---------+-------------+

The fserver workload for an 8-node system is dominated by the
contention in the read/write lock. Mutex contention also plays a
role. With the first patch only, mutex contention is down (as shown by
the __mutex_lock_slowpath figure) which help a little bit. We saw only
a few percents improvement with that.

By applying patch 2 as well, the single mutex_spin_on_owner figure is
now split out into an additional mspin_lock figure. The time increases
from 3.42% to 11.23%. It shows a great reduction in contention among
the spinners leading to a 30% improvement. The time ratio 9.9/2.33=4.3
indicates that there are on average 4+ spinners waiting in the spin_lock
loop for each spinner in the mutex_spin_on_owner loop. Contention in
other locking functions also go down by quite a lot.

The table below shows the performance change of both patches 1 & 2 over
patch 1 alone in other AIM7 workloads (at 8 nodes, hyperthreading off).

+--------------+---------------+----------------+-----------------+
| Workload | mean % change | mean % change | mean % change |
| | 10-100 users | 200-1000 users | 1100-2000 users |
+--------------+---------------+----------------+-----------------+
| alltests | 0.0% | -0.8% | +0.6% |
| five_sec | -0.3% | +0.8% | +0.8% |
| high_systime | +0.4% | +2.4% | +2.1% |
| new_fserver | +0.1% | +14.1% | +34.2% |
| shared | -0.5% | -0.3% | -0.4% |
| short | -1.7% | -9.8% | -8.3% |
+--------------+---------------+----------------+-----------------+

The short workload is the only one that shows a decline in performance
probably due to the spinner locking and queuing overhead.

Signed-off-by: Waiman Long <[email protected]>
---
include/linux/mutex.h | 3 ++
include/linux/sched.h | 3 ++
kernel/mutex.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++++-
kernel/sched/core.c | 24 +++++++++++++--
4 files changed, 104 insertions(+), 4 deletions(-)

diff --git a/include/linux/mutex.h b/include/linux/mutex.h
index 9121595..433da8a 100644
--- a/include/linux/mutex.h
+++ b/include/linux/mutex.h
@@ -53,6 +53,9 @@ struct mutex {
#if defined(CONFIG_DEBUG_MUTEXES) || defined(CONFIG_SMP)
struct task_struct *owner;
#endif
+#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
+ void *spin_mlock; /* Spinner MCS lock */
+#endif
#ifdef CONFIG_DEBUG_MUTEXES
const char *name;
void *magic;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d35d2b6..8af6f13 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -320,7 +320,10 @@ extern signed long schedule_timeout_killable(signed long timeout);
extern signed long schedule_timeout_uninterruptible(signed long timeout);
asmlinkage void schedule(void);
extern void schedule_preempt_disabled(void);
+#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
extern int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner);
+extern int mutex_can_spin_on_owner(struct mutex *lock);
+#endif

struct nsproxy;
struct user_namespace;
diff --git a/kernel/mutex.c b/kernel/mutex.c
index e3496df..e6a90de 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -55,6 +55,9 @@ __mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key)
spin_lock_init(&lock->wait_lock);
INIT_LIST_HEAD(&lock->wait_list);
mutex_clear_owner(lock);
+#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
+ lock->spin_mlock = NULL;
+#endif

debug_mutex_init(lock, name, key);
}
@@ -106,6 +109,65 @@ void __sched mutex_lock(struct mutex *lock)
EXPORT_SYMBOL(mutex_lock);
#endif

+#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
+/*
+ * In order to avoid a stampede of mutex spinners from acquiring the mutex
+ * more or less simultaneously, the spinners need to acquire a MCS lock
+ * first before spinning on the owner field.
+ *
+ * We don't inline mspin_lock() so that perf can correctly account for the
+ * time spent in this lock function.
+ */
+typedef struct mspin_node {
+ struct mspin_node *next;
+ int locked; /* 1 if lock acquired */
+} mspin_node_t;
+
+typedef mspin_node_t *mspin_lock_t;
+
+#define MLOCK(mutex) ((mspin_lock_t *)&((mutex)->spin_mlock))
+
+static noinline void mspin_lock(mspin_lock_t *lock, mspin_node_t *node)
+{
+ mspin_node_t *prev;
+
+ /* Init node */
+ node->locked = 0;
+ node->next = NULL;
+
+ prev = xchg(lock, node);
+ if (likely(prev == NULL)) {
+ /* Lock acquired */
+ node->locked = 1;
+ return;
+ }
+ ACCESS_ONCE(prev->next) = node;
+ smp_wmb();
+ /* Wait until the lock holder passes the lock down */
+ while (!ACCESS_ONCE(node->locked))
+ arch_mutex_cpu_relax();
+}
+
+static void mspin_unlock(mspin_lock_t *lock, mspin_node_t *node)
+{
+ mspin_node_t *next = ACCESS_ONCE(node->next);
+
+ if (likely(!next)) {
+ /*
+ * Release the lock by setting it to NULL
+ */
+ if (cmpxchg(lock, node, NULL) == node)
+ return;
+ /* Wait until the next pointer is set */
+ while (!(next = ACCESS_ONCE(node->next)))
+ arch_mutex_cpu_relax();
+ }
+ barrier();
+ ACCESS_ONCE(next->locked) = 1;
+ smp_wmb();
+}
+#endif
+
static __used noinline void __sched __mutex_unlock_slowpath(atomic_t *lock_count);

/**
@@ -169,26 +231,39 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
*
* We can't do this for DEBUG_MUTEXES because that relies on wait_lock
* to serialize everything.
+ *
+ * The mutex spinners are queued up using MCS lock so that only one
+ * spinner can compete for the mutex. However, if mutex spinning isn't
+ * going to happen, there is no point in going through the lock/unlock
+ * overhead.
*/
+ if (!mutex_can_spin_on_owner(lock))
+ goto slowpath;

for (;;) {
struct task_struct *owner;
+ mspin_node_t node;

/*
* If there's an owner, wait for it to either
* release the lock or go to sleep.
*/
+ mspin_lock(MLOCK(lock), &node);
owner = ACCESS_ONCE(lock->owner);
- if (owner && !mutex_spin_on_owner(lock, owner))
+ if (owner && !mutex_spin_on_owner(lock, owner)) {
+ mspin_unlock(MLOCK(lock), &node);
break;
+ }

if ((atomic_read(&lock->count) == 1) &&
(atomic_cmpxchg(&lock->count, 1, 0) == 1)) {
lock_acquired(&lock->dep_map, ip);
mutex_set_owner(lock);
+ mspin_unlock(MLOCK(lock), &node);
preempt_enable();
return 0;
}
+ mspin_unlock(MLOCK(lock), &node);

/*
* When there's no owner, we might have preempted between the
@@ -207,6 +282,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
*/
arch_mutex_cpu_relax();
}
+slowpath:
#endif
spin_lock_mutex(&lock->wait_lock, flags);

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7f12624..176e82a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3021,9 +3021,6 @@ static inline bool owner_running(struct mutex *lock, struct task_struct *owner)
*/
int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
{
- if (!sched_feat(OWNER_SPIN))
- return 0;
-
rcu_read_lock();
while (owner_running(lock, owner)) {
if (need_resched())
@@ -3040,6 +3037,27 @@ int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
*/
return lock->owner == NULL;
}
+
+/*
+ * Initial check for entering the mutex spinning loop
+ */
+int mutex_can_spin_on_owner(struct mutex *lock)
+{
+ int retval = 1;
+
+ if (!sched_feat(OWNER_SPIN))
+ return 0;
+
+ rcu_read_lock();
+ if (lock->owner)
+ retval = lock->owner->on_cpu;
+ rcu_read_unlock();
+ /*
+ * if lock->owner is not set, the mutex owner may have just acquired
+ * it and not set the owner yet or the mutex has been released.
+ */
+ return retval;
+}
#endif

#ifdef CONFIG_PREEMPT
--
1.7.1

2013-04-15 14:39:12

by Waiman Long

[permalink] [raw]
Subject: [PATCH v2 1/3] mutex: Make more scalable by doing less atomic operations

In the __mutex_lock_common() function, an initial entry into
the lock slow path will cause two atomic_xchg instructions to be
issued. Together with the atomic decrement in the fast path, a total
of three atomic read-modify-write instructions will be issued in
rapid succession. This can cause a lot of cache bouncing when many
tasks are trying to acquire the mutex at the same time.

This patch will reduce the number of atomic_xchg instructions used by
checking the counter value first before issuing the instruction. The
atomic_read() function is just a simple memory read. The atomic_xchg()
function, on the other hand, can be up to 2 order of magnitude or even
more in cost when compared with atomic_read(). By using atomic_read()
to check the value first before calling atomic_xchg(), we can avoid a
lot of unnecessary cache coherency traffic. The only downside with this
change is that a task on the slow path will have a tiny bit
less chance of getting the mutex when competing with another task
in the fast path.

The same is true for the atomic_cmpxchg() function in the
mutex-spin-on-owner loop. So an atomic_read() is also performed before
calling atomic_cmpxchg().

The mutex locking and unlocking code for the x86 architecture can allow
any negative number to be used in the mutex count to indicate that some
tasks are waiting for the mutex. I am not so sure if that is the case
for the other architectures. So the default is to avoid atomic_xchg()
if the count has already been set to -1. For x86, the check is modified
to include all negative numbers to cover a larger case.

The following table shows the jobs per minutes (JPM) scalability data
on an 8-node 80-core Westmere box with a 3.7.10 kernel. The numactl
command is used to restrict the running of the high_systime workloads
to 1/2/4/8 nodes with hyperthreading on and off.

+-----------------+-----------+------------+----------+
| Configuration | Mean JPM | Mean JPM | % Change |
| | w/o patch | with patch | |
+-----------------+-----------------------------------+
| | User Range 1100 - 2000 |
+-----------------+-----------------------------------+
| 8 nodes, HT on | 36980 | 148590 | +301.8% |
| 8 nodes, HT off | 42799 | 145011 | +238.8% |
| 4 nodes, HT on | 61318 | 118445 | +51.1% |
| 4 nodes, HT off | 158481 | 158592 | +0.1% |
| 2 nodes, HT on | 180602 | 173967 | -3.7% |
| 2 nodes, HT off | 198409 | 198073 | -0.2% |
| 1 node , HT on | 149042 | 147671 | -0.9% |
| 1 node , HT off | 126036 | 126533 | +0.4% |
+-----------------+-----------------------------------+
| | User Range 200 - 1000 |
+-----------------+-----------------------------------+
| 8 nodes, HT on | 41525 | 122349 | +194.6% |
| 8 nodes, HT off | 49866 | 124032 | +148.7% |
| 4 nodes, HT on | 66409 | 106984 | +61.1% |
| 4 nodes, HT off | 119880 | 130508 | +8.9% |
| 2 nodes, HT on | 138003 | 133948 | -2.9% |
| 2 nodes, HT off | 132792 | 131997 | -0.6% |
| 1 node , HT on | 116593 | 115859 | -0.6% |
| 1 node , HT off | 104499 | 104597 | +0.1% |
+-----------------+------------+-----------+----------+

At low user range 10-100, the JPM differences were within +/-1%. So
they are not that interesting.

AIM7 benchmark run has a pretty large run-to-run variance due to random
nature of the subtests executed. So a difference of less than +-5%
may not be really significant.

This patch improves high_systime workload performance at 4 nodes
and up by maintaining transaction rates without significant drop-off
at high node count. The patch has practically no impact on 1 and 2
nodes system.

The table below shows the percentage time (as reported by perf
record -a -s -g) spent on the __mutex_lock_slowpath() function by
the high_systime workload at 1500 users for 2/4/8-node configurations
with hyperthreading off.

+---------------+-----------------+------------------+---------+
| Configuration | %Time w/o patch | %Time with patch | %Change |
+---------------+-----------------+------------------+---------+
| 8 nodes | 65.34% | 0.69% | -99% |
| 4 nodes | 8.70% | 1.02% | -88% |
| 2 nodes | 0.41% | 0.32% | -22% |
+---------------+-----------------+------------------+---------+

It is obvious that the dramatic performance improvement at 8
nodes was due to the drastic cut in the time spent within the
__mutex_lock_slowpath() function.

The table below show the improvements in other AIM7 workloads (at 8
nodes, hyperthreading off).

+--------------+---------------+----------------+-----------------+
| Workload | mean % change | mean % change | mean % change |
| | 10-100 users | 200-1000 users | 1100-2000 users |
+--------------+---------------+----------------+-----------------+
| alltests | +0.6% | +104.2% | +185.9% |
| five_sec | +1.9% | +0.9% | +0.9% |
| fserver | +1.4% | -7.7% | +5.1% |
| new_fserver | -0.5% | +3.2% | +3.1% |
| shared | +13.1% | +146.1% | +181.5% |
| short | +7.4% | +5.0% | +4.2% |
+--------------+---------------+----------------+-----------------+

Signed-off-by: Waiman Long <[email protected]>
Reviewed-by: Davidlohr Bueso <[email protected]>
---
arch/x86/include/asm/mutex.h | 10 ++++++++++
kernel/mutex.c | 19 ++++++++++++++++---
2 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/mutex.h b/arch/x86/include/asm/mutex.h
index 7d3a482..bc2a0b0 100644
--- a/arch/x86/include/asm/mutex.h
+++ b/arch/x86/include/asm/mutex.h
@@ -3,3 +3,13 @@
#else
# include <asm/mutex_64.h>
#endif
+
+#ifndef __ASM_MUTEX_H
+#define __ASM_MUTEX_H
+/*
+ * For the x86 architecture, it allows any negative number (besides -1) in
+ * the mutex count to indicate that some other threads are waiting on the
+ * mutex.
+ */
+#define __ARCH_ALLOW_ANY_NEGATIVE_MUTEX_COUNT 1
+#endif
diff --git a/kernel/mutex.c b/kernel/mutex.c
index 52f2301..e3496df 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -37,6 +37,17 @@
# include <asm/mutex.h>
#endif

+/*
+ * A mutex count of -1 indicates that waiters are sleeping waiting for the
+ * mutex. Some architectures can allow any negative number, not just -1, for
+ * this purpose.
+ */
+#ifdef __ARCH_ALLOW_ANY_NEGATIVE_MUTEX_COUNT
+#define MUTEX_SHOW_NO_WAITER(mutex) (atomic_read(&(mutex)->count) >= 0)
+#else
+#define MUTEX_SHOW_NO_WAITER(mutex) (atomic_read(&(mutex)->count) != -1)
+#endif
+
void
__mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key)
{
@@ -171,7 +182,8 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
if (owner && !mutex_spin_on_owner(lock, owner))
break;

- if (atomic_cmpxchg(&lock->count, 1, 0) == 1) {
+ if ((atomic_read(&lock->count) == 1) &&
+ (atomic_cmpxchg(&lock->count, 1, 0) == 1)) {
lock_acquired(&lock->dep_map, ip);
mutex_set_owner(lock);
preempt_enable();
@@ -205,7 +217,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
list_add_tail(&waiter.list, &lock->wait_list);
waiter.task = task;

- if (atomic_xchg(&lock->count, -1) == 1)
+ if (MUTEX_SHOW_NO_WAITER(lock) && (atomic_xchg(&lock->count, -1) == 1))
goto done;

lock_contended(&lock->dep_map, ip);
@@ -220,7 +232,8 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
* that when we release the lock, we properly wake up the
* other waiters:
*/
- if (atomic_xchg(&lock->count, -1) == 1)
+ if (MUTEX_SHOW_NO_WAITER(lock) &&
+ (atomic_xchg(&lock->count, -1) == 1))
break;

/*
--
1.7.1

2013-04-15 14:46:24

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] mutex: Make more scalable by doing less atomic operations

On 04/15/2013 10:37 AM, Waiman Long wrote:
> In the __mutex_lock_common() function, an initial entry into
> the lock slow path will cause two atomic_xchg instructions to be
> issued. Together with the atomic decrement in the fast path, a total
> of three atomic read-modify-write instructions will be issued in
> rapid succession. This can cause a lot of cache bouncing when many
> tasks are trying to acquire the mutex at the same time.
>
> This patch will reduce the number of atomic_xchg instructions used by
> checking the counter value first before issuing the instruction. The
> atomic_read() function is just a simple memory read. The atomic_xchg()
> function, on the other hand, can be up to 2 order of magnitude or even
> more in cost when compared with atomic_read(). By using atomic_read()
> to check the value first before calling atomic_xchg(), we can avoid a
> lot of unnecessary cache coherency traffic. The only downside with this
> change is that a task on the slow path will have a tiny bit
> less chance of getting the mutex when competing with another task
> in the fast path.
>

> Signed-off-by: Waiman Long <[email protected]>
> Reviewed-by: Davidlohr Bueso <[email protected]>

Reviewed-by: Rik van Riel <[email protected]>


--
All rights reversed

2013-04-15 16:28:34

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mutex: Queue mutex spinners with MCS lock to reduce cacheline contention

On 04/15/2013 10:37 AM, Waiman Long wrote:
> The current mutex spinning code (with MUTEX_SPIN_ON_OWNER option turned
> on) allow multiple tasks to spin on a single mutex concurrently. A
> potential problem with the current approach is that when the mutex
> becomes available, all the spinning tasks will try to acquire the
> mutex more or less simultaneously. As a result, there will be a lot of
> cacheline bouncing especially on systems with a large number of CPUs.
>
> This patch tries to reduce this kind of contention by putting the
> mutex spinners into a queue so that only the first one in the queue
> will try to acquire the mutex. This will reduce contention and allow
> all the tasks to move forward faster.
>
> The queuing of mutex spinners is done using an MCS lock based
> implementation which will further reduce contention on the mutex
> cacheline than a similar ticket spinlock based implementation. This
> patch will add a new field into the mutex data structure for holding
> the MCS lock. This expands the mutex size by 8 bytes for 64-bit system
> and 4 bytes for 32-bit system. This overhead will be avoid if the
> MUTEX_SPIN_ON_OWNER option is turned off.

> Signed-off-by: Waiman Long <[email protected]>

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed

2013-04-16 04:24:43

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mutex: Queue mutex spinners with MCS lock to reduce cacheline contention

On Mon, 2013-04-15 at 10:37 -0400, Waiman Long wrote:
[...]
> +typedef struct mspin_node {
> + struct mspin_node *next;
> + int locked; /* 1 if lock acquired */
> +} mspin_node_t;
> +
> +typedef mspin_node_t *mspin_lock_t;

I think we could do without the typedefs, specially mspin_lock_t.

> +
> +#define MLOCK(mutex) ((mspin_lock_t *)&((mutex)->spin_mlock))
> +
> +static noinline void mspin_lock(mspin_lock_t *lock, mspin_node_t *node)
> +{
> + mspin_node_t *prev;
> +
> + /* Init node */
> + node->locked = 0;
> + node->next = NULL;
> +
> + prev = xchg(lock, node);
> + if (likely(prev == NULL)) {
> + /* Lock acquired */
> + node->locked = 1;
> + return;
> + }
> + ACCESS_ONCE(prev->next) = node;
> + smp_wmb();
> + /* Wait until the lock holder passes the lock down */
> + while (!ACCESS_ONCE(node->locked))
> + arch_mutex_cpu_relax();
> +}
> +
> +static void mspin_unlock(mspin_lock_t *lock, mspin_node_t *node)
> +{
> + mspin_node_t *next = ACCESS_ONCE(node->next);
> +
> + if (likely(!next)) {
> + /*
> + * Release the lock by setting it to NULL
> + */
> + if (cmpxchg(lock, node, NULL) == node)
> + return;
> + /* Wait until the next pointer is set */
> + while (!(next = ACCESS_ONCE(node->next)))
> + arch_mutex_cpu_relax();
> + }
> + barrier();
> + ACCESS_ONCE(next->locked) = 1;
> + smp_wmb();

Do we really need the compiler barrier call? The CPUs can reorder
anyway. I assume the smp_wbm() call makes sure no there's no funny
business before the next lock is acquired, might be worth commenting.

Thanks,
Davidlohr

2013-04-16 09:10:34

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mutex: Queue mutex spinners with MCS lock to reduce cacheline contention


* Waiman Long <[email protected]> wrote:

> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3021,9 +3021,6 @@ static inline bool owner_running(struct mutex *lock, struct task_struct *owner)
> */
> int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
> {
> - if (!sched_feat(OWNER_SPIN))
> - return 0;
> -
> rcu_read_lock();
> while (owner_running(lock, owner)) {
> if (need_resched())
> @@ -3040,6 +3037,27 @@ int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
> */
> return lock->owner == NULL;
> }
> +
> +/*
> + * Initial check for entering the mutex spinning loop
> + */
> +int mutex_can_spin_on_owner(struct mutex *lock)
> +{
> + int retval = 1;
> +
> + if (!sched_feat(OWNER_SPIN))
> + return 0;
> +
> + rcu_read_lock();
> + if (lock->owner)
> + retval = lock->owner->on_cpu;
> + rcu_read_unlock();
> + /*
> + * if lock->owner is not set, the mutex owner may have just acquired
> + * it and not set the owner yet or the mutex has been released.
> + */
> + return retval;
> +}

The SCHED_FEAT_OWNER_SPIN was really just an early hack we did to make
with/without mutex-spinning testable.

I'd suggest a preparatory patch that gets rid of that flag and moves these two
functions from sched/core.c to mutex.c where they belong.

This will also allow the removal of the mutex prototypes from sched.h.

Thanks,

Ingo

2013-04-16 09:13:07

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 0/3 v2] mutex: Improve mutex performance by doing less atomic-ops & better spinning


* Waiman Long <[email protected]> wrote:

> [...]
>
> Patches 2 improves the mutex spinning process by reducing contention among the
> spinners when competing for the mutex. This is done by using a MCS lock to put
> the spinners in a queue so that only the first spinner will try to acquire the
> mutex when it is available. This patch showed significant performance
> improvement of +30% on the AIM7 fserver and new_fserver workload.

Ok, that's really nice - and this approach has no arbitrary limits/tunings in it.

Do you have a performance comparison to your first series (patches 1+2+3 IIRC) -
how does this new series with MCS locking compare to the best previous result from
that old series? Do we now achieve that level of performance?

Thanks,

Ingo

2013-04-16 10:05:32

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v2 optional 3/3] mutex: back out architecture specific check for negative mutex count

On Mon, Apr 15, 2013 at 03:37:59PM +0100, Waiman Long wrote:
> If it is confirmed that all the supported architectures can allow a
> negative mutex count without incorrect behavior, we can then back
> out the architecture specific change and allow the mutex count to
> go to any negative number. That should further reduce contention for
> non-x86 architecture.
>
> If this is not the case, this patch should be dropped.

A good starting point might be to look at the asm-generic mutex
implementations, which clears up the majority of architectures. A cursory
glance at mutex-dec.h suggests that it's OK to me...

Will

2013-04-16 11:49:55

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 0/3 v2] mutex: Improve mutex performance by doing less atomic-ops & better spinning

On 04/16/2013 05:12 AM, Ingo Molnar wrote:
> * Waiman Long<[email protected]> wrote:
>
>> [...]
>>
>> Patches 2 improves the mutex spinning process by reducing contention among the
>> spinners when competing for the mutex. This is done by using a MCS lock to put
>> the spinners in a queue so that only the first spinner will try to acquire the
>> mutex when it is available. This patch showed significant performance
>> improvement of +30% on the AIM7 fserver and new_fserver workload.
> Ok, that's really nice - and this approach has no arbitrary limits/tunings in it.
>
> Do you have a performance comparison to your first series (patches 1+2+3 IIRC) -
> how does this new series with MCS locking compare to the best previous result from
> that old series? Do we now achieve that level of performance?

Compared with the old patch set, the new patches 1+2 have over 30%
performance gain in high user load (1100-1500) in the fserver and
new_fserver workloads. The old patches 1+2 or 1+3 only manages around
10% gain. In the intermediate range of 200-1000, the 2 sets are more
comparable in performance gain.

Regards,
Longman

2013-04-16 12:05:31

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mutex: Queue mutex spinners with MCS lock to reduce cacheline contention

On 04/16/2013 12:24 AM, Davidlohr Bueso wrote:
> On Mon, 2013-04-15 at 10:37 -0400, Waiman Long wrote:
> [...]
>> +typedef struct mspin_node {
>> + struct mspin_node *next;
>> + int locked; /* 1 if lock acquired */
>> +} mspin_node_t;
>> +
>> +typedef mspin_node_t *mspin_lock_t;
> I think we could do without the typedefs, specially mspin_lock_t.
Yes, we can do without the typedefs.

>> +
>> +#define MLOCK(mutex) ((mspin_lock_t *)&((mutex)->spin_mlock))
>> +
>> +static noinline void mspin_lock(mspin_lock_t *lock, mspin_node_t *node)
>> +{
>> + mspin_node_t *prev;
>> +
>> + /* Init node */
>> + node->locked = 0;
>> + node->next = NULL;
>> +
>> + prev = xchg(lock, node);
>> + if (likely(prev == NULL)) {
>> + /* Lock acquired */
>> + node->locked = 1;
>> + return;
>> + }
>> + ACCESS_ONCE(prev->next) = node;
>> + smp_wmb();
>> + /* Wait until the lock holder passes the lock down */
>> + while (!ACCESS_ONCE(node->locked))
>> + arch_mutex_cpu_relax();
>> +}
>> +
>> +static void mspin_unlock(mspin_lock_t *lock, mspin_node_t *node)
>> +{
>> + mspin_node_t *next = ACCESS_ONCE(node->next);
>> +
>> + if (likely(!next)) {
>> + /*
>> + * Release the lock by setting it to NULL
>> + */
>> + if (cmpxchg(lock, node, NULL) == node)
>> + return;
>> + /* Wait until the next pointer is set */
>> + while (!(next = ACCESS_ONCE(node->next)))
>> + arch_mutex_cpu_relax();
>> + }
>> + barrier();
>> + ACCESS_ONCE(next->locked) = 1;
>> + smp_wmb();
> Do we really need the compiler barrier call? The CPUs can reorder
> anyway. I assume the smp_wbm() call makes sure no there's no funny
> business before the next lock is acquired, might be worth commenting.

The smp_wmb() calls are to make sure that the writes are committed to
memory rather than staying in the cache only. They are safety measures.
The barrier() call probably is not needed because of the next pointer
data dependency, but it doesn't have an actual cost either as it doesn't
translate to any instruction.

Regards,
Longman

2013-04-16 12:10:58

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v2 optional 3/3] mutex: back out architecture specific check for negative mutex count

On 04/16/2013 06:05 AM, Will Deacon wrote:
> On Mon, Apr 15, 2013 at 03:37:59PM +0100, Waiman Long wrote:
>> If it is confirmed that all the supported architectures can allow a
>> negative mutex count without incorrect behavior, we can then back
>> out the architecture specific change and allow the mutex count to
>> go to any negative number. That should further reduce contention for
>> non-x86 architecture.
>>
>> If this is not the case, this patch should be dropped.
> A good starting point might be to look at the asm-generic mutex
> implementations, which clears up the majority of architectures. A cursory
> glance at mutex-dec.h suggests that it's OK to me...

I think the generic version is fine with negative mutex count. However,
it is the architecture specific versions (we have 22 of them as of 3.8)
that I am worry about. I just don't have enough know-how and test
machines to verify that.

Regards,
Longman

2013-04-16 13:09:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 0/3 v2] mutex: Improve mutex performance by doing less atomic-ops & better spinning


* Waiman Long <[email protected]> wrote:

> On 04/16/2013 05:12 AM, Ingo Molnar wrote:
> >* Waiman Long<[email protected]> wrote:
> >
> >>[...]
> >>
> >>Patches 2 improves the mutex spinning process by reducing contention among the
> >>spinners when competing for the mutex. This is done by using a MCS lock to put
> >>the spinners in a queue so that only the first spinner will try to acquire the
> >>mutex when it is available. This patch showed significant performance
> >>improvement of +30% on the AIM7 fserver and new_fserver workload.
> >Ok, that's really nice - and this approach has no arbitrary limits/tunings in it.
> >
> >Do you have a performance comparison to your first series (patches 1+2+3 IIRC) -
> >how does this new series with MCS locking compare to the best previous result from
> >that old series? Do we now achieve that level of performance?
>
> Compared with the old patch set, the new patches 1+2 have over 30%
> performance gain in high user load (1100-1500) in the fserver and
> new_fserver workloads. The old patches 1+2 or 1+3 only manages
> around 10% gain. In the intermediate range of 200-1000, the 2 sets
> are more comparable in performance gain.

Ok, that's cool!

Thanks,

Ingo

2013-04-16 14:26:37

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mutex: Queue mutex spinners with MCS lock to reduce cacheline contention

On 04/16/2013 05:10 AM, Ingo Molnar wrote:
> * Waiman Long<[email protected]> wrote:
>
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -3021,9 +3021,6 @@ static inline bool owner_running(struct mutex *lock, struct task_struct *owner)
>> */
>> int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
>> {
>> - if (!sched_feat(OWNER_SPIN))
>> - return 0;
>> -
>> rcu_read_lock();
>> while (owner_running(lock, owner)) {
>> if (need_resched())
>> @@ -3040,6 +3037,27 @@ int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
>> */
>> return lock->owner == NULL;
>> }
>> +
>> +/*
>> + * Initial check for entering the mutex spinning loop
>> + */
>> +int mutex_can_spin_on_owner(struct mutex *lock)
>> +{
>> + int retval = 1;
>> +
>> + if (!sched_feat(OWNER_SPIN))
>> + return 0;
>> +
>> + rcu_read_lock();
>> + if (lock->owner)
>> + retval = lock->owner->on_cpu;
>> + rcu_read_unlock();
>> + /*
>> + * if lock->owner is not set, the mutex owner may have just acquired
>> + * it and not set the owner yet or the mutex has been released.
>> + */
>> + return retval;
>> +}
> The SCHED_FEAT_OWNER_SPIN was really just an early hack we did to make
> with/without mutex-spinning testable.
I see.

> I'd suggest a preparatory patch that gets rid of that flag and moves these two
> functions from sched/core.c to mutex.c where they belong.
>
> This will also allow the removal of the mutex prototypes from sched.h.

Yes, I can certainly prepare a patch to remove SCHED_FEAT_OWNER_SPIN &
move those functions back to mutex.c after my patch set goes in. As for
the timing, do you want me to do it now or it can wait as I will start
my vacation later this week and will be back by the end of the month.

Regards,
Longman

2013-04-17 07:51:01

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mutex: Queue mutex spinners with MCS lock to reduce cacheline contention


* Waiman Long <[email protected]> wrote:

> > I'd suggest a preparatory patch that gets rid of that flag and moves these two
> > functions from sched/core.c to mutex.c where they belong.
> >
> > This will also allow the removal of the mutex prototypes from sched.h.
>
> Yes, I can certainly prepare a patch to remove SCHED_FEAT_OWNER_SPIN & move
> those functions back to mutex.c after my patch set goes in. As for the timing,
> do you want me to do it now or it can wait as I will start my vacation later
> this week and will be back by the end of the month.

I'd suggest waiting with the latest series until you are back, so that you can
address regressions, should there be any.

Thanks,

Ingo