LinuxLists.cc - [PATCH v3 0/5] mutex: Improve mutex performance by doing less atomic-ops & better spinning

2013-04-17 04:28:39

Subject: [PATCH v3 0/5] mutex: Improve mutex performance by doing less atomic-ops & better spinning

v2->v3
- Add patch 4 to remove new typedefs introduced in patch 2.
- Add patch 5 to remove SCHED_FEAT_OWNER_SPIN and move the mutex
spinning code to mutex.c.

v1->v2
- Remove the 2 mutex spinner patches and replaced it by another one
to improve the mutex spinning process.
- Remove changes made to kernel/mutex.h & localize changes in
kernel/mutex.c.
- Add an optional patch to remove architecture specific check in patch
1.

This patch set is a collection of 5 different mutex related patches
aimed at improving mutex performance especially for system with large
number of CPUs. This is achieved by doing less atomic operations and
better mutex spinning (when the CONFIG_MUTEX_SPIN_ON_OWNER is on).

The first patch reduces the number of atomic operations executed. It
can produce dramatic performance improvement in the AIM7 benchmark
with large number of CPUs. For example, there was a more than 3X
improvement in the high_systime workload with a 3.7.10 kernel on
an 8-socket x86-64 system with 80 cores. The 3.8 kernels, on the
other hand, are not mutex limited for that workload anymore. So the
performance improvement is only about 1% for the high_systime workload.

Patch 2 improves the mutex spinning process by reducing contention
among the spinners when competing for the mutex. This is done by
using a MCS lock to put the spinners in a queue so that only the
first spinner will try to acquire the mutex when it is available. This
patch showed significant performance improvement of +30% on the AIM7
fserver and new_fserver workload.

Compared with patches 2&3 in v1, the new patch 2 consistently provided
better performance improvement at high user load (1100-2000) for the
fserver and new_fserver AIM7 workloads. The old patches had around 10%
and less improvement at high user load while the new patch produced
30% better performance for the same workloads.

Patch is an optional one for backing out architecture specific check
in patch 1, if so desired.

Patch 4 removes new typedefs introduced in patch 2 and an unnecessary
barrier() call.

Patch 5 removes SCHED_FEAT_OWNER_SPIN which was just an earlier hack
for testing purpose. It also moves the mutex spinning code back to
mutex.c.

Waiman Long (5):
mutex: Make more scalable by doing less atomic operations
mutex: Queue mutex spinners with MCS lock to reduce cacheline
contention
mutex: back out architecture specific check for negative mutex count
mutex: Remove new typedefs introduced in patch 2
mutex: Move mutex spinning code from sched/core.c back to mutex.c

include/linux/mutex.h | 3 +
include/linux/sched.h | 1 -
kernel/mutex.c | 151 +++++++++++++++++++++++++++++++++++++++++++++-
kernel/sched/core.c | 45 --------------
kernel/sched/features.h | 7 --
5 files changed, 150 insertions(+), 57 deletions(-)

2013-04-17 04:28:42

by Waiman Long

[permalink] [raw]

Subject: [PATCH v3 2/5] mutex: Queue mutex spinners with MCS lock to reduce cacheline contention

The current mutex spinning code (with MUTEX_SPIN_ON_OWNER option turned
on) allow multiple tasks to spin on a single mutex concurrently. A
potential problem with the current approach is that when the mutex
becomes available, all the spinning tasks will try to acquire the
mutex more or less simultaneously. As a result, there will be a lot of
cacheline bouncing especially on systems with a large number of CPUs.

This patch tries to reduce this kind of contention by putting the
mutex spinners into a queue so that only the first one in the queue
will try to acquire the mutex. This will reduce contention and allow
all the tasks to move forward faster.

The queuing of mutex spinners is done using an MCS lock based
implementation which will further reduce contention on the mutex
cacheline than a similar ticket spinlock based implementation. This
patch will add a new field into the mutex data structure for holding
the MCS lock. This expands the mutex size by 8 bytes for 64-bit system
and 4 bytes for 32-bit system. This overhead will be avoid if the
MUTEX_SPIN_ON_OWNER option is turned off.

The following table shows the jobs per minute (JPM) scalability data
on an 8-node 80-core Westmere box with a 3.7.10 kernel. The numactl
command is used to restrict the running of the fserver workloads to
1/2/4/8 nodes with hyperthreading off.

+-----------------+-----------+-----------+-------------+----------+
| Configuration | Mean JPM | Mean JPM | Mean JPM | % Change |
| | w/o patch | patch 1 | patches 1&2 | 1->1&2 |
+-----------------+------------------------------------------------+
| | User Range 1100 - 2000 |
+-----------------+------------------------------------------------+
| 8 nodes, HT off | 227972 | 227237 | 305043 | +34.2% |
| 4 nodes, HT off | 393503 | 381558 | 394650 | +3.4% |
| 2 nodes, HT off | 334957 | 325240 | 338853 | +4.2% |
| 1 node , HT off | 198141 | 197972 | 198075 | +0.1% |
+-----------------+------------------------------------------------+
| | User Range 200 - 1000 |
+-----------------+------------------------------------------------+
| 8 nodes, HT off | 282325 | 312870 | 332185 | +6.2% |
| 4 nodes, HT off | 390698 | 378279 | 393419 | +4.0% |
| 2 nodes, HT off | 336986 | 326543 | 340260 | +4.2% |
| 1 node , HT off | 197588 | 197622 | 197582 | 0.0% |
+-----------------+-----------+-----------+-------------+----------+

At low user range 10-100, the JPM differences were within +/-1%. So
they are not that interesting.

The fserver workload uses mutex spinning extensively. With just
the mutex change in the first patch, there is no noticeable change
in performance. Rather, there is a slight drop in performance. This
mutex spinning patch more than recovers the lost performance and show
a significant increase of +30% at high user load with the full 8 nodes.
Similar improvements were also seen in a 3.8 kernel.

The table below shows the %time spent by different kernel functions
as reported by perf when running the fserver workload at 1500 users
with all 8 nodes.

+-----------------------+-----------+---------+-------------+
| Function | % time | % time | % time |
| | w/o patch | patch 1 | patches 1&2 |
+-----------------------+-----------+---------+-------------+
| __read_lock_failed | 34.96% | 34.91% | 29.14% |
| __write_lock_failed | 10.14% | 10.68% | 7.51% |
| mutex_spin_on_owner | 3.62% | 3.42% | 2.33% |
| mspin_lock | N/A | N/A | 9.90% |
| __mutex_lock_slowpath | 1.46% | 0.81% | 0.14% |
| _raw_spin_lock | 2.25% | 2.50% | 1.10% |
+-----------------------+-----------+---------+-------------+

The fserver workload for an 8-node system is dominated by the
contention in the read/write lock. Mutex contention also plays a
role. With the first patch only, mutex contention is down (as shown by
the __mutex_lock_slowpath figure) which help a little bit. We saw only
a few percents improvement with that.

By applying patch 2 as well, the single mutex_spin_on_owner figure is
now split out into an additional mspin_lock figure. The time increases
from 3.42% to 11.23%. It shows a great reduction in contention among
the spinners leading to a 30% improvement. The time ratio 9.9/2.33=4.3
indicates that there are on average 4+ spinners waiting in the spin_lock
loop for each spinner in the mutex_spin_on_owner loop. Contention in
other locking functions also go down by quite a lot.

The table below shows the performance change of both patches 1 & 2 over
patch 1 alone in other AIM7 workloads (at 8 nodes, hyperthreading off).

+--------------+---------------+----------------+-----------------+
| Workload | mean % change | mean % change | mean % change |
| | 10-100 users | 200-1000 users | 1100-2000 users |
+--------------+---------------+----------------+-----------------+
| alltests | 0.0% | -0.8% | +0.6% |
| five_sec | -0.3% | +0.8% | +0.8% |
| high_systime | +0.4% | +2.4% | +2.1% |
| new_fserver | +0.1% | +14.1% | +34.2% |
| shared | -0.5% | -0.3% | -0.4% |
| short | -1.7% | -9.8% | -8.3% |
+--------------+---------------+----------------+-----------------+

The short workload is the only one that shows a decline in performance
probably due to the spinner locking and queuing overhead.

Signed-off-by: Waiman Long <[email protected]>
Acked-by: Rik van Riel <[email protected]>
---
include/linux/mutex.h | 3 ++
include/linux/sched.h | 3 ++
kernel/mutex.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++++-
kernel/sched/core.c | 24 +++++++++++++--
4 files changed, 104 insertions(+), 4 deletions(-)

diff --git a/include/linux/mutex.h b/include/linux/mutex.h
index 9121595..433da8a 100644
--- a/include/linux/mutex.h
+++ b/include/linux/mutex.h
@@ -53,6 +53,9 @@ struct mutex {
#if defined(CONFIG_DEBUG_MUTEXES) || defined(CONFIG_SMP)
struct task_struct *owner;
#endif
+#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
+ void *spin_mlock; /* Spinner MCS lock */
+#endif
#ifdef CONFIG_DEBUG_MUTEXES
const char *name;
void *magic;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d35d2b6..8af6f13 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -320,7 +320,10 @@ extern signed long schedule_timeout_killable(signed long timeout);
extern signed long schedule_timeout_uninterruptible(signed long timeout);
asmlinkage void schedule(void);
extern void schedule_preempt_disabled(void);
+#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
extern int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner);
+extern int mutex_can_spin_on_owner(struct mutex *lock);
+#endif

struct nsproxy;
struct user_namespace;
diff --git a/kernel/mutex.c b/kernel/mutex.c
index e3496df..e6a90de 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -55,6 +55,9 @@ __mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key)
spin_lock_init(&lock->wait_lock);
INIT_LIST_HEAD(&lock->wait_list);
mutex_clear_owner(lock);
+#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
+ lock->spin_mlock = NULL;
+#endif

debug_mutex_init(lock, name, key);
}
@@ -106,6 +109,65 @@ void __sched mutex_lock(struct mutex *lock)
EXPORT_SYMBOL(mutex_lock);
#endif

+#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
+/*
+ * In order to avoid a stampede of mutex spinners from acquiring the mutex
+ * more or less simultaneously, the spinners need to acquire a MCS lock
+ * first before spinning on the owner field.
+ *
+ * We don't inline mspin_lock() so that perf can correctly account for the
+ * time spent in this lock function.
+ */
+typedef struct mspin_node {
+ struct mspin_node *next;
+ int locked; /* 1 if lock acquired */
+} mspin_node_t;
+
+typedef mspin_node_t *mspin_lock_t;
+
+#define MLOCK(mutex) ((mspin_lock_t *)&((mutex)->spin_mlock))
+
+static noinline void mspin_lock(mspin_lock_t *lock, mspin_node_t *node)
+{
+ mspin_node_t *prev;
+
+ /* Init node */
+ node->locked = 0;
+ node->next = NULL;
+
+ prev = xchg(lock, node);
+ if (likely(prev == NULL)) {
+ /* Lock acquired */
+ node->locked = 1;
+ return;
+ }
+ ACCESS_ONCE(prev->next) = node;
+ smp_wmb();
+ /* Wait until the lock holder passes the lock down */
+ while (!ACCESS_ONCE(node->locked))
+ arch_mutex_cpu_relax();
+}
+
+static void mspin_unlock(mspin_lock_t *lock, mspin_node_t *node)
+{
+ mspin_node_t *next = ACCESS_ONCE(node->next);
+
+ if (likely(!next)) {
+ /*
+ * Release the lock by setting it to NULL
+ */
+ if (cmpxchg(lock, node, NULL) == node)
+ return;
+ /* Wait until the next pointer is set */
+ while (!(next = ACCESS_ONCE(node->next)))
+ arch_mutex_cpu_relax();
+ }
+ barrier();
+ ACCESS_ONCE(next->locked) = 1;
+ smp_wmb();
+}
+#endif
+
static __used noinline void __sched __mutex_unlock_slowpath(atomic_t *lock_count);

/**
@@ -169,26 +231,39 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
*
* We can't do this for DEBUG_MUTEXES because that relies on wait_lock
* to serialize everything.
+ *
+ * The mutex spinners are queued up using MCS lock so that only one
+ * spinner can compete for the mutex. However, if mutex spinning isn't
+ * going to happen, there is no point in going through the lock/unlock
+ * overhead.
*/
+ if (!mutex_can_spin_on_owner(lock))
+ goto slowpath;

for (;;) {
struct task_struct *owner;
+ mspin_node_t node;

/*
* If there's an owner, wait for it to either
* release the lock or go to sleep.
*/
+ mspin_lock(MLOCK(lock), &node);
owner = ACCESS_ONCE(lock->owner);
- if (owner && !mutex_spin_on_owner(lock, owner))
+ if (owner && !mutex_spin_on_owner(lock, owner)) {
+ mspin_unlock(MLOCK(lock), &node);
break;
+ }

if ((atomic_read(&lock->count) == 1) &&
(atomic_cmpxchg(&lock->count, 1, 0) == 1)) {
lock_acquired(&lock->dep_map, ip);
mutex_set_owner(lock);
+ mspin_unlock(MLOCK(lock), &node);
preempt_enable();
return 0;
}
+ mspin_unlock(MLOCK(lock), &node);

/*
* When there's no owner, we might have preempted between the
@@ -207,6 +282,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
*/
arch_mutex_cpu_relax();
}
+slowpath:
#endif
spin_lock_mutex(&lock->wait_lock, flags);

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7f12624..176e82a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3021,9 +3021,6 @@ static inline bool owner_running(struct mutex *lock, struct task_struct *owner)
*/
int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
{
- if (!sched_feat(OWNER_SPIN))
- return 0;
-
rcu_read_lock();
while (owner_running(lock, owner)) {
if (need_resched())
@@ -3040,6 +3037,27 @@ int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
*/
return lock->owner == NULL;
}
+
+/*
+ * Initial check for entering the mutex spinning loop
+ */
+int mutex_can_spin_on_owner(struct mutex *lock)
+{
+ int retval = 1;
+
+ if (!sched_feat(OWNER_SPIN))
+ return 0;
+
+ rcu_read_lock();
+ if (lock->owner)
+ retval = lock->owner->on_cpu;
+ rcu_read_unlock();
+ /*
+ * if lock->owner is not set, the mutex owner may have just acquired
+ * it and not set the owner yet or the mutex has been released.
+ */
+ return retval;
+}
#endif

#ifdef CONFIG_PREEMPT
--
1.7.1

2013-04-17 04:28:47

by Waiman Long

[permalink] [raw]

Subject: [PATCH v3 5/5] mutex: Move mutex spinning code from sched/core.c back to mutex.c

As mentioned by Ingo, the SCHED_FEAT_OWNER_SPIN scheduler feature
bit was really just an early hack to make with/without mutex-spinning
testable. So it is no longer necessary.

This patch removes the SCHED_FEAT_OWNER_SPIN feature bit and move the
mutex spinning code from kernel/sched/core.c back to kernel/mutex.c
which is where they should belong.

Signed-off-by: Waiman Long <[email protected]>
---
include/linux/sched.h | 4 ---
kernel/mutex.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/core.c | 63 -----------------------------------------------
kernel/sched/features.h | 7 -----
4 files changed, 62 insertions(+), 74 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8af6f13..aefe45d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -320,10 +320,6 @@ extern signed long schedule_timeout_killable(signed long timeout);
extern signed long schedule_timeout_uninterruptible(signed long timeout);
asmlinkage void schedule(void);
extern void schedule_preempt_disabled(void);
-#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
-extern int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner);
-extern int mutex_can_spin_on_owner(struct mutex *lock);
-#endif

struct nsproxy;
struct user_namespace;
diff --git a/kernel/mutex.c b/kernel/mutex.c
index 140f113..ad53a66 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -158,6 +158,68 @@ static void mspin_unlock(struct mspin_node **lock, struct mspin_node *node)
ACCESS_ONCE(next->locked) = 1;
smp_wmb();
}
+
+/*
+ * Mutex spinning code migrated from kernel/sched/core.c
+ */
+
+static inline bool owner_running(struct mutex *lock, struct task_struct *owner)
+{
+ if (lock->owner != owner)
+ return false;
+
+ /*
+ * Ensure we emit the owner->on_cpu, dereference _after_ checking
+ * lock->owner still matches owner, if that fails, owner might
+ * point to free()d memory, if it still matches, the rcu_read_lock()
+ * ensures the memory stays valid.
+ */
+ barrier();
+
+ return owner->on_cpu;
+}
+
+/*
+ * Look out! "owner" is an entirely speculative pointer
+ * access and not reliable.
+ */
+static noinline
+int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
+{
+ rcu_read_lock();
+ while (owner_running(lock, owner)) {
+ if (need_resched())
+ break;
+
+ arch_mutex_cpu_relax();
+ }
+ rcu_read_unlock();
+
+ /*
+ * We break out the loop above on need_resched() and when the
+ * owner changed, which is a sign for heavy contention. Return
+ * success only when lock->owner is NULL.
+ */
+ return lock->owner == NULL;
+}
+
+/*
+ * Initial check for entering the mutex spinning loop
+ */
+static inline int mutex_can_spin_on_owner(struct mutex *lock)
+{
+ int retval = 1;
+
+ rcu_read_lock();
+ if (lock->owner)
+ retval = lock->owner->on_cpu;
+ rcu_read_unlock();
+ /*
+ * if lock->owner is not set, the mutex owner may have just acquired
+ * it and not set the owner yet or the mutex has been released.
+ */
+ return retval;
+}
#endif

static __used noinline void __sched __mutex_unlock_slowpath(atomic_t *lock_count);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 176e82a..b37a22b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2997,69 +2997,6 @@ void __sched schedule_preempt_disabled(void)
preempt_disable();
}

-#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
-
-static inline bool owner_running(struct mutex *lock, struct task_struct *owner)
-{
- if (lock->owner != owner)
- return false;
-
- /*
- * Ensure we emit the owner->on_cpu, dereference _after_ checking
- * lock->owner still matches owner, if that fails, owner might
- * point to free()d memory, if it still matches, the rcu_read_lock()
- * ensures the memory stays valid.
- */
- barrier();
-
- return owner->on_cpu;
-}
-
-/*
- * Look out! "owner" is an entirely speculative pointer
- * access and not reliable.
- */
-int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
-{
- rcu_read_lock();
- while (owner_running(lock, owner)) {
- if (need_resched())
- break;
-
- arch_mutex_cpu_relax();
- }
- rcu_read_unlock();
-
- /*
- * We break out the loop above on need_resched() and when the
- * owner changed, which is a sign for heavy contention. Return
- * success only when lock->owner is NULL.
- */
- return lock->owner == NULL;
-}
-
-/*
- * Initial check for entering the mutex spinning loop
- */
-int mutex_can_spin_on_owner(struct mutex *lock)
-{
- int retval = 1;
-
- if (!sched_feat(OWNER_SPIN))
- return 0;
-
- rcu_read_lock();
- if (lock->owner)
- retval = lock->owner->on_cpu;
- rcu_read_unlock();
- /*
- * if lock->owner is not set, the mutex owner may have just acquired
- * it and not set the owner yet or the mutex has been released.
- */
- return retval;
-}
-#endif
-
#ifdef CONFIG_PREEMPT
/*
* this is the entry point to schedule() from in-kernel preemption
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 1ad1d2b..99399f8 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -46,13 +46,6 @@ SCHED_FEAT(DOUBLE_TICK, false)
SCHED_FEAT(LB_BIAS, true)

/*
- * Spin-wait on mutex acquisition when the mutex owner is running on
- * another cpu -- assumes that when the owner is running, it will soon
- * release the lock. Decreases scheduling overhead.
- */
-SCHED_FEAT(OWNER_SPIN, true)
-
-/*
* Decrement CPU power based on time not spent running tasks
*/
SCHED_FEAT(NONTASK_POWER, true)
--
1.7.1

2013-04-17 04:29:06

by Waiman Long

[permalink] [raw]

Subject: [PATCH v3 4/5] mutex: Remove new typedefs introduced in patch 2

In response to the review comment from Davidlohr, this patch will
remove the new typedefs introduced by patch 2. It also removes an
unnecessary barrier() call.

Signed-off-by: Waiman Long <[email protected]>
---
kernel/mutex.c | 25 +++++++++++--------------
1 files changed, 11 insertions(+), 14 deletions(-)

diff --git a/kernel/mutex.c b/kernel/mutex.c
index 5600bdf..140f113 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -113,18 +113,16 @@ EXPORT_SYMBOL(mutex_lock);
* We don't inline mspin_lock() so that perf can correctly account for the
* time spent in this lock function.
*/
-typedef struct mspin_node {
- struct mspin_node *next;
- int locked; /* 1 if lock acquired */
-} mspin_node_t;
-
-typedef mspin_node_t *mspin_lock_t;
-
-#define MLOCK(mutex) ((mspin_lock_t *)&((mutex)->spin_mlock))
-
-static noinline void mspin_lock(mspin_lock_t *lock, mspin_node_t *node)
+struct mspin_node {
+ struct mspin_node *next ;
+ int locked; /* 1 if lock acquired */
+};
+#define MLOCK(mutex) ((struct mspin_node **)&((mutex)->spin_mlock))
+
+static noinline
+void mspin_lock(struct mspin_node **lock, struct mspin_node *node)
{
- mspin_node_t *prev;
+ struct mspin_node *prev;

/* Init node */
node->locked = 0;
@@ -143,9 +141,9 @@ static noinline void mspin_lock(mspin_lock_t *lock, mspin_node_t *node)
arch_mutex_cpu_relax();
}

-static void mspin_unlock(mspin_lock_t *lock, mspin_node_t *node)
+static void mspin_unlock(struct mspin_node **lock, struct mspin_node *node)
{
- mspin_node_t *next = ACCESS_ONCE(node->next);
+ struct mspin_node *next = ACCESS_ONCE(node->next);

if (likely(!next)) {
/*
@@ -157,7 +155,6 @@ static void mspin_unlock(mspin_lock_t *lock, mspin_node_t *node)
while (!(next = ACCESS_ONCE(node->next)))
arch_mutex_cpu_relax();
}
- barrier();
ACCESS_ONCE(next->locked) = 1;
smp_wmb();
}
@@ -237,7 +234,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,

for (;;) {
struct task_struct *owner;
- mspin_node_t node;
+ struct mspin_node node;

/*
* If there's an owner, wait for it to either
--
1.7.1

2013-04-17 04:29:32

by Waiman Long

[permalink] [raw]

Subject: [PATCH v3 optional 3/5] mutex: back out architecture specific check for negative mutex count

If it is confirmed that all the supported architectures can allow a
negative mutex count without incorrect behavior, we can then back
out the architecture specific change and allow the mutex count to
go to any negative number. That should further reduce contention for
non-x86 architecture.

If this is not the case, this patch should be dropped.

Signed-off-by: Waiman Long <[email protected]>
---
arch/x86/include/asm/mutex.h | 10 ----------
kernel/mutex.c | 9 ++-------
2 files changed, 2 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/mutex.h b/arch/x86/include/asm/mutex.h
index bc2a0b0..7d3a482 100644
--- a/arch/x86/include/asm/mutex.h
+++ b/arch/x86/include/asm/mutex.h
@@ -3,13 +3,3 @@
#else
# include <asm/mutex_64.h>
#endif
-
-#ifndef __ASM_MUTEX_H
-#define __ASM_MUTEX_H
-/*
- * For the x86 architecture, it allows any negative number (besides -1) in
- * the mutex count to indicate that some other threads are waiting on the
- * mutex.
- */
-#define __ARCH_ALLOW_ANY_NEGATIVE_MUTEX_COUNT 1
-#endif
diff --git a/kernel/mutex.c b/kernel/mutex.c
index e6a90de..5600bdf 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -38,15 +38,10 @@
#endif

/*
- * A mutex count of -1 indicates that waiters are sleeping waiting for the
- * mutex. Some architectures can allow any negative number, not just -1, for
- * this purpose.
+ * A negative mutex count indicates that waiters are sleeping waiting for the
+ * mutex.
*/
-#ifdef __ARCH_ALLOW_ANY_NEGATIVE_MUTEX_COUNT
#define MUTEX_SHOW_NO_WAITER(mutex) (atomic_read(&(mutex)->count) >= 0)
-#else
-#define MUTEX_SHOW_NO_WAITER(mutex) (atomic_read(&(mutex)->count) != -1)
-#endif

void
__mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key)
--
1.7.1

2013-04-17 04:29:49

by Waiman Long

[permalink] [raw]

Subject: [PATCH v3 1/5] mutex: Make more scalable by doing less atomic operations

In the __mutex_lock_common() function, an initial entry into
the lock slow path will cause two atomic_xchg instructions to be
issued. Together with the atomic decrement in the fast path, a total
of three atomic read-modify-write instructions will be issued in
rapid succession. This can cause a lot of cache bouncing when many
tasks are trying to acquire the mutex at the same time.

This patch will reduce the number of atomic_xchg instructions used by
checking the counter value first before issuing the instruction. The
atomic_read() function is just a simple memory read. The atomic_xchg()
function, on the other hand, can be up to 2 order of magnitude or even
more in cost when compared with atomic_read(). By using atomic_read()
to check the value first before calling atomic_xchg(), we can avoid a
lot of unnecessary cache coherency traffic. The only downside with this
change is that a task on the slow path will have a tiny bit
less chance of getting the mutex when competing with another task
in the fast path.

The same is true for the atomic_cmpxchg() function in the
mutex-spin-on-owner loop. So an atomic_read() is also performed before
calling atomic_cmpxchg().

The mutex locking and unlocking code for the x86 architecture can allow
any negative number to be used in the mutex count to indicate that some
tasks are waiting for the mutex. I am not so sure if that is the case
for the other architectures. So the default is to avoid atomic_xchg()
if the count has already been set to -1. For x86, the check is modified
to include all negative numbers to cover a larger case.

The following table shows the jobs per minutes (JPM) scalability data
on an 8-node 80-core Westmere box with a 3.7.10 kernel. The numactl
command is used to restrict the running of the high_systime workloads
to 1/2/4/8 nodes with hyperthreading on and off.

+-----------------+-----------+------------+----------+
| Configuration | Mean JPM | Mean JPM | % Change |
| | w/o patch | with patch | |
+-----------------+-----------------------------------+
| | User Range 1100 - 2000 |
+-----------------+-----------------------------------+
| 8 nodes, HT on | 36980 | 148590 | +301.8% |
| 8 nodes, HT off | 42799 | 145011 | +238.8% |
| 4 nodes, HT on | 61318 | 118445 | +51.1% |
| 4 nodes, HT off | 158481 | 158592 | +0.1% |
| 2 nodes, HT on | 180602 | 173967 | -3.7% |
| 2 nodes, HT off | 198409 | 198073 | -0.2% |
| 1 node , HT on | 149042 | 147671 | -0.9% |
| 1 node , HT off | 126036 | 126533 | +0.4% |
+-----------------+-----------------------------------+
| | User Range 200 - 1000 |
+-----------------+-----------------------------------+
| 8 nodes, HT on | 41525 | 122349 | +194.6% |
| 8 nodes, HT off | 49866 | 124032 | +148.7% |
| 4 nodes, HT on | 66409 | 106984 | +61.1% |
| 4 nodes, HT off | 119880 | 130508 | +8.9% |
| 2 nodes, HT on | 138003 | 133948 | -2.9% |
| 2 nodes, HT off | 132792 | 131997 | -0.6% |
| 1 node , HT on | 116593 | 115859 | -0.6% |
| 1 node , HT off | 104499 | 104597 | +0.1% |
+-----------------+------------+-----------+----------+

At low user range 10-100, the JPM differences were within +/-1%. So
they are not that interesting.

AIM7 benchmark run has a pretty large run-to-run variance due to random
nature of the subtests executed. So a difference of less than +-5%
may not be really significant.

This patch improves high_systime workload performance at 4 nodes
and up by maintaining transaction rates without significant drop-off
at high node count. The patch has practically no impact on 1 and 2
nodes system.

The table below shows the percentage time (as reported by perf
record -a -s -g) spent on the __mutex_lock_slowpath() function by
the high_systime workload at 1500 users for 2/4/8-node configurations
with hyperthreading off.

+---------------+-----------------+------------------+---------+
| Configuration | %Time w/o patch | %Time with patch | %Change |
+---------------+-----------------+------------------+---------+
| 8 nodes | 65.34% | 0.69% | -99% |
| 4 nodes | 8.70% | 1.02% | -88% |
| 2 nodes | 0.41% | 0.32% | -22% |
+---------------+-----------------+------------------+---------+

It is obvious that the dramatic performance improvement at 8
nodes was due to the drastic cut in the time spent within the
__mutex_lock_slowpath() function.

The table below show the improvements in other AIM7 workloads (at 8
nodes, hyperthreading off).

+--------------+---------------+----------------+-----------------+
| Workload | mean % change | mean % change | mean % change |
| | 10-100 users | 200-1000 users | 1100-2000 users |
+--------------+---------------+----------------+-----------------+
| alltests | +0.6% | +104.2% | +185.9% |
| five_sec | +1.9% | +0.9% | +0.9% |
| fserver | +1.4% | -7.7% | +5.1% |
| new_fserver | -0.5% | +3.2% | +3.1% |
| shared | +13.1% | +146.1% | +181.5% |
| short | +7.4% | +5.0% | +4.2% |
+--------------+---------------+----------------+-----------------+

Signed-off-by: Waiman Long <[email protected]>
Reviewed-by: Davidlohr Bueso <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
---
arch/x86/include/asm/mutex.h | 10 ++++++++++
kernel/mutex.c | 19 ++++++++++++++++---
2 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/mutex.h b/arch/x86/include/asm/mutex.h
index 7d3a482..bc2a0b0 100644
--- a/arch/x86/include/asm/mutex.h
+++ b/arch/x86/include/asm/mutex.h
@@ -3,3 +3,13 @@
#else
# include <asm/mutex_64.h>
#endif
+
+#ifndef __ASM_MUTEX_H
+#define __ASM_MUTEX_H
+/*
+ * For the x86 architecture, it allows any negative number (besides -1) in
+ * the mutex count to indicate that some other threads are waiting on the
+ * mutex.
+ */
+#define __ARCH_ALLOW_ANY_NEGATIVE_MUTEX_COUNT 1
+#endif
diff --git a/kernel/mutex.c b/kernel/mutex.c
index 52f2301..e3496df 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -37,6 +37,17 @@
# include <asm/mutex.h>
#endif

+/*
+ * A mutex count of -1 indicates that waiters are sleeping waiting for the
+ * mutex. Some architectures can allow any negative number, not just -1, for
+ * this purpose.
+ */
+#ifdef __ARCH_ALLOW_ANY_NEGATIVE_MUTEX_COUNT
+#define MUTEX_SHOW_NO_WAITER(mutex) (atomic_read(&(mutex)->count) >= 0)
+#else
+#define MUTEX_SHOW_NO_WAITER(mutex) (atomic_read(&(mutex)->count) != -1)
+#endif
+
void
__mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key)
{
@@ -171,7 +182,8 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
if (owner && !mutex_spin_on_owner(lock, owner))
break;

- if (atomic_cmpxchg(&lock->count, 1, 0) == 1) {
+ if ((atomic_read(&lock->count) == 1) &&
+ (atomic_cmpxchg(&lock->count, 1, 0) == 1)) {
lock_acquired(&lock->dep_map, ip);
mutex_set_owner(lock);
preempt_enable();
@@ -205,7 +217,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
list_add_tail(&waiter.list, &lock->wait_list);
waiter.task = task;

- if (atomic_xchg(&lock->count, -1) == 1)
+ if (MUTEX_SHOW_NO_WAITER(lock) && (atomic_xchg(&lock->count, -1) == 1))
goto done;

lock_contended(&lock->dep_map, ip);
@@ -220,7 +232,8 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
* that when we release the lock, we properly wake up the
* other waiters:
*/
- if (atomic_xchg(&lock->count, -1) == 1)
+ if (MUTEX_SHOW_NO_WAITER(lock) &&
+ (atomic_xchg(&lock->count, -1) == 1))
break;

/*
--
1.7.1

2013-04-17 07:17:31

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH v3 4/5] mutex: Remove new typedefs introduced in patch 2

* Waiman Long <[email protected]> wrote:

> In response to the review comment from Davidlohr, this patch will
> remove the new typedefs introduced by patch 2. It also removes an
> unnecessary barrier() call.

Please backmerge this fix of patch #2 into patch #2, that reduces the number of
patches by one.

Thanks,

Ingo

2013-04-17 07:18:33

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH v3 5/5] mutex: Move mutex spinning code from sched/core.c back to mutex.c

* Waiman Long <[email protected]> wrote:

> As mentioned by Ingo, the SCHED_FEAT_OWNER_SPIN scheduler feature
> bit was really just an early hack to make with/without mutex-spinning
> testable. So it is no longer necessary.
>
> This patch removes the SCHED_FEAT_OWNER_SPIN feature bit and move the
> mutex spinning code from kernel/sched/core.c back to kernel/mutex.c
> which is where they should belong.
>
> Signed-off-by: Waiman Long <[email protected]>
> ---
> include/linux/sched.h | 4 ---
> kernel/mutex.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++
> kernel/sched/core.c | 63 -----------------------------------------------
> kernel/sched/features.h | 7 -----
> 4 files changed, 62 insertions(+), 74 deletions(-)

Please move this patch first in the series - so that later, functional patches are
easier to revert, should the need arise. When possible we sort patches by risk:
low risk ones first, riskier come later.

Thanks,

Ingo

2013-04-17 19:28:43

by Waiman Long

[permalink] [raw]

Subject: Re: [PATCH v3 5/5] mutex: Move mutex spinning code from sched/core.c back to mutex.c

On 04/17/2013 03:18 AM, Ingo Molnar wrote:
> * Waiman Long<[email protected]> wrote:
>
>> As mentioned by Ingo, the SCHED_FEAT_OWNER_SPIN scheduler feature
>> bit was really just an early hack to make with/without mutex-spinning
>> testable. So it is no longer necessary.
>>
>> This patch removes the SCHED_FEAT_OWNER_SPIN feature bit and move the
>> mutex spinning code from kernel/sched/core.c back to kernel/mutex.c
>> which is where they should belong.
>>
>> Signed-off-by: Waiman Long<[email protected]>
>> ---
>> include/linux/sched.h | 4 ---
>> kernel/mutex.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++
>> kernel/sched/core.c | 63 -----------------------------------------------
>> kernel/sched/features.h | 7 -----
>> 4 files changed, 62 insertions(+), 74 deletions(-)
> Please move this patch first in the series - so that later, functional patches are
> easier to revert, should the need arise. When possible we sort patches by risk:
> low risk ones first, riskier come later.

I have sent out a new series (v4) with the requested changes. Please let
me know what else do you think I need to do.

Regards,
Longman