2013-06-13 23:26:30

by Tim Chen

[permalink] [raw]
Subject: Performance regression from switching lock to rw-sem for anon-vma tree

Ingo,

At the time of switching the anon-vma tree's lock from mutex to
rw-sem (commit 5a505085), we encountered regressions for fork heavy workload.
A lot of optimizations to rw-sem (e.g. lock stealing) helped to
mitigate the problem. I tried an experiment on the 3.10-rc4 kernel
to compare the performance of rw-sem to one that uses mutex. I saw
a 8% regression in throughput for rw-sem vs a mutex implementation in
3.10-rc4.

For the experiments, I used the exim mail server workload in
the MOSBENCH test suite on 4 socket (westmere) and a 4 socket
(ivy bridge) with the number of clients sending mail equal
to number of cores. The mail server will
fork off a process to handle an incoming mail and put it into mail
spool. The lock protecting the anon-vma tree is stressed due to
heavy forking. On both machines, I saw that the mutex implementation
has 8% more throughput. I've pinned the cpu frequency to maximum
in the experiments.

I've tried two separate tweaks to the rw-sem on 3.10-rc4. I've tested
each tweak individually.

1) Add an owner field when a writer holds the lock and introduce
optimistic spinning when an active writer is holding the semaphore.
It reduced the context switching by 30% to a level very close to the
mutex implementation. However, I did not see any throughput improvement
of exim.

2) When the sem->count's active field is non-zero (i.e. someone
is holding the lock), we can skip directly to the down_write_failed
path, without adding the RWSEM_DOWN_WRITE_BIAS and taking
it off again from sem->count, saving us two atomic operations.
Since we will try the lock stealing again later, this should be okay.
Unfortunately it did not improve the exim workload either.

Any suggestions on the difference between rwsem and mutex performance
and possible improvements to recover this regression?

Thanks.

Tim

vmstat for mutex implementation:
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
38 0 0 130957920 47860 199956 0 0 0 56 236342 476975 14 72 14 0 0
41 0 0 130938560 47860 219900 0 0 0 0 236816 479676 14 72 14 0 0

vmstat for rw-sem implementation (3.10-rc4)
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
40 0 0 130933984 43232 202584 0 0 0 0 321817 690741 13 71 16 0 0
39 0 0 130913904 43232 224812 0 0 0 0 322193 692949 13 71 16 0 0


Profile for mutex implementation:
5.02% exim [kernel.kallsyms] [k] page_fault
3.67% exim [kernel.kallsyms] [k] anon_vma_interval_tree_insert
2.66% exim [kernel.kallsyms] [k] unmap_single_vma
2.15% exim [kernel.kallsyms] [k] do_raw_spin_lock
2.14% exim [kernel.kallsyms] [k] page_cache_get_speculative
2.04% exim [kernel.kallsyms] [k] copy_page_rep
1.58% exim [kernel.kallsyms] [k] clear_page_c
1.55% exim [kernel.kallsyms] [k] cpu_relax
1.55% exim [kernel.kallsyms] [k] mutex_unlock
1.42% exim [kernel.kallsyms] [k] __slab_free
1.16% exim [kernel.kallsyms] [k] mutex_lock
1.12% exim libc-2.13.so [.] vfprintf
0.99% exim [kernel.kallsyms] [k] find_vma
0.95% exim [kernel.kallsyms] [k] __list_del_entry

Profile for rw-sem implementation
4.88% exim [kernel.kallsyms] [k] page_fault
3.43% exim [kernel.kallsyms] [k] anon_vma_interval_tree_insert
2.65% exim [kernel.kallsyms] [k] unmap_single_vma
2.46% exim [kernel.kallsyms] [k] do_raw_spin_lock
2.25% exim [kernel.kallsyms] [k] copy_page_rep
2.01% exim [kernel.kallsyms] [k] page_cache_get_speculative
1.81% exim [kernel.kallsyms] [k] clear_page_c
1.51% exim [kernel.kallsyms] [k] __slab_free
1.12% exim libc-2.13.so [.] vfprintf
1.06% exim [kernel.kallsyms] [k] __list_del_entry
1.02% swapper [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
1.00% exim [kernel.kallsyms] [k] find_vma
0.93% exim [kernel.kallsyms] [k] mutex_unlock


turbostat for mutex implementation:
pk cor CPU %c0 GHz TSC %c1 %c3 %c6 CTMP %pc3 %pc6
82.91 2.39 2.39 11.65 2.76 2.68 51 0.00 0.00

turbostat of rw-sem implementation (3.10-rc4):
pk cor CPU %c0 GHz TSC %c1 %c3 %c6 CTMP %pc3 %pc6
80.10 2.39 2.39 14.96 2.80 2.13 52 0.00 0.00




2013-06-14 16:10:00

by Tim Chen

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree

Added copy to mailing list which I forgot in my previous reply:

On Thu, 2013-06-13 at 16:43 -0700, Davidlohr Bueso wrote:
> On Thu, 2013-06-13 at 16:15 -0700, Tim Chen wrote:
> > Ingo,
> >
> > At the time of switching the anon-vma tree's lock from mutex to
> > rw-sem (commit 5a505085), we encountered regressions for fork heavy workload.
> > A lot of optimizations to rw-sem (e.g. lock stealing) helped to
> > mitigate the problem. I tried an experiment on the 3.10-rc4 kernel
> > to compare the performance of rw-sem to one that uses mutex. I saw
> > a 8% regression in throughput for rw-sem vs a mutex implementation in
> > 3.10-rc4.
>
> Funny, just yesterday I was discussing this issue with Michel. While I
> didn't measure the anon-vma mutex->rwem conversion, I did convert the
> i_mmap_mutex to a rwsem and noticed a performance regression on a few
> aim7 workloads on a 8 socket, 80 core box, when keeping all writers,
> which should perform very similarly to a mutex. While some of these
> workloads recovered when I shared the lock among readers (similar to
> anon-vma), it left me wondering.
>
> > For the experiments, I used the exim mail server workload in
> > the MOSBENCH test suite on 4 socket (westmere) and a 4 socket
> > (ivy bridge) with the number of clients sending mail equal
> > to number of cores. The mail server will
> > fork off a process to handle an incoming mail and put it into mail
> > spool. The lock protecting the anon-vma tree is stressed due to
> > heavy forking. On both machines, I saw that the mutex implementation
> > has 8% more throughput. I've pinned the cpu frequency to maximum
> > in the experiments.
>
> I got some similar -8% throughput on high_systime and shared.
>

That's interesting. Another perspective on rwsem vs mutex.

> >
> > I've tried two separate tweaks to the rw-sem on 3.10-rc4. I've tested
> > each tweak individually.
> >
> > 1) Add an owner field when a writer holds the lock and introduce
> > optimistic spinning when an active writer is holding the semaphore.
> > It reduced the context switching by 30% to a level very close to the
> > mutex implementation. However, I did not see any throughput improvement
> > of exim.
>
> I was hoping that the lack of spin on owner was the main difference with
> rwsems and am/was in the middle of implementing it. Could you send your
> patch so I can give it a try on my workloads?
>
> Note that there have been a few recent (3.10) changes to mutexes that
> give a nice performance boost, specially on large systems, most
> noticeably:
>
> commit 2bd2c92c (mutex: Make more scalable by doing less atomic
> operations)
>
> commit 0dc8c730 (mutex: Queue mutex spinners with MCS lock to reduce
> cacheline contention)
>
> It might be worth looking into doing something similar to commit
> 0dc8c730, in addition to the optimistic spinning.

Okay. Here's my ugly experimental hack with some code lifted from optimistic spin
within mutex. I've thought about doing the MCS lock thing but decided
to keep the first try on the optimistic spinning simple.

Matthew and I have also discussed possibly introducing some
limited spinning for writer when semaphore is held by read.
His idea was to have readers as well as writers set ->owner.
Writers, as now, unconditionally clear owner. Readers clear
owner if sem->owner == current. Writers spin on ->owner if ->owner
is non-NULL and still active. That gives us a reasonable chance
to spin since we'll be spinning on
the most recent acquirer of the lock.

Tim

Signed-off-by: Tim Chen <[email protected]>
---
diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index 0616ffe..331f5f0 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -29,6 +29,7 @@ struct rw_semaphore {
#ifdef CONFIG_DEBUG_LOCK_ALLOC
struct lockdep_map dep_map;
#endif
+ struct task_struct *owner;
};

extern struct rw_semaphore *rwsem_down_read_failed(struct rw_semaphore *sem);
diff --git a/kernel/rwsem.c b/kernel/rwsem.c
index cfff143..916747f 100644
--- a/kernel/rwsem.c
+++ b/kernel/rwsem.c
@@ -12,6 +12,16 @@

#include <linux/atomic.h>

+static inline void rwsem_set_owner(struct rw_semaphore *sem)
+{
+ sem->owner = current;
+}
+
+static inline void rwsem_clear_owner(struct rw_semaphore *sem)
+{
+ sem->owner = NULL;
+}
+
/*
* lock for reading
*/
@@ -48,6 +58,7 @@ void __sched down_write(struct rw_semaphore *sem)
rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);

LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
+ rwsem_set_owner(sem);
}

EXPORT_SYMBOL(down_write);
@@ -59,8 +70,10 @@ int down_write_trylock(struct rw_semaphore *sem)
{
int ret = __down_write_trylock(sem);

- if (ret == 1)
+ if (ret == 1) {
rwsem_acquire(&sem->dep_map, 0, 1, _RET_IP_);
+ rwsem_set_owner(sem);
+ }
return ret;
}

@@ -86,6 +99,7 @@ void up_write(struct rw_semaphore *sem)
rwsem_release(&sem->dep_map, 1, _RET_IP_);

__up_write(sem);
+ rwsem_clear_owner(sem);
}

EXPORT_SYMBOL(up_write);
@@ -100,6 +114,7 @@ void downgrade_write(struct rw_semaphore *sem)
* dependency.
*/
__downgrade_write(sem);
+ rwsem_clear_owner(sem);
}

EXPORT_SYMBOL(downgrade_write);
@@ -122,6 +137,7 @@ void _down_write_nest_lock(struct rw_semaphore *sem, struct lockdep_map *nest)
rwsem_acquire_nest(&sem->dep_map, 0, 0, nest, _RET_IP_);

LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
+ rwsem_set_owner(sem);
}

EXPORT_SYMBOL(_down_write_nest_lock);
@@ -141,6 +157,7 @@ void down_write_nested(struct rw_semaphore *sem, int subclass)
rwsem_acquire(&sem->dep_map, subclass, 0, _RET_IP_);

LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
+ rwsem_set_owner(sem);
}

EXPORT_SYMBOL(down_write_nested);
diff --git a/lib/rwsem-spinlock.c b/lib/rwsem-spinlock.c
index 9be8a91..9d3edd5 100644
--- a/lib/rwsem-spinlock.c
+++ b/lib/rwsem-spinlock.c
@@ -49,6 +49,7 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
sem->activity = 0;
raw_spin_lock_init(&sem->wait_lock);
INIT_LIST_HEAD(&sem->wait_list);
+ sem->owner = NULL;
}
EXPORT_SYMBOL(__init_rwsem);

diff --git a/lib/rwsem.c b/lib/rwsem.c
index 19c5fa9..4dbd022 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -8,6 +8,7 @@
*/
#include <linux/rwsem.h>
#include <linux/sched.h>
+#include <linux/sched/rt.h>
#include <linux/init.h>
#include <linux/export.h>

@@ -27,6 +28,7 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
sem->count = RWSEM_UNLOCKED_VALUE;
raw_spin_lock_init(&sem->wait_lock);
INIT_LIST_HEAD(&sem->wait_list);
+ sem->owner = NULL;
}

EXPORT_SYMBOL(__init_rwsem);
@@ -187,12 +189,126 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem)
return sem;
}

+static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
+{
+ int retval = true;
+
+ /* Spin only if active writer running */
+ if (!sem->owner)
+ return false;
+
+ rcu_read_lock();
+ if (sem->owner)
+ retval = sem->owner->on_cpu;
+ rcu_read_unlock();
+ /*
+ * if lock->owner is not set, the sem owner may have just acquired
+ * it and not set the owner yet, or the sem has been released, or
+ * reader active.
+ */
+ return retval;
+}
+
+static inline bool owner_running(struct rw_semaphore *lock, struct task_struct *owner)
+{
+ if (lock->owner != owner)
+ return false;
+
+ /*
+ * Ensure we emit the owner->on_cpu, dereference _after_ checking
+ * lock->owner still matches owner, if that fails, owner might
+ * point to free()d memory, if it still matches, the rcu_read_lock()
+ * ensures the memory stays valid.
+ */
+ barrier();
+
+ return owner->on_cpu;
+}
+
+static noinline
+int rwsem_spin_on_owner(struct rw_semaphore *lock, struct task_struct *owner)
+{
+ rcu_read_lock();
+ while (owner_running(lock, owner)) {
+ if (need_resched())
+ break;
+
+ arch_mutex_cpu_relax();
+ }
+ rcu_read_unlock();
+
+ /*
+ * We break out the loop above on need_resched() and when the
+ * owner changed, which is a sign for heavy contention. Return
+ * success only when lock->owner is NULL.
+ */
+ return lock->owner == NULL;
+}
+
+
+static inline int rwsem_try_write_lock(long count, bool need_lock, struct rw_semaphore *sem)
+{
+ if (!(count & RWSEM_ACTIVE_MASK)) {
+ /* Try acquiring the write lock. */
+ if (sem->count == RWSEM_WAITING_BIAS &&
+ cmpxchg(&sem->count, RWSEM_WAITING_BIAS,
+ RWSEM_ACTIVE_WRITE_BIAS) == RWSEM_WAITING_BIAS) {
+ if (need_lock)
+ raw_spin_lock_irq(&sem->wait_lock);
+ if (!list_is_singular(&sem->wait_list))
+ rwsem_atomic_update(RWSEM_WAITING_BIAS, sem);
+ return 1;
+ }
+ }
+ return 0;
+}
+
+int rwsem_optimistic_spin(struct rw_semaphore *sem)
+{
+ struct task_struct *owner;
+
+ /* sem->wait_lock should not be held when attempting optimistic spinning */
+ if (!rwsem_can_spin_on_owner(sem))
+ return 0;
+
+ for (;;) {
+ owner = ACCESS_ONCE(sem->owner);
+ if (owner && !rwsem_spin_on_owner(sem, owner))
+ break;
+
+ /* wait_lock will be acquired if write_lock is obtained */
+ if (rwsem_try_write_lock(sem->count, true, sem))
+ return 1;
+
+ /*
+ * When there's no owner, we might have preempted between the
+ * owner acquiring the lock and setting the owner field. If
+ * we're an RT task that will live-lock because we won't let
+ * the owner complete.
+ */
+ if (!owner && (need_resched() || rt_task(current)))
+ break;
+
+ /*
+ * The cpu_relax() call is a compiler barrier which forces
+ * everything in this loop to be re-loaded. We don't need
+ * memory barriers as we'll eventually observe the right
+ * values at the cost of a few extra spins.
+ */
+ arch_mutex_cpu_relax();
+
+ }
+
+ return 0;
+}
+
/*
* wait until we successfully acquire the write lock
*/
struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
{
long count, adjustment = -RWSEM_ACTIVE_WRITE_BIAS;
+ bool try_optimistic_spin = true;
struct rwsem_waiter waiter;
struct task_struct *tsk = current;

@@ -218,20 +334,16 @@ struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
/* wait until we successfully acquire the lock */
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
while (true) {
- if (!(count & RWSEM_ACTIVE_MASK)) {
- /* Try acquiring the write lock. */
- count = RWSEM_ACTIVE_WRITE_BIAS;
- if (!list_is_singular(&sem->wait_list))
- count += RWSEM_WAITING_BIAS;
-
- if (sem->count == RWSEM_WAITING_BIAS &&
- cmpxchg(&sem->count, RWSEM_WAITING_BIAS, count) ==
- RWSEM_WAITING_BIAS)
- break;
- }
+ if (rwsem_try_write_lock(count, false, sem))
+ break;

raw_spin_unlock_irq(&sem->wait_lock);

+ /* do some optimistic spinning */
+ if (try_optimistic_spin && rwsem_optimistic_spin(sem))
+ break;
+
+ try_optimistic_spin = false;
/* Block until there are no active lockers. */
do {
schedule();



2013-06-14 22:31:51

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree

On Fri, 2013-06-14 at 09:09 -0700, Tim Chen wrote:
> Added copy to mailing list which I forgot in my previous reply:
>
> On Thu, 2013-06-13 at 16:43 -0700, Davidlohr Bueso wrote:
> > On Thu, 2013-06-13 at 16:15 -0700, Tim Chen wrote:
> > > Ingo,
> > >
> > > At the time of switching the anon-vma tree's lock from mutex to
> > > rw-sem (commit 5a505085), we encountered regressions for fork heavy workload.
> > > A lot of optimizations to rw-sem (e.g. lock stealing) helped to
> > > mitigate the problem. I tried an experiment on the 3.10-rc4 kernel
> > > to compare the performance of rw-sem to one that uses mutex. I saw
> > > a 8% regression in throughput for rw-sem vs a mutex implementation in
> > > 3.10-rc4.
> >
> > Funny, just yesterday I was discussing this issue with Michel. While I
> > didn't measure the anon-vma mutex->rwem conversion, I did convert the
> > i_mmap_mutex to a rwsem and noticed a performance regression on a few
> > aim7 workloads on a 8 socket, 80 core box, when keeping all writers,
> > which should perform very similarly to a mutex. While some of these
> > workloads recovered when I shared the lock among readers (similar to
> > anon-vma), it left me wondering.
> >
> > > For the experiments, I used the exim mail server workload in
> > > the MOSBENCH test suite on 4 socket (westmere) and a 4 socket
> > > (ivy bridge) with the number of clients sending mail equal
> > > to number of cores. The mail server will
> > > fork off a process to handle an incoming mail and put it into mail
> > > spool. The lock protecting the anon-vma tree is stressed due to
> > > heavy forking. On both machines, I saw that the mutex implementation
> > > has 8% more throughput. I've pinned the cpu frequency to maximum
> > > in the experiments.
> >
> > I got some similar -8% throughput on high_systime and shared.
> >
>
> That's interesting. Another perspective on rwsem vs mutex.
>
> > >
> > > I've tried two separate tweaks to the rw-sem on 3.10-rc4. I've tested
> > > each tweak individually.
> > >
> > > 1) Add an owner field when a writer holds the lock and introduce
> > > optimistic spinning when an active writer is holding the semaphore.
> > > It reduced the context switching by 30% to a level very close to the
> > > mutex implementation. However, I did not see any throughput improvement
> > > of exim.
> >
> > I was hoping that the lack of spin on owner was the main difference with
> > rwsems and am/was in the middle of implementing it. Could you send your
> > patch so I can give it a try on my workloads?
> >
> > Note that there have been a few recent (3.10) changes to mutexes that
> > give a nice performance boost, specially on large systems, most
> > noticeably:
> >
> > commit 2bd2c92c (mutex: Make more scalable by doing less atomic
> > operations)
> >
> > commit 0dc8c730 (mutex: Queue mutex spinners with MCS lock to reduce
> > cacheline contention)
> >
> > It might be worth looking into doing something similar to commit
> > 0dc8c730, in addition to the optimistic spinning.
>
> Okay. Here's my ugly experimental hack with some code lifted from optimistic spin
> within mutex. I've thought about doing the MCS lock thing but decided
> to keep the first try on the optimistic spinning simple.

Unfortunately this patch didn't make any difference, in fact it hurt
several of the workloads even more. I also tried disabling preemption
when spinning on owner to actually resemble spinlocks, which was my
original plan, yet not much difference.

A few ideas that come to mind are avoiding taking the ->wait_lock and
avoid dealing with waiters when doing the optimistic spinning (just like
mutexes do).

I agree that we should first deal with the optimistic spinning before
adding the MCS complexity.

> Matthew and I have also discussed possibly introducing some
> limited spinning for writer when semaphore is held by read.
> His idea was to have readers as well as writers set ->owner.
> Writers, as now, unconditionally clear owner. Readers clear
> owner if sem->owner == current. Writers spin on ->owner if ->owner
> is non-NULL and still active. That gives us a reasonable chance
> to spin since we'll be spinning on
> the most recent acquirer of the lock.

I also tried implementing this concept on top of your patch, didn't make
much of a difference with or without it.

Thanks,
Davidlohr

2013-06-14 22:44:26

by Tim Chen

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree


>
> Unfortunately this patch didn't make any difference, in fact it hurt
> several of the workloads even more. I also tried disabling preemption
> when spinning on owner to actually resemble spinlocks, which was my
> original plan, yet not much difference.
>

That's also similar to the performance I got. There are things about
optimistic spinning that I missed that results in the better mutex
performance. So I'm scratching my head.

> A few ideas that come to mind are avoiding taking the ->wait_lock and
> avoid dealing with waiters when doing the optimistic spinning (just like
> mutexes do).
>

For my patch, we actually spin without the wait_lock.

> I agree that we should first deal with the optimistic spinning before
> adding the MCS complexity.
>
> > Matthew and I have also discussed possibly introducing some
> > limited spinning for writer when semaphore is held by read.
> > His idea was to have readers as well as writers set ->owner.
> > Writers, as now, unconditionally clear owner. Readers clear
> > owner if sem->owner == current. Writers spin on ->owner if ->owner
> > is non-NULL and still active. That gives us a reasonable chance
> > to spin since we'll be spinning on
> > the most recent acquirer of the lock.
>
> I also tried implementing this concept on top of your patch, didn't make
> much of a difference with or without it.
>

It also didn't make a difference for me.

Tim

2013-06-14 22:47:16

by Michel Lespinasse

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree

On Fri, Jun 14, 2013 at 3:31 PM, Davidlohr Bueso <[email protected]> wrote:
> A few ideas that come to mind are avoiding taking the ->wait_lock and
> avoid dealing with waiters when doing the optimistic spinning (just like
> mutexes do).
>
> I agree that we should first deal with the optimistic spinning before
> adding the MCS complexity.

Maybe it would be worth disabling the MCS patch in mutex and comparing
that to the rwsem patches ? Just to make sure the rwsem performance
delta isn't related to that.

--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

2013-06-16 09:50:55

by Alex Shi

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree

On 06/14/2013 07:43 AM, Davidlohr Bueso wrote:
> I was hoping that the lack of spin on owner was the main difference with
> rwsems and am/was in the middle of implementing it. Could you send your
> patch so I can give it a try on my workloads?
>
> Note that there have been a few recent (3.10) changes to mutexes that
> give a nice performance boost, specially on large systems, most
> noticeably:
>
> commit 2bd2c92c (mutex: Make more scalable by doing less atomic
> operations)
>
> commit 0dc8c730 (mutex: Queue mutex spinners with MCS lock to reduce
> cacheline contention)
>
> It might be worth looking into doing something similar to commit
> 0dc8c730, in addition to the optimistic spinning.

It is a good tunning for large machine. I just following what the commit
0dc8c730 done, give a RFC patch here. I tried it on my NHM EP machine. seems no
clear help on aim7. but maybe it is helpful on large machine. :)


diff --git a/include/asm-generic/rwsem.h b/include/asm-generic/rwsem.h
index bb1e2cd..240729a 100644
--- a/include/asm-generic/rwsem.h
+++ b/include/asm-generic/rwsem.h
@@ -70,11 +70,11 @@ static inline void __down_write(struct rw_semaphore *sem)

static inline int __down_write_trylock(struct rw_semaphore *sem)
{
- long tmp;
+ if (unlikely(&sem->count != RWSEM_UNLOCKED_VALUE))
+ return 0;

- tmp = cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE,
- RWSEM_ACTIVE_WRITE_BIAS);
- return tmp == RWSEM_UNLOCKED_VALUE;
+ return cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE,
+ RWSEM_ACTIVE_WRITE_BIAS) == RWSEM_UNLOCKED_VALUE;
}

/*
diff --git a/lib/rwsem.c b/lib/rwsem.c
index 19c5fa9..9e54e20 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -64,7 +64,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
struct rwsem_waiter *waiter;
struct task_struct *tsk;
struct list_head *next;
- long oldcount, woken, loop, adjustment;
+ long woken, loop, adjustment;

waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
@@ -75,7 +75,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
* will block as they will notice the queued writer.
*/
wake_up_process(waiter->task);
- goto out;
+ return sem;
}

/* Writers might steal the lock before we grant it to the next reader.
@@ -85,15 +85,28 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
adjustment = 0;
if (wake_type != RWSEM_WAKE_READ_OWNED) {
adjustment = RWSEM_ACTIVE_READ_BIAS;
- try_reader_grant:
- oldcount = rwsem_atomic_update(adjustment, sem) - adjustment;
- if (unlikely(oldcount < RWSEM_WAITING_BIAS)) {
- /* A writer stole the lock. Undo our reader grant. */
+ while (1) {
+ long oldcount;
+
+ /* A writer stole the lock. */
+ if (unlikely(sem->count & RWSEM_ACTIVE_MASK))
+ return sem;
+
+ if (unlikely(sem->count < RWSEM_WAITING_BIAS)) {
+ cpu_relax();
+ continue;
+ }
+
+ oldcount = rwsem_atomic_update(adjustment, sem)
+ - adjustment;
+ if (likely(oldcount >= RWSEM_WAITING_BIAS))
+ break;
+
+ /* A writer stole the lock. Undo our reader grant. */
if (rwsem_atomic_update(-adjustment, sem) &
RWSEM_ACTIVE_MASK)
- goto out;
+ return sem;
/* Last active locker left. Retry waking readers. */
- goto try_reader_grant;
}
}

@@ -136,7 +149,6 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
sem->wait_list.next = next;
next->prev = &sem->wait_list;

- out:
return sem;
}

--
Thanks
Alex


--
Thanks
Alex

2013-06-17 16:22:10

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree

On Sun, 2013-06-16 at 17:50 +0800, Alex Shi wrote:
> On 06/14/2013 07:43 AM, Davidlohr Bueso wrote:
> > I was hoping that the lack of spin on owner was the main difference with
> > rwsems and am/was in the middle of implementing it. Could you send your
> > patch so I can give it a try on my workloads?
> >
> > Note that there have been a few recent (3.10) changes to mutexes that
> > give a nice performance boost, specially on large systems, most
> > noticeably:
> >
> > commit 2bd2c92c (mutex: Make more scalable by doing less atomic
> > operations)
> >
> > commit 0dc8c730 (mutex: Queue mutex spinners with MCS lock to reduce
> > cacheline contention)
> >
> > It might be worth looking into doing something similar to commit
> > 0dc8c730, in addition to the optimistic spinning.
>
> It is a good tunning for large machine. I just following what the commit
> 0dc8c730 done, give a RFC patch here. I tried it on my NHM EP machine. seems no
> clear help on aim7. but maybe it is helpful on large machine. :)

After a lot of benchmarking, I finally got the ideal results for aim7,
so far: this patch + optimistic spinning with preemption disabled. Just
like optimistic spinning, this patch by itself makes little to no
difference, yet combined is where we actually outperform 3.10-rc5. In
addition, I noticed extra throughput when disabling preemption in
try_optimistic_spin().

With i_mmap as a rwsem and these changes I could see performance
benefits for alltests (+14.5%), custom (+17%), disk (+11%), high_systime
(+5%), shared (+15%) and short (+4%), most of them after around 500
users, for fewer users, it made little to no difference.

Thanks,
Davidlohr

>
>
> diff --git a/include/asm-generic/rwsem.h b/include/asm-generic/rwsem.h
> index bb1e2cd..240729a 100644
> --- a/include/asm-generic/rwsem.h
> +++ b/include/asm-generic/rwsem.h
> @@ -70,11 +70,11 @@ static inline void __down_write(struct rw_semaphore *sem)
>
> static inline int __down_write_trylock(struct rw_semaphore *sem)
> {
> - long tmp;
> + if (unlikely(&sem->count != RWSEM_UNLOCKED_VALUE))
> + return 0;
>
> - tmp = cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE,
> - RWSEM_ACTIVE_WRITE_BIAS);
> - return tmp == RWSEM_UNLOCKED_VALUE;
> + return cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE,
> + RWSEM_ACTIVE_WRITE_BIAS) == RWSEM_UNLOCKED_VALUE;
> }
>
> /*
> diff --git a/lib/rwsem.c b/lib/rwsem.c
> index 19c5fa9..9e54e20 100644
> --- a/lib/rwsem.c
> +++ b/lib/rwsem.c
> @@ -64,7 +64,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
> struct rwsem_waiter *waiter;
> struct task_struct *tsk;
> struct list_head *next;
> - long oldcount, woken, loop, adjustment;
> + long woken, loop, adjustment;
>
> waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
> if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
> @@ -75,7 +75,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
> * will block as they will notice the queued writer.
> */
> wake_up_process(waiter->task);
> - goto out;
> + return sem;
> }
>
> /* Writers might steal the lock before we grant it to the next reader.
> @@ -85,15 +85,28 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
> adjustment = 0;
> if (wake_type != RWSEM_WAKE_READ_OWNED) {
> adjustment = RWSEM_ACTIVE_READ_BIAS;
> - try_reader_grant:
> - oldcount = rwsem_atomic_update(adjustment, sem) - adjustment;
> - if (unlikely(oldcount < RWSEM_WAITING_BIAS)) {
> - /* A writer stole the lock. Undo our reader grant. */
> + while (1) {
> + long oldcount;
> +
> + /* A writer stole the lock. */
> + if (unlikely(sem->count & RWSEM_ACTIVE_MASK))
> + return sem;
> +
> + if (unlikely(sem->count < RWSEM_WAITING_BIAS)) {
> + cpu_relax();
> + continue;
> + }
> +
> + oldcount = rwsem_atomic_update(adjustment, sem)
> + - adjustment;
> + if (likely(oldcount >= RWSEM_WAITING_BIAS))
> + break;
> +
> + /* A writer stole the lock. Undo our reader grant. */
> if (rwsem_atomic_update(-adjustment, sem) &
> RWSEM_ACTIVE_MASK)
> - goto out;
> + return sem;
> /* Last active locker left. Retry waking readers. */
> - goto try_reader_grant;
> }
> }
>
> @@ -136,7 +149,6 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
> sem->wait_list.next = next;
> next->prev = &sem->wait_list;
>
> - out:
> return sem;
> }
>
> --
> Thanks
> Alex
>
>

2013-06-17 18:45:58

by Tim Chen

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree

On Mon, 2013-06-17 at 09:22 -0700, Davidlohr Bueso wrote:
> On Sun, 2013-06-16 at 17:50 +0800, Alex Shi wrote:
> > On 06/14/2013 07:43 AM, Davidlohr Bueso wrote:
> > > I was hoping that the lack of spin on owner was the main difference with
> > > rwsems and am/was in the middle of implementing it. Could you send your
> > > patch so I can give it a try on my workloads?
> > >
> > > Note that there have been a few recent (3.10) changes to mutexes that
> > > give a nice performance boost, specially on large systems, most
> > > noticeably:
> > >
> > > commit 2bd2c92c (mutex: Make more scalable by doing less atomic
> > > operations)
> > >
> > > commit 0dc8c730 (mutex: Queue mutex spinners with MCS lock to reduce
> > > cacheline contention)
> > >
> > > It might be worth looking into doing something similar to commit
> > > 0dc8c730, in addition to the optimistic spinning.
> >
> > It is a good tunning for large machine. I just following what the commit
> > 0dc8c730 done, give a RFC patch here. I tried it on my NHM EP machine. seems no
> > clear help on aim7. but maybe it is helpful on large machine. :)
>
> After a lot of benchmarking, I finally got the ideal results for aim7,
> so far: this patch + optimistic spinning with preemption disabled. Just
> like optimistic spinning, this patch by itself makes little to no
> difference, yet combined is where we actually outperform 3.10-rc5. In
> addition, I noticed extra throughput when disabling preemption in
> try_optimistic_spin().
>
> With i_mmap as a rwsem and these changes I could see performance
> benefits for alltests (+14.5%), custom (+17%), disk (+11%), high_systime
> (+5%), shared (+15%) and short (+4%), most of them after around 500
> users, for fewer users, it made little to no difference.
>

Thanks. Those are encouraging numbers. On my exim workload I didn't
get a boost when I added in the preempt disable in optimistic spin and
put Alex's changes in. Can you send me your combined patch to see if
there may be something you did that I've missed. I have a tweak to
Alex's patch below to simplify things a bit.

Tim

> Thanks,
> Davidlohr
>
> >
> >
> > diff --git a/include/asm-generic/rwsem.h b/include/asm-generic/rwsem.h
> > index bb1e2cd..240729a 100644
> > --- a/include/asm-generic/rwsem.h
> > +++ b/include/asm-generic/rwsem.h
> > @@ -70,11 +70,11 @@ static inline void __down_write(struct rw_semaphore *sem)
> >
> > static inline int __down_write_trylock(struct rw_semaphore *sem)
> > {
> > - long tmp;
> > + if (unlikely(&sem->count != RWSEM_UNLOCKED_VALUE))
> > + return 0;
> >
> > - tmp = cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE,
> > - RWSEM_ACTIVE_WRITE_BIAS);
> > - return tmp == RWSEM_UNLOCKED_VALUE;
> > + return cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE,
> > + RWSEM_ACTIVE_WRITE_BIAS) == RWSEM_UNLOCKED_VALUE;
> > }
> >
> > /*
> > diff --git a/lib/rwsem.c b/lib/rwsem.c
> > index 19c5fa9..9e54e20 100644
> > --- a/lib/rwsem.c
> > +++ b/lib/rwsem.c
> > @@ -64,7 +64,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
> > struct rwsem_waiter *waiter;
> > struct task_struct *tsk;
> > struct list_head *next;
> > - long oldcount, woken, loop, adjustment;
> > + long woken, loop, adjustment;
> >
> > waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
> > if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
> > @@ -75,7 +75,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
> > * will block as they will notice the queued writer.
> > */
> > wake_up_process(waiter->task);
> > - goto out;
> > + return sem;
> > }
> >
> > /* Writers might steal the lock before we grant it to the next reader.
> > @@ -85,15 +85,28 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
> > adjustment = 0;
> > if (wake_type != RWSEM_WAKE_READ_OWNED) {
> > adjustment = RWSEM_ACTIVE_READ_BIAS;
> > - try_reader_grant:
> > - oldcount = rwsem_atomic_update(adjustment, sem) - adjustment;
> > - if (unlikely(oldcount < RWSEM_WAITING_BIAS)) {
> > - /* A writer stole the lock. Undo our reader grant. */
> > + while (1) {
> > + long oldcount;
> > +
> > + /* A writer stole the lock. */
> > + if (unlikely(sem->count & RWSEM_ACTIVE_MASK))
> > + return sem;
> > +
> > + if (unlikely(sem->count < RWSEM_WAITING_BIAS)) {
> > + cpu_relax();
> > + continue;
> > + }

The above two if statements could be cleaned up as a single check:

if (unlikely(sem->count < RWSEM_WAITING_BIAS))
return sem;

This one statement is sufficient to check that we don't have a writer
stolen the lock before we attempt to acquire the read lock by modifying
sem->count.




2013-06-17 19:05:38

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree

On Mon, 2013-06-17 at 11:45 -0700, Tim Chen wrote:
> On Mon, 2013-06-17 at 09:22 -0700, Davidlohr Bueso wrote:
> > On Sun, 2013-06-16 at 17:50 +0800, Alex Shi wrote:
> > > On 06/14/2013 07:43 AM, Davidlohr Bueso wrote:
> > > > I was hoping that the lack of spin on owner was the main difference with
> > > > rwsems and am/was in the middle of implementing it. Could you send your
> > > > patch so I can give it a try on my workloads?
> > > >
> > > > Note that there have been a few recent (3.10) changes to mutexes that
> > > > give a nice performance boost, specially on large systems, most
> > > > noticeably:
> > > >
> > > > commit 2bd2c92c (mutex: Make more scalable by doing less atomic
> > > > operations)
> > > >
> > > > commit 0dc8c730 (mutex: Queue mutex spinners with MCS lock to reduce
> > > > cacheline contention)
> > > >
> > > > It might be worth looking into doing something similar to commit
> > > > 0dc8c730, in addition to the optimistic spinning.
> > >
> > > It is a good tunning for large machine. I just following what the commit
> > > 0dc8c730 done, give a RFC patch here. I tried it on my NHM EP machine. seems no
> > > clear help on aim7. but maybe it is helpful on large machine. :)
> >
> > After a lot of benchmarking, I finally got the ideal results for aim7,
> > so far: this patch + optimistic spinning with preemption disabled. Just
> > like optimistic spinning, this patch by itself makes little to no
> > difference, yet combined is where we actually outperform 3.10-rc5. In
> > addition, I noticed extra throughput when disabling preemption in
> > try_optimistic_spin().
> >
> > With i_mmap as a rwsem and these changes I could see performance
> > benefits for alltests (+14.5%), custom (+17%), disk (+11%), high_systime
> > (+5%), shared (+15%) and short (+4%), most of them after around 500
> > users, for fewer users, it made little to no difference.
> >
>
> Thanks. Those are encouraging numbers. On my exim workload I didn't
> get a boost when I added in the preempt disable in optimistic spin and
> put Alex's changes in. Can you send me your combined patch to see if
> there may be something you did that I've missed. I have a tweak to
> Alex's patch below to simplify things a bit.
>

I'm using:

int rwsem_optimistic_spin(struct rw_semaphore *sem)
{
struct task_struct *owner;

/* sem->wait_lock should not be held when attempting optimistic spinning */
if (!rwsem_can_spin_on_owner(sem))
return 0;

preempt_disable();
for (;;) {
owner = ACCESS_ONCE(sem->owner);
if (owner && !rwsem_spin_on_owner(sem, owner))
break;

/* wait_lock will be acquired if write_lock is obtained */
if (rwsem_try_write_lock(sem->count, true, sem)) {
preempt_enable();
return 1;
}

/*
* When there's no owner, we might have preempted between the
* owner acquiring the lock and setting the owner field. If
* we're an RT task that will live-lock because we won't let
* the owner complete.
*/
if (!owner && (need_resched() || rt_task(current)))
break;

/*
* The cpu_relax() call is a compiler barrier which forces
* everything in this loop to be re-loaded. We don't need
* memory barriers as we'll eventually observe the right
* values at the cost of a few extra spins.
*/
arch_mutex_cpu_relax();

}

preempt_enable();
return 0;
}

> >
> > >
> > >
> > > diff --git a/include/asm-generic/rwsem.h b/include/asm-generic/rwsem.h
> > > index bb1e2cd..240729a 100644
> > > --- a/include/asm-generic/rwsem.h
> > > +++ b/include/asm-generic/rwsem.h
> > > @@ -70,11 +70,11 @@ static inline void __down_write(struct rw_semaphore *sem)
> > >
> > > static inline int __down_write_trylock(struct rw_semaphore *sem)
> > > {
> > > - long tmp;
> > > + if (unlikely(&sem->count != RWSEM_UNLOCKED_VALUE))
> > > + return 0;
> > >
> > > - tmp = cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE,
> > > - RWSEM_ACTIVE_WRITE_BIAS);
> > > - return tmp == RWSEM_UNLOCKED_VALUE;
> > > + return cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE,
> > > + RWSEM_ACTIVE_WRITE_BIAS) == RWSEM_UNLOCKED_VALUE;
> > > }
> > >
> > > /*
> > > diff --git a/lib/rwsem.c b/lib/rwsem.c
> > > index 19c5fa9..9e54e20 100644
> > > --- a/lib/rwsem.c
> > > +++ b/lib/rwsem.c
> > > @@ -64,7 +64,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
> > > struct rwsem_waiter *waiter;
> > > struct task_struct *tsk;
> > > struct list_head *next;
> > > - long oldcount, woken, loop, adjustment;
> > > + long woken, loop, adjustment;
> > >
> > > waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
> > > if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
> > > @@ -75,7 +75,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
> > > * will block as they will notice the queued writer.
> > > */
> > > wake_up_process(waiter->task);
> > > - goto out;
> > > + return sem;
> > > }
> > >
> > > /* Writers might steal the lock before we grant it to the next reader.
> > > @@ -85,15 +85,28 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
> > > adjustment = 0;
> > > if (wake_type != RWSEM_WAKE_READ_OWNED) {
> > > adjustment = RWSEM_ACTIVE_READ_BIAS;
> > > - try_reader_grant:
> > > - oldcount = rwsem_atomic_update(adjustment, sem) - adjustment;
> > > - if (unlikely(oldcount < RWSEM_WAITING_BIAS)) {
> > > - /* A writer stole the lock. Undo our reader grant. */
> > > + while (1) {
> > > + long oldcount;
> > > +
> > > + /* A writer stole the lock. */
> > > + if (unlikely(sem->count & RWSEM_ACTIVE_MASK))
> > > + return sem;
> > > +
> > > + if (unlikely(sem->count < RWSEM_WAITING_BIAS)) {
> > > + cpu_relax();
> > > + continue;
> > > + }
>
> The above two if statements could be cleaned up as a single check:
>
> if (unlikely(sem->count < RWSEM_WAITING_BIAS))
> return sem;
>
> This one statement is sufficient to check that we don't have a writer
> stolen the lock before we attempt to acquire the read lock by modifying
> sem->count.

We probably still want to keep the cpu relaxation if the statement
doesn't comply.

Thanks,
Davidlohr

2013-06-17 22:27:44

by Tim Chen

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree

On Fri, 2013-06-14 at 15:47 -0700, Michel Lespinasse wrote:
> On Fri, Jun 14, 2013 at 3:31 PM, Davidlohr Bueso <[email protected]> wrote:
> > A few ideas that come to mind are avoiding taking the ->wait_lock and
> > avoid dealing with waiters when doing the optimistic spinning (just like
> > mutexes do).
> >
> > I agree that we should first deal with the optimistic spinning before
> > adding the MCS complexity.
>
> Maybe it would be worth disabling the MCS patch in mutex and comparing
> that to the rwsem patches ? Just to make sure the rwsem performance
> delta isn't related to that.
>

I've tried to back out the MCS patch. In fact, for exim, it is about 1%
faster without MCS. So the better performance of mutex I saw was not
due to MCS. Thanks for the suggestion.

Tim

2013-06-17 22:28:42

by Tim Chen

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree

On Mon, 2013-06-17 at 12:05 -0700, Davidlohr Bueso wrote:

> >
> > Thanks. Those are encouraging numbers. On my exim workload I didn't
> > get a boost when I added in the preempt disable in optimistic spin and
> > put Alex's changes in. Can you send me your combined patch to see if
> > there may be something you did that I've missed. I have a tweak to
> > Alex's patch below to simplify things a bit.
> >
>
> I'm using:
>
> int rwsem_optimistic_spin(struct rw_semaphore *sem)
> {
> struct task_struct *owner;
>
> /* sem->wait_lock should not be held when attempting optimistic spinning */
> if (!rwsem_can_spin_on_owner(sem))
> return 0;
>
> preempt_disable();
> for (;;) {
> owner = ACCESS_ONCE(sem->owner);
> if (owner && !rwsem_spin_on_owner(sem, owner))
> break;
>
> /* wait_lock will be acquired if write_lock is obtained */
> if (rwsem_try_write_lock(sem->count, true, sem)) {
> preempt_enable();
> return 1;
> }
>
> /*
> * When there's no owner, we might have preempted between the
> * owner acquiring the lock and setting the owner field. If
> * we're an RT task that will live-lock because we won't let
> * the owner complete.
> */
> if (!owner && (need_resched() || rt_task(current)))
> break;
>
> /*
> * The cpu_relax() call is a compiler barrier which forces
> * everything in this loop to be re-loaded. We don't need
> * memory barriers as we'll eventually observe the right
> * values at the cost of a few extra spins.
> */
> arch_mutex_cpu_relax();
>
> }
>
> preempt_enable();
> return 0;
> }

This is identical to the changes that I've tested. Thanks for sharing.

Tim

> > > > @@ -85,15 +85,28 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
> > > > adjustment = 0;
> > > > if (wake_type != RWSEM_WAKE_READ_OWNED) {
> > > > adjustment = RWSEM_ACTIVE_READ_BIAS;
> > > > - try_reader_grant:
> > > > - oldcount = rwsem_atomic_update(adjustment, sem) - adjustment;
> > > > - if (unlikely(oldcount < RWSEM_WAITING_BIAS)) {
> > > > - /* A writer stole the lock. Undo our reader grant. */
> > > > + while (1) {
> > > > + long oldcount;
> > > > +
> > > > + /* A writer stole the lock. */
> > > > + if (unlikely(sem->count & RWSEM_ACTIVE_MASK))
> > > > + return sem;
> > > > +
> > > > + if (unlikely(sem->count < RWSEM_WAITING_BIAS)) {
> > > > + cpu_relax();
> > > > + continue;
> > > > + }
> >
> > The above two if statements could be cleaned up as a single check:
> >
> > if (unlikely(sem->count < RWSEM_WAITING_BIAS))
> > return sem;
> >
> > This one statement is sufficient to check that we don't have a writer
> > stolen the lock before we attempt to acquire the read lock by modifying
> > sem->count.
>
> We probably still want to keep the cpu relaxation if the statement
> doesn't comply.
>
> Thanks,
> Davidlohr
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-06-17 23:19:05

by Alex Shi

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree

On 06/18/2013 02:45 AM, Tim Chen wrote:
>>> + if (unlikely(sem->count < RWSEM_WAITING_BIAS)) {
>>> > > + cpu_relax();
>>> > > + continue;
>>> > > + }
> The above two if statements could be cleaned up as a single check:
>
> if (unlikely(sem->count < RWSEM_WAITING_BIAS))
> return sem;
>
> This one statement is sufficient to check that we don't have a writer
> stolen the lock before we attempt to acquire the read lock by modifying
> sem->count.
>
>

Thanks. I will send out the patchset base your suggestion.


--
Thanks
Alex

2013-06-17 23:20:24

by Alex Shi

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree

On 06/18/2013 12:22 AM, Davidlohr Bueso wrote:
> After a lot of benchmarking, I finally got the ideal results for aim7,
> so far: this patch + optimistic spinning with preemption disabled. Just
> like optimistic spinning, this patch by itself makes little to no
> difference, yet combined is where we actually outperform 3.10-rc5. In
> addition, I noticed extra throughput when disabling preemption in
> try_optimistic_spin().
>
> With i_mmap as a rwsem and these changes I could see performance
> benefits for alltests (+14.5%), custom (+17%), disk (+11%), high_systime
> (+5%), shared (+15%) and short (+4%), most of them after around 500
> users, for fewer users, it made little to no difference.

A pretty good number. what's the cpu number in your machine? :)

--
Thanks
Alex

2013-06-17 23:35:27

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree

On Tue, 2013-06-18 at 07:20 +0800, Alex Shi wrote:
> On 06/18/2013 12:22 AM, Davidlohr Bueso wrote:
> > After a lot of benchmarking, I finally got the ideal results for aim7,
> > so far: this patch + optimistic spinning with preemption disabled. Just
> > like optimistic spinning, this patch by itself makes little to no
> > difference, yet combined is where we actually outperform 3.10-rc5. In
> > addition, I noticed extra throughput when disabling preemption in
> > try_optimistic_spin().
> >
> > With i_mmap as a rwsem and these changes I could see performance
> > benefits for alltests (+14.5%), custom (+17%), disk (+11%), high_systime
> > (+5%), shared (+15%) and short (+4%), most of them after around 500
> > users, for fewer users, it made little to no difference.
>
> A pretty good number. what's the cpu number in your machine? :)

8-socket, 80 cores (ht off)

2013-06-18 00:08:01

by Tim Chen

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree

On Mon, 2013-06-17 at 16:35 -0700, Davidlohr Bueso wrote:
> On Tue, 2013-06-18 at 07:20 +0800, Alex Shi wrote:
> > On 06/18/2013 12:22 AM, Davidlohr Bueso wrote:
> > > After a lot of benchmarking, I finally got the ideal results for aim7,
> > > so far: this patch + optimistic spinning with preemption disabled. Just
> > > like optimistic spinning, this patch by itself makes little to no
> > > difference, yet combined is where we actually outperform 3.10-rc5. In
> > > addition, I noticed extra throughput when disabling preemption in
> > > try_optimistic_spin().
> > >
> > > With i_mmap as a rwsem and these changes I could see performance
> > > benefits for alltests (+14.5%), custom (+17%), disk (+11%), high_systime
> > > (+5%), shared (+15%) and short (+4%), most of them after around 500
> > > users, for fewer users, it made little to no difference.
> >
> > A pretty good number. what's the cpu number in your machine? :)
>
> 8-socket, 80 cores (ht off)
>
>

David,

I wonder if you are interested to try the experimental patch below.
It tries to avoid unnecessary writes to the sem->count when we are
going to fail the down_write by executing rwsem_down_write_failed_s
instead of rwsem_down_write_failed. It should further reduce the
cache line bouncing. It didn't make a difference for my
workload. Wonder if it may help yours more in addition to the
other two patches. Right now the patch is an ugly hack. I'll merge
rwsem_down_write_failed_s and rwsem_down_write_failed into one
function if this approach actually helps things.

I'll clean these three patches after we have some idea of their
effectiveness.

Thanks.

Tim

Signed-off-by: Tim Chen <[email protected]>
---
commit 04c8ad3f21861746d5b7fff55a6ef186a4dd0765
Author: Tim Chen <[email protected]>
Date: Mon Jun 10 04:50:04 2013 -0700

Try skip write to rwsem->count when we have active lockers

diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index 0616ffe..83f9184 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -33,6 +33,7 @@ struct rw_semaphore {

extern struct rw_semaphore *rwsem_down_read_failed(struct rw_semaphore *sem);
extern struct rw_semaphore *rwsem_down_write_failed(struct rw_semaphore *sem);
+extern struct rw_semaphore *rwsem_down_write_failed_s(struct rw_semaphore *sem);
extern struct rw_semaphore *rwsem_wake(struct rw_semaphore *);
extern struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem);

diff --git a/kernel/rwsem.c b/kernel/rwsem.c
index cfff143..188f6ea 100644
--- a/kernel/rwsem.c
+++ b/kernel/rwsem.c
@@ -42,12 +42,22 @@ EXPORT_SYMBOL(down_read_trylock);
/*
* lock for writing
*/
+
+static void ___down_write(struct rw_semaphore *sem)
+{
+ if (sem->count & RWSEM_ACTIVE_MASK) {
+ rwsem_down_write_failed_s(sem);
+ return;
+ }
+ __down_write(sem);
+}
+
void __sched down_write(struct rw_semaphore *sem)
{
might_sleep();
rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);

- LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
+ LOCK_CONTENDED(sem, __down_write_trylock, ___down_write);
}

EXPORT_SYMBOL(down_write);
diff --git a/lib/rwsem.c b/lib/rwsem.c
index 19c5fa9..25143b5 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -248,6 +248,63 @@ struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
return sem;
}

+struct rw_semaphore __sched *rwsem_down_write_failed_s(struct rw_semaphore *sem)
+{
+ long count, adjustment = 0;
+ struct rwsem_waiter waiter;
+ struct task_struct *tsk = current;
+
+ /* set up my own style of waitqueue */
+ waiter.task = tsk;
+ waiter.type = RWSEM_WAITING_FOR_WRITE;
+
+ raw_spin_lock_irq(&sem->wait_lock);
+ if (list_empty(&sem->wait_list))
+ adjustment += RWSEM_WAITING_BIAS;
+ list_add_tail(&waiter.list, &sem->wait_list);
+
+ /* If there were already threads queued before us and there are no
+ * active writers, the lock must be read owned; so we try to wake
+ * any read locks that were queued ahead of us. */
+ if (adjustment == 0) {
+ if (sem->count > RWSEM_WAITING_BIAS)
+ sem = __rwsem_do_wake(sem, RWSEM_WAKE_READERS);
+ } else
+ count = rwsem_atomic_update(adjustment, sem);
+
+ /* wait until we successfully acquire the lock */
+ set_task_state(tsk, TASK_UNINTERRUPTIBLE);
+ while (true) {
+ if (!(sem->count & RWSEM_ACTIVE_MASK)) {
+ /* Try acquiring the write lock. */
+ count = RWSEM_ACTIVE_WRITE_BIAS;
+ if (!list_is_singular(&sem->wait_list))
+ count += RWSEM_WAITING_BIAS;
+
+ if (sem->count == RWSEM_WAITING_BIAS &&
+ cmpxchg(&sem->count, RWSEM_WAITING_BIAS, count) ==
+ RWSEM_WAITING_BIAS)
+ break;
+ }
+
+ raw_spin_unlock_irq(&sem->wait_lock);
+
+ /* Block until there are no active lockers. */
+ do {
+ schedule();
+ set_task_state(tsk, TASK_UNINTERRUPTIBLE);
+ } while ((count = sem->count) & RWSEM_ACTIVE_MASK);
+
+ raw_spin_lock_irq(&sem->wait_lock);
+ }
+
+ list_del(&waiter.list);
+ raw_spin_unlock_irq(&sem->wait_lock);
+ tsk->state = TASK_RUNNING;
+
+ return sem;
+}
+
/*
* handle waking up a waiter on the semaphore
* - up_read/up_write has decremented the active part of count if we come here
@@ -289,5 +346,6 @@ struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)

EXPORT_SYMBOL(rwsem_down_read_failed);
EXPORT_SYMBOL(rwsem_down_write_failed);
+EXPORT_SYMBOL(rwsem_down_write_failed_s);
EXPORT_SYMBOL(rwsem_wake);
EXPORT_SYMBOL(rwsem_downgrade_wake);

2013-06-19 13:16:20

by Ingo Molnar

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree


* Tim Chen <[email protected]> wrote:

> Ingo,
>
> At the time of switching the anon-vma tree's lock from mutex to
> rw-sem (commit 5a505085), we encountered regressions for fork heavy workload.
> A lot of optimizations to rw-sem (e.g. lock stealing) helped to
> mitigate the problem. I tried an experiment on the 3.10-rc4 kernel
> to compare the performance of rw-sem to one that uses mutex. I saw
> a 8% regression in throughput for rw-sem vs a mutex implementation in
> 3.10-rc4.
>
> For the experiments, I used the exim mail server workload in
> the MOSBENCH test suite on 4 socket (westmere) and a 4 socket
> (ivy bridge) with the number of clients sending mail equal
> to number of cores. The mail server will
> fork off a process to handle an incoming mail and put it into mail
> spool. The lock protecting the anon-vma tree is stressed due to
> heavy forking. On both machines, I saw that the mutex implementation
> has 8% more throughput. I've pinned the cpu frequency to maximum
> in the experiments.
>
> I've tried two separate tweaks to the rw-sem on 3.10-rc4. I've tested
> each tweak individually.
>
> 1) Add an owner field when a writer holds the lock and introduce
> optimistic spinning when an active writer is holding the semaphore.
> It reduced the context switching by 30% to a level very close to the
> mutex implementation. However, I did not see any throughput improvement
> of exim.
>
> 2) When the sem->count's active field is non-zero (i.e. someone
> is holding the lock), we can skip directly to the down_write_failed
> path, without adding the RWSEM_DOWN_WRITE_BIAS and taking
> it off again from sem->count, saving us two atomic operations.
> Since we will try the lock stealing again later, this should be okay.
> Unfortunately it did not improve the exim workload either.
>
> Any suggestions on the difference between rwsem and mutex performance
> and possible improvements to recover this regression?
>
> Thanks.
>
> Tim
>
> vmstat for mutex implementation:
> procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
> r b swpd free buff cache si so bi bo in cs us sy id wa st
> 38 0 0 130957920 47860 199956 0 0 0 56 236342 476975 14 72 14 0 0
> 41 0 0 130938560 47860 219900 0 0 0 0 236816 479676 14 72 14 0 0
>
> vmstat for rw-sem implementation (3.10-rc4)
> procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
> r b swpd free buff cache si so bi bo in cs us sy id wa st
> 40 0 0 130933984 43232 202584 0 0 0 0 321817 690741 13 71 16 0 0
> 39 0 0 130913904 43232 224812 0 0 0 0 322193 692949 13 71 16 0 0

It appears the main difference is that the rwsem variant context-switches
about 36% more than the mutex version, right?

I'm wondering how that's possible - the lock is mostly write-locked,
correct? So the lock-stealing from Davidlohr Bueso and Michel Lespinasse
ought to have brought roughly the same lock-stealing behavior as mutexes
do, right?

So the next analytical step would be to figure out why rwsem lock-stealing
is not behaving in an equivalent fashion on this workload. Do readers come
in frequently enough to disrupt write-lock-stealing perhaps?

Context-switch call-graph profiling might shed some light on where the
extra context switches come from...

Something like:

perf record -g -e sched:sched_switch --filter 'prev_state != 0' -a sleep 1

or a variant thereof might do the trick.

Thanks,

Ingo

2013-06-19 16:53:51

by Tim Chen

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree

On Wed, 2013-06-19 at 15:16 +0200, Ingo Molnar wrote:

> > vmstat for mutex implementation:
> > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
> > r b swpd free buff cache si so bi bo in cs us sy id wa st
> > 38 0 0 130957920 47860 199956 0 0 0 56 236342 476975 14 72 14 0 0
> > 41 0 0 130938560 47860 219900 0 0 0 0 236816 479676 14 72 14 0 0
> >
> > vmstat for rw-sem implementation (3.10-rc4)
> > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
> > r b swpd free buff cache si so bi bo in cs us sy id wa st
> > 40 0 0 130933984 43232 202584 0 0 0 0 321817 690741 13 71 16 0 0
> > 39 0 0 130913904 43232 224812 0 0 0 0 322193 692949 13 71 16 0 0
>
> It appears the main difference is that the rwsem variant context-switches
> about 36% more than the mutex version, right?
>
> I'm wondering how that's possible - the lock is mostly write-locked,
> correct? So the lock-stealing from Davidlohr Bueso and Michel Lespinasse
> ought to have brought roughly the same lock-stealing behavior as mutexes
> do, right?
>
> So the next analytical step would be to figure out why rwsem lock-stealing
> is not behaving in an equivalent fashion on this workload. Do readers come
> in frequently enough to disrupt write-lock-stealing perhaps?
>
> Context-switch call-graph profiling might shed some light on where the
> extra context switches come from...
>
> Something like:
>
> perf record -g -e sched:sched_switch --filter 'prev_state != 0' -a sleep 1
>
> or a variant thereof might do the trick.
>

Ingo,

It appears that we are having much more down write failure causing a process to
block vs going to the slow path for the mutex case.

Here's the profile data from
perf record -g -e sched:sched_switch --filter 'prev_state != 0' -a sleep 1

3.10-rc4 (mutex implementation context switch profile)

- 59.51% exim [kernel.kallsyms] [k] perf_trace_sched_switch
- perf_trace_sched_switch
- __schedule
- 99.98% schedule
+ 33.07% schedule_timeout
+ 23.84% pipe_wait
+ 20.24% do_wait
+ 12.37% do_exit
+ 5.34% sigsuspend
- 3.40% schedule_preempt_disabled
__mutex_lock_common.clone.8
__mutex_lock_slowpath
- mutex_lock <---------low rate mutex blocking
+ 65.71% lock_anon_vma_root.clone.24
+ 19.03% anon_vma_lock.clone.21
+ 7.14% dup_mm
+ 5.36% unlink_file_vma
+ 1.71% ima_file_check
+ 0.64% generic_file_aio_write
- 1.07% rwsem_down_write_failed
call_rwsem_down_write_failed
exit_shm
do_exit
do_group_exit
SyS_exit_group
system_call_fastpath
- 27.61% smtpbm [kernel.kallsyms] [k] perf_trace_sched_switch
- perf_trace_sched_switch
- __schedule
- schedule
- schedule_timeout
+ 100.00% sk_wait_data
+ 0.46% swapper [kernel.kallsyms] [k] perf_trace_sched_switch


----------------------
3.10-rc4 implementation (rw-sem context switch profile)

81.91% exim [kernel.kallsyms] [k] perf_trace_sched_switch
- perf_trace_sched_switch
- __schedule
- 99.99% schedule
- 65.26% rwsem_down_write_failed <------High write lock blocking
- call_rwsem_down_write_failed
- 79.36% lock_anon_vma_root.clone.27
+ 52.64% unlink_anon_vmas
+ 47.36% anon_vma_clone
+ 12.16% anon_vma_fork
+ 8.00% anon_vma_free
+ 11.96% schedule_timeout
+ 7.66% do_exit
+ 7.61% do_wait
+ 5.49% pipe_wait
+ 1.82% sigsuspend
13.55% smtpbm [kernel.kallsyms] [k] perf_trace_sched_switch
- perf_trace_sched_switch
- __schedule
- schedule
- schedule_timeout
0.11% rcu_sched [kernel.kallsyms] [k] perf_trace_sched_switch


Thanks.

Tim

2013-06-19 23:11:59

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree

On Mon, 2013-06-17 at 17:08 -0700, Tim Chen wrote:
> On Mon, 2013-06-17 at 16:35 -0700, Davidlohr Bueso wrote:
> > On Tue, 2013-06-18 at 07:20 +0800, Alex Shi wrote:
> > > On 06/18/2013 12:22 AM, Davidlohr Bueso wrote:
> > > > After a lot of benchmarking, I finally got the ideal results for aim7,
> > > > so far: this patch + optimistic spinning with preemption disabled. Just
> > > > like optimistic spinning, this patch by itself makes little to no
> > > > difference, yet combined is where we actually outperform 3.10-rc5. In
> > > > addition, I noticed extra throughput when disabling preemption in
> > > > try_optimistic_spin().
> > > >
> > > > With i_mmap as a rwsem and these changes I could see performance
> > > > benefits for alltests (+14.5%), custom (+17%), disk (+11%), high_systime
> > > > (+5%), shared (+15%) and short (+4%), most of them after around 500
> > > > users, for fewer users, it made little to no difference.
> > >
> > > A pretty good number. what's the cpu number in your machine? :)
> >
> > 8-socket, 80 cores (ht off)
> >
> >
>
> David,
>
> I wonder if you are interested to try the experimental patch below.
> It tries to avoid unnecessary writes to the sem->count when we are
> going to fail the down_write by executing rwsem_down_write_failed_s
> instead of rwsem_down_write_failed. It should further reduce the
> cache line bouncing. It didn't make a difference for my
> workload. Wonder if it may help yours more in addition to the
> other two patches. Right now the patch is an ugly hack. I'll merge
> rwsem_down_write_failed_s and rwsem_down_write_failed into one
> function if this approach actually helps things.
>

I tried this on top of the patches we've already been dealing with. It
actually did more harm than good. Only got a slight increase in the
five_sec workload, for the rest either no effect, or negative. So far
the best results are still with spin on owner + preempt disable + Alex's
patches.

Thanks,
Davidlohr


> I'll clean these three patches after we have some idea of their
> effectiveness.
>
> Thanks.
>
> Tim
>
> Signed-off-by: Tim Chen <[email protected]>
> ---
> commit 04c8ad3f21861746d5b7fff55a6ef186a4dd0765
> Author: Tim Chen <[email protected]>
> Date: Mon Jun 10 04:50:04 2013 -0700
>
> Try skip write to rwsem->count when we have active lockers
>
> diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
> index 0616ffe..83f9184 100644
> --- a/include/linux/rwsem.h
> +++ b/include/linux/rwsem.h
> @@ -33,6 +33,7 @@ struct rw_semaphore {
>
> extern struct rw_semaphore *rwsem_down_read_failed(struct rw_semaphore *sem);
> extern struct rw_semaphore *rwsem_down_write_failed(struct rw_semaphore *sem);
> +extern struct rw_semaphore *rwsem_down_write_failed_s(struct rw_semaphore *sem);
> extern struct rw_semaphore *rwsem_wake(struct rw_semaphore *);
> extern struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem);
>
> diff --git a/kernel/rwsem.c b/kernel/rwsem.c
> index cfff143..188f6ea 100644
> --- a/kernel/rwsem.c
> +++ b/kernel/rwsem.c
> @@ -42,12 +42,22 @@ EXPORT_SYMBOL(down_read_trylock);
> /*
> * lock for writing
> */
> +
> +static void ___down_write(struct rw_semaphore *sem)
> +{
> + if (sem->count & RWSEM_ACTIVE_MASK) {
> + rwsem_down_write_failed_s(sem);
> + return;
> + }
> + __down_write(sem);
> +}
> +
> void __sched down_write(struct rw_semaphore *sem)
> {
> might_sleep();
> rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);
>
> - LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
> + LOCK_CONTENDED(sem, __down_write_trylock, ___down_write);
> }
>
> EXPORT_SYMBOL(down_write);
> diff --git a/lib/rwsem.c b/lib/rwsem.c
> index 19c5fa9..25143b5 100644
> --- a/lib/rwsem.c
> +++ b/lib/rwsem.c
> @@ -248,6 +248,63 @@ struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
> return sem;
> }
>
> +struct rw_semaphore __sched *rwsem_down_write_failed_s(struct rw_semaphore *sem)
> +{
> + long count, adjustment = 0;
> + struct rwsem_waiter waiter;
> + struct task_struct *tsk = current;
> +
> + /* set up my own style of waitqueue */
> + waiter.task = tsk;
> + waiter.type = RWSEM_WAITING_FOR_WRITE;
> +
> + raw_spin_lock_irq(&sem->wait_lock);
> + if (list_empty(&sem->wait_list))
> + adjustment += RWSEM_WAITING_BIAS;
> + list_add_tail(&waiter.list, &sem->wait_list);
> +
> + /* If there were already threads queued before us and there are no
> + * active writers, the lock must be read owned; so we try to wake
> + * any read locks that were queued ahead of us. */
> + if (adjustment == 0) {
> + if (sem->count > RWSEM_WAITING_BIAS)
> + sem = __rwsem_do_wake(sem, RWSEM_WAKE_READERS);
> + } else
> + count = rwsem_atomic_update(adjustment, sem);
> +
> + /* wait until we successfully acquire the lock */
> + set_task_state(tsk, TASK_UNINTERRUPTIBLE);
> + while (true) {
> + if (!(sem->count & RWSEM_ACTIVE_MASK)) {
> + /* Try acquiring the write lock. */
> + count = RWSEM_ACTIVE_WRITE_BIAS;
> + if (!list_is_singular(&sem->wait_list))
> + count += RWSEM_WAITING_BIAS;
> +
> + if (sem->count == RWSEM_WAITING_BIAS &&
> + cmpxchg(&sem->count, RWSEM_WAITING_BIAS, count) ==
> + RWSEM_WAITING_BIAS)
> + break;
> + }
> +
> + raw_spin_unlock_irq(&sem->wait_lock);
> +
> + /* Block until there are no active lockers. */
> + do {
> + schedule();
> + set_task_state(tsk, TASK_UNINTERRUPTIBLE);
> + } while ((count = sem->count) & RWSEM_ACTIVE_MASK);
> +
> + raw_spin_lock_irq(&sem->wait_lock);
> + }
> +
> + list_del(&waiter.list);
> + raw_spin_unlock_irq(&sem->wait_lock);
> + tsk->state = TASK_RUNNING;
> +
> + return sem;
> +}
> +
> /*
> * handle waking up a waiter on the semaphore
> * - up_read/up_write has decremented the active part of count if we come here
> @@ -289,5 +346,6 @@ struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
>
> EXPORT_SYMBOL(rwsem_down_read_failed);
> EXPORT_SYMBOL(rwsem_down_write_failed);
> +EXPORT_SYMBOL(rwsem_down_write_failed_s);
> EXPORT_SYMBOL(rwsem_wake);
> EXPORT_SYMBOL(rwsem_downgrade_wake);
>
>

2013-06-19 23:24:13

by Tim Chen

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree

On Wed, 2013-06-19 at 16:11 -0700, Davidlohr Bueso wrote:
> On Mon, 2013-06-17 at 17:08 -0700, Tim Chen wrote:
> > On Mon, 2013-06-17 at 16:35 -0700, Davidlohr Bueso wrote:
> > > On Tue, 2013-06-18 at 07:20 +0800, Alex Shi wrote:
> > > > On 06/18/2013 12:22 AM, Davidlohr Bueso wrote:
> > > > > After a lot of benchmarking, I finally got the ideal results for aim7,
> > > > > so far: this patch + optimistic spinning with preemption disabled. Just
> > > > > like optimistic spinning, this patch by itself makes little to no
> > > > > difference, yet combined is where we actually outperform 3.10-rc5. In
> > > > > addition, I noticed extra throughput when disabling preemption in
> > > > > try_optimistic_spin().
> > > > >
> > > > > With i_mmap as a rwsem and these changes I could see performance
> > > > > benefits for alltests (+14.5%), custom (+17%), disk (+11%), high_systime
> > > > > (+5%), shared (+15%) and short (+4%), most of them after around 500
> > > > > users, for fewer users, it made little to no difference.
> > > >
> > > > A pretty good number. what's the cpu number in your machine? :)
> > >
> > > 8-socket, 80 cores (ht off)
> > >
> > >
> >
> > David,
> >
> > I wonder if you are interested to try the experimental patch below.
> > It tries to avoid unnecessary writes to the sem->count when we are
> > going to fail the down_write by executing rwsem_down_write_failed_s
> > instead of rwsem_down_write_failed. It should further reduce the
> > cache line bouncing. It didn't make a difference for my
> > workload. Wonder if it may help yours more in addition to the
> > other two patches. Right now the patch is an ugly hack. I'll merge
> > rwsem_down_write_failed_s and rwsem_down_write_failed into one
> > function if this approach actually helps things.
> >
>
> I tried this on top of the patches we've already been dealing with. It
> actually did more harm than good. Only got a slight increase in the
> five_sec workload, for the rest either no effect, or negative. So far
> the best results are still with spin on owner + preempt disable + Alex's
> patches.
>

Thanks for trying it out. A little disappointed as I was expecting no
change in performance for the worst case.

Tim

2013-06-26 00:20:00

by Tim Chen

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree

On Wed, 2013-06-19 at 09:53 -0700, Tim Chen wrote:
> On Wed, 2013-06-19 at 15:16 +0200, Ingo Molnar wrote:
>
> > > vmstat for mutex implementation:
> > > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
> > > r b swpd free buff cache si so bi bo in cs us sy id wa st
> > > 38 0 0 130957920 47860 199956 0 0 0 56 236342 476975 14 72 14 0 0
> > > 41 0 0 130938560 47860 219900 0 0 0 0 236816 479676 14 72 14 0 0
> > >
> > > vmstat for rw-sem implementation (3.10-rc4)
> > > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
> > > r b swpd free buff cache si so bi bo in cs us sy id wa st
> > > 40 0 0 130933984 43232 202584 0 0 0 0 321817 690741 13 71 16 0 0
> > > 39 0 0 130913904 43232 224812 0 0 0 0 322193 692949 13 71 16 0 0
> >
> > It appears the main difference is that the rwsem variant context-switches
> > about 36% more than the mutex version, right?
> >
> > I'm wondering how that's possible - the lock is mostly write-locked,
> > correct? So the lock-stealing from Davidlohr Bueso and Michel Lespinasse
> > ought to have brought roughly the same lock-stealing behavior as mutexes
> > do, right?
> >
> > So the next analytical step would be to figure out why rwsem lock-stealing
> > is not behaving in an equivalent fashion on this workload. Do readers come
> > in frequently enough to disrupt write-lock-stealing perhaps?

Ingo,

I did some instrumentation on the write lock failure path. I found that
for the exim workload, there are no readers blocking for the rwsem when
write locking failed. The lock stealing is successful for 9.1% of the
time and the rest of the write lock failure caused the writer to go to
sleep. About 1.4% of the writers sleep more than once. Majority of the
writers sleep once.

It is weird that lock stealing is not successful more often.

Tim

> >
> > Context-switch call-graph profiling might shed some light on where the
> > extra context switches come from...
> >
> > Something like:
> >
> > perf record -g -e sched:sched_switch --filter 'prev_state != 0' -a sleep 1
> >
> > or a variant thereof might do the trick.
> >
>
> Ingo,
>
> It appears that we are having much more down write failure causing a process to
> block vs going to the slow path for the mutex case.
>
> Here's the profile data from
> perf record -g -e sched:sched_switch --filter 'prev_state != 0' -a sleep 1
>
> 3.10-rc4 (mutex implementation context switch profile)
>
> - 59.51% exim [kernel.kallsyms] [k] perf_trace_sched_switch
> - perf_trace_sched_switch
> - __schedule
> - 99.98% schedule
> + 33.07% schedule_timeout
> + 23.84% pipe_wait
> + 20.24% do_wait
> + 12.37% do_exit
> + 5.34% sigsuspend
> - 3.40% schedule_preempt_disabled
> __mutex_lock_common.clone.8
> __mutex_lock_slowpath
> - mutex_lock <---------low rate mutex blocking
> + 65.71% lock_anon_vma_root.clone.24
> + 19.03% anon_vma_lock.clone.21
> + 7.14% dup_mm
> + 5.36% unlink_file_vma
> + 1.71% ima_file_check
> + 0.64% generic_file_aio_write
> - 1.07% rwsem_down_write_failed
> call_rwsem_down_write_failed
> exit_shm
> do_exit
> do_group_exit
> SyS_exit_group
> system_call_fastpath
> - 27.61% smtpbm [kernel.kallsyms] [k] perf_trace_sched_switch
> - perf_trace_sched_switch
> - __schedule
> - schedule
> - schedule_timeout
> + 100.00% sk_wait_data
> + 0.46% swapper [kernel.kallsyms] [k] perf_trace_sched_switch
>
>
> ----------------------
> 3.10-rc4 implementation (rw-sem context switch profile)
>
> 81.91% exim [kernel.kallsyms] [k] perf_trace_sched_switch
> - perf_trace_sched_switch
> - __schedule
> - 99.99% schedule
> - 65.26% rwsem_down_write_failed <------High write lock blocking
> - call_rwsem_down_write_failed
> - 79.36% lock_anon_vma_root.clone.27
> + 52.64% unlink_anon_vmas
> + 47.36% anon_vma_clone
> + 12.16% anon_vma_fork
> + 8.00% anon_vma_free
> + 11.96% schedule_timeout
> + 7.66% do_exit
> + 7.61% do_wait
> + 5.49% pipe_wait
> + 1.82% sigsuspend
> 13.55% smtpbm [kernel.kallsyms] [k] perf_trace_sched_switch
> - perf_trace_sched_switch
> - __schedule
> - schedule
> - schedule_timeout
> 0.11% rcu_sched [kernel.kallsyms] [k] perf_trace_sched_switch
>
>
> Thanks.
>
> Tim


2013-06-26 09:51:15

by Ingo Molnar

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree


* Tim Chen <[email protected]> wrote:

> On Wed, 2013-06-19 at 09:53 -0700, Tim Chen wrote:
> > On Wed, 2013-06-19 at 15:16 +0200, Ingo Molnar wrote:
> >
> > > > vmstat for mutex implementation:
> > > > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
> > > > r b swpd free buff cache si so bi bo in cs us sy id wa st
> > > > 38 0 0 130957920 47860 199956 0 0 0 56 236342 476975 14 72 14 0 0
> > > > 41 0 0 130938560 47860 219900 0 0 0 0 236816 479676 14 72 14 0 0
> > > >
> > > > vmstat for rw-sem implementation (3.10-rc4)
> > > > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
> > > > r b swpd free buff cache si so bi bo in cs us sy id wa st
> > > > 40 0 0 130933984 43232 202584 0 0 0 0 321817 690741 13 71 16 0 0
> > > > 39 0 0 130913904 43232 224812 0 0 0 0 322193 692949 13 71 16 0 0
> > >
> > > It appears the main difference is that the rwsem variant context-switches
> > > about 36% more than the mutex version, right?
> > >
> > > I'm wondering how that's possible - the lock is mostly write-locked,
> > > correct? So the lock-stealing from Davidlohr Bueso and Michel Lespinasse
> > > ought to have brought roughly the same lock-stealing behavior as mutexes
> > > do, right?
> > >
> > > So the next analytical step would be to figure out why rwsem lock-stealing
> > > is not behaving in an equivalent fashion on this workload. Do readers come
> > > in frequently enough to disrupt write-lock-stealing perhaps?
>
> Ingo,
>
> I did some instrumentation on the write lock failure path. I found that
> for the exim workload, there are no readers blocking for the rwsem when
> write locking failed. The lock stealing is successful for 9.1% of the
> time and the rest of the write lock failure caused the writer to go to
> sleep. About 1.4% of the writers sleep more than once. Majority of the
> writers sleep once.
>
> It is weird that lock stealing is not successful more often.

For this to be comparable to the mutex scalability numbers you'd have to
compare wlock-stealing _and_ adaptive spinning for failed-wlock rwsems.

Are both techniques applied in the kernel you are running your tests on?

Thanks,

Ingo

2013-06-26 21:36:08

by Tim Chen

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree

On Wed, 2013-06-26 at 11:51 +0200, Ingo Molnar wrote:
> * Tim Chen <[email protected]> wrote:
>
> > On Wed, 2013-06-19 at 09:53 -0700, Tim Chen wrote:
> > > On Wed, 2013-06-19 at 15:16 +0200, Ingo Molnar wrote:
> > >
> > > > > vmstat for mutex implementation:
> > > > > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
> > > > > r b swpd free buff cache si so bi bo in cs us sy id wa st
> > > > > 38 0 0 130957920 47860 199956 0 0 0 56 236342 476975 14 72 14 0 0
> > > > > 41 0 0 130938560 47860 219900 0 0 0 0 236816 479676 14 72 14 0 0
> > > > >
> > > > > vmstat for rw-sem implementation (3.10-rc4)
> > > > > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
> > > > > r b swpd free buff cache si so bi bo in cs us sy id wa st
> > > > > 40 0 0 130933984 43232 202584 0 0 0 0 321817 690741 13 71 16 0 0
> > > > > 39 0 0 130913904 43232 224812 0 0 0 0 322193 692949 13 71 16 0 0
> > > >
> > > > It appears the main difference is that the rwsem variant context-switches
> > > > about 36% more than the mutex version, right?
> > > >
> > > > I'm wondering how that's possible - the lock is mostly write-locked,
> > > > correct? So the lock-stealing from Davidlohr Bueso and Michel Lespinasse
> > > > ought to have brought roughly the same lock-stealing behavior as mutexes
> > > > do, right?
> > > >
> > > > So the next analytical step would be to figure out why rwsem lock-stealing
> > > > is not behaving in an equivalent fashion on this workload. Do readers come
> > > > in frequently enough to disrupt write-lock-stealing perhaps?
> >
> > Ingo,
> >
> > I did some instrumentation on the write lock failure path. I found that
> > for the exim workload, there are no readers blocking for the rwsem when
> > write locking failed. The lock stealing is successful for 9.1% of the
> > time and the rest of the write lock failure caused the writer to go to
> > sleep. About 1.4% of the writers sleep more than once. Majority of the
> > writers sleep once.
> >
> > It is weird that lock stealing is not successful more often.
>
> For this to be comparable to the mutex scalability numbers you'd have to
> compare wlock-stealing _and_ adaptive spinning for failed-wlock rwsems.
>
> Are both techniques applied in the kernel you are running your tests on?
>

Ingo,

The previous experiment was done on a kernel without spinning.
I've redone the testing on two kernel for a 15 sec stretch of the
workload run. One with the adaptive (or optimistic)
spinning and the other without. Both have the patches from Alex to avoid
cmpxchg induced cache bouncing.

With the spinning, I sleep much less for lock acquisition (18.6% vs 91.58%).
However, I've got doubling of write lock acquisition getting
blocked. So that offset the gain from spinning which may be why
I didn't see gain for this particular workload.

No Opt Spin Opt Spin
Writer acquisition blocked count 3448946 7359040
Blocked by reader 0.00% 0.55%
Lock acquired first attempt (lock stealing) 8.42% 16.92%
Lock acquired second attempt (1 sleep) 90.26% 17.60%
Lock acquired after more than 1 sleep 1.32% 1.00%
Lock acquired with optimistic spin N/A 64.48%

Tim


2013-06-27 00:24:59

by Tim Chen

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree

On Wed, 2013-06-26 at 14:36 -0700, Tim Chen wrote:
> On Wed, 2013-06-26 at 11:51 +0200, Ingo Molnar wrote:
> > * Tim Chen <[email protected]> wrote:
> >
> > > On Wed, 2013-06-19 at 09:53 -0700, Tim Chen wrote:
> > > > On Wed, 2013-06-19 at 15:16 +0200, Ingo Molnar wrote:
> > > >
> > > > > > vmstat for mutex implementation:
> > > > > > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
> > > > > > r b swpd free buff cache si so bi bo in cs us sy id wa st
> > > > > > 38 0 0 130957920 47860 199956 0 0 0 56 236342 476975 14 72 14 0 0
> > > > > > 41 0 0 130938560 47860 219900 0 0 0 0 236816 479676 14 72 14 0 0
> > > > > >
> > > > > > vmstat for rw-sem implementation (3.10-rc4)
> > > > > > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
> > > > > > r b swpd free buff cache si so bi bo in cs us sy id wa st
> > > > > > 40 0 0 130933984 43232 202584 0 0 0 0 321817 690741 13 71 16 0 0
> > > > > > 39 0 0 130913904 43232 224812 0 0 0 0 322193 692949 13 71 16 0 0
> > > > >
> > > > > It appears the main difference is that the rwsem variant context-switches
> > > > > about 36% more than the mutex version, right?
> > > > >
> > > > > I'm wondering how that's possible - the lock is mostly write-locked,
> > > > > correct? So the lock-stealing from Davidlohr Bueso and Michel Lespinasse
> > > > > ought to have brought roughly the same lock-stealing behavior as mutexes
> > > > > do, right?
> > > > >
> > > > > So the next analytical step would be to figure out why rwsem lock-stealing
> > > > > is not behaving in an equivalent fashion on this workload. Do readers come
> > > > > in frequently enough to disrupt write-lock-stealing perhaps?
> > >
> > > Ingo,
> > >
> > > I did some instrumentation on the write lock failure path. I found that
> > > for the exim workload, there are no readers blocking for the rwsem when
> > > write locking failed. The lock stealing is successful for 9.1% of the
> > > time and the rest of the write lock failure caused the writer to go to
> > > sleep. About 1.4% of the writers sleep more than once. Majority of the
> > > writers sleep once.
> > >
> > > It is weird that lock stealing is not successful more often.
> >
> > For this to be comparable to the mutex scalability numbers you'd have to
> > compare wlock-stealing _and_ adaptive spinning for failed-wlock rwsems.
> >
> > Are both techniques applied in the kernel you are running your tests on?
> >
>
> Ingo,
>
> The previous experiment was done on a kernel without spinning.
> I've redone the testing on two kernel for a 15 sec stretch of the
> workload run. One with the adaptive (or optimistic)
> spinning and the other without. Both have the patches from Alex to avoid
> cmpxchg induced cache bouncing.
>
> With the spinning, I sleep much less for lock acquisition (18.6% vs 91.58%).
> However, I've got doubling of write lock acquisition getting
> blocked. So that offset the gain from spinning which may be why
> I didn't see gain for this particular workload.
>
> No Opt Spin Opt Spin
> Writer acquisition blocked count 3448946 7359040
> Blocked by reader 0.00% 0.55%
> Lock acquired first attempt (lock stealing) 8.42% 16.92%
> Lock acquired second attempt (1 sleep) 90.26% 17.60%
> Lock acquired after more than 1 sleep 1.32% 1.00%
> Lock acquired with optimistic spin N/A 64.48%
>

Adding also the mutex statistics for the 3.10-rc4 kernel with mutex
implemenation of lock for anon_vma tree. Wonder if Ingo has any
insight on why mutex performs better from these stats.

Mutex acquisition blocked count 14380340
Lock acquired in slowpath (no sleep) 0.06%
Lock acquired in slowpath (1 sleep) 0.24%
Lock acquired in slowpath more than 1 sleep 0.98%
Lock acquired with optimistic spin 99.6%

Thanks.

Tim

2013-06-27 08:36:59

by Ingo Molnar

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree


* Tim Chen <[email protected]> wrote:

> On Wed, 2013-06-26 at 14:36 -0700, Tim Chen wrote:
> > On Wed, 2013-06-26 at 11:51 +0200, Ingo Molnar wrote:
> > > * Tim Chen <[email protected]> wrote:
> > >
> > > > On Wed, 2013-06-19 at 09:53 -0700, Tim Chen wrote:
> > > > > On Wed, 2013-06-19 at 15:16 +0200, Ingo Molnar wrote:
> > > > >
> > > > > > > vmstat for mutex implementation:
> > > > > > > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
> > > > > > > r b swpd free buff cache si so bi bo in cs us sy id wa st
> > > > > > > 38 0 0 130957920 47860 199956 0 0 0 56 236342 476975 14 72 14 0 0
> > > > > > > 41 0 0 130938560 47860 219900 0 0 0 0 236816 479676 14 72 14 0 0
> > > > > > >
> > > > > > > vmstat for rw-sem implementation (3.10-rc4)
> > > > > > > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
> > > > > > > r b swpd free buff cache si so bi bo in cs us sy id wa st
> > > > > > > 40 0 0 130933984 43232 202584 0 0 0 0 321817 690741 13 71 16 0 0
> > > > > > > 39 0 0 130913904 43232 224812 0 0 0 0 322193 692949 13 71 16 0 0
> > > > > >
> > > > > > It appears the main difference is that the rwsem variant context-switches
> > > > > > about 36% more than the mutex version, right?
> > > > > >
> > > > > > I'm wondering how that's possible - the lock is mostly write-locked,
> > > > > > correct? So the lock-stealing from Davidlohr Bueso and Michel Lespinasse
> > > > > > ought to have brought roughly the same lock-stealing behavior as mutexes
> > > > > > do, right?
> > > > > >
> > > > > > So the next analytical step would be to figure out why rwsem lock-stealing
> > > > > > is not behaving in an equivalent fashion on this workload. Do readers come
> > > > > > in frequently enough to disrupt write-lock-stealing perhaps?
> > > >
> > > > Ingo,
> > > >
> > > > I did some instrumentation on the write lock failure path. I found that
> > > > for the exim workload, there are no readers blocking for the rwsem when
> > > > write locking failed. The lock stealing is successful for 9.1% of the
> > > > time and the rest of the write lock failure caused the writer to go to
> > > > sleep. About 1.4% of the writers sleep more than once. Majority of the
> > > > writers sleep once.
> > > >
> > > > It is weird that lock stealing is not successful more often.
> > >
> > > For this to be comparable to the mutex scalability numbers you'd have to
> > > compare wlock-stealing _and_ adaptive spinning for failed-wlock rwsems.
> > >
> > > Are both techniques applied in the kernel you are running your tests on?
> > >
> >
> > Ingo,
> >
> > The previous experiment was done on a kernel without spinning.
> > I've redone the testing on two kernel for a 15 sec stretch of the
> > workload run. One with the adaptive (or optimistic)
> > spinning and the other without. Both have the patches from Alex to avoid
> > cmpxchg induced cache bouncing.
> >
> > With the spinning, I sleep much less for lock acquisition (18.6% vs 91.58%).
> > However, I've got doubling of write lock acquisition getting
> > blocked. So that offset the gain from spinning which may be why
> > I didn't see gain for this particular workload.
> >
> > No Opt Spin Opt Spin
> > Writer acquisition blocked count 3448946 7359040
> > Blocked by reader 0.00% 0.55%
> > Lock acquired first attempt (lock stealing) 8.42% 16.92%
> > Lock acquired second attempt (1 sleep) 90.26% 17.60%
> > Lock acquired after more than 1 sleep 1.32% 1.00%
> > Lock acquired with optimistic spin N/A 64.48%
> >
>
> Adding also the mutex statistics for the 3.10-rc4 kernel with mutex
> implemenation of lock for anon_vma tree. Wonder if Ingo has any
> insight on why mutex performs better from these stats.
>
> Mutex acquisition blocked count 14380340
> Lock acquired in slowpath (no sleep) 0.06%
> Lock acquired in slowpath (1 sleep) 0.24%
> Lock acquired in slowpath more than 1 sleep 0.98%
> Lock acquired with optimistic spin 99.6%

This is how I interpret the stats:

It does appear that in the mutex case we manage to acquire via spinning
with a very high percentage - i.e. it essentialy behaves as a spinlock.

That is actually good news in a way, because it makes it rather simple how
rwsems should behave in this case: since they have no substantial
read-locking aspect in this workload, the down_write()/up_write()s should
essentially behave like spinlocks as well, right?

Yet in the rwsem-spinning case the stats show that we only acquire the
lock via spinning in 65% of the cases, plus we lock-steal in 16.9% of the
cases:

Because lock stealing is essentially a single-spin spinning as well:

> > Lock acquired first attempt (lock stealing) ...... 16.92%

So rwsems in this case behave like spinlocks in 65%+16.9% == 81.9% of the
time.

What remains is the sleeping component:

> > Lock acquired second attempt (1 sleep) ...... 17.60%

Yet the 17.6% sleep percentage is still much higher than the 1% in the
mutex case. Why doesn't spinning work - do we time out of spinning
differently?

Is there some other aspect that defeats optimistic spinning and forces the
slowpath and creates sleeping, scheduling and thus extra overhead?

For example after a failed lock-stealing, do we still try optimistic
spinning to write-acquire the rwsem, or go into the slowpath and thus
trigger excessive context-switches?

Thanks,

Ingo

2013-06-27 20:53:46

by Tim Chen

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree

On Thu, 2013-06-27 at 10:36 +0200, Ingo Molnar wrote:
> * Tim Chen <[email protected]> wrote:
>
> > On Wed, 2013-06-26 at 14:36 -0700, Tim Chen wrote:
> > > On Wed, 2013-06-26 at 11:51 +0200, Ingo Molnar wrote:
> > > > * Tim Chen <[email protected]> wrote:
> > > >
> > > > > On Wed, 2013-06-19 at 09:53 -0700, Tim Chen wrote:
> > > > > > On Wed, 2013-06-19 at 15:16 +0200, Ingo Molnar wrote:
> > > > > >
> > > > > > > > vmstat for mutex implementation:
> > > > > > > > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
> > > > > > > > r b swpd free buff cache si so bi bo in cs us sy id wa st
> > > > > > > > 38 0 0 130957920 47860 199956 0 0 0 56 236342 476975 14 72 14 0 0
> > > > > > > > 41 0 0 130938560 47860 219900 0 0 0 0 236816 479676 14 72 14 0 0
> > > > > > > >
> > > > > > > > vmstat for rw-sem implementation (3.10-rc4)
> > > > > > > > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
> > > > > > > > r b swpd free buff cache si so bi bo in cs us sy id wa st
> > > > > > > > 40 0 0 130933984 43232 202584 0 0 0 0 321817 690741 13 71 16 0 0
> > > > > > > > 39 0 0 130913904 43232 224812 0 0 0 0 322193 692949 13 71 16 0 0
> > > > > > >
> > > > > > > It appears the main difference is that the rwsem variant context-switches
> > > > > > > about 36% more than the mutex version, right?
> > > > > > >
> > > > > > > I'm wondering how that's possible - the lock is mostly write-locked,
> > > > > > > correct? So the lock-stealing from Davidlohr Bueso and Michel Lespinasse
> > > > > > > ought to have brought roughly the same lock-stealing behavior as mutexes
> > > > > > > do, right?
> > > > > > >
> > > > > > > So the next analytical step would be to figure out why rwsem lock-stealing
> > > > > > > is not behaving in an equivalent fashion on this workload. Do readers come
> > > > > > > in frequently enough to disrupt write-lock-stealing perhaps?
> > > > >
> > > > > Ingo,
> > > > >
> > > > > I did some instrumentation on the write lock failure path. I found that
> > > > > for the exim workload, there are no readers blocking for the rwsem when
> > > > > write locking failed. The lock stealing is successful for 9.1% of the
> > > > > time and the rest of the write lock failure caused the writer to go to
> > > > > sleep. About 1.4% of the writers sleep more than once. Majority of the
> > > > > writers sleep once.
> > > > >
> > > > > It is weird that lock stealing is not successful more often.
> > > >
> > > > For this to be comparable to the mutex scalability numbers you'd have to
> > > > compare wlock-stealing _and_ adaptive spinning for failed-wlock rwsems.
> > > >
> > > > Are both techniques applied in the kernel you are running your tests on?
> > > >
> > >
> > > Ingo,
> > >
> > > The previous experiment was done on a kernel without spinning.
> > > I've redone the testing on two kernel for a 15 sec stretch of the
> > > workload run. One with the adaptive (or optimistic)
> > > spinning and the other without. Both have the patches from Alex to avoid
> > > cmpxchg induced cache bouncing.
> > >
> > > With the spinning, I sleep much less for lock acquisition (18.6% vs 91.58%).
> > > However, I've got doubling of write lock acquisition getting
> > > blocked. So that offset the gain from spinning which may be why
> > > I didn't see gain for this particular workload.
> > >
> > > No Opt Spin Opt Spin
> > > Writer acquisition blocked count 3448946 7359040
> > > Blocked by reader 0.00% 0.55%
> > > Lock acquired first attempt (lock stealing) 8.42% 16.92%
> > > Lock acquired second attempt (1 sleep) 90.26% 17.60%
> > > Lock acquired after more than 1 sleep 1.32% 1.00%
> > > Lock acquired with optimistic spin N/A 64.48%
> > >
> >
> > Adding also the mutex statistics for the 3.10-rc4 kernel with mutex
> > implemenation of lock for anon_vma tree. Wonder if Ingo has any
> > insight on why mutex performs better from these stats.
> >
> > Mutex acquisition blocked count 14380340
> > Lock acquired in slowpath (no sleep) 0.06%
> > Lock acquired in slowpath (1 sleep) 0.24%
> > Lock acquired in slowpath more than 1 sleep 0.98%
> > Lock acquired with optimistic spin 99.6%
>
> This is how I interpret the stats:
>
> It does appear that in the mutex case we manage to acquire via spinning
> with a very high percentage - i.e. it essentialy behaves as a spinlock.
>
> That is actually good news in a way, because it makes it rather simple how
> rwsems should behave in this case: since they have no substantial
> read-locking aspect in this workload, the down_write()/up_write()s should
> essentially behave like spinlocks as well, right?

Yes, it makes sense.

>
> Yet in the rwsem-spinning case the stats show that we only acquire the
> lock via spinning in 65% of the cases, plus we lock-steal in 16.9% of the
> cases:
>
> Because lock stealing is essentially a single-spin spinning as well:
>
> > > Lock acquired first attempt (lock stealing) ...... 16.92%
>
> So rwsems in this case behave like spinlocks in 65%+16.9% == 81.9% of the
> time.
>
> What remains is the sleeping component:
>
> > > Lock acquired second attempt (1 sleep) ...... 17.60%
>
> Yet the 17.6% sleep percentage is still much higher than the 1% in the
> mutex case. Why doesn't spinning work - do we time out of spinning
> differently?

I have some stats for the 18.6% cases (including 1% more than
1 sleep cases) that go to sleep and failed optimistic spinning.
There are 3 abort points in the rwsem_optimistic_spin code:

1. 11.8% is due to abort point #1, where we don't find an owner and
assumed that probably a reader owned lock as we've just tried
to acquire lock previously for lock stealing. I think I will need
to actually check the sem->count to make sure we have reader owned lock
before aborting spin.

2. 6.8% is due to abort point #2, where the mutex owner switches
to another writer or we need rescheduling.

3. Minuscule amount due to abort point #3, where we don't have
a owner of the lock but need rescheduling

int rwsem_optimistic_spin(struct rw_semaphore *sem)
{
struct task_struct *owner;
int ret = 0;

/* sem->wait_lock should not be held when doing optimistic spinning */
if (!rwsem_can_spin_on_owner(sem))
return ret; <------------------------------- abort (1)

preempt_disable();
for (;;) {
owner = ACCESS_ONCE(sem->owner);
if (owner && !rwsem_spin_on_owner(sem, owner))
break; <--------------------------- abort (2)

/* wait_lock will be acquired if write_lock is obtained */
if (rwsem_try_write_lock(sem->count, true, sem)) {
ret = 1;
break;
}

/*
* When there's no owner, we might have preempted between the
* owner acquiring the lock and setting the owner field. If
* we're an RT task that will live-lock because we won't let
* the owner complete.
*/
if (!owner && (need_resched() || rt_task(current)))
break; <---------------------------- abort (3)

/*
* The cpu_relax() call is a compiler barrier which forces
* everything in this loop to be re-loaded. We don't need
* memory barriers as we'll eventually observe the right
* values at the cost of a few extra spins.
*/
arch_mutex_cpu_relax();

}

preempt_enable();
return ret;

See the other thread for complete patch of rwsem optimistic spin code:
https://lkml.org/lkml/2013/6/26/692

Any suggestions on tweaking this is appreciated.

> Is there some other aspect that defeats optimistic spinning and forces the
> slowpath and creates sleeping, scheduling and thus extra overhead?
>
There are other aspects that are different from mutex in my optimistic
spinning for rwsem:

1. Mutex spinning has MCS lock.
I have disabled MCS lock in mutex and get same profile and
performance for my tests. So this is probably not a reason for
performance difference.

2. Preemption was disabled at the beginning of mutex acquisition.
I have tried moving the preemption disable of rwsem from
the optimistic spin to the top of rwsem_down_write_failed.
However, I didn't see a change in performance.


> For example after a failed lock-stealing, do we still try optimistic
> spinning to write-acquire the rwsem, or go into the slowpath and thus
> trigger excessive context-switches?

I do try optimistic spinning after a failed lock stealing. However,
not after we have gone to sleep.

Thanks,

Tim

2013-06-27 23:31:13

by Tim Chen

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree

On Thu, 2013-06-27 at 13:53 -0700, Tim Chen wrote:
> On Thu, 2013-06-27 at 10:36 +0200, Ingo Molnar wrote:
> > * Tim Chen <[email protected]> wrote:
> >
> > > On Wed, 2013-06-26 at 14:36 -0700, Tim Chen wrote:
> > > > On Wed, 2013-06-26 at 11:51 +0200, Ingo Molnar wrote:
> > > > > * Tim Chen <[email protected]> wrote:
> > > > >
> > > > > > On Wed, 2013-06-19 at 09:53 -0700, Tim Chen wrote:
> > > > > > > On Wed, 2013-06-19 at 15:16 +0200, Ingo Molnar wrote:
> > > > > > >
> > > > > > > > > vmstat for mutex implementation:
> > > > > > > > > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
> > > > > > > > > r b swpd free buff cache si so bi bo in cs us sy id wa st
> > > > > > > > > 38 0 0 130957920 47860 199956 0 0 0 56 236342 476975 14 72 14 0 0
> > > > > > > > > 41 0 0 130938560 47860 219900 0 0 0 0 236816 479676 14 72 14 0 0
> > > > > > > > >
> > > > > > > > > vmstat for rw-sem implementation (3.10-rc4)
> > > > > > > > > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
> > > > > > > > > r b swpd free buff cache si so bi bo in cs us sy id wa st
> > > > > > > > > 40 0 0 130933984 43232 202584 0 0 0 0 321817 690741 13 71 16 0 0
> > > > > > > > > 39 0 0 130913904 43232 224812 0 0 0 0 322193 692949 13 71 16 0 0
> > > > > > > >
> > > > > > > > It appears the main difference is that the rwsem variant context-switches
> > > > > > > > about 36% more than the mutex version, right?
> > > > > > > >
> > > > > > > > I'm wondering how that's possible - the lock is mostly write-locked,
> > > > > > > > correct? So the lock-stealing from Davidlohr Bueso and Michel Lespinasse
> > > > > > > > ought to have brought roughly the same lock-stealing behavior as mutexes
> > > > > > > > do, right?
> > > > > > > >
> > > > > > > > So the next analytical step would be to figure out why rwsem lock-stealing
> > > > > > > > is not behaving in an equivalent fashion on this workload. Do readers come
> > > > > > > > in frequently enough to disrupt write-lock-stealing perhaps?
> > > > > >
> > > > > > Ingo,
> > > > > >
> > > > > > I did some instrumentation on the write lock failure path. I found that
> > > > > > for the exim workload, there are no readers blocking for the rwsem when
> > > > > > write locking failed. The lock stealing is successful for 9.1% of the
> > > > > > time and the rest of the write lock failure caused the writer to go to
> > > > > > sleep. About 1.4% of the writers sleep more than once. Majority of the
> > > > > > writers sleep once.
> > > > > >
> > > > > > It is weird that lock stealing is not successful more often.
> > > > >
> > > > > For this to be comparable to the mutex scalability numbers you'd have to
> > > > > compare wlock-stealing _and_ adaptive spinning for failed-wlock rwsems.
> > > > >
> > > > > Are both techniques applied in the kernel you are running your tests on?
> > > > >
> > > >
> > > > Ingo,
> > > >
> > > > The previous experiment was done on a kernel without spinning.
> > > > I've redone the testing on two kernel for a 15 sec stretch of the
> > > > workload run. One with the adaptive (or optimistic)
> > > > spinning and the other without. Both have the patches from Alex to avoid
> > > > cmpxchg induced cache bouncing.
> > > >
> > > > With the spinning, I sleep much less for lock acquisition (18.6% vs 91.58%).
> > > > However, I've got doubling of write lock acquisition getting
> > > > blocked. So that offset the gain from spinning which may be why
> > > > I didn't see gain for this particular workload.
> > > >
> > > > No Opt Spin Opt Spin
> > > > Writer acquisition blocked count 3448946 7359040
> > > > Blocked by reader 0.00% 0.55%
> > > > Lock acquired first attempt (lock stealing) 8.42% 16.92%
> > > > Lock acquired second attempt (1 sleep) 90.26% 17.60%
> > > > Lock acquired after more than 1 sleep 1.32% 1.00%
> > > > Lock acquired with optimistic spin N/A 64.48%
> > > >
> > >
> > > Adding also the mutex statistics for the 3.10-rc4 kernel with mutex
> > > implemenation of lock for anon_vma tree. Wonder if Ingo has any
> > > insight on why mutex performs better from these stats.
> > >
> > > Mutex acquisition blocked count 14380340
> > > Lock acquired in slowpath (no sleep) 0.06%
> > > Lock acquired in slowpath (1 sleep) 0.24%
> > > Lock acquired in slowpath more than 1 sleep 0.98%
> > > Lock acquired with optimistic spin 99.6%
> >
> > This is how I interpret the stats:
> >
> > It does appear that in the mutex case we manage to acquire via spinning
> > with a very high percentage - i.e. it essentialy behaves as a spinlock.
> >
> > That is actually good news in a way, because it makes it rather simple how
> > rwsems should behave in this case: since they have no substantial
> > read-locking aspect in this workload, the down_write()/up_write()s should
> > essentially behave like spinlocks as well, right?
>
> Yes, it makes sense.
>
> >
> > Yet in the rwsem-spinning case the stats show that we only acquire the
> > lock via spinning in 65% of the cases, plus we lock-steal in 16.9% of the
> > cases:
> >
> > Because lock stealing is essentially a single-spin spinning as well:
> >
> > > > Lock acquired first attempt (lock stealing) ...... 16.92%
> >
> > So rwsems in this case behave like spinlocks in 65%+16.9% == 81.9% of the
> > time.
> >
> > What remains is the sleeping component:
> >
> > > > Lock acquired second attempt (1 sleep) ...... 17.60%
> >
> > Yet the 17.6% sleep percentage is still much higher than the 1% in the
> > mutex case. Why doesn't spinning work - do we time out of spinning
> > differently?
>
> I have some stats for the 18.6% cases (including 1% more than
> 1 sleep cases) that go to sleep and failed optimistic spinning.
> There are 3 abort points in the rwsem_optimistic_spin code:
>
> 1. 11.8% is due to abort point #1, where we don't find an owner and
> assumed that probably a reader owned lock as we've just tried
> to acquire lock previously for lock stealing. I think I will need
> to actually check the sem->count to make sure we have reader owned lock
> before aborting spin.

I tried some tweaking that checks sem->count for read owned lock.
Even though it reduces the percentage of acquisitions that
need sleeping by 8.14% (from 18.6% to 10.46%), it increases the writer
acquisition blocked count by 11%. This change still doesn't boost
throughput and has a tiny regression for the workload.

Opt Spin Opt Spin
(with tweak)
Writer acquisition blocked count 7359040 8168006
Blocked by reader 0.55% 0.52%
Lock acquired first attempt (lock stealing) 16.92% 19.70%
Lock acquired second attempt (1 sleep) 17.60% 9.32%
Lock acquired after more than 1 sleep 1.00% 1.14%
Lock acquired with optimistic spin 64.48% 69.84%
Optimistic spin abort 1 11.77% 1.14%
Optimistic spin abort 2 6.81% 9.22%
Optimistic spin abort 3 0.02% 0.10%

--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -221,16 +221,21 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
{
int retval;
struct task_struct *owner;
+ long count;

rcu_read_lock();
owner = ACCESS_ONCE(sem->owner);

- /* Spin only if active writer running */
+ /* Don't spin if active writer not running or reader holds lock */
if (owner)
retval = owner->on_cpu;
- else
- retval = false;
-
+ else {
+ count = ACCESS_ONCE(sem->count);
+ if (count > RWSEM_WAITING_BIAS)
+ retval = false;
+ else
+ retval = true;
+ }
rcu_read_unlock();
/*

Thanks.

Tim

> 2. 6.8% is due to abort point #2, where the mutex owner switches
> to another writer or we need rescheduling.
>
> 3. Minuscule amount due to abort point #3, where we don't have
> a owner of the lock but need rescheduling
>
> int rwsem_optimistic_spin(struct rw_semaphore *sem)
> {
> struct task_struct *owner;
> int ret = 0;
>
> /* sem->wait_lock should not be held when doing optimistic spinning */
> if (!rwsem_can_spin_on_owner(sem))
> return ret; <------------------------------- abort (1)
>
> preempt_disable();
> for (;;) {
> owner = ACCESS_ONCE(sem->owner);
> if (owner && !rwsem_spin_on_owner(sem, owner))
> break; <--------------------------- abort (2)
>
> /* wait_lock will be acquired if write_lock is obtained */
> if (rwsem_try_write_lock(sem->count, true, sem)) {
> ret = 1;
> break;
> }
>
> /*
> * When there's no owner, we might have preempted between the
> * owner acquiring the lock and setting the owner field. If
> * we're an RT task that will live-lock because we won't let
> * the owner complete.
> */
> if (!owner && (need_resched() || rt_task(current)))
> break; <---------------------------- abort (3)
>
> /*
> * The cpu_relax() call is a compiler barrier which forces
> * everything in this loop to be re-loaded. We don't need
> * memory barriers as we'll eventually observe the right
> * values at the cost of a few extra spins.
> */
> arch_mutex_cpu_relax();
>
> }
>
> preempt_enable();
> return ret;
>
> See the other thread for complete patch of rwsem optimistic spin code:
> https://lkml.org/lkml/2013/6/26/692
>
> Any suggestions on tweaking this is appreciated.
>
> > Is there some other aspect that defeats optimistic spinning and forces the
> > slowpath and creates sleeping, scheduling and thus extra overhead?
> >
> There are other aspects that are different from mutex in my optimistic
> spinning for rwsem:
>
> 1. Mutex spinning has MCS lock.
> I have disabled MCS lock in mutex and get same profile and
> performance for my tests. So this is probably not a reason for
> performance difference.
>
> 2. Preemption was disabled at the beginning of mutex acquisition.
> I have tried moving the preemption disable of rwsem from
> the optimistic spin to the top of rwsem_down_write_failed.
> However, I didn't see a change in performance.
>
>
> > For example after a failed lock-stealing, do we still try optimistic
> > spinning to write-acquire the rwsem, or go into the slowpath and thus
> > trigger excessive context-switches?
>
> I do try optimistic spinning after a failed lock stealing. However,
> not after we have gone to sleep.
>

2013-06-28 09:20:45

by Ingo Molnar

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree


* Tim Chen <[email protected]> wrote:

> > Yet the 17.6% sleep percentage is still much higher than the 1% in the
> > mutex case. Why doesn't spinning work - do we time out of spinning
> > differently?
>
> I have some stats for the 18.6% cases (including 1% more than 1 sleep
> cases) that go to sleep and failed optimistic spinning. There are 3
> abort points in the rwsem_optimistic_spin code:
>
> 1. 11.8% is due to abort point #1, where we don't find an owner and
> assumed that probably a reader owned lock as we've just tried to acquire
> lock previously for lock stealing. I think I will need to actually
> check the sem->count to make sure we have reader owned lock before
> aborting spin.

That looks like to be the biggest remaining effect.

> 2. 6.8% is due to abort point #2, where the mutex owner switches
> to another writer or we need rescheduling.
>
> 3. Minuscule amount due to abort point #3, where we don't have
> a owner of the lock but need rescheduling

The percentages here might go down if #1 is fixed. Excessive scheduling
creates wakeups and has a higher rate of preemption as well as waiting
writers are woken.

There's a chance that if you fix #1 you'll get to the mutex equivalency
Holy Grail! :-)

> See the other thread for complete patch of rwsem optimistic spin code:
> https://lkml.org/lkml/2013/6/26/692
>
> Any suggestions on tweaking this is appreciated.

I think you are on the right track: the goal is to eliminate these sleeps,
the mutex case proves that it's possible to just spin and not sleep much.

It would be even more complex to match it if the mutex workload showed
significant internal complexity - but it does not, it still just behaves
like spinlocks, right?

Thanks,

Ingo

2013-06-28 09:38:15

by Ingo Molnar

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree


* Tim Chen <[email protected]> wrote:

> I tried some tweaking that checks sem->count for read owned lock. Even
> though it reduces the percentage of acquisitions that need sleeping by
> 8.14% (from 18.6% to 10.46%), it increases the writer acquisition
> blocked count by 11%. This change still doesn't boost throughput and has
> a tiny regression for the workload.
>
> Opt Spin Opt Spin
> (with tweak)
> Writer acquisition blocked count 7359040 8168006
> Blocked by reader 0.55% 0.52%
> Lock acquired first attempt (lock stealing) 16.92% 19.70%
> Lock acquired second attempt (1 sleep) 17.60% 9.32%
> Lock acquired after more than 1 sleep 1.00% 1.14%
> Lock acquired with optimistic spin 64.48% 69.84%
> Optimistic spin abort 1 11.77% 1.14%
> Optimistic spin abort 2 6.81% 9.22%
> Optimistic spin abort 3 0.02% 0.10%

So lock stealing+spinning now acquires the lock successfully ~90% of the
time, the remaining sleeps are:

> Lock acquired second attempt (1 sleep) ...... 9.32%

And the reason these sleeps are mostly due to:

> Optimistic spin abort 2 ..... 9.22%

Right?

So this particular #2 abort point is:

| preempt_disable();
| for (;;) {
| owner = ACCESS_ONCE(sem->owner);
| if (owner && !rwsem_spin_on_owner(sem, owner))
| break; <--------------------------- abort (2)

Next step would be to investigate why we decide to not spin there, why
does rwsem_spin_on_owner() fail?

If I got all the patches right, rwsem_spin_on_owner() is this:

+static noinline
+int rwsem_spin_on_owner(struct rw_semaphore *lock, struct task_struct *owner)
+{
+ rcu_read_lock();
+ while (owner_running(lock, owner)) {
+ if (need_resched())
+ break;
+
+ arch_mutex_cpu_relax();
+ }
+ rcu_read_unlock();
+
+ /*
+ * We break out the loop above on need_resched() and when the
+ * owner changed, which is a sign for heavy contention. Return
+ * success only when lock->owner is NULL.
+ */
+ return lock->owner == NULL;
+}

where owner_running() is similar to the mutex spinning code: it in the end
checks owner->on_cpu - like the mutex code.

If my analysis is correct so far then it might be useful to add two more
stats: did rwsem_spin_on_owner() fail because lock->owner == NULL [owner
released the rwsem], or because owner_running() failed [owner went to
sleep]?

Thanks,

Ingo

2013-06-28 21:04:21

by Tim Chen

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree

On Fri, 2013-06-28 at 11:38 +0200, Ingo Molnar wrote:
> * Tim Chen <[email protected]> wrote:
>
> > I tried some tweaking that checks sem->count for read owned lock. Even
> > though it reduces the percentage of acquisitions that need sleeping by
> > 8.14% (from 18.6% to 10.46%), it increases the writer acquisition
> > blocked count by 11%. This change still doesn't boost throughput and has
> > a tiny regression for the workload.
> >
> > Opt Spin Opt Spin
> > (with tweak)
> > Writer acquisition blocked count 7359040 8168006
> > Blocked by reader 0.55% 0.52%
> > Lock acquired first attempt (lock stealing) 16.92% 19.70%
> > Lock acquired second attempt (1 sleep) 17.60% 9.32%
> > Lock acquired after more than 1 sleep 1.00% 1.14%
> > Lock acquired with optimistic spin 64.48% 69.84%
> > Optimistic spin abort 1 11.77% 1.14%
> > Optimistic spin abort 2 6.81% 9.22%
> > Optimistic spin abort 3 0.02% 0.10%
>
> So lock stealing+spinning now acquires the lock successfully ~90% of the
> time, the remaining sleeps are:
>
> > Lock acquired second attempt (1 sleep) ...... 9.32%
>
> And the reason these sleeps are mostly due to:
>
> > Optimistic spin abort 2 ..... 9.22%
>
> Right?
>
> So this particular #2 abort point is:
>
> | preempt_disable();
> | for (;;) {
> | owner = ACCESS_ONCE(sem->owner);
> | if (owner && !rwsem_spin_on_owner(sem, owner))
> | break; <--------------------------- abort (2)
>
> Next step would be to investigate why we decide to not spin there, why
> does rwsem_spin_on_owner() fail?
>
> If I got all the patches right, rwsem_spin_on_owner() is this:
>
> +static noinline
> +int rwsem_spin_on_owner(struct rw_semaphore *lock, struct task_struct *owner)
> +{
> + rcu_read_lock();
> + while (owner_running(lock, owner)) {
> + if (need_resched())
> + break;
> +
> + arch_mutex_cpu_relax();
> + }
> + rcu_read_unlock();
> +
> + /*
> + * We break out the loop above on need_resched() and when the
> + * owner changed, which is a sign for heavy contention. Return
> + * success only when lock->owner is NULL.
> + */
> + return lock->owner == NULL;
> +}
>
> where owner_running() is similar to the mutex spinning code: it in the end
> checks owner->on_cpu - like the mutex code.
>
> If my analysis is correct so far then it might be useful to add two more
> stats: did rwsem_spin_on_owner() fail because lock->owner == NULL [owner
> released the rwsem], or because owner_running() failed [owner went to
> sleep]?

Ingo,

I tabulated the cases where rwsem_spin_on_owner returns false and causes
us to stop spinning.

97.12% was due to lock's owner switching to another writer
0.01% was due to the owner of the lock sleeping
2.87% was due to need_resched()

I made a change to allow us to continue to spin even when lock's
owner switch to another writer. I did get the lock to be acquired
now mostly (98%) via optimistic spin and lock stealing, but my
benchmark's throughput actually got reduced by 30% (too many cycles
spent on useless spinning?). The lock statistics are below:

Writer acquisition blocked count 7538864
Blocked by reader 0.37%
Lock acquired first attempt (lock stealing) 18.45%
Lock acquired second attempt (1 sleep) 1.69%
Lock acquired after more than 1 sleep 0.25%
Lock acquired with optimistic spin 79.62%
Optimistic spin failure (abort point 1) 1.37%
Optimistic spin failure (abort point 2) 0.32%
Optimistic spin failure (abort point 3) 0.23%
(Opt spin abort point 2 breakdown) owner sleep 0.00%
(Opt spin abort point 2 breakdown) need_resched 0.32%


Thanks.

Tim

2013-06-29 07:12:53

by Ingo Molnar

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree


* Tim Chen <[email protected]> wrote:

> > If my analysis is correct so far then it might be useful to add two
> > more stats: did rwsem_spin_on_owner() fail because lock->owner == NULL
> > [owner released the rwsem], or because owner_running() failed [owner
> > went to sleep]?
>
> Ingo,
>
> I tabulated the cases where rwsem_spin_on_owner returns false and causes
> us to stop spinning.
>
> 97.12% was due to lock's owner switching to another writer
> 0.01% was due to the owner of the lock sleeping
> 2.87% was due to need_resched()
>
> I made a change to allow us to continue to spin even when lock's owner
> switch to another writer. I did get the lock to be acquired now mostly
> (98%) via optimistic spin and lock stealing, but my benchmark's
> throughput actually got reduced by 30% (too many cycles spent on useless
> spinning?).

Hm, I'm running out of quick ideas :-/ The writer-ends-spinning sequence
is pretty similar in the rwsem and in the mutex case. I'd have a look at
one more detail: is the wakeup of another writer in the rwsem case
singular, is only a single writer woken? I suspect the answer is yes ...

A quick glance suggests that the ordering of wakeups of waiters is the
same for mutexes and rwsems: FIFO, single waiter woken on slowpath-unlock.
So that shouldn't make a big difference.

If all last-ditch efforts to analyze it via counters fail then the way I'd
approach it next is brute-force instrumentation:

- First I'd create a workload 'steady state' that can be traced and
examined without worrying that that it ends or switches to some other
workload.

- Then I'd create a relatively lightweight trace (maybe trace_printk() is
lightweight enough), and capture key mutex and rwsem events.

- I'd capture a 1-10 seconds trace in steady state, both with rwsems and
mutexes. I'd have a good look at which tasks take locks and schedule
how and why. I'd try to eliminate any assymetries in behavior, i.e.
make rwsems behave like mutexes.

The risk and difficulty is that tracing can easily skew locking patterns,
so I'd first check whether with such new tracepoints enabled the assymetry
in behavior and regression is still present.

Thanks,

Ingo

2013-07-01 20:28:17

by Tim Chen

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree

On Sat, 2013-06-29 at 09:12 +0200, Ingo Molnar wrote:
> * Tim Chen <[email protected]> wrote:
>
> > > If my analysis is correct so far then it might be useful to add two
> > > more stats: did rwsem_spin_on_owner() fail because lock->owner == NULL
> > > [owner released the rwsem], or because owner_running() failed [owner
> > > went to sleep]?
> >
> > Ingo,
> >
> > I tabulated the cases where rwsem_spin_on_owner returns false and causes
> > us to stop spinning.
> >
> > 97.12% was due to lock's owner switching to another writer
> > 0.01% was due to the owner of the lock sleeping
> > 2.87% was due to need_resched()
> >
> > I made a change to allow us to continue to spin even when lock's owner
> > switch to another writer. I did get the lock to be acquired now mostly
> > (98%) via optimistic spin and lock stealing, but my benchmark's
> > throughput actually got reduced by 30% (too many cycles spent on useless
> > spinning?).
>
> Hm, I'm running out of quick ideas :-/ The writer-ends-spinning sequence
> is pretty similar in the rwsem and in the mutex case. I'd have a look at
> one more detail: is the wakeup of another writer in the rwsem case
> singular, is only a single writer woken? I suspect the answer is yes ...

Ingo, we can only wake one writer, right? In __rwsem_do_wake, that is
indeed the case. Or you are talking about something else?

>
> A quick glance suggests that the ordering of wakeups of waiters is the
> same for mutexes and rwsems: FIFO, single waiter woken on slowpath-unlock.
> So that shouldn't make a big difference.

> If all last-ditch efforts to analyze it via counters fail then the way I'd
> approach it next is brute-force instrumentation:
>
> - First I'd create a workload 'steady state' that can be traced and
> examined without worrying that that it ends or switches to some other
> workload.
>
> - Then I'd create a relatively lightweight trace (maybe trace_printk() is
> lightweight enough), and capture key mutex and rwsem events.
>
> - I'd capture a 1-10 seconds trace in steady state, both with rwsems and
> mutexes. I'd have a good look at which tasks take locks and schedule
> how and why. I'd try to eliminate any assymetries in behavior, i.e.
> make rwsems behave like mutexes.

You mean adding trace points to record the events? If you can be more
specific on what data to capture, that will be helpful. It will be
holidays here in US so I may get around to this the following week.

Thanks!

Tim
>
> The risk and difficulty is that tracing can easily skew locking patterns,
> so I'd first check whether with such new tracepoints enabled the assymetry
> in behavior and regression is still present.
>
> Thanks,
>
> Ingo

2013-07-02 06:45:45

by Ingo Molnar

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree


* Tim Chen <[email protected]> wrote:

> On Sat, 2013-06-29 at 09:12 +0200, Ingo Molnar wrote:
> > * Tim Chen <[email protected]> wrote:
> >
> > > > If my analysis is correct so far then it might be useful to add two
> > > > more stats: did rwsem_spin_on_owner() fail because lock->owner == NULL
> > > > [owner released the rwsem], or because owner_running() failed [owner
> > > > went to sleep]?
> > >
> > > Ingo,
> > >
> > > I tabulated the cases where rwsem_spin_on_owner returns false and causes
> > > us to stop spinning.
> > >
> > > 97.12% was due to lock's owner switching to another writer
> > > 0.01% was due to the owner of the lock sleeping
> > > 2.87% was due to need_resched()
> > >
> > > I made a change to allow us to continue to spin even when lock's owner
> > > switch to another writer. I did get the lock to be acquired now mostly
> > > (98%) via optimistic spin and lock stealing, but my benchmark's
> > > throughput actually got reduced by 30% (too many cycles spent on useless
> > > spinning?).
> >
> > Hm, I'm running out of quick ideas :-/ The writer-ends-spinning sequence
> > is pretty similar in the rwsem and in the mutex case. I'd have a look at
> > one more detail: is the wakeup of another writer in the rwsem case
> > singular, is only a single writer woken? I suspect the answer is yes ...
>
> Ingo, we can only wake one writer, right? In __rwsem_do_wake, that is
> indeed the case. Or you are talking about something else?

Yeah, I was talking about that, and my understanding and reading of the
code says that too - I just wanted to make sure :-)

> >
> > A quick glance suggests that the ordering of wakeups of waiters is the
> > same for mutexes and rwsems: FIFO, single waiter woken on
> > slowpath-unlock. So that shouldn't make a big difference.
>
> > If all last-ditch efforts to analyze it via counters fail then the way
> > I'd approach it next is brute-force instrumentation:
> >
> > - First I'd create a workload 'steady state' that can be traced and
> > examined without worrying that that it ends or switches to some other
> > workload.
> >
> > - Then I'd create a relatively lightweight trace (maybe trace_printk() is
> > lightweight enough), and capture key mutex and rwsem events.
> >
> > - I'd capture a 1-10 seconds trace in steady state, both with rwsems and
> > mutexes. I'd have a good look at which tasks take locks and schedule
> > how and why. I'd try to eliminate any assymetries in behavior, i.e.
> > make rwsems behave like mutexes.
>
> You mean adding trace points to record the events? If you can be more
> specific on what data to capture, that will be helpful. It will be
> holidays here in US so I may get around to this the following week.

Yeah, adding the relevant tracepoints (or trace_printk(), which is much
simpler albeit a bit more expensive) - and capturing a 1-second
steady-state traces via ftrace.

Then I'd compare the two traces and look at the biggest difference, and
try to zoom in on to figure out why the difference occurs. More
trace_printk()s can be added as you suspect specific areas or want to
confirm various theories.

[ Assuming the phenomenon does not go away under tracing :-/ ]

[
Another brute-force approach is to add a dynamic debug knob to switch
between a spinlock and an rwsem implementation for that lock: I'd do it
by adding both types of locks to the data structure, initializing both
but using a /proc/sys/kernel/ knob to decide whether to use spin_*() or
rwsem facilities to utilize it. This way you could switch between the
two implementations without rebooting the system. In theory.
]

Thanks,

Ingo

2013-07-16 17:53:15

by Tim Chen

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree

On Tue, 2013-07-02 at 08:45 +0200, Ingo Molnar wrote:
> * Tim Chen <[email protected]> wrote:
>
> > On Sat, 2013-06-29 at 09:12 +0200, Ingo Molnar wrote:
> > > * Tim Chen <[email protected]> wrote:
> > >
> > > > > If my analysis is correct so far then it might be useful to add two
> > > > > more stats: did rwsem_spin_on_owner() fail because lock->owner == NULL
> > > > > [owner released the rwsem], or because owner_running() failed [owner
> > > > > went to sleep]?
> > > >
> > > > Ingo,
> > > >
> > > > I tabulated the cases where rwsem_spin_on_owner returns false and causes
> > > > us to stop spinning.
> > > >
> > > > 97.12% was due to lock's owner switching to another writer
> > > > 0.01% was due to the owner of the lock sleeping
> > > > 2.87% was due to need_resched()
> > > >
> > > > I made a change to allow us to continue to spin even when lock's owner
> > > > switch to another writer. I did get the lock to be acquired now mostly
> > > > (98%) via optimistic spin and lock stealing, but my benchmark's
> > > > throughput actually got reduced by 30% (too many cycles spent on useless
> > > > spinning?).
> > >
> > > Hm, I'm running out of quick ideas :-/ The writer-ends-spinning sequence
> > > is pretty similar in the rwsem and in the mutex case. I'd have a look at
> > > one more detail: is the wakeup of another writer in the rwsem case
> > > singular, is only a single writer woken? I suspect the answer is yes ...
> >
> > Ingo, we can only wake one writer, right? In __rwsem_do_wake, that is
> > indeed the case. Or you are talking about something else?
>
> Yeah, I was talking about that, and my understanding and reading of the
> code says that too - I just wanted to make sure :-)
>
> > >
> > > A quick glance suggests that the ordering of wakeups of waiters is the
> > > same for mutexes and rwsems: FIFO, single waiter woken on
> > > slowpath-unlock. So that shouldn't make a big difference.
> >

Ingo,

I tried MCS locking to order the writers but
it didn't make much difference on my particular workload.
After thinking about this some more, a likely explanation of the
performance difference between mutex and rwsem performance is:

1) Jobs acquiring mutex put itself on the wait list only after
optimistic spinning. That's only 2% of the time on my
test workload so they access the wait list rarely.

2) Jobs acquiring rw-sem for write *always* put itself on the wait
list first before trying lock stealing and optimistic spinning.
This creates a bottleneck at the wait list, and also more
cache bouncing.

One possible optimization is to delay putting the writer on the
wait list till after optimistic spinning, but we may need to keep
track of the number of writers waiting. We could add a WAIT_BIAS
to count for each write waiter and remove the WAIT_BIAS each time a
writer job completes. This is tricky as I'm changing the
semantics of the count field and likely will require a number of changes
to rwsem code. Your thoughts on a better way to do this?

Thanks.

Tim

2013-07-23 09:45:20

by Ingo Molnar

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree


* Tim Chen <[email protected]> wrote:

> Ingo,
>
> I tried MCS locking to order the writers but it didn't make much
> difference on my particular workload. After thinking about this some
> more, a likely explanation of the performance difference between mutex
> and rwsem performance is:
>
> 1) Jobs acquiring mutex put itself on the wait list only after
> optimistic spinning. That's only 2% of the time on my test workload so
> they access the wait list rarely.
>
> 2) Jobs acquiring rw-sem for write *always* put itself on the wait list
> first before trying lock stealing and optimistic spinning. This creates
> a bottleneck at the wait list, and also more cache bouncing.

Indeed ...

> One possible optimization is to delay putting the writer on the wait
> list till after optimistic spinning, but we may need to keep track of
> the number of writers waiting. We could add a WAIT_BIAS to count for
> each write waiter and remove the WAIT_BIAS each time a writer job
> completes. This is tricky as I'm changing the semantics of the count
> field and likely will require a number of changes to rwsem code. Your
> thoughts on a better way to do this?

Why not just try the delayed addition approach first? The spinning is time
limited AFAICS, so we don't _have to_ recognize those as writers per se,
only if the spinning fails and it wants to go on the waitlist. Am I
missing something?

It will change patterns, it might even change the fairness balance - but
is a legit change otherwise, especially if it helps performance.

Thanks,

Ingo

2013-07-23 09:51:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree

On Tue, Jul 23, 2013 at 11:45:13AM +0200, Ingo Molnar wrote:
> Why not just try the delayed addition approach first? The spinning is time
> limited AFAICS, so we don't _have to_ recognize those as writers per se,
> only if the spinning fails and it wants to go on the waitlist. Am I
> missing something?
>
> It will change patterns, it might even change the fairness balance - but
> is a legit change otherwise, especially if it helps performance.

Be very careful here. Some people (XFS) have very specific needs. Walken
and dchinner had a longish discussion on this a while back.

2013-07-23 09:53:15

by Ingo Molnar

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree


* Peter Zijlstra <[email protected]> wrote:

> On Tue, Jul 23, 2013 at 11:45:13AM +0200, Ingo Molnar wrote:
>
> > Why not just try the delayed addition approach first? The spinning is
> > time limited AFAICS, so we don't _have to_ recognize those as writers
> > per se, only if the spinning fails and it wants to go on the waitlist.
> > Am I missing something?
> >
> > It will change patterns, it might even change the fairness balance -
> > but is a legit change otherwise, especially if it helps performance.
>
> Be very careful here. Some people (XFS) have very specific needs. Walken
> and dchinner had a longish discussion on this a while back.

Agreed - yet it's worth at least trying it out the quick way, to see the
main effect and to see whether that explains the performance assymetry and
invest more effort into it.

Thanks,

Ingo

2013-07-30 00:13:28

by Tim Chen

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree

On Tue, 2013-07-23 at 11:53 +0200, Ingo Molnar wrote:
> * Peter Zijlstra <[email protected]> wrote:
>
> > On Tue, Jul 23, 2013 at 11:45:13AM +0200, Ingo Molnar wrote:
> >
> > > Why not just try the delayed addition approach first? The spinning is
> > > time limited AFAICS, so we don't _have to_ recognize those as writers
> > > per se, only if the spinning fails and it wants to go on the waitlist.
> > > Am I missing something?
> > >
> > > It will change patterns, it might even change the fairness balance -
> > > but is a legit change otherwise, especially if it helps performance.
> >
> > Be very careful here. Some people (XFS) have very specific needs. Walken
> > and dchinner had a longish discussion on this a while back.
>
> Agreed - yet it's worth at least trying it out the quick way, to see the
> main effect and to see whether that explains the performance assymetry and
> invest more effort into it.
>

Ingo & Peter,

Here's a patch that moved optimistic spinning of the writer lock
acquisition before putting the writer on the wait list. It did give me
a 5% performance boost on my exim mail server workload.
It recovered a good portion of the 8% performance regression from
mutex implementation.

I think we are on the right track. Let me know if there's anything
in the patch that may cause grief to XFS.

There is some further optimization possible. We went
to the failed path within __down_write if the count field is non
zero. But in fact if the old count field was RWSEM_WAITING_BIAS,
there's no one active and we could have stolen the
lock when we perform our atomic op on the count field
in __down_write. Yet we go to the failed path in the current
code.

I will combine this change and also Alex's patches on rwsem together
in a patchset later.

Your comments and thoughts are most welcomed.

Tim

Signed-off-by: Tim Chen <[email protected]>
---
diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index 0616ffe..58a4acb 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -29,6 +29,10 @@ struct rw_semaphore {
#ifdef CONFIG_DEBUG_LOCK_ALLOC
struct lockdep_map dep_map;
#endif
+#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
+ struct task_struct *owner;
+ void *spin_mlock;
+#endif
};

extern struct rw_semaphore *rwsem_down_read_failed(struct rw_semaphore *sem);
@@ -55,11 +59,21 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
# define __RWSEM_DEP_MAP_INIT(lockname)
#endif

+#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
+#define __RWSEM_INITIALIZER(name) \
+ { RWSEM_UNLOCKED_VALUE, \
+ __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock), \
+ LIST_HEAD_INIT((name).wait_list), \
+ __RWSEM_DEP_MAP_INIT(name) \
+ NULL, \
+ NULL }
+#else
#define __RWSEM_INITIALIZER(name) \
{ RWSEM_UNLOCKED_VALUE, \
__RAW_SPIN_LOCK_UNLOCKED(name.wait_lock), \
LIST_HEAD_INIT((name).wait_list) \
__RWSEM_DEP_MAP_INIT(name) }
+#endif

#define DECLARE_RWSEM(name) \
struct rw_semaphore name = __RWSEM_INITIALIZER(name)
diff --git a/init/Kconfig b/init/Kconfig
index 9d3a788..d97225f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1595,6 +1595,16 @@ config TRACEPOINTS

source "arch/Kconfig"

+config RWSEM_SPIN_ON_WRITE_OWNER
+ bool "Optimistic spin write acquisition for writer owned rw-sem"
+ default n
+ depends on SMP
+ help
+ Allows a writer to perform optimistic spinning if another writer own
+ the read write semaphore. If the lock owner is running, it is likely
+ to release the lock soon. Spinning gives a greater chance for writer to
+ acquire a semaphore before putting it to sleep.
+
endmenu # General setup

config HAVE_GENERIC_DMA_COHERENT
diff --git a/kernel/rwsem.c b/kernel/rwsem.c
index cfff143..a32990a 100644
--- a/kernel/rwsem.c
+++ b/kernel/rwsem.c
@@ -12,6 +12,26 @@

#include <linux/atomic.h>

+#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
+static inline void rwsem_set_owner(struct rw_semaphore *sem)
+{
+ sem->owner = current;
+}
+
+static inline void rwsem_clear_owner(struct rw_semaphore *sem)
+{
+ sem->owner = NULL;
+}
+#else
+static inline void rwsem_set_owner(struct rw_semaphore *sem)
+{
+}
+
+static inline void rwsem_clear_owner(struct rw_semaphore *sem)
+{
+}
+#endif
+
/*
* lock for reading
*/
@@ -48,6 +68,7 @@ void __sched down_write(struct rw_semaphore *sem)
rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);

LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
+ rwsem_set_owner(sem);
}

EXPORT_SYMBOL(down_write);
@@ -59,8 +80,10 @@ int down_write_trylock(struct rw_semaphore *sem)
{
int ret = __down_write_trylock(sem);

- if (ret == 1)
+ if (ret == 1) {
rwsem_acquire(&sem->dep_map, 0, 1, _RET_IP_);
+ rwsem_set_owner(sem);
+ }
return ret;
}

@@ -86,6 +109,7 @@ void up_write(struct rw_semaphore *sem)
rwsem_release(&sem->dep_map, 1, _RET_IP_);

__up_write(sem);
+ rwsem_clear_owner(sem);
}

EXPORT_SYMBOL(up_write);
@@ -100,6 +124,7 @@ void downgrade_write(struct rw_semaphore *sem)
* dependency.
*/
__downgrade_write(sem);
+ rwsem_clear_owner(sem);
}

EXPORT_SYMBOL(downgrade_write);
@@ -122,6 +147,7 @@ void _down_write_nest_lock(struct rw_semaphore *sem, struct lockdep_map *nest)
rwsem_acquire_nest(&sem->dep_map, 0, 0, nest, _RET_IP_);

LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
+ rwsem_set_owner(sem);
}

EXPORT_SYMBOL(_down_write_nest_lock);
@@ -141,6 +167,7 @@ void down_write_nested(struct rw_semaphore *sem, int subclass)
rwsem_acquire(&sem->dep_map, subclass, 0, _RET_IP_);

LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
+ rwsem_set_owner(sem);
}

EXPORT_SYMBOL(down_write_nested);
diff --git a/lib/rwsem.c b/lib/rwsem.c
index 1d6e6e8..1472ff3 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -8,6 +8,7 @@
*/
#include <linux/rwsem.h>
#include <linux/sched.h>
+#include <linux/sched/rt.h>
#include <linux/init.h>
#include <linux/export.h>

@@ -27,6 +28,10 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
sem->count = RWSEM_UNLOCKED_VALUE;
raw_spin_lock_init(&sem->wait_lock);
INIT_LIST_HEAD(&sem->wait_list);
+#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
+ sem->owner = NULL;
+ sem->spin_mlock = NULL;
+#endif
}

EXPORT_SYMBOL(__init_rwsem);
@@ -194,48 +199,252 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem)
return sem;
}

+static inline int rwsem_try_write_lock(long count, struct rw_semaphore *sem)
+{
+ if (!(count & RWSEM_ACTIVE_MASK)) {
+ /* Try acquiring the write lock. */
+ if (sem->count == RWSEM_WAITING_BIAS &&
+ cmpxchg(&sem->count, RWSEM_WAITING_BIAS,
+ RWSEM_ACTIVE_WRITE_BIAS) == RWSEM_WAITING_BIAS) {
+ if (!list_is_singular(&sem->wait_list))
+ rwsem_atomic_update(RWSEM_WAITING_BIAS, sem);
+ return 1;
+ }
+ }
+ return 0;
+}
+
+#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
+
+struct mspin_node {
+ struct mspin_node *next ;
+ int locked; /* 1 if lock acquired */
+};
+#define MLOCK(rwsem) ((struct mspin_node **)&((rwsem)->spin_mlock))
+
+static noinline
+void mspin_lock(struct mspin_node **lock, struct mspin_node *node)
+{
+ struct mspin_node *prev;
+
+ /* Init node */
+ node->locked = 0;
+ node->next = NULL;
+
+ prev = xchg(lock, node);
+ if (likely(prev == NULL)) {
+ /* Lock acquired */
+ node->locked = 1;
+ return;
+ }
+ ACCESS_ONCE(prev->next) = node;
+ smp_wmb();
+ /* Wait until the lock holder passes the lock down */
+ while (!ACCESS_ONCE(node->locked))
+ arch_mutex_cpu_relax();
+}
+
+static void mspin_unlock(struct mspin_node **lock, struct mspin_node *node)
+{
+ struct mspin_node *next = ACCESS_ONCE(node->next);
+
+ if (likely(!next)) {
+ /*
+ * Release the lock by setting it to NULL
+ */
+ if (cmpxchg(lock, node, NULL) == node)
+ return;
+ /* Wait until the next pointer is set */
+ while (!(next = ACCESS_ONCE(node->next)))
+ arch_mutex_cpu_relax();
+ }
+ ACCESS_ONCE(next->locked) = 1;
+ smp_wmb();
+}
+
+static inline int rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
+{
+ long count;
+
+ count = ACCESS_ONCE(sem->count);
+retry:
+ if (count == RWSEM_WAITING_BIAS) {
+ count = cmpxchg(&sem->count, RWSEM_WAITING_BIAS,
+ RWSEM_ACTIVE_WRITE_BIAS + RWSEM_WAITING_BIAS);
+ /* allow write lock stealing, try acquiring the write lock. */
+ if (count == RWSEM_WAITING_BIAS)
+ goto acquired;
+ else if (count == 0)
+ goto retry;
+ } else if (count == 0) {
+ count = cmpxchg(&sem->count, 0,
+ RWSEM_ACTIVE_WRITE_BIAS);
+ if (count == 0)
+ goto acquired;
+ else if (count == RWSEM_WAITING_BIAS)
+ goto retry;
+ }
+ return 0;
+
+acquired:
+ return 1;
+}
+
+
+static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
+{
+ int retval;
+ struct task_struct *owner;
+
+ rcu_read_lock();
+ owner = ACCESS_ONCE(sem->owner);
+
+ /* Spin only if active writer running */
+ if (owner)
+ retval = owner->on_cpu;
+ else
+ retval = false;
+
+ rcu_read_unlock();
+ /*
+ * if lock->owner is not set, the sem owner may have just acquired
+ * it and not set the owner yet, or the sem has been released, or
+ * reader active.
+ */
+ return retval;
+}
+
+static inline bool owner_running(struct rw_semaphore *lock,
+ struct task_struct *owner)
+{
+ if (lock->owner != owner)
+ return false;
+
+ /*
+ * Ensure we emit the owner->on_cpu, dereference _after_ checking
+ * lock->owner still matches owner, if that fails, owner might
+ * point to free()d memory, if it still matches, the rcu_read_lock()
+ * ensures the memory stays valid.
+ */
+ barrier();
+
+ return owner->on_cpu;
+}
+
+static noinline
+int rwsem_spin_on_owner(struct rw_semaphore *lock, struct task_struct *owner)
+{
+ rcu_read_lock();
+ while (owner_running(lock, owner)) {
+ if (need_resched())
+ break;
+
+ arch_mutex_cpu_relax();
+ }
+ rcu_read_unlock();
+
+ /*
+ * We break out the loop above on need_resched() and when the
+ * owner changed, which is a sign for heavy contention. Return
+ * success only when lock->owner is NULL.
+ */
+ return lock->owner == NULL;
+}
+
+int rwsem_optimistic_spin(struct rw_semaphore *sem)
+{
+ struct task_struct *owner;
+ int ret = 0;
+
+ /* sem->wait_lock should not be held when doing optimistic spinning */
+ if (!rwsem_can_spin_on_owner(sem))
+ return ret;
+
+ preempt_disable();
+ for (;;) {
+ struct mspin_node node;
+
+ mspin_lock(MLOCK(sem), &node);
+ owner = ACCESS_ONCE(sem->owner);
+ if (owner && !rwsem_spin_on_owner(sem, owner)) {
+ mspin_unlock(MLOCK(sem), &node);
+ break;
+ }
+
+ /* wait_lock will be acquired if write_lock is obtained */
+ if (rwsem_try_write_lock_unqueued(sem)) {
+ mspin_unlock(MLOCK(sem), &node);
+ ret = 1;
+ break;
+ }
+ mspin_unlock(MLOCK(sem), &node);
+
+ /*
+ * When there's no owner, we might have preempted between the
+ * owner acquiring the lock and setting the owner field. If
+ * we're an RT task that will live-lock because we won't let
+ * the owner complete.
+ */
+ if (!owner && (need_resched() || rt_task(current)))
+ break;
+
+ /*
+ * The cpu_relax() call is a compiler barrier which forces
+ * everything in this loop to be re-loaded. We don't need
+ * memory barriers as we'll eventually observe the right
+ * values at the cost of a few extra spins.
+ */
+ arch_mutex_cpu_relax();
+
+ }
+
+ preempt_enable();
+ return ret;
+}
+#endif
+
/*
* wait until we successfully acquire the write lock
*/
struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
{
- long count, adjustment = -RWSEM_ACTIVE_WRITE_BIAS;
+ long count;
struct rwsem_waiter waiter;
struct task_struct *tsk = current;
+ bool waiting = true;

+ count = rwsem_atomic_update(-RWSEM_ACTIVE_WRITE_BIAS, sem);
+#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
+ /* do optimistic spinning */
+ if (rwsem_optimistic_spin(sem))
+ goto done;
+#endif
/* set up my own style of waitqueue */
waiter.task = tsk;
waiter.type = RWSEM_WAITING_FOR_WRITE;

raw_spin_lock_irq(&sem->wait_lock);
if (list_empty(&sem->wait_list))
- adjustment += RWSEM_WAITING_BIAS;
+ waiting = false;
list_add_tail(&waiter.list, &sem->wait_list);

/* we're now waiting on the lock, but no longer actively locking */
- count = rwsem_atomic_update(adjustment, sem);
+ if (waiting)
+ count = ACCESS_ONCE(sem->count);
+ else
+ count = rwsem_atomic_update(RWSEM_WAITING_BIAS, sem);

/* If there were already threads queued before us and there are no
* active writers, the lock must be read owned; so we try to wake
* any read locks that were queued ahead of us. */
- if (count > RWSEM_WAITING_BIAS &&
- adjustment == -RWSEM_ACTIVE_WRITE_BIAS)
+ if ((count > RWSEM_WAITING_BIAS) && waiting)
sem = __rwsem_do_wake(sem, RWSEM_WAKE_READERS);

/* wait until we successfully acquire the lock */
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
while (true) {
- if (!(count & RWSEM_ACTIVE_MASK)) {
- /* Try acquiring the write lock. */
- count = RWSEM_ACTIVE_WRITE_BIAS;
- if (!list_is_singular(&sem->wait_list))
- count += RWSEM_WAITING_BIAS;
-
- if (sem->count == RWSEM_WAITING_BIAS &&
- cmpxchg(&sem->count, RWSEM_WAITING_BIAS, count) ==
- RWSEM_WAITING_BIAS)
- break;
- }
+ if (rwsem_try_write_lock(count, sem))
+ break;

raw_spin_unlock_irq(&sem->wait_lock);

@@ -250,6 +459,7 @@ struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)

list_del(&waiter.list);
raw_spin_unlock_irq(&sem->wait_lock);
+done:
tsk->state = TASK_RUNNING;

return sem;

2013-07-30 19:25:03

by Ingo Molnar

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree


* Tim Chen <[email protected]> wrote:

> On Tue, 2013-07-23 at 11:53 +0200, Ingo Molnar wrote:
> > * Peter Zijlstra <[email protected]> wrote:
> >
> > > On Tue, Jul 23, 2013 at 11:45:13AM +0200, Ingo Molnar wrote:
> > >
> > > > Why not just try the delayed addition approach first? The spinning is
> > > > time limited AFAICS, so we don't _have to_ recognize those as writers
> > > > per se, only if the spinning fails and it wants to go on the waitlist.
> > > > Am I missing something?
> > > >
> > > > It will change patterns, it might even change the fairness balance -
> > > > but is a legit change otherwise, especially if it helps performance.
> > >
> > > Be very careful here. Some people (XFS) have very specific needs. Walken
> > > and dchinner had a longish discussion on this a while back.
> >
> > Agreed - yet it's worth at least trying it out the quick way, to see the
> > main effect and to see whether that explains the performance assymetry and
> > invest more effort into it.
> >
>
> Ingo & Peter,
>
> Here's a patch that moved optimistic spinning of the
> writer lock acquisition before putting the writer on the
> wait list. It did give me a 5% performance boost on my
> exim mail server workload. It recovered a good portion of
> the 8% performance regression from mutex implementation.

Very nice detective work!

> I think we are on the right track. Let me know if there's
> anything in the patch that may cause grief to XFS.
>
> There is some further optimization possible. We went to
> the failed path within __down_write if the count field is
> non zero. But in fact if the old count field was
> RWSEM_WAITING_BIAS, there's no one active and we could
> have stolen the lock when we perform our atomic op on the
> count field in __down_write. Yet we go to the failed path
> in the current code.
>
> I will combine this change and also Alex's patches on
> rwsem together in a patchset later.
>
> Your comments and thoughts are most welcomed.
>
> Tim
>
> Signed-off-by: Tim Chen <[email protected]>

> +config RWSEM_SPIN_ON_WRITE_OWNER
> + bool "Optimistic spin write acquisition for writer owned rw-sem"
> + default n
> + depends on SMP
> + help
> + Allows a writer to perform optimistic spinning if another writer own
> + the read write semaphore. If the lock owner is running, it is likely
> + to release the lock soon. Spinning gives a greater chance for writer to
> + acquire a semaphore before putting it to sleep.

The way you've worded this new Kconfig option makes it
sound as if it's not just for testing, and I'm not a big
believer in extra Kconfig degrees of freedom for
scalability features of core locking primitives like
rwsems, in production kernels ...

So the bad news is that such scalability optimizations
really need to work for everyone.

The good news is that I don't think there's anything
particularly controversial about making the rwsem write
side perform just as well as mutexes - it would in fact be
a very nice quality of implementation feature: it gives
people freedom to switch between mutexes and rwsems without
having to worry about scalability differences too much.

Once readers are mixed into the workload can we keep the
XFS assumptions, if they are broken in any way?

We are spinning here so we have full awareness about the
state of the lock and we can react to a new reader in very
short order - so at a quick glance I don't see any
fundamental difficulty in being able to resolve it - beyond
the SMOP aspect that is ... :-)

Thanks,

Ingo

2013-07-30 20:00:00

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree

cc'ing Dave Chinner for XFS

On Mon, 2013-07-29 at 17:13 -0700, Tim Chen wrote:
> On Tue, 2013-07-23 at 11:53 +0200, Ingo Molnar wrote:
> > * Peter Zijlstra <[email protected]> wrote:
> >
> > > On Tue, Jul 23, 2013 at 11:45:13AM +0200, Ingo Molnar wrote:
> > >
> > > > Why not just try the delayed addition approach first? The spinning is
> > > > time limited AFAICS, so we don't _have to_ recognize those as writers
> > > > per se, only if the spinning fails and it wants to go on the waitlist.
> > > > Am I missing something?
> > > >
> > > > It will change patterns, it might even change the fairness balance -
> > > > but is a legit change otherwise, especially if it helps performance.
> > >
> > > Be very careful here. Some people (XFS) have very specific needs. Walken
> > > and dchinner had a longish discussion on this a while back.
> >
> > Agreed - yet it's worth at least trying it out the quick way, to see the
> > main effect and to see whether that explains the performance assymetry and
> > invest more effort into it.
> >
>
> Ingo & Peter,
>
> Here's a patch that moved optimistic spinning of the writer lock
> acquisition before putting the writer on the wait list. It did give me
> a 5% performance boost on my exim mail server workload.
> It recovered a good portion of the 8% performance regression from
> mutex implementation.
>
> I think we are on the right track. Let me know if there's anything
> in the patch that may cause grief to XFS.
>
> There is some further optimization possible. We went
> to the failed path within __down_write if the count field is non
> zero. But in fact if the old count field was RWSEM_WAITING_BIAS,
> there's no one active and we could have stolen the
> lock when we perform our atomic op on the count field
> in __down_write. Yet we go to the failed path in the current
> code.
>
> I will combine this change and also Alex's patches on rwsem together
> in a patchset later.
>
> Your comments and thoughts are most welcomed.
>
> Tim
>
> Signed-off-by: Tim Chen <[email protected]>
> ---
> diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
> index 0616ffe..58a4acb 100644
> --- a/include/linux/rwsem.h
> +++ b/include/linux/rwsem.h
> @@ -29,6 +29,10 @@ struct rw_semaphore {
> #ifdef CONFIG_DEBUG_LOCK_ALLOC
> struct lockdep_map dep_map;
> #endif
> +#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
> + struct task_struct *owner;
> + void *spin_mlock;
> +#endif
> };
>
> extern struct rw_semaphore *rwsem_down_read_failed(struct rw_semaphore *sem);
> @@ -55,11 +59,21 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
> # define __RWSEM_DEP_MAP_INIT(lockname)
> #endif
>
> +#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
> +#define __RWSEM_INITIALIZER(name) \
> + { RWSEM_UNLOCKED_VALUE, \
> + __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock), \
> + LIST_HEAD_INIT((name).wait_list), \
> + __RWSEM_DEP_MAP_INIT(name) \
> + NULL, \
> + NULL }
> +#else
> #define __RWSEM_INITIALIZER(name) \
> { RWSEM_UNLOCKED_VALUE, \
> __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock), \
> LIST_HEAD_INIT((name).wait_list) \
> __RWSEM_DEP_MAP_INIT(name) }
> +#endif
>
> #define DECLARE_RWSEM(name) \
> struct rw_semaphore name = __RWSEM_INITIALIZER(name)
> diff --git a/init/Kconfig b/init/Kconfig
> index 9d3a788..d97225f 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1595,6 +1595,16 @@ config TRACEPOINTS
>
> source "arch/Kconfig"
>
> +config RWSEM_SPIN_ON_WRITE_OWNER
> + bool "Optimistic spin write acquisition for writer owned rw-sem"
> + default n
> + depends on SMP
> + help
> + Allows a writer to perform optimistic spinning if another writer own
> + the read write semaphore. If the lock owner is running, it is likely
> + to release the lock soon. Spinning gives a greater chance for writer to
> + acquire a semaphore before putting it to sleep.
> +
> endmenu # General setup
>
> config HAVE_GENERIC_DMA_COHERENT
> diff --git a/kernel/rwsem.c b/kernel/rwsem.c
> index cfff143..a32990a 100644
> --- a/kernel/rwsem.c
> +++ b/kernel/rwsem.c
> @@ -12,6 +12,26 @@
>
> #include <linux/atomic.h>
>
> +#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
> +static inline void rwsem_set_owner(struct rw_semaphore *sem)
> +{
> + sem->owner = current;
> +}
> +
> +static inline void rwsem_clear_owner(struct rw_semaphore *sem)
> +{
> + sem->owner = NULL;
> +}
> +#else
> +static inline void rwsem_set_owner(struct rw_semaphore *sem)
> +{
> +}
> +
> +static inline void rwsem_clear_owner(struct rw_semaphore *sem)
> +{
> +}
> +#endif
> +
> /*
> * lock for reading
> */
> @@ -48,6 +68,7 @@ void __sched down_write(struct rw_semaphore *sem)
> rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);
>
> LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
> + rwsem_set_owner(sem);
> }
>
> EXPORT_SYMBOL(down_write);
> @@ -59,8 +80,10 @@ int down_write_trylock(struct rw_semaphore *sem)
> {
> int ret = __down_write_trylock(sem);
>
> - if (ret == 1)
> + if (ret == 1) {
> rwsem_acquire(&sem->dep_map, 0, 1, _RET_IP_);
> + rwsem_set_owner(sem);
> + }
> return ret;
> }
>
> @@ -86,6 +109,7 @@ void up_write(struct rw_semaphore *sem)
> rwsem_release(&sem->dep_map, 1, _RET_IP_);
>
> __up_write(sem);
> + rwsem_clear_owner(sem);
> }
>
> EXPORT_SYMBOL(up_write);
> @@ -100,6 +124,7 @@ void downgrade_write(struct rw_semaphore *sem)
> * dependency.
> */
> __downgrade_write(sem);
> + rwsem_clear_owner(sem);
> }
>
> EXPORT_SYMBOL(downgrade_write);
> @@ -122,6 +147,7 @@ void _down_write_nest_lock(struct rw_semaphore *sem, struct lockdep_map *nest)
> rwsem_acquire_nest(&sem->dep_map, 0, 0, nest, _RET_IP_);
>
> LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
> + rwsem_set_owner(sem);
> }
>
> EXPORT_SYMBOL(_down_write_nest_lock);
> @@ -141,6 +167,7 @@ void down_write_nested(struct rw_semaphore *sem, int subclass)
> rwsem_acquire(&sem->dep_map, subclass, 0, _RET_IP_);
>
> LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
> + rwsem_set_owner(sem);
> }
>
> EXPORT_SYMBOL(down_write_nested);
> diff --git a/lib/rwsem.c b/lib/rwsem.c
> index 1d6e6e8..1472ff3 100644
> --- a/lib/rwsem.c
> +++ b/lib/rwsem.c
> @@ -8,6 +8,7 @@
> */
> #include <linux/rwsem.h>
> #include <linux/sched.h>
> +#include <linux/sched/rt.h>
> #include <linux/init.h>
> #include <linux/export.h>
>
> @@ -27,6 +28,10 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
> sem->count = RWSEM_UNLOCKED_VALUE;
> raw_spin_lock_init(&sem->wait_lock);
> INIT_LIST_HEAD(&sem->wait_list);
> +#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
> + sem->owner = NULL;
> + sem->spin_mlock = NULL;
> +#endif
> }
>
> EXPORT_SYMBOL(__init_rwsem);
> @@ -194,48 +199,252 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem)
> return sem;
> }
>
> +static inline int rwsem_try_write_lock(long count, struct rw_semaphore *sem)
> +{
> + if (!(count & RWSEM_ACTIVE_MASK)) {
> + /* Try acquiring the write lock. */
> + if (sem->count == RWSEM_WAITING_BIAS &&
> + cmpxchg(&sem->count, RWSEM_WAITING_BIAS,
> + RWSEM_ACTIVE_WRITE_BIAS) == RWSEM_WAITING_BIAS) {
> + if (!list_is_singular(&sem->wait_list))
> + rwsem_atomic_update(RWSEM_WAITING_BIAS, sem);
> + return 1;
> + }
> + }
> + return 0;
> +}
> +
> +#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
> +
> +struct mspin_node {
> + struct mspin_node *next ;
> + int locked; /* 1 if lock acquired */
> +};
> +#define MLOCK(rwsem) ((struct mspin_node **)&((rwsem)->spin_mlock))
> +
> +static noinline
> +void mspin_lock(struct mspin_node **lock, struct mspin_node *node)
> +{
> + struct mspin_node *prev;
> +
> + /* Init node */
> + node->locked = 0;
> + node->next = NULL;
> +
> + prev = xchg(lock, node);
> + if (likely(prev == NULL)) {
> + /* Lock acquired */
> + node->locked = 1;
> + return;
> + }
> + ACCESS_ONCE(prev->next) = node;
> + smp_wmb();
> + /* Wait until the lock holder passes the lock down */
> + while (!ACCESS_ONCE(node->locked))
> + arch_mutex_cpu_relax();
> +}
> +
> +static void mspin_unlock(struct mspin_node **lock, struct mspin_node *node)
> +{
> + struct mspin_node *next = ACCESS_ONCE(node->next);
> +
> + if (likely(!next)) {
> + /*
> + * Release the lock by setting it to NULL
> + */
> + if (cmpxchg(lock, node, NULL) == node)
> + return;
> + /* Wait until the next pointer is set */
> + while (!(next = ACCESS_ONCE(node->next)))
> + arch_mutex_cpu_relax();
> + }
> + ACCESS_ONCE(next->locked) = 1;
> + smp_wmb();
> +}
> +
> +static inline int rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
> +{
> + long count;
> +
> + count = ACCESS_ONCE(sem->count);
> +retry:
> + if (count == RWSEM_WAITING_BIAS) {
> + count = cmpxchg(&sem->count, RWSEM_WAITING_BIAS,
> + RWSEM_ACTIVE_WRITE_BIAS + RWSEM_WAITING_BIAS);
> + /* allow write lock stealing, try acquiring the write lock. */
> + if (count == RWSEM_WAITING_BIAS)
> + goto acquired;
> + else if (count == 0)
> + goto retry;
> + } else if (count == 0) {
> + count = cmpxchg(&sem->count, 0,
> + RWSEM_ACTIVE_WRITE_BIAS);
> + if (count == 0)
> + goto acquired;
> + else if (count == RWSEM_WAITING_BIAS)
> + goto retry;
> + }
> + return 0;
> +
> +acquired:
> + return 1;
> +}
> +
> +
> +static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
> +{
> + int retval;
> + struct task_struct *owner;
> +
> + rcu_read_lock();
> + owner = ACCESS_ONCE(sem->owner);
> +
> + /* Spin only if active writer running */
> + if (owner)
> + retval = owner->on_cpu;
> + else
> + retval = false;
> +
> + rcu_read_unlock();
> + /*
> + * if lock->owner is not set, the sem owner may have just acquired
> + * it and not set the owner yet, or the sem has been released, or
> + * reader active.
> + */
> + return retval;
> +}
> +
> +static inline bool owner_running(struct rw_semaphore *lock,
> + struct task_struct *owner)
> +{
> + if (lock->owner != owner)
> + return false;
> +
> + /*
> + * Ensure we emit the owner->on_cpu, dereference _after_ checking
> + * lock->owner still matches owner, if that fails, owner might
> + * point to free()d memory, if it still matches, the rcu_read_lock()
> + * ensures the memory stays valid.
> + */
> + barrier();
> +
> + return owner->on_cpu;
> +}
> +
> +static noinline
> +int rwsem_spin_on_owner(struct rw_semaphore *lock, struct task_struct *owner)
> +{
> + rcu_read_lock();
> + while (owner_running(lock, owner)) {
> + if (need_resched())
> + break;
> +
> + arch_mutex_cpu_relax();
> + }
> + rcu_read_unlock();
> +
> + /*
> + * We break out the loop above on need_resched() and when the
> + * owner changed, which is a sign for heavy contention. Return
> + * success only when lock->owner is NULL.
> + */
> + return lock->owner == NULL;
> +}
> +
> +int rwsem_optimistic_spin(struct rw_semaphore *sem)
> +{
> + struct task_struct *owner;
> + int ret = 0;
> +
> + /* sem->wait_lock should not be held when doing optimistic spinning */
> + if (!rwsem_can_spin_on_owner(sem))
> + return ret;
> +
> + preempt_disable();
> + for (;;) {
> + struct mspin_node node;
> +
> + mspin_lock(MLOCK(sem), &node);
> + owner = ACCESS_ONCE(sem->owner);
> + if (owner && !rwsem_spin_on_owner(sem, owner)) {
> + mspin_unlock(MLOCK(sem), &node);
> + break;
> + }
> +
> + /* wait_lock will be acquired if write_lock is obtained */
> + if (rwsem_try_write_lock_unqueued(sem)) {
> + mspin_unlock(MLOCK(sem), &node);
> + ret = 1;
> + break;
> + }
> + mspin_unlock(MLOCK(sem), &node);
> +
> + /*
> + * When there's no owner, we might have preempted between the
> + * owner acquiring the lock and setting the owner field. If
> + * we're an RT task that will live-lock because we won't let
> + * the owner complete.
> + */
> + if (!owner && (need_resched() || rt_task(current)))
> + break;
> +
> + /*
> + * The cpu_relax() call is a compiler barrier which forces
> + * everything in this loop to be re-loaded. We don't need
> + * memory barriers as we'll eventually observe the right
> + * values at the cost of a few extra spins.
> + */
> + arch_mutex_cpu_relax();
> +
> + }
> +
> + preempt_enable();
> + return ret;
> +}
> +#endif
> +
> /*
> * wait until we successfully acquire the write lock
> */
> struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
> {
> - long count, adjustment = -RWSEM_ACTIVE_WRITE_BIAS;
> + long count;
> struct rwsem_waiter waiter;
> struct task_struct *tsk = current;
> + bool waiting = true;
>
> + count = rwsem_atomic_update(-RWSEM_ACTIVE_WRITE_BIAS, sem);
> +#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
> + /* do optimistic spinning */
> + if (rwsem_optimistic_spin(sem))
> + goto done;
> +#endif
> /* set up my own style of waitqueue */
> waiter.task = tsk;
> waiter.type = RWSEM_WAITING_FOR_WRITE;
>
> raw_spin_lock_irq(&sem->wait_lock);
> if (list_empty(&sem->wait_list))
> - adjustment += RWSEM_WAITING_BIAS;
> + waiting = false;
> list_add_tail(&waiter.list, &sem->wait_list);
>
> /* we're now waiting on the lock, but no longer actively locking */
> - count = rwsem_atomic_update(adjustment, sem);
> + if (waiting)
> + count = ACCESS_ONCE(sem->count);
> + else
> + count = rwsem_atomic_update(RWSEM_WAITING_BIAS, sem);
>
> /* If there were already threads queued before us and there are no
> * active writers, the lock must be read owned; so we try to wake
> * any read locks that were queued ahead of us. */
> - if (count > RWSEM_WAITING_BIAS &&
> - adjustment == -RWSEM_ACTIVE_WRITE_BIAS)
> + if ((count > RWSEM_WAITING_BIAS) && waiting)
> sem = __rwsem_do_wake(sem, RWSEM_WAKE_READERS);
>
> /* wait until we successfully acquire the lock */
> set_task_state(tsk, TASK_UNINTERRUPTIBLE);
> while (true) {
> - if (!(count & RWSEM_ACTIVE_MASK)) {
> - /* Try acquiring the write lock. */
> - count = RWSEM_ACTIVE_WRITE_BIAS;
> - if (!list_is_singular(&sem->wait_list))
> - count += RWSEM_WAITING_BIAS;
> -
> - if (sem->count == RWSEM_WAITING_BIAS &&
> - cmpxchg(&sem->count, RWSEM_WAITING_BIAS, count) ==
> - RWSEM_WAITING_BIAS)
> - break;
> - }
> + if (rwsem_try_write_lock(count, sem))
> + break;
>
> raw_spin_unlock_irq(&sem->wait_lock);
>
> @@ -250,6 +459,7 @@ struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
>
> list_del(&waiter.list);
> raw_spin_unlock_irq(&sem->wait_lock);
> +done:
> tsk->state = TASK_RUNNING;
>
> return sem;
>
>

2013-07-30 20:34:11

by Tim Chen

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree

On Tue, 2013-07-30 at 12:59 -0700, Davidlohr Bueso wrote:
> cc'ing Dave Chinner for XFS
>

Davidlohr,

I also wonder it this change benefit your workload. Will
be interested to know of your performance numbers.

Tim

2013-07-30 21:45:52

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree

On Tue, 2013-07-30 at 13:34 -0700, Tim Chen wrote:
> On Tue, 2013-07-30 at 12:59 -0700, Davidlohr Bueso wrote:
> > cc'ing Dave Chinner for XFS
> >
>
> Davidlohr,
>
> I also wonder it this change benefit your workload. Will
> be interested to know of your performance numbers.

Yep, I'm currently running some tests and will report when done.

>
> Tim
>

2013-08-05 22:08:24

by Tim Chen

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree

On Tue, 2013-07-30 at 21:24 +0200, Ingo Molnar wrote:

> >
> > Signed-off-by: Tim Chen <[email protected]>
>
> > +config RWSEM_SPIN_ON_WRITE_OWNER
> > + bool "Optimistic spin write acquisition for writer owned rw-sem"
> > + default n
> > + depends on SMP
> > + help
> > + Allows a writer to perform optimistic spinning if another writer own
> > + the read write semaphore. If the lock owner is running, it is likely
> > + to release the lock soon. Spinning gives a greater chance for writer to
> > + acquire a semaphore before putting it to sleep.
>
> The way you've worded this new Kconfig option makes it
> sound as if it's not just for testing, and I'm not a big
> believer in extra Kconfig degrees of freedom for
> scalability features of core locking primitives like
> rwsems, in production kernels ...
>
> So the bad news is that such scalability optimizations
> really need to work for everyone.
>
> The good news is that I don't think there's anything
> particularly controversial about making the rwsem write
> side perform just as well as mutexes - it would in fact be
> a very nice quality of implementation feature: it gives
> people freedom to switch between mutexes and rwsems without
> having to worry about scalability differences too much.
>

Sorry for replying to your email late as I was pulled to
some other tasks.

Ingo, any objection if I make the optimistic writer spin the
default for SMP without an extra config? This will make
the rw_semaphore structure grow a bit to accommodate the
owner and spin_mlock field.

Thanks.

Tim
> Once readers are mixed into the workload can we keep the
> XFS assumptions, if they are broken in any way?
>
> We are spinning here so we have full awareness about the
> state of the lock and we can react to a new reader in very
> short order - so at a quick glance I don't see any
> fundamental difficulty in being able to resolve it - beyond
> the SMOP aspect that is ... :-)
>
> Thanks,
>
> Ingo

2013-08-06 23:55:33

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree

On Mon, 2013-07-29 at 17:13 -0700, Tim Chen wrote:
> On Tue, 2013-07-23 at 11:53 +0200, Ingo Molnar wrote:
> > * Peter Zijlstra <[email protected]> wrote:
> >
> > > On Tue, Jul 23, 2013 at 11:45:13AM +0200, Ingo Molnar wrote:
> > >
> > > > Why not just try the delayed addition approach first? The spinning is
> > > > time limited AFAICS, so we don't _have to_ recognize those as writers
> > > > per se, only if the spinning fails and it wants to go on the waitlist.
> > > > Am I missing something?
> > > >
> > > > It will change patterns, it might even change the fairness balance -
> > > > but is a legit change otherwise, especially if it helps performance.
> > >
> > > Be very careful here. Some people (XFS) have very specific needs. Walken
> > > and dchinner had a longish discussion on this a while back.
> >
> > Agreed - yet it's worth at least trying it out the quick way, to see the
> > main effect and to see whether that explains the performance assymetry and
> > invest more effort into it.
> >
>
> Ingo & Peter,
>
> Here's a patch that moved optimistic spinning of the writer lock
> acquisition before putting the writer on the wait list. It did give me
> a 5% performance boost on my exim mail server workload.
> It recovered a good portion of the 8% performance regression from
> mutex implementation.
>
> I think we are on the right track. Let me know if there's anything
> in the patch that may cause grief to XFS.
>
> There is some further optimization possible. We went
> to the failed path within __down_write if the count field is non
> zero. But in fact if the old count field was RWSEM_WAITING_BIAS,
> there's no one active and we could have stolen the
> lock when we perform our atomic op on the count field
> in __down_write. Yet we go to the failed path in the current
> code.
>
> I will combine this change and also Alex's patches on rwsem together
> in a patchset later.
>
> Your comments and thoughts are most welcomed.

I got good numbers, recovering the performance drop I noticed with the
i_mmap_mutex to rwsem patches. Looking forward to a more upstreamable
patchset that deals with this work, including the previous patches.

One thing that's bugging me about this series though is the huge amount
of duplicated code being introduced to rwsems from mutexes. We can share
common functionality such as mcs locking (perhaps in a new file under
lib/), can_spin_on_owner() and owner_running(), perhaps moving those
functions into sheduler code, were AFAIK they were originally.

Thanks,
Davidlohr

>
> Tim
>
> Signed-off-by: Tim Chen <[email protected]>
> ---
> diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
> index 0616ffe..58a4acb 100644
> --- a/include/linux/rwsem.h
> +++ b/include/linux/rwsem.h
> @@ -29,6 +29,10 @@ struct rw_semaphore {
> #ifdef CONFIG_DEBUG_LOCK_ALLOC
> struct lockdep_map dep_map;
> #endif
> +#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
> + struct task_struct *owner;
> + void *spin_mlock;
> +#endif
> };
>
> extern struct rw_semaphore *rwsem_down_read_failed(struct rw_semaphore *sem);
> @@ -55,11 +59,21 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
> # define __RWSEM_DEP_MAP_INIT(lockname)
> #endif
>
> +#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
> +#define __RWSEM_INITIALIZER(name) \
> + { RWSEM_UNLOCKED_VALUE, \
> + __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock), \
> + LIST_HEAD_INIT((name).wait_list), \
> + __RWSEM_DEP_MAP_INIT(name) \
> + NULL, \
> + NULL }
> +#else
> #define __RWSEM_INITIALIZER(name) \
> { RWSEM_UNLOCKED_VALUE, \
> __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock), \
> LIST_HEAD_INIT((name).wait_list) \
> __RWSEM_DEP_MAP_INIT(name) }
> +#endif
>
> #define DECLARE_RWSEM(name) \
> struct rw_semaphore name = __RWSEM_INITIALIZER(name)
> diff --git a/init/Kconfig b/init/Kconfig
> index 9d3a788..d97225f 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1595,6 +1595,16 @@ config TRACEPOINTS
>
> source "arch/Kconfig"
>
> +config RWSEM_SPIN_ON_WRITE_OWNER
> + bool "Optimistic spin write acquisition for writer owned rw-sem"
> + default n
> + depends on SMP
> + help
> + Allows a writer to perform optimistic spinning if another writer own
> + the read write semaphore. If the lock owner is running, it is likely
> + to release the lock soon. Spinning gives a greater chance for writer to
> + acquire a semaphore before putting it to sleep.
> +
> endmenu # General setup
>
> config HAVE_GENERIC_DMA_COHERENT
> diff --git a/kernel/rwsem.c b/kernel/rwsem.c
> index cfff143..a32990a 100644
> --- a/kernel/rwsem.c
> +++ b/kernel/rwsem.c
> @@ -12,6 +12,26 @@
>
> #include <linux/atomic.h>
>
> +#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
> +static inline void rwsem_set_owner(struct rw_semaphore *sem)
> +{
> + sem->owner = current;
> +}
> +
> +static inline void rwsem_clear_owner(struct rw_semaphore *sem)
> +{
> + sem->owner = NULL;
> +}
> +#else
> +static inline void rwsem_set_owner(struct rw_semaphore *sem)
> +{
> +}
> +
> +static inline void rwsem_clear_owner(struct rw_semaphore *sem)
> +{
> +}
> +#endif
> +
> /*
> * lock for reading
> */
> @@ -48,6 +68,7 @@ void __sched down_write(struct rw_semaphore *sem)
> rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);
>
> LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
> + rwsem_set_owner(sem);
> }
>
> EXPORT_SYMBOL(down_write);
> @@ -59,8 +80,10 @@ int down_write_trylock(struct rw_semaphore *sem)
> {
> int ret = __down_write_trylock(sem);
>
> - if (ret == 1)
> + if (ret == 1) {
> rwsem_acquire(&sem->dep_map, 0, 1, _RET_IP_);
> + rwsem_set_owner(sem);
> + }
> return ret;
> }
>
> @@ -86,6 +109,7 @@ void up_write(struct rw_semaphore *sem)
> rwsem_release(&sem->dep_map, 1, _RET_IP_);
>
> __up_write(sem);
> + rwsem_clear_owner(sem);
> }
>
> EXPORT_SYMBOL(up_write);
> @@ -100,6 +124,7 @@ void downgrade_write(struct rw_semaphore *sem)
> * dependency.
> */
> __downgrade_write(sem);
> + rwsem_clear_owner(sem);
> }
>
> EXPORT_SYMBOL(downgrade_write);
> @@ -122,6 +147,7 @@ void _down_write_nest_lock(struct rw_semaphore *sem, struct lockdep_map *nest)
> rwsem_acquire_nest(&sem->dep_map, 0, 0, nest, _RET_IP_);
>
> LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
> + rwsem_set_owner(sem);
> }
>
> EXPORT_SYMBOL(_down_write_nest_lock);
> @@ -141,6 +167,7 @@ void down_write_nested(struct rw_semaphore *sem, int subclass)
> rwsem_acquire(&sem->dep_map, subclass, 0, _RET_IP_);
>
> LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
> + rwsem_set_owner(sem);
> }
>
> EXPORT_SYMBOL(down_write_nested);
> diff --git a/lib/rwsem.c b/lib/rwsem.c
> index 1d6e6e8..1472ff3 100644
> --- a/lib/rwsem.c
> +++ b/lib/rwsem.c
> @@ -8,6 +8,7 @@
> */
> #include <linux/rwsem.h>
> #include <linux/sched.h>
> +#include <linux/sched/rt.h>
> #include <linux/init.h>
> #include <linux/export.h>
>
> @@ -27,6 +28,10 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
> sem->count = RWSEM_UNLOCKED_VALUE;
> raw_spin_lock_init(&sem->wait_lock);
> INIT_LIST_HEAD(&sem->wait_list);
> +#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
> + sem->owner = NULL;
> + sem->spin_mlock = NULL;
> +#endif
> }
>
> EXPORT_SYMBOL(__init_rwsem);
> @@ -194,48 +199,252 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem)
> return sem;
> }
>
> +static inline int rwsem_try_write_lock(long count, struct rw_semaphore *sem)
> +{
> + if (!(count & RWSEM_ACTIVE_MASK)) {
> + /* Try acquiring the write lock. */
> + if (sem->count == RWSEM_WAITING_BIAS &&
> + cmpxchg(&sem->count, RWSEM_WAITING_BIAS,
> + RWSEM_ACTIVE_WRITE_BIAS) == RWSEM_WAITING_BIAS) {
> + if (!list_is_singular(&sem->wait_list))
> + rwsem_atomic_update(RWSEM_WAITING_BIAS, sem);
> + return 1;
> + }
> + }
> + return 0;
> +}
> +
> +#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
> +
> +struct mspin_node {
> + struct mspin_node *next ;
> + int locked; /* 1 if lock acquired */
> +};
> +#define MLOCK(rwsem) ((struct mspin_node **)&((rwsem)->spin_mlock))
> +
> +static noinline
> +void mspin_lock(struct mspin_node **lock, struct mspin_node *node)
> +{
> + struct mspin_node *prev;
> +
> + /* Init node */
> + node->locked = 0;
> + node->next = NULL;
> +
> + prev = xchg(lock, node);
> + if (likely(prev == NULL)) {
> + /* Lock acquired */
> + node->locked = 1;
> + return;
> + }
> + ACCESS_ONCE(prev->next) = node;
> + smp_wmb();
> + /* Wait until the lock holder passes the lock down */
> + while (!ACCESS_ONCE(node->locked))
> + arch_mutex_cpu_relax();
> +}
> +
> +static void mspin_unlock(struct mspin_node **lock, struct mspin_node *node)
> +{
> + struct mspin_node *next = ACCESS_ONCE(node->next);
> +
> + if (likely(!next)) {
> + /*
> + * Release the lock by setting it to NULL
> + */
> + if (cmpxchg(lock, node, NULL) == node)
> + return;
> + /* Wait until the next pointer is set */
> + while (!(next = ACCESS_ONCE(node->next)))
> + arch_mutex_cpu_relax();
> + }
> + ACCESS_ONCE(next->locked) = 1;
> + smp_wmb();
> +}
> +
> +static inline int rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
> +{
> + long count;
> +
> + count = ACCESS_ONCE(sem->count);
> +retry:
> + if (count == RWSEM_WAITING_BIAS) {
> + count = cmpxchg(&sem->count, RWSEM_WAITING_BIAS,
> + RWSEM_ACTIVE_WRITE_BIAS + RWSEM_WAITING_BIAS);
> + /* allow write lock stealing, try acquiring the write lock. */
> + if (count == RWSEM_WAITING_BIAS)
> + goto acquired;
> + else if (count == 0)
> + goto retry;
> + } else if (count == 0) {
> + count = cmpxchg(&sem->count, 0,
> + RWSEM_ACTIVE_WRITE_BIAS);
> + if (count == 0)
> + goto acquired;
> + else if (count == RWSEM_WAITING_BIAS)
> + goto retry;
> + }
> + return 0;
> +
> +acquired:
> + return 1;
> +}
> +
> +
> +static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
> +{
> + int retval;
> + struct task_struct *owner;
> +
> + rcu_read_lock();
> + owner = ACCESS_ONCE(sem->owner);
> +
> + /* Spin only if active writer running */
> + if (owner)
> + retval = owner->on_cpu;
> + else
> + retval = false;
> +
> + rcu_read_unlock();
> + /*
> + * if lock->owner is not set, the sem owner may have just acquired
> + * it and not set the owner yet, or the sem has been released, or
> + * reader active.
> + */
> + return retval;
> +}
> +
> +static inline bool owner_running(struct rw_semaphore *lock,
> + struct task_struct *owner)
> +{
> + if (lock->owner != owner)
> + return false;
> +
> + /*
> + * Ensure we emit the owner->on_cpu, dereference _after_ checking
> + * lock->owner still matches owner, if that fails, owner might
> + * point to free()d memory, if it still matches, the rcu_read_lock()
> + * ensures the memory stays valid.
> + */
> + barrier();
> +
> + return owner->on_cpu;
> +}
> +
> +static noinline
> +int rwsem_spin_on_owner(struct rw_semaphore *lock, struct task_struct *owner)
> +{
> + rcu_read_lock();
> + while (owner_running(lock, owner)) {
> + if (need_resched())
> + break;
> +
> + arch_mutex_cpu_relax();
> + }
> + rcu_read_unlock();
> +
> + /*
> + * We break out the loop above on need_resched() and when the
> + * owner changed, which is a sign for heavy contention. Return
> + * success only when lock->owner is NULL.
> + */
> + return lock->owner == NULL;
> +}
> +
> +int rwsem_optimistic_spin(struct rw_semaphore *sem)
> +{
> + struct task_struct *owner;
> + int ret = 0;
> +
> + /* sem->wait_lock should not be held when doing optimistic spinning */
> + if (!rwsem_can_spin_on_owner(sem))
> + return ret;
> +
> + preempt_disable();
> + for (;;) {
> + struct mspin_node node;
> +
> + mspin_lock(MLOCK(sem), &node);
> + owner = ACCESS_ONCE(sem->owner);
> + if (owner && !rwsem_spin_on_owner(sem, owner)) {
> + mspin_unlock(MLOCK(sem), &node);
> + break;
> + }
> +
> + /* wait_lock will be acquired if write_lock is obtained */
> + if (rwsem_try_write_lock_unqueued(sem)) {
> + mspin_unlock(MLOCK(sem), &node);
> + ret = 1;
> + break;
> + }
> + mspin_unlock(MLOCK(sem), &node);
> +
> + /*
> + * When there's no owner, we might have preempted between the
> + * owner acquiring the lock and setting the owner field. If
> + * we're an RT task that will live-lock because we won't let
> + * the owner complete.
> + */
> + if (!owner && (need_resched() || rt_task(current)))
> + break;
> +
> + /*
> + * The cpu_relax() call is a compiler barrier which forces
> + * everything in this loop to be re-loaded. We don't need
> + * memory barriers as we'll eventually observe the right
> + * values at the cost of a few extra spins.
> + */
> + arch_mutex_cpu_relax();
> +
> + }
> +
> + preempt_enable();
> + return ret;
> +}
> +#endif
> +
> /*
> * wait until we successfully acquire the write lock
> */
> struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
> {
> - long count, adjustment = -RWSEM_ACTIVE_WRITE_BIAS;
> + long count;
> struct rwsem_waiter waiter;
> struct task_struct *tsk = current;
> + bool waiting = true;
>
> + count = rwsem_atomic_update(-RWSEM_ACTIVE_WRITE_BIAS, sem);
> +#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
> + /* do optimistic spinning */
> + if (rwsem_optimistic_spin(sem))
> + goto done;
> +#endif
> /* set up my own style of waitqueue */
> waiter.task = tsk;
> waiter.type = RWSEM_WAITING_FOR_WRITE;
>
> raw_spin_lock_irq(&sem->wait_lock);
> if (list_empty(&sem->wait_list))
> - adjustment += RWSEM_WAITING_BIAS;
> + waiting = false;
> list_add_tail(&waiter.list, &sem->wait_list);
>
> /* we're now waiting on the lock, but no longer actively locking */
> - count = rwsem_atomic_update(adjustment, sem);
> + if (waiting)
> + count = ACCESS_ONCE(sem->count);
> + else
> + count = rwsem_atomic_update(RWSEM_WAITING_BIAS, sem);
>
> /* If there were already threads queued before us and there are no
> * active writers, the lock must be read owned; so we try to wake
> * any read locks that were queued ahead of us. */
> - if (count > RWSEM_WAITING_BIAS &&
> - adjustment == -RWSEM_ACTIVE_WRITE_BIAS)
> + if ((count > RWSEM_WAITING_BIAS) && waiting)
> sem = __rwsem_do_wake(sem, RWSEM_WAKE_READERS);
>
> /* wait until we successfully acquire the lock */
> set_task_state(tsk, TASK_UNINTERRUPTIBLE);
> while (true) {
> - if (!(count & RWSEM_ACTIVE_MASK)) {
> - /* Try acquiring the write lock. */
> - count = RWSEM_ACTIVE_WRITE_BIAS;
> - if (!list_is_singular(&sem->wait_list))
> - count += RWSEM_WAITING_BIAS;
> -
> - if (sem->count == RWSEM_WAITING_BIAS &&
> - cmpxchg(&sem->count, RWSEM_WAITING_BIAS, count) ==
> - RWSEM_WAITING_BIAS)
> - break;
> - }
> + if (rwsem_try_write_lock(count, sem))
> + break;
>
> raw_spin_unlock_irq(&sem->wait_lock);
>
> @@ -250,6 +459,7 @@ struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
>
> list_del(&waiter.list);
> raw_spin_unlock_irq(&sem->wait_lock);
> +done:
> tsk->state = TASK_RUNNING;
>
> return sem;
>
>

2013-08-07 00:56:29

by Tim Chen

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree

On Tue, 2013-08-06 at 16:55 -0700, Davidlohr Bueso wrote:

> I got good numbers, recovering the performance drop I noticed with the
> i_mmap_mutex to rwsem patches.

That's good. I remembered that the earlier version of the patch not
only recovered the performance drop, but also provide some boost when
you switch from i_mmap_mutex to rwsem for aim7. Do you see similar
boost with this version?

> Looking forward to a more upstreamable
> patchset that deals with this work, including the previous patches.
>
> One thing that's bugging me about this series though is the huge amount
> of duplicated code being introduced to rwsems from mutexes. We can share
> common functionality such as mcs locking (perhaps in a new file under
> lib/), can_spin_on_owner() and owner_running(), perhaps moving those
> functions into sheduler code, were AFAIK they were originally.

I think that MCS locking is worth breaking out as its
own library. After we've done that, the rest of
the duplication are minimal. It is easier
to keep them separate as there are some rwsem
specific logic that may require tweaking
to can_spin_on_owner and owner_running.

Thanks.

Tim

2013-08-12 18:52:55

by Ingo Molnar

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree


* Tim Chen <[email protected]> wrote:

> On Tue, 2013-08-06 at 16:55 -0700, Davidlohr Bueso wrote:
>
> > I got good numbers, recovering the performance drop I noticed with the
> > i_mmap_mutex to rwsem patches.
>
> That's good. I remembered that the earlier version of the patch not
> only recovered the performance drop, but also provide some boost when
> you switch from i_mmap_mutex to rwsem for aim7. Do you see similar
> boost with this version?
>
> > Looking forward to a more upstreamable
> > patchset that deals with this work, including the previous patches.
> >
> > One thing that's bugging me about this series though is the huge amount
> > of duplicated code being introduced to rwsems from mutexes. We can share
> > common functionality such as mcs locking (perhaps in a new file under
> > lib/), can_spin_on_owner() and owner_running(), perhaps moving those
> > functions into sheduler code, were AFAIK they were originally.
>
> I think that MCS locking is worth breaking out as its
> own library. After we've done that, the rest of
> the duplication are minimal. It is easier
> to keep them separate as there are some rwsem
> specific logic that may require tweaking
> to can_spin_on_owner and owner_running.

That's what I would strongly suggest to be the approach of these patches:
first the MCS locking factoring out, then changes in rwsem behavior.

I'd suggest the librarization should be done using inlines or so, so that
we don't touch the current (pretty good) mutex.o code generation. I.e.
code library only on the source code level.

Done that way we could also apply the librarization first, without having
to worry about performance aspects. Having the code shared will also make
sure that an improvement to the mutex slowpaths automatically carries over
into rwems and vice versa.

Thanks,

Ingo

2013-08-12 20:10:52

by Tim Chen

[permalink] [raw]
Subject: Re: Performance regression from switching lock to rw-sem for anon-vma tree

On Mon, 2013-08-12 at 20:52 +0200, Ingo Molnar wrote:
> * Tim Chen <[email protected]> wrote:
>
> > On Tue, 2013-08-06 at 16:55 -0700, Davidlohr Bueso wrote:
> >
> > > I got good numbers, recovering the performance drop I noticed with the
> > > i_mmap_mutex to rwsem patches.
> >
> > That's good. I remembered that the earlier version of the patch not
> > only recovered the performance drop, but also provide some boost when
> > you switch from i_mmap_mutex to rwsem for aim7. Do you see similar
> > boost with this version?
> >
> > > Looking forward to a more upstreamable
> > > patchset that deals with this work, including the previous patches.
> > >
> > > One thing that's bugging me about this series though is the huge amount
> > > of duplicated code being introduced to rwsems from mutexes. We can share
> > > common functionality such as mcs locking (perhaps in a new file under
> > > lib/), can_spin_on_owner() and owner_running(), perhaps moving those
> > > functions into sheduler code, were AFAIK they were originally.
> >
> > I think that MCS locking is worth breaking out as its
> > own library. After we've done that, the rest of
> > the duplication are minimal. It is easier
> > to keep them separate as there are some rwsem
> > specific logic that may require tweaking
> > to can_spin_on_owner and owner_running.
>
> That's what I would strongly suggest to be the approach of these patches:
> first the MCS locking factoring out, then changes in rwsem behavior.
>
> I'd suggest the librarization should be done using inlines or so, so that
> we don't touch the current (pretty good) mutex.o code generation. I.e.
> code library only on the source code level.
>
> Done that way we could also apply the librarization first, without having
> to worry about performance aspects. Having the code shared will also make
> sure that an improvement to the mutex slowpaths automatically carries over
> into rwems and vice versa.

Ingo and Davidlohr,

Thanks for your feedbacks. I'll spin off a set of new patches to
incorporate your suggestions later.

Tim