LinuxLists.cc - [PATCH v8 0/9] rwsem performance optimizations

2013-10-02 22:38:19

Subject: [PATCH v8 0/9] rwsem performance optimizations

For version 8 of the patchset, we included the patch from Waiman
to streamline wakeup operations and also optimize the MCS lock
used in rwsem and mutex.

In this patchset, we introduce three categories of optimizations to read
write semaphore. The first four patches from Alex Shi reduce cache
bouncing of the sem->count field by doing a pre-read of the sem->count
and avoid cmpxchg if possible.

The next four patches from Tim, Davidlohr and Jason
introduce optimistic spinning logic similar to that in the
mutex code for the writer lock acquisition of rwsem. This addresses the
general 'mutexes out perform writer-rwsems' situations that has been
seen in more than one case. Users now need not worry about performance
issues when choosing between these two locking mechanisms. We have
also factored out the MCS lock originally in the mutex code into its
own file, and performed micro optimizations and corrected the memory
barriers so it could be used for general lock/unlock of critical
sections.

The last patch from Waiman help to streamline the wake up operation
by avoiding multiple threads all doing wakeup operations when only
one wakeup thread is enough. This significantly reduced lock
contentions from multiple wakeup threads.

Tim got the following improvement for exim mail server
workload on 40 core system:

Alex+Tim's patchset: +4.8%
Alex+Tim+Waiman's patchset: +5.3%

Without these optimizations, Davidlohr Bueso saw a -8% regression to
aim7's shared and high_systime workloads when he switched i_mmap_mutex
to rwsem. Tests were on 8 socket 80 cores system. With Alex
and Tim's patches, he got significant improvements to the aim7
suite instead of regressions:

alltests (+16.3%), custom (+20%), disk (+19.5%), high_systime (+7%),
shared (+18.4%) and short (+6.3%).

More Aim7 numbers will be posted when Davidlohr has a chance
to test the complete patchset including Waiman's patch.

Thanks to Ingo Molnar, Peter Hurley, Peter Zijlstra and Paul McKenney
for helping to review this patchset.

Tim

Changelog:

v8:
1. Added Waiman's patch to avoid multiple wakeup thread lock contention.
2. Micro-optimizations of MCS lock.
3. Correct the barriers of MCS lock to prevent critical sections from
leaking.

v7:
1. Rename mcslock.h to mcs_spinlock.h and also rename mcs related fields
with mcs prefix.
2. Properly define type of *mcs_lock field instead of leaving it as *void.
3. Added breif explanation of mcs lock.

v6:
1. Fix missing mcslock.h file.
2. Fix various code style issues.

v5:
1. Try optimistic spinning before we put the writer on the wait queue
to avoid bottlenecking at wait queue. This provides 5% boost to exim workload
and between 2% to 8% boost to aim7.
2. Put MCS locking code into its own mcslock.h file for better reuse
between mutex.c and rwsem.c
3. Remove the configuration RWSEM_SPIN_ON_WRITE_OWNER and make the
operations default per Ingo's suggestions.

v4:
1. Fixed a bug in task_struct definition in rwsem_can_spin_on_owner
2. Fix another typo for RWSEM_SPIN_ON_WRITE_OWNER config option

v3:
1. Added ACCESS_ONCE to sem->count access in rwsem_can_spin_on_owner.
2. Fix typo bug for RWSEM_SPIN_ON_WRITE_OWNER option in init/Kconfig

v2:
1. Reorganize changes to down_write_trylock and do_wake into 4 patches and fixed
a bug referencing &sem->count when sem->count is intended.
2. Fix unsafe sem->owner de-reference in rwsem_can_spin_on_owner.
the option to be on for more seasoning but can be turned off should it be detrimental.
3. Various patch comments update

Alex Shi (4):
rwsem: check the lock before cpmxchg in down_write_trylock
rwsem: remove 'out' label in do_wake
rwsem: remove try_reader_grant label do_wake
rwsem/wake: check lock before do atomic update

Jason Low (2):
MCS Lock: optimizations and extra comments
MCS Lock: Barrier corrections

Tim Chen (2):
MCS Lock: Restructure the MCS lock defines and locking code into its
own file
rwsem: do optimistic spinning for writer lock acquisition

Waiman Long (1):
rwsem: reduce spinlock contention in wakeup code path

include/asm-generic/rwsem.h | 8 +-
include/linux/mcs_spinlock.h | 82 ++++++++++++++
include/linux/mutex.h | 5 +-
include/linux/rwsem.h | 9 ++-
kernel/mutex.c | 60 +---------
kernel/rwsem.c | 19 +++-
lib/rwsem.c | 255 +++++++++++++++++++++++++++++++++++++-----
7 files changed, 349 insertions(+), 89 deletions(-)
create mode 100644 include/linux/mcs_spinlock.h

--
1.7.4.4

2013-10-03 07:32:18

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH v8 0/9] rwsem performance optimizations

* Tim Chen <[email protected]> wrote:

> For version 8 of the patchset, we included the patch from Waiman to
> streamline wakeup operations and also optimize the MCS lock used in
> rwsem and mutex.

I'd be feeling a lot easier about this patch series if you also had
performance figures that show how mmap_sem is affected.

These:

> Tim got the following improvement for exim mail server
> workload on 40 core system:
>
> Alex+Tim's patchset: +4.8%
> Alex+Tim+Waiman's patchset: +5.3%

appear to be mostly related to the anon_vma->rwsem. But once that lock is
changed to an rwlock_t, this measurement falls away.

Peter Zijlstra suggested the following testcase:

===============================>
In fact, try something like this from userspace:

n-threads:

pthread_mutex_lock(&mutex);
foo = mmap();
pthread_mutex_lock(&mutex);

/* work */

pthread_mutex_unlock(&mutex);
munma(foo);
pthread_mutex_unlock(&mutex);

vs

n-threads:

foo = mmap();
/* work */
munmap(foo);

I've had reports that the former was significantly faster than the
latter.
<===============================

this could be put into a standalone testcase, or you could add it as a new
subcommand of 'perf bench', which already has some pthread code, see for
example in tools/perf/bench/sched-messaging.c. Adding:

perf bench mm threads

or so would be a natural thing to have.

Thanks,

Ingo

2013-10-07 22:58:01

by Tim Chen

[permalink] [raw]

Subject: Re: [PATCH v8 0/9] rwsem performance optimizations

On Thu, 2013-10-03 at 09:32 +0200, Ingo Molnar wrote:
> * Tim Chen <[email protected]> wrote:
>
> > For version 8 of the patchset, we included the patch from Waiman to
> > streamline wakeup operations and also optimize the MCS lock used in
> > rwsem and mutex.
>
> I'd be feeling a lot easier about this patch series if you also had
> performance figures that show how mmap_sem is affected.
>
> These:
>
> > Tim got the following improvement for exim mail server
> > workload on 40 core system:
> >
> > Alex+Tim's patchset: +4.8%
> > Alex+Tim+Waiman's patchset: +5.3%
>
> appear to be mostly related to the anon_vma->rwsem. But once that lock is
> changed to an rwlock_t, this measurement falls away.
>
> Peter Zijlstra suggested the following testcase:
>
> ===============================>
> In fact, try something like this from userspace:
>
> n-threads:
>
> pthread_mutex_lock(&mutex);
> foo = mmap();
> pthread_mutex_lock(&mutex);
>
> /* work */
>
> pthread_mutex_unlock(&mutex);
> munma(foo);
> pthread_mutex_unlock(&mutex);
>
> vs
>
> n-threads:
>
> foo = mmap();
> /* work */
> munmap(foo);

Ingo,

I ran the vanilla kernel, the kernel with all rwsem patches and the
kernel with all patches except the optimistic spin one.
I am listing two presentations of the data. Please note that
there is about 5% run-run variation.

% change in performance vs vanilla kernel
#threads all without optspin
mmap only
1 1.9% 1.6%
5 43.8% 2.6%
10 22.7% -3.0%
20 -12.0% -4.5%
40 -26.9% -2.0%
mmap with mutex acquisition
1 -2.1% -3.0%
5 -1.9% 1.0%
10 4.2% 12.5%
20 -4.1% 0.6%
40 -2.8% -1.9%

The optimistic spin case does very well at low to moderate contentions,
but worse when there are very heavy contentions for the pure mmap case.
For the case with pthread mutex, there's not much change from vanilla
kernel.

% change in performance of the mmap with pthread-mutex vs pure mmap
#threads vanilla all without optspin
1 3.0% -1.0% -1.7%
5 7.2% -26.8% 5.5%
10 5.2% -10.6% 22.1%
20 6.8% 16.4% 12.5%
40 -0.2% 32.7% 0.0%

In general, vanilla and no-optspin case perform better with
pthread-mutex. For the case with optspin, mmap with
pthread-mutex is worse at low to moderate contention and better
at high contention.

Tim

>
> I've had reports that the former was significantly faster than the
> latter.
> <===============================
>
> this could be put into a standalone testcase, or you could add it as a new
> subcommand of 'perf bench', which already has some pthread code, see for
> example in tools/perf/bench/sched-messaging.c. Adding:
>
> perf bench mm threads
>
> or so would be a natural thing to have.
>
> Thanks,
>
> Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-10-09 06:15:57

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH v8 0/9] rwsem performance optimizations

* Tim Chen <[email protected]> wrote:

> Ingo,
>
> I ran the vanilla kernel, the kernel with all rwsem patches and the
> kernel with all patches except the optimistic spin one. I am listing
> two presentations of the data. Please note that there is about 5%
> run-run variation.
>
> % change in performance vs vanilla kernel
> #threads all without optspin
> mmap only
> 1 1.9% 1.6%
> 5 43.8% 2.6%
> 10 22.7% -3.0%
> 20 -12.0% -4.5%
> 40 -26.9% -2.0%
> mmap with mutex acquisition
> 1 -2.1% -3.0%
> 5 -1.9% 1.0%
> 10 4.2% 12.5%
> 20 -4.1% 0.6%
> 40 -2.8% -1.9%

Silly question: how do the two methods of starting N threads compare to
each other? Do they have identical runtimes? I think PeterZ's point was
that the pthread_mutex case, despite adding extra serialization, actually
runs faster in some circumstances.

Also, mind posting the testcase? What 'work' do the threads do - clear
some memory area? How big is the memory area?

I'd expect this to be about large enough mmap()s showing page fault
processing to be mmap_sem bound and the serialization via pthread_mutex()
sets up a 'train' of threads in one case, while the parallel startup would
run into the mmap_sem in the regular case.

So I'd expect this to be a rather sensitive workload and you'd have to
actively engineer it to hit the effect PeterZ mentioned. I could imagine
MPI workloads to run into such patterns - but not deterministically.

Only once you've convinced yourself that you are hitting that kind of
effect reliably on the vanilla kernel, could/should the effects of an
improved rwsem implementation be measured.

Thanks,

Ingo

2013-10-09 07:29:14

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH v8 0/9] rwsem performance optimizations

On Wed, Oct 09, 2013 at 08:15:51AM +0200, Ingo Molnar wrote:
> So I'd expect this to be a rather sensitive workload and you'd have to
> actively engineer it to hit the effect PeterZ mentioned. I could imagine
> MPI workloads to run into such patterns - but not deterministically.

The workload that I got the report from was a virus scanner, it would
spawn nr_cpus threads and {mmap file, scan content, munmap} through your
filesystem.

Now if I only could remember who reported this.. :/

2013-10-09 16:34:07

by Tim Chen

[permalink] [raw]

Subject: Re: [PATCH v8 0/9] rwsem performance optimizations

On Wed, 2013-10-09 at 08:15 +0200, Ingo Molnar wrote:
> * Tim Chen <[email protected]> wrote:
>
> > Ingo,
> >
> > I ran the vanilla kernel, the kernel with all rwsem patches and the
> > kernel with all patches except the optimistic spin one. I am listing
> > two presentations of the data. Please note that there is about 5%
> > run-run variation.
> >
> > % change in performance vs vanilla kernel
> > #threads all without optspin
> > mmap only
> > 1 1.9% 1.6%
> > 5 43.8% 2.6%
> > 10 22.7% -3.0%
> > 20 -12.0% -4.5%
> > 40 -26.9% -2.0%
> > mmap with mutex acquisition
> > 1 -2.1% -3.0%
> > 5 -1.9% 1.0%
> > 10 4.2% 12.5%
> > 20 -4.1% 0.6%
> > 40 -2.8% -1.9%
>
> Silly question: how do the two methods of starting N threads compare to
> each other?

They both started N pthreads and run for a fixed time.
The throughput of pure mmap with mutex is below vs pure mmap is below:

% change in performance of the mmap with pthread-mutex vs pure mmap
#threads vanilla all rwsem without optspin
patches
1 3.0% -1.0% -1.7%
5 7.2% -26.8% 5.5%
10 5.2% -10.6% 22.1%
20 6.8% 16.4% 12.5%
40 -0.2% 32.7% 0.0%

So with mutex, the vanilla kernel and the one without optspin both
run faster. This is consistent with what Peter reported. With
optspin, the picture is more mixed, with lower throughput at low to
moderate number of threads and higher throughput with high number
of threads.

> Do they have identical runtimes?

Yes, they both have identical runtimes. I look at the number
of mmap and munmap operations I could push through.

> I think PeterZ's point was
> that the pthread_mutex case, despite adding extra serialization, actually
> runs faster in some circumstances.

Yes, I also see the pthread mutex run faster for the vanilla kernel
from the data above.

>
> Also, mind posting the testcase? What 'work' do the threads do - clear
> some memory area?

The test case do simple mmap and munmap 1MB memory per iteration.

> How big is the memory area?

1MB

The two cases are created as:

#define MEMSIZE (1 * 1024 * 1024)

char *testcase_description = "Anonymous memory mmap/munmap of 1MB";

void testcase(unsigned long long *iterations)
{
while (1) {
char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
assert(c != MAP_FAILED);
munmap(c, MEMSIZE);

(*iterations)++;
}
}

and adding mutex to serialize:

#define MEMSIZE (1 * 1024 * 1024)

char *testcase_description = "Anonymous memory mmap/munmap of 1MB with
mutex";

pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;

void testcase(unsigned long long *iterations)
{
while (1) {
pthread_mutex_lock(&mutex);
char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
assert(c != MAP_FAILED);
munmap(c, MEMSIZE);
pthread_mutex_unlock(&mutex);

(*iterations)++;
}
}

and run as a pthread.
>
> I'd expect this to be about large enough mmap()s showing page fault
> processing to be mmap_sem bound and the serialization via pthread_mutex()
> sets up a 'train' of threads in one case, while the parallel startup would
> run into the mmap_sem in the regular case.
>
> So I'd expect this to be a rather sensitive workload and you'd have to
> actively engineer it to hit the effect PeterZ mentioned. I could imagine
> MPI workloads to run into such patterns - but not deterministically.
>
> Only once you've convinced yourself that you are hitting that kind of
> effect reliably on the vanilla kernel, could/should the effects of an
> improved rwsem implementation be measured.
>
> Thanks,
>
> Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-10-10 03:14:53

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH v8 0/9] rwsem performance optimizations

On Wed, Oct 9, 2013 at 12:28 AM, Peter Zijlstra <[email protected]> wrote:
>
> The workload that I got the report from was a virus scanner, it would
> spawn nr_cpus threads and {mmap file, scan content, munmap} through your
> filesystem.

So I suspect we could make the mmap_sem write area *much* smaller for
the normal cases.

Look at do_mmap_pgoff(), for example: it is run entirely under
mmap_sem, but 99% of what it does doesn't actually need the lock.

The part that really needs the lock is

addr = get_unmapped_area(file, addr, len, pgoff, flags);
addr = mmap_region(file, addr, len, vm_flags, pgoff);

but we hold it over all the other stuff too.

In fact, even if we moved the mmap_sem down into do_mmap(), and moved
code around a bit to only hold it over those functions, it would still
cover unnecessarily much. For example, while merging is common, not
merging is pretty common too, and we do that

vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);

allocation under the lock. We could easily do things like preallocate
it outside the lock.

Right now mmap_sem covers pretty much the whole system call (we do do
some security checks outside of it).

I think the main issue is that nobody has ever cared deeply enough to
see how far this could be pushed. I suspect there is some low-hanging
fruit for anybody who is willing to handle the pain..

Linus

2013-10-10 05:04:07

by Davidlohr Bueso

[permalink] [raw]

Subject: Re: [PATCH v8 0/9] rwsem performance optimizations

On Wed, 2013-10-09 at 20:14 -0700, Linus Torvalds wrote:
> On Wed, Oct 9, 2013 at 12:28 AM, Peter Zijlstra <[email protected]> wrote:
> >
> > The workload that I got the report from was a virus scanner, it would
> > spawn nr_cpus threads and {mmap file, scan content, munmap} through your
> > filesystem.
>
> So I suspect we could make the mmap_sem write area *much* smaller for
> the normal cases.
>
> Look at do_mmap_pgoff(), for example: it is run entirely under
> mmap_sem, but 99% of what it does doesn't actually need the lock.
>
> The part that really needs the lock is
>
> addr = get_unmapped_area(file, addr, len, pgoff, flags);
> addr = mmap_region(file, addr, len, vm_flags, pgoff);
>
> but we hold it over all the other stuff too.
>

True. By looking at the callers, we're always doing:

down_write(&mm->mmap_sem);
do_mmap_pgoff()
...
up_write(&mm->mmap_sem);

That goes for shm, aio, and of course mmap_pgoff().

While I know you hate two level locking, one way to go about this is to
take the lock inside do_mmap_pgoff() after the initial checks (flags,
page align, etc.) and return with the lock held, leaving the caller to
unlock it.

> In fact, even if we moved the mmap_sem down into do_mmap(), and moved
> code around a bit to only hold it over those functions, it would still
> cover unnecessarily much. For example, while merging is common, not
> merging is pretty common too, and we do that
>
> vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
>
> allocation under the lock. We could easily do things like preallocate
> it outside the lock.
>

AFAICT there are also checks that should be done at the beginning of the
function, such as checking for MAP_LOCKED and VM_LOCKED flags before
calling get_unmapped_area().

Thanks,
Davidlohr

2013-10-10 07:54:50

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH v8 0/9] rwsem performance optimizations

* Tim Chen <[email protected]> wrote:

> The throughput of pure mmap with mutex is below vs pure mmap is below:
>
> % change in performance of the mmap with pthread-mutex vs pure mmap
> #threads vanilla all rwsem without optspin
> patches
> 1 3.0% -1.0% -1.7%
> 5 7.2% -26.8% 5.5%
> 10 5.2% -10.6% 22.1%
> 20 6.8% 16.4% 12.5%
> 40 -0.2% 32.7% 0.0%
>
> So with mutex, the vanilla kernel and the one without optspin both run
> faster. This is consistent with what Peter reported. With optspin, the
> picture is more mixed, with lower throughput at low to moderate number
> of threads and higher throughput with high number of threads.

So, going back to your orignal table:

> % change in performance of the mmap with pthread-mutex vs pure mmap
> #threads vanilla all without optspin
> 1 3.0% -1.0% -1.7%
> 5 7.2% -26.8% 5.5%
> 10 5.2% -10.6% 22.1%
> 20 6.8% 16.4% 12.5%
> 40 -0.2% 32.7% 0.0%
>
> In general, vanilla and no-optspin case perform better with
> pthread-mutex. For the case with optspin, mmap with pthread-mutex is
> worse at low to moderate contention and better at high contention.

it appears that 'without optspin' appears to be a pretty good choice - if
it wasn't for that '1 thread' number, which, if I correctly assume is the
uncontended case, is one of the most common usecases ...

How can the single-threaded case get slower? None of the patches should
really cause noticeable overhead in the non-contended case. That looks
weird.

It would also be nice to see the 2, 3, 4 thread numbers - those are the
most common contention scenarios in practice - where do we see the first
improvement in performance?

Also, it would be nice to include a noise/sttdev figure, it's really hard
to tell whether -1.7% is statistically significant.

Thanks,

Ingo

2013-10-16 00:09:40

by Tim Chen

[permalink] [raw]

Subject: Re: [PATCH v8 0/9] rwsem performance optimizations

On Thu, 2013-10-10 at 09:54 +0200, Ingo Molnar wrote:
> * Tim Chen <[email protected]> wrote:
>
> > The throughput of pure mmap with mutex is below vs pure mmap is below:
> >
> > % change in performance of the mmap with pthread-mutex vs pure mmap
> > #threads vanilla all rwsem without optspin
> > patches
> > 1 3.0% -1.0% -1.7%
> > 5 7.2% -26.8% 5.5%
> > 10 5.2% -10.6% 22.1%
> > 20 6.8% 16.4% 12.5%
> > 40 -0.2% 32.7% 0.0%
> >
> > So with mutex, the vanilla kernel and the one without optspin both run
> > faster. This is consistent with what Peter reported. With optspin, the
> > picture is more mixed, with lower throughput at low to moderate number
> > of threads and higher throughput with high number of threads.
>
> So, going back to your orignal table:
>
> > % change in performance of the mmap with pthread-mutex vs pure mmap
> > #threads vanilla all without optspin
> > 1 3.0% -1.0% -1.7%
> > 5 7.2% -26.8% 5.5%
> > 10 5.2% -10.6% 22.1%
> > 20 6.8% 16.4% 12.5%
> > 40 -0.2% 32.7% 0.0%
> >
> > In general, vanilla and no-optspin case perform better with
> > pthread-mutex. For the case with optspin, mmap with pthread-mutex is
> > worse at low to moderate contention and better at high contention.
>
> it appears that 'without optspin' appears to be a pretty good choice - if
> it wasn't for that '1 thread' number, which, if I correctly assume is the
> uncontended case, is one of the most common usecases ...
>
> How can the single-threaded case get slower? None of the patches should
> really cause noticeable overhead in the non-contended case. That looks
> weird.
>
> It would also be nice to see the 2, 3, 4 thread numbers - those are the
> most common contention scenarios in practice - where do we see the first
> improvement in performance?
>
> Also, it would be nice to include a noise/sttdev figure, it's really hard
> to tell whether -1.7% is statistically significant.

Ingo,

I think that the optimistic spin changes to rwsem should enhance
performance to real workloads after all.

In my previous tests, I was doing mmap followed immediately by
munmap without doing anything to the memory. No real workload
will behave that way and it is not the scenario that we
should optimize for. A much better approximation of
real usages will be doing mmap, then touching
the memories being mmaped, followed by munmap.

This changes the dynamics of the rwsem as we are now dominated
by read acquisitions of mmap sem due to the page faults, instead
of having only write acquisitions from mmap. In this case, any delay
in write acquisitions will be costly as we will be
blocking a lot of readers. This is where optimistic spinning on
write acquisitions of mmap sem can provide a very significant boost
to the throughput.

I change the test case to the following with writes to
the mmaped memory:

#define MEMSIZE (1 * 1024 * 1024)

char *testcase_description = "Anonymous memory mmap/munmap of 1MB";

void testcase(unsigned long long *iterations)
{
int i;

while (1) {
char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
assert(c != MAP_FAILED);
for (i=0; i<MEMSIZE; i+=8) {
c[i] = 0xa;
}
munmap(c, MEMSIZE);

(*iterations)++;
}
}

I compare the throughput where I have the complete rwsem
patchset against vanilla and the case where I take out the
optimistic spin patch. I have increased the run time
by 10x from my pervious experiments and do 10 runs for
each case. The standard deviation is ~1.5% so any changes
under 1.5% is statistically significant.

% change in throughput vs the vanilla kernel.
Threads all No-optspin
1 +0.4% -0.1%
2 +2.0% +0.2%
3 +1.1% +1.5%
4 -0.5% -1.4%
5 -0.1% -0.1%
10 +2.2% -1.2%
20 +237.3% -2.3%
40 +548.1% +0.3%

For threads 1 to 5, we essentially
have about the same performance as the vanilla case.
We are getting a boost in throughput by 237% for 20 threads
and 548% for 40 threads. Now when we take out
the optimistic spin, we have mostly similar throughput as
the vanilla kernel for this test.

When I look at the profile of the vanilla
kernel for the 40 threads case, I saw 80% of
cpu time is spent contending for the spin lock of the rwsem
wait queue, when rwsem_down_read_failed in page fault.
When I apply the rwsem patchset with optimistic spin,
this lock contention went down to only 2% of cpu time.

Now when I test the case where we acquire mutex in the
user space before mmap, I got the following data versus
vanilla kernel. There's little contention on mmap sem
acquisition in this case.

n all No-optspin
1 +0.8% -1.2%
2 +1.0% -0.5%
3 +1.8% +0.2%
4 +1.5% -0.4%
5 +1.1% +0.4%
10 +1.5% -0.3%
20 +1.4% -0.2%
40 +1.3% +0.4%

Thanks.

Tim

2013-10-16 06:55:34

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH v8 0/9] rwsem performance optimizations

* Tim Chen <[email protected]> wrote:

> On Thu, 2013-10-10 at 09:54 +0200, Ingo Molnar wrote:
> > * Tim Chen <[email protected]> wrote:
> >
> > > The throughput of pure mmap with mutex is below vs pure mmap is below:
> > >
> > > % change in performance of the mmap with pthread-mutex vs pure mmap
> > > #threads vanilla all rwsem without optspin
> > > patches
> > > 1 3.0% -1.0% -1.7%
> > > 5 7.2% -26.8% 5.5%
> > > 10 5.2% -10.6% 22.1%
> > > 20 6.8% 16.4% 12.5%
> > > 40 -0.2% 32.7% 0.0%
> > >
> > > So with mutex, the vanilla kernel and the one without optspin both run
> > > faster. This is consistent with what Peter reported. With optspin, the
> > > picture is more mixed, with lower throughput at low to moderate number
> > > of threads and higher throughput with high number of threads.
> >
> > So, going back to your orignal table:
> >
> > > % change in performance of the mmap with pthread-mutex vs pure mmap
> > > #threads vanilla all without optspin
> > > 1 3.0% -1.0% -1.7%
> > > 5 7.2% -26.8% 5.5%
> > > 10 5.2% -10.6% 22.1%
> > > 20 6.8% 16.4% 12.5%
> > > 40 -0.2% 32.7% 0.0%
> > >
> > > In general, vanilla and no-optspin case perform better with
> > > pthread-mutex. For the case with optspin, mmap with pthread-mutex is
> > > worse at low to moderate contention and better at high contention.
> >
> > it appears that 'without optspin' appears to be a pretty good choice - if
> > it wasn't for that '1 thread' number, which, if I correctly assume is the
> > uncontended case, is one of the most common usecases ...
> >
> > How can the single-threaded case get slower? None of the patches should
> > really cause noticeable overhead in the non-contended case. That looks
> > weird.
> >
> > It would also be nice to see the 2, 3, 4 thread numbers - those are the
> > most common contention scenarios in practice - where do we see the first
> > improvement in performance?
> >
> > Also, it would be nice to include a noise/sttdev figure, it's really hard
> > to tell whether -1.7% is statistically significant.
>
> Ingo,
>
> I think that the optimistic spin changes to rwsem should enhance
> performance to real workloads after all.
>
> In my previous tests, I was doing mmap followed immediately by
> munmap without doing anything to the memory. No real workload
> will behave that way and it is not the scenario that we
> should optimize for. A much better approximation of
> real usages will be doing mmap, then touching
> the memories being mmaped, followed by munmap.

That's why I asked for a working testcase to be posted ;-) Not just
pseudocode - send the real .c thing please.

> This changes the dynamics of the rwsem as we are now dominated by read
> acquisitions of mmap sem due to the page faults, instead of having only
> write acquisitions from mmap. [...]

Absolutely, the page fault read case is the #1 optimization target of
rwsems.

> [...] In this case, any delay in write acquisitions will be costly as we
> will be blocking a lot of readers. This is where optimistic spinning on
> write acquisitions of mmap sem can provide a very significant boost to
> the throughput.
>
> I change the test case to the following with writes to
> the mmaped memory:
>
> #define MEMSIZE (1 * 1024 * 1024)
>
> char *testcase_description = "Anonymous memory mmap/munmap of 1MB";
>
> void testcase(unsigned long long *iterations)
> {
> int i;
>
> while (1) {
> char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> assert(c != MAP_FAILED);
> for (i=0; i<MEMSIZE; i+=8) {
> c[i] = 0xa;
> }
> munmap(c, MEMSIZE);
>
> (*iterations)++;
> }
> }

It would be _really_ nice to stick this into tools/perf/bench/ as:

perf bench mem pagefaults

or so, with a number of parallelism and workload patterns. See
tools/perf/bench/numa.c for a couple of workload generators - although
those are not page fault intense.

So that future generations can run all these tests too and such.

> I compare the throughput where I have the complete rwsem patchset
> against vanilla and the case where I take out the optimistic spin patch.
> I have increased the run time by 10x from my pervious experiments and do
> 10 runs for each case. The standard deviation is ~1.5% so any changes
> under 1.5% is statistically significant.
>
> % change in throughput vs the vanilla kernel.
> Threads all No-optspin
> 1 +0.4% -0.1%
> 2 +2.0% +0.2%
> 3 +1.1% +1.5%
> 4 -0.5% -1.4%
> 5 -0.1% -0.1%
> 10 +2.2% -1.2%
> 20 +237.3% -2.3%
> 40 +548.1% +0.3%

The tail is impressive. The early parts are important as well, but it's
really hard to tell the significance of the early portion without having
an sttdev column.

( "perf stat --repeat N" will give you sttdev output, in handy percentage
form. )

> Now when I test the case where we acquire mutex in the
> user space before mmap, I got the following data versus
> vanilla kernel. There's little contention on mmap sem
> acquisition in this case.
>
> n all No-optspin
> 1 +0.8% -1.2%
> 2 +1.0% -0.5%
> 3 +1.8% +0.2%
> 4 +1.5% -0.4%
> 5 +1.1% +0.4%
> 10 +1.5% -0.3%
> 20 +1.4% -0.2%
> 40 +1.3% +0.4%
>
> Thanks.

A bit hard to see as there's no comparison _between_ the pthread_mutex and
plain-parallel versions. No contention isn't a great result if performance
suffers because it's all serialized.

Thanks,

Ingo

2013-10-16 18:28:50

by Tim Chen

[permalink] [raw]

Subject: Re: [PATCH v8 0/9] rwsem performance optimizations

On Wed, 2013-10-16 at 08:55 +0200, Ingo Molnar wrote:
> * Tim Chen <[email protected]> wrote:
>
> > On Thu, 2013-10-10 at 09:54 +0200, Ingo Molnar wrote:
> > > * Tim Chen <[email protected]> wrote:
> > >
> > > > The throughput of pure mmap with mutex is below vs pure mmap is below:
> > > >
> > > > % change in performance of the mmap with pthread-mutex vs pure mmap
> > > > #threads vanilla all rwsem without optspin
> > > > patches
> > > > 1 3.0% -1.0% -1.7%
> > > > 5 7.2% -26.8% 5.5%
> > > > 10 5.2% -10.6% 22.1%
> > > > 20 6.8% 16.4% 12.5%
> > > > 40 -0.2% 32.7% 0.0%
> > > >
> > > > So with mutex, the vanilla kernel and the one without optspin both run
> > > > faster. This is consistent with what Peter reported. With optspin, the
> > > > picture is more mixed, with lower throughput at low to moderate number
> > > > of threads and higher throughput with high number of threads.
> > >
> > > So, going back to your orignal table:
> > >
> > > > % change in performance of the mmap with pthread-mutex vs pure mmap
> > > > #threads vanilla all without optspin
> > > > 1 3.0% -1.0% -1.7%
> > > > 5 7.2% -26.8% 5.5%
> > > > 10 5.2% -10.6% 22.1%
> > > > 20 6.8% 16.4% 12.5%
> > > > 40 -0.2% 32.7% 0.0%
> > > >
> > > > In general, vanilla and no-optspin case perform better with
> > > > pthread-mutex. For the case with optspin, mmap with pthread-mutex is
> > > > worse at low to moderate contention and better at high contention.
> > >
> > > it appears that 'without optspin' appears to be a pretty good choice - if
> > > it wasn't for that '1 thread' number, which, if I correctly assume is the
> > > uncontended case, is one of the most common usecases ...
> > >
> > > How can the single-threaded case get slower? None of the patches should
> > > really cause noticeable overhead in the non-contended case. That looks
> > > weird.
> > >
> > > It would also be nice to see the 2, 3, 4 thread numbers - those are the
> > > most common contention scenarios in practice - where do we see the first
> > > improvement in performance?
> > >
> > > Also, it would be nice to include a noise/sttdev figure, it's really hard
> > > to tell whether -1.7% is statistically significant.
> >
> > Ingo,
> >
> > I think that the optimistic spin changes to rwsem should enhance
> > performance to real workloads after all.
> >
> > In my previous tests, I was doing mmap followed immediately by
> > munmap without doing anything to the memory. No real workload
> > will behave that way and it is not the scenario that we
> > should optimize for. A much better approximation of
> > real usages will be doing mmap, then touching
> > the memories being mmaped, followed by munmap.
>
> That's why I asked for a working testcase to be posted ;-) Not just
> pseudocode - send the real .c thing please.

I was using a modified version of Anton's will-it-scale test. I'll try
to port the tests to perf bench to make it easier for other people to
run the tests.

>
> > This changes the dynamics of the rwsem as we are now dominated by read
> > acquisitions of mmap sem due to the page faults, instead of having only
> > write acquisitions from mmap. [...]
>
> Absolutely, the page fault read case is the #1 optimization target of
> rwsems.
>
> > [...] In this case, any delay in write acquisitions will be costly as we
> > will be blocking a lot of readers. This is where optimistic spinning on
> > write acquisitions of mmap sem can provide a very significant boost to
> > the throughput.
> >
> > I change the test case to the following with writes to
> > the mmaped memory:
> >
> > #define MEMSIZE (1 * 1024 * 1024)
> >
> > char *testcase_description = "Anonymous memory mmap/munmap of 1MB";
> >
> > void testcase(unsigned long long *iterations)
> > {
> > int i;
> >
> > while (1) {
> > char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE,
> > MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> > assert(c != MAP_FAILED);
> > for (i=0; i<MEMSIZE; i+=8) {
> > c[i] = 0xa;
> > }
> > munmap(c, MEMSIZE);
> >
> > (*iterations)++;
> > }
> > }
>
> It would be _really_ nice to stick this into tools/perf/bench/ as:
>
> perf bench mem pagefaults
>
> or so, with a number of parallelism and workload patterns. See
> tools/perf/bench/numa.c for a couple of workload generators - although
> those are not page fault intense.
>
> So that future generations can run all these tests too and such.

Okay, will do.

>
> > I compare the throughput where I have the complete rwsem patchset
> > against vanilla and the case where I take out the optimistic spin patch.
> > I have increased the run time by 10x from my pervious experiments and do
> > 10 runs for each case. The standard deviation is ~1.5% so any changes
> > under 1.5% is statistically significant.
> >
> > % change in throughput vs the vanilla kernel.
> > Threads all No-optspin
> > 1 +0.4% -0.1%
> > 2 +2.0% +0.2%
> > 3 +1.1% +1.5%
> > 4 -0.5% -1.4%
> > 5 -0.1% -0.1%
> > 10 +2.2% -1.2%
> > 20 +237.3% -2.3%
> > 40 +548.1% +0.3%
>
> The tail is impressive. The early parts are important as well, but it's
> really hard to tell the significance of the early portion without having
> an sttdev column.

Here's the data with sdv column:

n all sdv No-optspin sdv
1 +0.4% 0.9% -0.1% 0.8%
2 +2.0% 0.8% +0.2% 1.2%
3 +1.1% 0.8% +1.5% 0.6%
4 -0.5% 0.9% -1.4% 1.1%
5 -0.1% 1.1% -0.1% 1.1%
10 +2.2% 0.8% -1.2% 1.0%
20 +237.3% 0.7% -2.3% 1.3%
40 +548.1% 0.8% +0.3% 1.2%

> ( "perf stat --repeat N" will give you sttdev output, in handy percentage
> form. )
>
> > Now when I test the case where we acquire mutex in the
> > user space before mmap, I got the following data versus
> > vanilla kernel. There's little contention on mmap sem
> > acquisition in this case.
> >
> > n all No-optspin
> > 1 +0.8% -1.2%
> > 2 +1.0% -0.5%
> > 3 +1.8% +0.2%
> > 4 +1.5% -0.4%
> > 5 +1.1% +0.4%
> > 10 +1.5% -0.3%
> > 20 +1.4% -0.2%
> > 40 +1.3% +0.4%

Adding std-dev to above data:

n all sdv No-optspin sdv
1 +0.8% 1.0% -1.2% 1.2%
2 +1.0% 1.0% -0.5% 1.0%
3 +1.8% 0.7% +0.2% 0.8%
4 +1.5% 0.8% -0.4% 0.7%
5 +1.1% 1.1% +0.4% 0.3%
10 +1.5% 0.7% -0.3% 0.7%
20 +1.4% 0.8% -0.2% 1.0%
40 +1.3% 0.7% +0.4% 0.5%

> >
> > Thanks.
>
> A bit hard to see as there's no comparison _between_ the pthread_mutex and
> plain-parallel versions. No contention isn't a great result if performance
> suffers because it's all serialized.

Now the data for pthread-mutex vs plain-parallel vanilla testcase
with std-dev

n vanilla sdv Rwsem-all sdv No-optspin sdv
1 +0.5% 0.9% +1.4% 0.9% -0.7% 1.0%
2 -39.3% 1.0% -38.7% 1.1% -39.6% 1.1%
3 -52.6% 1.2% -51.8% 0.7% -52.5% 0.7%
4 -59.8% 0.8% -59.2% 1.0% -59.9% 0.9%
5 -63.5% 1.4% -63.1% 1.4% -63.4% 1.0%
10 -66.1% 1.3% -65.6% 1.3% -66.2% 1.3%
20 +178.3% 0.9% +182.3% 1.0% +177.7% 1.1%
40 +604.8% 1.1% +614.0% 1.0% +607.9% 0.9%

The version with full rwsem patchset perform best across the threads.
Serialization actually hurts for smaller number of threads even for
current vanilla kernel.

I'll rerun the tests once I ported them to the perf bench. It may take
me a couple of days.

Thanks.

Tim

2013-10-16 21:55:49

by Tim Chen

[permalink] [raw]

Subject: Re: [PATCH v8 0/9] rwsem performance optimizations

>
> It would be _really_ nice to stick this into tools/perf/bench/ as:
>
> perf bench mem pagefaults
>
> or so, with a number of parallelism and workload patterns. See
> tools/perf/bench/numa.c for a couple of workload generators - although
> those are not page fault intense.
>
> So that future generations can run all these tests too and such.
>
> > I compare the throughput where I have the complete rwsem patchset
> > against vanilla and the case where I take out the optimistic spin patch.
> > I have increased the run time by 10x from my pervious experiments and do
> > 10 runs for each case. The standard deviation is ~1.5% so any changes
> > under 1.5% is statistically significant.
> >
> > % change in throughput vs the vanilla kernel.
> > Threads all No-optspin
> > 1 +0.4% -0.1%
> > 2 +2.0% +0.2%
> > 3 +1.1% +1.5%
> > 4 -0.5% -1.4%
> > 5 -0.1% -0.1%
> > 10 +2.2% -1.2%
> > 20 +237.3% -2.3%
> > 40 +548.1% +0.3%
>
> The tail is impressive. The early parts are important as well, but it's
> really hard to tell the significance of the early portion without having
> an sttdev column.
>
> ( "perf stat --repeat N" will give you sttdev output, in handy percentage
> form. )

Quick naive question as I haven't hacked perf bench before.
Now perf stat gives the statistics of the performance counter or events.
How do I get it to compute the stats of
the throughput reported by perf bench?

Something like

perf stat -r 10 -- perf bench mm memset --iterations 10

doesn't quite give what I need.

Pointers appreciated.

Tim

2013-10-18 06:52:12

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH v8 0/9] rwsem performance optimizations

* Tim Chen <[email protected]> wrote:

>
> >
> > It would be _really_ nice to stick this into tools/perf/bench/ as:
> >
> > perf bench mem pagefaults
> >
> > or so, with a number of parallelism and workload patterns. See
> > tools/perf/bench/numa.c for a couple of workload generators - although
> > those are not page fault intense.
> >
> > So that future generations can run all these tests too and such.
> >
> > > I compare the throughput where I have the complete rwsem patchset
> > > against vanilla and the case where I take out the optimistic spin patch.
> > > I have increased the run time by 10x from my pervious experiments and do
> > > 10 runs for each case. The standard deviation is ~1.5% so any changes
> > > under 1.5% is statistically significant.
> > >
> > > % change in throughput vs the vanilla kernel.
> > > Threads all No-optspin
> > > 1 +0.4% -0.1%
> > > 2 +2.0% +0.2%
> > > 3 +1.1% +1.5%
> > > 4 -0.5% -1.4%
> > > 5 -0.1% -0.1%
> > > 10 +2.2% -1.2%
> > > 20 +237.3% -2.3%
> > > 40 +548.1% +0.3%
> >
> > The tail is impressive. The early parts are important as well, but it's
> > really hard to tell the significance of the early portion without having
> > an sttdev column.
> >
> > ( "perf stat --repeat N" will give you sttdev output, in handy percentage
> > form. )
>
> Quick naive question as I haven't hacked perf bench before.

Btw., please use tip:master, I've got a few cleanups in there that should
make it easier to hack.

> Now perf stat gives the statistics of the performance counter or events.
> How do I get it to compute the stats of
> the throughput reported by perf bench?

What I do is that I measure the execution time, via:

perf stat --null --repeat 10 perf bench ...

instead of relying on benchmark output.

> Something like
>
> perf stat -r 10 -- perf bench mm memset --iterations 10
>
> doesn't quite give what I need.

Yeha. So, perf bench also has a 'simple' output format:

comet:~/tip> perf bench -f simple sched pipe
10.378

We could extend 'perf stat' with an option to not measure time, but to
take any numeric data output from the executed task and use that as the
measurement result.

If you'd be interested in such a feature I can give it a try.

Thanks,

Ingo