LinuxLists.cc - REGRESSION: Performance regressions from switching anon

2011-06-15 00:29:21

Subject: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

It seems like that the recent mutex (commit 2b575eb6) causes 3.0-rc2) on exim mail server
Our test setup is on a 4 socket socket. 40 clients are created to the exim server residing
Exim forks off child processes process exits after the mail acquisition of the anon_vma->lock
On 2.6.39, the contention of However, after the switch of acquisition jumps to 18.6% of cpu. the 52% throughput regression.

Other workloads which have affected by this regression. could be affected too.

I've listed the profile of
Thanks.

Tim

---------------------------
3.0-rc2 profile:

- 18.60% exim [kernel.kallsyms] - __mutex_lock_common.clone.5 - 99.99% __mutex_lock_slowpath - mutex_lock - 99.54% anon_vma_lock.clone.10 + 38.99% anon_vma_clone + 37.56% unlink_anon_vmas + 11.92% anon_vma_fork + 11.53% anon_vma_free + 4.03% exim [kernel.kallsyms] - 3.00% exim [kernel.kallsyms] - do_raw_spin_lock - 94.11% _raw_spin_lock + 47.32% __mutex_lock_common.clone.5 + 14.23% __mutex_unlock_slowpath + 4.06% handle_pte_fault + 3.81% __do_fault + 3.16% unmap_vmas + 2.46% lock_flocks + 2.43% copy_pte_range + 2.28% __task_rq_lock + 1.30% __percpu_counter_add + 1.30% dput + 1.27% add_partial + 1.24% free_pcppages_bulk + 1.07% d_alloc + 1.07% get_page_from_freelist + 1.02% complete_walk + 0.89% dget + 0.71% new_inode + 0.61% __mod_timer + 0.58% dup_fd + 0.50% double_rq_lock + 3.66% _raw_spin_lock_irq + 0.87% _raw_spin_lock_bh + 2.90% exim [kernel.kallsyms] + 2.25% exim [kernel.kallsyms]

-----------------------------------
2.6.39 profile:
+ 4.84% exim [kernel.kallsyms] + 3.83% exim [kernel.kallsyms] - 3.25% exim [kernel.kallsyms] - do_raw_spin_lock
- 91.86% _raw_spin_lock
+ 14.16% unlink_anon_vmas
+ 7.30% anon_vma_clone_batch
+ 5.77% __do_fault
+ 5.77% __pte_alloc
+ 5.31% lock_flocks
...
+ 3.22% exim [kernel.kallsyms] + 2.27% exim [kernel.kallsyms] + 2.02% exim [kernel.kallsyms] + 1.63% exim [kernel.kallsyms] + 1.58% exim [kernel.kallsyms]

changes to make the anon_vma->lock into a
a 52% regression in throughput (2.6.39 vs
workload in the MOSBENCH test suite.
Westmere EX system, with 10 cores per
on the test machine which send email
on the sam test machine.
to handle the incoming mail, and the
delivery completes. We see quite a bit of
as a result.
anon_vma->lock occupies 3.25% of cpu.
the lock to mutex on 3.0-rc2, the mutex
This seems to be the main cause of
a lot of forks/exits may be similarly
Workloads which are vm lock intensive
3.0-rc2 and 2.6.39 below for comparison.
[k] __mutex_lock_common.clone.5

[k] _raw_spin_lock_irqsave
[k] do_raw_spin_lock

[k] page_fault
[k] mutex_unlock
/> [k] page_fault
[k] clear_page_c
[k] do_raw_spin_lock
/> /> [k] unmap_vmas
[k] page_cache_get_speculative
[k] copy_page_c
[k] __list_del_entry
[k] get_page_from_freelist

2011-06-15 00:37:21

by Andi Kleen

[permalink] [raw]

Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

> On 2.6.39, the contention of anon_vma->lock occupies 3.25% of cpu.
> However, after the switch of the lock to mutex on 3.0-rc2, the mutex
> acquisition jumps to 18.6% of cpu. This seems to be the main cause of
> the 52% throughput regression.
>
This patch makes the mutex in Tim's workload take a bit less CPU time
(4% down) but it doesn't really fix the regression. When spinning for a
value it's always better to read it first before attempting to write it.
This saves expensive operations on the interconnect.

So it's not really a fix for this, but may be a slight improvement for
other workloads.

-Andi

>From 34d4c1e579b3dfbc9a01967185835f5829bd52f0 Mon Sep 17 00:00:00 2001
From: Andi Kleen <[email protected]>
Date: Tue, 14 Jun 2011 16:27:54 -0700
Subject: [PATCH] mutex: while spinning read count before attempting cmpxchg

Under heavy contention it's better to read first before trying
to do an atomic operation on the interconnect.

This gives a few percent improvement for the mutex CPU time
under heavy contention and likely saves some power too.

Signed-off-by: Andi Kleen <[email protected]>
---
kernel/mutex.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/kernel/mutex.c b/kernel/mutex.c
index d607ed5..1abffa9 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -170,7 +170,8 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
if (owner && !mutex_spin_on_owner(lock, owner))
break;

- if (atomic_cmpxchg(&lock->count, 1, 0) == 1) {
+ if (atomic_read(&lock->count) == 1 &&
+ atomic_cmpxchg(&lock->count, 1, 0) == 1) {
lock_acquired(&lock->dep_map, ip);
mutex_set_owner(lock);
preempt_enable();
--
1.7.4.4

2011-06-15 01:22:09

by Linus Torvalds

[permalink] [raw]

Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Tue, Jun 14, 2011 at 5:29 PM, Tim Chen <[email protected]> wrote:
>
> On 2.6.39, the contention of anon_vma->lock occupies 3.25% of cpu.
> However, after the switch of the lock to mutex on 3.0-rc2, the mutex
> acquisition jumps to 18.6% of cpu. ?This seems to be the main cause of
> the 52% throughput regression.

Argh. That's nasty.

Even the 3.25% is horrible. We scale so well in other situations that
it's really sad how the anon_vma lock is now one of our worst issues.

Anyway, please check me if I'm wrong, but won't the "anon_vma->root"
be the same for all the anon_vma's that are associated with one
particular vma?

The reason I ask is because when I look at anon_vma_clone(), we do that

list_for_each_entry_reverse(pavc, &src->anon_vma_chain, same_vma) {
...
anon_vma_chain_link(dst, avc, pavc->anon_vma);
}

an dthen we do that anon_vma_lock()/unlock() dance on each of those
pavc->anon_vma's. But if the anon_vma->root is always the same, then
that would mean that we could do the lock just once, and hold it over
the loop.

Because I think the real problem with that anon_vma locking is that it
gets called so _much_. We'd be better off holding the lock for a
longer time, and just not do the lock/unlock thing so often. The
contention would go down simply because we wouldn't waste our time
with those atomic lock/unlock instructions as much.

Gaah. I knew exactly how the anon_vma locking worked a few months ago,
but it's complicated enough that I've swapped out all the details. So
I'm not at all sure that the anon_vma->root will be the same for every
anon_vma on the same_vma list.

Somebody hit me over the head with a clue-bat. Anybody?

Linus

2011-06-15 01:27:00

by Shaohua Li

[permalink] [raw]

Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Wed, 2011-06-15 at 08:29 +0800, Tim Chen wrote:
> It seems like that the recent changes to make the anon_vma->lock into a
> mutex (commit 2b575eb6) causes a 52% regression in throughput (2.6.39 vs
> 3.0-rc2) on exim mail server workload in the MOSBENCH test suite.
>
> Our test setup is on a 4 socket Westmere EX system, with 10 cores per
> socket. 40 clients are created on the test machine which send email
> to the exim server residing on the sam test machine.
>
> Exim forks off child processes to handle the incoming mail, and the
> process exits after the mail delivery completes. We see quite a bit of
> acquisition of the anon_vma->lock as a result.
>
> On 2.6.39, the contention of anon_vma->lock occupies 3.25% of cpu.
> However, after the switch of the lock to mutex on 3.0-rc2, the mutex
> acquisition jumps to 18.6% of cpu. This seems to be the main cause of
> the 52% throughput regression.
>
> Other workloads which have a lot of forks/exits may be similarly
> affected by this regression. Workloads which are vm lock intensive
> could be affected too.
>
> I've listed the profile of 3.0-rc2 and 2.6.39 below for comparison.
>
> Thanks.
>
> Tim
>
>
> ---------------------------
> 3.0-rc2 profile:
>
> - 18.60% exim [kernel.kallsyms] [k] __mutex_lock_common.clone.5
> - __mutex_lock_common.clone.5
> - 99.99% __mutex_lock_slowpath
> - mutex_lock
> - 99.54% anon_vma_lock.clone.10
> + 38.99% anon_vma_clone
> + 37.56% unlink_anon_vmas
> + 11.92% anon_vma_fork
> + 11.53% anon_vma_free
> + 4.03% exim [kernel.kallsyms] [k] _raw_spin_lock_irqsave
> - 3.00% exim [kernel.kallsyms] [k] do_raw_spin_lock
> - do_raw_spin_lock
> - 94.11% _raw_spin_lock
> + 47.32% __mutex_lock_common.clone.5
> + 14.23% __mutex_unlock_slowpath
> + 4.06% handle_pte_fault
> + 3.81% __do_fault
> + 3.16% unmap_vmas
> + 2.46% lock_flocks
> + 2.43% copy_pte_range
> + 2.28% __task_rq_lock
> + 1.30% __percpu_counter_add
> + 1.30% dput
> + 1.27% add_partial
> + 1.24% free_pcppages_bulk
> + 1.07% d_alloc
> + 1.07% get_page_from_freelist
> + 1.02% complete_walk
> + 0.89% dget
> + 0.71% new_inode
> + 0.61% __mod_timer
> + 0.58% dup_fd
> + 0.50% double_rq_lock
> + 3.66% _raw_spin_lock_irq
> + 0.87% _raw_spin_lock_bh
> + 2.90% exim [kernel.kallsyms] [k] page_fault
> + 2.25% exim [kernel.kallsyms] [k] mutex_unlock
>
>
> -----------------------------------
>
> 2.6.39 profile:
> + 4.84% exim [kernel.kallsyms] [k] page_fault
> + 3.83% exim [kernel.kallsyms] [k] clear_page_c
> - 3.25% exim [kernel.kallsyms] [k] do_raw_spin_lock
> - do_raw_spin_lock
> - 91.86% _raw_spin_lock
> + 14.16% unlink_anon_vmas
> + 12.54% unlink_file_vma
> + 7.30% anon_vma_clone_batch
what are you testing? I didn't see Andi's batch anon->lock for fork
patches are merged in 2.6.39.

Thanks,
Shaohua

2011-06-15 03:43:28

2011-06-16 21:03:40

by Andi Kleen

[permalink] [raw]

Subject: Re: [GIT PULL] Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

> So user-return notifiers ought to be the ideal platform for that,
> right? We don't even have to touch the scheduler: anything that
> schedules will eventually return to user-space, at which point the
> RCU GC magic can run.

That's not necessarily true. Consider a router which only routes and never runs
user space. You would starve it. Given it's somewhat obscure, but it's possible.

-Andi
--
[email protected] -- Speaking for myself only

2011-06-16 21:28:11

by Linus Torvalds

[permalink] [raw]

Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Thu, Jun 16, 2011 at 1:47 PM, Linus Torvalds
<[email protected]> wrote:
>
> I guess I'll cook up an improved patch that does it for the vma exit
> case too, and see if that just makes the semaphores be a non-issue.

Ok, I bet it doesn't make them a non-issue, but if doing this in
anon_vma_clone() helped a lot, then doing the exact same pattern in
unlink_anon_vmas() hopefully helps some more.

This patch is UNTESTED! It replaces my previous one (it's really just
an extension of it), and while I actually test-booted that previous
one I did *not* do it for this one. So please look out. But it's using
the exact same pattern, so there should be no real surprises.

Does it improve things further on your load?

(Btw, I'm not at all certain about that "we can get an empty
anon_vma_chain" comment. I left it - and the test for a NULL anon_vma
- in the code, but I think it's bogus. If we've linked in the
anon_vma_chain, it will have an anon_vma associated with it, I'm
pretty sure)

VM people, please do comment on both that "empty anon_vma_chain"
issue, and on whether we can ever have two different anon_vma roots in
the 'same_vma' list. I have that WARN_ON_ONCE() there in both paths, I
just wonder whether we should just inconditionally take the first
entry in the list and lock it outside the whole loop instead?

Peter? Hugh?

Linus

2011-06-16 22:10:17

> The fact is, glibc is just total crap.
>
> I tried to send uli a patch to just add caching. No go. I sent
> *another* patch to at least make glibc use a sane interface (and the
> cache if it needs to fall back on /proc/stat for some legacy reason).
> We'll see what happens.

FWIW a rerun with this modified LD_PRELOAD that does caching seems
to have the same performance as the version that does sched_getaffinity.

So you're right. Caching indeed helps and my assumption that the child
would only do it once was incorrect.

The only problem I see with it is that it doesn't handle CPU hotplug,
but Paul's suggestion would fix that too.

> Paul Eggbert suggested "caching for one second" - by just calling
> "gettimtofday()" to see how old the cache is. That would work too.
>

Maybe we need a "standard LD_PRELOAD library to improve glibc" @)

-Andi

Attachments:

sysconf-caching.c (471.00 B)

2011-06-17 00:45:44

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [GIT PULL] Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Fri, Jun 17, 2011 at 12:58:03AM +0200, Ingo Molnar wrote:
>
> * Andi Kleen <[email protected]> wrote:
>
> > > There's a crazy solution for that: the idle thread could process
> > > RCU callbacks carefully, as if it was running user-space code.
> >
> > In Ben's kernel NFS server case the system may not be idle.
>
> An always-100%-busy NFS server is very unlikely, but even in the
> hypothetical case a kernel NFS server is really performing system
> calls from a kernel thread in essence. If it doesn't do it explicitly
> then its main loop can easily include a "check RCU callbacks" call.

As long as they make sure to call it in a clean environment: no locks
held and so on. But I am a bit worried about the possibility of someone
forgetting to put one of these where it is needed -- it would work just
fine for most workloads, but could fail only for rare workloads.

That said, invoking RCU core/callback processing from the scheduler
context certainly sounds like an interesting way to speed up grace
periods.

Thanx, Paul

2011-06-17 04:05:38

* Andi Kleen <[email protected]> wrote:

> On Fri, Jun 17, 2011 at 09:46:00AM -0700, Linus Torvalds wrote:
> > On Fri, Jun 17, 2011 at 4:28 AM, Peter Zijlstra <[email protected]> wrote:
> > >
> > > Something like so? Compiles and runs the benchmark in question.
> >
> > Oh, and can you do this with a commit log and sign-off, and I'll put
> > it in my "anon_vma-locking" branch that I have. I'm not going to
> > actually merge that branch into mainline until I've seen a few more
> > acks or more testing by Tim.
> >
> > But if Tim's numbers hold up (-32% to +15% performance by just the
> > first one, and +15% isn't actually an improvement since tmpfs
> > read-ahead should have gotten us to +66%), I think we have to do this
> > just to avoid the performance regression.
>
> You could also add the mutex "optimize caching protocol"
> patch I posted earlier to that branch.
>
> It didn't actually improve Tim's throughput number, but it made the
> CPU consumption of the mutex go down.

Why have you ignored the negative feedback for that patch:

http://marc.info/[email protected]

and why have you resent this patch without addressing that feedback?

Thanks,

Ingo