2011-06-15 00:29:21

by Tim Chen

[permalink] [raw]
Subject: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

It seems like that the recent changes to make the anon_vma->lock into a
mutex (commit 2b575eb6) causes a 52% regression in throughput (2.6.39 vs
3.0-rc2) on exim mail server workload in the MOSBENCH test suite.

Our test setup is on a 4 socket Westmere EX system, with 10 cores per
socket. 40 clients are created on the test machine which send email
to the exim server residing on the sam test machine.

Exim forks off child processes to handle the incoming mail, and the
process exits after the mail delivery completes. We see quite a bit of
acquisition of the anon_vma->lock as a result.

On 2.6.39, the contention of anon_vma->lock occupies 3.25% of cpu.
However, after the switch of the lock to mutex on 3.0-rc2, the mutex
acquisition jumps to 18.6% of cpu. This seems to be the main cause of
the 52% throughput regression.

Other workloads which have a lot of forks/exits may be similarly
affected by this regression. Workloads which are vm lock intensive
could be affected too.

I've listed the profile of 3.0-rc2 and 2.6.39 below for comparison.

Thanks.

Tim


---------------------------
3.0-rc2 profile:

- 18.60% exim [kernel.kallsyms] [k] __mutex_lock_common.clone.5
- __mutex_lock_common.clone.5
- 99.99% __mutex_lock_slowpath
- mutex_lock
- 99.54% anon_vma_lock.clone.10
+ 38.99% anon_vma_clone
+ 37.56% unlink_anon_vmas
+ 11.92% anon_vma_fork
+ 11.53% anon_vma_free
+ 4.03% exim [kernel.kallsyms] [k] _raw_spin_lock_irqsave
- 3.00% exim [kernel.kallsyms] [k] do_raw_spin_lock
- do_raw_spin_lock
- 94.11% _raw_spin_lock
+ 47.32% __mutex_lock_common.clone.5
+ 14.23% __mutex_unlock_slowpath
+ 4.06% handle_pte_fault
+ 3.81% __do_fault
+ 3.16% unmap_vmas
+ 2.46% lock_flocks
+ 2.43% copy_pte_range
+ 2.28% __task_rq_lock
+ 1.30% __percpu_counter_add
+ 1.30% dput
+ 1.27% add_partial
+ 1.24% free_pcppages_bulk
+ 1.07% d_alloc
+ 1.07% get_page_from_freelist
+ 1.02% complete_walk
+ 0.89% dget
+ 0.71% new_inode
+ 0.61% __mod_timer
+ 0.58% dup_fd
+ 0.50% double_rq_lock
+ 3.66% _raw_spin_lock_irq
+ 0.87% _raw_spin_lock_bh
+ 2.90% exim [kernel.kallsyms] [k] page_fault
+ 2.25% exim [kernel.kallsyms] [k] mutex_unlock


-----------------------------------

2.6.39 profile:
+ 4.84% exim [kernel.kallsyms] [k] page_fault
+ 3.83% exim [kernel.kallsyms] [k] clear_page_c
- 3.25% exim [kernel.kallsyms] [k] do_raw_spin_lock
- do_raw_spin_lock
- 91.86% _raw_spin_lock
+ 14.16% unlink_anon_vmas
+ 12.54% unlink_file_vma
+ 7.30% anon_vma_clone_batch
+ 6.17% dup_mm
+ 5.77% __do_fault
+ 5.77% __pte_alloc
+ 5.31% lock_flocks
...
+ 3.22% exim [kernel.kallsyms] [k] unmap_vmas
+ 2.27% exim [kernel.kallsyms] [k] page_cache_get_speculative
+ 2.02% exim [kernel.kallsyms] [k] copy_page_c
+ 1.63% exim [kernel.kallsyms] [k] __list_del_entry
+ 1.58% exim [kernel.kallsyms] [k] get_page_from_freelist



2011-06-15 00:37:21

by Andi Kleen

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

> On 2.6.39, the contention of anon_vma->lock occupies 3.25% of cpu.
> However, after the switch of the lock to mutex on 3.0-rc2, the mutex
> acquisition jumps to 18.6% of cpu. This seems to be the main cause of
> the 52% throughput regression.
>
This patch makes the mutex in Tim's workload take a bit less CPU time
(4% down) but it doesn't really fix the regression. When spinning for a
value it's always better to read it first before attempting to write it.
This saves expensive operations on the interconnect.

So it's not really a fix for this, but may be a slight improvement for
other workloads.

-Andi

>From 34d4c1e579b3dfbc9a01967185835f5829bd52f0 Mon Sep 17 00:00:00 2001
From: Andi Kleen <[email protected]>
Date: Tue, 14 Jun 2011 16:27:54 -0700
Subject: [PATCH] mutex: while spinning read count before attempting cmpxchg

Under heavy contention it's better to read first before trying
to do an atomic operation on the interconnect.

This gives a few percent improvement for the mutex CPU time
under heavy contention and likely saves some power too.

Signed-off-by: Andi Kleen <[email protected]>
---
kernel/mutex.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/kernel/mutex.c b/kernel/mutex.c
index d607ed5..1abffa9 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -170,7 +170,8 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
if (owner && !mutex_spin_on_owner(lock, owner))
break;

- if (atomic_cmpxchg(&lock->count, 1, 0) == 1) {
+ if (atomic_read(&lock->count) == 1 &&
+ atomic_cmpxchg(&lock->count, 1, 0) == 1) {
lock_acquired(&lock->dep_map, ip);
mutex_set_owner(lock);
preempt_enable();
--
1.7.4.4

2011-06-15 01:22:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Tue, Jun 14, 2011 at 5:29 PM, Tim Chen <[email protected]> wrote:
>
> On 2.6.39, the contention of anon_vma->lock occupies 3.25% of cpu.
> However, after the switch of the lock to mutex on 3.0-rc2, the mutex
> acquisition jumps to 18.6% of cpu. ?This seems to be the main cause of
> the 52% throughput regression.

Argh. That's nasty.

Even the 3.25% is horrible. We scale so well in other situations that
it's really sad how the anon_vma lock is now one of our worst issues.

Anyway, please check me if I'm wrong, but won't the "anon_vma->root"
be the same for all the anon_vma's that are associated with one
particular vma?

The reason I ask is because when I look at anon_vma_clone(), we do that

list_for_each_entry_reverse(pavc, &src->anon_vma_chain, same_vma) {
...
anon_vma_chain_link(dst, avc, pavc->anon_vma);
}

an dthen we do that anon_vma_lock()/unlock() dance on each of those
pavc->anon_vma's. But if the anon_vma->root is always the same, then
that would mean that we could do the lock just once, and hold it over
the loop.

Because I think the real problem with that anon_vma locking is that it
gets called so _much_. We'd be better off holding the lock for a
longer time, and just not do the lock/unlock thing so often. The
contention would go down simply because we wouldn't waste our time
with those atomic lock/unlock instructions as much.

Gaah. I knew exactly how the anon_vma locking worked a few months ago,
but it's complicated enough that I've swapped out all the details. So
I'm not at all sure that the anon_vma->root will be the same for every
anon_vma on the same_vma list.

Somebody hit me over the head with a clue-bat. Anybody?

Linus

2011-06-15 01:27:00

by Shaohua Li

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Wed, 2011-06-15 at 08:29 +0800, Tim Chen wrote:
> It seems like that the recent changes to make the anon_vma->lock into a
> mutex (commit 2b575eb6) causes a 52% regression in throughput (2.6.39 vs
> 3.0-rc2) on exim mail server workload in the MOSBENCH test suite.
>
> Our test setup is on a 4 socket Westmere EX system, with 10 cores per
> socket. 40 clients are created on the test machine which send email
> to the exim server residing on the sam test machine.
>
> Exim forks off child processes to handle the incoming mail, and the
> process exits after the mail delivery completes. We see quite a bit of
> acquisition of the anon_vma->lock as a result.
>
> On 2.6.39, the contention of anon_vma->lock occupies 3.25% of cpu.
> However, after the switch of the lock to mutex on 3.0-rc2, the mutex
> acquisition jumps to 18.6% of cpu. This seems to be the main cause of
> the 52% throughput regression.
>
> Other workloads which have a lot of forks/exits may be similarly
> affected by this regression. Workloads which are vm lock intensive
> could be affected too.
>
> I've listed the profile of 3.0-rc2 and 2.6.39 below for comparison.
>
> Thanks.
>
> Tim
>
>
> ---------------------------
> 3.0-rc2 profile:
>
> - 18.60% exim [kernel.kallsyms] [k] __mutex_lock_common.clone.5
> - __mutex_lock_common.clone.5
> - 99.99% __mutex_lock_slowpath
> - mutex_lock
> - 99.54% anon_vma_lock.clone.10
> + 38.99% anon_vma_clone
> + 37.56% unlink_anon_vmas
> + 11.92% anon_vma_fork
> + 11.53% anon_vma_free
> + 4.03% exim [kernel.kallsyms] [k] _raw_spin_lock_irqsave
> - 3.00% exim [kernel.kallsyms] [k] do_raw_spin_lock
> - do_raw_spin_lock
> - 94.11% _raw_spin_lock
> + 47.32% __mutex_lock_common.clone.5
> + 14.23% __mutex_unlock_slowpath
> + 4.06% handle_pte_fault
> + 3.81% __do_fault
> + 3.16% unmap_vmas
> + 2.46% lock_flocks
> + 2.43% copy_pte_range
> + 2.28% __task_rq_lock
> + 1.30% __percpu_counter_add
> + 1.30% dput
> + 1.27% add_partial
> + 1.24% free_pcppages_bulk
> + 1.07% d_alloc
> + 1.07% get_page_from_freelist
> + 1.02% complete_walk
> + 0.89% dget
> + 0.71% new_inode
> + 0.61% __mod_timer
> + 0.58% dup_fd
> + 0.50% double_rq_lock
> + 3.66% _raw_spin_lock_irq
> + 0.87% _raw_spin_lock_bh
> + 2.90% exim [kernel.kallsyms] [k] page_fault
> + 2.25% exim [kernel.kallsyms] [k] mutex_unlock
>
>
> -----------------------------------
>
> 2.6.39 profile:
> + 4.84% exim [kernel.kallsyms] [k] page_fault
> + 3.83% exim [kernel.kallsyms] [k] clear_page_c
> - 3.25% exim [kernel.kallsyms] [k] do_raw_spin_lock
> - do_raw_spin_lock
> - 91.86% _raw_spin_lock
> + 14.16% unlink_anon_vmas
> + 12.54% unlink_file_vma
> + 7.30% anon_vma_clone_batch
what are you testing? I didn't see Andi's batch anon->lock for fork
patches are merged in 2.6.39.

Thanks,
Shaohua

2011-06-15 03:43:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Tue, Jun 14, 2011 at 6:21 PM, Linus Torvalds
<[email protected]> wrote:
>
> Anyway, please check me if I'm wrong, but won't the "anon_vma->root"
> be the same for all the anon_vma's that are associated with one
> particular vma?
>
> The reason I ask [...]

So here's a trial patch that moves the anon_vma locking one level up
in the anon_vma_clone() call chain. It actually does allow the root to
change, but has a WARN_ON_ONCE() if that ever happens.

I *suspect* this will help the locking numbers a bit, but I'd like to
note that it does this *only* for the anon_vma_clone() case, and the
exact same thing should be done for the exit case too (ie the
unlink_anon_vmas()). So if it does work it's still just one step on
the way, and there would be more work along the same lines to possibly
improve the locking further.

The patch is "tested" in the sense that I booted the kernel and am
running it right now (and compiled a kernel with it). But that's not a
whole lot of actual real life testing, so caveat emptor.

And I won't really even guarantee that the main problem locking-wise
would be a long chain of "same_vma" anon-vma's that this does with
just a single lock. So who knows - maybe it doesn't help at all. I
suspect it's worth testing, though.

Linus


Attachments:
patch.diff (1.72 kB)

2011-06-15 10:38:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Tue, 2011-06-14 at 17:29 -0700, Tim Chen wrote:
> MOSBENCH test suite.

Argh, I'm trying to get this thing to run, but its all snake poo..

/me takes up a heavy club and goes snake hunting, should make a pretty
hat or something.

2011-06-15 10:59:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Wed, 2011-06-15 at 12:36 +0200, Peter Zijlstra wrote:
> On Tue, 2011-06-14 at 17:29 -0700, Tim Chen wrote:
> > MOSBENCH test suite.
>
> Argh, I'm trying to get this thing to run, but its all snake poo..
>
> /me takes up a heavy club and goes snake hunting, should make a pretty
> hat or something.

Sweet, I've got meself a snake-skin hat!

The first thing that stood out when running it was:

31694 root 20 0 26660 1460 1212 S 17.5 0.0 0:01.97 exim
7 root -2 19 0 0 0 S 12.7 0.0 0:06.14 rcuc0
24 root -2 19 0 0 0 S 11.7 0.0 0:04.15 rcuc3
34 root -2 19 0 0 0 S 11.7 0.0 0:04.10 rcuc5
39 root -2 19 0 0 0 S 11.7 0.0 0:06.38 rcuc6
44 root -2 19 0 0 0 S 11.7 0.0 0:04.53 rcuc7
49 root -2 19 0 0 0 S 11.7 0.0 0:04.11 rcuc8
79 root -2 19 0 0 0 S 11.7 0.0 0:03.91 rcuc14
89 root -2 19 0 0 0 S 11.7 0.0 0:03.90 rcuc16
110 root -2 19 0 0 0 S 11.7 0.0 0:03.90 rcuc20
120 root -2 19 0 0 0 S 11.7 0.0 0:03.82 rcuc22
13 root -2 19 0 0 0 S 10.7 0.0 0:04.37 rcuc1
19 root -2 19 0 0 0 S 10.7 0.0 0:04.19 rcuc2
29 root -2 19 0 0 0 S 10.7 0.0 0:04.12 rcuc4
54 root -2 19 0 0 0 S 10.7 0.0 0:04.11 rcuc9
59 root -2 19 0 0 0 S 10.7 0.0 0:04.40 rcuc10
64 root -2 19 0 0 0 R 10.7 0.0 0:04.17 rcuc11
69 root -2 19 0 0 0 R 10.7 0.0 0:04.23 rcuc12
84 root -2 19 0 0 0 S 10.7 0.0 0:03.90 rcuc15
95 root -2 19 0 0 0 S 10.7 0.0 0:03.99 rcuc17
100 root -2 19 0 0 0 S 10.7 0.0 0:03.88 rcuc18
105 root -2 19 0 0 0 S 10.7 0.0 0:04.14 rcuc19
125 root -2 19 0 0 0 S 10.7 0.0 0:03.79 rcuc23
74 root -2 19 0 0 0 S 9.7 0.0 0:04.33 rcuc13
115 root -2 19 0 0 0 R 9.7 0.0 0:03.82 rcuc21

Which is an impressive amount of RCU usage..

2011-06-15 11:43:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Wed, 2011-06-15 at 12:58 +0200, Peter Zijlstra wrote:
> On Wed, 2011-06-15 at 12:36 +0200, Peter Zijlstra wrote:
> > On Tue, 2011-06-14 at 17:29 -0700, Tim Chen wrote:
> > > MOSBENCH test suite.
> >
> > Argh, I'm trying to get this thing to run, but its all snake poo..
> >
> > /me takes up a heavy club and goes snake hunting, should make a pretty
> > hat or something.
>
> Sweet, I've got meself a snake-skin hat!
>
> The first thing that stood out when running it was:
>
> 31694 root 20 0 26660 1460 1212 S 17.5 0.0 0:01.97 exim
> 7 root -2 19 0 0 0 S 12.7 0.0 0:06.14 rcuc0
> 24 root -2 19 0 0 0 S 11.7 0.0 0:04.15 rcuc3
> 34 root -2 19 0 0 0 S 11.7 0.0 0:04.10 rcuc5
> 39 root -2 19 0 0 0 S 11.7 0.0 0:06.38 rcuc6
> 44 root -2 19 0 0 0 S 11.7 0.0 0:04.53 rcuc7
> 49 root -2 19 0 0 0 S 11.7 0.0 0:04.11 rcuc8
> 79 root -2 19 0 0 0 S 11.7 0.0 0:03.91 rcuc14
> 89 root -2 19 0 0 0 S 11.7 0.0 0:03.90 rcuc16
> 110 root -2 19 0 0 0 S 11.7 0.0 0:03.90 rcuc20
> 120 root -2 19 0 0 0 S 11.7 0.0 0:03.82 rcuc22
> 13 root -2 19 0 0 0 S 10.7 0.0 0:04.37 rcuc1
> 19 root -2 19 0 0 0 S 10.7 0.0 0:04.19 rcuc2
> 29 root -2 19 0 0 0 S 10.7 0.0 0:04.12 rcuc4
> 54 root -2 19 0 0 0 S 10.7 0.0 0:04.11 rcuc9
> 59 root -2 19 0 0 0 S 10.7 0.0 0:04.40 rcuc10
> 64 root -2 19 0 0 0 R 10.7 0.0 0:04.17 rcuc11
> 69 root -2 19 0 0 0 R 10.7 0.0 0:04.23 rcuc12
> 84 root -2 19 0 0 0 S 10.7 0.0 0:03.90 rcuc15
> 95 root -2 19 0 0 0 S 10.7 0.0 0:03.99 rcuc17
> 100 root -2 19 0 0 0 S 10.7 0.0 0:03.88 rcuc18
> 105 root -2 19 0 0 0 S 10.7 0.0 0:04.14 rcuc19
> 125 root -2 19 0 0 0 S 10.7 0.0 0:03.79 rcuc23
> 74 root -2 19 0 0 0 S 9.7 0.0 0:04.33 rcuc13
> 115 root -2 19 0 0 0 R 9.7 0.0 0:03.82 rcuc21
>
> Which is an impressive amount of RCU usage..

FWIW, Alex Shi's patch:

http://lkml.kernel.org/r/1308029185.15392.147.camel@sli10-conroe

Improves the situation to:

3745 root 20 0 26664 1460 1212 S 18.5 0.0 0:01.28 exim
39 root -2 19 0 0 0 S 4.9 0.0 0:02.83 rcuc6
105 root -2 19 0 0 0 S 4.9 0.0 0:02.79 rcuc19
7 root -2 19 0 0 0 S 3.9 0.0 0:02.70 rcuc0
13 root -2 19 0 0 0 S 3.9 0.0 0:02.54 rcuc1
19 root -2 19 0 0 0 S 3.9 0.0 0:02.76 rcuc2
24 root -2 19 0 0 0 S 3.9 0.0 0:02.75 rcuc3
...

And throughput increases like:

-tip 260.092 messages/sec/core
-tip+sirq-rcu 271.078 messages/sec/core

2011-06-15 11:53:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Wed, 2011-06-15 at 09:26 +0800, Shaohua Li wrote:

> On Wed, 2011-06-15 at 08:29 +0800, Tim Chen wrote:
> > + 7.30% anon_vma_clone_batch

> what are you testing? I didn't see Andi's batch anon->lock for fork
> patches are merged in 2.6.39.

Good spot that certainly isn't plain .39.

It looks like those (http://marc.info/?l=linux-mm&m=130533041726258) are
similar to Linus' patch, except Linus takes the hard line that the root
lock should stay the same. Let me try Linus' patch first to see if this
workload can trigger his WARN.

/me mutters something about patches in attachments and rebuilds.

OK, the WARN doesn't trigger, but it also doesn't improve things (quite
the opposite in fact):

-tip 260.092 messages/sec/core
+sirq-rcu 271.078 messages/sec/core
+linus 262.435 messages/sec/core

So Linus' patch makes the throughput drop from 271 to 262, weird.

/me goes re-test without the sirq-rcu bits mixed in just to make sure.

2011-06-15 12:50:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Wed, 2011-06-15 at 13:52 +0200, Peter Zijlstra wrote:
> /me goes re-test without the sirq-rcu bits mixed in just to make sure.

I switched from PREEMPT=n to PREEMPT_VOLUNTARY=y, which seemed to make a
difference:

.39 257.651 messages/sec/core
-tip 254.976 messages/sec/core
+linus 258.03 messages/sec/core
+sirq 265.951 messages/sec/core

2011-06-15 16:19:48

by Andi Kleen

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

> Good spot that certainly isn't plain .39.

True.

>
> It looks like those (http://marc.info/?l=linux-mm&m=130533041726258) are
> similar to Linus' patch, except Linus takes the hard line that the root

One difference to Linus' patch is that it does batching for both fork+exit.
I suspect Linus' patch could probably too.

But the fork+exit patch only improved things slightly, it's very unlikely to
recover 52%. Maybe the overall locking here needs to be revisited.

And in general it looks like blind conversion from spinlock to mutex
is a bad idea right now.

-Andi

2011-06-15 16:42:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Wed, 2011-06-15 at 09:18 -0700, Andi Kleen wrote:

> And in general it looks like blind conversion from spinlock to mutex
> is a bad idea right now.

For 4 socket machines, maybe. On 2 sockets I cannot reproduce anything.

I wonder if its the fairness thing, the mutex spinners aren't fifo fair
like the ticket locks are. It could be significant with larger socket
count since their cacheline arbitration is more sucky.


2011-06-15 16:48:29

by Andi Kleen

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Wed, Jun 15, 2011 at 06:45:37PM +0200, Peter Zijlstra wrote:
> On Wed, 2011-06-15 at 09:18 -0700, Andi Kleen wrote:
>
> > And in general it looks like blind conversion from spinlock to mutex
> > is a bad idea right now.
>
> For 4 socket machines, maybe. On 2 sockets I cannot reproduce anything.

With only one other guy active a lot of things are quite a bit easier.
Basically 2S is a trivial case here.

> I wonder if its the fairness thing, the mutex spinners aren't fifo fair

The Intel 4S systems are fair, but ticketing still helps significantly
because it has a lot nicer interconnect behaviour.

> like the ticket locks are. It could be significant with larger socket
> count since their cacheline arbitration is more sucky.

It gets a bit better with the patch I sent earlier to read the count
first, but yes it's a problem. However I'm not sure that even
with that fixed mutexes will be as good as plain ticket locks.

Also certainly it's no short term fix for 3.0. Right now
we still have this terrible regression.

-Andi
--
[email protected] -- Speaking for myself only

2011-06-15 18:42:46

by Tim Chen

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Wed, 2011-06-15 at 18:45 +0200, Peter Zijlstra wrote:
> On Wed, 2011-06-15 at 09:18 -0700, Andi Kleen wrote:
>
> > And in general it looks like blind conversion from spinlock to mutex
> > is a bad idea right now.
>
> For 4 socket machines, maybe. On 2 sockets I cannot reproduce anything.
>
> I wonder if its the fairness thing, the mutex spinners aren't fifo fair
> like the ticket locks are. It could be significant with larger socket
> count since their cacheline arbitration is more sucky.
>
>
>

Peter,

Wonder if you can provide the profile on your run so I can compare with
what I got on 4 sockets?

Thanks.

Tim

2011-06-15 19:11:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Wed, Jun 15, 2011 at 3:58 AM, Peter Zijlstra <[email protected]> wrote:
>
> The first thing that stood out when running it was:
>
> 31694 root ? ? ?20 ? 0 26660 1460 1212 S 17.5 ?0.0 ? 0:01.97 exim
> ? ?7 root ? ? ?-2 ?19 ? ? 0 ? ?0 ? ?0 S 12.7 ?0.0 ? 0:06.14 rcuc0
...
>
> Which is an impressive amount of RCU usage..

Gaah. Can we just revert that crazy "use threads for RCU" thing already?

It's wrong. It's clearly expensive. It's using threads FOR NO GOOD
REASON, since the only reason for using them are config options that
nobody even uses, for chissake!

And it results in real problems. For example, if you use "perf record"
to see what the hell is up, the use of kernel threads for RCU
callbacks means that the RCU cost is never even seen. I don't know how
Tim did his profiling to figure out the costs, and I don't know how he
decided that the spinlock to semaphore conversion was the culprit, but
it is entirely possible that Tim didn't actually bisect the problem,
but instead used "perf record" on the exim task, saw that the
semaphore costs had gone up, and decided that it must be the
conversion.

And sure, maybe 50% of it was the conversion, and maybe 50% of it the
RCU changes - and "perf record" just never showed the RCU component.
We already know that it causes huge slowdowns on some other loads. We
just don't know.

So using anonymous kernel threads is actually a real downside. It
makes it much less obvious what is going on. We saw that exact same
thing with the generic worker thread conversions: things that used to
have clear performance issues ("oh, the iwl-phy0 thread is using 3% of
CPU time because it is polling for IO, and I can see that in 'top'")
turned into much-harder-to-see issues ("oh, kwork0 us using 3% CPU
time according to 'top' - I have no idea why").

Now, with RCU using softirq's, clearly the costs of RCU can sometimes
be mis-attributed because it turns out that the softirq is run from
some other thread. But statistically, if you end up having a heavy
softirq load, it _usually_ ends up being triggered in the context of
whoever causes that load. Not always, and not reliably, but I suspect
it ends up being easier to see.

And quite frankly, just look at commit a26ac2455ffc: it sure as hell
isn't making anything simpler. It adds several hundred lines of code,
and it's already been implicated in one major performance regression,
and is a possible reason for this one.

So Ingo, Paul: can we *please* just revert it, and agree that if you
want to re-instate it, the code should be

(a) only done for the case where it matters (ie for the RCUBOOST case)

(b) tested better for performance issues (and maybe shared with the
tinyrcu case that also uses threads?)

Please? It's more than a revert of that one commit - there's tons of
commits on top of that to actually do the boosting etc (and fixing
some of the fallout). But really, by now I'd prefer to just revert it
all, rather than see if it can be fixed up.. According to Peter,
Shaohua Li's patch that largely fixes the performance issue for the
other load (by moving *some* of the RCU stuff back to softirq context)
helps, but still leaves the rcu threads with a lot of CPU time.

Considering that most of the RCU callbacks are not very CPU intensive,
I bet that there's a *ton* of them, and that the context switch
overhead is quite noticeable. And quite frankly, even if Shaohua Li's
patch largely fixes the performance issue, it does so by making the
RCU situation EVEN MORE COMPLEX, with RCU now using *both* threads and
softirq.

That's just crazy. If you really want to do both the threads and
softirq thing for the magical RCU_BOOST case, go ahead, but please
don't do crazy things for the sane configurations.

Linus

2011-06-15 19:25:41

by Andrew Morton

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Wed, 15 Jun 2011 12:11:19 -0700
Linus Torvalds <[email protected]> wrote:

> So using anonymous kernel threads is actually a real downside. It
> makes it much less obvious what is going on. We saw that exact same
> thing with the generic worker thread conversions: things that used to
> have clear performance issues ("oh, the iwl-phy0 thread is using 3% of
> CPU time because it is polling for IO, and I can see that in 'top'")
> turned into much-harder-to-see issues ("oh, kwork0 us using 3% CPU
> time according to 'top' - I have no idea why").

Yes, this is an issue with the memcg async reclaim patches. One
implementation uses a per-memcg kswapd and you can then actually see
what it's doing, and see when it goes nuts (as kswapd threads like to
do). The other implementation uses worker threads and you have no clue
what's going on.

It could be that if more things move away from dedicated threads and
into worker threads, we'll need to build a separate accounting system
so we can see how much time worker threads are spending on a
per-handler basis. Which means a new top-like tool, etc.

That's all pretty nasty and is a tradeoff which should be considered
when making thread-vs-worker decisions.

2011-06-15 20:13:33

by Ingo Molnar

[permalink] [raw]
Subject: [GIT PULL] Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex


* Linus Torvalds <[email protected]> wrote:

> On Wed, Jun 15, 2011 at 3:58 AM, Peter Zijlstra <[email protected]> wrote:
> >
> > The first thing that stood out when running it was:
> >
> > 31694 root ? ? ?20 ? 0 26660 1460 1212 S 17.5 ?0.0 ? 0:01.97 exim
> > ? ?7 root ? ? ?-2 ?19 ? ? 0 ? ?0 ? ?0 S 12.7 ?0.0 ? 0:06.14 rcuc0
> ...
> >
> > Which is an impressive amount of RCU usage..
>
> Gaah. Can we just revert that crazy "use threads for RCU" thing already?

I have this fix queued up currently:

09223371deac: rcu: Use softirq to address performance regression

and that's ready for pulling and should fix this regression:

git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git core-urgent-for-linus

The revert itself looks quite hairy, i just attempted it: it affects
half a dozen other followup commits. We might be better off using the
(tested) commit above and then shutting down all kthread based
processing and always use a softirq ... or something like that.

if you think that's risky and we should do the revert then i'll
rebase the core/urgent branch and we'll do the revert.

Paul, Peter, what do you think?

Thanks,

Ingo

------------------>
Paul E. McKenney (1):
rcu: Simplify curing of load woes

Shaohua Li (1):
rcu: Use softirq to address performance regression


Documentation/filesystems/proc.txt | 1 +
include/linux/interrupt.h | 1 +
include/trace/events/irq.h | 3 +-
kernel/rcutree.c | 88 ++++++++++++++++-------------------
kernel/rcutree.h | 1 +
kernel/rcutree_plugin.h | 20 ++++----
kernel/softirq.c | 2 +-
tools/perf/util/trace-event-parse.c | 1 +
8 files changed, 57 insertions(+), 60 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index f481780..db3b1ab 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -843,6 +843,7 @@ Provides counts of softirq handlers serviced since boot time, for each cpu.
TASKLET: 0 0 0 290
SCHED: 27035 26983 26971 26746
HRTIMER: 0 0 0 0
+ RCU: 1678 1769 2178 2250


1.3 IDE devices in /proc/ide
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 6c12989..f6efed0 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -414,6 +414,7 @@ enum
TASKLET_SOFTIRQ,
SCHED_SOFTIRQ,
HRTIMER_SOFTIRQ,
+ RCU_SOFTIRQ, /* Preferable RCU should always be the last softirq */

NR_SOFTIRQS
};
diff --git a/include/trace/events/irq.h b/include/trace/events/irq.h
index ae045ca..1c09820 100644
--- a/include/trace/events/irq.h
+++ b/include/trace/events/irq.h
@@ -20,7 +20,8 @@ struct softirq_action;
softirq_name(BLOCK_IOPOLL), \
softirq_name(TASKLET), \
softirq_name(SCHED), \
- softirq_name(HRTIMER))
+ softirq_name(HRTIMER), \
+ softirq_name(RCU))

/**
* irq_handler_entry - called immediately before the irq action handler
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 89419ff..ae5c9ea 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -100,6 +100,7 @@ static char rcu_kthreads_spawnable;

static void rcu_node_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu);
static void invoke_rcu_cpu_kthread(void);
+static void __invoke_rcu_cpu_kthread(void);

#define RCU_KTHREAD_PRIO 1 /* RT priority for per-CPU kthreads. */

@@ -1442,13 +1443,21 @@ __rcu_process_callbacks(struct rcu_state *rsp, struct rcu_data *rdp)
}

/* If there are callbacks ready, invoke them. */
- rcu_do_batch(rsp, rdp);
+ if (cpu_has_callbacks_ready_to_invoke(rdp))
+ __invoke_rcu_cpu_kthread();
+}
+
+static void rcu_kthread_do_work(void)
+{
+ rcu_do_batch(&rcu_sched_state, &__get_cpu_var(rcu_sched_data));
+ rcu_do_batch(&rcu_bh_state, &__get_cpu_var(rcu_bh_data));
+ rcu_preempt_do_callbacks();
}

/*
* Do softirq processing for the current CPU.
*/
-static void rcu_process_callbacks(void)
+static void rcu_process_callbacks(struct softirq_action *unused)
{
__rcu_process_callbacks(&rcu_sched_state,
&__get_cpu_var(rcu_sched_data));
@@ -1465,7 +1474,7 @@ static void rcu_process_callbacks(void)
* the current CPU with interrupts disabled, the rcu_cpu_kthread_task
* cannot disappear out from under us.
*/
-static void invoke_rcu_cpu_kthread(void)
+static void __invoke_rcu_cpu_kthread(void)
{
unsigned long flags;

@@ -1479,6 +1488,11 @@ static void invoke_rcu_cpu_kthread(void)
local_irq_restore(flags);
}

+static void invoke_rcu_cpu_kthread(void)
+{
+ raise_softirq(RCU_SOFTIRQ);
+}
+
/*
* Wake up the specified per-rcu_node-structure kthread.
* Because the per-rcu_node kthreads are immortal, we don't need
@@ -1613,7 +1627,7 @@ static int rcu_cpu_kthread(void *arg)
*workp = 0;
local_irq_restore(flags);
if (work)
- rcu_process_callbacks();
+ rcu_kthread_do_work();
local_bh_enable();
if (*workp != 0)
spincnt++;
@@ -1635,6 +1649,20 @@ static int rcu_cpu_kthread(void *arg)
* to manipulate rcu_cpu_kthread_task. There might be another CPU
* attempting to access it during boot, but the locking in kthread_bind()
* will enforce sufficient ordering.
+ *
+ * Please note that we cannot simply refuse to wake up the per-CPU
+ * kthread because kthreads are created in TASK_UNINTERRUPTIBLE state,
+ * which can result in softlockup complaints if the task ends up being
+ * idle for more than a couple of minutes.
+ *
+ * However, please note also that we cannot bind the per-CPU kthread to its
+ * CPU until that CPU is fully online. We also cannot wait until the
+ * CPU is fully online before we create its per-CPU kthread, as this would
+ * deadlock the system when CPU notifiers tried waiting for grace
+ * periods. So we bind the per-CPU kthread to its CPU only if the CPU
+ * is online. If its CPU is not yet fully online, then the code in
+ * rcu_cpu_kthread() will wait until it is fully online, and then do
+ * the binding.
*/
static int __cpuinit rcu_spawn_one_cpu_kthread(int cpu)
{
@@ -1647,12 +1675,14 @@ static int __cpuinit rcu_spawn_one_cpu_kthread(int cpu)
t = kthread_create(rcu_cpu_kthread, (void *)(long)cpu, "rcuc%d", cpu);
if (IS_ERR(t))
return PTR_ERR(t);
- kthread_bind(t, cpu);
+ if (cpu_online(cpu))
+ kthread_bind(t, cpu);
per_cpu(rcu_cpu_kthread_cpu, cpu) = cpu;
WARN_ON_ONCE(per_cpu(rcu_cpu_kthread_task, cpu) != NULL);
- per_cpu(rcu_cpu_kthread_task, cpu) = t;
sp.sched_priority = RCU_KTHREAD_PRIO;
sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
+ per_cpu(rcu_cpu_kthread_task, cpu) = t;
+ wake_up_process(t); /* Get to TASK_INTERRUPTIBLE quickly. */
return 0;
}

@@ -1759,12 +1789,11 @@ static int __cpuinit rcu_spawn_one_node_kthread(struct rcu_state *rsp,
raw_spin_unlock_irqrestore(&rnp->lock, flags);
sp.sched_priority = 99;
sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
+ wake_up_process(t); /* get to TASK_INTERRUPTIBLE quickly. */
}
return rcu_spawn_one_boost_kthread(rsp, rnp, rnp_index);
}

-static void rcu_wake_one_boost_kthread(struct rcu_node *rnp);
-
/*
* Spawn all kthreads -- called as soon as the scheduler is running.
*/
@@ -1772,30 +1801,18 @@ static int __init rcu_spawn_kthreads(void)
{
int cpu;
struct rcu_node *rnp;
- struct task_struct *t;

rcu_kthreads_spawnable = 1;
for_each_possible_cpu(cpu) {
per_cpu(rcu_cpu_has_work, cpu) = 0;
- if (cpu_online(cpu)) {
+ if (cpu_online(cpu))
(void)rcu_spawn_one_cpu_kthread(cpu);
- t = per_cpu(rcu_cpu_kthread_task, cpu);
- if (t)
- wake_up_process(t);
- }
}
rnp = rcu_get_root(rcu_state);
(void)rcu_spawn_one_node_kthread(rcu_state, rnp);
- if (rnp->node_kthread_task)
- wake_up_process(rnp->node_kthread_task);
if (NUM_RCU_NODES > 1) {
- rcu_for_each_leaf_node(rcu_state, rnp) {
+ rcu_for_each_leaf_node(rcu_state, rnp)
(void)rcu_spawn_one_node_kthread(rcu_state, rnp);
- t = rnp->node_kthread_task;
- if (t)
- wake_up_process(t);
- rcu_wake_one_boost_kthread(rnp);
- }
}
return 0;
}
@@ -2221,31 +2238,6 @@ static void __cpuinit rcu_prepare_kthreads(int cpu)
}

/*
- * kthread_create() creates threads in TASK_UNINTERRUPTIBLE state,
- * but the RCU threads are woken on demand, and if demand is low this
- * could be a while triggering the hung task watchdog.
- *
- * In order to avoid this, poke all tasks once the CPU is fully
- * up and running.
- */
-static void __cpuinit rcu_online_kthreads(int cpu)
-{
- struct rcu_data *rdp = per_cpu_ptr(rcu_state->rda, cpu);
- struct rcu_node *rnp = rdp->mynode;
- struct task_struct *t;
-
- t = per_cpu(rcu_cpu_kthread_task, cpu);
- if (t)
- wake_up_process(t);
-
- t = rnp->node_kthread_task;
- if (t)
- wake_up_process(t);
-
- rcu_wake_one_boost_kthread(rnp);
-}
-
-/*
* Handle CPU online/offline notification events.
*/
static int __cpuinit rcu_cpu_notify(struct notifier_block *self,
@@ -2262,7 +2254,6 @@ static int __cpuinit rcu_cpu_notify(struct notifier_block *self,
rcu_prepare_kthreads(cpu);
break;
case CPU_ONLINE:
- rcu_online_kthreads(cpu);
case CPU_DOWN_FAILED:
rcu_node_kthread_setaffinity(rnp, -1);
rcu_cpu_kthread_setrt(cpu, 1);
@@ -2410,6 +2401,7 @@ void __init rcu_init(void)
rcu_init_one(&rcu_sched_state, &rcu_sched_data);
rcu_init_one(&rcu_bh_state, &rcu_bh_data);
__rcu_init_preempt();
+ open_softirq(RCU_SOFTIRQ, rcu_process_callbacks);

/*
* We don't need protection against CPU-hotplug here because
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index 7b9a08b..0fed6b9 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -439,6 +439,7 @@ static void rcu_preempt_offline_cpu(int cpu);
#endif /* #ifdef CONFIG_HOTPLUG_CPU */
static void rcu_preempt_check_callbacks(int cpu);
static void rcu_preempt_process_callbacks(void);
+static void rcu_preempt_do_callbacks(void);
void call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu));
#if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_TREE_PREEMPT_RCU)
static void rcu_report_exp_rnp(struct rcu_state *rsp, struct rcu_node *rnp);
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index c8bff30..38d09c5 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -602,6 +602,11 @@ static void rcu_preempt_process_callbacks(void)
&__get_cpu_var(rcu_preempt_data));
}

+static void rcu_preempt_do_callbacks(void)
+{
+ rcu_do_batch(&rcu_preempt_state, &__get_cpu_var(rcu_preempt_data));
+}
+
/*
* Queue a preemptible-RCU callback for invocation after a grace period.
*/
@@ -997,6 +1002,10 @@ static void rcu_preempt_process_callbacks(void)
{
}

+static void rcu_preempt_do_callbacks(void)
+{
+}
+
/*
* Wait for an rcu-preempt grace period, but make it happen quickly.
* But because preemptible RCU does not exist, map to rcu-sched.
@@ -1299,15 +1308,10 @@ static int __cpuinit rcu_spawn_one_boost_kthread(struct rcu_state *rsp,
raw_spin_unlock_irqrestore(&rnp->lock, flags);
sp.sched_priority = RCU_KTHREAD_PRIO;
sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
+ wake_up_process(t); /* get to TASK_INTERRUPTIBLE quickly. */
return 0;
}

-static void __cpuinit rcu_wake_one_boost_kthread(struct rcu_node *rnp)
-{
- if (rnp->boost_kthread_task)
- wake_up_process(rnp->boost_kthread_task);
-}
-
#else /* #ifdef CONFIG_RCU_BOOST */

static void rcu_initiate_boost(struct rcu_node *rnp, unsigned long flags)
@@ -1331,10 +1335,6 @@ static int __cpuinit rcu_spawn_one_boost_kthread(struct rcu_state *rsp,
return 0;
}

-static void __cpuinit rcu_wake_one_boost_kthread(struct rcu_node *rnp)
-{
-}
-
#endif /* #else #ifdef CONFIG_RCU_BOOST */

#ifndef CONFIG_SMP
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 1396017..40cf63d 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -58,7 +58,7 @@ DEFINE_PER_CPU(struct task_struct *, ksoftirqd);

char *softirq_to_name[NR_SOFTIRQS] = {
"HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "BLOCK_IOPOLL",
- "TASKLET", "SCHED", "HRTIMER"
+ "TASKLET", "SCHED", "HRTIMER", "RCU"
};

/*
diff --git a/tools/perf/util/trace-event-parse.c b/tools/perf/util/trace-event-parse.c
index 1e88485..0a7ed5b 100644
--- a/tools/perf/util/trace-event-parse.c
+++ b/tools/perf/util/trace-event-parse.c
@@ -2187,6 +2187,7 @@ static const struct flag flags[] = {
{ "TASKLET_SOFTIRQ", 6 },
{ "SCHED_SOFTIRQ", 7 },
{ "HRTIMER_SOFTIRQ", 8 },
+ { "RCU_SOFTIRQ", 9 },

{ "HRTIMER_NORESTART", 0 },
{ "HRTIMER_RESTART", 1 },

2011-06-15 20:12:33

by Tim Chen

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Wed, 2011-06-15 at 12:11 -0700, Linus Torvalds wrote:

>
> And it results in real problems. For example, if you use "perf record"
> to see what the hell is up, the use of kernel threads for RCU
> callbacks means that the RCU cost is never even seen. I don't know how
> Tim did his profiling to figure out the costs, and I don't know how he
> decided that the spinlock to semaphore conversion was the culprit, but
> it is entirely possible that Tim didn't actually bisect the problem,
> but instead used "perf record" on the exim task, saw that the
> semaphore costs had gone up, and decided that it must be the
> conversion.
>

Yes, I was using perf to do the profiling. I thought that the mutex
conversion was the most likely culprit based on the change in profile.

Tim

2011-06-15 20:17:20

by Ingo Molnar

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex


* Andrew Morton <[email protected]> wrote:

> It could be that if more things move away from dedicated threads
> and into worker threads, we'll need to build a separate accounting
> system so we can see how much time worker threads are spending on a
> per-handler basis. Which means a new top-like tool, etc.

perf record -g will go a long way towards such a tool already - but i
think it would be useful to create a more top-alike view as well.

Thanks,

Ingo

2011-06-15 20:17:55

by Ingo Molnar

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex


* Tim Chen <[email protected]> wrote:

> On Wed, 2011-06-15 at 12:11 -0700, Linus Torvalds wrote:
>
> >
> > And it results in real problems. For example, if you use "perf record"
> > to see what the hell is up, the use of kernel threads for RCU
> > callbacks means that the RCU cost is never even seen. I don't know how
> > Tim did his profiling to figure out the costs, and I don't know how he
> > decided that the spinlock to semaphore conversion was the culprit, but
> > it is entirely possible that Tim didn't actually bisect the problem,
> > but instead used "perf record" on the exim task, saw that the
> > semaphore costs had gone up, and decided that it must be the
> > conversion.
> >
>
> Yes, I was using perf to do the profiling. I thought that the mutex
> conversion was the most likely culprit based on the change in
> profile.

have you used callgraph profiling (perf record -g) or flat profiling?
Flat profiling can be misleading when there's proxy work done.

Thanks,

Ingo

2011-06-15 20:20:31

by Tim Chen

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Wed, 2011-06-15 at 22:17 +0200, Ingo Molnar wrote:

>
> have you used callgraph profiling (perf record -g) or flat profiling?
> Flat profiling can be misleading when there's proxy work done.
>

I've used callgraph profiling to see the call stack.

Tim

2011-06-15 20:31:13

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [GIT PULL] Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Wed, Jun 15, 2011 at 10:12:16PM +0200, Ingo Molnar wrote:
>
> * Linus Torvalds <[email protected]> wrote:
>
> > On Wed, Jun 15, 2011 at 3:58 AM, Peter Zijlstra <[email protected]> wrote:
> > >
> > > The first thing that stood out when running it was:
> > >
> > > 31694 root ? ? ?20 ? 0 26660 1460 1212 S 17.5 ?0.0 ? 0:01.97 exim
> > > ? ?7 root ? ? ?-2 ?19 ? ? 0 ? ?0 ? ?0 S 12.7 ?0.0 ? 0:06.14 rcuc0
> > ...
> > >
> > > Which is an impressive amount of RCU usage..
> >
> > Gaah. Can we just revert that crazy "use threads for RCU" thing already?
>
> I have this fix queued up currently:
>
> 09223371deac: rcu: Use softirq to address performance regression
>
> and that's ready for pulling and should fix this regression:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git core-urgent-for-linus
>
> The revert itself looks quite hairy, i just attempted it: it affects
> half a dozen other followup commits. We might be better off using the
> (tested) commit above and then shutting down all kthread based
> processing and always use a softirq ... or something like that.
>
> if you think that's risky and we should do the revert then i'll
> rebase the core/urgent branch and we'll do the revert.
>
> Paul, Peter, what do you think?

It would be much lower risk to make the current code always use softirq
if !RCU_BOOST -- last time I attempted the revert, it was quite hairy.

But if we must do the revert, we can do the revert. A small matter of
software and all that.

Thanx, Paul

> Thanks,
>
> Ingo
>
> ------------------>
> Paul E. McKenney (1):
> rcu: Simplify curing of load woes
>
> Shaohua Li (1):
> rcu: Use softirq to address performance regression
>
>
> Documentation/filesystems/proc.txt | 1 +
> include/linux/interrupt.h | 1 +
> include/trace/events/irq.h | 3 +-
> kernel/rcutree.c | 88 ++++++++++++++++-------------------
> kernel/rcutree.h | 1 +
> kernel/rcutree_plugin.h | 20 ++++----
> kernel/softirq.c | 2 +-
> tools/perf/util/trace-event-parse.c | 1 +
> 8 files changed, 57 insertions(+), 60 deletions(-)
>
> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> index f481780..db3b1ab 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -843,6 +843,7 @@ Provides counts of softirq handlers serviced since boot time, for each cpu.
> TASKLET: 0 0 0 290
> SCHED: 27035 26983 26971 26746
> HRTIMER: 0 0 0 0
> + RCU: 1678 1769 2178 2250
>
>
> 1.3 IDE devices in /proc/ide
> diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
> index 6c12989..f6efed0 100644
> --- a/include/linux/interrupt.h
> +++ b/include/linux/interrupt.h
> @@ -414,6 +414,7 @@ enum
> TASKLET_SOFTIRQ,
> SCHED_SOFTIRQ,
> HRTIMER_SOFTIRQ,
> + RCU_SOFTIRQ, /* Preferable RCU should always be the last softirq */
>
> NR_SOFTIRQS
> };
> diff --git a/include/trace/events/irq.h b/include/trace/events/irq.h
> index ae045ca..1c09820 100644
> --- a/include/trace/events/irq.h
> +++ b/include/trace/events/irq.h
> @@ -20,7 +20,8 @@ struct softirq_action;
> softirq_name(BLOCK_IOPOLL), \
> softirq_name(TASKLET), \
> softirq_name(SCHED), \
> - softirq_name(HRTIMER))
> + softirq_name(HRTIMER), \
> + softirq_name(RCU))
>
> /**
> * irq_handler_entry - called immediately before the irq action handler
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index 89419ff..ae5c9ea 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -100,6 +100,7 @@ static char rcu_kthreads_spawnable;
>
> static void rcu_node_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu);
> static void invoke_rcu_cpu_kthread(void);
> +static void __invoke_rcu_cpu_kthread(void);
>
> #define RCU_KTHREAD_PRIO 1 /* RT priority for per-CPU kthreads. */
>
> @@ -1442,13 +1443,21 @@ __rcu_process_callbacks(struct rcu_state *rsp, struct rcu_data *rdp)
> }
>
> /* If there are callbacks ready, invoke them. */
> - rcu_do_batch(rsp, rdp);
> + if (cpu_has_callbacks_ready_to_invoke(rdp))
> + __invoke_rcu_cpu_kthread();
> +}
> +
> +static void rcu_kthread_do_work(void)
> +{
> + rcu_do_batch(&rcu_sched_state, &__get_cpu_var(rcu_sched_data));
> + rcu_do_batch(&rcu_bh_state, &__get_cpu_var(rcu_bh_data));
> + rcu_preempt_do_callbacks();
> }
>
> /*
> * Do softirq processing for the current CPU.
> */
> -static void rcu_process_callbacks(void)
> +static void rcu_process_callbacks(struct softirq_action *unused)
> {
> __rcu_process_callbacks(&rcu_sched_state,
> &__get_cpu_var(rcu_sched_data));
> @@ -1465,7 +1474,7 @@ static void rcu_process_callbacks(void)
> * the current CPU with interrupts disabled, the rcu_cpu_kthread_task
> * cannot disappear out from under us.
> */
> -static void invoke_rcu_cpu_kthread(void)
> +static void __invoke_rcu_cpu_kthread(void)
> {
> unsigned long flags;
>
> @@ -1479,6 +1488,11 @@ static void invoke_rcu_cpu_kthread(void)
> local_irq_restore(flags);
> }
>
> +static void invoke_rcu_cpu_kthread(void)
> +{
> + raise_softirq(RCU_SOFTIRQ);
> +}
> +
> /*
> * Wake up the specified per-rcu_node-structure kthread.
> * Because the per-rcu_node kthreads are immortal, we don't need
> @@ -1613,7 +1627,7 @@ static int rcu_cpu_kthread(void *arg)
> *workp = 0;
> local_irq_restore(flags);
> if (work)
> - rcu_process_callbacks();
> + rcu_kthread_do_work();
> local_bh_enable();
> if (*workp != 0)
> spincnt++;
> @@ -1635,6 +1649,20 @@ static int rcu_cpu_kthread(void *arg)
> * to manipulate rcu_cpu_kthread_task. There might be another CPU
> * attempting to access it during boot, but the locking in kthread_bind()
> * will enforce sufficient ordering.
> + *
> + * Please note that we cannot simply refuse to wake up the per-CPU
> + * kthread because kthreads are created in TASK_UNINTERRUPTIBLE state,
> + * which can result in softlockup complaints if the task ends up being
> + * idle for more than a couple of minutes.
> + *
> + * However, please note also that we cannot bind the per-CPU kthread to its
> + * CPU until that CPU is fully online. We also cannot wait until the
> + * CPU is fully online before we create its per-CPU kthread, as this would
> + * deadlock the system when CPU notifiers tried waiting for grace
> + * periods. So we bind the per-CPU kthread to its CPU only if the CPU
> + * is online. If its CPU is not yet fully online, then the code in
> + * rcu_cpu_kthread() will wait until it is fully online, and then do
> + * the binding.
> */
> static int __cpuinit rcu_spawn_one_cpu_kthread(int cpu)
> {
> @@ -1647,12 +1675,14 @@ static int __cpuinit rcu_spawn_one_cpu_kthread(int cpu)
> t = kthread_create(rcu_cpu_kthread, (void *)(long)cpu, "rcuc%d", cpu);
> if (IS_ERR(t))
> return PTR_ERR(t);
> - kthread_bind(t, cpu);
> + if (cpu_online(cpu))
> + kthread_bind(t, cpu);
> per_cpu(rcu_cpu_kthread_cpu, cpu) = cpu;
> WARN_ON_ONCE(per_cpu(rcu_cpu_kthread_task, cpu) != NULL);
> - per_cpu(rcu_cpu_kthread_task, cpu) = t;
> sp.sched_priority = RCU_KTHREAD_PRIO;
> sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
> + per_cpu(rcu_cpu_kthread_task, cpu) = t;
> + wake_up_process(t); /* Get to TASK_INTERRUPTIBLE quickly. */
> return 0;
> }
>
> @@ -1759,12 +1789,11 @@ static int __cpuinit rcu_spawn_one_node_kthread(struct rcu_state *rsp,
> raw_spin_unlock_irqrestore(&rnp->lock, flags);
> sp.sched_priority = 99;
> sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
> + wake_up_process(t); /* get to TASK_INTERRUPTIBLE quickly. */
> }
> return rcu_spawn_one_boost_kthread(rsp, rnp, rnp_index);
> }
>
> -static void rcu_wake_one_boost_kthread(struct rcu_node *rnp);
> -
> /*
> * Spawn all kthreads -- called as soon as the scheduler is running.
> */
> @@ -1772,30 +1801,18 @@ static int __init rcu_spawn_kthreads(void)
> {
> int cpu;
> struct rcu_node *rnp;
> - struct task_struct *t;
>
> rcu_kthreads_spawnable = 1;
> for_each_possible_cpu(cpu) {
> per_cpu(rcu_cpu_has_work, cpu) = 0;
> - if (cpu_online(cpu)) {
> + if (cpu_online(cpu))
> (void)rcu_spawn_one_cpu_kthread(cpu);
> - t = per_cpu(rcu_cpu_kthread_task, cpu);
> - if (t)
> - wake_up_process(t);
> - }
> }
> rnp = rcu_get_root(rcu_state);
> (void)rcu_spawn_one_node_kthread(rcu_state, rnp);
> - if (rnp->node_kthread_task)
> - wake_up_process(rnp->node_kthread_task);
> if (NUM_RCU_NODES > 1) {
> - rcu_for_each_leaf_node(rcu_state, rnp) {
> + rcu_for_each_leaf_node(rcu_state, rnp)
> (void)rcu_spawn_one_node_kthread(rcu_state, rnp);
> - t = rnp->node_kthread_task;
> - if (t)
> - wake_up_process(t);
> - rcu_wake_one_boost_kthread(rnp);
> - }
> }
> return 0;
> }
> @@ -2221,31 +2238,6 @@ static void __cpuinit rcu_prepare_kthreads(int cpu)
> }
>
> /*
> - * kthread_create() creates threads in TASK_UNINTERRUPTIBLE state,
> - * but the RCU threads are woken on demand, and if demand is low this
> - * could be a while triggering the hung task watchdog.
> - *
> - * In order to avoid this, poke all tasks once the CPU is fully
> - * up and running.
> - */
> -static void __cpuinit rcu_online_kthreads(int cpu)
> -{
> - struct rcu_data *rdp = per_cpu_ptr(rcu_state->rda, cpu);
> - struct rcu_node *rnp = rdp->mynode;
> - struct task_struct *t;
> -
> - t = per_cpu(rcu_cpu_kthread_task, cpu);
> - if (t)
> - wake_up_process(t);
> -
> - t = rnp->node_kthread_task;
> - if (t)
> - wake_up_process(t);
> -
> - rcu_wake_one_boost_kthread(rnp);
> -}
> -
> -/*
> * Handle CPU online/offline notification events.
> */
> static int __cpuinit rcu_cpu_notify(struct notifier_block *self,
> @@ -2262,7 +2254,6 @@ static int __cpuinit rcu_cpu_notify(struct notifier_block *self,
> rcu_prepare_kthreads(cpu);
> break;
> case CPU_ONLINE:
> - rcu_online_kthreads(cpu);
> case CPU_DOWN_FAILED:
> rcu_node_kthread_setaffinity(rnp, -1);
> rcu_cpu_kthread_setrt(cpu, 1);
> @@ -2410,6 +2401,7 @@ void __init rcu_init(void)
> rcu_init_one(&rcu_sched_state, &rcu_sched_data);
> rcu_init_one(&rcu_bh_state, &rcu_bh_data);
> __rcu_init_preempt();
> + open_softirq(RCU_SOFTIRQ, rcu_process_callbacks);
>
> /*
> * We don't need protection against CPU-hotplug here because
> diff --git a/kernel/rcutree.h b/kernel/rcutree.h
> index 7b9a08b..0fed6b9 100644
> --- a/kernel/rcutree.h
> +++ b/kernel/rcutree.h
> @@ -439,6 +439,7 @@ static void rcu_preempt_offline_cpu(int cpu);
> #endif /* #ifdef CONFIG_HOTPLUG_CPU */
> static void rcu_preempt_check_callbacks(int cpu);
> static void rcu_preempt_process_callbacks(void);
> +static void rcu_preempt_do_callbacks(void);
> void call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu));
> #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_TREE_PREEMPT_RCU)
> static void rcu_report_exp_rnp(struct rcu_state *rsp, struct rcu_node *rnp);
> diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
> index c8bff30..38d09c5 100644
> --- a/kernel/rcutree_plugin.h
> +++ b/kernel/rcutree_plugin.h
> @@ -602,6 +602,11 @@ static void rcu_preempt_process_callbacks(void)
> &__get_cpu_var(rcu_preempt_data));
> }
>
> +static void rcu_preempt_do_callbacks(void)
> +{
> + rcu_do_batch(&rcu_preempt_state, &__get_cpu_var(rcu_preempt_data));
> +}
> +
> /*
> * Queue a preemptible-RCU callback for invocation after a grace period.
> */
> @@ -997,6 +1002,10 @@ static void rcu_preempt_process_callbacks(void)
> {
> }
>
> +static void rcu_preempt_do_callbacks(void)
> +{
> +}
> +
> /*
> * Wait for an rcu-preempt grace period, but make it happen quickly.
> * But because preemptible RCU does not exist, map to rcu-sched.
> @@ -1299,15 +1308,10 @@ static int __cpuinit rcu_spawn_one_boost_kthread(struct rcu_state *rsp,
> raw_spin_unlock_irqrestore(&rnp->lock, flags);
> sp.sched_priority = RCU_KTHREAD_PRIO;
> sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
> + wake_up_process(t); /* get to TASK_INTERRUPTIBLE quickly. */
> return 0;
> }
>
> -static void __cpuinit rcu_wake_one_boost_kthread(struct rcu_node *rnp)
> -{
> - if (rnp->boost_kthread_task)
> - wake_up_process(rnp->boost_kthread_task);
> -}
> -
> #else /* #ifdef CONFIG_RCU_BOOST */
>
> static void rcu_initiate_boost(struct rcu_node *rnp, unsigned long flags)
> @@ -1331,10 +1335,6 @@ static int __cpuinit rcu_spawn_one_boost_kthread(struct rcu_state *rsp,
> return 0;
> }
>
> -static void __cpuinit rcu_wake_one_boost_kthread(struct rcu_node *rnp)
> -{
> -}
> -
> #endif /* #else #ifdef CONFIG_RCU_BOOST */
>
> #ifndef CONFIG_SMP
> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index 1396017..40cf63d 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -58,7 +58,7 @@ DEFINE_PER_CPU(struct task_struct *, ksoftirqd);
>
> char *softirq_to_name[NR_SOFTIRQS] = {
> "HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "BLOCK_IOPOLL",
> - "TASKLET", "SCHED", "HRTIMER"
> + "TASKLET", "SCHED", "HRTIMER", "RCU"
> };
>
> /*
> diff --git a/tools/perf/util/trace-event-parse.c b/tools/perf/util/trace-event-parse.c
> index 1e88485..0a7ed5b 100644
> --- a/tools/perf/util/trace-event-parse.c
> +++ b/tools/perf/util/trace-event-parse.c
> @@ -2187,6 +2187,7 @@ static const struct flag flags[] = {
> { "TASKLET_SOFTIRQ", 6 },
> { "SCHED_SOFTIRQ", 7 },
> { "HRTIMER_SOFTIRQ", 8 },
> + { "RCU_SOFTIRQ", 9 },
>
> { "HRTIMER_NORESTART", 0 },
> { "HRTIMER_RESTART", 1 },

2011-06-15 20:33:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Wed, 2011-06-15 at 11:43 -0700, Tim Chen wrote:

> Wonder if you can provide the profile on your run so I can compare with
> what I got on 4 sockets?

Sure, so this is on an Westmere-EP (2 sockets, 6 cores/socket, 2
threads/core), what I did was:

perf record -r2 -gf make bench
perf report > foo.txt
bzip2 -9 foo.txt

Both files are about 0.5M, the tip one has the sirq-rcu patch and linus'
patch applied (could do one without if wanted).

http://programming.kicks-ass.net/sekrit/tip.txt.bz2
http://programming.kicks-ass.net/sekrit/39.txt.bz2

However, looking at them, the weird thing is, they're both dominated by
(taken from 39.txt):

7.44% exim [kernel.kallsyms] [k] format_decode
|
--- format_decode
|
|--93.07%-- vsnprintf
| |
| |--98.83%-- seq_printf
| | show_stat
| | seq_read
| | proc_reg_read
| | vfs_read
| | sys_read
| | system_call
| | __GI___libc_read
| | |
| | |--99.47%-- (nil)
| | --0.53%-- [...]
| |
| --1.17%-- snprintf
| proc_flush_task
| release_task
| wait_consider_task
| do_wait
| sys_wait4
| system_call
| |
| |--93.15%-- __libc_wait
| |
| --6.85%-- __waitpid
|
|--6.84%-- seq_printf
| show_stat
| seq_read
| proc_reg_read
| vfs_read
| sys_read
| system_call
| __GI___libc_read
| |
| |--99.56%-- (nil)
| --0.44%-- [...]
--0.10%-- [...]


I've no idea why its doing that, I've had massive trouble getting this
MOSBENCH crap working in the first place since its all in python, but
what I basically done was rip out everything !exim in config.py and put
cores = [24]. In hosts.py I too ripped out everything !exim, cleared out
the clients list and made 'tom' my localhost (removing that perflock
thing).

After that things more or less ran, I saw exim, and its giving me those
msgs/sec/core numbers like:

# perf record -r2 -gfo 39.perf.data make bench
python config.py
Starting results in: results/20110615-221914
*** Starting configuration 1/1 (benchmark-exim) ***
Starting Host.host-westmere...
sending westmere: /./
del.ing westmere: out/log/EximLoad.trial-2.host-westmere
del.ing westmere: out/log/EximLoad.trial-1.host-westmere
del.ing westmere: out/log/EximLoad.trial-0.host-westmere
del.ing westmere: out/log/
del.ing westmere: out/EximDaemon.host-westmere.configure
del.ing westmere: out/
sending westmere: /home/root/
sending westmere: /home/root/test/mosbench/
Starting Host.host-westmere... done
Starting HostInfo.host-westmere...
Starting HostInfo.host-westmere... done
Starting FileSystem.host-westmere.fstype-tmpfs-separate...
Starting FileSystem.host-westmere.fstype-tmpfs-separate... done
Starting SetCPUs.host-westmere...
FATAL: Module oprofile not found.
FATAL: Module oprofile not found.
Kernel doesn't support oprofile
CPUs 0-23 are online
CPUs 0-23 are online
Starting SetCPUs.host-westmere... done
Starting EximDaemon.host-westmere...
Starting EximDaemon.host-westmere... done
Waiting on EximLoad.trial-0.host-westmere...
[EximLoad.trial-0.host-westmere] => 86983 messages (15.0032 secs, 241.568 messages/sec/core)
Waiting on EximLoad.trial-0.host-westmere... done
Waiting on EximLoad.trial-1.host-westmere...
[EximLoad.trial-1.host-westmere] => 86770 messages (15.004 secs, 240.964 messages/sec/core)
Waiting on EximLoad.trial-1.host-westmere... done
Waiting on EximLoad.trial-2.host-westmere...
[EximLoad.trial-2.host-westmere] => 86987 messages (15.0035 secs, 241.574 messages/sec/core)
Waiting on EximLoad.trial-2.host-westmere... done
Stopping EximDaemon.host-westmere...
Stopping EximDaemon.host-westmere... done
Stopping HostInfo.host-westmere...
Stopping HostInfo.host-westmere... done
Stopping Host.host-westmere...
copying westmere: ./
copying westmere: EximDaemon.host-westmere.configure
copying westmere: log/
copying westmere: log/EximLoad.trial-0.host-westmere
copying westmere: log/EximLoad.trial-1.host-westmere
copying westmere: log/EximLoad.trial-2.host-westmere
Stopping Host.host-westmere... done
Stopping ResultPath...
Results in: results/20110615-221914/benchmark-exim
Stopping ResultPath... done
All results in: results/20110615-221914
[ perf record: Woken up 3774 times to write data ]
[ perf record: Captured and wrote 979.494 MB 39.perf.data (~42794760 samples) ]
CPUs 0-23 are online

2011-06-15 20:47:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex



"Paul E. McKenney" <[email protected]> wrote:
>
>It would be much lower risk to make the current code always use softirq
>if !RCU_BOOST -- last time I attempted the revert, it was quite hairy.

I don't care if it's a real revert or not, but I want the threads gone. Entirely. Not just the patch that uses softirqs for some things, and threads for the callbacks. No, I don't want the threads to show up or exist at all.

And to be sure, I'd like the code to set up and use the threads to actually compile away statically, so that there clearly isn't some way it's partially enabled.

Linus

2011-06-15 20:54:32

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [GIT PULL] Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Wed, Jun 15, 2011 at 01:47:33PM -0700, Linus Torvalds wrote:
>
>
> "Paul E. McKenney" <[email protected]> wrote:
> >
> >It would be much lower risk to make the current code always use softirq
> >if !RCU_BOOST -- last time I attempted the revert, it was quite hairy.
>
> I don't care if it's a real revert or not, but I want the threads gone. Entirely. Not just the patch that uses softirqs for some things, and threads for the callbacks. No, I don't want the threads to show up or exist at all.
>
> And to be sure, I'd like the code to set up and use the threads to actually compile away statically, so that there clearly isn't some way it's partially enabled.

Yes, the kthread creation will happen only if RCU_BOOST=y. Otherwise,
there will be no RCU kthreads at all.

Thanx, Paul

2011-06-15 20:55:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex



Ingo Molnar <[email protected]> wrote:
>
>perf record -g will go a long way towards such a tool already - but i
>think it would be useful to create a more top-alike view as well.

perf record -g doesn't help when the issue is that we're recording a single process and the actual work is being done in another unrelated process that is just being woken up.

Sure, you can do a system wide recording, but that shows all kinds of unrelated noise and requires root permissions.

So those rcu threads really need to go.

Linus

2011-06-15 20:57:31

by Andi Kleen

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex


> 7.44% exim [kernel.kallsyms] [k] format_decode
> |
> --- format_decod


This is a glibc issue. exim calls libdb and libdb asks sysconf for the
number of CPUs to tune
its locking, and glibc reads /proc/stat. And /proc/stat is incredible slow.

I would blame glibc, but in this case it's really the kernel to blame
for not providing proper
interface.

This was my motivation for the sysconf() syscall I submitted some time ago.
https://lkml.org/lkml/2011/5/13/455

Anyways a quick workaround is to use this LD_PRELOAD:
http://halobates.de/smallsrc/sysconf.c
But it's not 100% equivalent.

-Andi


2011-06-15 21:06:14

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex



Ingo Molnar <[email protected]> wrote:
>
>I have this fix queued up currently:
>
> 09223371deac: rcu: Use softirq to address performance regression

I really don't think that is even close to enough.

It still does all the callbacks in the threads, and according to Peter, about half the rcu time in the threads remained..

Linus

2011-06-15 21:11:42

by Tim Chen

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Wed, 2011-06-15 at 13:57 -0700, Andi Kleen wrote:
> > 7.44% exim [kernel.kallsyms] [k] format_decode
> > |
> > --- format_decod
>
>
> This is a glibc issue. exim calls libdb and libdb asks sysconf for the
> number of CPUs to tune
> its locking, and glibc reads /proc/stat. And /proc/stat is incredible slow.
>
> I would blame glibc, but in this case it's really the kernel to blame
> for not providing proper
> interface.
>
> This was my motivation for the sysconf() syscall I submitted some time ago.
> https://lkml.org/lkml/2011/5/13/455
>
> Anyways a quick workaround is to use this LD_PRELOAD:
> http://halobates.de/smallsrc/sysconf.c
> But it's not 100% equivalent.
>

Thanks to Andi for providing the info. We've used this workaround in
our testing so it will not mask true kernel scaling bottlenecks.

Tim

2011-06-15 21:15:35

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [GIT PULL] Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Wed, Jun 15, 2011 at 02:05:46PM -0700, Linus Torvalds wrote:
>
>
> Ingo Molnar <[email protected]> wrote:
> >
> >I have this fix queued up currently:
> >
> > 09223371deac: rcu: Use softirq to address performance regression
>
> I really don't think that is even close to enough.
>
> It still does all the callbacks in the threads, and according to Peter, about half the rcu time in the threads remained..

I am putting together a patch that gets rid of the kthreads in the
!RCU_BOOST case. The time will still be consumed, but in softirq
context, though of course with many fewer context switches.

Thanx, Paul

2011-06-15 21:28:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex



"Paul E. McKenney" <[email protected]> wrote:
>
> The time will still be consumed, but in softirq
>context, though of course with many fewer context switches.

So the problem with threads goes way beyond just the context switches, or even just the problem with tracing.

Threads change the batching behavior, for example. That is *especially* true with background threads or with realtime threads. Both end up having high priority -either because they are realtime, or because they've been sleeping and thus have been building up extra priority that way.

So when you wake up such a thread, suddenly you get preemption behaviour, or you get the semaphores deciding to break out of their busy loops due to need_resched being set etc.

In contrast, softirqs don't have those kinds of side effects. They have a much smaller effect on system behavior and just run the code we ask them to.

Linus

2011-06-15 21:38:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Wed, 2011-06-15 at 14:12 -0700, Tim Chen wrote:
> Thanks to Andi for providing the info. We've used this workaround in
> our testing so it will not mask true kernel scaling bottlenecks.


http://programming.kicks-ass.net/sekrit/39-2.txt.bz2
http://programming.kicks-ass.net/sekrit/tip-2.txt.bz2

tip+sirq+linus is still slightly faster than .39 here, although removing
that sysconf() wreckage closed the gap considerably (needing to know the
number of cpus to optimize locking sounds like a trainwreck all of its
own, needing it _that_ often instead of just once at startup is even
worse).

2011-06-15 21:52:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex



Peter Zijlstra <[email protected]> wrote:
>removing
>that sysconf() wreckage closed the gap considerably (needing to know
>the
>number of cpus to optimize locking sounds like a trainwreck all of its
>own, needing it _that_ often instead of just once at startup is even
>worse).

Yeah, I think it's ridiculous to say that glibc is not doing something stupid and that it's a problem with kernel interfaces.

Do the proc file parsing once and cache the result in a static variable. Doing it over and over again is just crazy.

Adding new system calls because glibc is crazy is insane.

Linus

2011-06-15 22:15:39

by Andi Kleen

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

(needing to know the
> number of cpus to optimize locking sounds like a trainwreck all of its
> own, needing it _that_ often instead of just once at startup is even
> worse).

libdb does it only once per startup, it's just that it gets restarted
for every
child (it's a library, not a server)

Really the kernel just needs to provide a faster way to get that.
Requiring /proc/stat for that is just insane.

I'll resend the sysconf patchkit I guess :-)

-Andi

2011-06-15 22:19:15

by Andi Kleen

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex


> Yeah, I think it's ridiculous to say that glibc is not doing something stupid and that it's a problem with kernel interfaces.
>
> Do the proc file parsing once and cache the result in a static variable. Doing it over and over again is just crazy.
>
> Adding new system calls because glibc is crazy is insane.

Caching doesn't help because the library gets reinitialized in every child
(it may already do caching, not fully sure for this; it does it for
other sysconfs at least)

I don't think glibc is crazy in this. It has no other choice.

-Andi

2011-06-16 00:17:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Wed, Jun 15, 2011 at 3:19 PM, Andi Kleen <[email protected]> wrote:
>
> Caching doesn't help because the library gets reinitialized in every child
> (it may already do caching, not fully sure for this; it does it for other
> sysconfs at least)

Why the hell do you continue to make excuses for glibc that are
*clearly*not*true*?

Stop this insanity, Andi. Do you realize that this kind of crazy
behavior just makes me convinced that there is no way in hell I should
*ever* take your sysconfig patch, since all your analysis for it is
totally worthless?

JUST LOOK AT THE NUMBERS, for chrissake!

When format_decode is 7% of the whole workload, and the top 15
functions of the profile look like this:

6.40% exim [kernel.kallsyms] [k] format_decode
5.26% exim [kernel.kallsyms] [k] page_fault
5.05% exim [kernel.kallsyms] [k] vsnprintf
3.55% exim [kernel.kallsyms] [k] number
3.00% exim [kernel.kallsyms] [k] copy_page_c
2.88% exim [kernel.kallsyms] [k] read_hpet
2.38% exim libc-2.13.90.so [.] __GI_vfprintf
1.92% exim [kernel.kallsyms] [k] kstat_irqs
1.53% exim [kernel.kallsyms] [k] find_vma
1.47% exim [kernel.kallsyms] [k] _raw_spin_lock
1.40% exim [kernel.kallsyms] [k] seq_printf
1.34% exim [kernel.kallsyms] [k] radix_tree_lookup
1.21% exim [kernel.kallsyms] [k]
page_cache_get_speculative
1.20% exim [kernel.kallsyms] [k] clear_page_c
1.05% exim [kernel.kallsyms] [k] do_page_fault

I can pretty much guarantee that it doesn't do just one /proc/stat
read per fork() just to get the number of CPU's.

/proc/stat may be slow, but it's not slower than doing real work -
unless you call it millions of times.

And you didn't actually look at glibc sources, did you? Because if you
had, you would ALSO have seen that you are totally full of sh*t. Glibc
at no point caches anything.

So repeat after me: stop making excuses and lying about glibc. It's
crap. End of story.

> I don't think glibc is crazy in this. It has no other choice.

Stop this insanity, Andi. Why do you lie or just make up arguments? WHY?

There is very clearly no caching going on. And since exim doesn't even
execve, it just forks, it's very clear that it could cache things just
ONCE, so your argument that caching wouldn't be possible at that level
is also bogus.

I can certainly agree that /proc/stat isn't wonderful (it used to be
better), but that's no excuse for just totally making up excuses for
just plain bad *stupid* behavior in user space. And it certainly
doesn't excuse just making shit up!

Linus

2011-06-16 01:07:57

by Tim Chen

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Wed, 2011-06-15 at 23:37 +0200, Peter Zijlstra wrote:
> On Wed, 2011-06-15 at 14:12 -0700, Tim Chen wrote:
> > Thanks to Andi for providing the info. We've used this workaround in
> > our testing so it will not mask true kernel scaling bottlenecks.
>
>
> http://programming.kicks-ass.net/sekrit/39-2.txt.bz2
> http://programming.kicks-ass.net/sekrit/tip-2.txt.bz2
>
> tip+sirq+linus is still slightly faster than .39 here, although removing
> that sysconf() wreckage closed the gap considerably (needing to know the
> number of cpus to optimize locking sounds like a trainwreck all of its
> own, needing it _that_ often instead of just once at startup is even
> worse).
>

Peter,

Fengguang's readahead fixes for tmpfs removed another bottleneck before
anon_vma->lock become dominant. https://lkml.org/lkml/2011/4/26/143)
We've found this issue when we were testing exim earlier.
It was merged in 3.0-rc2 but not in plain 2.6.39. So with this patch on
2.6.39 we should get better comparison with 3.0-rc2.

Thanks.

Tim

2011-06-16 01:50:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Wed, Jun 15, 2011 at 2:37 PM, Peter Zijlstra <[email protected]> wrote:
>
> http://programming.kicks-ass.net/sekrit/39-2.txt.bz2
> http://programming.kicks-ass.net/sekrit/tip-2.txt.bz2
>
> tip+sirq+linus is still slightly faster than .39 here,

Hmm. Your profile doesn't show the mutex slowpath at all, so there's a
big difference to the one Tim quoted parts of.

In fact, your profile looks fine. The load clearly spends tons of time
in page faulting and in timing things (that read_hpet thing is
disgusting), but with that in mind, the profile doesn't look scary.
Yes, the 2% spinlock time is bad, but you've clearly not hit the real
lock contention case. The mutex lock shows up, but _way_ below the
spinlock, and the slowpath never shows at all. You end up having
mutex_spin_on_owner at 0.09%, it's not really visible.

Clearly going from your two-socket 12-core thing to Tim's four-socket
40-core case is a big jump. But maybe it really was about RCU, and
even the limited softirq patch that moves the grace period stuff etc
back to softirqs ends up helping.

Tim, have you tried running your bigger load with that patch? You
could try my patch on top too just to match Peter's tree, but I doubt
that's the big first-order issue.

Linus

2011-06-16 07:04:54

by Ingo Molnar

[permalink] [raw]
Subject: Re: [GIT PULL] Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex


* Linus Torvalds <[email protected]> wrote:

>
>
> Ingo Molnar <[email protected]> wrote:
> >
> > I have this fix queued up currently:
> >
> > 09223371deac: rcu: Use softirq to address performance regression
>
> I really don't think that is even close to enough.

Yeah.

> It still does all the callbacks in the threads, and according to
> Peter, about half the rcu time in the threads remained..

You are right - things that are a few percent on a 24 core machine
will definitely go exponentially worse on larger boxen. We'll get rid
of the kthreads entirely.

The funny thing about this workload is that context-switches are
really a fastpath here and we are using anonymous IRQ-triggered
softirqs embedded in random task contexts as a workaround for that.

[ I think we'll have to revisit this issue and do it properly:
quiescent state is mostly defined by context-switches here, so we
could do the RCU callbacks from the task that turns a CPU
quiescent, right in the scheduler context-switch path - perhaps
with an option for SCHED_FIFO tasks to *not* do GC.

That could possibly be more cache-efficient than softirq execution,
as we'll process a still-hot pool of callbacks instead of doing
them only once per timer tick. It will also make the RCU GC
behavior HZ independent. ]

In any case the proxy kthread model clearly sucked, no argument about
that.

Thanks,

Ingo

2011-06-16 17:16:55

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [GIT PULL] Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Thu, Jun 16, 2011 at 09:03:35AM +0200, Ingo Molnar wrote:
>
> * Linus Torvalds <[email protected]> wrote:
>
> >
> >
> > Ingo Molnar <[email protected]> wrote:
> > >
> > > I have this fix queued up currently:
> > >
> > > 09223371deac: rcu: Use softirq to address performance regression
> >
> > I really don't think that is even close to enough.
>
> Yeah.
>
> > It still does all the callbacks in the threads, and according to
> > Peter, about half the rcu time in the threads remained..
>
> You are right - things that are a few percent on a 24 core machine
> will definitely go exponentially worse on larger boxen. We'll get rid
> of the kthreads entirely.

I did indeed at one time have access to larger test systems than I
do now, and I clearly need to fix that. :-/

> The funny thing about this workload is that context-switches are
> really a fastpath here and we are using anonymous IRQ-triggered
> softirqs embedded in random task contexts as a workaround for that.

The other thing that the IRQ-triggered softirqs do is to get the callbacks
invoked in cases where a CPU-bound user thread is never context switching.
Of course, one alternative might be to set_need_resched() to force entry
into the scheduler as needed.

> [ I think we'll have to revisit this issue and do it properly:
> quiescent state is mostly defined by context-switches here, so we
> could do the RCU callbacks from the task that turns a CPU
> quiescent, right in the scheduler context-switch path - perhaps
> with an option for SCHED_FIFO tasks to *not* do GC.

I considered this approach for TINY_RCU, but dropped it in favor of
reducing the interlocking between the scheduler and RCU callbacks.
Might be worth revisiting, though. If SCHED_FIFO task omit RCU callback
invocation, then there will need to be some override for CPUs with lots
of SCHED_FIFO load, probably similar to RCU's current blimit stuff.

> That could possibly be more cache-efficient than softirq execution,
> as we'll process a still-hot pool of callbacks instead of doing
> them only once per timer tick. It will also make the RCU GC
> behavior HZ independent. ]

Well, the callbacks will normally be cache-cold in any case due to the
grace-period delay, but on the other hand, both tick-independence and
the ability to shield a given CPU from RCU callback execution might be
quite useful. The tick currently does the following for RCU:

1. Informs RCU of user-mode execution (rcu_sched and rcu_bh
quiescent state).

2. Informs RCU of non-dyntick idle mode (again, rcu_sched and
rcu_bh quiescent state).

3. Kicks the current CPU's RCU core processing as needed in
response to actions from other CPUs.

Frederic's work avoiding ticks in long-running user-mode tasks
might take care of #1, and it should be possible to make use of
the current dyntick-idle APIs to deal with #2. Replacing #3
efficiently will take some thought.

> In any case the proxy kthread model clearly sucked, no argument about
> that.

Indeed, I lost track of the global nature of real-time scheduling. :-(

Whatever does the boosting will need to have process context and
can be subject to delays, so that pretty much needs to be a kthread.
But it will context-switch quite rarely, so should not be a problem.

Thanx, Paul

2011-06-16 20:15:32

by Andi Kleen

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex


> /proc/stat may be slow, but it's not slower than doing real work -
> unless you call it millions of times.


I haven't analyzed it in detail, but I suspect it's some cache line
bounce, which
can slow things down quite a lot. Also the total number of invocations
is quite high (hundreds of messages per core * 32 cores)

Ok even with cache line bouncing it's suspicious.

> And you didn't actually look at glibc sources, did you?

I did, but I gave up fully following that code path because it's so
convoluted :-/

Ok if you want I can implement caching in the LD_PRELOAD and see
if it changes things.

> There is very clearly no caching going on. And since exim doesn't even
> execve, it just forks, it's very clear that it could cache things just
> ONCE, so your argument that caching wouldn't be possible at that level
> is also bogus.

So you mean caching it at startup time? Otherwise the parent would
need to do sysconf() at least , which it doesn't do (the exim source doesn't
really know anything about libdb internals)

That would add /proc/stat overhead to every program execution. Is that
what you
are proposing?

-Andi

2011-06-16 20:27:13

by Ingo Molnar

[permalink] [raw]
Subject: Re: [GIT PULL] Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex


* Paul E. McKenney <[email protected]> wrote:

> > The funny thing about this workload is that context-switches are
> > really a fastpath here and we are using anonymous IRQ-triggered
> > softirqs embedded in random task contexts as a workaround for
> > that.
>
> The other thing that the IRQ-triggered softirqs do is to get the
> callbacks invoked in cases where a CPU-bound user thread is never
> context switching.

Yeah - but this workload didnt have that.

> Of course, one alternative might be to set_need_resched() to force
> entry into the scheduler as needed.

No need for that: we can just do the callback not in softirq but in
regular syscall context in that case, in the return-to-userspace
notifier. (see TIF_USER_RETURN_NOTIFY and the USER_RETURN_NOTIFIER
facility)

Abusing a facility like setting need_resched artificially will
generally cause trouble.

> > [ I think we'll have to revisit this issue and do it properly:
> > quiescent state is mostly defined by context-switches here, so we
> > could do the RCU callbacks from the task that turns a CPU
> > quiescent, right in the scheduler context-switch path - perhaps
> > with an option for SCHED_FIFO tasks to *not* do GC.
>
> I considered this approach for TINY_RCU, but dropped it in favor of
> reducing the interlocking between the scheduler and RCU callbacks.
> Might be worth revisiting, though. If SCHED_FIFO task omit RCU
> callback invocation, then there will need to be some override for
> CPUs with lots of SCHED_FIFO load, probably similar to RCU's
> current blimit stuff.

I wouldnt complicate it much for SCHED_FIFO: SCHED_FIFO tasks are
special and should never run long.

> > That could possibly be more cache-efficient than softirq execution,
> > as we'll process a still-hot pool of callbacks instead of doing
> > them only once per timer tick. It will also make the RCU GC
> > behavior HZ independent. ]
>
> Well, the callbacks will normally be cache-cold in any case due to
> the grace-period delay, [...]

The workloads that are the most critical in this regard tend to be
context switch intense, so the grace period expiry latency should be
pretty short.

Or at least significantly shorter than today's HZ frequency, right?
HZ would still provide an upper bound for the latency.

Btw., the current worst-case grace period latency is in reality more
like two timer ticks: one for the current CPU to expire and another
for the longest "other CPU" expiry, right? Average expiry (for
IRQ-poor workloads) would be 1.5 timer ticks. (if i got my stat
calculations right!)

> [...] but on the other hand, both tick-independence and the ability
> to shield a given CPU from RCU callback execution might be quite
> useful. [...]

Yeah.

> [...] The tick currently does the following for RCU:
>
> 1. Informs RCU of user-mode execution (rcu_sched and rcu_bh
> quiescent state).
>
> 2. Informs RCU of non-dyntick idle mode (again, rcu_sched and
> rcu_bh quiescent state).
>
> 3. Kicks the current CPU's RCU core processing as needed in
> response to actions from other CPUs.
>
> Frederic's work avoiding ticks in long-running user-mode tasks
> might take care of #1, and it should be possible to make use of the
> current dyntick-idle APIs to deal with #2. Replacing #3
> efficiently will take some thought.

What is the longest delay the scheduler tick can take typically - 40
msecs? That would then be the worst-case grace period latency for
workloads that neither do context switches nor trigger IRQs, right?

> > In any case the proxy kthread model clearly sucked, no argument
> > about that.
>
> Indeed, I lost track of the global nature of real-time scheduling.
> :-(

Btw., i think that test was pretty bad: running exim as SCHED_FIFO??

But it does not excuse the kthread model.

> Whatever does the boosting will need to have process context and
> can be subject to delays, so that pretty much needs to be a
> kthread. But it will context-switch quite rarely, so should not be
> a problem.

So user-return notifiers ought to be the ideal platform for that,
right? We don't even have to touch the scheduler: anything that
schedules will eventually return to user-space, at which point the
RCU GC magic can run.

And user-return-notifiers can be triggered from IRQs as well.

That allows us to get rid of softirqs altogether and maybe even speed
the whole thing up and allow it to be isolated better.

Thanks,

Ingo

2011-06-16 20:25:42

by Tim Chen

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Wed, 2011-06-15 at 18:50 -0700, Linus Torvalds wrote:

>
> Tim, have you tried running your bigger load with that patch? You
> could try my patch on top too just to match Peter's tree, but I doubt
> that's the big first-order issue.
>
> Linus

I ran exim with different kernel versions. Using 2.6.39-vanilla
kernel as a baseline, the results are as follow:

Throughput
2.6.39(vanilla) 100.0%
2.6.39+ra-patch 166.7% (+66.7%) (note: tmpfs readahead patchset is merged in 3.0-rc2)
3.0-rc2(vanilla) 68.0% (-32%)
3.0-rc2+linus 115.7% (+15.7%)
3.0-rc2+linus+softirq 86.2% (-17.3%)

So Linus' patch certainly helped things over vanilla 3.0-rc2, but throughput is still
less than the 2.6.39 with the readahead patch set. The softirq patch I used was from Ingo's
combined patch from Shaohua and Paul. It seems odd that it makes things worse. I will
recheck this data probably just this patch and without Linus' patch later.

I also notice that the run to run variations have increased quite a bit for 3.0-rc2.
I'm using 6 runs per kernel. Perhaps a side effect of converting the anon_vma->lock to mutex?

(Max-Min)/avg
2.6.39(vanilla) 3%
2.6.39+ra-patch 3%
3.0-rc2(vanilla) 20%
3.0-rc2+linus 36%
3.0-rc2+linus+softirq 40%

Thanks.

Tim

2011-06-16 21:04:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Thu, Jun 16, 2011 at 1:14 PM, Andi Kleen <[email protected]> wrote:
>
> I haven't analyzed it in detail, but I suspect it's some cache line bounce,
> which
> can slow things down quite a lot. ?Also the total number of invocations
> is quite high (hundreds of messages per core * 32 cores)

The fact is, glibc is just total crap.

I tried to send uli a patch to just add caching. No go. I sent
*another* patch to at least make glibc use a sane interface (and the
cache if it needs to fall back on /proc/stat for some legacy reason).
We'll see what happens.

Paul Eggbert suggested "caching for one second" - by just calling
"gettimtofday()" to see how old the cache is. That would work too.

The point I'm making is that it really is a glibc problem. Glibc is
doing stupid expensive things, and not trying to correct for the fact
that it's expensive.

> I did, but I gave up fully following that code path because it's so
> convoluted :-/

I do agree that glibc sources are incomprehensible, with multiple
layers of abstraction (sysdeps, "posix", helper functions etc etc).

In this case it was really trivial to find the culprit with a simple

git grep /proc/stat

though. The code is crap. It's insane. It's using
/sys/devices/system/cpu for _SC_NPROCESSORS_CONF, which is at least a
reasonable interface to use. But it does it in odd ways, and actually
counts the CPU's by doing a readdir call. And it doesn't cache the
result, even though that particular result had better be 100% stable -
it has nothing to do with "online" vs "offline" etc.

But then for _SC_NPROCESSORS_ONLN, it doesn't actually use
/sys/devices/system/cpu at all, but the /proc/stat interface. Which is
slow, mostly because it has all the crazy interrupt stuff in it, but
also because it has lots of legacy stuff.

I wrote a _much_ cleaner routine (loosely based on what we do in
tools/prof) to just parse /sys/devices/system/cpu/online. I didn't
even time it, but I can almost guarantee that it's an order of
magnitude faster than /proc/stat. And if that doesn't work, you can
fall back on a cached version of the /proc/stat parsing, since if
those files don't exist, you can forget about CPU hotplug.

> So you mean caching it at startup time? Otherwise the parent would
> need to do sysconf() at least , which it doesn't do (the exim source doesn't
> really know anything about libdb internals)

Even if you do it in the children, it will help. At least it would be
run just _once_ per fork.

But actually looking at glibc just shows that they are simply doing
stupid things. And I absolutely _refuse_ to add new interfaces to the
kernel only because glibc is being a moron.

Linus

2011-06-16 21:10:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Thu, Jun 16, 2011 at 1:26 PM, Tim Chen <[email protected]> wrote:
>
> I ran exim with different kernel versions. ?Using 2.6.39-vanilla
> kernel as a baseline, the results are as follow:
>
> ? ? ? ? ? ? ? ? ? ? ? ?Throughput
> 2.6.39(vanilla) ? ? ? ? 100.0%
> 2.6.39+ra-patch ? ? ? ? 166.7% ?(+66.7%) ? ? ? ?(note: tmpfs readahead patchset is merged in 3.0-rc2)
> 3.0-rc2(vanilla) ? ? ? ? 68.0% ?(-32%)
> 3.0-rc2+linus ? ? ? ? ? 115.7% ?(+15.7%)
> 3.0-rc2+linus+softirq ? ?86.2% ?(-17.3%)

Ok, so batching the semaphore operations makes more of a difference
than I would have expected.

I guess I'll cook up an improved patch that does it for the vma exit
case too, and see if that just makes the semaphores be a non-issue.

> I also notice that the run to run variations have increased quite a bit for 3.0-rc2.
> I'm using 6 runs per kernel. ?Perhaps a side effect of converting the anon_vma->lock to mutex?

So the thing about using the mutexes is that heavy contention on a
spinlock is very stable: it may be *slow*, but it's reliable, nicely
queued, and has very few surprises.

On a mutex, heavy contention results in very subtle behavior, with the
adaptive spinning often - but certainly not always - making the mutex
act as a spinlock, but once you have lots of contention the adaptive
spinning breaks down. And then you have lots of random interactions
with the scheduler and 'need_resched' etc.

The only valid answer to lock contention is invariably always just
"don't do that then". We've been pretty good at getting rid of
problematic locks, but this one clearly isn't one of the ones we've
fixed ;)

Linus

2011-06-16 21:01:36

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [GIT PULL] Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Thu, Jun 16, 2011 at 10:25:50PM +0200, Ingo Molnar wrote:
>
> * Paul E. McKenney <[email protected]> wrote:
>
> > > The funny thing about this workload is that context-switches are
> > > really a fastpath here and we are using anonymous IRQ-triggered
> > > softirqs embedded in random task contexts as a workaround for
> > > that.
> >
> > The other thing that the IRQ-triggered softirqs do is to get the
> > callbacks invoked in cases where a CPU-bound user thread is never
> > context switching.
>
> Yeah - but this workload didnt have that.
>
> > Of course, one alternative might be to set_need_resched() to force
> > entry into the scheduler as needed.
>
> No need for that: we can just do the callback not in softirq but in
> regular syscall context in that case, in the return-to-userspace
> notifier. (see TIF_USER_RETURN_NOTIFY and the USER_RETURN_NOTIFIER
> facility)
>
> Abusing a facility like setting need_resched artificially will
> generally cause trouble.

If the task enqueued callbacks in the kernel, thus started a new grace period,
it might return to userspace before every CPUs have completed that grace period,
and you need that full completion to happen before invoking the callbacks.

I think you need to keep the tick in such case because you can't count on
the other CPUs to handle that completion as they may be all idle.

So when you resume to userspace and you started a GP, either you find another
CPU to handle the GP completion and callbacks executions, or you keep the tick
until you are done.

2011-06-16 21:03:40

by Andi Kleen

[permalink] [raw]
Subject: Re: [GIT PULL] Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

> So user-return notifiers ought to be the ideal platform for that,
> right? We don't even have to touch the scheduler: anything that
> schedules will eventually return to user-space, at which point the
> RCU GC magic can run.

That's not necessarily true. Consider a router which only routes and never runs
user space. You would starve it. Given it's somewhat obscure, but it's possible.

-Andi
--
[email protected] -- Speaking for myself only

2011-06-16 21:28:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Thu, Jun 16, 2011 at 1:47 PM, Linus Torvalds
<[email protected]> wrote:
>
> I guess I'll cook up an improved patch that does it for the vma exit
> case too, and see if that just makes the semaphores be a non-issue.

Ok, I bet it doesn't make them a non-issue, but if doing this in
anon_vma_clone() helped a lot, then doing the exact same pattern in
unlink_anon_vmas() hopefully helps some more.

This patch is UNTESTED! It replaces my previous one (it's really just
an extension of it), and while I actually test-booted that previous
one I did *not* do it for this one. So please look out. But it's using
the exact same pattern, so there should be no real surprises.

Does it improve things further on your load?

(Btw, I'm not at all certain about that "we can get an empty
anon_vma_chain" comment. I left it - and the test for a NULL anon_vma
- in the code, but I think it's bogus. If we've linked in the
anon_vma_chain, it will have an anon_vma associated with it, I'm
pretty sure)

VM people, please do comment on both that "empty anon_vma_chain"
issue, and on whether we can ever have two different anon_vma roots in
the 'same_vma' list. I have that WARN_ON_ONCE() there in both paths, I
just wonder whether we should just inconditionally take the first
entry in the list and lock it outside the whole loop instead?

Peter? Hugh?

Linus

2011-06-16 22:10:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Thu, Jun 16, 2011 at 2:05 PM, Linus Torvalds
<[email protected]> wrote:
>
> This patch is UNTESTED!

It was also UNATTACHED!

Now it's attached.

Linus


Attachments:
patch.diff (3.35 kB)

2011-06-16 21:33:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Thu, Jun 16, 2011 at 2:06 PM, Linus Torvalds
<[email protected]> wrote:
> On Thu, Jun 16, 2011 at 2:05 PM, Linus Torvalds
> <[email protected]> wrote:
>>
>> This patch is UNTESTED!
>
> It was also UNATTACHED!

Hmm. And it doesn't work. We deadlock when we free the anon_vma
because the *freeing* path wants to take the anon_vma lock. See that
horrid code in anon_vma_free().

So now we now hold the root over the whole series of frees, and get an
instant deadlock.

We also can happen to free the root anon_vma before we release the
lock in it, which is another slight problem ;)

So the unlink_anon_vmas() case is actually much more complicated than
the clone case.

In other words, just forget that second patch. I'll have to think about it.

Linus

2011-06-16 22:00:41

by Andi Kleen

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex


> Ok, I bet it doesn't make them a non-issue, but if doing this in
> anon_vma_clone() helped a lot, then doing the exact same pattern in
> unlink_anon_vmas() hopefully helps some more.
Hi Linus,

I did essentially the same thing (just in a less elegant, more paranoid
way) in my old
batching patches.

http://comments.gmane.org/gmane.linux.kernel.mm/63130

They made it a bit better, but overall it's still much worse than 2.6.35
(.36 introduced the
root chain locking). I guess we'll see if it can recover mutex though,
but I'm not
optimistic.

-Andi

2011-06-16 22:24:00

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [GIT PULL] Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Thu, 2011-06-16 at 22:25 +0200, Ingo Molnar wrote:

> > Whatever does the boosting will need to have process context and
> > can be subject to delays, so that pretty much needs to be a
> > kthread. But it will context-switch quite rarely, so should not be
> > a problem.
>
> So user-return notifiers ought to be the ideal platform for that,
> right? We don't even have to touch the scheduler: anything that
> schedules will eventually return to user-space, at which point the
> RCU GC magic can run.
>
> And user-return-notifiers can be triggered from IRQs as well.
>
> That allows us to get rid of softirqs altogether and maybe even speed
> the whole thing up and allow it to be isolated better.

I'm a little worried of relying on things returning to userspace.

One could imagine something like a router appliance where userspace is
essentially asleep forever and everything happens in the kernel
(networking via softirq, maybe NFS kernel server, ...)

Cheers,
Ben.

2011-06-16 22:39:53

by Ingo Molnar

[permalink] [raw]
Subject: Re: [GIT PULL] Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex


* Benjamin Herrenschmidt <[email protected]> wrote:

> On Thu, 2011-06-16 at 22:25 +0200, Ingo Molnar wrote:
>
> > > Whatever does the boosting will need to have process context
> > > and can be subject to delays, so that pretty much needs to be a
> > > kthread. But it will context-switch quite rarely, so should not
> > > be a problem.
> >
> > So user-return notifiers ought to be the ideal platform for that,
> > right? We don't even have to touch the scheduler: anything that
> > schedules will eventually return to user-space, at which point
> > the RCU GC magic can run.
> >
> > And user-return-notifiers can be triggered from IRQs as well.
> >
> > That allows us to get rid of softirqs altogether and maybe even
> > speed the whole thing up and allow it to be isolated better.
>
> I'm a little worried of relying on things returning to userspace.
>
> One could imagine something like a router appliance where userspace
> is essentially asleep forever and everything happens in the kernel
> (networking via softirq, maybe NFS kernel server, ...)

There's a crazy solution for that: the idle thread could process RCU
callbacks carefully, as if it was running user-space code.

/me runs

Ok, joke aside: this is simply a special case where the idle thread
generates RCU work via hardirqs. The idle thread is arguably special
and could be handled in a special way: a helper thread that executes
only in this case?

Thanks,

Ingo

2011-06-16 22:47:32

by Andi Kleen

[permalink] [raw]
Subject: Re: [GIT PULL] Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex


> There's a crazy solution for that: the idle thread could process RCU
> callbacks carefully, as if it was running user-space code.

In Ben's kernel NFS server case the system may not be idle.

-Andi

2011-06-16 22:59:07

by Ingo Molnar

[permalink] [raw]
Subject: Re: [GIT PULL] Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex


* Andi Kleen <[email protected]> wrote:

> > There's a crazy solution for that: the idle thread could process
> > RCU callbacks carefully, as if it was running user-space code.
>
> In Ben's kernel NFS server case the system may not be idle.

An always-100%-busy NFS server is very unlikely, but even in the
hypothetical case a kernel NFS server is really performing system
calls from a kernel thread in essence. If it doesn't do it explicitly
then its main loop can easily include a "check RCU callbacks" call.

Thanks,

Ingo

2011-06-16 23:03:56

by Ingo Molnar

[permalink] [raw]
Subject: Re: [GIT PULL] Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex


* Frederic Weisbecker <[email protected]> wrote:

> On Thu, Jun 16, 2011 at 10:25:50PM +0200, Ingo Molnar wrote:
> >
> > * Paul E. McKenney <[email protected]> wrote:
> >
> > > > The funny thing about this workload is that context-switches are
> > > > really a fastpath here and we are using anonymous IRQ-triggered
> > > > softirqs embedded in random task contexts as a workaround for
> > > > that.
> > >
> > > The other thing that the IRQ-triggered softirqs do is to get the
> > > callbacks invoked in cases where a CPU-bound user thread is never
> > > context switching.
> >
> > Yeah - but this workload didnt have that.
> >
> > > Of course, one alternative might be to set_need_resched() to force
> > > entry into the scheduler as needed.
> >
> > No need for that: we can just do the callback not in softirq but in
> > regular syscall context in that case, in the return-to-userspace
> > notifier. (see TIF_USER_RETURN_NOTIFY and the USER_RETURN_NOTIFIER
> > facility)
> >
> > Abusing a facility like setting need_resched artificially will
> > generally cause trouble.
>
> If the task enqueued callbacks in the kernel, thus started a new
> grace period, it might return to userspace before every CPUs have
> completed that grace period, and you need that full completion to
> happen before invoking the callbacks.
>
> I think you need to keep the tick in such case because you can't
> count on the other CPUs to handle that completion as they may be
> all idle.
>
> So when you resume to userspace and you started a GP, either you
> find another CPU to handle the GP completion and callbacks
> executions, or you keep the tick until you are done.

We'll have a scheduler tick in any case, which will act as a
worst-case RCU tick.

My main point is that we need to check whether this solution improves
performance over the current softirq code. I think there's a real
chance that it improves things like VFS workloads, because it
provides (much!) lower grace period latencies hence provides
fundamentally better cache locality.

If a workload pays the cost of frequent scheduling then it might as
well use a beneficial side-effect of that scheduling: high-freq grace
periods ...

If it improves performance we can figure out all the loose ends. If
it doesnt then the loose ends are not worth worrying about.

Thanks,

Ingo

2011-06-16 23:37:36

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [GIT PULL] Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Thu, Jun 16, 2011 at 10:25:50PM +0200, Ingo Molnar wrote:
>
> * Paul E. McKenney <[email protected]> wrote:
>
> > > The funny thing about this workload is that context-switches are
> > > really a fastpath here and we are using anonymous IRQ-triggered
> > > softirqs embedded in random task contexts as a workaround for
> > > that.
> >
> > The other thing that the IRQ-triggered softirqs do is to get the
> > callbacks invoked in cases where a CPU-bound user thread is never
> > context switching.
>
> Yeah - but this workload didnt have that.

Ah, understood -- I was thinking about the general case as well as this
particular workload.

> > Of course, one alternative might be to set_need_resched() to force
> > entry into the scheduler as needed.
>
> No need for that: we can just do the callback not in softirq but in
> regular syscall context in that case, in the return-to-userspace
> notifier. (see TIF_USER_RETURN_NOTIFY and the USER_RETURN_NOTIFIER
> facility)

Good point, I should add this -- there was a similar RCU hook into
the trap and syscall paths in DYNIX/ptx. At first glance, this looks
x86-specific -- or am I missing something?

> Abusing a facility like setting need_resched artificially will
> generally cause trouble.

Yep, from an RCU viewpoint, it would be best to use it only to force a
quiescent state.

> > > [ I think we'll have to revisit this issue and do it properly:
> > > quiescent state is mostly defined by context-switches here, so we
> > > could do the RCU callbacks from the task that turns a CPU
> > > quiescent, right in the scheduler context-switch path - perhaps
> > > with an option for SCHED_FIFO tasks to *not* do GC.
> >
> > I considered this approach for TINY_RCU, but dropped it in favor of
> > reducing the interlocking between the scheduler and RCU callbacks.
> > Might be worth revisiting, though. If SCHED_FIFO task omit RCU
> > callback invocation, then there will need to be some override for
> > CPUs with lots of SCHED_FIFO load, probably similar to RCU's
> > current blimit stuff.
>
> I wouldnt complicate it much for SCHED_FIFO: SCHED_FIFO tasks are
> special and should never run long.

Agreed, but if someone creates a badly behaved SCHED_FIFO task, RCU
still needs to make forward progress. Otherwise, a user task can OOM
the system.

> > > That could possibly be more cache-efficient than softirq execution,
> > > as we'll process a still-hot pool of callbacks instead of doing
> > > them only once per timer tick. It will also make the RCU GC
> > > behavior HZ independent. ]
> >
> > Well, the callbacks will normally be cache-cold in any case due to
> > the grace-period delay, [...]
>
> The workloads that are the most critical in this regard tend to be
> context switch intense, so the grace period expiry latency should be
> pretty short.

And TINY_RCU does in fact do core RCU processing at context-switch
time from within rcu_preempt_note_context_switch(), which is called
from rcu_note_context_switch(). I never tried that in TREE_RCU because
of the heavier weight of the processing. But it is easy to test --
six-line change, see patch below. (Untested, probably doesn't compile.)

> Or at least significantly shorter than today's HZ frequency, right?
> HZ would still provide an upper bound for the latency.

Yes, this should shorten the average grace-period length, at some cost
in overhead. (Probably a -lot- less than the RT kthread change, but hey!)

> Btw., the current worst-case grace period latency is in reality more
> like two timer ticks: one for the current CPU to expire and another
> for the longest "other CPU" expiry, right? Average expiry (for
> IRQ-poor workloads) would be 1.5 timer ticks. (if i got my stat
> calculations right!)

Yep, typical best-case grace period latency is indeed less than two ticks,
but it depends on exactly what you are measuring. It consts another
tick on the average to get the callbacks invoked (assuming no backlog),
but you can save a tick for back-to-back grace periods.

But this assumes CONFIG_NO_HZ=n with current RCU -- it takes about six
scheduler-clock ticks to detect dyntick-idle CPUs. Certain CPU-hotplug
races can leave RCU confused about whether or not a given CPU is part
of the current grace period, in which case RCU also takes about six
scheduler-clock ticks to get itself unconfused. If a process spends too
much time in the kernel without scheduling, RCU will send it a resched
IPI after about six scheduler-clock ticks. And of course a very
long RCU read-side critical section will extend the grace period, as
it absolutely must.

> > [...] but on the other hand, both tick-independence and the ability
> > to shield a given CPU from RCU callback execution might be quite
> > useful. [...]
>
> Yeah.
>
> > [...] The tick currently does the following for RCU:
> >
> > 1. Informs RCU of user-mode execution (rcu_sched and rcu_bh
> > quiescent state).
> >
> > 2. Informs RCU of non-dyntick idle mode (again, rcu_sched and
> > rcu_bh quiescent state).
> >
> > 3. Kicks the current CPU's RCU core processing as needed in
> > response to actions from other CPUs.
> >
> > Frederic's work avoiding ticks in long-running user-mode tasks
> > might take care of #1, and it should be possible to make use of the
> > current dyntick-idle APIs to deal with #2. Replacing #3
> > efficiently will take some thought.
>
> What is the longest delay the scheduler tick can take typically - 40
> msecs? That would then be the worst-case grace period latency for
> workloads that neither do context switches nor trigger IRQs, right?

40 milliseconds would be 25Hz. I haven't run that myself, though.
The lowest-HZ systems I use regularly are 250HZ, for 4-millisecond
scheduler-tick period. So that gets you a grace-period latency from
6 to about 20 milliseconds, depending on what is happening and exactly
what you are measuring.

> > > In any case the proxy kthread model clearly sucked, no argument
> > > about that.
> >
> > Indeed, I lost track of the global nature of real-time scheduling.
> > :-(
>
> Btw., i think that test was pretty bad: running exim as SCHED_FIFO??
>
> But it does not excuse the kthread model.

"It seemed like a good idea at the time." ;-)

> > Whatever does the boosting will need to have process context and
> > can be subject to delays, so that pretty much needs to be a
> > kthread. But it will context-switch quite rarely, so should not be
> > a problem.
>
> So user-return notifiers ought to be the ideal platform for that,
> right? We don't even have to touch the scheduler: anything that
> schedules will eventually return to user-space, at which point the
> RCU GC magic can run.

The RCU-bh variant requires some other mechanism, as it must to work
even if some crazy piece of network infrastructure is hit so hard by a
DoS attack that it never executes in user mode at all.

> And user-return-notifiers can be triggered from IRQs as well.
>
> That allows us to get rid of softirqs altogether and maybe even speed
> the whole thing up and allow it to be isolated better.

I am a bit concerned about how this would work with extreme workloads.

Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 405a5fd..f58ce53 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -159,8 +159,14 @@ void rcu_bh_qs(int cpu)
*/
void rcu_note_context_switch(int cpu)
{
+ unsigned long flags;
+
rcu_sched_qs(cpu);
rcu_preempt_note_context_switch(cpu);
+ local_irq_save(flags);
+ if (rcu_pending(cpu))
+ invoke_rcu_core();
+ local_irq_restore(flags);
}
EXPORT_SYMBOL_GPL(rcu_note_context_switch);

2011-06-17 00:24:09

by Andi Kleen

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex


> The fact is, glibc is just total crap.
>
> I tried to send uli a patch to just add caching. No go. I sent
> *another* patch to at least make glibc use a sane interface (and the
> cache if it needs to fall back on /proc/stat for some legacy reason).
> We'll see what happens.

FWIW a rerun with this modified LD_PRELOAD that does caching seems
to have the same performance as the version that does sched_getaffinity.

So you're right. Caching indeed helps and my assumption that the child
would only do it once was incorrect.

The only problem I see with it is that it doesn't handle CPU hotplug,
but Paul's suggestion would fix that too.

> Paul Eggbert suggested "caching for one second" - by just calling
> "gettimtofday()" to see how old the cache is. That would work too.
>

Maybe we need a "standard LD_PRELOAD library to improve glibc" @)

-Andi


Attachments:
sysconf-caching.c (471.00 B)

2011-06-17 00:45:44

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [GIT PULL] Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Fri, Jun 17, 2011 at 12:58:03AM +0200, Ingo Molnar wrote:
>
> * Andi Kleen <[email protected]> wrote:
>
> > > There's a crazy solution for that: the idle thread could process
> > > RCU callbacks carefully, as if it was running user-space code.
> >
> > In Ben's kernel NFS server case the system may not be idle.
>
> An always-100%-busy NFS server is very unlikely, but even in the
> hypothetical case a kernel NFS server is really performing system
> calls from a kernel thread in essence. If it doesn't do it explicitly
> then its main loop can easily include a "check RCU callbacks" call.

As long as they make sure to call it in a clean environment: no locks
held and so on. But I am a bit worried about the possibility of someone
forgetting to put one of these where it is needed -- it would work just
fine for most workloads, but could fail only for rare workloads.

That said, invoking RCU core/callback processing from the scheduler
context certainly sounds like an interesting way to speed up grace
periods.

Thanx, Paul

2011-06-17 04:05:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Thu, Jun 16, 2011 at 2:26 PM, Linus Torvalds
<[email protected]> wrote:
>
> So the unlink_anon_vmas() case is actually much more complicated than
> the clone case.
>
> In other words, just forget that second patch. I'll have to think about it.

Ok, I'm still thinking. I have an approach that I think will handle it
fairly cleanly, but that involves walking the same_vma list twice:
once to actually unlink the anon_vma's under the lock, and then a
second pass that does the rest. It should work.

But in the meantime I cleaned up the patch I already sent out a bit,
because the lock/unlock sequence will be the same, so I abstracted it
out a bit and added a couple of comments.

So Tim, I'd like you to test out my first patch (that only does the
anon_vma_clone() case) once again, but now in the cleaned-up version.
Does this patch really make a big improvement for you? If so, this
first step is probably worth doing regardless of the more complicated
second step, but I'd want to really make sure it's ok, and that the
performance improvement you saw is consistent and not a fluke.

Linus


Attachments:
patch.diff (4.02 kB)

2011-06-17 09:14:13

by Ingo Molnar

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex


* Andi Kleen <[email protected]> wrote:

> > I tried to send uli a patch to just add caching. No go. I sent
> > *another* patch to at least make glibc use a sane interface (and
> > the cache if it needs to fall back on /proc/stat for some legacy
> > reason). We'll see what happens.
>
> FWIW a rerun with this modified LD_PRELOAD that does caching seems
> to have the same performance as the version that does
> sched_getaffinity.
>
> So you're right. Caching indeed helps and my assumption that the
> child would only do it once was incorrect.

You should have known that your assumption was wrong not just from a
quick look at the strace output or a quick look at the glibc sources,
but also because i pointed out the caching angle to you in the
sysconf() discussion:

http://lkml.org/lkml/2011/5/14/9

repeatedly:

http://lkml.org/lkml/2011/5/17/149

and Denys Vlasenko pointed out the caching angle as well:

http://lkml.org/lkml/2011/5/17/183

But you kept pushing for your new syscall for upstream integration,
ignoring all contrary evidence and ignoring all contrary feedback,
without even *once* checking where and how it would integrate into
glibc ...

Thanks,

Ingo

2011-06-17 09:44:50

by Ingo Molnar

[permalink] [raw]
Subject: Re: [GIT PULL] Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex


* Paul E. McKenney <[email protected]> wrote:

> On Fri, Jun 17, 2011 at 12:58:03AM +0200, Ingo Molnar wrote:
> >
> > * Andi Kleen <[email protected]> wrote:
> >
> > > > There's a crazy solution for that: the idle thread could process
> > > > RCU callbacks carefully, as if it was running user-space code.
> > >
> > > In Ben's kernel NFS server case the system may not be idle.
> >
> > An always-100%-busy NFS server is very unlikely, but even in the
> > hypothetical case a kernel NFS server is really performing system
> > calls from a kernel thread in essence. If it doesn't do it explicitly
> > then its main loop can easily include a "check RCU callbacks" call.
>
> As long as they make sure to call it in a clean environment: no
> locks held and so on. But I am a bit worried about the possibility
> of someone forgetting to put one of these where it is needed -- it
> would work just fine for most workloads, but could fail only for
> rare workloads.

Yeah, some sort of worst-case-tick mechanism would guarantee that we
wont remain without RCU GC.

> That said, invoking RCU core/callback processing from the scheduler
> context certainly sounds like an interesting way to speed up grace
> periods.

It also moves whatever priority logic is needed closer to the
scheduler that has to touch those data structures anyway.

RCU, at least partially, is a scheduler driven garbage collector even
today: beyond context switch quiescent states the main practical role
of the per CPU timer tick itself is scheduling. So having it close to
when we do context-switches anyway looks pretty natural - worth
trying.

It might not work out in practice, but at first sight it would
simplify a few things i think.

Thanks,

Ingo

2011-06-17 11:29:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Thu, 2011-06-16 at 20:58 -0700, Linus Torvalds wrote:
> Ok, I'm still thinking. I have an approach that I think will handle it
> fairly cleanly, but that involves walking the same_vma list twice:
> once to actually unlink the anon_vma's under the lock, and then a
> second pass that does the rest. It should work.

Something like so? Compiles and runs the benchmark in question.

---
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -324,36 +324,52 @@ int anon_vma_fork(struct vm_area_struct
return -ENOMEM;
}

-static void anon_vma_unlink(struct anon_vma_chain *anon_vma_chain)
+static int anon_vma_unlink(struct anon_vma_chain *anon_vma_chain, struct anon_vma *anon_vma)
{
- struct anon_vma *anon_vma = anon_vma_chain->anon_vma;
- int empty;
-
- /* If anon_vma_fork fails, we can get an empty anon_vma_chain. */
- if (!anon_vma)
- return;
-
- anon_vma_lock(anon_vma);
list_del(&anon_vma_chain->same_anon_vma);

/* We must garbage collect the anon_vma if it's empty */
- empty = list_empty(&anon_vma->head);
- anon_vma_unlock(anon_vma);
+ if (list_empty(&anon_vma->head))
+ return 1;

- if (empty)
- put_anon_vma(anon_vma);
+ return 0;
}

void unlink_anon_vmas(struct vm_area_struct *vma)
{
struct anon_vma_chain *avc, *next;
+ struct anon_vma *root = NULL;

/*
* Unlink each anon_vma chained to the VMA. This list is ordered
* from newest to oldest, ensuring the root anon_vma gets freed last.
*/
list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) {
- anon_vma_unlink(avc);
+ struct anon_vma *anon_vma = avc->anon_vma;
+
+ /* If anon_vma_fork fails, we can get an empty anon_vma_chain. */
+ if (anon_vma) {
+ root = lock_anon_vma_root(root, anon_vma);
+ /* Leave empty anon_vmas on the list. */
+ if (anon_vma_unlink(avc, anon_vma))
+ continue;
+ }
+ list_del(&avc->same_vma);
+ anon_vma_chain_free(avc);
+ }
+ unlock_anon_vma_root(root);
+
+ /*
+ * Iterate the list once more, it now only contains empty and unlinked
+ * anon_vmas, destroy them. Could not do before due to anon_vma_free()
+ * needing to acquire the anon_vma->root->mutex.
+ */
+ list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) {
+ struct anon_vma *anon_vma = avc->anon_vma;
+
+ if (anon_vma)
+ put_anon_vma(anon_vma);
+
list_del(&avc->same_vma);
anon_vma_chain_free(avc);
}

2011-06-17 11:55:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Fri, 2011-06-17 at 13:28 +0200, Peter Zijlstra wrote:
> +static int anon_vma_unlink(struct anon_vma_chain *anon_vma_chain,
> struct anon_vma *anon_vma)
> {
> list_del(&anon_vma_chain->same_anon_vma);
>
> /* We must garbage collect the anon_vma if it's empty */
> + if (list_empty(&anon_vma->head))
> + return 1;
>
> + return 0;
> }

Is of course a little pathetic, so lets kill it..

---
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -324,36 +324,42 @@ int anon_vma_fork(struct vm_area_struct
return -ENOMEM;
}

-static void anon_vma_unlink(struct anon_vma_chain *anon_vma_chain)
-{
- struct anon_vma *anon_vma = anon_vma_chain->anon_vma;
- int empty;
-
- /* If anon_vma_fork fails, we can get an empty anon_vma_chain. */
- if (!anon_vma)
- return;
-
- anon_vma_lock(anon_vma);
- list_del(&anon_vma_chain->same_anon_vma);
-
- /* We must garbage collect the anon_vma if it's empty */
- empty = list_empty(&anon_vma->head);
- anon_vma_unlock(anon_vma);
-
- if (empty)
- put_anon_vma(anon_vma);
-}
-
void unlink_anon_vmas(struct vm_area_struct *vma)
{
struct anon_vma_chain *avc, *next;
+ struct anon_vma *root = NULL;

/*
* Unlink each anon_vma chained to the VMA. This list is ordered
* from newest to oldest, ensuring the root anon_vma gets freed last.
*/
list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) {
- anon_vma_unlink(avc);
+ struct anon_vma *anon_vma = avc->anon_vma;
+
+ /* If anon_vma_fork fails, we can get an empty anon_vma_chain. */
+ if (anon_vma) {
+ root = lock_anon_vma_root(root, anon_vma);
+ list_del(&avc->same_anon_vma);
+ /* Leave empty anon_vmas on the list. */
+ if (list_empty(&anon_vma->head))
+ continue;
+ }
+ list_del(&avc->same_vma);
+ anon_vma_chain_free(avc);
+ }
+ unlock_anon_vma_root(root);
+
+ /*
+ * Iterate the list once more, it now only contains empty and unlinked
+ * anon_vmas, destroy them. Could not do before due to __put_anon_vma()
+ * needing to acquire the anon_vma->root->mutex.
+ */
+ list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) {
+ struct anon_vma *anon_vma = avc->anon_vma;
+
+ if (anon_vma)
+ put_anon_vma(anon_vma);
+
list_del(&avc->same_vma);
anon_vma_chain_free(avc);
}

2011-06-17 15:19:30

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [GIT PULL] Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Fri, Jun 17, 2011 at 01:02:47AM +0200, Ingo Molnar wrote:
>
> * Frederic Weisbecker <[email protected]> wrote:
>
> > On Thu, Jun 16, 2011 at 10:25:50PM +0200, Ingo Molnar wrote:
> > >
> > > * Paul E. McKenney <[email protected]> wrote:
> > >
> > > > > The funny thing about this workload is that context-switches are
> > > > > really a fastpath here and we are using anonymous IRQ-triggered
> > > > > softirqs embedded in random task contexts as a workaround for
> > > > > that.
> > > >
> > > > The other thing that the IRQ-triggered softirqs do is to get the
> > > > callbacks invoked in cases where a CPU-bound user thread is never
> > > > context switching.
> > >
> > > Yeah - but this workload didnt have that.
> > >
> > > > Of course, one alternative might be to set_need_resched() to force
> > > > entry into the scheduler as needed.
> > >
> > > No need for that: we can just do the callback not in softirq but in
> > > regular syscall context in that case, in the return-to-userspace
> > > notifier. (see TIF_USER_RETURN_NOTIFY and the USER_RETURN_NOTIFIER
> > > facility)
> > >
> > > Abusing a facility like setting need_resched artificially will
> > > generally cause trouble.
> >
> > If the task enqueued callbacks in the kernel, thus started a new
> > grace period, it might return to userspace before every CPUs have
> > completed that grace period, and you need that full completion to
> > happen before invoking the callbacks.
> >
> > I think you need to keep the tick in such case because you can't
> > count on the other CPUs to handle that completion as they may be
> > all idle.
> >
> > So when you resume to userspace and you started a GP, either you
> > find another CPU to handle the GP completion and callbacks
> > executions, or you keep the tick until you are done.
>
> We'll have a scheduler tick in any case, which will act as a
> worst-case RCU tick.
>
> My main point is that we need to check whether this solution improves
> performance over the current softirq code. I think there's a real
> chance that it improves things like VFS workloads, because it
> provides (much!) lower grace period latencies hence provides
> fundamentally better cache locality.
>
> If a workload pays the cost of frequent scheduling then it might as
> well use a beneficial side-effect of that scheduling: high-freq grace
> periods ...
>
> If it improves performance we can figure out all the loose ends. If
> it doesnt then the loose ends are not worth worrying about.

Yeah I see your point, seems worth trying.

2011-06-17 16:49:42

by Linus Torvalds

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Fri, Jun 17, 2011 at 4:28 AM, Peter Zijlstra <[email protected]> wrote:
>
> Something like so? Compiles and runs the benchmark in question.

Yup.

Except I really think that test for a NULL anon_vma should go away.

If an avc entry has a NULL anon_vma, something is seriously wrong. The
comment about anon_vma_fork failure is definitely just bogus: the
anon_vma is allocated before the avc entry, so there's no way a avc
can have a NULL anon_vma from there.

But yes, your patch is cleaner than the one I was playing around with
(your "remove if not list empty" is prettier than what I was toying
with - having a separate flag in the avc)

Tim, can you test Peter's (second - the cleaned up one) patch on top
of mine, and see if that helps things further?

The only thing I don't love about the batching is that we now do hold
the lock over some situations where we _could_ have allowed
concurrency (notably some avc allocations), but I think it's a good
trade-off. And walking the list twice at unlink_anon_vmas() should be
basically free.

Linus

2011-06-17 16:49:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Fri, Jun 17, 2011 at 4:28 AM, Peter Zijlstra <[email protected]> wrote:
>
> Something like so? Compiles and runs the benchmark in question.

Oh, and can you do this with a commit log and sign-off, and I'll put
it in my "anon_vma-locking" branch that I have. I'm not going to
actually merge that branch into mainline until I've seen a few more
acks or more testing by Tim.

But if Tim's numbers hold up (-32% to +15% performance by just the
first one, and +15% isn't actually an improvement since tmpfs
read-ahead should have gotten us to +66%), I think we have to do this
just to avoid the performance regression.

Linus

2011-06-17 16:48:44

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [GIT PULL] Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Fri, Jun 17, 2011 at 11:43:33AM +0200, Ingo Molnar wrote:
>
> * Paul E. McKenney <[email protected]> wrote:
>
> > On Fri, Jun 17, 2011 at 12:58:03AM +0200, Ingo Molnar wrote:
> > >
> > > * Andi Kleen <[email protected]> wrote:
> > >
> > > > > There's a crazy solution for that: the idle thread could process
> > > > > RCU callbacks carefully, as if it was running user-space code.
> > > >
> > > > In Ben's kernel NFS server case the system may not be idle.
> > >
> > > An always-100%-busy NFS server is very unlikely, but even in the
> > > hypothetical case a kernel NFS server is really performing system
> > > calls from a kernel thread in essence. If it doesn't do it explicitly
> > > then its main loop can easily include a "check RCU callbacks" call.
> >
> > As long as they make sure to call it in a clean environment: no
> > locks held and so on. But I am a bit worried about the possibility
> > of someone forgetting to put one of these where it is needed -- it
> > would work just fine for most workloads, but could fail only for
> > rare workloads.
>
> Yeah, some sort of worst-case-tick mechanism would guarantee that we
> wont remain without RCU GC.

Agreed!

> > That said, invoking RCU core/callback processing from the scheduler
> > context certainly sounds like an interesting way to speed up grace
> > periods.
>
> It also moves whatever priority logic is needed closer to the
> scheduler that has to touch those data structures anyway.
>
> RCU, at least partially, is a scheduler driven garbage collector even
> today: beyond context switch quiescent states the main practical role
> of the per CPU timer tick itself is scheduling. So having it close to
> when we do context-switches anyway looks pretty natural - worth
> trying.
>
> It might not work out in practice, but at first sight it would
> simplify a few things i think.

OK, please see below for a patch that not only builds, but actually
passes minimal testing. ;-)

Possible expectations and outcomes:

1. Reduced grace-period latencies on !NO_HZ systems and
on NO_HZ systems where each CPU goes non-idle frequently.
(My unscientific testing shows little or no benefit, but
then again, I was running rcutorture, which specializes
in long read-side critical sections. And I was running
NO_HZ, which is less likely to show benefit.)

I would not expect direct call to have any benefit over
softirq invocation -- sub-microsecond softirq overhead
won't matter to multi-millisecond

2. Better cache locality. I am a bit skeptical, but there is
some chance that this might reduce cache misses on task_struct.
Direct call might well do better than softirq here.

3. Easier conversion to user-space operation. I was figuring
on using POSIX signals for scheduler_tick() and for softirq,
so wasn't worried about it, but might well be simpler.

Anything else?

On eliminating softirq, I must admit that I am a bit worried about
invoking the RCU core code from scheduler_tick(), but there are some
changes that I could make that would reduce force_quiescent_state()
worst-case latency. At the expense of lengthening grace periods,
unfortunately, but the reduction will be needed for RT on larger systems
anyway, so might as well try it.

Thanx, Paul

------------------------------------------------------------------------

rcu: Experimental change driving RCU core from scheduler

This change causes RCU's context-switch code to check to see if the RCU
core needs anything from the current CPU, and, if so, invoke the RCU
core code. One possible improvement from this experiment is reduced
grace-period latency on systems that either have NO_HZ=n or that have
all CPUs frequently going non-idle. The invocation of the RCU core code
is currently via softirq, though a plausible next step in the experiment
might be to instead use a direct function call.

Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 405a5fd..6b7a43a 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -87,6 +87,10 @@ static struct rcu_state *rcu_state;
int rcu_scheduler_active __read_mostly;
EXPORT_SYMBOL_GPL(rcu_scheduler_active);

+static void force_quiescent_state(struct rcu_state *rsp, int relaxed);
+static int rcu_pending(int cpu);
+static void rcu_process_callbacks(struct softirq_action *unused);
+
#ifdef CONFIG_RCU_BOOST

/*
@@ -159,8 +163,14 @@ void rcu_bh_qs(int cpu)
*/
void rcu_note_context_switch(int cpu)
{
+ unsigned long flags;
+
rcu_sched_qs(cpu);
rcu_preempt_note_context_switch(cpu);
+ local_irq_save(flags);
+ if (rcu_pending(cpu))
+ invoke_rcu_core();
+ local_irq_restore(flags);
}
EXPORT_SYMBOL_GPL(rcu_note_context_switch);

@@ -182,9 +192,6 @@ module_param(qlowmark, int, 0);
int rcu_cpu_stall_suppress __read_mostly;
module_param(rcu_cpu_stall_suppress, int, 0644);

-static void force_quiescent_state(struct rcu_state *rsp, int relaxed);
-static int rcu_pending(int cpu);
-
/*
* Return the number of RCU-sched batches processed thus far for debug & stats.
*/

2011-06-17 17:30:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Fri, Jun 17, 2011 at 9:46 AM, Linus Torvalds
<[email protected]> wrote:
>
> Oh, and can you do this with a commit log and sign-off, and I'll put
> it in my "anon_vma-locking" branch that I have. I'm not going to
> actually merge that branch into mainline until I've seen a few more
> acks or more testing by Tim.

Attached is the tentative commit I have, which is yours but with the
tests for anon_vma being NULL removed, and a made-up commit log. It
works for me, but needs more testing and eyeballs looking at it.

Tim? This is on top of my previous patch, replacing Peter's two patches.

Linus


Attachments:
patch.diff (2.84 kB)

2011-06-17 17:41:45

by Hugh Dickins

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Fri, 17 Jun 2011, Linus Torvalds wrote:
> On Fri, Jun 17, 2011 at 4:28 AM, Peter Zijlstra <[email protected]> wrote:
> >
> > Something like so? Compiles and runs the benchmark in question.
>
> Yup.
>
> Except I really think that test for a NULL anon_vma should go away.
>
> If an avc entry has a NULL anon_vma, something is seriously wrong. The
> comment about anon_vma_fork failure is definitely just bogus: the
> anon_vma is allocated before the avc entry, so there's no way a avc
> can have a NULL anon_vma from there.
>
> But yes, your patch is cleaner than the one I was playing around with
> (your "remove if not list empty" is prettier than what I was toying
> with - having a separate flag in the avc)
>
> Tim, can you test Peter's (second - the cleaned up one) patch on top
> of mine, and see if that helps things further?
>
> The only thing I don't love about the batching is that we now do hold
> the lock over some situations where we _could_ have allowed
> concurrency (notably some avc allocations), but I think it's a good
> trade-off. And walking the list twice at unlink_anon_vmas() should be
> basically free.

Applying load with those two patches applied (combined patch shown at
the bottom, in case you can tell me I misunderstood what to apply,
and have got the wrong combination on), lockdep very soon protested.

I've not given it _any_ thought, and won't be able to come back to
it for a couple of hours: chucked over the wall for your delectation.

Hugh

[ 65.981291] =================================
[ 65.981354] [ INFO: inconsistent lock state ]
[ 65.981393] 3.0.0-rc3 #2
[ 65.981418] ---------------------------------
[ 65.981456] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
[ 65.981513] cp/1335 [HC0[0]:SC0[0]:HE1:SE1] takes:
[ 65.981556] (&anon_vma->mutex){+.+.?.}, at: [<781ba4b3>] page_lock_anon_vma+0xd6/0x130
[ 65.981644] {RECLAIM_FS-ON-W} state was registered at:
[ 65.981688] [<7817954f>] mark_held_locks+0x46/0x67
[ 65.981738] [<78179a1c>] lockdep_trace_alloc+0x7d/0x96
[ 65.981791] [<781c8572>] kmem_cache_alloc+0x21/0xe1
[ 65.981842] [<781baae9>] anon_vma_clone+0x38/0x124
[ 65.981892] [<781babf7>] anon_vma_fork+0x22/0xf3
[ 65.981940] [<7814f971>] dup_mmap+0x1b7/0x302
[ 65.981986] [<78150063>] dup_mm+0xa5/0x150
[ 65.982030] [<7815075c>] copy_process+0x62e/0xbeb
[ 65.982079] [<78150e1d>] do_fork+0xd5/0x1fc
[ 65.982123] [<7812ca7c>] sys_clone+0x1c/0x21
[ 65.982169] [<78500c31>] ptregs_clone+0x15/0x24
[ 65.982218] irq event stamp: 4625633
[ 65.982251] hardirqs last enabled at (4625633): [<784fe18a>] mutex_trylock+0xe7/0x118
[ 65.982323] hardirqs last disabled at (4625632): [<784fe0f0>] mutex_trylock+0x4d/0x118
[ 65.982394] softirqs last enabled at (4624962): [<781568a6>] __do_softirq+0xf5/0x104
[ 65.982467] softirqs last disabled at (4624835): [<78128007>] do_softirq+0x56/0xa7
[ 65.982537]
[ 65.982538] other info that might help us debug this:
[ 65.982595] Possible unsafe locking scenario:
[ 65.982596]
[ 65.982649] CPU0
[ 65.982672] ----
[ 65.982696] lock(&anon_vma->mutex);
[ 65.982738] <Interrupt>
[ 65.982762] lock(&anon_vma->mutex);
[ 65.982805]
[ 65.982806] *** DEADLOCK ***
[ 65.982807]
[ 65.982864] no locks held by cp/1335.
[ 65.982896]
[ 65.982897] stack backtrace:
[ 65.982939] Pid: 1335, comm: cp Not tainted 3.0.0-rc3 #2
[ 65.984010] Call Trace:
[ 65.984010] [<784fd0d6>] ? printk+0xf/0x11
[ 65.984010] [<78177ef2>] print_usage_bug+0x152/0x15f
[ 65.984010] [<78177fa0>] mark_lock_irq+0xa1/0x1e9
[ 65.984010] [<78176c8d>] ? print_irq_inversion_bug+0x16e/0x16e
[ 65.984010] [<781782f3>] mark_lock+0x20b/0x2d9
[ 65.984010] [<781784bf>] mark_irqflags+0xfe/0x115
[ 65.984010] [<781789cb>] __lock_acquire+0x4f5/0x6ba
[ 65.984010] [<78178f72>] lock_acquire+0x4a/0x60
[ 65.984010] [<781ba4b3>] ? page_lock_anon_vma+0xd6/0x130
[ 65.984010] [<781ba4b3>] ? page_lock_anon_vma+0xd6/0x130
[ 65.984010] [<784feaf4>] mutex_lock_nested+0x45/0x297
[ 65.984010] [<781ba4b3>] ? page_lock_anon_vma+0xd6/0x130
[ 65.984010] [<781ba4b3>] page_lock_anon_vma+0xd6/0x130
[ 65.984010] [<781ba48f>] ? page_lock_anon_vma+0xb2/0x130
[ 65.984010] [<781ba6c0>] page_referenced_anon+0x12/0x189
[ 65.984010] [<781ba8ba>] page_referenced+0x83/0xaf
[ 65.984010] [<781a8301>] shrink_active_list+0x186/0x240
[ 65.984010] [<781a8e1f>] shrink_zone+0x158/0x1ce
[ 65.984010] [<781a949b>] shrink_zones+0x94/0xe4
[ 65.984010] [<781a9545>] do_try_to_free_pages+0x5a/0x1db
[ 65.984010] [<781a301d>] ? get_page_from_freelist+0x2c4/0x2e1
[ 65.984010] [<781a9865>] try_to_free_pages+0x6c/0x73
[ 65.984010] [<781a3526>] __alloc_pages_nodemask+0x3aa/0x563
[ 65.984010] [<781a4d03>] __do_page_cache_readahead+0xee/0x1cd
[ 65.984010] [<781a4fa4>] ra_submit+0x19/0x1b
[ 65.984010] [<781a51b4>] ondemand_readahead+0x20e/0x219
[ 65.984010] [<781a525b>] page_cache_sync_readahead+0x3e/0x4b
[ 65.984010] [<7819e0ad>] do_generic_file_read.clone.0+0xd1/0x420
[ 65.984010] [<78178c14>] ? lock_release_non_nested+0x84/0x243
[ 65.984010] [<7819ef25>] generic_file_aio_read+0x1c0/0x1f4
[ 65.984010] [<78178ed7>] ? __lock_release+0x104/0x10f
[ 65.984010] [<781b16d0>] ? might_fault+0x45/0x84
[ 65.984010] [<781d3650>] do_sync_read+0x91/0xc5
[ 65.984010] [<781d71a8>] ? cp_new_stat64+0xd8/0xed
[ 65.984010] [<781d3cac>] vfs_read+0x8d/0xf5
[ 65.984010] [<781d3d50>] sys_read+0x3c/0x63
[ 65.984010] [<78500b50>] sysenter_do_call+0x12/0x36

>From this combined patch applied to 3.0-rc3:

--- 3.0-rc3/mm/rmap.c 2011-05-29 18:42:37.465882779 -0700
+++ linux/mm/rmap.c 2011-06-17 10:19:10.592857382 -0700
@@ -200,6 +200,32 @@ int anon_vma_prepare(struct vm_area_stru
return -ENOMEM;
}

+/*
+ * This is a useful helper function for locking the anon_vma root as
+ * we traverse the vma->anon_vma_chain, looping over anon_vma's that
+ * have the same vma.
+ *
+ * Such anon_vma's should have the same root, so you'd expect to see
+ * just a single mutex_lock for the whole traversal.
+ */
+static inline struct anon_vma *lock_anon_vma_root(struct anon_vma *root, struct anon_vma *anon_vma)
+{
+ struct anon_vma *new_root = anon_vma->root;
+ if (new_root != root) {
+ if (WARN_ON_ONCE(root))
+ mutex_unlock(&root->mutex);
+ root = new_root;
+ mutex_lock(&root->mutex);
+ }
+ return root;
+}
+
+static inline void unlock_anon_vma_root(struct anon_vma *root)
+{
+ if (root)
+ mutex_unlock(&root->mutex);
+}
+
static void anon_vma_chain_link(struct vm_area_struct *vma,
struct anon_vma_chain *avc,
struct anon_vma *anon_vma)
@@ -208,13 +234,11 @@ static void anon_vma_chain_link(struct v
avc->anon_vma = anon_vma;
list_add(&avc->same_vma, &vma->anon_vma_chain);

- anon_vma_lock(anon_vma);
/*
* It's critical to add new vmas to the tail of the anon_vma,
* see comment in huge_memory.c:__split_huge_page().
*/
list_add_tail(&avc->same_anon_vma, &anon_vma->head);
- anon_vma_unlock(anon_vma);
}

/*
@@ -224,16 +248,23 @@ static void anon_vma_chain_link(struct v
int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
{
struct anon_vma_chain *avc, *pavc;
+ struct anon_vma *root = NULL;

list_for_each_entry_reverse(pavc, &src->anon_vma_chain, same_vma) {
+ struct anon_vma *anon_vma;
+
avc = anon_vma_chain_alloc();
if (!avc)
goto enomem_failure;
- anon_vma_chain_link(dst, avc, pavc->anon_vma);
+ anon_vma = pavc->anon_vma;
+ root = lock_anon_vma_root(root, anon_vma);
+ anon_vma_chain_link(dst, avc, anon_vma);
}
+ unlock_anon_vma_root(root);
return 0;

enomem_failure:
+ unlock_anon_vma_root(root);
unlink_anon_vmas(dst);
return -ENOMEM;
}
@@ -280,7 +311,9 @@ int anon_vma_fork(struct vm_area_struct
get_anon_vma(anon_vma->root);
/* Mark this anon_vma as the one where our new (COWed) pages go. */
vma->anon_vma = anon_vma;
+ anon_vma_lock(anon_vma);
anon_vma_chain_link(vma, avc, anon_vma);
+ anon_vma_unlock(anon_vma);

return 0;

@@ -291,36 +324,42 @@ int anon_vma_fork(struct vm_area_struct
return -ENOMEM;
}

-static void anon_vma_unlink(struct anon_vma_chain *anon_vma_chain)
-{
- struct anon_vma *anon_vma = anon_vma_chain->anon_vma;
- int empty;
-
- /* If anon_vma_fork fails, we can get an empty anon_vma_chain. */
- if (!anon_vma)
- return;
-
- anon_vma_lock(anon_vma);
- list_del(&anon_vma_chain->same_anon_vma);
-
- /* We must garbage collect the anon_vma if it's empty */
- empty = list_empty(&anon_vma->head);
- anon_vma_unlock(anon_vma);
-
- if (empty)
- put_anon_vma(anon_vma);
-}
-
void unlink_anon_vmas(struct vm_area_struct *vma)
{
struct anon_vma_chain *avc, *next;
+ struct anon_vma *root = NULL;

/*
* Unlink each anon_vma chained to the VMA. This list is ordered
* from newest to oldest, ensuring the root anon_vma gets freed last.
*/
list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) {
- anon_vma_unlink(avc);
+ struct anon_vma *anon_vma = avc->anon_vma;
+
+ /* If anon_vma_fork fails, we can get an empty anon_vma_chain. */
+ if (anon_vma) {
+ root = lock_anon_vma_root(root, anon_vma);
+ list_del(&avc->same_anon_vma);
+ /* Leave empty anon_vmas on the list. */
+ if (list_empty(&anon_vma->head))
+ continue;
+ }
+ list_del(&avc->same_vma);
+ anon_vma_chain_free(avc);
+ }
+ unlock_anon_vma_root(root);
+
+ /*
+ * Iterate the list once more, it now only contains empty and unlinked
+ * anon_vmas, destroy them. Could not do before due to __put_anon_vma()
+ * needing to acquire the anon_vma->root->mutex.
+ */
+ list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) {
+ struct anon_vma *anon_vma = avc->anon_vma;
+
+ if (anon_vma)
+ put_anon_vma(anon_vma);
+
list_del(&avc->same_vma);
anon_vma_chain_free(avc);
}

2011-06-17 17:58:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Fri, 2011-06-17 at 10:41 -0700, Hugh Dickins wrote:
> > The only thing I don't love about the batching is that we now do hold
> > the lock over some situations where we _could_ have allowed
> > concurrency (notably some avc allocations), but I think it's a good
> > trade-off. And walking the list twice at unlink_anon_vmas() should be
> > basically free.
>
> Applying load with those two patches applied (combined patch shown at
> the bottom, in case you can tell me I misunderstood what to apply,
> and have got the wrong combination on), lockdep very soon protested.

Gah, of course. Its exactly the case Linus mentioned not loving. We can
get reclaim recursion due to the avc allocation, we hold anon_vma->mutex
while doing a (blocking) allocation, and reclaim can end up trying to
obtain said lock trying to free some memory.

Bugger. /me goes investigate

2011-06-17 18:02:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Fri, Jun 17, 2011 at 10:41 AM, Hugh Dickins <[email protected]> wrote:
>
> Applying load with those two patches applied (combined patch shown at
> the bottom, in case you can tell me I misunderstood what to apply,
> and have got the wrong combination on), lockdep very soon protested.

Combined patch looks good, it's just the one without the NULL ptr tests removed.

And yup, that makes sense. Since we now hold the anon_vma lock over an
allocation, the memory allocation might want to start to free things.

> I've not given it _any_ thought, and won't be able to come back to
> it for a couple of hours: chucked over the wall for your delectation.

It's a mis-feature of "page_referenced()": we can call it without
holding any page locks ("is_locked=0"), and that function will then do
a try_lock() on the page, and just consider it referenced if it
failed.

HOWEVER, the code to then do the same thing for the anon_vma lock
doesn't do the same trylock model, because it used to be a spinlock:
so there was "no need" (since you can never do a non-atomic allocation
from within a spinlock).

So this is arguably a bug in memory reclaim, but sicne the
mutex->spinlock conversion had been pretty mindless, nobody noticed
until the mutex region grew.

So I do think that "page_referenced_anon()" should do a trylock, and
return "referenced" if the trylock fails. Comments?

That said, we have a few other mutexes that are just not allowed to be
held over an allocation. page_referenced_file() has that
mapping->i_mmap_mutex lock, for example. So maybe the rule just has to
be "you cannot hold anon_vma lock over an allocation". Which would be
sad: one of the whole _points_ of turning it from a spinlock to a
mutex would be that it relaxes the locking rules a lot (and not just
the preemptibility)

Linus

2011-06-17 18:14:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Fri, 2011-06-17 at 11:01 -0700, Linus Torvalds wrote:

> So I do think that "page_referenced_anon()" should do a trylock, and
> return "referenced" if the trylock fails. Comments?

The only problem I can immediately see with that is when a single
process' anon memory is dominant, then such an allocation will never
succeed in freeing these pages because the one lock will pin pretty much
all anon. Then again, there's always a few file pages to drop.

That said, its rather unlikely, and iirc people were working on removing
direct reclaim, or at least rely less on it.


2011-06-17 18:22:18

by Tim Chen

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Thu, 2011-06-16 at 20:58 -0700, Linus Torvalds wrote:

>
> So Tim, I'd like you to test out my first patch (that only does the
> anon_vma_clone() case) once again, but now in the cleaned-up version.
> Does this patch really make a big improvement for you? If so, this
> first step is probably worth doing regardless of the more complicated
> second step, but I'd want to really make sure it's ok, and that the
> performance improvement you saw is consistent and not a fluke.
>
> Linus

Linus,

For this patch, I've run it 10 times and got an average throughput of
104.9% compared with 2.6.39 vanilla baseline. Wide variations are seen
run to run and the difference between max and min throughput is 52% of
average value.

So to recap,

Throughput
2.6.39(vanilla) 100.0%
2.6.39+ra-patch 166.7% (+66.7%)
3.0-rc2(vanilla) 68.0% (-32%)
3.0-rc2+linus 115.7% (+15.7%)
3.0-rc2+linus+softirq 86.2% (-17.3%)
3.0-rc2+linus (v2) 104.9% (+4.9%)

The time spent in the anon_vma mutex seems to directly affect
throughput.

In one run on your patch, I got a low throughput of 90.1% vs 2.6.39
throughput. The mutex_lock occupied 15.6% of cpu.

In another run, I got a high throughput of 120.8% vs 2.6.39 throughput.
The mutex lock occupied 7.5% of cpu.

I've attached the profiles of the two runs and a 3.0-rc2 vanilla run for
your reference.

I will follow up later with numbers that has Peter's patch added.

Thanks.

Tim

----------Profiles Below-------------------------

3.0-rc2+linus(v2) run 1 (90.1% throughput vs 2.6.39)

- 15.60% exim [kernel.kallsyms] [k] __mutex_lock_common.clone.5
- __mutex_lock_common.clone.5
- 99.99% __mutex_lock_slowpath
- mutex_lock
+ 75.52% anon_vma_lock.clone.10
+ 23.88% anon_vma_clone
- 4.38% exim [kernel.kallsyms] [k] _raw_spin_lock_irqsave
- _raw_spin_lock_irqsave
+ 82.83% cpupri_set
+ 6.75% try_to_wake_up
+ 5.35% release_pages
+ 1.72% pagevec_lru_move_fn
+ 0.93% get_page_from_freelist
+ 0.51% lock_timer_base.clone.20
+ 3.22% exim [kernel.kallsyms] [k] page_fault
+ 2.62% exim [kernel.kallsyms] [k] do_raw_spin_lock
+ 2.30% exim [kernel.kallsyms] [k] mutex_unlock
+ 2.02% exim [kernel.kallsyms] [k] unmap_vmas


3.0-rc2_linus(v2) run 2 (120.8% throughput vs 2.6.39)

- 7.53% exim [kernel.kallsyms] [k] __mutex_lock_common.clone.5
- __mutex_lock_common.clone.5
- 99.99% __mutex_lock_slowpath
- mutex_lock
+ 75.99% anon_vma_lock.clone.10
+ 22.68% anon_vma_clone
+ 0.70% unlink_file_vma
- 4.15% exim [kernel.kallsyms] [k] _raw_spin_lock_irqsave
- _raw_spin_lock_irqsave
+ 83.37% cpupri_set
+ 7.06% release_pages
+ 2.74% pagevec_lru_move_fn
+ 2.18% try_to_wake_up
+ 0.99% get_page_from_freelist
+ 0.59% lock_timer_base.clone.20
+ 0.58% lock_hrtimer_base.clone.16
+ 4.06% exim [kernel.kallsyms] [k] page_fault
+ 2.33% exim [kernel.kallsyms] [k] unmap_vmas
+ 2.22% exim [kernel.kallsyms] [k] do_raw_spin_lock
+ 2.05% exim [kernel.kallsyms] [k] page_cache_get_speculative
+ 1.98% exim [kernel.kallsyms] [k] mutex_unlock


3.0-rc2 vanilla run

- 18.60% exim [kernel.kallsyms] [k] __mutex_lock_common.clone.5 ↑
- __mutex_lock_common.clone.5 ▮
- 99.99% __mutex_lock_slowpath ▒
- mutex_lock ▒
- 99.54% anon_vma_lock.clone.10 ▒
+ 38.99% anon_vma_clone ▒
+ 37.56% unlink_anon_vmas ▒
+ 11.92% anon_vma_fork ▒
+ 11.53% anon_vma_free ▒
- 4.03% exim [kernel.kallsyms] [k] _raw_spin_lock_irqsave ▒
- _raw_spin_lock_irqsave ▒
+ 87.25% cpupri_set ▒
+ 4.75% release_pages ▒
+ 3.68% try_to_wake_up ▒
+ 1.17% pagevec_lru_move_fn ▒
+ 0.71% get_page_from_freelist ▒
+ 3.00% exim [kernel.kallsyms] [k] do_raw_spin_lock ▒
+ 2.90% exim [kernel.kallsyms] [k] page_fault ▒
+ 2.25% exim [kernel.kallsyms] [k] mutex_unlock ▒
+ 1.82% exim [kernel.kallsyms] [k] unmap_vmas ▒
+ 1.62% exim [kernel.kallsyms] [k] copy_page_c ▒

2011-06-17 18:28:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Fri, 2011-06-17 at 20:18 +0200, Peter Zijlstra wrote:
> On Fri, 2011-06-17 at 11:01 -0700, Linus Torvalds wrote:
>
> > So I do think that "page_referenced_anon()" should do a trylock, and
> > return "referenced" if the trylock fails. Comments?
>
> The only problem I can immediately see with that is when a single
> process' anon memory is dominant, then such an allocation will never
> succeed in freeing these pages because the one lock will pin pretty much
> all anon. Then again, there's always a few file pages to drop.
>
> That said, its rather unlikely, and iirc people were working on removing
> direct reclaim, or at least rely less on it.
>
>

something like so I guess, completely untested etc..

Also, there's a page_lock_anon_vma() user in split_huge_page(), which is
used in mm/swap_state.c:add_to_swap(), which is also in the reclaim
path, not quite sure what to do there.

Aside from the THP thing there's a user in memory-failure.c, which looks
to be broken as it is because its calling things under tasklist_lock
which isn't preemptible, but it looks like we can simply swap the
tasklist_lock vs page_lock_anon_vma.

---
kernel/Makefile | 1 +
mm/rmap.c | 39 +++++++++++++++++++++++++++++++++++++--
2 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/kernel/Makefile b/kernel/Makefile
index 2d64cfc..f6d05de 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -101,6 +101,7 @@ obj-$(CONFIG_RING_BUFFER) += trace/
obj-$(CONFIG_TRACEPOINTS) += trace/
obj-$(CONFIG_SMP) += sched_cpupri.o
obj-$(CONFIG_IRQ_WORK) += irq_work.o
+obj-m += test.o

obj-$(CONFIG_PERF_EVENTS) += events/

diff --git a/mm/rmap.c b/mm/rmap.c
index 0eb463e..40cd399 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -400,6 +400,41 @@ out:
return anon_vma;
}

+struct anon_vma *page_trylock_anon_vma(struct page *page)
+{
+ struct anon_vma *anon_vma = NULL;
+ struct anon_vma *root_anon_vma;
+ unsigned long anon_mapping;
+
+ rcu_read_lock();
+ anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
+ if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
+ goto out;
+ if (!page_mapped(page))
+ goto out;
+
+ anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
+ root_anon_vma = ACCESS_ONCE(anon_vma->root);
+ if (!mutex_trylock(&root_anon_vma->mutex)) {
+ anon_vma = NULL;
+ goto out;
+ }
+
+ /*
+ * If the page is still mapped, then this anon_vma is still
+ * its anon_vma, and holding the mutex ensures that it will
+ * not go away, see anon_vma_free().
+ */
+ if (!page_mapped(page)) {
+ mutex_unlock(&root_anon_vma->mutex);
+ anon_vma = NULL;
+ }
+
+out:
+ rcu_read_unlock();
+ return anon_vma;
+}
+
/*
* Similar to page_get_anon_vma() except it locks the anon_vma.
*
@@ -694,7 +729,7 @@ static int page_referenced_anon(struct page *page,
struct anon_vma_chain *avc;
int referenced = 0;

- anon_vma = page_lock_anon_vma(page);
+ anon_vma = page_trylock_anon_vma(page);
if (!anon_vma)
return referenced;

@@ -1396,7 +1431,7 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
struct anon_vma_chain *avc;
int ret = SWAP_AGAIN;

- anon_vma = page_lock_anon_vma(page);
+ anon_vma = page_trylock_anon_vma(page);
if (!anon_vma)
return ret;


2011-06-17 18:39:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Fri, Jun 17, 2011 at 11:32 AM, Peter Zijlstra <[email protected]> wrote:
>
> something like so I guess, completely untested etc..

Having gone over it a bit more, I actually think I prefer to just
special-case the allocation instead.

We already have to drop the anon_vma lock for the "out of memory"
case, and a slight re-organization of clone_anon_vma() makes it easy
to just first try a NOIO allocation with the lock still held, and then
if that fails do the "drop lock, retry, and hard-fail" case.

IOW, something like the attached (on top of the patches already posted
except for your memory reclaim thing)

Hugh, does this fix the lockdep issue?

Linus


Attachments:
patch.diff (1.96 kB)

2011-06-17 18:48:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Fri, Jun 17, 2011 at 11:39 AM, Linus Torvalds
<[email protected]> wrote:
>
> Having gone over it a bit more, I actually think I prefer to just
> special-case the allocation instead.

Just to explain my thinking: the thing I disliked most about doing an
allocation while holding the lock wasn't that I thought we would
deadlock on page reclaim. I don't claim that kind of far-sight.

No, the thing I disliked was that if we're low on memory and actually
have to wait, I disliked having the lack of concurrency. I'm ok with
holding the mutex over a few more CPU cycles, but anything longer
might actually hurt throughput. So the patch I just sent out should
fix both the page reclaim deadlock, and avoid any problems with delays
due to holding the critical lock over an expensive allocation.

Linus

2011-06-17 19:06:03

by Ray Lee

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Fri, Jun 17, 2011 at 11:22 AM, Tim Chen <[email protected]> wrote:
> For this patch, I've run it 10 times and got an average throughput of
> 104.9% compared with 2.6.39 vanilla baseline.  Wide variations are seen
> run to run and the difference between max and min throughput is 52% of
> average value.

If the variations are that drastic, it's useful to include the
standard deviation of the mean for each kernel as well. I know it's
only 10 runs, but it's often useful to know if a slow but reliable
process is being replaced by a sometimes faster but more erratic one,
for example.

Just a thought.

~r.

2011-06-17 19:08:32

by Ingo Molnar

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex


* Andi Kleen <[email protected]> wrote:

> > On 2.6.39, the contention of anon_vma->lock occupies 3.25% of cpu.
> > However, after the switch of the lock to mutex on 3.0-rc2, the mutex
> > acquisition jumps to 18.6% of cpu. This seems to be the main cause of
> > the 52% throughput regression.
> >
> This patch makes the mutex in Tim's workload take a bit less CPU time
> (4% down) but it doesn't really fix the regression. When spinning for a
> value it's always better to read it first before attempting to write it.
> This saves expensive operations on the interconnect.
>
> So it's not really a fix for this, but may be a slight improvement for
> other workloads.
>
> -Andi
>
> >From 34d4c1e579b3dfbc9a01967185835f5829bd52f0 Mon Sep 17 00:00:00 2001
> From: Andi Kleen <[email protected]>
> Date: Tue, 14 Jun 2011 16:27:54 -0700
> Subject: [PATCH] mutex: while spinning read count before attempting cmpxchg
>
> Under heavy contention it's better to read first before trying to
> do an atomic operation on the interconnect.
>
> This gives a few percent improvement for the mutex CPU time under
> heavy contention and likely saves some power too.
>
> Signed-off-by: Andi Kleen <[email protected]>
> ---
> kernel/mutex.c | 3 ++-
> 1 files changed, 2 insertions(+), 1 deletions(-)
>
> diff --git a/kernel/mutex.c b/kernel/mutex.c
> index d607ed5..1abffa9 100644
> --- a/kernel/mutex.c
> +++ b/kernel/mutex.c
> @@ -170,7 +170,8 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
> if (owner && !mutex_spin_on_owner(lock, owner))
> break;
>
> - if (atomic_cmpxchg(&lock->count, 1, 0) == 1) {
> + if (atomic_read(&lock->count) == 1 &&
> + atomic_cmpxchg(&lock->count, 1, 0) == 1) {
> lock_acquired(&lock->dep_map, ip);
> mutex_set_owner(lock);
> preempt_enable();


What is *sorely* missing from your changelog (again) is the
explanation of *why* it improves performance in the contended case:
because the cacheline is brought into shared-read MESI state which
the CMPXCHG might not dirty if the CMPXCHG fails in the contended
case.

Firstly, this reduces the cost of hard bounces somewhat because the
owner CPU still has the cacheline in read-shared state, which it can
invalidate from the other CPU's cache on unlock in a bit cheaper way
if it were forced to bounce the cacheline back.

Secondly, in the contended case more that 2 CPUs are looping on the
same cacheline and bringing it in shared and doing the cmpxchg loop
will not bounce the cacheline around (if CMPXCHG is smart enough to
not dirty the cacheline even in the failed case - this is CPU model
dependent) - it will spread to all interested CPUs in read-shared
state. This is most likely the dominant factor in Tim's tests.

Had you considered and described all that then you'd inevitably have
been led to consider the much more important issue that is missing
from your patch: the analysis of what happens to the cacheline under
*light* contention, which is by far the most important case ...

In the lightly contended case it's ultimately bad to bring in the
cacheline as read-shared first, because the cmpxchg will have to go
out to the MESI interconnect *again*: this time to flush the
cacheline from the previous owner CPU's cache.

So i don't think we want your patch, without some really good
supporting facts and analysis that show that the lightly contended
case does not suffer.

Thanks,

Ingo

2011-06-17 19:41:55

by Andi Kleen

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Fri, Jun 17, 2011 at 09:46:00AM -0700, Linus Torvalds wrote:
> On Fri, Jun 17, 2011 at 4:28 AM, Peter Zijlstra <[email protected]> wrote:
> >
> > Something like so? Compiles and runs the benchmark in question.
>
> Oh, and can you do this with a commit log and sign-off, and I'll put
> it in my "anon_vma-locking" branch that I have. I'm not going to
> actually merge that branch into mainline until I've seen a few more
> acks or more testing by Tim.
>
> But if Tim's numbers hold up (-32% to +15% performance by just the
> first one, and +15% isn't actually an improvement since tmpfs
> read-ahead should have gotten us to +66%), I think we have to do this
> just to avoid the performance regression.

You could also add the mutex "optimize caching protocol"
patch I posted earlier to that branch.

It didn't actually improve Tim's throughput number, but it made the CPU
consumption of the mutex go down.

-Andi

---
>From 34d4c1e579b3dfbc9a01967185835f5829bd52f0 Mon Sep 17 00:00:00 2001
From: Andi Kleen <[email protected]>
Date: Tue, 14 Jun 2011 16:27:54 -0700
Subject: [PATCH] mutex: while spinning read count before attempting cmpxchg

Under heavy contention it's better to read first before trying
to do an atomic operation on the interconnect.

This gives a few percent improvement for the mutex CPU time
under heavy contention and likely saves some power too.

Signed-off-by: Andi Kleen <[email protected]>

diff --git a/kernel/mutex.c b/kernel/mutex.c
index d607ed5..1abffa9 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -170,7 +170,8 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
if (owner && !mutex_spin_on_owner(lock, owner))
break;

- if (atomic_cmpxchg(&lock->count, 1, 0) == 1) {
+ if (atomic_read(&lock->count) == 1 &&
+ atomic_cmpxchg(&lock->count, 1, 0) == 1) {
lock_acquired(&lock->dep_map, ip);
mutex_set_owner(lock);
preempt_enable();



--
[email protected] -- Speaking for myself only

2011-06-17 19:49:25

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH] mm, memory-failure: Fix spinlock vs mutex order

On Fri, 2011-06-17 at 20:32 +0200, Peter Zijlstra wrote:
> Aside from the THP thing there's a user in memory-failure.c, which looks
> to be broken as it is because its calling things under tasklist_lock
> which isn't preemptible, but it looks like we can simply swap the
> tasklist_lock vs page_lock_anon_vma.
>

I thought about maybe using rcu, but then thought the thing is probably
wanting to exclude new tasks as it wants to kill all mm users.

---
Subject: mm, memory-failure: Fix spinlock vs mutex order

We cannot take a mutex while holding a spinlock, so flip the order as
its documented to be random.

Signed-off-by: Peter Zijlstra <[email protected]>
---
mm/memory-failure.c | 21 ++++++---------------
mm/rmap.c | 5 ++---
2 files changed, 8 insertions(+), 18 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index eac0ba5..740c4f5 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -391,10 +391,11 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill,
struct task_struct *tsk;
struct anon_vma *av;

- read_lock(&tasklist_lock);
av = page_lock_anon_vma(page);
if (av == NULL) /* Not actually mapped anymore */
- goto out;
+ return;
+
+ read_lock(&tasklist_lock);
for_each_process (tsk) {
struct anon_vma_chain *vmac;

@@ -408,9 +409,8 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill,
add_to_kill(tsk, page, vma, to_kill, tkc);
}
}
- page_unlock_anon_vma(av);
-out:
read_unlock(&tasklist_lock);
+ page_unlock_anon_vma(av);
}

/*
@@ -424,17 +424,8 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
struct prio_tree_iter iter;
struct address_space *mapping = page->mapping;

- /*
- * A note on the locking order between the two locks.
- * We don't rely on this particular order.
- * If you have some other code that needs a different order
- * feel free to switch them around. Or add a reverse link
- * from mm_struct to task_struct, then this could be all
- * done without taking tasklist_lock and looping over all tasks.
- */
-
- read_lock(&tasklist_lock);
mutex_lock(&mapping->i_mmap_mutex);
+ read_lock(&tasklist_lock);
for_each_process(tsk) {
pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);

@@ -454,8 +445,8 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
add_to_kill(tsk, page, vma, to_kill, tkc);
}
}
- mutex_unlock(&mapping->i_mmap_mutex);
read_unlock(&tasklist_lock);
+ mutex_unlock(&mapping->i_mmap_mutex);
}

/*
diff --git a/mm/rmap.c b/mm/rmap.c
index 0eb463e..5e51855 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -38,9 +38,8 @@
* in arch-dependent flush_dcache_mmap_lock,
* within inode_wb_list_lock in __sync_single_inode)
*
- * (code doesn't rely on that order so it could be switched around)
- * ->tasklist_lock
- * anon_vma->mutex (memory_failure, collect_procs_anon)
+ * anon_vma->mutex,mapping->i_mutex (memory_failure, collect_procs_anon)
+ * ->tasklist_lock
* pte map lock
*/


2011-06-17 20:05:59

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] mm, memory-failure: Fix spinlock vs mutex order

> I thought about maybe using rcu, but then thought the thing is probably
> wanting to exclude new tasks as it wants to kill all mm users.

Probably both would work.

Looks good to me. hwpoison patches are usually directly merged by Andrew
these days.

Acked-by: Andi Kleen <[email protected]>

-Andi

2011-06-17 20:19:28

by Tim Chen

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Fri, 2011-06-17 at 11:39 -0700, Linus Torvalds wrote:

> Having gone over it a bit more, I actually think I prefer to just
> special-case the allocation instead.
>
> We already have to drop the anon_vma lock for the "out of memory"
> case, and a slight re-organization of clone_anon_vma() makes it easy
> to just first try a NOIO allocation with the lock still held, and then
> if that fails do the "drop lock, retry, and hard-fail" case.
>
> IOW, something like the attached (on top of the patches already posted
> except for your memory reclaim thing)
>

Linus,

I've applied this patch, plus the other two patches on batching anon_vma
clone and anon_vma unlink. This improved throughput further. I now see
average throughput at 140.2% vs 2.6.39-vanilla over 10 runs. The mutex
lock has also gone down to 3.7% of cpu in my profile. Certainly a great
deal of improvements.

To summarize,

Throughput
2.6.39(vanilla) 100.0%
2.6.39+ra-patch 166.7% (+66.7%)
3.0-rc2(vanilla) 68.0% (-32%)
3.0-rc2+linus (v1) 115.7% (+15.7%) (anon_vma clone v1)
3.0-rc2+linus+softirq 86.2% (-17.3%)
3.0-rc2+linus (v2) 104.9% (+4.9%) (anon_vma clone v2)
3.0-rc2+linus (v3) 140.3% (+40.3%) (anon_vma clone v2 + unlink + chain_alloc_tweak)


(Max-Min)/avg Standard Dev
2.6.39(vanilla) 3% 1.1%
2.6.39+ra-patch 3% 1.2%
3.0-rc2(vanilla) 20% 7.3%
3.0-rc2+linus 36% 12.2%
3.0-rc2+linus+softirq 40% 15.2%
3.0-rc2+linus (v2) 53% 14.8%
3.0-rc2+linus (v3) 27% 8.1%

Thanks.

Tim


------------Profile attached--------------

Profile from latest run 3.0-rc2+linus (v3):

- 5.44% exim [kernel.kallsyms] [k] _raw_spin_lock_irqsave
- _raw_spin_lock_irqsave
+ 87.81% cpupri_set
+ 5.67% release_pages
+ 1.71% pagevec_lru_move_fn
+ 1.31% try_to_wake_up
+ 0.85% get_page_from_freelist
+ 4.15% exim [kernel.kallsyms] [k] page_fault
- 3.76% exim [kernel.kallsyms] [k] __mutex_lock_common.clone.5
- __mutex_lock_common.clone.5
- 99.97% __mutex_lock_slowpath
- mutex_lock
+ 55.46% lock_anon_vma_root.clone.13
+ 41.94% anon_vma_lock.clone.10
+ 1.14% dup_mm
+ 1.02% unlink_file_vma
+ 2.44% exim [kernel.kallsyms] [k] unmap_vmas
+ 2.06% exim [kernel.kallsyms] [k] do_raw_spin_lock
+ 1.91% exim [kernel.kallsyms] [k] page_cache_get_speculative
+ 1.89% exim [kernel.kallsyms] [k] copy_page_c


2011-06-17 22:20:34

by Hugh Dickins

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Fri, 17 Jun 2011, Linus Torvalds wrote:
> On Fri, Jun 17, 2011 at 11:32 AM, Peter Zijlstra <[email protected]> wrote:
> >
> > something like so I guess, completely untested etc..
>
> Having gone over it a bit more, I actually think I prefer to just
> special-case the allocation instead.
>
> We already have to drop the anon_vma lock for the "out of memory"
> case, and a slight re-organization of clone_anon_vma() makes it easy
> to just first try a NOIO allocation with the lock still held, and then
> if that fails do the "drop lock, retry, and hard-fail" case.
>
> IOW, something like the attached (on top of the patches already posted
> except for your memory reclaim thing)
>
> Hugh, does this fix the lockdep issue?

Yes, that fixed the lockdep issue, and ran nicely under load for an hour.

I agree that it's better to do this GFP_NOWAIT and fallback,
than trylock the anon_vma.

And I'm happy that you've still got that WARN_ON_ONCE(root) in: I do not
have a fluid mental model of the anon_vma_chains, get lost there; and
though it's obvious that we must have the same anon_vma->root going
down the same_anon_vma list, I could not put my finger on a killer
demonstration for why the same has to be true of the same_vma list.

But I've not seen your WARN_ON_ONCE fire, and it's hard to imagine
how there could be more than one root in the whole bundle of lists.

Hugh

2011-06-18 04:48:42

by Linus Torvalds

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex

On Fri, Jun 17, 2011 at 3:20 PM, Hugh Dickins <[email protected]> wrote:
>
> Yes, that fixed the lockdep issue, and ran nicely under load for an hour.

Ok, nobody screamed or complained, so the thing is now merged and pushed out.

Linus

2011-06-18 08:09:24

by Ingo Molnar

[permalink] [raw]
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock to mutex


* Andi Kleen <[email protected]> wrote:

> On Fri, Jun 17, 2011 at 09:46:00AM -0700, Linus Torvalds wrote:
> > On Fri, Jun 17, 2011 at 4:28 AM, Peter Zijlstra <[email protected]> wrote:
> > >
> > > Something like so? Compiles and runs the benchmark in question.
> >
> > Oh, and can you do this with a commit log and sign-off, and I'll put
> > it in my "anon_vma-locking" branch that I have. I'm not going to
> > actually merge that branch into mainline until I've seen a few more
> > acks or more testing by Tim.
> >
> > But if Tim's numbers hold up (-32% to +15% performance by just the
> > first one, and +15% isn't actually an improvement since tmpfs
> > read-ahead should have gotten us to +66%), I think we have to do this
> > just to avoid the performance regression.
>
> You could also add the mutex "optimize caching protocol"
> patch I posted earlier to that branch.
>
> It didn't actually improve Tim's throughput number, but it made the
> CPU consumption of the mutex go down.

Why have you ignored the negative feedback for that patch:

http://marc.info/[email protected]

and why have you resent this patch without addressing that feedback?

Thanks,

Ingo