LinuxLists.cc - [RFC][PATCH 0/5] Optimize percpu-rwsem

2015-05-26 12:11:06

Subject: [RFC][PATCH 0/5] Optimize percpu-rwsem

Hi all,

This is a derived work of the cpu hotplug lock rework I did in 2013 which never
really went anywhere because Linus didn't like it.

This applies those same optimizations to the percpu-rwsem. Seeing how we did
all the work it seemed a waste to not use it at all.

2015-05-26 18:12:08

by Linus Torvalds

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/5] Optimize percpu-rwsem

On Tue, May 26, 2015 at 4:43 AM, Peter Zijlstra <[email protected]> wrote:
>
> This is a derived work of the cpu hotplug lock rework I did in 2013 which never
> really went anywhere because Linus didn't like it.
>
> This applies those same optimizations to the percpu-rwsem. Seeing how we did
> all the work it seemed a waste to not use it at all.

So I *still* don't like it.

We literally have one single percpu-rwsem IN THE WHOLE KERNEL TREE.

One.

Not "a couple". Not "hundreds". ONE.

And that single use that you are optimizing for isn't even explained.
It's not really even clear that that thing is needed: fork() already
takes the mmap_sem for writing, and I'm not at all convinced that the
uprobes code couldn't just use mmap_sem for every mm it traverses,
rather than use this global percpu lock that we use absolutely nowhere
else.

So I'm not convinced that the right thing to do is to try to optimize
that lock at ALL. I'd rather get rid of it entirely.

Is there some new use that I don't know about? Have people really
looked at that uprobes code deeply? OF COURSE global locks will have
problems, I'm not at all convinced that "let's make that global lock
really complicated and clever" is the proper solution.

For example, the write lock is only ever taken when registering an
uprobe. Not exactly likely to be timing-critical. Is there some reason
why we couldn't get rid of the magical per-cpu rwsem entirely, and
replace it with something truly simple like a hashed rmsem, where the
readers (fork), will just read-lock one hashed rwsem, and the writer
will just write-lock every hashed rwsem.

The hash might be "pick the currently executing CPU at the time of
fork(), modulo 16". That one doesn't even need to disable preemption,
since it doesn't actually matter that the CPU stays constant, it's
just a trivial heurisitic to avoid cacheline pingpongs under very
heavy fork() load.

So we'd have

#define NR_UPROBE_RWSEM 16

// One cacheline per rwsem.
struct cache_line_aligned_rw_semaphore uprobe_rwsem[NR_UPROBE_RWSEM];

and fork() would basically do

int idx = raw_smp_processor_id() & (NR_UPROBE_RWSEM-1);
struct rw_sem *usem = &uprobe_rwsem[idx].rwsem;
..
down_read(usem);
... do the dup_mmap() here ..
up_read(usem);

and the uprobe register thing could just do

for (i = 0; i < NR_UPROBE_RWSEM; i++)
down_write(&uprobe_rwsem[i].rwsem);
...
for (i = 0; ... unlock them all ..)

which doesn't look any slower than what we do now (I suspect the
fork() part would be faster), and is much simpler, and would get rid
of the percpu-rwsem entirely.

Hmm?

Linus

2015-05-26 18:34:46

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/5] Optimize percpu-rwsem

On Tue, May 26, 2015 at 11:12:04AM -0700, Linus Torvalds wrote:
> Is there some new use that I don't know about?

TJ wants to use it in cgroups.

lkml.kernel.org/r/[email protected]

2015-05-26 18:35:36

by Tejun Heo

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/5] Optimize percpu-rwsem

Hello, Linus.

On Tue, May 26, 2015 at 11:12:04AM -0700, Linus Torvalds wrote:
...
> Is there some new use that I don't know about? Have people really
> looked at that uprobes code deeply? OF COURSE global locks will have
> problems, I'm not at all convinced that "let's make that global lock
> really complicated and clever" is the proper solution.

I've posted a patchset to make threadgroup rwsem cgroup-specific (it
is used only by cgroups) and replace the per-threadgroup rwsems with a
global percpu_rwsem. This is primarily to allow either succeeding or
failing migrations of multiple tasks atomically. We already try to do
multiple task migrations in some corner cases and the unified
hierarchy will make wider use of it. Currently, the implementation is
broken as there's no easy way of reverting changes if the operation
fails in the middle.

Given that task migrations aren't particularly high frequency
operations, a global percpu_rwsem is likely to be workable while being
less tedious and slightly less costly for the usual fork/exit paths.

That said, from cgroup's pov, write-locking per-threadgroup rwsems of
all target tasks works too although we'd need an outer mutex to
serialize down-write attempts to avoid locking order issues.

Thanks.

--
tejun

2015-05-26 18:42:44

by Davidlohr Bueso

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/5] Optimize percpu-rwsem

On Tue, 2015-05-26 at 11:12 -0700, Linus Torvalds wrote:
> On Tue, May 26, 2015 at 4:43 AM, Peter Zijlstra <[email protected]> wrote:
> >
> > This is a derived work of the cpu hotplug lock rework I did in 2013 which never
> > really went anywhere because Linus didn't like it.
> >
> > This applies those same optimizations to the percpu-rwsem. Seeing how we did
> > all the work it seemed a waste to not use it at all.
>
> So I *still* don't like it.
>
> We literally have one single percpu-rwsem IN THE WHOLE KERNEL TREE.
>
> One.
>
> Not "a couple". Not "hundreds". ONE.

Instead of dropping percpu-rwsem, I was thinking we could instead look
for opportunities to convert new users, for instance shinkers, where the
write lock is also taken just for register and unregister purposes,
similar to uprobes.

Thanks,
Davidlohr

2015-05-26 18:58:52

by Oleg Nesterov

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/5] Optimize percpu-rwsem

On 05/26, Linus Torvalds wrote:
>
> On Tue, May 26, 2015 at 4:43 AM, Peter Zijlstra <[email protected]> wrote:
> >
> > This is a derived work of the cpu hotplug lock rework I did in 2013 which never
> > really went anywhere because Linus didn't like it.
> >
> > This applies those same optimizations to the percpu-rwsem. Seeing how we did
> > all the work it seemed a waste to not use it at all.
>
> So I *still* don't like it.
>
> We literally have one single percpu-rwsem IN THE WHOLE KERNEL TREE.
>
> One.

Well. IIRC Tejun is going to turn signal_struct->group_rwsem into
percpu-rwsem.

And it can have more users. Say, __sb_start_write/etc does something
similar, and last time I checked this code it looked buggy to me.

> And that single use that you are optimizing for isn't even explained.
> It's not really even clear that that thing is needed: fork() already
> takes the mmap_sem for writing, and I'm not at all convinced that the
> uprobes code couldn't just use mmap_sem for every mm it traverses,

I don't think we can do this... At least I do not see how.

register_for_each_vma() would need to lock all ->mmap_sem's in system
to avoid the race with fork().

Oleg.

2015-05-26 19:14:30

by Oleg Nesterov

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/5] Optimize percpu-rwsem

On 05/26, Oleg Nesterov wrote:
>
> On 05/26, Linus Torvalds wrote:
> >
> >
> > We literally have one single percpu-rwsem IN THE WHOLE KERNEL TREE.
> >
> > One.
>
> Well. IIRC Tejun is going to turn signal_struct->group_rwsem into
> percpu-rwsem.
>
> And it can have more users. Say, __sb_start_write/etc does something
> similar, and last time I checked this code it looked buggy to me.

I have found my old email, see below. Perhaps this code was changed
since 2013 when I sent this email, I didn't verify... but in any
case this logic doesn't look simple, imo it would be nice to rely
on the generic helpers from kernel/locking.

Oleg.

------------------------------------------------------------------------------
When I look at __sb_start_write/etc I am not sure this locking
is correct. OK, __sb_start_write() does:

percpu_counter_inc();

mb();

if (sb->s_writers.frozen)
abort_and_retry;

freeze_super() does STORE + mb + LOAD in reverse order so either
__sb_start_write() must see SB_FREEZE_WRITE or freeze_super() must
see the change in ->s_writers.counter. This is correct.

Still I am not sure sb_wait_write() can trust percpu_counter_sum(),
because it can also see _other_ changes.

To simplify the discussion, suppose that percpu_counter doesn't have
lock/count/batch/whatever and inc/dec/sum only uses "__percpu *counters".
Lets denote sb->s_writers.counter[level] as CTR[cpu].

Suppose that some thread did __sb_start_write() on CPU_1 and sleeps
"forever". CTR[0] == 0, CTR_[1] == 1, freezer_super() should block.

Now:

1. freeze_super() sets SB_FREEZE_WRITE, does mb(), and
starts sb_wait_write()->percpu_counter_sum().

2. __percpu_counter_sum() does for_each_online_cpu(),
reads CTR[0] == 0. ret = 0.

3. Another thread comes, calls __sb_start_write() on CPU_0,
increments CTR[0].

Then it notices sb->s_writers.frozen >= level and starts
__sb_end_write() before retry.

Then it migrates to CPU_1. And decrements CTR[1] before
__percpu_counter_sum() reads it.

So CTR[0] == 1, CTR[1] == 0. Everything is fine except
sb_wait_write() has already read CTR[0].

4. __percpu_counter_sum() continues, reads CTR[1] == 0
and returns ret == 0.

sb_wait_write() returns while it should not?

2015-05-26 19:30:56

by Oleg Nesterov

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/5] Optimize percpu-rwsem

Sorry for noise, forgot to mention...

On 05/26, Oleg Nesterov wrote:
>
> And it can have more users.

I can be wrong and I am obviously biased... but it seems to me that
even rcu_sync alone can have more users in future. This interface
looks natural and simple.

Oleg.

2015-05-26 19:58:09

by Davidlohr Bueso

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/5] Optimize percpu-rwsem

While at it, I'll go ahead and add this triviality.

From: Davidlohr Bueso <[email protected]>
Date: Tue, 26 May 2015 12:51:15 -0700
Subject: [PATCH] doc,locking: Move percpu-rwsem doc into proper subdir

Signed-off-by: Davidlohr Bueso <[email protected]>
---
Documentation/locking/percpu-rw-semaphore.txt | 27 +++++++++++++++++++++++++++
Documentation/percpu-rw-semaphore.txt | 27 ---------------------------
2 files changed, 27 insertions(+), 27 deletions(-)
create mode 100644 Documentation/locking/percpu-rw-semaphore.txt
delete mode 100644 Documentation/percpu-rw-semaphore.txt

diff --git a/Documentation/locking/percpu-rw-semaphore.txt b/Documentation/locking/percpu-rw-semaphore.txt
new file mode 100644
index 0000000..7d3c824
--- /dev/null
+++ b/Documentation/locking/percpu-rw-semaphore.txt
@@ -0,0 +1,27 @@
+Percpu rw semaphores
+--------------------
+
+Percpu rw semaphores is a new read-write semaphore design that is
+optimized for locking for reading.
+
+The problem with traditional read-write semaphores is that when multiple
+cores take the lock for reading, the cache line containing the semaphore
+is bouncing between L1 caches of the cores, causing performance
+degradation.
+
+Locking for reading is very fast, it uses RCU and it avoids any atomic
+instruction in the lock and unlock path. On the other hand, locking for
+writing is very expensive, it calls synchronize_rcu() that can take
+hundreds of milliseconds.
+
+The lock is declared with "struct percpu_rw_semaphore" type.
+The lock is initialized percpu_init_rwsem, it returns 0 on success and
+-ENOMEM on allocation failure.
+The lock must be freed with percpu_free_rwsem to avoid memory leak.
+
+The lock is locked for read with percpu_down_read, percpu_up_read and
+for write with percpu_down_write, percpu_up_write.
+
+The idea of using RCU for optimized rw-lock was introduced by
+Eric Dumazet <[email protected]>.
+The code was written by Mikulas Patocka <[email protected]>
diff --git a/Documentation/percpu-rw-semaphore.txt b/Documentation/percpu-rw-semaphore.txt
deleted file mode 100644
index 7d3c824..0000000
--- a/Documentation/percpu-rw-semaphore.txt
+++ /dev/null
@@ -1,27 +0,0 @@
-Percpu rw semaphores
---------------------
-
-Percpu rw semaphores is a new read-write semaphore design that is
-optimized for locking for reading.
-
-The problem with traditional read-write semaphores is that when multiple
-cores take the lock for reading, the cache line containing the semaphore
-is bouncing between L1 caches of the cores, causing performance
-degradation.
-
-Locking for reading is very fast, it uses RCU and it avoids any atomic
-instruction in the lock and unlock path. On the other hand, locking for
-writing is very expensive, it calls synchronize_rcu() that can take
-hundreds of milliseconds.
-
-The lock is declared with "struct percpu_rw_semaphore" type.
-The lock is initialized percpu_init_rwsem, it returns 0 on success and
--ENOMEM on allocation failure.
-The lock must be freed with percpu_free_rwsem to avoid memory leak.
-
-The lock is locked for read with percpu_down_read, percpu_up_read and
-for write with percpu_down_write, percpu_up_write.
-
-The idea of using RCU for optimized rw-lock was introduced by
-Eric Dumazet <[email protected]>.
-The code was written by Mikulas Patocka <[email protected]>
--
2.1.4

2015-05-26 21:57:56

by Linus Torvalds

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/5] Optimize percpu-rwsem

On Tue, May 26, 2015 at 11:42 AM, Davidlohr Bueso <[email protected]> wrote:
>
> Instead of dropping percpu-rwsem, I was thinking we could instead look
> for opportunities to convert new users, for instance shinkers, where the
> write lock is also taken just for register and unregister purposes,
> similar to uprobes.

So if there really are useful use cases for this, I don't object to
the patch. It seems to just improve on a currently very low-usage
locking primitive.

And it's not like I conceptually mind the notion of a percpu rwsem, I
just hate seeing specialty locking that isn't really worth it.

Because as it is, with the current single use, I don't think it's even
worth improving on.

I _would_ ask that people who are looking at this also look at our
"lglock" thing. It's pretty much *exactly* the same thing, except for
spinlocks, and that one too has exactly two users (the documentation
states that the only user is stop_machine, but in fact file locking
does too).

Because that is another example of a complete failure of a locking
primitive that was just too specialized to be worth it.

Linus

2015-06-05 01:46:15

by Al Viro

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/5] Optimize percpu-rwsem

On Tue, May 26, 2015 at 02:57:53PM -0700, Linus Torvalds wrote:

> Because that is another example of a complete failure of a locking
> primitive that was just too specialized to be worth it.

<notices stale include in fs/file_table.c and removes it>

FWIW, I hadn't really looked into stop_machine uses, but fs/locks.c one
is really not all that great - there we have a large trashcan of a list
(every file_lock on the system) and the only use of that list is /proc/locks
output generation. Sure, additions take this CPU's spinlock. And removals
take pretty much a random one - losing the timeslice and regaining it on
a different CPU is quite likely with the uses there.

Why do we need a global lock there, anyway? Why not hold only one for
the chain currently being traversed? Sure, we'll need to get and drop
them in ->next() that way; so what?

2015-06-05 21:09:59

by Oleg Nesterov

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/5] Optimize percpu-rwsem

On 06/05, Al Viro wrote:
>
> FWIW, I hadn't really looked into stop_machine uses, but fs/locks.c one
> is really not all that great - there we have a large trashcan of a list
> (every file_lock on the system) and the only use of that list is /proc/locks
> output generation. Sure, additions take this CPU's spinlock. And removals
> take pretty much a random one - losing the timeslice and regaining it on
> a different CPU is quite likely with the uses there.
>
> Why do we need a global lock there, anyway? Why not hold only one for
> the chain currently being traversed? Sure, we'll need to get and drop
> them in ->next() that way; so what?

And note that fs/seq_file.c:seq_hlist_next_percpu() has no other users.

And given that locks_delete_global_locks() takes the random lock anyway,
perhaps the hashed lists/locking makes no sense, I dunno.

Oleg.

2015-06-05 22:11:23

by Al Viro

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/5] Optimize percpu-rwsem

On Fri, Jun 05, 2015 at 11:08:57PM +0200, Oleg Nesterov wrote:
> On 06/05, Al Viro wrote:
> >
> > FWIW, I hadn't really looked into stop_machine uses, but fs/locks.c one
> > is really not all that great - there we have a large trashcan of a list
> > (every file_lock on the system) and the only use of that list is /proc/locks
> > output generation. Sure, additions take this CPU's spinlock. And removals
> > take pretty much a random one - losing the timeslice and regaining it on
> > a different CPU is quite likely with the uses there.
> >
> > Why do we need a global lock there, anyway? Why not hold only one for
> > the chain currently being traversed? Sure, we'll need to get and drop
> > them in ->next() that way; so what?
>
> And note that fs/seq_file.c:seq_hlist_next_percpu() has no other users.
>
> And given that locks_delete_global_locks() takes the random lock anyway,
> perhaps the hashed lists/locking makes no sense, I dunno.

It's not about making life easier for /proc/locks; it's about not screwing
those who add/remove file_lock... And no, that "random lock" isn't held
when modifying the (per-cpu) lists - it protects the list hanging off each
element of the global list, and /proc/locks scans those lists, so rather
than taking/dropping it in each ->show(), it's taken once in ->start()...

2015-06-05 23:37:22

by Oleg Nesterov

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/5] Optimize percpu-rwsem

On 06/05, Al Viro wrote:
>
> On Fri, Jun 05, 2015 at 11:08:57PM +0200, Oleg Nesterov wrote:
> > On 06/05, Al Viro wrote:
> > >
> > > FWIW, I hadn't really looked into stop_machine uses, but fs/locks.c one
> > > is really not all that great - there we have a large trashcan of a list
> > > (every file_lock on the system) and the only use of that list is /proc/locks
> > > output generation. Sure, additions take this CPU's spinlock. And removals
> > > take pretty much a random one - losing the timeslice and regaining it on
> > > a different CPU is quite likely with the uses there.
> > >
> > > Why do we need a global lock there, anyway? Why not hold only one for
> > > the chain currently being traversed? Sure, we'll need to get and drop
> > > them in ->next() that way; so what?
> >
> > And note that fs/seq_file.c:seq_hlist_next_percpu() has no other users.
> >
> > And given that locks_delete_global_locks() takes the random lock anyway,
> > perhaps the hashed lists/locking makes no sense, I dunno.
>
> It's not about making life easier for /proc/locks; it's about not screwing
> those who add/remove file_lock...

I meant, seq_hlist_next_percpu() could be "static" in fs/locks.c.

> And no, that "random lock" isn't held
> when modifying the (per-cpu) lists - it protects the list hanging off each
> element of the global list, and /proc/locks scans those lists, so rather
> than taking/dropping it in each ->show(), it's taken once in ->start()...

Sure, I understand. I meant that (perhaps) something like

struct {
spinlock_t lock;
struct list_head *head
} file_lock_list[];

locks_insert_global_locks(fl)
{
int idx = fl->idx = hash(fl);
spin_lock(&file_lock_list[idx].lock);
hlist_add_head(...);
spin_unlock(...);
}

seq_hlist_next_percpu() could scan file_lock_list[] and unlock/lock ->lock
when it changes the index.

But please forget, this is really minor. Just I think that file_lock_list
is not actually "per-cpu", exactly because every locks_delete_global_locks()
needs lg_local_lock_cpu(fl->fl_link_cpu) as you pointed out.

Oleg.