2016-11-18 18:55:33

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH -tip 0/3] locking/percpu-rwsem: writer-side optimizations

Hi,

Here are two updates intended for those rare updater cases.

wrt numbers, I'm still trying to figure out this android boot environment
(that calls cgroup_proc_write ipc paths) which might care about these
changes. In the mean time, I'd like to know if anyone has any objections
to these optimizations.

Passed kernel builds with lockdep, as well as an overnight run doing
locktoture.

Thanks.

Davidlohr Bueso (3):
locking/percpu-rwsem: Move text file into Documentation/locking/
locking/percpu-rwsem: Replace bulky wait-queues with swait
locking/percpu-rwsem: Avoid unnecessary writer wakeups

Documentation/locking/percpu-rw-semaphore.txt | 27 ++++++++++
Documentation/percpu-rw-semaphore.txt | 27 ----------
include/linux/percpu-rwsem.h | 6 +--
kernel/locking/percpu-rwsem.c | 78 +++++++++++++++------------
4 files changed, 73 insertions(+), 65 deletions(-)
create mode 100644 Documentation/locking/percpu-rw-semaphore.txt
delete mode 100644 Documentation/percpu-rw-semaphore.txt

--
2.6.6


2016-11-18 18:55:41

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH 3/3] locking/percpu-rwsem: Avoid unnecessary writer wakeups

There is obviously no point in doing the wakeup if the
reader wakee task can do the active readers check on
behalf of the writer on its way to unlocking itself.

The downside is that when readers_active_check() does
return true we end up iterating the per-CPU counter twice.
This trade-off, however, does seem reasonable in that if
we are here, (i) we have already lost any hope for reader
side performance and; (ii) because this lock is used mainly
for sharing, it is not crazy to expect to have readers
incoming for the lock during this window -- therefore
sending writers right back to sleep for every reader up().

Signed-off-by: Davidlohr Bueso <[email protected]>
---
kernel/locking/percpu-rwsem.c | 72 ++++++++++++++++++++++++-------------------
1 file changed, 40 insertions(+), 32 deletions(-)

diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index cb71201855f2..8e6fbf117f14 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -8,6 +8,44 @@
#include <linux/sched.h>
#include <linux/errno.h>

+#define per_cpu_sum(var) \
+({ \
+ typeof(var) __sum = 0; \
+ int cpu; \
+ compiletime_assert_atomic_type(__sum); \
+ for_each_possible_cpu(cpu) \
+ __sum += per_cpu(var, cpu); \
+ __sum; \
+})
+
+/*
+ * Return true if the modular sum of the sem->read_count per-CPU variable is
+ * zero. If this sum is zero, then it is stable due to the fact that if any
+ * newly arriving readers increment a given counter, they will immediately
+ * decrement that same counter.
+ */
+static bool readers_active_check(struct percpu_rw_semaphore *sem)
+{
+ if (per_cpu_sum(*sem->read_count) != 0)
+ return false;
+
+ /*
+ * If we observed the decrement; ensure we see the entire critical
+ * section. In the case of __readers_active_check we avoid the
+ * critical section sync, as the writer wakee will fully re-check
+ * to continue.
+ */
+
+ smp_mb(); /* C matches B */
+
+ return true;
+}
+
+static bool __readers_active_check(struct percpu_rw_semaphore *sem)
+{
+ return !(per_cpu_sum(*sem->read_count) !=0);
+}
+
int __percpu_init_rwsem(struct percpu_rw_semaphore *sem,
const char *name, struct lock_class_key *rwsem_key)
{
@@ -103,41 +141,11 @@ void __percpu_up_read(struct percpu_rw_semaphore *sem)
__this_cpu_dec(*sem->read_count);

/* Prod writer to recheck readers_active */
- swake_up(&sem->writer);
+ if (__readers_active_check(sem))
+ swake_up(&sem->writer);
}
EXPORT_SYMBOL_GPL(__percpu_up_read);

-#define per_cpu_sum(var) \
-({ \
- typeof(var) __sum = 0; \
- int cpu; \
- compiletime_assert_atomic_type(__sum); \
- for_each_possible_cpu(cpu) \
- __sum += per_cpu(var, cpu); \
- __sum; \
-})
-
-/*
- * Return true if the modular sum of the sem->read_count per-CPU variable is
- * zero. If this sum is zero, then it is stable due to the fact that if any
- * newly arriving readers increment a given counter, they will immediately
- * decrement that same counter.
- */
-static bool readers_active_check(struct percpu_rw_semaphore *sem)
-{
- if (per_cpu_sum(*sem->read_count) != 0)
- return false;
-
- /*
- * If we observed the decrement; ensure we see the entire critical
- * section.
- */
-
- smp_mb(); /* C matches B */
-
- return true;
-}
-
void percpu_down_write(struct percpu_rw_semaphore *sem)
{
/* Notify readers to take the slow path. */
--
2.6.6

2016-11-18 18:55:56

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH 2/3] locking/percpu-rwsem: Replace bulky wait-queues with swait

In the case of the percpu-rwsem, they don't need any of the
fancy/bulky features, such as custom callbacks or fine grained
wakeups.

Users that can convert to simple wait-queues are encouraged to
do so for the various rt and (indirect) performance benefits.

Signed-off-by: Davidlohr Bueso <[email protected]>
---
include/linux/percpu-rwsem.h | 6 +++---
kernel/locking/percpu-rwsem.c | 8 ++++----
2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index 5b2e6159b744..82d54a4b9988 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -4,7 +4,7 @@
#include <linux/atomic.h>
#include <linux/rwsem.h>
#include <linux/percpu.h>
-#include <linux/wait.h>
+#include <linux/swait.h>
#include <linux/rcu_sync.h>
#include <linux/lockdep.h>

@@ -12,7 +12,7 @@ struct percpu_rw_semaphore {
struct rcu_sync rss;
unsigned int __percpu *read_count;
struct rw_semaphore rw_sem;
- wait_queue_head_t writer;
+ struct swait_queue_head writer;
int readers_block;
};

@@ -22,7 +22,7 @@ static struct percpu_rw_semaphore name = { \
.rss = __RCU_SYNC_INITIALIZER(name.rss, RCU_SCHED_SYNC), \
.read_count = &__percpu_rwsem_rc_##name, \
.rw_sem = __RWSEM_INITIALIZER(name.rw_sem), \
- .writer = __WAIT_QUEUE_HEAD_INITIALIZER(name.writer), \
+ .writer = __SWAIT_QUEUE_HEAD_INITIALIZER(name.writer), \
}

extern int __percpu_down_read(struct percpu_rw_semaphore *, int);
diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index ce182599cf2e..cb71201855f2 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -1,7 +1,7 @@
#include <linux/atomic.h>
#include <linux/rwsem.h>
#include <linux/percpu.h>
-#include <linux/wait.h>
+#include <linux/swait.h>
#include <linux/lockdep.h>
#include <linux/percpu-rwsem.h>
#include <linux/rcupdate.h>
@@ -18,7 +18,7 @@ int __percpu_init_rwsem(struct percpu_rw_semaphore *sem,
/* ->rw_sem represents the whole percpu_rw_semaphore for lockdep */
rcu_sync_init(&sem->rss, RCU_SCHED_SYNC);
__init_rwsem(&sem->rw_sem, name, rwsem_key);
- init_waitqueue_head(&sem->writer);
+ init_swait_queue_head(&sem->writer);
sem->readers_block = 0;
return 0;
}
@@ -103,7 +103,7 @@ void __percpu_up_read(struct percpu_rw_semaphore *sem)
__this_cpu_dec(*sem->read_count);

/* Prod writer to recheck readers_active */
- wake_up(&sem->writer);
+ swake_up(&sem->writer);
}
EXPORT_SYMBOL_GPL(__percpu_up_read);

@@ -160,7 +160,7 @@ void percpu_down_write(struct percpu_rw_semaphore *sem)
*/

/* Wait for all now active readers to complete. */
- wait_event(sem->writer, readers_active_check(sem));
+ swait_event(sem->writer, readers_active_check(sem));
}
EXPORT_SYMBOL_GPL(percpu_down_write);

--
2.6.6

2016-11-18 18:55:58

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH 1/3] locking/percpu-rwsem: Move text file into Documentation/locking/

Although this is rather useless and the actual code is orders
of magnitude more detailed and informative than this document.

Might as well keep it instead, I guess its already there and
could give a user a quick why/when use them vs regular rwsems.

Signed-off-by: Davidlohr Bueso <[email protected]>
---
Documentation/locking/percpu-rw-semaphore.txt | 27 +++++++++++++++++++++++++++
Documentation/percpu-rw-semaphore.txt | 27 ---------------------------
2 files changed, 27 insertions(+), 27 deletions(-)
create mode 100644 Documentation/locking/percpu-rw-semaphore.txt
delete mode 100644 Documentation/percpu-rw-semaphore.txt

diff --git a/Documentation/locking/percpu-rw-semaphore.txt b/Documentation/locking/percpu-rw-semaphore.txt
new file mode 100644
index 000000000000..7d3c82431909
--- /dev/null
+++ b/Documentation/locking/percpu-rw-semaphore.txt
@@ -0,0 +1,27 @@
+Percpu rw semaphores
+--------------------
+
+Percpu rw semaphores is a new read-write semaphore design that is
+optimized for locking for reading.
+
+The problem with traditional read-write semaphores is that when multiple
+cores take the lock for reading, the cache line containing the semaphore
+is bouncing between L1 caches of the cores, causing performance
+degradation.
+
+Locking for reading is very fast, it uses RCU and it avoids any atomic
+instruction in the lock and unlock path. On the other hand, locking for
+writing is very expensive, it calls synchronize_rcu() that can take
+hundreds of milliseconds.
+
+The lock is declared with "struct percpu_rw_semaphore" type.
+The lock is initialized percpu_init_rwsem, it returns 0 on success and
+-ENOMEM on allocation failure.
+The lock must be freed with percpu_free_rwsem to avoid memory leak.
+
+The lock is locked for read with percpu_down_read, percpu_up_read and
+for write with percpu_down_write, percpu_up_write.
+
+The idea of using RCU for optimized rw-lock was introduced by
+Eric Dumazet <[email protected]>.
+The code was written by Mikulas Patocka <[email protected]>
diff --git a/Documentation/percpu-rw-semaphore.txt b/Documentation/percpu-rw-semaphore.txt
deleted file mode 100644
index 7d3c82431909..000000000000
--- a/Documentation/percpu-rw-semaphore.txt
+++ /dev/null
@@ -1,27 +0,0 @@
-Percpu rw semaphores
---------------------
-
-Percpu rw semaphores is a new read-write semaphore design that is
-optimized for locking for reading.
-
-The problem with traditional read-write semaphores is that when multiple
-cores take the lock for reading, the cache line containing the semaphore
-is bouncing between L1 caches of the cores, causing performance
-degradation.
-
-Locking for reading is very fast, it uses RCU and it avoids any atomic
-instruction in the lock and unlock path. On the other hand, locking for
-writing is very expensive, it calls synchronize_rcu() that can take
-hundreds of milliseconds.
-
-The lock is declared with "struct percpu_rw_semaphore" type.
-The lock is initialized percpu_init_rwsem, it returns 0 on success and
--ENOMEM on allocation failure.
-The lock must be freed with percpu_free_rwsem to avoid memory leak.
-
-The lock is locked for read with percpu_down_read, percpu_up_read and
-for write with percpu_down_write, percpu_up_write.
-
-The idea of using RCU for optimized rw-lock was introduced by
-Eric Dumazet <[email protected]>.
-The code was written by Mikulas Patocka <[email protected]>
--
2.6.6

2016-11-21 12:23:50

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 3/3] locking/percpu-rwsem: Avoid unnecessary writer wakeups

On 11/18, Davidlohr Bueso wrote:
>
> +static bool __readers_active_check(struct percpu_rw_semaphore *sem)
> +{
> + return !(per_cpu_sum(*sem->read_count) !=0);
> +}

Hmm,

return per_cpu_sum(*sem->read_count) == 0;

looks more clear, but this is minor,

> int __percpu_init_rwsem(struct percpu_rw_semaphore *sem,
> const char *name, struct lock_class_key *rwsem_key)
> {
> @@ -103,41 +141,11 @@ void __percpu_up_read(struct percpu_rw_semaphore *sem)
> __this_cpu_dec(*sem->read_count);
>
> /* Prod writer to recheck readers_active */
> - swake_up(&sem->writer);
> + if (__readers_active_check(sem))
> + swake_up(&sem->writer);

Suppose we have 2 active readers which call __percpu_up_read() at the same
time and the pending writer sleeps.

What guarantees that one of these readers will observe per_cpu_sum() == 0 ?
They both can read the old value of the remote per-cpu counter, no?

Oleg.

2016-11-21 12:29:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/3] locking/percpu-rwsem: Avoid unnecessary writer wakeups

On Mon, Nov 21, 2016 at 01:23:44PM +0100, Oleg Nesterov wrote:
> On 11/18, Davidlohr Bueso wrote:
> >
> > +static bool __readers_active_check(struct percpu_rw_semaphore *sem)
> > +{
> > + return !(per_cpu_sum(*sem->read_count) !=0);
> > +}
>
> Hmm,
>
> return per_cpu_sum(*sem->read_count) == 0;
>
> looks more clear, but this is minor,

Very much so; that must be one of the most convoluted statements
possible :-).

>
> > int __percpu_init_rwsem(struct percpu_rw_semaphore *sem,
> > const char *name, struct lock_class_key *rwsem_key)
> > {
> > @@ -103,41 +141,11 @@ void __percpu_up_read(struct percpu_rw_semaphore *sem)
> > __this_cpu_dec(*sem->read_count);
> >
> > /* Prod writer to recheck readers_active */
> > - swake_up(&sem->writer);
> > + if (__readers_active_check(sem))
> > + swake_up(&sem->writer);
>
> Suppose we have 2 active readers which call __percpu_up_read() at the same
> time and the pending writer sleeps.
>
> What guarantees that one of these readers will observe per_cpu_sum() == 0 ?
> They both can read the old value of the remote per-cpu counter, no?

In particular, you're thinking of what provides the guarantee that the
woken CPU observes the same state the wakee saw? Isn't this one of the
Program-Order guarantees the scheduler _should_ provide?

2016-11-21 12:47:30

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 3/3] locking/percpu-rwsem: Avoid unnecessary writer wakeups

On 11/21, Peter Zijlstra wrote:
>
> On Mon, Nov 21, 2016 at 01:23:44PM +0100, Oleg Nesterov wrote:
> > > int __percpu_init_rwsem(struct percpu_rw_semaphore *sem,
> > > const char *name, struct lock_class_key *rwsem_key)
> > > {
> > > @@ -103,41 +141,11 @@ void __percpu_up_read(struct percpu_rw_semaphore *sem)
> > > __this_cpu_dec(*sem->read_count);
> > >
> > > /* Prod writer to recheck readers_active */
> > > - swake_up(&sem->writer);
> > > + if (__readers_active_check(sem))
> > > + swake_up(&sem->writer);
> >
> > Suppose we have 2 active readers which call __percpu_up_read() at the same
> > time and the pending writer sleeps.
> >
> > What guarantees that one of these readers will observe per_cpu_sum() == 0 ?
> > They both can read the old value of the remote per-cpu counter, no?
>
> In particular, you're thinking of what provides the guarantee that the
> woken CPU observes the same state the wakee saw?

No, no, I meant that afaics both readers can see per_cpu_sum() != 0 and
thus the writer won't be woken up. Till the next down_read/up_read.

Suppose that we have 2 CPU's, both counters == 1, both readers decrement.
its counter at the same time.

READER_ON_CPU_0 READER_ON_CPU_1

--ctr_0; --ctr_1;

if (ctr_0 + ctr_1) if (ctr_0 + ctr_1)
wakeup(); wakeup();

Why we can't miss a wakeup?

This patch doesn't even add a barrier, but I think wmb() won't be enough
anyway.

Oleg.

2016-11-21 12:55:42

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 2/3] locking/percpu-rwsem: Replace bulky wait-queues with swait

On 11/18, Davidlohr Bueso wrote:
>
> @@ -12,7 +12,7 @@ struct percpu_rw_semaphore {
> struct rcu_sync rss;
> unsigned int __percpu *read_count;
> struct rw_semaphore rw_sem;
> - wait_queue_head_t writer;
> + struct swait_queue_head writer;

I won't argue, but even swait_queue_head is overkill in this case.

We can just add "struct task_struct *writer" into percpu_rw_semaphore,

__percpu_up_read:

rcu_read_lock();
writer = task_rcu_dereference(&sem->writer);
if (writer)
wake_up_process(writer);
rcu_read_unlock();

percpu_down_write() can set sem->writer == current and do the simple
while-not-condition-schedule() loop.

But this probably needs a couple of new helpers, and probably they
can have more users.

Oleg.

2016-11-21 15:07:30

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 3/3] locking/percpu-rwsem: Avoid unnecessary writer wakeups

On 11/21, Oleg Nesterov wrote:
>
> No, no, I meant that afaics both readers can see per_cpu_sum() != 0 and
> thus the writer won't be woken up. Till the next down_read/up_read.
>
> Suppose that we have 2 CPU's, both counters == 1, both readers decrement.
> its counter at the same time.
>
> READER_ON_CPU_0 READER_ON_CPU_1
>
> --ctr_0; --ctr_1;
>
> if (ctr_0 + ctr_1) if (ctr_0 + ctr_1)
> wakeup(); wakeup();
>
> Why we can't miss a wakeup?
>
> This patch doesn't even add a barrier, but I think wmb() won't be enough
> anyway.

And in fact I am not sure this optimization makes sense... But it would be
nice to avoid wake_up() when the writer sleeps in rcu_sync_enter(). Or this
is the "slow mode" sem (cgroup_threadgroup_rwsem).

I need to re-check, but what do you think about the change below?

Oleg.

--- x/kernel/locking/percpu-rwsem.c
+++ x/kernel/locking/percpu-rwsem.c
@@ -103,7 +103,9 @@ void __percpu_up_read(struct percpu_rw_s
__this_cpu_dec(*sem->read_count);

/* Prod writer to recheck readers_active */
- wake_up(&sem->writer);
+ smp_mb();
+ if (sem->readers_block)
+ wake_up(&sem->writer);
}
EXPORT_SYMBOL_GPL(__percpu_up_read);


2016-11-21 17:26:58

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH 2/3] locking/percpu-rwsem: Replace bulky wait-queues with swait

On Mon, 21 Nov 2016, Oleg Nesterov wrote:

>On 11/18, Davidlohr Bueso wrote:
>>
>> @@ -12,7 +12,7 @@ struct percpu_rw_semaphore {
>> struct rcu_sync rss;
>> unsigned int __percpu *read_count;
>> struct rw_semaphore rw_sem;
>> - wait_queue_head_t writer;
>> + struct swait_queue_head writer;
>
>I won't argue, but even swait_queue_head is overkill in this case.
>
>We can just add "struct task_struct *writer" into percpu_rw_semaphore,

Given that this is how all things locking/ work, I very much agree with
you, lemme send a v2.

2016-11-22 03:59:22

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH 3/3] locking/percpu-rwsem: Avoid unnecessary writer wakeups

On Mon, 21 Nov 2016, Oleg Nesterov wrote:

>On 11/21, Oleg Nesterov wrote:
>>
>> No, no, I meant that afaics both readers can see per_cpu_sum() != 0 and
>> thus the writer won't be woken up. Till the next down_read/up_read.
>>
>> Suppose that we have 2 CPU's, both counters == 1, both readers decrement.
>> its counter at the same time.
>>
>> READER_ON_CPU_0 READER_ON_CPU_1
>>
>> --ctr_0; --ctr_1;
>>
>> if (ctr_0 + ctr_1) if (ctr_0 + ctr_1)
>> wakeup(); wakeup();
>>
>> Why we can't miss a wakeup?

But the patch is really: if (!(ctr_0 + ctr_1)). wrt to stale values is this
like due to the data dependency we only see the real value of this_cpu ctr,
and no guarantee for the other cpus? If so I had not considered that scenario,
and yes we'd need stronger guarantees.

I'd have to wonder if other users of per_cpu_sum() would fall into a similar
trap. Hmm and each user seems to implement its own copy of the same thing.

>And in fact I am not sure this optimization makes sense... But it would be
>nice to avoid wake_up() when the writer sleeps in rcu_sync_enter(). Or this
>is the "slow mode" sem (cgroup_threadgroup_rwsem).

Why do you think using per_cpu_sum() does not make sense? As mentioned in the
changelog it optimizes for incoming readers while the writer is doing sync_enter
and getting the regular rwsem. What am I missing?

>
>I need to re-check, but what do you think about the change below?

While optimizing for multiple writers (rcu_sync_enter) is certainly valid
(at least considering the cgroups rwsem you mention), I think that my
heuristic covers the otherwise more common case. Could both optimizations
not work together?

Of course, the window of where readers_block == 1 is quite large, so there
can be a lot of false positives.

Thanks,
Davidlohr

2016-11-23 14:43:17

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 3/3] locking/percpu-rwsem: Avoid unnecessary writer wakeups

On 11/21, Davidlohr Bueso wrote:
>
> On Mon, 21 Nov 2016, Oleg Nesterov wrote:
>
>> On 11/21, Oleg Nesterov wrote:
>>>
>>> No, no, I meant that afaics both readers can see per_cpu_sum() != 0 and
>>> thus the writer won't be woken up. Till the next down_read/up_read.
>>>
>>> Suppose that we have 2 CPU's, both counters == 1, both readers decrement.
>>> its counter at the same time.
>>>
>>> READER_ON_CPU_0 READER_ON_CPU_1
>>>
>>> --ctr_0; --ctr_1;
>>>
>>> if (ctr_0 + ctr_1) if (ctr_0 + ctr_1)
>>> wakeup(); wakeup();
>>>
>>> Why we can't miss a wakeup?
>
> But the patch is really: if (!(ctr_0 + ctr_1)).

Of course, I meant if (ctr_0 + ctr_1 == 0).

>> And in fact I am not sure this optimization makes sense... But it would be
>> nice to avoid wake_up() when the writer sleeps in rcu_sync_enter(). Or this
>> is the "slow mode" sem (cgroup_threadgroup_rwsem).
>
> Why do you think using per_cpu_sum() does not make sense? As mentioned in the
> changelog it optimizes for incoming readers while the writer is doing sync_enter
> and getting the regular rwsem. What am I missing?

And this does make sense, but see below,

>> I need to re-check, but what do you think about the change below?
>
> While optimizing for multiple writers (rcu_sync_enter) is certainly valid
> (at least considering the cgroups rwsem you mention),

No, it is not for multiple writers. rcu_sync_enter() is slow, the new
readers can come and acquire/release this lock. And if it is a "slow mode"
sem then every up() does wakeup which we want to eliminate.

But after sem->readers_block is already true, I am not sure the additional
per_cpu_sum() is a win (even if it was correct), the new readers can't come.

Except __percpu_down_read()->__percpu_up_read() which we want to optimize
too, but in this case we do not need per_cpu_sum() too.

I'll try to make a patch this week... I had this optimization in mind from
the very beginning, I event mentioned it during the last discussion, but
never had time. Basically we should not inc if readers_block == T.

Oleg.

2016-12-03 02:19:14

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH v2 2/3] locking/percpu-rwsem: Rework writer block/wake to not use wait-queues

The use of any kind of wait queue is an overkill for pcpu-rwsems.
While one option would be to use the less heavy simple (swait)
flavor, this is still too much for what pcpu-rwsems needs. For one,
we do not care about any sort of queuing in that the only (rare) time
writers (and readers, for that matter) are queued is when trying to
acquire the regular contended rw_sem. There cannot be any further
queuing as writers are serialized by the rw_sem in the first place.

This patch, therefore, implements custom wait/wake, with an rcu-aware
writer task pointer. The only time this is !nil is when a writer is
determining if it is going to block, and reset as soon as we know that
the percpu_down_write() call has succeeded. All this is obviously done while
holding the regular rw_sem. As such, we can avoid the queue handling and
locking overhead (although we currently end up taking the waitqueue
spinlock fastpath, so it wouldn't be a very big an impact).

Signed-off-by: Davidlohr Bueso <[email protected]>
---
include/linux/percpu-rwsem.h | 5 ++---
kernel/locking/percpu-rwsem.c | 26 +++++++++++++++++++++-----
2 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index 5b2e6159b744..9942b7e8bde8 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -4,7 +4,6 @@
#include <linux/atomic.h>
#include <linux/rwsem.h>
#include <linux/percpu.h>
-#include <linux/wait.h>
#include <linux/rcu_sync.h>
#include <linux/lockdep.h>

@@ -12,7 +11,7 @@ struct percpu_rw_semaphore {
struct rcu_sync rss;
unsigned int __percpu *read_count;
struct rw_semaphore rw_sem;
- wait_queue_head_t writer;
+ struct task_struct *writer; /* blocked writer */
int readers_block;
};

@@ -22,7 +21,7 @@ static struct percpu_rw_semaphore name = { \
.rss = __RCU_SYNC_INITIALIZER(name.rss, RCU_SCHED_SYNC), \
.read_count = &__percpu_rwsem_rc_##name, \
.rw_sem = __RWSEM_INITIALIZER(name.rw_sem), \
- .writer = __WAIT_QUEUE_HEAD_INITIALIZER(name.writer), \
+ .writer = NULL, \
}

extern int __percpu_down_read(struct percpu_rw_semaphore *, int);
diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index ce182599cf2e..7856a77396d3 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -1,7 +1,6 @@
#include <linux/atomic.h>
#include <linux/rwsem.h>
#include <linux/percpu.h>
-#include <linux/wait.h>
#include <linux/lockdep.h>
#include <linux/percpu-rwsem.h>
#include <linux/rcupdate.h>
@@ -18,7 +17,7 @@ int __percpu_init_rwsem(struct percpu_rw_semaphore *sem,
/* ->rw_sem represents the whole percpu_rw_semaphore for lockdep */
rcu_sync_init(&sem->rss, RCU_SCHED_SYNC);
__init_rwsem(&sem->rw_sem, name, rwsem_key);
- init_waitqueue_head(&sem->writer);
+ sem->writer = NULL;
sem->readers_block = 0;
return 0;
}
@@ -94,6 +93,8 @@ EXPORT_SYMBOL_GPL(__percpu_down_read);

void __percpu_up_read(struct percpu_rw_semaphore *sem)
{
+ struct task_struct *writer;
+
smp_mb(); /* B matches C */
/*
* In other words, if they see our decrement (presumably to aggregate
@@ -102,8 +103,13 @@ void __percpu_up_read(struct percpu_rw_semaphore *sem)
*/
__this_cpu_dec(*sem->read_count);

+ rcu_read_lock();
+ writer = rcu_dereference(sem->writer);
+
/* Prod writer to recheck readers_active */
- wake_up(&sem->writer);
+ if (writer)
+ wake_up_process(writer);
+ rcu_read_unlock();
}
EXPORT_SYMBOL_GPL(__percpu_up_read);

@@ -159,8 +165,18 @@ void percpu_down_write(struct percpu_rw_semaphore *sem)
* will wait for them.
*/

- /* Wait for all now active readers to complete. */
- wait_event(sem->writer, readers_active_check(sem));
+ WRITE_ONCE(sem->writer, current);
+ for (;;) {
+ set_current_state(TASK_UNINTERRUPTIBLE);
+
+ if (readers_active_check(sem))
+ break;
+
+ schedule();
+ }
+
+ rcu_assign_pointer(sem->writer, NULL);
+ __set_current_state(TASK_RUNNING);
}
EXPORT_SYMBOL_GPL(percpu_down_write);

--
2.6.6

2016-12-05 08:36:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] locking/percpu-rwsem: Rework writer block/wake to not use wait-queues

On Fri, Dec 02, 2016 at 06:18:39PM -0800, Davidlohr Bueso wrote:
> @@ -102,8 +103,13 @@ void __percpu_up_read(struct percpu_rw_semaphore *sem)
> */
> __this_cpu_dec(*sem->read_count);
>
> + rcu_read_lock();
> + writer = rcu_dereference(sem->writer);

Don't think this is correct, I think Oleg suggested using
task_rcu_dereference(), which is a giant pile of magic.

The problem is that task_struct isn't RCU protected as such.

> +
> /* Prod writer to recheck readers_active */
> - wake_up(&sem->writer);
> + if (writer)
> + wake_up_process(writer);
> + rcu_read_unlock();
> }
> EXPORT_SYMBOL_GPL(__percpu_up_read);
>
> @@ -159,8 +165,18 @@ void percpu_down_write(struct percpu_rw_semaphore *sem)
> * will wait for them.
> */
>
> - /* Wait for all now active readers to complete. */
> - wait_event(sem->writer, readers_active_check(sem));
> + WRITE_ONCE(sem->writer, current);

So this one matches rcu_dereference(), which is weird, because you now
have unmatched barriers.

> + for (;;) {
> + set_current_state(TASK_UNINTERRUPTIBLE);
> +
> + if (readers_active_check(sem))
> + break;
> +
> + schedule();
> + }
> +
> + rcu_assign_pointer(sem->writer, NULL);

And this one does not, and the value being NULL this actually reverts to
WRITE_ONCE().

> + __set_current_state(TASK_RUNNING);
> }
> EXPORT_SYMBOL_GPL(percpu_down_write);

2016-12-05 11:27:00

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] locking/percpu-rwsem: Rework writer block/wake to not use wait-queues

Davidlohr, Peter, I'll try to read this patch later, just one note.

On 12/05, Peter Zijlstra wrote:
>
> On Fri, Dec 02, 2016 at 06:18:39PM -0800, Davidlohr Bueso wrote:
> > @@ -102,8 +103,13 @@ void __percpu_up_read(struct percpu_rw_semaphore *sem)
> > */
> > __this_cpu_dec(*sem->read_count);
> >
> > + rcu_read_lock();
> > + writer = rcu_dereference(sem->writer);
>
> Don't think this is correct, I think Oleg suggested using
> task_rcu_dereference(), which is a giant pile of magic.

Yes, but on a second thought task_rcu_dereference() won't really help,
but we can just use rcu_dereference().

> The problem is that task_struct isn't RCU protected as such.

Yes. But percpu_down_write() should not be used after exit_notify(), so we
can rely on rcu_read_lock(), release_task()->call_rcu(delayed_put_task_struct)
can't be called until an exiting task passes exit_notify().

But then we probably need WARN_ON(current->exit_state) in percpu_down_write().

And personally I think this change should add the new helpers, they can have
more users. Something like

struct xxx {
struct task_struct *task;
};

xxx_wake_up(struct xxx *xxx)
{
rcu_read_lock();
task = rcu_dereference(xxx->task);
if (task)
wake_up_process(task);
rcu_read_unlock();
}


#define xxx_wait_event(xxx, event) {
// comment to explain why
WARN_ON(current->exit_state);

xxx->task = current;

...
}

Oleg.

2016-12-05 11:43:43

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] locking/percpu-rwsem: Rework writer block/wake to not use wait-queues

On 12/05, Oleg Nesterov wrote:
>
> Yes, but on a second thought task_rcu_dereference() won't really help,

I forgot to explain why, see below.

> #define xxx_wait_event(xxx, event) {
> // comment to explain why
> WARN_ON(current->exit_state);

Otherwise this process/thread can be already (auto)reaped and wakeup
can't rely on rcu.

And task_rcu_dereference() can't help because it can return NULL in
this case.

Oleg.

2016-12-05 17:14:15

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] locking/percpu-rwsem: Rework writer block/wake to not use wait-queues

On 12/02, Davidlohr Bueso wrote:
>
> @@ -102,8 +103,13 @@ void __percpu_up_read(struct percpu_rw_semaphore *sem)
> */
> __this_cpu_dec(*sem->read_count);
>
> + rcu_read_lock();
> + writer = rcu_dereference(sem->writer);
> +
> /* Prod writer to recheck readers_active */
> - wake_up(&sem->writer);
> + if (writer)
> + wake_up_process(writer);
> + rcu_read_unlock();

This needs a barrier between __this_cpu_dec() and rcu_dereference(), I think.

> @@ -159,8 +165,18 @@ void percpu_down_write(struct percpu_rw_semaphore *sem)
> * will wait for them.
> */
>
> - /* Wait for all now active readers to complete. */
> - wait_event(sem->writer, readers_active_check(sem));
> + WRITE_ONCE(sem->writer, current);
> + for (;;) {
> + set_current_state(TASK_UNINTERRUPTIBLE);
> +
> + if (readers_active_check(sem))
> + break;

This looks fine, we can rely on set_current_state() which inserts a barrier
between WRITE_ONCE() and readers_active_check(). So we do not even need
WRITE_ONCE().

And the fact this needs the barriers and the comments makes me think again
you should add the new helpers.

Oleg.

2016-12-05 17:19:39

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] locking/percpu-rwsem: Rework writer block/wake to not use wait-queues

On 12/05, Peter Zijlstra wrote:
>
> > + for (;;) {
> > + set_current_state(TASK_UNINTERRUPTIBLE);
> > +
> > + if (readers_active_check(sem))
> > + break;
> > +
> > + schedule();
> > + }
> > +
> > + rcu_assign_pointer(sem->writer, NULL);
>
> And this one does not, and the value being NULL this actually reverts to
> WRITE_ONCE().

Do we really care? We do not even need WRITE_ONCE() afaics, this is like
__set_current_state(TASK_RUNNING) after the main loop. We can't avoid the
spurious wakeups anyway after return from percpu_down_write().

Oleg.

2016-12-05 17:37:23

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] locking/percpu-rwsem: Rework writer block/wake to not use wait-queues

On Mon, 05 Dec 2016, Oleg Nesterov wrote:

>Yes. But percpu_down_write() should not be used after exit_notify(), so we
>can rely on rcu_read_lock(), release_task()->call_rcu(delayed_put_task_struct)
>can't be called until an exiting task passes exit_notify().
>
>But then we probably need WARN_ON(current->exit_state) in percpu_down_write().

Hmm, my immediate thought would have been doing a PF_EXITING check, but of course
this enlarges the window of the warn being triggered, yet maintains what you are
saying in that percpu_down_write should not be used after do_exit/exit_notify.
Furthermore, reading the comment in task_rcu_dereference, I get your point and
we can loose the reference to the task, iow busted rcu read cr.

>
>And personally I think this change should add the new helpers, they can have
>more users. Something like
>
> struct xxx {

What do you think of s/xxx/rcuwait?

Thanks,
Davidlohr