LinuxLists.cc - [RFC v1] Use swait in completion

2016-03-30 14:53:26

Subject: [RFC v1] Use swait in completion

From: Daniel Wagner <[email protected]>

Hi,

To avoid a size increase of struct completion, I spitted the done
field into two half.

add/remove: 3/0 grow/shrink: 3/10 up/down: 242/-236 (6)
function old new delta
swake_up_all_locked - 181 +181
__kstrtab_swake_up_all_locked - 20 +20
__ksymtab_swake_up_all_locked - 16 +16
complete_all 73 87 +14
try_wait_for_completion 99 107 +8
completion_done 40 43 +3
complete 73 65 -8
wait_for_completion_timeout 283 265 -18
wait_for_completion_killable_timeout 319 301 -18
wait_for_completion_io_timeout 283 265 -18
wait_for_completion_io 275 257 -18
wait_for_completion 275 257 -18
wait_for_completion_interruptible_timeout 304 285 -19
kexec_purgatory 26473 26449 -24
wait_for_completion_killable 544 499 -45
wait_for_completion_interruptible 522 472 -50

Obviously, that decreases the number of waiters which can be handled
to 32k instead of 2M. I started to review all complete_all() users
(half through) and none so far seem to use a huge number of
waiters. It looks almost like most usages of complete_all() could be
turned into complete(). Let's see what the rest of the complete_all()
users do.

So far I only found one hot path user of complete_all(), it's ceph's
rdb code. Luckily I could play with fio on a real ceph cluster.

[global]
ioengine=rbd
pool=rbd
rbdname=fio_test
rw=rw
bs=4k

runtime=60

[rbd_iodepth32]
iodepth=32

baseline (4.5.0-00038-g5b87ae2):

rbd_iodepth32: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32
fio-2.2.8
Starting 1 process
rbd engine: RBD version: 0.1.9

rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=2370: Wed Mar 30 15:43:52 2016
read : io=85596KB, bw=1425.8KB/s, iops=356, runt= 60038msec
slat (usec): min=0, max=73, avg= 1.76, stdev= 3.91
clat (usec): min=27, max=174894, avg=28943.97, stdev=31059.18
lat (usec): min=33, max=174908, avg=28945.73, stdev=31059.18
clat percentiles (usec):
| 1.00th=[ 179], 5.00th=[ 708], 10.00th=[ 1112], 20.00th=[ 1720],
| 30.00th=[ 2256], 40.00th=[ 2832], 50.00th=[24704], 60.00th=[35584],
| 70.00th=[44288], 80.00th=[54528], 90.00th=[72192], 95.00th=[86528],
| 99.00th=[116224], 99.50th=[136192], 99.90th=[148480], 99.95th=[175104],
| 99.99th=[175104]
bw (KB /s): min= 464, max= 2576, per=100.00%, avg=1435.27, stdev=451.79
write: io=85464KB, bw=1423.6KB/s, iops=355, runt= 60038msec
slat (usec): min=0, max=63, avg= 1.99, stdev= 4.05
clat (msec): min=6, max=177, avg=60.84, stdev=24.07
lat (msec): min=6, max=177, avg=60.84, stdev=24.07
clat percentiles (msec):
| 1.00th=[ 23], 5.00th=[ 30], 10.00th=[ 34], 20.00th=[ 41],
| 30.00th=[ 46], 40.00th=[ 50], 50.00th=[ 57], 60.00th=[ 64],
| 70.00th=[ 72], 80.00th=[ 81], 90.00th=[ 93], 95.00th=[ 104],
| 99.00th=[ 137], 99.50th=[ 145], 99.90th=[ 176], 99.95th=[ 176],
| 99.99th=[ 178]
bw (KB /s): min= 543, max= 2401, per=100.00%, avg=1432.49, stdev=437.34
lat (usec) : 50=0.15%, 100=0.20%, 250=0.38%, 500=0.84%, 750=1.23%
lat (usec) : 1000=1.75%
lat (msec) : 2=8.49%, 4=10.52%, 10=0.21%, 20=0.26%, 50=34.17%
lat (msec) : 100=36.94%, 250=4.86%
cpu : usr=0.29%, sys=0.07%, ctx=2950, majf=0, minf=69
IO depths : 1=2.4%, 2=4.7%, 4=9.5%, 8=19.8%, 16=58.9%, 32=4.7%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=96.1%, 8=0.4%, 16=0.8%, 32=2.7%, 64=0.0%, >=64=0.0%
issued : total=r=21399/w=21366/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
READ: io=85596KB, aggrb=1425KB/s, minb=1425KB/s, maxb=1425KB/s, mint=60038msec, maxt=60038msec
WRITE: io=85464KB, aggrb=1423KB/s, minb=1423KB/s, maxb=1423KB/s, mint=60038msec, maxt=60038msec

Disk stats (read/write):
sde: ios=3/7, merge=0/0, ticks=1/0, in_queue=1, util=0.00%

swait-completion (4.5.0-00040-g5e20c37):

rbd_iodepth32: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32
fio-2.2.8
Starting 1 process
rbd engine: RBD version: 0.1.9

rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=2655: Wed Mar 30 15:40:59 2016
read : io=85180KB, bw=1417.3KB/s, iops=354, runt= 60103msec
slat (usec): min=0, max=75, avg= 1.74, stdev= 4.10
clat (usec): min=25, max=200465, avg=29529.78, stdev=31181.31
lat (usec): min=32, max=200466, avg=29531.52, stdev=31181.29
clat percentiles (usec):
| 1.00th=[ 239], 5.00th=[ 788], 10.00th=[ 1256], 20.00th=[ 1800],
| 30.00th=[ 2416], 40.00th=[ 3216], 50.00th=[26752], 60.00th=[36608],
| 70.00th=[44288], 80.00th=[54016], 90.00th=[72192], 95.00th=[90624],
| 99.00th=[118272], 99.50th=[138240], 99.90th=[154624], 99.95th=[175104],
| 99.99th=[195584]
bw (KB /s): min= 418, max= 2567, per=100.00%, avg=1420.58, stdev=431.94
write: io=85136KB, bw=1416.6KB/s, iops=354, runt= 60103msec
slat (usec): min=0, max=413, avg= 1.98, stdev= 5.15
clat (msec): min=7, max=204, avg=60.72, stdev=24.68
lat (msec): min=7, max=204, avg=60.73, stdev=24.68
clat percentiles (msec):
| 1.00th=[ 24], 5.00th=[ 30], 10.00th=[ 35], 20.00th=[ 41],
| 30.00th=[ 46], 40.00th=[ 51], 50.00th=[ 57], 60.00th=[ 62],
| 70.00th=[ 71], 80.00th=[ 81], 90.00th=[ 94], 95.00th=[ 108],
| 99.00th=[ 137], 99.50th=[ 153], 99.90th=[ 202], 99.95th=[ 204],
| 99.99th=[ 204]
bw (KB /s): min= 544, max= 2304, per=100.00%, avg=1420.61, stdev=397.98
lat (usec) : 50=0.14%, 100=0.10%, 250=0.30%, 500=0.73%, 750=1.08%
lat (usec) : 1000=1.53%
lat (msec) : 2=8.29%, 4=10.62%, 10=0.50%, 20=0.23%, 50=34.29%
lat (msec) : 100=37.51%, 250=4.69%
cpu : usr=0.31%, sys=0.08%, ctx=2827, majf=0, minf=31
IO depths : 1=2.4%, 2=4.7%, 4=9.5%, 8=20.0%, 16=58.8%, 32=4.6%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=96.1%, 8=0.3%, 16=0.8%, 32=2.8%, 64=0.0%, >=64=0.0%
issued : total=r=21295/w=21284/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
READ: io=85180KB, aggrb=1417KB/s, minb=1417KB/s, maxb=1417KB/s, mint=60103msec, maxt=60103msec
WRITE: io=85136KB, aggrb=1416KB/s, minb=1416KB/s, maxb=1416KB/s, mint=60103msec, maxt=60103msec

Disk stats (read/write):
sde: ios=0/13, merge=0/1, ticks=0/0, in_queue=0, util=0.00%

My micro benchmark shows some small improvements if there is one waiter
only, which I think is the common use case. You can clearly the
latency increase if the complete_all() call is from the irq context
and there are more than 1 waiter because the wake up all call is
deferred.

irqwork 0/1: complete_all() is called from thread or irq context
counter-100000: number of iterations
waiter_nr-[0-5]: number of waiters

proc-irqwork-0-counter-10000-waiter_nr-1
mean std var max min
4.5.0-00038-g5b87ae2 0.01515 0.002489 6.194e-06 0.028 0.012
4.5.0-00040-g5e20c37 0.01483 2.09 0.002614 -5.05 6.835e-06 -10.35 0.029 -3.57 0.011 8.33

proc-irqwork-0-counter-10000-waiter_nr-2
mean std var max min
4.5.0-00038-g5b87ae2 0.1081 0.005127 2.629e-05 0.161 0.098
4.5.0-00040-g5e20c37 0.1066 1.40 0.003345 34.76 1.119e-05 57.43 0.142 11.80 0.101 -3.06

proc-irqwork-0-counter-10000-waiter_nr-3
mean std var max min
4.5.0-00038-g5b87ae2 0.1185 0.004941 2.441e-05 0.181 0.111
4.5.0-00040-g5e20c37 0.1254 -5.78 0.004619 6.51 2.134e-05 12.60 0.176 2.76 0.121 -9.01

proc-irqwork-0-counter-10000-waiter_nr-4
mean std var max min
4.5.0-00038-g5b87ae2 0.1343 0.004129 1.705e-05 0.173 0.128
4.5.0-00040-g5e20c37 0.1289 4.03 0.004587 -11.11 2.104e-05 -23.45 0.177 -2.31 0.124 3.13

proc-irqwork-0-counter-10000-waiter_nr-5
mean std var max min
4.5.0-00038-g5b87ae2 0.1463 0.004261 1.815e-05 0.175 0.141
4.5.0-00040-g5e20c37 0.1469 -0.44 0.004617 -8.35 2.131e-05 -17.40 0.201 -14.86 0.141 0.00

proc-irqwork-1-counter-10000-waiter_nr-1
mean std var max min
4.5.0-00038-g5b87ae2 0.01992 0.002911 8.476e-06 0.038 0.019
4.5.0-00040-g5e20c37 0.01932 3.00 0.002845 2.28 8.095e-06 4.50 0.037 2.63 0.018 5.26

proc-irqwork-1-counter-10000-waiter_nr-2
mean std var max min
4.5.0-00038-g5b87ae2 0.1357 0.005088 2.589e-05 0.184 0.118
4.5.0-00040-g5e20c37 0.1558 -14.85 0.004727 7.10 2.234e-05 13.69 0.213 -15.76 0.151 -27.97

proc-irqwork-1-counter-10000-waiter_nr-3
mean std var max min
4.5.0-00038-g5b87ae2 0.1309 0.004545 2.066e-05 0.173 0.119
4.5.0-00040-g5e20c37 0.1598 -22.08 0.005181 -13.98 2.684e-05 -29.93 0.226 -30.64 0.155 -30.25

proc-irqwork-1-counter-10000-waiter_nr-4
mean std var max min
4.5.0-00038-g5b87ae2 0.1413 0.004988 2.488e-05 0.168 0.131
4.5.0-00040-g5e20c37 0.1651 -16.90 0.005854 -17.36 3.427e-05 -37.74 0.24 -42.86 0.159 -21.37

proc-irqwork-1-counter-10000-waiter_nr-5
mean std var max min
4.5.0-00038-g5b87ae2 0.1478 0.004044 1.635e-05 0.181 0.144
4.5.0-00040-g5e20c37 0.1755 -18.76 0.004753 -17.54 2.259e-05 -38.15 0.231 -27.62 0.166 -15.28

With Tom Zanussi's hist patches I was able to gather some more data on
the distribution of the latencies. This is only with 1 waiter.

irqwork-0

4.5.0-00038-g5b87ae2

{ latency: ~ 2^9 } hitcount: 10
{ latency: ~ 2^10 } hitcount: 100790083
{ latency: ~ 2^11 } hitcount: 141105
{ latency: ~ 2^12 } hitcount: 84360
{ latency: ~ 2^13 } hitcount: 34744
{ latency: ~ 2^14 } hitcount: 8632
{ latency: ~ 2^15 } hitcount: 443
{ latency: ~ 2^16 } hitcount: 35
{ latency: ~ 2^17 } hitcount: 4
{ latency: ~ 2^19 } hitcount: 149

4.5.0-00040-g5e20c37

{ latency: ~ 2^9 } hitcount: 193
{ latency: ~ 2^10 } hitcount: 103169910
{ latency: ~ 2^11 } hitcount: 101648
{ latency: ~ 2^12 } hitcount: 78514
{ latency: ~ 2^13 } hitcount: 35127
{ latency: ~ 2^14 } hitcount: 9922
{ latency: ~ 2^15 } hitcount: 562
{ latency: ~ 2^16 } hitcount: 19
{ latency: ~ 2^17 } hitcount: 18
{ latency: ~ 2^19 } hitcount: 146

irqwork-1

4.5.0-00038-g5b87ae2

{ latency: ~ 2^10 } hitcount: 9088981
{ latency: ~ 2^11 } hitcount: 48496198
{ latency: ~ 2^12 } hitcount: 90199
{ latency: ~ 2^13 } hitcount: 55650
{ latency: ~ 2^14 } hitcount: 9940
{ latency: ~ 2^15 } hitcount: 766
{ latency: ~ 2^16 } hitcount: 19
{ latency: ~ 2^17 } hitcount: 28
{ latency: ~ 2^19 } hitcount: 130

4.5.0-00040-g5e20c37

{ latency: ~ 2^10 } hitcount: 28371223
{ latency: ~ 2^11 } hitcount: 28260205
{ latency: ~ 2^12 } hitcount: 99544
{ latency: ~ 2^13 } hitcount: 47821
{ latency: ~ 2^14 } hitcount: 9860
{ latency: ~ 2^15 } hitcount: 443
{ latency: ~ 2^16 } hitcount: 28
{ latency: ~ 2^17 } hitcount: 21
{ latency: ~ 2^19 } hitcount: 121

The patches are available also here:

git://git.kernel.org/pub/scm/linux/kernel/git/wagi/linux.git completion-test-3
git://git.kernel.org/pub/scm/linux/kernel/git/wagi/linux.git baseline

cheers,
daniel

Daniel Wagner (1):
sched/completion: convert completions to use simple wait queues

include/linux/completion.h | 23 ++++++++++++++++-------
include/linux/swait.h | 1 +
kernel/sched/completion.c | 43 ++++++++++++++++++++++++++-----------------
kernel/sched/swait.c | 24 ++++++++++++++++++++++++
4 files changed, 67 insertions(+), 24 deletions(-)

--
2.5.5

2016-03-30 14:53:24

by Daniel Wagner

[permalink] [raw]

Subject: [RFC v1] sched/completion: convert completions to use simple wait queues

From: Daniel Wagner <[email protected]>

Completions have no long lasting callbacks and therefore do not need
the complex waitqueue variant. Use simple waitqueues which reduces
the contention on the waitqueue lock.

This was a carry forward from v3.10-rt, with some RT specific chunks,
dropped, and updated to align with names that were chosen to match the
simple waitqueue support.

[wagi: Added flag to defer swake_up_all() from irq context]

Signed-off-by: Daniel Wagner <[email protected]>
---
include/linux/completion.h | 23 ++++++++++++++++-------
include/linux/swait.h | 1 +
kernel/sched/completion.c | 43 ++++++++++++++++++++++++++-----------------
kernel/sched/swait.c | 24 ++++++++++++++++++++++++
4 files changed, 67 insertions(+), 24 deletions(-)

diff --git a/include/linux/completion.h b/include/linux/completion.h
index 5d5aaae..45fd91a 100644
--- a/include/linux/completion.h
+++ b/include/linux/completion.h
@@ -8,7 +8,7 @@
* See kernel/sched/completion.c for details.
*/

-#include <linux/wait.h>
+#include <linux/swait.h>

/*
* struct completion - structure used to maintain state for a "completion"
@@ -22,13 +22,22 @@
* reinit_completion(), and macros DECLARE_COMPLETION(),
* DECLARE_COMPLETION_ONSTACK().
*/
+
+#define COMPLETION_DEFER (1 << 0)
+
struct completion {
- unsigned int done;
- wait_queue_head_t wait;
+ union {
+ struct {
+ unsigned short flags;
+ unsigned short done;
+ };
+ unsigned int val;
+ };
+ struct swait_queue_head wait;
};

#define COMPLETION_INITIALIZER(work) \
- { 0, __WAIT_QUEUE_HEAD_INITIALIZER((work).wait) }
+ { 0, 0, __SWAIT_QUEUE_HEAD_INITIALIZER((work).wait) }

#define COMPLETION_INITIALIZER_ONSTACK(work) \
({ init_completion(&work); work; })
@@ -72,8 +81,8 @@ struct completion {
*/
static inline void init_completion(struct completion *x)
{
- x->done = 0;
- init_waitqueue_head(&x->wait);
+ x->val = 0;
+ init_swait_queue_head(&x->wait);
}

/**
@@ -85,7 +94,7 @@ static inline void init_completion(struct completion *x)
*/
static inline void reinit_completion(struct completion *x)
{
- x->done = 0;
+ x->val = 0;
}

extern void wait_for_completion(struct completion *);
diff --git a/include/linux/swait.h b/include/linux/swait.h
index c1f9c62..83f004a 100644
--- a/include/linux/swait.h
+++ b/include/linux/swait.h
@@ -87,6 +87,7 @@ static inline int swait_active(struct swait_queue_head *q)
extern void swake_up(struct swait_queue_head *q);
extern void swake_up_all(struct swait_queue_head *q);
extern void swake_up_locked(struct swait_queue_head *q);
+extern void swake_up_all_locked(struct swait_queue_head *q);

extern void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait);
extern void prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait, int state);
diff --git a/kernel/sched/completion.c b/kernel/sched/completion.c
index 8d0f35d..d4dccd3 100644
--- a/kernel/sched/completion.c
+++ b/kernel/sched/completion.c
@@ -30,10 +30,10 @@ void complete(struct completion *x)
{
unsigned long flags;

- spin_lock_irqsave(&x->wait.lock, flags);
+ raw_spin_lock_irqsave(&x->wait.lock, flags);
x->done++;
- __wake_up_locked(&x->wait, TASK_NORMAL, 1);
- spin_unlock_irqrestore(&x->wait.lock, flags);
+ swake_up_locked(&x->wait);
+ raw_spin_unlock_irqrestore(&x->wait.lock, flags);
}
EXPORT_SYMBOL(complete);

@@ -50,10 +50,15 @@ void complete_all(struct completion *x)
{
unsigned long flags;

- spin_lock_irqsave(&x->wait.lock, flags);
- x->done += UINT_MAX/2;
- __wake_up_locked(&x->wait, TASK_NORMAL, 0);
- spin_unlock_irqrestore(&x->wait.lock, flags);
+ raw_spin_lock_irqsave(&x->wait.lock, flags);
+ x->done += USHRT_MAX/2;
+ if (irqs_disabled_flags(flags)) {
+ x->flags = COMPLETION_DEFER;
+ swake_up_locked(&x->wait);
+ } else {
+ swake_up_all_locked(&x->wait);
+ }
+ raw_spin_unlock_irqrestore(&x->wait.lock, flags);
}
EXPORT_SYMBOL(complete_all);

@@ -62,20 +67,20 @@ do_wait_for_common(struct completion *x,
long (*action)(long), long timeout, int state)
{
if (!x->done) {
- DECLARE_WAITQUEUE(wait, current);
+ DECLARE_SWAITQUEUE(wait);

- __add_wait_queue_tail_exclusive(&x->wait, &wait);
+ __prepare_to_swait(&x->wait, &wait);
do {
if (signal_pending_state(state, current)) {
timeout = -ERESTARTSYS;
break;
}
__set_current_state(state);
- spin_unlock_irq(&x->wait.lock);
+ raw_spin_unlock_irq(&x->wait.lock);
timeout = action(timeout);
- spin_lock_irq(&x->wait.lock);
+ raw_spin_lock_irq(&x->wait.lock);
} while (!x->done && timeout);
- __remove_wait_queue(&x->wait, &wait);
+ __finish_swait(&x->wait, &wait);
if (!x->done)
return timeout;
}
@@ -89,9 +94,13 @@ __wait_for_common(struct completion *x,
{
might_sleep();

- spin_lock_irq(&x->wait.lock);
+ raw_spin_lock_irq(&x->wait.lock);
timeout = do_wait_for_common(x, action, timeout, state);
- spin_unlock_irq(&x->wait.lock);
+ raw_spin_unlock_irq(&x->wait.lock);
+ if (x->flags & COMPLETION_DEFER) {
+ x->flags = 0;
+ swake_up_all(&x->wait);
+ }
return timeout;
}

@@ -277,12 +286,12 @@ bool try_wait_for_completion(struct completion *x)
if (!READ_ONCE(x->done))
return 0;

- spin_lock_irqsave(&x->wait.lock, flags);
+ raw_spin_lock_irqsave(&x->wait.lock, flags);
if (!x->done)
ret = 0;
else
x->done--;
- spin_unlock_irqrestore(&x->wait.lock, flags);
+ raw_spin_unlock_irqrestore(&x->wait.lock, flags);
return ret;
}
EXPORT_SYMBOL(try_wait_for_completion);
@@ -311,7 +320,7 @@ bool completion_done(struct completion *x)
* after it's acquired the lock.
*/
smp_rmb();
- spin_unlock_wait(&x->wait.lock);
+ raw_spin_unlock_wait(&x->wait.lock);
return true;
}
EXPORT_SYMBOL(completion_done);
diff --git a/kernel/sched/swait.c b/kernel/sched/swait.c
index 82f0dff..efe366b 100644
--- a/kernel/sched/swait.c
+++ b/kernel/sched/swait.c
@@ -72,6 +72,30 @@ void swake_up_all(struct swait_queue_head *q)
}
EXPORT_SYMBOL(swake_up_all);

+void swake_up_all_locked(struct swait_queue_head *q)
+{
+ struct swait_queue *curr;
+ LIST_HEAD(tmp);
+
+ if (!swait_active(q))
+ return;
+
+ list_splice_init(&q->task_list, &tmp);
+ while (!list_empty(&tmp)) {
+ curr = list_first_entry(&tmp, typeof(*curr), task_list);
+
+ wake_up_state(curr->task, TASK_NORMAL);
+ list_del_init(&curr->task_list);
+
+ if (list_empty(&tmp))
+ break;
+
+ raw_spin_unlock_irq(&q->lock);
+ raw_spin_lock_irq(&q->lock);
+ }
+}
+EXPORT_SYMBOL(swake_up_all_locked);
+
void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait)
{
wait->task = current;
--
2.5.5

2016-03-30 15:07:57

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC v1] sched/completion: convert completions to use simple wait queues

On Wed, Mar 30, 2016 at 04:53:05PM +0200, Daniel Wagner wrote:
> From: Daniel Wagner <[email protected]>
>
> Completions have no long lasting callbacks and therefore do not need
> the complex waitqueue variant. Use simple waitqueues which reduces
> the contention on the waitqueue lock.

Changelog really should have talk about the determinism thing. The last
time you posted this the point was raised that we should wake the
highest prio waiter in the defer case, you did not address this.

Also, you make no mention of the reduction of UINT_MAX to USHORT_MAX and
the implications of that.

2016-03-30 15:17:34

by Sebastian Andrzej Siewior

[permalink] [raw]

Subject: Re: [RFC v1] sched/completion: convert completions to use simple wait queues

On 03/30/2016 05:07 PM, Peter Zijlstra wrote:
> On Wed, Mar 30, 2016 at 04:53:05PM +0200, Daniel Wagner wrote:
>> From: Daniel Wagner <[email protected]>
>>
>> Completions have no long lasting callbacks and therefore do not need
>> the complex waitqueue variant. Use simple waitqueues which reduces
>> the contention on the waitqueue lock.
>
> Changelog really should have talk about the determinism thing. The last
> time you posted this the point was raised that we should wake the
> highest prio waiter in the defer case, you did not address this.

So we really want to go this road? I didn't find any numbers what the
highest count of queued sleepers was in Daniel's complete_all() testing.

As for the latest -RT I received only one report from Clark Williams
with something like 3 to 9 sleepers waked up during one complete_all()
and this happens in the resume code.
Based on this, deferring wake-ups from IRQ-context and a RB-tree (or
something like that for priority sorting) looks like a lot of complexity
and it does not look like we gain much.

> Also, you make no mention of the reduction of UINT_MAX to USHORT_MAX and
> the implications of that.

Wasn't this
|To avoid a size increase of struct completion, I spitted the done
|field into two half.

later he mentions that we can't have 2M sleepers anymore.

Sebastian

2016-03-30 15:22:03

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC v1] sched/completion: convert completions to use simple wait queues

On Wed, Mar 30, 2016 at 05:17:29PM +0200, Sebastian Andrzej Siewior wrote:
> On 03/30/2016 05:07 PM, Peter Zijlstra wrote:
> > On Wed, Mar 30, 2016 at 04:53:05PM +0200, Daniel Wagner wrote:
> >> From: Daniel Wagner <[email protected]>
> >>
> >> Completions have no long lasting callbacks and therefore do not need
> >> the complex waitqueue variant. Use simple waitqueues which reduces
> >> the contention on the waitqueue lock.
> >
> > Changelog really should have talk about the determinism thing. The last
> > time you posted this the point was raised that we should wake the
> > highest prio waiter in the defer case, you did not address this.
>
> So we really want to go this road?

Dunno, but at least mention why it wouldn't matter.

> I didn't find any numbers what the
> highest count of queued sleepers was in Daniel's complete_all() testing.
>
> As for the latest -RT I received only one report from Clark Williams
> with something like 3 to 9 sleepers waked up during one complete_all()
> and this happens in the resume code.
> Based on this, deferring wake-ups from IRQ-context and a RB-tree (or
> something like that for priority sorting) looks like a lot of complexity
> and it does not look like we gain much.

Sure, but that equally puts the whole defer thing into question, if we
can put a hard cap on the max number (and WARN when exceeded) we're also
good.

> > Also, you make no mention of the reduction of UINT_MAX to USHORT_MAX and
> > the implications of that.
>
> Wasn't this
> |To avoid a size increase of struct completion, I spitted the done
> |field into two half.
>
> later he mentions that we can't have 2M sleepers anymore.

That wasn't in this changelog, therefore it wasn't read ;-)

2016-03-30 15:29:50

by Daniel Wagner

[permalink] [raw]

Subject: Re: [RFC v1] sched/completion: convert completions to use simple wait queues

On 03/30/2016 05:21 PM, Peter Zijlstra wrote:
> On Wed, Mar 30, 2016 at 05:17:29PM +0200, Sebastian Andrzej Siewior wrote:
>> On 03/30/2016 05:07 PM, Peter Zijlstra wrote:
>>> On Wed, Mar 30, 2016 at 04:53:05PM +0200, Daniel Wagner wrote:
>>>> From: Daniel Wagner <[email protected]>
>>>>
>>>> Completions have no long lasting callbacks and therefore do not need
>>>> the complex waitqueue variant. Use simple waitqueues which reduces
>>>> the contention on the waitqueue lock.
>>>
>>> Changelog really should have talk about the determinism thing. The last
>>> time you posted this the point was raised that we should wake the
>>> highest prio waiter in the defer case, you did not address this.
>>
>> So we really want to go this road?
>
> Dunno, but at least mention why it wouldn't matter.

It seems I put to much effort into the cover letter. I should have spent
that time in the changelog. Anyway, I am going through the users of
complete_all() and it looks like most of them are either some setup code
paths and the other bunch of calls are just making sure the single
waiter really wakes up.

>> I didn't find any numbers what the
>> highest count of queued sleepers was in Daniel's complete_all() testing.
>>
>> As for the latest -RT I received only one report from Clark Williams
>> with something like 3 to 9 sleepers waked up during one complete_all()
>> and this happens in the resume code.
>> Based on this, deferring wake-ups from IRQ-context and a RB-tree (or
>> something like that for priority sorting) looks like a lot of complexity
>> and it does not look like we gain much.
>
> Sure, but that equally puts the whole defer thing into question, if we
> can put a hard cap on the max number (and WARN when exceeded) we're also
> good.
>
>>> Also, you make no mention of the reduction of UINT_MAX to USHORT_MAX and
>>> the implications of that.
>>
>> Wasn't this
>> |To avoid a size increase of struct completion, I spitted the done
>> |field into two half.
>>
>> later he mentions that we can't have 2M sleepers anymore.
>
> That wasn't in this changelog, therefore it wasn't read ;-)

Got it, next version has all info in the changelog and not in the cover
letter.