2015-04-19 19:17:57

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH 0/2] lockless wake-queues

Hello,

This series is aimed at addressing some of the futex hash bucket
lock hold times by introducing lockless wake-queues for futex_wake.

patch-1: introduces the lockless wake-queue machinery.
patch-2: makes use of patch 1 for futexes.

Details in the individual patches.

This was suggested sometime ago by peterz, but due to it potentially
causing spurious wakeups, was never given much consideration. However,
nowadays, so far, I am reliably booting a 45-core box with peterz's
patch to trigger spurious wakeups. While there are drivers out there
that do not play nice with schedule(), they can be fixed over time --
while this is a production problem for some customers). Furthermore,
after some auditing, there really aren't that many, it a lot of cases,
those functions that end up calling schedule are merely wrapped in a
loop, so just not clear at first sight.

Other code that could be migrated to lockless wake_q includes:
s390 pfault_interrupt, sysv msg (sem already does the lockless wakeups
without the ipc object lock), among others. However, futexes are the
only users afaik that solve a _real_ issue.

Applies on top of Linus' tree 4.0-64fb1d0e975e. Thanks!

sched: lockless wake-queues
futex: lockless wakeups

include/linux/sched.h | 30 ++++++++++++++++++++++++++++++
kernel/futex.c | 33 +++++++++++++++++----------------
kernel/sched/core.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 96 insertions(+), 16 deletions(-)

--
2.1.4


2015-04-19 19:18:07

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH 1/2] sched: lockless wake-queues

From: Peter Zijlstra <[email protected]>

This is useful for locking primitives that can effect multiple
wakeups per operation and want to avoid lock internal lock contention
by delaying the wakeups until we've released the lock internal locks.

Alternatively it can be used to avoid issuing multiple wakeups, and
thus save a few cycles, in packet processing. Queue all target tasks
and wakeup once you've processed all packets. That way you avoid
waking the target task multiple times if there were multiple packets
for the same task.

Properties of a wake_q are:
- Lockless, as queue head must reside on the stack.
- Being a queue, maintains wakeup order passed by the callers. This can
be important for otherwise, in scenarios where highly contended locks
could affect any reliance on lock fairness. It also respects user
order of wakeups.
- A queued task cannot be added again until it is woken up.

This patch adds the needed infrastructure into the scheduler code
and uses the new wake_q to delay the futex wakeups until after
we've released the corresponding user locks.

Signed-off-by: Peter Zijlstra <[email protected]>
[tweaks, adjustments, comments, etc.]
Signed-off-by: Davidlohr Bueso <[email protected]>
---
include/linux/sched.h | 30 ++++++++++++++++++++++++++++++
kernel/sched/core.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 79 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8222ae4..3b20fe5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -947,6 +947,34 @@ static inline int cpu_numa_flags(void)
}
#endif

+/*
+ * Wake-queues are lists of tasks with a pending wakeup, whose
+ * callers have already marked the task as woken internally,
+ * and can thus carry on. A common use case is being able to
+ * do the wakeups once the corresponding lock as been released.
+ *
+ * We hold reference to each task in the list across the wakeup,
+ * thus guaranteeing that the memory is still valid by the time
+ * the actual wakeups are performed in wake_up_q().
+ */
+struct wake_q_node {
+ struct wake_q_node *next;
+};
+
+struct wake_q_head {
+ struct wake_q_node *first;
+ struct wake_q_node *last;
+};
+
+#define WAKE_Q_TAIL ((struct wake_q_node *) 0x01)
+
+#define WAKE_Q(name) \
+ struct wake_q_head name = { WAKE_Q_TAIL, WAKE_Q_TAIL }
+
+extern void wake_q_add(struct wake_q_head *head,
+ struct task_struct *task);
+extern void wake_up_q(struct wake_q_head *head);
+
struct sched_domain_attr {
int relax_domain_level;
};
@@ -1519,6 +1547,8 @@ struct task_struct {
/* Protection of the PI data structures: */
raw_spinlock_t pi_lock;

+ struct wake_q_node wake_q;
+
#ifdef CONFIG_RT_MUTEXES
/* PI waiters blocked on a rt_mutex held by this task */
struct rb_root pi_waiters;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f9123a8..ebe6890 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -541,6 +541,55 @@ static bool set_nr_if_polling(struct task_struct *p)
#endif
#endif

+void wake_q_add(struct wake_q_head *head, struct task_struct *task)
+{
+ struct wake_q_node *node = &task->wake_q;
+
+ /*
+ * Atomically grab the task, if ->wake_q is !nil already it means
+ * its already queued (either by us or someone else) and will get the
+ * wakeup due to that.
+ *
+ * This cmpxchg() implies a full barrier, which pairs with the write
+ * barrier implied by the wakeup in wake_up_list().
+ */
+ if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL))
+ return;
+
+ get_task_struct(task);
+
+ /*
+ * The head is context local, there can be no concurrency.
+ */
+ if (head->first == WAKE_Q_TAIL)
+ head->first = node;
+ else
+ head->last->next = node;
+
+ head->last = node;
+}
+
+void wake_up_q(struct wake_q_head *head)
+{
+ struct wake_q_node *node = head->first;
+
+ while (node != WAKE_Q_TAIL) {
+ struct task_struct *task;
+
+ task = container_of(node, struct task_struct, wake_q);
+ BUG_ON(!task);
+ node = node->next;
+ task->wake_q.next = NULL; /* task can safely be re-inserted now */
+
+ /*
+ * wake_up_process() implies a wmb() to pair with the queueing
+ * in wake_q_add() so as not to miss wakeups.
+ */
+ wake_up_process(task);
+ put_task_struct(task);
+ }
+}
+
/*
* resched_curr - mark rq's current task 'to be rescheduled now'.
*
--
2.1.4

2015-04-19 19:18:17

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH 2/2] futex: lockless wakeups

Given the overall futex architecture, any chance of reducing
hb->lock contention is welcome. In this particular case, using
wake-queues to enable lockless wakeups addresses very much real
world performance concerns, even cases of soft-lockups in cases
of large amounts of blocked tasks (which is not hard to find in
large boxes, using but just a handful of futex).

At the lowest level, this patch can reduce latency of a single thread
attempting to acquire hb->lock in highly contended scenarios by a
up to 2x. At lower counts of nr_wake there are no regressions,
confirming, of course, that the wake_q handling overhead is practically
non existent. For instance, while a fair amount of variation,
the extended pef-bench wakeup benchmark shows for a 20 core machine
the following avg per-thread time to wakeup its share of tasks:

nr_thr ms-before ms-after
16 0.0590 0.0215
32 0.0396 0.0220
48 0.0417 0.0182
64 0.0536 0.0236
80 0.0414 0.0097
96 0.0672 0.0152

Naturally, this can cause spurious wakeups. However there is core code
that cannot handle them afaict, and furthermore tglx does have the point
that other events can already trigger them anyway.

Signed-off-by: Davidlohr Bueso <[email protected]>
---
kernel/futex.c | 33 +++++++++++++++++----------------
1 file changed, 17 insertions(+), 16 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 2579e40..f9984c3 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1090,9 +1090,11 @@ static void __unqueue_futex(struct futex_q *q)

/*
* The hash bucket lock must be held when this is called.
- * Afterwards, the futex_q must not be accessed.
+ * Afterwards, the futex_q must not be accessed. Callers
+ * must ensure to later call wake_up_q() for the actual
+ * wakeups to occur.
*/
-static void wake_futex(struct futex_q *q)
+static void mark_wake_futex(struct wake_q_head *wake_q, struct futex_q *q)
{
struct task_struct *p = q->task;

@@ -1100,14 +1102,10 @@ static void wake_futex(struct futex_q *q)
return;

/*
- * We set q->lock_ptr = NULL _before_ we wake up the task. If
- * a non-futex wake up happens on another CPU then the task
- * might exit and p would dereference a non-existing task
- * struct. Prevent this by holding a reference on p across the
- * wake up.
+ * Queue the task for later wakeup for after we've released
+ * the hb->lock. wake_q_add() grabs reference to p.
*/
- get_task_struct(p);
-
+ wake_q_add(wake_q, p);
__unqueue_futex(q);
/*
* The waiting task can free the futex_q as soon as
@@ -1117,9 +1115,6 @@ static void wake_futex(struct futex_q *q)
*/
smp_wmb();
q->lock_ptr = NULL;
-
- wake_up_state(p, TASK_NORMAL);
- put_task_struct(p);
}

static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this)
@@ -1217,6 +1212,7 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
struct futex_q *this, *next;
union futex_key key = FUTEX_KEY_INIT;
int ret;
+ WAKE_Q(wake_q);

if (!bitset)
return -EINVAL;
@@ -1244,13 +1240,14 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
if (!(this->bitset & bitset))
continue;

- wake_futex(this);
+ mark_wake_futex(&wake_q, this);
if (++ret >= nr_wake)
break;
}
}

spin_unlock(&hb->lock);
+ wake_up_q(&wake_q);
out_put_key:
put_futex_key(&key);
out:
@@ -1269,6 +1266,7 @@ futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
struct futex_hash_bucket *hb1, *hb2;
struct futex_q *this, *next;
int ret, op_ret;
+ WAKE_Q(wake_q);

retry:
ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1, VERIFY_READ);
@@ -1320,7 +1318,7 @@ retry_private:
ret = -EINVAL;
goto out_unlock;
}
- wake_futex(this);
+ mark_wake_futex(&wake_q, this);
if (++ret >= nr_wake)
break;
}
@@ -1334,7 +1332,7 @@ retry_private:
ret = -EINVAL;
goto out_unlock;
}
- wake_futex(this);
+ mark_wake_futex(&wake_q, this);
if (++op_ret >= nr_wake2)
break;
}
@@ -1344,6 +1342,7 @@ retry_private:

out_unlock:
double_unlock_hb(hb1, hb2);
+ wake_up_q(&wake_q);
out_put_keys:
put_futex_key(&key2);
out_put_key1:
@@ -1503,6 +1502,7 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags,
struct futex_pi_state *pi_state = NULL;
struct futex_hash_bucket *hb1, *hb2;
struct futex_q *this, *next;
+ WAKE_Q(wake_q);

if (requeue_pi) {
/*
@@ -1679,7 +1679,7 @@ retry_private:
* woken by futex_unlock_pi().
*/
if (++task_count <= nr_wake && !requeue_pi) {
- wake_futex(this);
+ mark_wake_futex(&wake_q, this);
continue;
}

@@ -1719,6 +1719,7 @@ retry_private:
out_unlock:
free_pi_state(pi_state);
double_unlock_hb(hb1, hb2);
+ wake_up_q(&wake_q);
hb_waiters_dec(hb2);

/*
--
2.1.4

2015-04-20 06:18:42

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 2/2] futex: lockless wakeups


* Davidlohr Bueso <[email protected]> wrote:

> Given the overall futex architecture, any chance of reducing
> hb->lock contention is welcome. In this particular case, using
> wake-queues to enable lockless wakeups addresses very much real
> world performance concerns, even cases of soft-lockups in cases
> of large amounts of blocked tasks (which is not hard to find in
> large boxes, using but just a handful of futex).
>
> At the lowest level, this patch can reduce latency of a single thread
> attempting to acquire hb->lock in highly contended scenarios by a
> up to 2x. At lower counts of nr_wake there are no regressions,
> confirming, of course, that the wake_q handling overhead is practically
> non existent. For instance, while a fair amount of variation,
> the extended pef-bench wakeup benchmark shows for a 20 core machine
> the following avg per-thread time to wakeup its share of tasks:
>
> nr_thr ms-before ms-after
> 16 0.0590 0.0215
> 32 0.0396 0.0220
> 48 0.0417 0.0182
> 64 0.0536 0.0236
> 80 0.0414 0.0097
> 96 0.0672 0.0152
>
> Naturally, this can cause spurious wakeups. [...]

Please write a small description we can cite to driver authors once
the (inevitable) breakages appear, outlining this new behavior and its
implications, so that we can fix any remaining bugs ASAP.

I'll also let this pending a bit longer than other changes, to make
sure we shake out any bugs/regressions triggered by this change.

Third, it might make sense to add a new 'spurious wakeup injection
debug mechanism' that, if enabled in the .config, automatically and
continuously inserts spurious wakeups at a given, slightly randomized
rate - which would ensure that all kernel facilities can robustly
handle spurious wakeups.

My guess would be that most remaining fragilities against spurious
wakeups ought to be in the boot/init phase, so I'd keep an eye out for
suspend/resume regressions.

> [...] However there is core code that cannot handle them afaict, and
> furthermore tglx does have the point that other events can already
> trigger them anyway.

s/there is core code/there is no core code

Thanks,

Ingo

2015-04-20 13:55:58

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH 2/2] futex: lockless wakeups

On Mon, 2015-04-20 at 08:18 +0200, Ingo Molnar wrote:
> Please write a small description we can cite to driver authors once
> the (inevitable) breakages appear, outlining this new behavior and its
> implications, so that we can fix any remaining bugs ASAP.

I wouldn't call this new behavior, simply because changing a critical
region should not be labeled as such imho. Other than asking driver
authors to put their schedule() in a loop to confirm that the expected
condition has in fact occurred, I'm not sure what else we can ask them
to do -- as you know, this is not just about futexes.

> I'll also let this pending a bit longer than other changes, to make
> sure we shake out any bugs/regressions triggered by this change.
>
> Third, it might make sense to add a new 'spurious wakeup injection
> debug mechanism' that, if enabled in the .config, automatically and
> continuously inserts spurious wakeups at a given, slightly randomized
> rate - which would ensure that all kernel facilities can robustly
> handle spurious wakeups.

I have been using this from Peter to test against:

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6d77432..fdf1f68 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -214,9 +214,10 @@ print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq);
#define TASK_WAKEKILL 128
#define TASK_WAKING 256
#define TASK_PARKED 512
-#define TASK_STATE_MAX 1024
+#define TASK_YIELD 1024
+#define TASK_STATE_MAX 2048

-#define TASK_STATE_TO_CHAR_STR "RSDTtXZxKWP"
+#define TASK_STATE_TO_CHAR_STR "RSDTtXZxKWPY"

extern char ___assert_task_state[1 - 2*!!(
sizeof(TASK_STATE_TO_CHAR_STR)-1 != ilog2(TASK_STATE_MAX)+1)];
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f0f831e..2c938ae 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1005,7 +1005,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
* ttwu() will sort out the placement.
*/
WARN_ON_ONCE(p->state != TASK_RUNNING && p->state != TASK_WAKING &&
- !p->on_rq);
+ !p->on_rq && !(p->state & TASK_YIELD));

#ifdef CONFIG_LOCKDEP
/*
@@ -2743,6 +2743,23 @@ static void __sched __schedule(void)
if (unlikely(signal_pending_state(prev->state, prev))) {
prev->state = TASK_RUNNING;
} else {
+
+ /*
+ * Provide an auto-yield feature on schedule().
+ *
+ * The thought is to avoid a sleep+wakeup cycle
+ * if simply yielding the cpu will suffice to
+ * satisfy the required condition.
+ *
+ * Assumes the calling schedule() site can deal
+ * with spurious wakeups.
+ */
+ if (prev->state & TASK_YIELD) {
+ prev->state &= ~TASK_YIELD;
+ if (rq->nr_running > 1)
+ goto no_deactivate;
+ }
+
deactivate_task(rq, prev, DEQUEUE_SLEEP);
prev->on_rq = 0;

@@ -2759,6 +2776,7 @@ static void __sched __schedule(void)
try_to_wake_up_local(to_wakeup);
}
}
+ no_deactivate:
switch_count = &prev->nvcsw;
}


> My guess would be that most remaining fragilities against spurious
> wakeups ought to be in the boot/init phase, so I'd keep an eye out for
> suspend/resume regressions.

Correct, which is why I'm not that concerned anymore about spurious
wakups, in fact that code above now boots and handles correctly on
rather large systems.

>
> > [...] However there is core code that cannot handle them afaict, and
> > furthermore tglx does have the point that other events can already
> > trigger them anyway.
>
> s/there is core code/there is no core code

heh yes.

Thanks,
Davidlohr

2015-04-20 14:42:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched: lockless wake-queues

On Sun, Apr 19, 2015 at 12:17:39PM -0700, Davidlohr Bueso wrote:
> +void wake_q_add(struct wake_q_head *head, struct task_struct *task)
> +{
> + struct wake_q_node *node = &task->wake_q;
> +
> + /*
> + * Atomically grab the task, if ->wake_q is !nil already it means
> + * its already queued (either by us or someone else) and will get the
> + * wakeup due to that.
> + *
> + * This cmpxchg() implies a full barrier, which pairs with the write
> + * barrier implied by the wakeup in wake_up_list().
> + */
> + if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL))
> + return;
> +
> + get_task_struct(task);
> +
> + /*
> + * The head is context local, there can be no concurrency.
> + */
> + if (head->first == WAKE_Q_TAIL)
> + head->first = node;
> + else
> + head->last->next = node;
> +
> + head->last = node;
> +}

Do we want a sched_feat() that makes the above to an immediate wake-up
instead of the fancy thing? -- just for debuging/performance
measurements like things?

2015-04-20 17:13:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/2] futex: lockless wakeups

On Mon, Apr 20, 2015 at 06:55:39AM -0700, Davidlohr Bueso wrote:
> I have been using this from Peter to test against:
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 6d77432..fdf1f68 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -214,9 +214,10 @@ print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq);
> #define TASK_WAKEKILL 128
> #define TASK_WAKING 256
> #define TASK_PARKED 512
> -#define TASK_STATE_MAX 1024
> +#define TASK_YIELD 1024
> +#define TASK_STATE_MAX 2048
>
> -#define TASK_STATE_TO_CHAR_STR "RSDTtXZxKWP"
> +#define TASK_STATE_TO_CHAR_STR "RSDTtXZxKWPY"
>
> extern char ___assert_task_state[1 - 2*!!(
> sizeof(TASK_STATE_TO_CHAR_STR)-1 != ilog2(TASK_STATE_MAX)+1)];
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f0f831e..2c938ae 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1005,7 +1005,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
> * ttwu() will sort out the placement.
> */
> WARN_ON_ONCE(p->state != TASK_RUNNING && p->state != TASK_WAKING &&
> - !p->on_rq);
> + !p->on_rq && !(p->state & TASK_YIELD));
>
> #ifdef CONFIG_LOCKDEP
> /*
> @@ -2743,6 +2743,23 @@ static void __sched __schedule(void)
> if (unlikely(signal_pending_state(prev->state, prev))) {
> prev->state = TASK_RUNNING;
> } else {
> +
> + /*
> + * Provide an auto-yield feature on schedule().
> + *
> + * The thought is to avoid a sleep+wakeup cycle
> + * if simply yielding the cpu will suffice to
> + * satisfy the required condition.
> + *
> + * Assumes the calling schedule() site can deal
> + * with spurious wakeups.
> + */
> + if (prev->state & TASK_YIELD) {
> + prev->state &= ~TASK_YIELD;
> + if (rq->nr_running > 1)
> + goto no_deactivate;
> + }
> +
> deactivate_task(rq, prev, DEQUEUE_SLEEP);
> prev->on_rq = 0;
>
> @@ -2759,6 +2776,7 @@ static void __sched __schedule(void)
> try_to_wake_up_local(to_wakeup);
> }
> }
> + no_deactivate:
> switch_count = &prev->nvcsw;
> }
>

Haha, you dug that out did you ;-)

Does this patch on its own actually help anything? Back when I wrote
that there was the vague hope it would actually help some of the
client-server ping-pong workloads, where the client does sync requests
to the server.

In that case, if the server can handle the req. while the client yields
we've saved on an expensive wakeup.

Reality never really worked out for that -- even after we fixed a few
boot and runtime issues with it.

2015-04-20 16:04:24

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2/2] futex: lockless wakeups

On Sun, Apr 19, 2015 at 12:17 PM, Davidlohr Bueso <[email protected]> wrote:
>
> Naturally, this can cause spurious wakeups. However there is core code
> that cannot handle them afaict, and furthermore tglx does have the point
> that other events can already trigger them anyway.

Indeed. We need to make this *very* explicit. We have absolutely
_always_ had spurious wakeups. The semaphore code does it today, other
code has done it historically. Nobody should ever expect that there si
only one unique wakeup source. Anybody who sleeps on a condition needs
to re-check the condition rather than assume that "because I was woken
up the condition must now be true".

Linus

2015-04-20 17:31:23

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH 2/2] futex: lockless wakeups

On Mon, 2015-04-20 at 16:57 +0200, Peter Zijlstra wrote:
> Haha, you dug that out did you ;-)
>
> Does this patch on its own actually help anything? Back when I wrote
> that there was the vague hope it would actually help some of the
> client-server ping-pong workloads, where the client does sync requests
> to the server.
>
> In that case, if the server can handle the req. while the client yields
> we've saved on an expensive wakeup.
>
> Reality never really worked out for that -- even after we fixed a few
> boot and runtime issues with it.

I had boot problems about 2 years ago (even with qemu) with it.
Nowadays, I cannot trigger any issues with it. And on a large machine
I've been using it just fine, throwing a lot of workloads at it. All in
all my schedule auditing has only been by searching the code, not from
this patch.

Thanks,
Davidlohr

2015-04-20 18:24:56

by George Spelvin

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched: lockless wake-queues

+struct wake_q_head {
+ struct wake_q_node *first;
+ struct wake_q_node *last;
+};
+
+#define WAKE_Q_TAIL ((struct wake_q_node *) 0x01)
+
+#define WAKE_Q(name) \
+ struct wake_q_head name = { WAKE_Q_TAIL, WAKE_Q_TAIL }

Is there some reason you don't use the simpler singly-linked list
construction with the tail being a pointer to a pointer:

struct wake_q_head {
struct wake_q_node *first, **lastp;
};

#define WAKE_Q(name) \
struct wake_q_head name = { WAKE_Q_TAIL, &name.first }


That removes a conditional from wake_q_add:

+/*
+ * Queue a task for later wake-up by wake_up_q(). If the task is already
+ * queued by someone else, leave it to them to deliver the wakeup.
+ *
+ * This property makes it impossible to guarantee the order of wakeups,
+ * but for efficiency we try to deliver wakeups in the order tasks
+ * are added. If we didn't mind reversing the order, a LIFO stack
+ * would be simpler.
+ */
+void wake_q_add(struct wake_q_head *head, struct task_struct *task)
+{
+ struct wake_q_node *node = &task->wake_q;
+
+ /*
+ * Atomically grab the task, if ->wake_q is !nil already it means
+ * its already queued (either by us or someone else) and will get the
+ * wakeup due to that.
+ *
+ * This cmpxchg() implies a full barrier, which pairs with the write
+ * barrier implied by the wakeup in wake_up_list().
+ */
+ if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL))
+ return;
+
+ get_task_struct(task);
+
+ /*
+ * The head is context local, there can be no concurrency.
+ */
+ *head->lastp = node;
+ head->lastp = &node->next;
+}

It may also be worth commenting the fact that wake_up_q() leaves the
struct wake_q_head in a corrupt state, so don't try to do it again.

2015-04-20 20:08:35

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched: lockless wake-queues

On Mon, 2015-04-20 at 14:24 -0400, George Spelvin wrote:
> +struct wake_q_head {
> + struct wake_q_node *first;
> + struct wake_q_node *last;
> +};
> +
> +#define WAKE_Q_TAIL ((struct wake_q_node *) 0x01)
> +
> +#define WAKE_Q(name) \
> + struct wake_q_head name = { WAKE_Q_TAIL, WAKE_Q_TAIL }
>
> Is there some reason you don't use the simpler singly-linked list
> construction with the tail being a pointer to a pointer:

Sure, that would also work.

>
> struct wake_q_head {
> struct wake_q_node *first, **lastp;
> };
>
> #define WAKE_Q(name) \
> struct wake_q_head name = { WAKE_Q_TAIL, &name.first }
>
>
> That removes a conditional from wake_q_add:
>
> +/*
> + * Queue a task for later wake-up by wake_up_q(). If the task is already
> + * queued by someone else, leave it to them to deliver the wakeup.

This is already commented in the cmpxchg.

> + *
> + * This property makes it impossible to guarantee the order of wakeups,
> + * but for efficiency we try to deliver wakeups in the order tasks
> + * are added.

Ok.

> If we didn't mind reversing the order, a LIFO stack
> + * would be simpler.

While true, I don't think it belongs here.

> + */
> +void wake_q_add(struct wake_q_head *head, struct task_struct *task)
> +{
> + struct wake_q_node *node = &task->wake_q;
> +
> + /*
> + * Atomically grab the task, if ->wake_q is !nil already it means
> + * its already queued (either by us or someone else) and will get the
> + * wakeup due to that.
> + *
> + * This cmpxchg() implies a full barrier, which pairs with the write
> + * barrier implied by the wakeup in wake_up_list().
> + */
> + if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL))
> + return;
> +
> + get_task_struct(task);
> +
> + /*
> + * The head is context local, there can be no concurrency.
> + */
> + *head->lastp = node;
> + head->lastp = &node->next;
> +}
>
> It may also be worth commenting the fact that wake_up_q() leaves the
> struct wake_q_head in a corrupt state, so don't try to do it again.

Right, we could re-init the list once the loop is complete, yes. But it
shouldn't matter due to how we use wake-queues.

Thanks,
Davidlohr

2015-04-21 01:39:41

by George Spelvin

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched: lockless wake-queues

>> Is there some reason you don't use the simpler singly-linked list
>> construction with the tail being a pointer to a pointer:

> Sure, that would also work.

It's just a convenient simplification, already used in struct hlist_node.

>> +/*
>> + * Queue a task for later wake-up by wake_up_q(). If the task is already
>> + * queued by someone else, leave it to them to deliver the wakeup.
>
> This is already commented in the cmpxchg.
>
>> + *
>> + * This property makes it impossible to guarantee the order of wakeups,
>> + * but for efficiency we try to deliver wakeups in the order tasks
>> + * are added.
>
> Ok.

This is just me thinking "out loud" about the semantics.

>> It may also be worth commenting the fact that wake_up_q() leaves the
>> struct wake_q_head in a corrupt state, so don't try to do it again.

> Right, we could re-init the list once the loop is complete, yes. But it
> shouldn't matter due to how we use wake-queues.

Oh, indeed, there's no point. Unless it's worth a debugging option,
but as you say the usage patterns are such that I don't expect it's
needed.

It just seemed worth commenting explicitly.


If I were going to comment it, here's what I'd write. Feel free
to copy any or none of this:

/*
* Wake-queues are lists of tasks about to be woken up.
* Deferring the wakeup is useful when the waker is waking up multiple
* tasks while holding a lock which the woken tasks will need, so they'd
* go straight into a wait queue anyway.
*
* So instead, the the waker can wake_q_add(&q, task) under the lock,
* and then wake_up_q(&q) afterward.
*
* The list head is allocated on the waker's stack, and the queue nodes
* are preallocated as part of the task struct.
*
* A reference to each task (get_task_struct()) is held during the wait,
* so the list will remain valid through wake_up_q().
*
* One per task suffices, because there's never a need for a task to be
* in two wake queues simultaneously; it is forbidden to abandon a task
* in a wake queue (a call to wake_up_q() _must_ follow), so if a task is
* already in a wake queue, the wakeup will happen soon and the second
* waker can just skip it.
*
* As with all Linux wakeup primitives, there is no guarantee about the
* order, but this code tries to wake tasks in wake_q_add order.
*
* The WAKE_Q macro declares and initializes the list head.
* wake_up_q() does NOT reinitialize the list; it's expected to be
* called near the end of a function, where the fact that the queue is
* not used again will be easy to see by inspection.
*/