Message-ID: <1429538139.2042.10.camel@stgolabs.net>
Subject: Re: [PATCH 2/2] futex: lockless wakeups
From: Davidlohr Bueso <dave@stgolabs.net>
To: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
        Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>,
        Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Chris Mason <clm@fb.com>, Steven Rostedt <rostedt@goodmis.org>,
        fredrik.markstrom@windriver.com, linux-kernel@vger.kernel.org,
        Andrew Morton <akpm@linux-foundation.org>
Date: Mon, 20 Apr 2015 06:55:39 -0700
In-Reply-To: <20150420061836.GA11191@gmail.com>
References: <1429471060-21271-1-git-send-email-dave@stgolabs.net>
	 <1429471060-21271-3-git-send-email-dave@stgolabs.net>
	 <20150420061836.GA11191@gmail.com>
Content-Type: text/plain; charset="UTF-8"
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3811
Lines: 110

On Mon, 2015-04-20 at 08:18 +0200, Ingo Molnar wrote:
> Please write a small description we can cite to driver authors once 
> the (inevitable) breakages appear, outlining this new behavior and its 
> implications, so that we can fix any remaining bugs ASAP.

I wouldn't call this new behavior, simply because changing a critical
region should not be labeled as such imho. Other than asking driver
authors to put their schedule() in a loop to confirm that the expected
condition has in fact occurred, I'm not sure what else we can ask them
to do -- as you know, this is not just about futexes.

> I'll also let this pending a bit longer than other changes, to make 
> sure we shake out any bugs/regressions triggered by this change.
> 
> Third, it might make sense to add a new 'spurious wakeup injection 
> debug mechanism' that, if enabled in the .config, automatically and 
> continuously inserts spurious wakeups at a given, slightly randomized 
> rate - which would ensure that all kernel facilities can robustly 
> handle spurious wakeups.

I have been using this from Peter to test against:

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6d77432..fdf1f68 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -214,9 +214,10 @@ print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq);
 #define TASK_WAKEKILL		128
 #define TASK_WAKING		256
 #define TASK_PARKED		512
-#define TASK_STATE_MAX		1024
+#define TASK_YIELD              1024
+#define TASK_STATE_MAX		2048
 
-#define TASK_STATE_TO_CHAR_STR "RSDTtXZxKWP"
+#define TASK_STATE_TO_CHAR_STR "RSDTtXZxKWPY"
 
 extern char ___assert_task_state[1 - 2*!!(
 		sizeof(TASK_STATE_TO_CHAR_STR)-1 != ilog2(TASK_STATE_MAX)+1)];
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f0f831e..2c938ae 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1005,7 +1005,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 	 * ttwu() will sort out the placement.
 	 */
 	WARN_ON_ONCE(p->state != TASK_RUNNING && p->state != TASK_WAKING &&
-			!p->on_rq);
+			!p->on_rq && !(p->state & TASK_YIELD));
 
 #ifdef CONFIG_LOCKDEP
 	/*
@@ -2743,6 +2743,23 @@ static void __sched __schedule(void)
 		if (unlikely(signal_pending_state(prev->state, prev))) {
 			prev->state = TASK_RUNNING;
 		} else {
+
+			/*
+			 * Provide an auto-yield feature on schedule().
+			 *
+			 * The thought is to avoid a sleep+wakeup cycle
+			 * if simply yielding the cpu will suffice to
+			 * satisfy the required condition.
+			 *
+			 * Assumes the calling schedule() site can deal
+			 * with spurious wakeups.
+			 */
+			if (prev->state & TASK_YIELD) {
+				prev->state &= ~TASK_YIELD;
+				if (rq->nr_running > 1)
+					goto no_deactivate;
+			}
+
 			deactivate_task(rq, prev, DEQUEUE_SLEEP);
 			prev->on_rq = 0;
 
@@ -2759,6 +2776,7 @@ static void __sched __schedule(void)
 					try_to_wake_up_local(to_wakeup);
 			}
 		}
+	no_deactivate:
 		switch_count = &prev->nvcsw;
 	}
 

> My guess would be that most remaining fragilities against spurious 
> wakeups ought to be in the boot/init phase, so I'd keep an eye out for 
> suspend/resume regressions.

Correct, which is why I'm not that concerned anymore about spurious
wakups, in fact that code above now boots and handles correctly on
rather large systems.

> 
> > [...] However there is core code that cannot handle them afaict, and 
> > furthermore tglx does have the point that other events can already 
> > trigger them anyway.
> 
> s/there is core code/there is no core code

heh yes.

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/