Message-ID: <1433940278.6814.66.camel@gmail.com>
Subject: RFC: futex_wait() can  DoS the tick
From: Mike Galbraith <umgwanakikbuti@gmail.com>
To: LKML <linux-kernel@vger.kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>,
        Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@elte.hu>,
        Steven Rostedt <rostedt@goodmis.org>
Date: Wed, 10 Jun 2015 14:44:38 +0200
Content-Type: text/plain; charset="UTF-8"
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4235
Lines: 130

Greetings,

Like so...

#include <time.h>
#include <sys/time.h>
#include <linux/futex.h>
#include  <sys/syscall.h>

int sys_futex(void *addr1, int op, int val1, struct timespec *timeout, void *addr2, int val3)
{
	return syscall(SYS_futex, addr1, op, val1, timeout, addr2, val3);
}


int main()
{
	struct timespec t;
	int f = 1;

	clock_gettime(CLOCK_REALTIME, &t);
	t.tv_sec -= 10;

	while (1) {
		sys_futex(&f, FUTEX_WAIT_BITSET | FUTEX_CLOCK_REALTIME, 1, &t, NULL, FUTEX_BITSET_MATCH_ANY);
	}
}


The above was handed to me by a colleague working on a Xen guest that
livelocked.  I at first though Xen arch must have a weird problem, but
when I tried proggy on my desktop box, while it didn't stop the tick
completely as it did the Xen box, it slowed it to a crawl.  I noticed
that this did not happen with newer kernels, so a bisecting I did go,
and found that...

279f14614 x86: apic: Use tsc deadline for oneshot when available

..is what fixed it up.  Trouble is, while it fixes up my Haswell box, a
Xen dom0 remains busted by that testcase whether that patch is applied
to the host or not, even though the hypervisor supports deadline timer,
and seemingly regardless of CPU type all together.

Of all the x86_64 bare metal boxen I've tested, only those with the TSC
deadline timer have shown the issue, and there it goes away as of v3.8
unless you boot lapic=notscdeadline.

However, given any x86_64 Intel box with TSC deadline timer (ivy, sandy,
hasbeen) can be made to exhibit the symptom, there may be other arches
that get seriously dinged up or maybe even as thoroughly b0rked as Xen
does when hrtimer_interrupt() is pounded into the ground by userspace.

Alternatively, should someone out there know that all bare metal is in
fact fine post 279f14614, that person will likely also know what the Xen
folks need to do to fix up their busted arch. 

The below targets the symptom, consider it hrtimer cluebat attractant.

---
 kernel/time/hrtimer.c |   31 ++++++++++++++++++++++++++++---
 1 file changed, 28 insertions(+), 3 deletions(-)

--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -933,6 +933,8 @@ remove_hrtimer(struct hrtimer *timer, st
 	return 0;
 }
 
+static enum hrtimer_restart hrtimer_wakeup(struct hrtimer *timer);
+
 int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
 		unsigned long delta_ns, const enum hrtimer_mode mode,
 		int wakeup)
@@ -980,8 +982,27 @@ int __hrtimer_start_range_ns(struct hrti
 		 * on dynticks target.
 		 */
 		wake_up_nohz_cpu(new_base->cpu_base->cpu);
-	} else if (new_base->cpu_base == this_cpu_ptr(&hrtimer_bases) &&
-			hrtimer_reprogram(timer, new_base)) {
+	} else if (new_base->cpu_base == this_cpu_ptr(&hrtimer_bases)) {
+		int res = hrtimer_reprogram(timer, new_base);
+
+		if (!res)
+			goto out;
+
+		/*
+		 * If a buggy app tries forever to be awakened in the past,
+		 * banging on hrtimer_interrupt() at high speed can stall
+		 * the tick, and on a Xen box, forever.  On haswell with
+		 * tsc_deadline_timer disabled you can see it, though it
+		 * only slows the tick way down.  Other bare metal boxes
+		 * may also be terminally affected.
+		 */
+		if (unlikely(wakeup && !ret && IS_ERR_VALUE(res) &&
+			    timer->function == hrtimer_wakeup)) {
+			debug_deactivate(timer);
+			__remove_hrtimer(timer, new_base, 0, 0);
+			ret = -ETIMEDOUT;
+		}
+
 		/*
 		 * Only allow reprogramming if the new base is on this CPU.
 		 * (it might still be on another CPU if the timer was pending)
@@ -994,7 +1015,10 @@ int __hrtimer_start_range_ns(struct hrti
 			 * lock ordering issue vs. rq->lock.
 			 */
 			raw_spin_unlock(&new_base->cpu_base->lock);
-			raise_softirq_irqoff(HRTIMER_SOFTIRQ);
+			if (!IS_ERR_VALUE(ret))
+				raise_softirq_irqoff(HRTIMER_SOFTIRQ);
+			else
+				hrtimer_wakeup(timer);
 			local_irq_restore(flags);
 			return ret;
 		} else {
@@ -1002,6 +1026,7 @@ int __hrtimer_start_range_ns(struct hrti
 		}
 	}
 
+out:
 	unlock_hrtimer_base(timer, &flags);
 
 	return ret;


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/