RT throttling aims to prevent starvation of non-SCHED_FIFO threads
when a rogue RT thread is hogging the CPU. It does so by piggybacking
on the rt_bandwidth system and allocating at most rt_runtime per
rt_period to SCHED_FIFO tasks (e.g. 950ms out of every second,
allowing 'regular' tasks to run for at least 50ms every second).
However, when multiple cores are available, rt_bandwidth allows cores
to borrow rt_runtime from one another. This means that a core with a
rogue RT thread, consuming 100% CPU cycles, can borrow enough runtime
from other cores to allow the RT thread to run continuously, with no
runtime for regular tasks on this core.
Although regular tasks can get scheduled on other available cores
(which are guaranteed to have some non-RT runtime avaible, since they
just lent some RT time to us), tasks that are specifically affined to
a particular core may not be able to make progress (e.g. workqueues,
timer functions). This can break e.g. watchdog-like functionality that
is supposed to kill the rogue RT thread.
This patch changes do_balance_runtime() in such a way that no core can
aquire (borrow) more runtime than the globally set rt_runtime /
rt_period ratio. This guarantees there will always be some non-RT
runtime available on every individual core.
Signed-off-by: Peter Boonstoppel <[email protected]>
---
kernel/sched/rt.c | 21 ++++++++++++++++++---
1 files changed, 18 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 127a2c4..5ec4eab 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -571,11 +571,25 @@ static int do_balance_runtime(struct rt_rq *rt_rq)
struct root_domain *rd = rq_of_rt_rq(rt_rq)->rd;
int i, weight, more = 0;
u64 rt_period;
+ u64 max_runtime;
weight = cpumask_weight(rd->span);
raw_spin_lock(&rt_b->rt_runtime_lock);
rt_period = ktime_to_ns(rt_b->rt_period);
+
+ /* Don't allow more runtime than global ratio */
+ if (global_rt_runtime() == RUNTIME_INF)
+ max_runtime = rt_period;
+ else
+ max_runtime = div64_u64(global_rt_runtime() * rt_period,
+ global_rt_period());
+
+ if (rt_rq->rt_runtime >= max_runtime) {
+ raw_spin_unlock(&rt_b->rt_runtime_lock);
+ return more;
+ }
+
for_each_cpu(i, rd->span) {
struct rt_rq *iter = sched_rt_period_rt_rq(rt_b, i);
s64 diff;
@@ -592,6 +606,7 @@ static int do_balance_runtime(struct rt_rq *rt_rq)
if (iter->rt_runtime == RUNTIME_INF)
goto next;
+
/*
* From runqueues with spare time, take 1/n part of their
* spare time, but no more than our period.
@@ -599,12 +614,12 @@ static int do_balance_runtime(struct rt_rq *rt_rq)
diff = iter->rt_runtime - iter->rt_time;
if (diff > 0) {
diff = div_u64((u64)diff, weight);
- if (rt_rq->rt_runtime + diff > rt_period)
- diff = rt_period - rt_rq->rt_runtime;
+ if (rt_rq->rt_runtime + diff > max_runtime)
+ diff = max_runtime - rt_rq->rt_runtime;
iter->rt_runtime -= diff;
rt_rq->rt_runtime += diff;
more = 1;
- if (rt_rq->rt_runtime == rt_period) {
+ if (rt_rq->rt_runtime == max_runtime) {
raw_spin_unlock(&iter->rt_runtime_lock);
break;
}
--
1.7.0.4
On Tue, 2013-05-21 at 14:30 -0700, Peter Boonstoppel wrote:
> RT throttling aims to prevent starvation of non-SCHED_FIFO threads
> when a rogue RT thread is hogging the CPU. It does so by piggybacking
> on the rt_bandwidth system and allocating at most rt_runtime per
> rt_period to SCHED_FIFO tasks (e.g. 950ms out of every second,
> allowing 'regular' tasks to run for at least 50ms every second).
>
> However, when multiple cores are available, rt_bandwidth allows cores
> to borrow rt_runtime from one another. This means that a core with a
> rogue RT thread, consuming 100% CPU cycles, can borrow enough runtime
> from other cores to allow the RT thread to run continuously, with no
> runtime for regular tasks on this core.
>
> Although regular tasks can get scheduled on other available cores
> (which are guaranteed to have some non-RT runtime avaible, since they
> just lent some RT time to us), tasks that are specifically affined to
> a particular core may not be able to make progress (e.g. workqueues,
> timer functions). This can break e.g. watchdog-like functionality that
> is supposed to kill the rogue RT thread.
>
> This patch changes do_balance_runtime() in such a way that no core can
> aquire (borrow) more runtime than the globally set rt_runtime /
> rt_period ratio. This guarantees there will always be some non-RT
> runtime available on every individual core.
Seems to me it's be better to just invert RT_RUNTIME_SHARE default
setting, since the real world throttle mission tends to be saving the
user from his own brilliance a lot more often than any group scheduling
or debugging.
I see this fairly frequently: "My application is the most important
thing in the universe, so of course I run it SCHED_FIFO priority 99.
Why does all kinds of bad juju happen, frequently ending with box
ignoring all gefingerpoken und mittenmgrabben, when I do something as
innocuous as turning my pet cpuhog from hell loose in godmode?"
-Mike
Hi, Peter
On 05/22/2013 05:30 AM, Peter Boonstoppel wrote:
> RT throttling aims to prevent starvation of non-SCHED_FIFO threads
> when a rogue RT thread is hogging the CPU. It does so by piggybacking
> on the rt_bandwidth system and allocating at most rt_runtime per
> rt_period to SCHED_FIFO tasks (e.g. 950ms out of every second,
> allowing 'regular' tasks to run for at least 50ms every second).
>
> However, when multiple cores are available, rt_bandwidth allows cores
> to borrow rt_runtime from one another. This means that a core with a
> rogue RT thread, consuming 100% CPU cycles, can borrow enough runtime
> from other cores to allow the RT thread to run continuously, with no
> runtime for regular tasks on this core.
IMHO, such kind of starving should attributed to the Admin...
Reserve cpu will make realtime misnomer, then Admin will blame the
scheduler when his RT task got a higher latency...
Regards,
Michael Wang
>
> Although regular tasks can get scheduled on other available cores
> (which are guaranteed to have some non-RT runtime avaible, since they
> just lent some RT time to us), tasks that are specifically affined to
> a particular core may not be able to make progress (e.g. workqueues,
> timer functions). This can break e.g. watchdog-like functionality that
> is supposed to kill the rogue RT thread.
>
> This patch changes do_balance_runtime() in such a way that no core can
> aquire (borrow) more runtime than the globally set rt_runtime /
> rt_period ratio. This guarantees there will always be some non-RT
> runtime available on every individual core.
>
> Signed-off-by: Peter Boonstoppel <[email protected]>
> ---
> kernel/sched/rt.c | 21 ++++++++++++++++++---
> 1 files changed, 18 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 127a2c4..5ec4eab 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -571,11 +571,25 @@ static int do_balance_runtime(struct rt_rq *rt_rq)
> struct root_domain *rd = rq_of_rt_rq(rt_rq)->rd;
> int i, weight, more = 0;
> u64 rt_period;
> + u64 max_runtime;
>
> weight = cpumask_weight(rd->span);
>
> raw_spin_lock(&rt_b->rt_runtime_lock);
> rt_period = ktime_to_ns(rt_b->rt_period);
> +
> + /* Don't allow more runtime than global ratio */
> + if (global_rt_runtime() == RUNTIME_INF)
> + max_runtime = rt_period;
> + else
> + max_runtime = div64_u64(global_rt_runtime() * rt_period,
> + global_rt_period());
> +
> + if (rt_rq->rt_runtime >= max_runtime) {
> + raw_spin_unlock(&rt_b->rt_runtime_lock);
> + return more;
> + }
> +
> for_each_cpu(i, rd->span) {
> struct rt_rq *iter = sched_rt_period_rt_rq(rt_b, i);
> s64 diff;
> @@ -592,6 +606,7 @@ static int do_balance_runtime(struct rt_rq *rt_rq)
> if (iter->rt_runtime == RUNTIME_INF)
> goto next;
>
> +
> /*
> * From runqueues with spare time, take 1/n part of their
> * spare time, but no more than our period.
> @@ -599,12 +614,12 @@ static int do_balance_runtime(struct rt_rq *rt_rq)
> diff = iter->rt_runtime - iter->rt_time;
> if (diff > 0) {
> diff = div_u64((u64)diff, weight);
> - if (rt_rq->rt_runtime + diff > rt_period)
> - diff = rt_period - rt_rq->rt_runtime;
> + if (rt_rq->rt_runtime + diff > max_runtime)
> + diff = max_runtime - rt_rq->rt_runtime;
> iter->rt_runtime -= diff;
> rt_rq->rt_runtime += diff;
> more = 1;
> - if (rt_rq->rt_runtime == rt_period) {
> + if (rt_rq->rt_runtime == max_runtime) {
> raw_spin_unlock(&iter->rt_runtime_lock);
> break;
> }
>
On Tue, May 21, 2013 at 02:30:09PM -0700, Peter Boonstoppel wrote:
> RT throttling aims to prevent starvation of non-SCHED_FIFO threads
> when a rogue RT thread is hogging the CPU. It does so by piggybacking
> on the rt_bandwidth system and allocating at most rt_runtime per
> rt_period to SCHED_FIFO tasks (e.g. 950ms out of every second,
> allowing 'regular' tasks to run for at least 50ms every second).
>
> However, when multiple cores are available, rt_bandwidth allows cores
> to borrow rt_runtime from one another. This means that a core with a
> rogue RT thread, consuming 100% CPU cycles, can borrow enough runtime
> from other cores to allow the RT thread to run continuously, with no
> runtime for regular tasks on this core.
>
> Although regular tasks can get scheduled on other available cores
> (which are guaranteed to have some non-RT runtime avaible, since they
> just lent some RT time to us), tasks that are specifically affined to
> a particular core may not be able to make progress (e.g. workqueues,
> timer functions). This can break e.g. watchdog-like functionality that
> is supposed to kill the rogue RT thread.
>
> This patch changes do_balance_runtime() in such a way that no core can
> aquire (borrow) more runtime than the globally set rt_runtime /
> rt_period ratio. This guarantees there will always be some non-RT
> runtime available on every individual core.
>
> Signed-off-by: Peter Boonstoppel <[email protected]>
I thought we already had a knob for that; RT_RUNTIME_SHARE. And as Mike
said, we might consider flipping the default on that.
On Wed, 2013-05-22 at 10:16 +0200, Peter Zijlstra wrote:
> I thought we already had a knob for that; RT_RUNTIME_SHARE. And as Mike
> said, we might consider flipping the default on that.
sched,rt: disable rt_runtime borrowing by default
Make the default RT_RUNTIME_SHARE setting reflect the most common
throttle role, that of safety mechanism to protect the box.
Signed-off-by: Mike Galbraith <[email protected]>
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 99399f8..0945d38 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -57,7 +57,7 @@ SCHED_FEAT(NONTASK_POWER, true)
SCHED_FEAT(TTWU_QUEUE, true)
SCHED_FEAT(FORCE_SD_OVERLAP, false)
-SCHED_FEAT(RT_RUNTIME_SHARE, true)
+SCHED_FEAT(RT_RUNTIME_SHARE, false)
SCHED_FEAT(LB_MIN, false)
/*
I think my patch was aiming at enabling rt_runtime borrowing, while preserving the global invariant of not allowing 100% allocation of time to rt_rq. Disabling RT_RUNTIME_SHARE definitely addresses the safety mechanism concern.
However, I'd like to understand the original goal of this rt_runtime borrowing a little bit better. At a high level it seems to me that bandwidth allocation (with possible borrowing of unused bandwidth) and providing a safety mechanism against rogue rt threads are two independent goals, which just happen to share the same mechanism. And if you'd want both, you'd have to provision some amount of bandwidth that cannot be borrowed.
Other than that I second Mike's change for flipping the default.
Peter
________________________________________
From: Mike Galbraith [[email protected]]
Sent: Wednesday, May 22, 2013 1:46 AM
To: Peter Zijlstra
Cc: Peter Boonstoppel; Ingo Molnar; [email protected]; Paul Walmsley
Subject: Re: [PATCH RFC] sched/rt: preserve global runtime/period ratio in do_balance_runtime()
On Wed, 2013-05-22 at 10:16 +0200, Peter Zijlstra wrote:
> I thought we already had a knob for that; RT_RUNTIME_SHARE. And as Mike
> said, we might consider flipping the default on that.
sched,rt: disable rt_runtime borrowing by default
Make the default RT_RUNTIME_SHARE setting reflect the most common
throttle role, that of safety mechanism to protect the box.
Signed-off-by: Mike Galbraith <[email protected]>
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 99399f8..0945d38 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -57,7 +57,7 @@ SCHED_FEAT(NONTASK_POWER, true)
SCHED_FEAT(TTWU_QUEUE, true)
SCHED_FEAT(FORCE_SD_OVERLAP, false)
-SCHED_FEAT(RT_RUNTIME_SHARE, true)
+SCHED_FEAT(RT_RUNTIME_SHARE, false)
SCHED_FEAT(LB_MIN, false)
/*