The rt throttling mechanism prevents the starvation of non-real-time
tasks by CPU intensive real-time tasks. In terms of percentage,
the default behavior allows real-time tasks to run up to 95% of a
given period, leaving the other 5% of the period for non-real-time
tasks. In the absence of non-rt tasks, the system goes idle for 5%
of the period.
Although this behavior works fine for the purpose of avoiding
bad real-time tasks that can hang the system, some greed users
want to allow the real-time task to continue running in the absence
of non-real-time tasks starving. In other words, they do not want to
see the system going idle.
This patch implements the RT_RUNTIME_GREED scheduler feature for greedy
users (TM). When enabled, this feature will check if non-rt tasks are
starving before throttling the real-time task. If the real-time task
becomes throttled, it will be unthrottled as soon as the system goes
idle, or when the next period starts, whichever comes first.
This feature is enabled with the following command:
# echo RT_RUNTIME_GREED > /sys/kernel/debug/sched_features
The user might also want to disable NO_RT_RUNTIME_SHARE logic,
to keep all CPUs with the same rt_runtime.
# echo NO_RT_RUNTIME_SHARE > /sys/kernel/debug/sched_features
With these two options set, the user will guarantee some runtime
for non-rt-tasks on all CPUs, while keeping real-time tasks running
as much as possible.
The feature is disabled by default, keeping the current behavior.
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Reviewed-by: Steven Rostedt <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: linux-rt-users <[email protected]>
Cc: LKML <[email protected]>
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 42d4027..c4c62ee 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3275,7 +3275,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct pin_cookie cookie
if (unlikely(!p))
p = idle_sched_class.pick_next_task(rq, prev, cookie);
- return p;
+ if (likely(p != RETRY_TASK))
+ return p;
}
again:
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 69631fa..3bd7a6d 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -66,6 +66,7 @@ SCHED_FEAT(RT_PUSH_IPI, true)
SCHED_FEAT(FORCE_SD_OVERLAP, false)
SCHED_FEAT(RT_RUNTIME_SHARE, true)
+SCHED_FEAT(RT_RUNTIME_GREED, false)
SCHED_FEAT(LB_MIN, false)
SCHED_FEAT(ATTACH_AGE_LOAD, true)
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index 5405d3f..0f23e06 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -26,6 +26,10 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
static struct task_struct *
pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct pin_cookie cookie)
{
+ if (sched_feat(RT_RUNTIME_GREED))
+ if (try_to_unthrottle_rt_rq(&rq->rt))
+ return RETRY_TASK;
+
put_prev_task(rq, prev);
update_idle_core(rq);
schedstat_inc(rq->sched_goidle);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 2516b8d..a6961a5 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -631,6 +631,22 @@ static inline struct rt_bandwidth *sched_rt_bandwidth(struct rt_rq *rt_rq)
#endif /* CONFIG_RT_GROUP_SCHED */
+static inline void unthrottle_rt_rq(struct rt_rq *rt_rq)
+{
+ rt_rq->rt_time = 0;
+ rt_rq->rt_throttled = 0;
+ sched_rt_rq_enqueue(rt_rq);
+}
+
+int try_to_unthrottle_rt_rq(struct rt_rq *rt_rq)
+{
+ if (rt_rq_throttled(rt_rq)) {
+ unthrottle_rt_rq(rt_rq);
+ return 1;
+ }
+ return 0;
+}
+
bool sched_rt_bandwidth_account(struct rt_rq *rt_rq)
{
struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq);
@@ -920,6 +936,18 @@ static int sched_rt_runtime_exceeded(struct rt_rq *rt_rq)
* but accrue some time due to boosting.
*/
if (likely(rt_b->rt_runtime)) {
+ if (sched_feat(RT_RUNTIME_GREED)) {
+ struct rq *rq = rq_of_rt_rq(rt_rq);
+ /*
+ * If there is no other tasks able to run
+ * on this rq, lets be greed and reset our
+ * rt_time.
+ */
+ if (rq->nr_running == rt_rq->rt_nr_running) {
+ rt_rq->rt_time = 0;
+ return 0;
+ }
+ }
rt_rq->rt_throttled = 1;
printk_deferred_once("sched: RT throttling activated\n");
} else {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 055f935..450ca34 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -502,6 +502,8 @@ struct rt_rq {
#endif
};
+int try_to_unthrottle_rt_rq(struct rt_rq *rt_rq);
+
/* Deadline class' related fields in a runqueue */
struct dl_rq {
/* runqueue is an rbtree, ordered by deadline */
--
2.7.4
as anticipated live to Daniel:
-) +1 for the general concept, we'd need something similar also for SCHED_DEADLINE
-) only issue might be that, if a non-RT task wakes up after the unthrottle, it will have to wait, but worst-case it will have a chance in the next throttling window
-) an alternative to unthrottling might be temporary class downgrade to sched_other, but that might be much more complex, instead this Daniel's one looks quite simple
-) when considering also DEADLINE tasks, it might be good to think about how we'd like the throttling of DEADLINE and RT tasks to inter-relate, e.g.:
a) DEADLINE unthrottles if there's no RT nor OTHER tasks? what if there's an unthrottled RT?
b) DEADLINE throttles by downgrading to OTHER?
c) DEADLINE throttles by downgrading to RT (RR/FIFO and what prio?)
My2c, thanks!
T.
On 07/11/2016 09:17, Daniel Bristot de Oliveira wrote:
> The rt throttling mechanism prevents the starvation of non-real-time
> tasks by CPU intensive real-time tasks. In terms of percentage,
> the default behavior allows real-time tasks to run up to 95% of a
> given period, leaving the other 5% of the period for non-real-time
> tasks. In the absence of non-rt tasks, the system goes idle for 5%
> of the period.
>
> Although this behavior works fine for the purpose of avoiding
> bad real-time tasks that can hang the system, some greed users
> want to allow the real-time task to continue running in the absence
> of non-real-time tasks starving. In other words, they do not want to
> see the system going idle.
>
> This patch implements the RT_RUNTIME_GREED scheduler feature for greedy
> users (TM). When enabled, this feature will check if non-rt tasks are
> starving before throttling the real-time task. If the real-time task
> becomes throttled, it will be unthrottled as soon as the system goes
> idle, or when the next period starts, whichever comes first.
>
> This feature is enabled with the following command:
> # echo RT_RUNTIME_GREED > /sys/kernel/debug/sched_features
>
> The user might also want to disable NO_RT_RUNTIME_SHARE logic,
> to keep all CPUs with the same rt_runtime.
> # echo NO_RT_RUNTIME_SHARE > /sys/kernel/debug/sched_features
>
> With these two options set, the user will guarantee some runtime
> for non-rt-tasks on all CPUs, while keeping real-time tasks running
> as much as possible.
>
> The feature is disabled by default, keeping the current behavior.
>
> Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
> Reviewed-by: Steven Rostedt <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: linux-rt-users <[email protected]>
> Cc: LKML <[email protected]>
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 42d4027..c4c62ee 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3275,7 +3275,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct pin_cookie cookie
> if (unlikely(!p))
> p = idle_sched_class.pick_next_task(rq, prev, cookie);
>
> - return p;
> + if (likely(p != RETRY_TASK))
> + return p;
> }
>
> again:
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 69631fa..3bd7a6d 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -66,6 +66,7 @@ SCHED_FEAT(RT_PUSH_IPI, true)
>
> SCHED_FEAT(FORCE_SD_OVERLAP, false)
> SCHED_FEAT(RT_RUNTIME_SHARE, true)
> +SCHED_FEAT(RT_RUNTIME_GREED, false)
> SCHED_FEAT(LB_MIN, false)
> SCHED_FEAT(ATTACH_AGE_LOAD, true)
>
> diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
> index 5405d3f..0f23e06 100644
> --- a/kernel/sched/idle_task.c
> +++ b/kernel/sched/idle_task.c
> @@ -26,6 +26,10 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
> static struct task_struct *
> pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct pin_cookie cookie)
> {
> + if (sched_feat(RT_RUNTIME_GREED))
> + if (try_to_unthrottle_rt_rq(&rq->rt))
> + return RETRY_TASK;
> +
> put_prev_task(rq, prev);
> update_idle_core(rq);
> schedstat_inc(rq->sched_goidle);
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 2516b8d..a6961a5 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -631,6 +631,22 @@ static inline struct rt_bandwidth *sched_rt_bandwidth(struct rt_rq *rt_rq)
>
> #endif /* CONFIG_RT_GROUP_SCHED */
>
> +static inline void unthrottle_rt_rq(struct rt_rq *rt_rq)
> +{
> + rt_rq->rt_time = 0;
> + rt_rq->rt_throttled = 0;
> + sched_rt_rq_enqueue(rt_rq);
> +}
> +
> +int try_to_unthrottle_rt_rq(struct rt_rq *rt_rq)
> +{
> + if (rt_rq_throttled(rt_rq)) {
> + unthrottle_rt_rq(rt_rq);
> + return 1;
> + }
> + return 0;
> +}
> +
> bool sched_rt_bandwidth_account(struct rt_rq *rt_rq)
> {
> struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq);
> @@ -920,6 +936,18 @@ static int sched_rt_runtime_exceeded(struct rt_rq *rt_rq)
> * but accrue some time due to boosting.
> */
> if (likely(rt_b->rt_runtime)) {
> + if (sched_feat(RT_RUNTIME_GREED)) {
> + struct rq *rq = rq_of_rt_rq(rt_rq);
> + /*
> + * If there is no other tasks able to run
> + * on this rq, lets be greed and reset our
> + * rt_time.
> + */
> + if (rq->nr_running == rt_rq->rt_nr_running) {
> + rt_rq->rt_time = 0;
> + return 0;
> + }
> + }
> rt_rq->rt_throttled = 1;
> printk_deferred_once("sched: RT throttling activated\n");
> } else {
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 055f935..450ca34 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -502,6 +502,8 @@ struct rt_rq {
> #endif
> };
>
> +int try_to_unthrottle_rt_rq(struct rt_rq *rt_rq);
> +
> /* Deadline class' related fields in a runqueue */
> struct dl_rq {
> /* runqueue is an rbtree, ordered by deadline */
Hi Tommaso,
On 11/07/2016 11:31 AM, Tommaso Cucinotta wrote:
> as anticipated live to Daniel:
> -) +1 for the general concept, we'd need something similar also for
> SCHED_DEADLINE
Resumed: the sum of the runtime of deadline tasks will not be greater
than the "to_ratio(global_rt_period(), global_rt_runtime())" - see
init_dl_bw(). Therefore, DL rq will not be throttle by the RT throttling
mechanism.
Extended: RT tasks' throttling aims to bound, for all CPUS of a domain -
when RT_RUNTIME_SHARING sharing is enabled; or per-rq - when
RT_RUNTIME_SHARING is disabled; the amount of time that RT tasks can run
continuously, in such way to provide some CPU time for non-real-time
tasks to run. RT tasks need this global/local throttling mechanism to
avoid the starvation of non-rt tasks because RT tasks do not have a
limited runtime - RT task (or taskset) can run for an infinity runtime.
DL tasks' throttling has another meaning. DL tasks' throttling aims to
avoid *a* DL task for running for more than *its own* pre-allocated runtime.
The sum of allocated runtime for all DL tasks will not to be greater
than RT throttling enforcement runtime. The DL scheduler admission
control already avoids this by limiting the amount of CPU time all DL
tasks can consume (see init_dl_bw()). So, DL tasks are avoid ind the
"global" throttling on before hand - in the admission control.
GRUB might implement something <<similar>> for the DEADLINE scheduler.
With GRUB, a deadline tasks will have more runtime than previously
set/granted..... But I am quite sure it will still be bounded by the sum
of the already allocated DL runtime, that will continue being smaller
than "to_ratio(global_rt_period(), global_rt_runtime())".
Am I missing something?
> -) only issue might be that, if a non-RT task wakes up after the
> unthrottle, it will have to wait, but worst-case it will have a chance
> in the next throttling window
In the current default behavior (RT_RUNTIME_SHARING), in a domain with
more than two CPUs, the worst case easily become "infinity," because a
CPU can borrow runtime from another CPU. There is no guarantee for
minimum latency for non-rt tasks. Anyway, if the user wants to provide
such guarantee, they just need not enable this feature, while disabling
RT_RUNTIME_SHARING (or run the non-rt task as a deadline task ;-))
> -) an alternative to unthrottling might be temporary class downgrade to
> sched_other, but that might be much more complex, instead this Daniel's
> one looks quite simple
Yeah, decrease the priority of the task would be something way more
complicated and prone to errors. RT tasks would need to reduce its
priority to a level higher than the IDLE task, but lower than SCHED_IDLE...
> -) when considering also DEADLINE tasks, it might be good to think about
> how we'd like the throttling of DEADLINE and RT tasks to inter-relate,
> e.g.:
Currently, DL tasks are limited (in the bw control) to the global RT
throttling limit...
I think that this might be an extension to GRUB... that is extending the
current behavior... so... things for the future - and IMHO it is another
topic - way more challenging.
Comments are welcome :-)
-- Daniel
On Mon, 7 Nov 2016, Daniel Bristot de Oliveira wrote:
> With these two options set, the user will guarantee some runtime
> for non-rt-tasks on all CPUs, while keeping real-time tasks running
> as much as possible.
Excellent this would improve the situation with deadlocks as a result of
cgroup_locks not being released due to lack of workqueue processing.
On 07/11/2016 14:51, Daniel Bristot de Oliveira wrote:
> Hi Tommaso,
Hi,
I'm cc-ing Luca for GRUB et al., pls find a few further notes below...
> On 11/07/2016 11:31 AM, Tommaso Cucinotta wrote:
>> as anticipated live to Daniel:
>> -) +1 for the general concept, we'd need something similar also for
>> SCHED_DEADLINE
>
> Resumed: the sum of the runtime of deadline tasks will not be greater
> than the "to_ratio(global_rt_period(), global_rt_runtime())" - see
> init_dl_bw(). Therefore, DL rq will not be throttle by the RT throttling
> mechanism.
>
> Extended: RT tasks' throttling aims to bound, for all CPUS of a domain -
> when RT_RUNTIME_SHARING sharing is enabled; or per-rq - when
> RT_RUNTIME_SHARING is disabled; the amount of time that RT tasks can run
> continuously, in such way to provide some CPU time for non-real-time
> tasks to run. RT tasks need this global/local throttling mechanism to
> avoid the starvation of non-rt tasks because RT tasks do not have a
> limited runtime - RT task (or taskset) can run for an infinity runtime.
>
> DL tasks' throttling has another meaning. DL tasks' throttling aims to
> avoid *a* DL task for running for more than *its own* pre-allocated runtime.
sure, and having an option to let it run for longer, if there's nothing else
running in the system, is still interesting for pretty much similar reasons
to those being discussed in this thread ...
> The sum of allocated runtime for all DL tasks will not to be greater
> than RT throttling enforcement runtime. The DL scheduler admission
> control already avoids this by limiting the amount of CPU time all DL
> tasks can consume (see init_dl_bw()). So, DL tasks are avoid ind the
> "global" throttling on before hand - in the admission control.
>
> GRUB might implement something <<similar>> for the DEADLINE scheduler.
> With GRUB, a deadline tasks will have more runtime than previously
> set/granted.....
yes, the main difference being: GRUB will let a DL task run for longer
than its own runtime, but still let it starve anything below (RT as well
as OTHER tasks); perhaps Luca (cc) has some further comment on this...
Thanks,
T.
> But I am quite sure it will still be bounded by the sum
> of the already allocated DL runtime, that will continue being smaller
> than "to_ratio(global_rt_period(), global_rt_runtime())".
>
> Am I missing something?
>
>> -) only issue might be that, if a non-RT task wakes up after the
>> unthrottle, it will have to wait, but worst-case it will have a chance
>> in the next throttling window
>
> In the current default behavior (RT_RUNTIME_SHARING), in a domain with
> more than two CPUs, the worst case easily become "infinity," because a
> CPU can borrow runtime from another CPU. There is no guarantee for
> minimum latency for non-rt tasks. Anyway, if the user wants to provide
> such guarantee, they just need not enable this feature, while disabling
> RT_RUNTIME_SHARING (or run the non-rt task as a deadline task ;-))
>
>> -) an alternative to unthrottling might be temporary class downgrade to
>> sched_other, but that might be much more complex, instead this Daniel's
>> one looks quite simple
>
> Yeah, decrease the priority of the task would be something way more
> complicated and prone to errors. RT tasks would need to reduce its
> priority to a level higher than the IDLE task, but lower than SCHED_IDLE...
>
>> -) when considering also DEADLINE tasks, it might be good to think about
>> how we'd like the throttling of DEADLINE and RT tasks to inter-relate,
>> e.g.:
>
> Currently, DL tasks are limited (in the bw control) to the global RT
> throttling limit...
>
> I think that this might be an extension to GRUB... that is extending the
> current behavior... so... things for the future - and IMHO it is another
> topic - way more challenging.
>
> Comments are welcome :-)
>
> -- Daniel
>
--
Tommaso Cucinotta, Computer Engineering PhD
Associate Professor at the Real-Time Systems Laboratory (ReTiS)
Scuola Superiore Sant'Anna, Pisa, Italy
http://retis.sssup.it/people/tommaso
On Mon, 7 Nov 2016 09:17:55 +0100
Daniel Bristot de Oliveira <[email protected]> wrote:
> The rt throttling mechanism prevents the starvation of non-real-time
> tasks by CPU intensive real-time tasks. In terms of percentage,
> the default behavior allows real-time tasks to run up to 95% of a
> given period, leaving the other 5% of the period for non-real-time
> tasks. In the absence of non-rt tasks, the system goes idle for 5%
> of the period.
>
> Although this behavior works fine for the purpose of avoiding
> bad real-time tasks that can hang the system, some greed users
> want to allow the real-time task to continue running in the absence
> of non-real-time tasks starving. In other words, they do not want to
> see the system going idle.
>
> This patch implements the RT_RUNTIME_GREED scheduler feature for greedy
> users (TM). When enabled, this feature will check if non-rt tasks are
> starving before throttling the real-time task. If the real-time task
> becomes throttled, it will be unthrottled as soon as the system goes
> idle, or when the next period starts, whichever comes first.
>
> This feature is enabled with the following command:
> # echo RT_RUNTIME_GREED > /sys/kernel/debug/sched_features
>
I'm still reviewing the patch, but I have to wonder why bother with making it a scheduler feature?
The SCHED_FIFO definition allows a fifo thread to starve others because a fifo task will run until it yields. Throttling was added as a safety valve to allow starved SCHED_OTHER tasks to get some cpu time. Adding this unconditionally gets us a safety valve for throttling a badly written fifo task, but allows the fifo task to continue to consume cpu cycles if it's not starving anyone.
Or am I missing something that's blazingly obvious?
Clark
On Mon, 7 Nov 2016 19:03:08 +0100
Tommaso Cucinotta <[email protected]> wrote:
> On 07/11/2016 14:51, Daniel Bristot de Oliveira wrote:
> > Hi Tommaso,
>
> Hi,
>
> I'm cc-ing Luca for GRUB et al., pls find a few further notes below...
Thanks Tommaso! I've seen the email on the linux-rt-users mailing list,
and I'll reply there about GRUB...
Thanks,
Luca
>
> > On 11/07/2016 11:31 AM, Tommaso Cucinotta wrote:
> >> as anticipated live to Daniel:
> >> -) +1 for the general concept, we'd need something similar also for
> >> SCHED_DEADLINE
> >
> > Resumed: the sum of the runtime of deadline tasks will not be
> > greater than the "to_ratio(global_rt_period(),
> > global_rt_runtime())" - see init_dl_bw(). Therefore, DL rq will not
> > be throttle by the RT throttling mechanism.
> >
> > Extended: RT tasks' throttling aims to bound, for all CPUS of a
> > domain - when RT_RUNTIME_SHARING sharing is enabled; or per-rq -
> > when RT_RUNTIME_SHARING is disabled; the amount of time that RT
> > tasks can run continuously, in such way to provide some CPU time
> > for non-real-time tasks to run. RT tasks need this global/local
> > throttling mechanism to avoid the starvation of non-rt tasks
> > because RT tasks do not have a limited runtime - RT task (or
> > taskset) can run for an infinity runtime.
> >
> > DL tasks' throttling has another meaning. DL tasks' throttling aims
> > to avoid *a* DL task for running for more than *its own*
> > pre-allocated runtime.
>
> sure, and having an option to let it run for longer, if there's
> nothing else running in the system, is still interesting for pretty
> much similar reasons to those being discussed in this thread ...
>
> > The sum of allocated runtime for all DL tasks will not to be greater
> > than RT throttling enforcement runtime. The DL scheduler admission
> > control already avoids this by limiting the amount of CPU time all
> > DL tasks can consume (see init_dl_bw()). So, DL tasks are avoid ind
> > the "global" throttling on before hand - in the admission control.
> >
> > GRUB might implement something <<similar>> for the DEADLINE
> > scheduler. With GRUB, a deadline tasks will have more runtime than
> > previously set/granted.....
>
> yes, the main difference being: GRUB will let a DL task run for longer
> than its own runtime, but still let it starve anything below (RT as
> well as OTHER tasks); perhaps Luca (cc) has some further comment on
> this...
>
> Thanks,
>
> T.
>
> > But I am quite sure it will still be bounded by the sum
> > of the already allocated DL runtime, that will continue being
> > smaller than "to_ratio(global_rt_period(), global_rt_runtime())".
> >
> > Am I missing something?
> >
> >> -) only issue might be that, if a non-RT task wakes up after the
> >> unthrottle, it will have to wait, but worst-case it will have a
> >> chance in the next throttling window
> >
> > In the current default behavior (RT_RUNTIME_SHARING), in a domain
> > with more than two CPUs, the worst case easily become "infinity,"
> > because a CPU can borrow runtime from another CPU. There is no
> > guarantee for minimum latency for non-rt tasks. Anyway, if the user
> > wants to provide such guarantee, they just need not enable this
> > feature, while disabling RT_RUNTIME_SHARING (or run the non-rt task
> > as a deadline task ;-))
> >
> >> -) an alternative to unthrottling might be temporary class
> >> downgrade to sched_other, but that might be much more complex,
> >> instead this Daniel's one looks quite simple
> >
> > Yeah, decrease the priority of the task would be something way more
> > complicated and prone to errors. RT tasks would need to reduce its
> > priority to a level higher than the IDLE task, but lower than
> > SCHED_IDLE...
> >
> >> -) when considering also DEADLINE tasks, it might be good to think
> >> about how we'd like the throttling of DEADLINE and RT tasks to
> >> inter-relate, e.g.:
> >
> > Currently, DL tasks are limited (in the bw control) to the global RT
> > throttling limit...
> >
> > I think that this might be an extension to GRUB... that is
> > extending the current behavior... so... things for the future - and
> > IMHO it is another topic - way more challenging.
> >
> > Comments are welcome :-)
> >
> > -- Daniel
> >
>
>
On Mon, 7 Nov 2016 12:22:21 -0600
Clark Williams <[email protected]> wrote:
> I'm still reviewing the patch, but I have to wonder why bother with making it a scheduler feature?
>
> The SCHED_FIFO definition allows a fifo thread to starve others
> because a fifo task will run until it yields. Throttling was added as
> a safety valve to allow starved SCHED_OTHER tasks to get some cpu
> time. Adding this unconditionally gets us a safety valve for
> throttling a badly written fifo task, but allows the fifo task to
> continue to consume cpu cycles if it's not starving anyone.
>
> Or am I missing something that's blazingly obvious?
Or I say make it the default. If people want the old behavior, they can
modify SCHED_FEATURES to do so.
-- Steve
On Mon, 7 Nov 2016 10:55:38 -0600 (CST)
Christoph Lameter <[email protected]> wrote:
> On Mon, 7 Nov 2016, Daniel Bristot de Oliveira wrote:
>
> > With these two options set, the user will guarantee some runtime
> > for non-rt-tasks on all CPUs, while keeping real-time tasks running
> > as much as possible.
>
> Excellent this would improve the situation with deadlocks as a result of
> cgroup_locks not being released due to lack of workqueue processing.
?? What deadlocks do you see? I mean, can you show the situation that
throttling RT tasks will cause deadlock?
Sorry, but I'm just not seeing it.
-- Steve
On 11/07/2016 07:30 PM, Steven Rostedt wrote:
>> I'm still reviewing the patch, but I have to wonder why bother with making it a scheduler feature?
>> >
>> > The SCHED_FIFO definition allows a fifo thread to starve others
>> > because a fifo task will run until it yields. Throttling was added as
>> > a safety valve to allow starved SCHED_OTHER tasks to get some cpu
>> > time. Adding this unconditionally gets us a safety valve for
>> > throttling a badly written fifo task, but allows the fifo task to
>> > continue to consume cpu cycles if it's not starving anyone.
>> >
>> > Or am I missing something that's blazingly obvious?
> Or I say make it the default. If people want the old behavior, they can
> modify SCHED_FEATURES to do so.
I added it as a feature to keep the current behavior by default.
Currently, we have two throttling modes:
With RT_RUNTIME_SHARING (default):
before throttle, try to borrow some runtime from other CPU.
Without RT_RUNTIME_SHARING:
throttle the RT task, even if there is nothing else to do.
The problem of the first is that an CPU easily borrow enough runtime to
make the spin-rt-task to run forever, allowing the starvation of the
non-rt-tasks, hence invalidating the mechanism.
The problem of the second is that (with the default values) the CPU will
be idle 5% of the time.
IMHO, the balanced behavior is using GREED option + without
RT_RUNTIME_SHARING: the non-rt tasks will be able to run, while avoiding
CPU going idle.
We can turn it by default setting default options.
Moreover, AFAICS, these sched options are static keys, so they are very
very low overhead conditions.
-- Daniel
On 11/07/2016 07:32 PM, Steven Rostedt wrote:
>> Excellent this would improve the situation with deadlocks as a result of
>> > cgroup_locks not being released due to lack of workqueue processing.
> ?? What deadlocks do you see? I mean, can you show the situation that
> throttling RT tasks will cause deadlock?
>
> Sorry, but I'm just not seeing it.
It is not a deadlock in the theoretical sense of the word, but it is
more a side effect of the starvation - that looks like a deadlock.
There is a case where the removal of a cgroup dir calls
lru_add_drain_all(), that might schedule a kworker in the CPU that is
running the spinning-rt task. The kworker will starve - because they are
SCHED_OTHER by design, the lru_add_drain_all() will wait forever while
holding the cgroup lock and this will cause a lot of problems on other
tasks.
This problem was fixed on -rt using a -rt specific lock mechanism, but
the problem still exist in the non-rt kernel. Btw, this is just an
example of side effects of the starvation of non-rt-tasks. We can have
more examples...
-- Daniel
On Mon, 7 Nov 2016 13:30:46 -0500
Steven Rostedt <[email protected]> wrote:
> On Mon, 7 Nov 2016 12:22:21 -0600
> Clark Williams <[email protected]> wrote:
>
> > I'm still reviewing the patch, but I have to wonder why bother with making it a scheduler feature?
> >
> > The SCHED_FIFO definition allows a fifo thread to starve others
> > because a fifo task will run until it yields. Throttling was added as
> > a safety valve to allow starved SCHED_OTHER tasks to get some cpu
> > time. Adding this unconditionally gets us a safety valve for
> > throttling a badly written fifo task, but allows the fifo task to
> > continue to consume cpu cycles if it's not starving anyone.
> >
> > Or am I missing something that's blazingly obvious?
>
> Or I say make it the default. If people want the old behavior, they can
> modify SCHED_FEATURES to do so.
>
Ok, I can see wanting the previous behavior.
Clark
On Mon, 7 Nov 2016 19:49:03 +0100
Daniel Bristot de Oliveira <[email protected]> wrote:
> On 11/07/2016 07:32 PM, Steven Rostedt wrote:
> >> Excellent this would improve the situation with deadlocks as a result of
> >> > cgroup_locks not being released due to lack of workqueue processing.
> > ?? What deadlocks do you see? I mean, can you show the situation that
> > throttling RT tasks will cause deadlock?
> >
> > Sorry, but I'm just not seeing it.
>
> It is not a deadlock in the theoretical sense of the word, but it is
> more a side effect of the starvation - that looks like a deadlock.
>
> There is a case where the removal of a cgroup dir calls
> lru_add_drain_all(), that might schedule a kworker in the CPU that is
> running the spinning-rt task. The kworker will starve - because they are
> SCHED_OTHER by design, the lru_add_drain_all() will wait forever while
> holding the cgroup lock and this will cause a lot of problems on other
> tasks.
I understand the issue with not throttling an RT task, but this patch
is about not not throttling! That is, what scenario is there that will
cause a "deadlock" or deadlock like to happen when we *do* throttle,
where not throttling will work better, as this patch would have?
-- Steve
On Mon, 7 Nov 2016, Steven Rostedt wrote:
> On Mon, 7 Nov 2016 10:55:38 -0600 (CST)
> Christoph Lameter <[email protected]> wrote:
>
> > On Mon, 7 Nov 2016, Daniel Bristot de Oliveira wrote:
> >
> > > With these two options set, the user will guarantee some runtime
> > > for non-rt-tasks on all CPUs, while keeping real-time tasks running
> > > as much as possible.
> >
> > Excellent this would improve the situation with deadlocks as a result of
> > cgroup_locks not being released due to lack of workqueue processing.
>
> ?? What deadlocks do you see? I mean, can you show the situation that
> throttling RT tasks will cause deadlock?
>
> Sorry, but I'm just not seeing it.
SCHED_RR tasks alternately running on on cpu can cause endless deferral of
kworker threads. With the global effect of the OS processing reserved
it may be the case that the processor we are executing never gets any
time. And if that kworker threads role is releasing a mutex (like the
cgroup_lock) then deadlocks can result.
On Mon, 7 Nov 2016, Steven Rostedt wrote:
> On Mon, 7 Nov 2016 13:30:15 -0600 (CST)
> Christoph Lameter <[email protected]> wrote:
>
> > SCHED_RR tasks alternately running on on cpu can cause endless deferral of
> > kworker threads. With the global effect of the OS processing reserved
> > it may be the case that the processor we are executing never gets any
> > time. And if that kworker threads role is releasing a mutex (like the
> > cgroup_lock) then deadlocks can result.
>
> I believe SCHED_RR tasks will still throttle if they use up too much of
> the CPU. But I still don't see how this patch helps your situation.
The kworker thread will be able to make progress? Or am I not reading this
correctly?
On Mon, 7 Nov 2016 13:54:12 -0600 (CST)
Christoph Lameter <[email protected]> wrote:
> On Mon, 7 Nov 2016, Steven Rostedt wrote:
>
> > On Mon, 7 Nov 2016 13:30:15 -0600 (CST)
> > Christoph Lameter <[email protected]> wrote:
> >
> > > SCHED_RR tasks alternately running on on cpu can cause endless deferral of
> > > kworker threads. With the global effect of the OS processing reserved
> > > it may be the case that the processor we are executing never gets any
> > > time. And if that kworker threads role is releasing a mutex (like the
> > > cgroup_lock) then deadlocks can result.
> >
> > I believe SCHED_RR tasks will still throttle if they use up too much of
> > the CPU. But I still don't see how this patch helps your situation.
>
> The kworker thread will be able to make progress? Or am I not reading this
> correctly?
>
If kworker is SCHED_OTHER, then it will be able to make progress if the
RT tasks are throttled.
What Daniel's patch does, is to turn off throttling of the RT tasks if
there's no other task on the run queue.
-- Steve
On Mon, 7 Nov 2016 13:30:15 -0600 (CST)
Christoph Lameter <[email protected]> wrote:
> SCHED_RR tasks alternately running on on cpu can cause endless deferral of
> kworker threads. With the global effect of the OS processing reserved
> it may be the case that the processor we are executing never gets any
> time. And if that kworker threads role is releasing a mutex (like the
> cgroup_lock) then deadlocks can result.
I believe SCHED_RR tasks will still throttle if they use up too much of
the CPU. But I still don't see how this patch helps your situation.
-- Steve
On 11/07/2016 09:00 PM, Steven Rostedt wrote:
> On Mon, 7 Nov 2016 13:54:12 -0600 (CST)
> Christoph Lameter <[email protected]> wrote:
>
>> On Mon, 7 Nov 2016, Steven Rostedt wrote:
>>
>>> On Mon, 7 Nov 2016 13:30:15 -0600 (CST)
>>> Christoph Lameter <[email protected]> wrote:
>>>
>>>> SCHED_RR tasks alternately running on on cpu can cause endless deferral of
>>>> kworker threads. With the global effect of the OS processing reserved
>>>> it may be the case that the processor we are executing never gets any
>>>> time. And if that kworker threads role is releasing a mutex (like the
>>>> cgroup_lock) then deadlocks can result.
>>>
>>> I believe SCHED_RR tasks will still throttle if they use up too much of
>>> the CPU. But I still don't see how this patch helps your situation.
>>
>> The kworker thread will be able to make progress? Or am I not reading this
>> correctly?
>>
>
> If kworker is SCHED_OTHER, then it will be able to make progress if the
> RT tasks are throttled.
>
> What Daniel's patch does, is to turn off throttling of the RT tasks if
> there's no other task on the run queue.
Here in the example of two spinning RR tasks (f-22466 & f-22473) and an
other task (o-22506):
f-22466 [002] d... 79045.641364: sched_switch: prev_comm=f prev_pid=22466 prev_prio=94 prev_state=R ==> next_comm=o next_pid=22506 next_prio=120
o-22506 [002] d... 79045.690379: sched_switch: prev_comm=o prev_pid=22506 prev_prio=120 prev_state=R ==> next_comm=f next_pid=22466 next_prio=94
f-22466 [002] d... 79045.725359: sched_switch: prev_comm=f prev_pid=22466 prev_prio=94 prev_state=R ==> next_comm=f next_pid=22473 next_prio=94
f-22473 [002] d... 79045.825356: sched_switch: prev_comm=f prev_pid=22473 prev_prio=94 prev_state=R ==> next_comm=f next_pid=22466 next_prio=94
f-22466 [002] d... 79045.925350: sched_switch: prev_comm=f prev_pid=22466 prev_prio=94 prev_state=R ==> next_comm=f next_pid=22473 next_prio=94
f-22473 [002] d... 79046.025346: sched_switch: prev_comm=f prev_pid=22473 prev_prio=94 prev_state=R ==> next_comm=f next_pid=22466 next_prio=94
f-22466 [002] d... 79046.125346: sched_switch: prev_comm=f prev_pid=22466 prev_prio=94 prev_state=R ==> next_comm=f next_pid=22473 next_prio=94
f-22473 [002] d... 79046.225337: sched_switch: prev_comm=f prev_pid=22473 prev_prio=94 prev_state=R ==> next_comm=f next_pid=22466 next_prio=94
f-22466 [002] d... 79046.325333: sched_switch: prev_comm=f prev_pid=22466 prev_prio=94 prev_state=R ==> next_comm=f next_pid=22473 next_prio=94
f-22473 [002] d... 79046.425328: sched_switch: prev_comm=f prev_pid=22473 prev_prio=94 prev_state=R ==> next_comm=f next_pid=22466 next_prio=94
f-22466 [002] d... 79046.525324: sched_switch: prev_comm=f prev_pid=22466 prev_prio=94 prev_state=R ==> next_comm=f next_pid=22473 next_prio=94
f-22473 [002] d... 79046.625319: sched_switch: prev_comm=f prev_pid=22473 prev_prio=94 prev_state=R ==> next_comm=f next_pid=22466 next_prio=94
f-22466 [002] d... 79046.641320: sched_switch: prev_comm=f prev_pid=22466 prev_prio=94 prev_state=R ==> next_comm=o next_pid=22506 next_prio=120
o-22506 [002] d... 79046.690335: sched_switch: prev_comm=o prev_pid=22506 prev_prio=120 prev_state=R ==> next_comm=f next_pid=22466 next_prio=94
The throttling is per-rq, so even if the RR tasks keep switching
between each other, the throttling will take place if there is
any sched other task.
On Christoph's case, the other task will be the kworker, like bellow:
f-22466 [002] d... 79294.430542: sched_switch: prev_comm=f prev_pid=22466 prev_prio=94 prev_state=R ==> next_comm=f next_pid=22473 next_prio=94
f-22473 [002] d... 79294.483539: sched_switch: prev_comm=f prev_pid=22473 prev_prio=94 prev_state=R ==> next_comm=kworker/2:1 next_pid=22198 next_prio=120
kworker/2:1-22198 [002] d... 79294.483544: sched_switch: prev_comm=kworker/2:1 prev_pid=22198 prev_prio=120 prev_state=S ==> next_comm=f next_pid=22473 next_prio=94
f-22473 [002] d... 79294.530537: sched_switch: prev_comm=f prev_pid=22473 prev_prio=94 prev_state=R ==> next_comm=f next_pid=22466 next_prio=94
f-22466 [002] d... 79294.630541: sched_switch: prev_comm=f prev_pid=22466 prev_prio=94 prev_state=R ==> next_comm=f next_pid=22473 next_prio=94
The throttling allowed the kworker to run, but once the kworker went to
sleep, the RT tasks started to work again. In the previous behavior,
the system would either go idle, or the kworker would starve because
the runtime become infinity for RR tasks.
-- Daniel
On Mon, 7 Nov 2016 21:06:50 +0100
Daniel Bristot de Oliveira <[email protected]> wrote:
> The throttling allowed the kworker to run, but once the kworker went to
> sleep, the RT tasks started to work again. In the previous behavior,
> the system would either go idle, or the kworker would starve because
> the runtime become infinity for RR tasks.
I'm confused? Are you saying that RR tasks don't get throttled in the
current code? That sounds like a bug to me.
-- Steve
On 11/07/2016 09:16 PM, Steven Rostedt wrote:
> I'm confused? Are you saying that RR tasks don't get throttled in the
> current code? That sounds like a bug to me.
If the RT_RUNTIME_SHARING is enabled, the CPU in which the RR tasks are
running (and pinned) will borrow RT runtime from another CPU, allowing
the RR tasks to run forever. For example:
[root@kiron debug]# cat /proc/sched_debug | grep rt_runtime
.rt_runtime : 950.000000
.rt_runtime : 950.000000
.rt_runtime : 950.000000
.rt_runtime : 950.000000
[root@kiron debug]# echo RT_RUNTIME_SHARE > sched_features
[root@kiron debug]# taskset -c 2 chrt -r 5 /home/bristot/f &
[1] 23908
[root@kiron debug]# taskset -c 2 chrt -r 5 /home/bristot/f &
[2] 23915
[root@kiron debug]# cat /proc/sched_debug | grep rt_runtime
.rt_runtime : 900.000000
.rt_runtime : 950.000000
.rt_runtime : 1000.000000
.rt_runtime : 950.000000
You see? the rt_runtime of the CPU 2 was borrowed time from CPU 0.
It is not a BUG but a feature (no jokes haha). With RT_RUNTIME_SHARE,
the rt_runtime is such a global runtime. It works fine for tasks that
can migrate... but that is not the case for per-cpu kworkers.
-- Daniel
On Mon, 7 Nov 2016 21:33:02 +0100
Daniel Bristot de Oliveira <[email protected]> wrote:
> On 11/07/2016 09:16 PM, Steven Rostedt wrote:
> > I'm confused? Are you saying that RR tasks don't get throttled in the
> > current code? That sounds like a bug to me.
>
> If the RT_RUNTIME_SHARING is enabled, the CPU in which the RR tasks are
> running (and pinned) will borrow RT runtime from another CPU, allowing
> the RR tasks to run forever. For example:
>
> [root@kiron debug]# cat /proc/sched_debug | grep rt_runtime
> .rt_runtime : 950.000000
> .rt_runtime : 950.000000
> .rt_runtime : 950.000000
> .rt_runtime : 950.000000
> [root@kiron debug]# echo RT_RUNTIME_SHARE > sched_features
> [root@kiron debug]# taskset -c 2 chrt -r 5 /home/bristot/f &
> [1] 23908
> [root@kiron debug]# taskset -c 2 chrt -r 5 /home/bristot/f &
> [2] 23915
> [root@kiron debug]# cat /proc/sched_debug | grep rt_runtime
> .rt_runtime : 900.000000
> .rt_runtime : 950.000000
> .rt_runtime : 1000.000000
> .rt_runtime : 950.000000
>
> You see? the rt_runtime of the CPU 2 was borrowed time from CPU 0.
>
> It is not a BUG but a feature (no jokes haha). With RT_RUNTIME_SHARE,
> the rt_runtime is such a global runtime. It works fine for tasks that
> can migrate... but that is not the case for per-cpu kworkers.
This still looks like a bug, or not the expected result. Perhaps we
shouldn't share when tasks are pinned. It doesn't make sense. It's like
pinning two deadline tasks to the same CPU and giving them 100% of that
CPU and saying that it's really just 1/nr_cpus of usage, which would
have the same effect.
OK, it appears this is specific to RT_RUNTIME_SHARE which is what
causes this strange behavior, and even more rational to make this a
default option and perhaps even turn RT_RUNTIME_SHARE off by default.
-- Steve
Hi all,
since GRUB reclaiming has been mentioned, I am going to add some
comments on it :)
On Mon, 7 Nov 2016 14:51:37 +0100
Daniel Bristot de Oliveira <[email protected]> wrote:
[...]
> The sum of allocated runtime for all DL tasks will not to be greater
> than RT throttling enforcement runtime. The DL scheduler admission
> control already avoids this by limiting the amount of CPU time all DL
> tasks can consume (see init_dl_bw()). So, DL tasks are avoid ind the
> "global" throttling on before hand - in the admission control.
>
> GRUB might implement something <<similar>> for the DEADLINE scheduler.
> With GRUB, a deadline tasks will have more runtime than previously
> set/granted..... But I am quite sure it will still be bounded by the
> sum of the already allocated DL runtime, that will continue being
> smaller than "to_ratio(global_rt_period(), global_rt_runtime())".
Well, it's not exactly like this... In the original GRUB algorithm[1]
(that was uni-processor only), the tasks were able to reclaim 100% of
the CPU time (in other words: with the original GRUB algorithm,
SCHED_DEADLINE tasks can starve all of the non-deadline tasks).
But in the patchset I submitted I modified the algorithm to reclaim
only a specified fraction of the CPU time[2] (so that some CPU time is
left for non-deadline tasks). See patch 5/6 in my latest submission
(v3). I set the percentage of reclaimable CPU time equal to
"to_ratio(global_rt_period(), global_rt_runtime())" (so, deadline tasks
can be able to consume up to this fraction), but this can be changed if
needed.
Finally, notice that if we are interested in hard schedulability (hard
respect of all the deadlines - this is not something that
SCHED_DEADLINE currently does) then the reclaiming algorithm must be
modified and can reclaim a smaller amount of CPU time (see [3,4] for
details)
Luca
[1] Lipari, G., & Baruah, S. (2000). Greedy reclamation of unused
bandwidth in constant-bandwidth servers. In Real-Time Systems, 2000.
Euromicro RTS 2000. 12th Euromicro Conference on (pp. 193-200). IEEE.
[2] Abeni, L., Lelli, J., Scordino, C., & Palopoli, L. (2014, October).
Greedy CPU reclaiming for SCHED DEADLINE. In Proceedings of the
Real-Time Linux Workshop (RTLWS), Dusseldorf, Germany.
[3] Abeni, L., Lipari, G., Parri, A., & Sun, Y. (2016, April).
Multicore CPU reclaiming: parallel or sequential?. In Proceedings of
the 31st Annual ACM Symposium on Applied Computing (pp. 1877-1884). ACM.
[4] https://arxiv.org/abs/1512.01984
>
> Am I missing something?
>
> > -) only issue might be that, if a non-RT task wakes up after the
> > unthrottle, it will have to wait, but worst-case it will have a
> > chance in the next throttling window
>
> In the current default behavior (RT_RUNTIME_SHARING), in a domain with
> more than two CPUs, the worst case easily become "infinity," because a
> CPU can borrow runtime from another CPU. There is no guarantee for
> minimum latency for non-rt tasks. Anyway, if the user wants to provide
> such guarantee, they just need not enable this feature, while
> disabling RT_RUNTIME_SHARING (or run the non-rt task as a deadline
> task ;-))
>
> > -) an alternative to unthrottling might be temporary class
> > downgrade to sched_other, but that might be much more complex,
> > instead this Daniel's one looks quite simple
>
> Yeah, decrease the priority of the task would be something way more
> complicated and prone to errors. RT tasks would need to reduce its
> priority to a level higher than the IDLE task, but lower than
> SCHED_IDLE...
>
> > -) when considering also DEADLINE tasks, it might be good to think
> > about how we'd like the throttling of DEADLINE and RT tasks to
> > inter-relate, e.g.:
>
> Currently, DL tasks are limited (in the bw control) to the global RT
> throttling limit...
>
> I think that this might be an extension to GRUB... that is extending
> the current behavior... so... things for the future - and IMHO it is
> another topic - way more challenging.
>
> Comments are welcome :-)
>
> -- Daniel
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-rt-users" in the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
[ looks good? if so, I will send a v2 patch set including the ]
[ RT_RUNTIME_GREED patch and this one. ]
Currently, the option RT_RUNTIME_SHARE is enabled by default. This
option enables the sharing of rt_runtime between CPUs, allowing a CPU to
borrow rt_runtime from another CPU, permitting a real-time task to run
100% of the time on a single CPU. The problem is that this can lead to
the starvation of a non-real-time task pinned to the CPU running the CPU
bound RT workload. One example of non-real-time task pinned to a CPU are
the kworkers. Often kworkers starve on this scenario, causing a system
hang.
This patch changes the default setup for RT THROTTLING, disabling the
RT_RUNTIME_SHARE option while enabling the RT_RUNTIME_GREED option. In
such configuration, real-time tasks will be able to run 100% of the time
in the absence of non-real-time tasks starving in the local CPU.
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Tommaso Cucinotta <[email protected]>
Cc: Luca Abeni <[email protected]>
Cc: Clark Williams <[email protected]>
Cc: linux-rt-users <[email protected]>
Cc: LKML <[email protected]>
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 3bd7a6d..265c0db 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -65,8 +65,8 @@ SCHED_FEAT(RT_PUSH_IPI, true)
#endif
SCHED_FEAT(FORCE_SD_OVERLAP, false)
-SCHED_FEAT(RT_RUNTIME_SHARE, true)
-SCHED_FEAT(RT_RUNTIME_GREED, false)
+SCHED_FEAT(RT_RUNTIME_SHARE, false)
+SCHED_FEAT(RT_RUNTIME_GREED, true)
SCHED_FEAT(LB_MIN, false)
SCHED_FEAT(ATTACH_AGE_LOAD, true)
Hi Daniel,
On 07/11/16 14:51, Daniel Bristot de Oliveira wrote:
> On 11/07/2016 11:31 AM, Tommaso Cucinotta wrote:
[...]
> > -) only issue might be that, if a non-RT task wakes up after the
> > unthrottle, it will have to wait, but worst-case it will have a chance
> > in the next throttling window
>
> In the current default behavior (RT_RUNTIME_SHARING), in a domain with
> more than two CPUs, the worst case easily become "infinity," because a
> CPU can borrow runtime from another CPU. There is no guarantee for
> minimum latency for non-rt tasks. Anyway, if the user wants to provide
> such guarantee, they just need not enable this feature, while disabling
> RT_RUNTIME_SHARING (or run the non-rt task as a deadline task ;-))
>
I could only skim through the patch, so please forgive me if I'm talking
gibberish, but I think what Tommaso is saying is that with your current
approach if an unlucky OTHER task wakes up just after you unthrottled an
rt_rq (by replenishment), it will have to wait until the next throttling
event. I agree that this is still better than current status, and that
you can still configure the system to avoid this from happening. What
I'm wondering though is if we could modify your implementation and only
do the replenishment when the replenishment timer actually fires, but
let RT tasks continue to run, while their rt_rq is throttled, if no
OTHER task is present, or wakes up. I guess this will complicate things,
and maybe doesn't buy us much, just an idea. :)
Otherwise, the patch looks good and useful to me.
Best,
- Juri
No, none of this stands a chance of being accepted.
This is making bad code worse.
On Tue, 8 Nov 2016 12:59:58 +0100
Peter Zijlstra <[email protected]> wrote:
> No, none of this stands a chance of being accepted.
>
> This is making bad code worse.
Peter,
Instead of a flat out rejection, can you please provide some
constructive criticism to let those that are working on this know what
would be accepted? And what their next steps should be.
There's obviously a problem with the current code, what steps do you
recommend to fix it?
-- Steve
On Tue, Nov 08, 2016 at 09:07:40AM -0500, Steven Rostedt wrote:
> On Tue, 8 Nov 2016 12:59:58 +0100
> Peter Zijlstra <[email protected]> wrote:
>
> > No, none of this stands a chance of being accepted.
> >
> > This is making bad code worse.
>
> Peter,
>
> Instead of a flat out rejection, can you please provide some
> constructive criticism to let those that are working on this know what
> would be accepted? And what their next steps should be.
>
> There's obviously a problem with the current code, what steps do you
> recommend to fix it?
You really should already know this.
As stands the current rt cgroup code (and all this throttling code) is a
giant mess (as in, its not actually correct from a RT pov). We should
not make it worse by adding random hacks to it.
The right way to to about doing this is by replacing it with something
better; like the proposed DL server for FIFO tasks -- which is entirely
non-trivial as well, see existing discussion on that.
I'm not entirely sure what this patch was supposed to fix, but it could
be running CFS tasks with higher priority than RT for a window, instead
of throttling RT tasks. This seems fairly ill specified, but something
like that could easily done with an explicit or slack time DL server for
CFS tasks.
On Tue, 8 Nov 2016 17:51:33 +0100
Peter Zijlstra <[email protected]> wrote:
> You really should already know this.
I know what we want to do, but there's some momentous problems that
need to be solved first. Until then, we may be forced to continue with
hacks.
>
> As stands the current rt cgroup code (and all this throttling code) is a
> giant mess (as in, its not actually correct from a RT pov). We should
> not make it worse by adding random hacks to it.
>
> The right way to to about doing this is by replacing it with something
> better; like the proposed DL server for FIFO tasks -- which is entirely
> non-trivial as well, see existing discussion on that.
Right. The biggest issue that I see is how to assign affinities to
FIFO tasks and use a DL server to keep them from starving other tasks?
>
> I'm not entirely sure what this patch was supposed to fix, but it could
> be running CFS tasks with higher priority than RT for a window, instead
I'm a bit confused with the above sentence. Do you mean that this patch
causes CFS tasks to run for a period with a higher priority than RT?
Well, currently we have the both CFS tasks and the "idle" task run
higher than RT, but this patch changes that to be just CFS tasks.
> of throttling RT tasks. This seems fairly ill specified, but something
> like that could easily done with an explicit or slack time DL server for
> CFS tasks.
If we can have a DL scheduler that can handle arbitrary affinities,
then all could be solved with that.
-- Steve
On Tue, Nov 08, 2016 at 12:17:10PM -0500, Steven Rostedt wrote:
> On Tue, 8 Nov 2016 17:51:33 +0100
> Peter Zijlstra <[email protected]> wrote:
>
> > You really should already know this.
>
> I know what we want to do, but there's some momentous problems that
> need to be solved first.
Like what?
> Until then, we may be forced to continue with
> hacks.
Well, the more ill specified hacks we put in, the harder if will be to
replace because people will end up depending on it.
On 11/08/2016 07:05 PM, Peter Zijlstra wrote:
>> >
>> > I know what we want to do, but there's some momentous problems that
>> > need to be solved first.
> Like what?
The problem is that using RT_RUNTIME_SHARE a CPU will almost always
borrow enough runtime to make a CPU intensive rt task to run forever...
well not forever, but until the system crash because a kworker starved
in this CPU. Kworkers are sched fair by design and users do not always
have a way to avoid them in an isolated CPU, for example.
The user then can disable RT_RUNTIME_SHARE, but then the user will have
the CPU going idle for (period - runtime) at each period... throwing CPU
time in the trash.
>> > Until then, we may be forced to continue with
>> > hacks.
> Well, the more ill specified hacks we put in, the harder if will be to
> replace because people will end up depending on it.
The proposed patch seems to be the expected behavior for users/rt
throttling - they want a safeguard for fair tasks while allowing -rt
tasks to run as much as possible.
I see (and completely agree) that a DL server for fair/rt task would be
the best way to deal with this problem, but it will take some time until
such solution :-(. We even discussed this at Retis today, but yeah, it
will take sometime even in the best case.
(thinking aloud... a DL Server would react like the proposed patch, in
the sense that it would not be activated without tasks to run and would
return the CPU for other tasks if the tasks inside the server finish
their job before the end of the DL server runtime...)
-- Daniel
On Tue, Nov 08, 2016 at 08:29:49PM +0100, Daniel Bristot de Oliveira wrote:
>
>
> On 11/08/2016 07:05 PM, Peter Zijlstra wrote:
> >> >
> >> > I know what we want to do, but there's some momentous problems that
> >> > need to be solved first.
> > Like what?
>
> The problem is that using RT_RUNTIME_SHARE a CPU will almost always
> borrow enough runtime to make a CPU intensive rt task to run forever...
> well not forever, but until the system crash because a kworker starved
> in this CPU. Kworkers are sched fair by design and users do not always
> have a way to avoid them in an isolated CPU, for example.
>
> The user then can disable RT_RUNTIME_SHARE, but then the user will have
> the CPU going idle for (period - runtime) at each period... throwing CPU
> time in the trash.
So why is this a problem? You really should not be running that much
FIFO tasks to begin with.
So I'm willing to take out (or at least default disable
RT_RUNTIME_SHARE). But other than this, this never really worked to
begin with. So it cannot be a regression. And we've lived this long with
the 'problem'.
And that means this is a 'feature' and that means I say no.
We really should be doing the right thing here, not make a bigger mess.
On Mon, 7 Nov 2016, Steven Rostedt wrote:
> On Mon, 7 Nov 2016 21:06:50 +0100
> Daniel Bristot de Oliveira <[email protected]> wrote:
>
> > The throttling allowed the kworker to run, but once the kworker went to
> > sleep, the RT tasks started to work again. In the previous behavior,
> > the system would either go idle, or the kworker would starve because
> > the runtime become infinity for RR tasks.
>
> I'm confused? Are you saying that RR tasks don't get throttled in the
> current code? That sounds like a bug to me.
Good. Thats what I wanted to hear after all these justifications that the
system is just behaving as designed.
On 11/08/2016 08:50 PM, Peter Zijlstra wrote:
>> The problem is that using RT_RUNTIME_SHARE a CPU will almost always
>> > borrow enough runtime to make a CPU intensive rt task to run forever...
>> > well not forever, but until the system crash because a kworker starved
>> > in this CPU. Kworkers are sched fair by design and users do not always
>> > have a way to avoid them in an isolated CPU, for example.
>> >
>> > The user then can disable RT_RUNTIME_SHARE, but then the user will have
>> > the CPU going idle for (period - runtime) at each period... throwing CPU
>> > time in the trash.
> So why is this a problem? You really should not be running that much
> FIFO tasks to begin with.
I agree that a spinning real-time task is not a good practice, but there
are people using it and they have their own reasons/metrics/evaluations
motivating them.
> So I'm willing to take out (or at least default disable
> RT_RUNTIME_SHARE). But other than this, this never really worked to
> begin with. So it cannot be a regression. And we've lived this long with
> the 'problem'.
I agree! It would work perfectly in the absence of tasks pinned to a
processor, but that is not the case.
Trying to attend the users that want as much CPU time for -rt tasks as
possible, the proposed patch seems to be a better solution when compared
to RT_RUNTIME_SHARE, and it is way simpler! Even though it is not as
perfect as a DL Server would be in the future, it seems to be useful
until there...
> We really should be doing the right thing here, not make a bigger mess.
I see, agree and I am anxious to have it! :-). Tommaso and I discussed
about DL servers implementing such rt throttling. The more complicated
point so far (as Rostedt pointed on another e-mail) will be to have DL
servers with arbitrary affinity, or serving task with arbitrary
affinity. For example, one DL server pinned to each CPU providing
bandwidth for fair tasks to run for (rt_period - rt_runtime)us at each
rt_period... it will take sometime until someone propose it.
-- Daniel
On Thu, 10 Nov 2016, Daniel Vacek wrote:
> I believe Daniel's patches are the best thing we can do in current
> situation as the behavior now seems rather buggy and does not provide above
> mentioned expectations set when rt throttling was merged with default
> budget of 95% of CPU time. Nor if you configure so that it does (by
> disabling RT_RUNTIME_SHARE), it also forces CPUs to go idle needlessly when
> there still can be rt task running not really starving anyone. At least
> till a proper rework of rt scheduling with DL Server is implemented.
This looks like a fix for a bug and the company I work for is suffering
as a result. Could we please merge that ASAP?
On Fri, Nov 11, 2016 at 12:46:37PM -0600, Christoph Lameter wrote:
> On Thu, 10 Nov 2016, Daniel Vacek wrote:
>
> > I believe Daniel's patches are the best thing we can do in current
> > situation as the behavior now seems rather buggy and does not provide above
> > mentioned expectations set when rt throttling was merged with default
> > budget of 95% of CPU time. Nor if you configure so that it does (by
> > disabling RT_RUNTIME_SHARE), it also forces CPUs to go idle needlessly when
> > there still can be rt task running not really starving anyone. At least
> > till a proper rework of rt scheduling with DL Server is implemented.
>
> This looks like a fix for a bug and the company I work for is suffering
> as a result. Could we please merge that ASAP?
>
What bug? And no, the patch has as many technical issues as it has
conceptual ones.
On Fri, 11 Nov 2016, Peter Zijlstra wrote:
> On Fri, Nov 11, 2016 at 12:46:37PM -0600, Christoph Lameter wrote:
> > On Thu, 10 Nov 2016, Daniel Vacek wrote:
> >
> > > I believe Daniel's patches are the best thing we can do in current
> > > situation as the behavior now seems rather buggy and does not provide above
> > > mentioned expectations set when rt throttling was merged with default
> > > budget of 95% of CPU time. Nor if you configure so that it does (by
> > > disabling RT_RUNTIME_SHARE), it also forces CPUs to go idle needlessly when
> > > there still can be rt task running not really starving anyone. At least
> > > till a proper rework of rt scheduling with DL Server is implemented.
> >
> > This looks like a fix for a bug and the company I work for is suffering
> > as a result. Could we please merge that ASAP?
> >
>
> What bug? And no, the patch has as many technical issues as it has
> conceptual ones.
There is a deadlock, Peter!!!
On Sun, Nov 13, 2016 at 12:53:28PM -0600, Christoph Lameter wrote:
> On Fri, 11 Nov 2016, Peter Zijlstra wrote:
>
> > On Fri, Nov 11, 2016 at 12:46:37PM -0600, Christoph Lameter wrote:
> > > On Thu, 10 Nov 2016, Daniel Vacek wrote:
> > >
> > > > I believe Daniel's patches are the best thing we can do in current
> > > > situation as the behavior now seems rather buggy and does not provide above
> > > > mentioned expectations set when rt throttling was merged with default
> > > > budget of 95% of CPU time. Nor if you configure so that it does (by
> > > > disabling RT_RUNTIME_SHARE), it also forces CPUs to go idle needlessly when
> > > > there still can be rt task running not really starving anyone. At least
> > > > till a proper rework of rt scheduling with DL Server is implemented.
> > >
> > > This looks like a fix for a bug and the company I work for is suffering
> > > as a result. Could we please merge that ASAP?
> > >
> >
> > What bug? And no, the patch has as many technical issues as it has
> > conceptual ones.
>
> There is a deadlock, Peter!!!
Describe please? Also, have you tried disabling RT_RUNTIME_SHARE ?