Deadline tasks behave differently of other tasks because deadline
task's also depend on their period, deadline and runtime.
Hence, the well known sched:sched_wakeup and sched:sched_switch
tracepoints are not always enough to precisely explain the behavior of a
deadline task with respect to the task's period, deadline and runtime
consumption.
This patch series implements tracepoints for the following deadline
scheduler points of interest:
- sched:sched_deadline_replenish: Informs the periodic runtime
replenishment of a deadline task.
- sched:sched_deadline_yield: Informs that a deadline task called
sched_yield(), and will wait for the next period.
- sched:sched_deadline_throttle: Informs that a task consumed all its
available runtime and was throttled.
- sched:sched_deadline_block: Informs that a deadline task went to sleep
waiting to be awakened by another task.
Changes since v1:
Cleanup in the sched:sched_deadline_yield tracepoint
Fix compilantion warning on Intel 32 bits
Daniel Bristot de Oliveira (2):
sched: Move deadline container_of() helper functions into sched.h
sched/deadline: Tracepoints for deadline scheduler
Steven Rostedt (Red Hat) (1):
tracing: Add __print_ns_to_secs() and __print_ns_without_secs()
helpers
include/trace/events/sched.h | 88 ++++++++++++++++++++++++++++++++++++++++++++
include/trace/trace_events.h | 25 +++++++++++++
kernel/sched/deadline.c | 26 ++++---------
kernel/sched/sched.h | 18 +++++++++
4 files changed, 139 insertions(+), 18 deletions(-)
--
2.5.0
Move dl_task_of(), dl_rq_of_se() and rq_of_dl_rq() helper functions
from kernel/sched/deadline.c to kernel/sched/sched.h, so they
can be used on other scheduler files.
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index affd97e..f6bfb0a 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -20,24 +20,6 @@
struct dl_bandwidth def_dl_bandwidth;
-static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
-{
- return container_of(dl_se, struct task_struct, dl);
-}
-
-static inline struct rq *rq_of_dl_rq(struct dl_rq *dl_rq)
-{
- return container_of(dl_rq, struct rq, dl);
-}
-
-static inline struct dl_rq *dl_rq_of_se(struct sched_dl_entity *dl_se)
-{
- struct task_struct *p = dl_task_of(dl_se);
- struct rq *rq = task_rq(p);
-
- return &rq->dl;
-}
-
static inline int on_dl_rq(struct sched_dl_entity *dl_se)
{
return !RB_EMPTY_NODE(&dl_se->rb_node);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ec2e8d2..8671291 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -750,6 +750,24 @@ static inline void rq_clock_skip_update(struct rq *rq, bool skip)
rq->clock_skip_update &= ~RQCF_REQ_SKIP;
}
+static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
+{
+ return container_of(dl_se, struct task_struct, dl);
+}
+
+static inline struct rq *rq_of_dl_rq(struct dl_rq *dl_rq)
+{
+ return container_of(dl_rq, struct rq, dl);
+}
+
+static inline struct dl_rq *dl_rq_of_se(struct sched_dl_entity *dl_se)
+{
+ struct task_struct *p = dl_task_of(dl_se);
+ struct rq *rq = task_rq(p);
+
+ return &rq->dl;
+}
+
#ifdef CONFIG_NUMA
enum numa_topology_type {
NUMA_DIRECT,
--
2.5.0
Deadline tasks behave differently of other tasks because deadline
task's also depend on their period, deadline and runtime.
Hence, the well known sched:sched_wakeup and sched:sched_switch
tracepoints are not always enough to precisely explain the behavior of a
deadline task with respect to the task's period, deadline, and runtime
consumption and replenishment.
In order to provide more precise information about the scheduling of
deadline tasks, this patch implements the following tracepoints:
- sched:sched_deadline_replenish: Informs the runtime replenishment of
a deadline task. Trace example:
<idle>-0 [010] d.h. 268.428878: sched_deadline_replenish: \
comm=y pid=1608 now=268.428876113 deadline=268.458863627 \
runtime=20000000
The task y-1608 was replenished with 20000000 ns at 268.428876113
to be used until the deadline at 268.458863627.
- sched:sched_deadline_yield: Informs that a deadline task called
sched_yield(), and will wait for the next period. Trace example:
y-1608 [010] d... 268.428892: sched_deadline_yield: \
now=268.428891932 deadline=268.458863627 \
remaining_runtime=19995653
The task y-1608 yielded before its deadline, with 19995653 ns of
remaining runtime.
- sched:sched_deadline_throttle: Informs that a task consumed all its
available runtime and was throttled. Trace example:
t-1613 [011] d.h. 280.419823: sched_deadline_throttle: \
now=280.419823282 deadline=280.430683206 \
remaining_runtime=-13280
The task t-1613 overrun its runtime by 13280 ns and was throttled.
- sched:sched_deadline_block: Informs that a deadline task went to sleep
waiting to be awakened by another task. Trace example:
b-1611 [011] d... 276.228298: sched_deadline_block: \
now=276.228295889 deadline=276.258262555
remaining_runtime=19966666
The task b-1611 blocked waiting for an external event. Its deadline is at
276.258262555, and it stills have 19966666 ns of remaining runtime on the
current period.
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 9b90c57..8696c1a 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -562,6 +562,94 @@ TRACE_EVENT(sched_wake_idle_without_ipi,
TP_printk("cpu=%d", __entry->cpu)
);
+
+/*
+ * Tracepoint for runtime replenishment of deadline tasks:
+ */
+TRACE_EVENT(sched_deadline_replenish,
+
+ TP_PROTO(struct sched_dl_entity *dl_se),
+
+ TP_ARGS(dl_se),
+
+ TP_STRUCT__entry(
+ __array( char, comm, TASK_COMM_LEN )
+ __field( pid_t, pid )
+ __field( u64, now )
+ __field( u64, deadline )
+ __field( s64, runtime )
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->comm, dl_task_of(dl_se)->comm, TASK_COMM_LEN);
+ __entry->now = rq_clock(rq_of_dl_rq(dl_rq_of_se(dl_se)));
+ __entry->pid = dl_task_of(dl_se)->pid;
+ __entry->deadline = dl_se->deadline;
+ __entry->runtime = dl_se->runtime;
+ ),
+
+ TP_printk("comm=%s pid=%d now=%llu.%09u deadline=%llu.%09u runtime=%lld",
+ __entry->comm,
+ __entry->pid,
+ __print_ns_to_secs(__entry->now),
+ __print_ns_without_secs(__entry->now),
+ __print_ns_to_secs(__entry->deadline),
+ __print_ns_without_secs(__entry->deadline),
+ __entry->runtime)
+);
+
+DECLARE_EVENT_CLASS(sched_deadline_template,
+
+ TP_PROTO(struct sched_dl_entity *dl_se),
+
+ TP_ARGS(dl_se),
+
+ TP_STRUCT__entry(
+ __field( u64, now )
+ __field( u64, deadline )
+ __field( s64, runtime )
+ ),
+
+ TP_fast_assign(
+ __entry->now = rq_clock(rq_of_dl_rq(dl_rq_of_se(dl_se)));
+ __entry->deadline = dl_se->deadline;
+ __entry->runtime = dl_se->runtime;
+ ),
+
+ TP_printk("now=%llu.%09u deadline=%llu.%09u remaining_runtime=%lld",
+ __print_ns_to_secs(__entry->now),
+ __print_ns_without_secs(__entry->now),
+ __print_ns_to_secs(__entry->deadline),
+ __print_ns_without_secs(__entry->deadline),
+ __entry->runtime)
+);
+
+/*
+ * Tracepoint for sched_yield() of a deadline task (the task
+ * went to sleep waiting for the next period).
+ */
+DEFINE_EVENT_CONDITION(sched_deadline_template, sched_deadline_yield,
+ TP_PROTO(struct sched_dl_entity *dl_se),
+ TP_ARGS(dl_se),
+ TP_CONDITION(dl_se->dl_yielded));
+
+/*
+ * Tracepoint for throttling of a deadline task that consumed all its
+ * runtime.
+ */
+DEFINE_EVENT_CONDITION(sched_deadline_template, sched_deadline_throttle,
+ TP_PROTO(struct sched_dl_entity *dl_se),
+ TP_ARGS(dl_se),
+ TP_CONDITION(!dl_se->dl_yielded));
+
+/*
+ * Tracepoint for blocking of a deadline task. The deadline task was
+ * dequeued, but neither by being throttled nor yielding.
+ */
+DEFINE_EVENT_CONDITION(sched_deadline_template, sched_deadline_block,
+ TP_PROTO(struct sched_dl_entity *dl_se),
+ TP_ARGS(dl_se),
+ TP_CONDITION(!dl_se->dl_yielded && !dl_se->dl_throttled));
#endif /* _TRACE_SCHED_H */
/* This part must be outside protection */
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index f6bfb0a..e8551fc 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -17,6 +17,7 @@
#include "sched.h"
#include <linux/slab.h>
+#include <trace/events/sched.h>
struct dl_bandwidth def_dl_bandwidth;
@@ -351,6 +352,7 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se,
*/
dl_se->deadline = rq_clock(rq) + pi_se->dl_deadline;
dl_se->runtime = pi_se->dl_runtime;
+ trace_sched_deadline_replenish(dl_se);
}
/*
@@ -417,6 +419,8 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se,
dl_se->runtime = pi_se->dl_runtime;
}
+ trace_sched_deadline_replenish(dl_se);
+
if (dl_se->dl_yielded)
dl_se->dl_yielded = 0;
if (dl_se->dl_throttled)
@@ -496,6 +500,7 @@ static void update_dl_entity(struct sched_dl_entity *dl_se,
dl_entity_overflow(dl_se, pi_se, rq_clock(rq))) {
dl_se->deadline = rq_clock(rq) + pi_se->dl_deadline;
dl_se->runtime = pi_se->dl_runtime;
+ trace_sched_deadline_replenish(dl_se);
}
}
@@ -733,7 +738,9 @@ static void update_curr_dl(struct rq *rq)
throttle:
if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
+ trace_sched_deadline_yield(&rq->curr->dl);
dl_se->dl_throttled = 1;
+ trace_sched_deadline_throttle(dl_se);
__dequeue_task_dl(rq, curr, 0);
if (unlikely(dl_se->dl_boosted || !start_dl_timer(curr)))
enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);
@@ -910,6 +917,7 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se,
static void dequeue_dl_entity(struct sched_dl_entity *dl_se)
{
__dequeue_dl_entity(dl_se);
+ trace_sched_deadline_block(dl_se);
}
static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
--
2.5.0
From: "Steven Rostedt (Red Hat)" <[email protected]>
To have nanosecond output displayed in a more human readable format, its
nicer to convert it to a seconds format (XXX.YYYYYYYYY). The problem is that
to do so, the numbers must be divided by NSEC_PER_SEC, and moded too. But as
these numbers are 64 bit, this can not be done simply with '/' and '%'
operators, but must use do_div() instead.
Instead of performing the expensive do_div() in the hot path of the
tracepoint, it is more efficient to perform it during the output phase. But
passing in do_div() can confuse the parser, and do_div() doesn't work
exactly like a normal C function. It modifies the number in place, and we
don't want to modify the actual values in the ring buffer.
Two helper functions are now created:
__print_ns_to_secs() and __print_ns_without_secs()
They both take a value of nanoseconds, and the former will return that
number divided by NSEC_PER_SEC, and the latter will mod it with NSEC_PER_SEC
giving a way to print a nice human readable format:
__print_fmt("time=%llu.%09u",
__print_ns_to_secs(REC->nsec_val),
__print_ns_without_secs(REC->nsec_val))
Signed-off-by: Steven Rostedt <[email protected]>
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
index 170c93b..e9c3f93 100644
--- a/include/trace/trace_events.h
+++ b/include/trace/trace_events.h
@@ -283,6 +283,21 @@ TRACE_MAKE_SYSTEM_STR();
trace_print_symbols_seq(p, value, symbols); \
})
+#undef __print_ns_to_secs
+#define __print_ns_to_secs(value) \
+ ({ \
+ u64 ____val = (u64)value; \
+ do_div(____val, NSEC_PER_SEC); \
+ ____val; \
+ })
+
+#undef __print_ns_without_secs
+#define __print_ns_without_secs(value) \
+ ({ \
+ u64 ____val = (u64)value; \
+ (u32) do_div(____val, NSEC_PER_SEC); \
+ })
+
#undef __print_symbolic_u64
#if BITS_PER_LONG == 32
#define __print_symbolic_u64(value, symbol_array...) \
@@ -720,6 +735,16 @@ static inline void ftrace_test_probe_##call(void) \
#undef __get_bitmask
#undef __print_array
+/*
+ * The below is not executed in the kernel. It is only what is
+ * displayed in the print format for userspace to parse.
+ */
+#undef __print_ns_to_secs
+#define __print_ns_to_secs(val) val / 1000000000UL
+
+#undef __print_ns_without_secs
+#define __print_ns_without_secs(val) val % 1000000000UL
+
#undef TP_printk
#define TP_printk(fmt, args...) "\"" fmt "\", " __stringify(args)
--
2.5.0
On Mon, Mar 28, 2016 at 01:50:51PM -0300, Daniel Bristot de Oliveira wrote:
> @@ -733,7 +738,9 @@ static void update_curr_dl(struct rq *rq)
>
> throttle:
> if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
> + trace_sched_deadline_yield(&rq->curr->dl);
> dl_se->dl_throttled = 1;
> + trace_sched_deadline_throttle(dl_se);
This is just really very sad.
> __dequeue_task_dl(rq, curr, 0);
> if (unlikely(dl_se->dl_boosted || !start_dl_timer(curr)))
> enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);
> @@ -910,6 +917,7 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se,
> static void dequeue_dl_entity(struct sched_dl_entity *dl_se)
> {
> __dequeue_dl_entity(dl_se);
> + trace_sched_deadline_block(dl_se);
> }
And that's just not going to happen.
On Tue, 29 Mar 2016 17:16:49 +0200
Peter Zijlstra <[email protected]> wrote:
> On Mon, Mar 28, 2016 at 01:50:51PM -0300, Daniel Bristot de Oliveira wrote:
> > @@ -733,7 +738,9 @@ static void update_curr_dl(struct rq *rq)
> >
> > throttle:
> > if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
> > + trace_sched_deadline_yield(&rq->curr->dl);
> > dl_se->dl_throttled = 1;
> > + trace_sched_deadline_throttle(dl_se);
>
> This is just really very sad.
I agree. This should be a single tracepoint here. Especially since it
seems that dl_se == &rq->curr->dl :-)
But perhaps we should add that generic sys_yield() tracepoint, to be
able to see that the task was throttled because of a yield call.
We still want to see a task yield, and then throttle because of it. The
deadline/runtime should reflect the information correctly.
>
> > __dequeue_task_dl(rq, curr, 0);
> > if (unlikely(dl_se->dl_boosted || !start_dl_timer(curr)))
> > enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);
> > @@ -910,6 +917,7 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se,
> > static void dequeue_dl_entity(struct sched_dl_entity *dl_se)
> > {
> > __dequeue_dl_entity(dl_se);
> > + trace_sched_deadline_block(dl_se);
> > }
>
> And that's just not going to happen.
Sure, we'll probably want to figure out a better way to see deadline
tasks blocked. Probably can see that from sched switch though, as it
would be in the blocked state as it scheduled out.
Hmm, I probably could add tracing infrastructure that would let us
extend existing tracepoints. That is, without modifying sched_switch,
we could add a new tracepoint that when enabled, would attach itself to
the sched_switch tracepoint and record different information. Like a
special sched_switch_deadline tracepoint, that would record the existing
runtime,deadline and period for deadline tasks. It wont add more
tracepoints into the core scheduler, but use the existing one.
Maybe something to play with while I'm on the flight to San Diego or
Portland.
-- Steve
On Tue, Mar 29, 2016 at 11:57:00AM -0400, Steven Rostedt wrote:
> Hmm, I probably could add tracing infrastructure that would let us
> extend existing tracepoints. That is, without modifying sched_switch,
> we could add a new tracepoint that when enabled, would attach itself to
> the sched_switch tracepoint and record different information. Like a
> special sched_switch_deadline tracepoint, that would record the existing
> runtime,deadline and period for deadline tasks. It wont add more
> tracepoints into the core scheduler, but use the existing one.
Urgh; maybe. But I would would not want the new thing to be called
_deadline, maybe _v{n} id anything and have a KERN_WARNING emitted when
people enable the old one.
Ideally we'd rename the old one, but I suspect even that would break
stuff :/
On 03/29/2016 12:16 PM, Peter Zijlstra wrote:
>> + trace_sched_deadline_yield(&rq->curr->dl);
ouch, it should be trace_sched_deadline_yield(dl_se). It works
as is, but it is really very sad, my bad, sorry.
>> > dl_se->dl_throttled = 1;
>> > + trace_sched_deadline_throttle(dl_se);
> This is just really very sad.
Am I missing any other really very sad thing here?
>> > __dequeue_task_dl(rq, curr, 0);
>> > if (unlikely(dl_se->dl_boosted || !start_dl_timer(curr)))
>> > enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);
>> > @@ -910,6 +917,7 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se,
>> > static void dequeue_dl_entity(struct sched_dl_entity *dl_se)
>> > {
>> > __dequeue_dl_entity(dl_se);
>> > + trace_sched_deadline_block(dl_se);
>> > }
> And that's just not going to happen.
It will, if a task goes to sleep during the activation,
e.g., when blocking on a system call. For example:
<idle>-0 [007] d..3 78377.688969: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=b next_pid=18973 next_prio=-1
b-18973 [007] d..3 78377.688979: sched_deadline_block: now=78377.688976271 deadline=78377.718945137 remaining_runtime=9968866
b-18973 [007] d..3 78377.688981: sched_switch: prev_comm=b prev_pid=18973 prev_prio=-1 prev_state=S ==> next_comm=swapper/7 next_pid=0 next_prio=120
On Tue, 29 Mar 2016 18:04:01 +0200
Peter Zijlstra <[email protected]> wrote:
> On Tue, Mar 29, 2016 at 11:57:00AM -0400, Steven Rostedt wrote:
> > Hmm, I probably could add tracing infrastructure that would let us
> > extend existing tracepoints. That is, without modifying sched_switch,
> > we could add a new tracepoint that when enabled, would attach itself to
> > the sched_switch tracepoint and record different information. Like a
> > special sched_switch_deadline tracepoint, that would record the existing
> > runtime,deadline and period for deadline tasks. It wont add more
> > tracepoints into the core scheduler, but use the existing one.
>
> Urgh; maybe. But I would would not want the new thing to be called
> _deadline, maybe _v{n} id anything and have a KERN_WARNING emitted when
> people enable the old one.
I wasn't thinking of having a new sched switch, I was thinking of
having multiple ones. And not versions, as the one for a deadline task
wouldn't be applicable for a non deadline task. But regardless, I'm
also thinking of something else.
>
> Ideally we'd rename the old one, but I suspect even that would break
> stuff :/
Yes, we don't want to get rid of the old one. But it shouldn't break
anything if we extend it. I'm thinking of extending it with a dynamic
array to store the deadline task values (runtime, period). And for non
deadline tasks, the array would be empty (size zero). I think that
could be doable and maintain backward compatibility.
-- Steve
On Tue, 29 Mar 2016 13:10:34 -0300
Daniel Bristot de Oliveira <[email protected]> wrote:
> On 03/29/2016 12:16 PM, Peter Zijlstra wrote:
> >> + trace_sched_deadline_yield(&rq->curr->dl);
>
> ouch, it should be trace_sched_deadline_yield(dl_se). It works
> as is, but it is really very sad, my bad, sorry.
>
> >> > dl_se->dl_throttled = 1;
> >> > + trace_sched_deadline_throttle(dl_se);
> > This is just really very sad.
>
> Am I missing any other really very sad thing here?
Well, we shouldn't have two tracepoints back to back.
>
> >> > __dequeue_task_dl(rq, curr, 0);
> >> > if (unlikely(dl_se->dl_boosted || !start_dl_timer(curr)))
> >> > enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);
> >> > @@ -910,6 +917,7 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se,
> >> > static void dequeue_dl_entity(struct sched_dl_entity *dl_se)
> >> > {
> >> > __dequeue_dl_entity(dl_se);
> >> > + trace_sched_deadline_block(dl_se);
> >> > }
> > And that's just not going to happen.
>
> It will, if a task goes to sleep during the activation,
> e.g., when blocking on a system call. For example:
>
> <idle>-0 [007] d..3 78377.688969: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=b next_pid=18973 next_prio=-1
> b-18973 [007] d..3 78377.688979: sched_deadline_block: now=78377.688976271 deadline=78377.718945137 remaining_runtime=9968866
> b-18973 [007] d..3 78377.688981: sched_switch: prev_comm=b prev_pid=18973 prev_prio=-1 prev_state=S ==> next_comm=swapper/7 next_pid=0 next_prio=120
Why did it go to sleep? The above is still not very useful. What do you
mean "blocking on a system call"?
-- Steve
On 03/29/2016 12:57 PM, Steven Rostedt wrote:
> Peter Zijlstra <[email protected]> wrote:
>
>> > On Mon, Mar 28, 2016 at 01:50:51PM -0300, Daniel Bristot de Oliveira wrote:
>>> > > @@ -733,7 +738,9 @@ static void update_curr_dl(struct rq *rq)
>>> > >
>>> > > throttle:
>>> > > if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
>>> > > + trace_sched_deadline_yield(&rq->curr->dl);
>>> > > dl_se->dl_throttled = 1;
>>> > > + trace_sched_deadline_throttle(dl_se);
>> >
>> > This is just really very sad.
> I agree. This should be a single tracepoint here. Especially since it
> seems that dl_se == &rq->curr->dl :-)
>
> But perhaps we should add that generic sys_yield() tracepoint, to be
> able to see that the task was throttled because of a yield call.
>
> We still want to see a task yield, and then throttle because of it. The
> deadline/runtime should reflect the information correctly.
The above tracepoints are conditional, if dl_se->dl_yielded, only the
yield tracepoint will happen. If !dl_se->dl_yielded, only the throttle
tracepoint will happen.
We can try to join the sched_deadline_(yield|throttle|block) on a single
tracepoint, but IMHO having them separated is more intuitive for users.
> Sure, we'll probably want to figure out a better way to see deadline
> tasks blocked. Probably can see that from sched switch though, as it
> would be in the blocked state as it scheduled out.
We can guess that in the sched switch, but it currently does not show
deadline specific information (deadline, runtime, now (in the timer used
by the scheduler), and they are relevant in the analysis of deadline tasks.
>
> Hmm, I probably could add tracing infrastructure that would let us
> extend existing tracepoints. That is, without modifying sched_switch,
> we could add a new tracepoint that when enabled, would attach itself to
> the sched_switch tracepoint and record different information. Like a
> special sched_switch_deadline tracepoint, that would record the existing
> runtime,deadline and period for deadline tasks. It wont add more
> tracepoints into the core scheduler, but use the existing one.
You can display the joined version of sched deadline
(yield|throttle|block) tracepoint, but IMHO this will just turn things
more complex than they need to be, and will possibly add overhead to the
sched_switch tracepoint, that is more used by non-deadline users than by
deadline users.
Moreover, deadline users will probably want to see only the deadline
data, and have to use complex enabling/filtering options is really
really not intuitive for users (that are not kernel developers).
On Tue, 29 Mar 2016 14:37:18 -0300
Daniel Bristot de Oliveira <[email protected]> wrote:
> The above tracepoints are conditional, if dl_se->dl_yielded, only the
> yield tracepoint will happen. If !dl_se->dl_yielded, only the throttle
> tracepoint will happen.
Yes, I know they are conditional, but that doesn't make up the fact
that there's two tracepoints back to back, which should not happen.
>
> We can try to join the sched_deadline_(yield|throttle|block) on a single
> tracepoint, but IMHO having them separated is more intuitive for users.
A userspace tool can make it more intuitive for users.
Note, tracepoints are very low overhead when not enabled. But they are
not zero percent overhead. That means, two tracepoints just doubled
that small overhead. Each one adds to the I$.
>
> > Sure, we'll probably want to figure out a better way to see deadline
> > tasks blocked. Probably can see that from sched switch though, as it
> > would be in the blocked state as it scheduled out.
>
> We can guess that in the sched switch, but it currently does not show
> deadline specific information (deadline, runtime, now (in the timer used
> by the scheduler), and they are relevant in the analysis of deadline tasks.
But as I said in another email, it can be extended dynamically.
>
> >
> > Hmm, I probably could add tracing infrastructure that would let us
> > extend existing tracepoints. That is, without modifying sched_switch,
> > we could add a new tracepoint that when enabled, would attach itself to
> > the sched_switch tracepoint and record different information. Like a
> > special sched_switch_deadline tracepoint, that would record the existing
> > runtime,deadline and period for deadline tasks. It wont add more
> > tracepoints into the core scheduler, but use the existing one.
>
> You can display the joined version of sched deadline
> (yield|throttle|block) tracepoint, but IMHO this will just turn things
> more complex than they need to be, and will possibly add overhead to the
> sched_switch tracepoint, that is more used by non-deadline users than by
> deadline users.
Adding to the sched_switch tracepoint isn't as big of a deal as slowing
non tracing down. All new overhead must be focused into the tracing
enabled case. That is, if we slow down tracing by 100% where we could
have an alternative that slows down non-tracing by .01%, we still
choose the slowdown of tracing.
>
> Moreover, deadline users will probably want to see only the deadline
> data, and have to use complex enabling/filtering options is really
> really not intuitive for users (that are not kernel developers).
That's what userspace filtering tools are for ;-)
-- Steve
On 03/29/2016 02:13 PM, Steven Rostedt wrote:
>> <idle>-0 [007] d..3 78377.688969: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=b next_pid=18973 next_prio=-1
>> > b-18973 [007] d..3 78377.688979: sched_deadline_block: now=78377.688976271 deadline=78377.718945137 remaining_runtime=9968866
>> > b-18973 [007] d..3 78377.688981: sched_switch: prev_comm=b prev_pid=18973 prev_prio=-1 prev_state=S ==> next_comm=swapper/7 next_pid=0 next_prio=120
> Why did it go to sleep? The above is still not very useful. What do you
> mean "blocking on a system call"?
A task can go can to sleep in a blocking system call, like waiting
a network packet, or any other external event.
The "block state" is a possible state of a task running in the deadline
scheduler. It means that a task voluntarily left the processor, not
by calling sched_yield(), but by blocking (or sleeping) waiting another
event.
This state is described in the Figure 2 of the article "Deadline
scheduling in the Linux kernel", available at:
http://onlinelibrary.wiley.com/doi/10.1002/spe.2335/abstract
The block state affects the replenishment of the task, and that is why
it is different of both yeild and throttle. If the task blocks and is
awakened prior to the deadline, the task's runtime will not be
replenished. On the other hand it will. For example:
Blocking, and then waking up after the deadline:
b-5152 [007] d..3 3983.376428: sched_deadline_replenish: comm=b pid=5152 now=3983.376425148 deadline=3983.406425148 runtime=10000000
b-5152 [007] d..3 3983.376515: sched_deadline_block: now=3983.376511101 deadline=3983.406425148 remaining_runtime=9909566
b-5152 [007] d..3 3983.376529: sched_switch: prev_comm=b prev_pid=5152 prev_prio=-1 prev_state=S ==> next_comm=swapper/7 next_pid=0 next_prio=120
<idle>-0 [007] d.h4 3983.476592: sched_deadline_replenish: comm=b pid=5152 now=3983.476589573 deadline=3983.506589573 runtime=10000000
<idle>-0 [007] dNh4 3983.476596: sched_wakeup: comm=b pid=5152 prio=-1 target_cpu=007
<idle>-0 [007] d..3 3983.476648: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=b next_pid=5152 next_prio=-1
b-5152 [007] d..3 3983.476660: sched_deadline_block: now=3983.476656613 deadline=3983.506589573 remaining_runtime=9932960
b-5152 [007] d..3 3983.476663: sched_switch: prev_comm=b prev_pid=5152 prev_prio=-1 prev_state=S ==> next_comm=swapper/7 next_pid=0 next_prio=120
Blocking, and then waking up before the deadline:
b-5139 [007] d..3 3964.148290: sched_deadline_replenish: comm=b pid=5139 now=3964.148285227 deadline=3964.178285227 runtime=10000000
b-5139 [007] d..3 3964.148396: sched_deadline_block: now=3964.148385308 deadline=3964.178285227 remaining_runtime=9895243
b-5139 [007] d..3 3964.148400: sched_switch: prev_comm=b prev_pid=5139 prev_prio=-1 prev_state=S ==> next_comm=swapper/7 next_pid=0 next_prio=120
<idle>-0 [007] dNh5 3964.148411: sched_wakeup: comm=b pid=5139 prio=-1 target_cpu=007
<idle>-0 [007] d..3 3964.148419: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=b next_pid=5139 next_prio=-1
b-5139 [007] d..3 3964.148427: sched_deadline_block: now=3964.148426022 deadline=3964.178285227 remaining_runtime=9878164
b-5139 [007] d..3 3964.148429: sched_switch: prev_comm=b prev_pid=5139 prev_prio=-1 prev_state=S ==> next_comm=swapper/7 next_pid=0 next_prio=120
-- Daniel
On Tue, 29 Mar 2016 16:12:38 -0300
Daniel Bristot de Oliveira <[email protected]> wrote:
> On 03/29/2016 02:13 PM, Steven Rostedt wrote:
> >> <idle>-0 [007] d..3 78377.688969: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=b next_pid=18973 next_prio=-1
> >> > b-18973 [007] d..3 78377.688979: sched_deadline_block: now=78377.688976271 deadline=78377.718945137 remaining_runtime=9968866
> >> > b-18973 [007] d..3 78377.688981: sched_switch: prev_comm=b prev_pid=18973 prev_prio=-1 prev_state=S ==> next_comm=swapper/7 next_pid=0 next_prio=120
> > Why did it go to sleep? The above is still not very useful. What do you
> > mean "blocking on a system call"?
>
> A task can go can to sleep in a blocking system call, like waiting
> a network packet, or any other external event.
Note, waiting for a network packet or some other external event is a
userspace call. A schedule out in 'S' state means exactly that. But
I hate the term "blocked" because that is more like waiting for
something else to finish (like blocked on a lock). In which case, if
that did happen, the state would be "D" not "S".
"S" is basically "sleeping" and it gets woken up by some other event. A
slight difference to the term "blocked".
>
> The "block state" is a possible state of a task running in the deadline
> scheduler. It means that a task voluntarily left the processor, not
> by calling sched_yield(), but by blocking (or sleeping) waiting another
> event.
>
> This state is described in the Figure 2 of the article "Deadline
> scheduling in the Linux kernel", available at:
> http://onlinelibrary.wiley.com/doi/10.1002/spe.2335/abstract
Bah, confusing terminology.
>
> The block state affects the replenishment of the task, and that is why
> it is different of both yeild and throttle. If the task blocks and is
> awakened prior to the deadline, the task's runtime will not be
> replenished. On the other hand it will. For example:
>
> Blocking, and then waking up after the deadline:
> b-5152 [007] d..3 3983.376428: sched_deadline_replenish: comm=b pid=5152 now=3983.376425148 deadline=3983.406425148 runtime=10000000
> b-5152 [007] d..3 3983.376515: sched_deadline_block: now=3983.376511101 deadline=3983.406425148 remaining_runtime=9909566
> b-5152 [007] d..3 3983.376529: sched_switch: prev_comm=b prev_pid=5152 prev_prio=-1 prev_state=S ==> next_comm=swapper/7 next_pid=0 next_prio=120
>
> <idle>-0 [007] d.h4 3983.476592: sched_deadline_replenish: comm=b pid=5152 now=3983.476589573 deadline=3983.506589573 runtime=10000000
> <idle>-0 [007] dNh4 3983.476596: sched_wakeup: comm=b pid=5152 prio=-1 target_cpu=007
> <idle>-0 [007] d..3 3983.476648: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=b next_pid=5152 next_prio=-1
> b-5152 [007] d..3 3983.476660: sched_deadline_block: now=3983.476656613 deadline=3983.506589573 remaining_runtime=9932960
> b-5152 [007] d..3 3983.476663: sched_switch: prev_comm=b prev_pid=5152 prev_prio=-1 prev_state=S ==> next_comm=swapper/7 next_pid=0 next_prio=120
>
>
> Blocking, and then waking up before the deadline:
> b-5139 [007] d..3 3964.148290: sched_deadline_replenish: comm=b pid=5139 now=3964.148285227 deadline=3964.178285227 runtime=10000000
> b-5139 [007] d..3 3964.148396: sched_deadline_block: now=3964.148385308 deadline=3964.178285227 remaining_runtime=9895243
> b-5139 [007] d..3 3964.148400: sched_switch: prev_comm=b prev_pid=5139 prev_prio=-1 prev_state=S ==> next_comm=swapper/7 next_pid=0 next_prio=120
>
> <idle>-0 [007] dNh5 3964.148411: sched_wakeup: comm=b pid=5139 prio=-1 target_cpu=007
> <idle>-0 [007] d..3 3964.148419: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=b next_pid=5139 next_prio=-1
> b-5139 [007] d..3 3964.148427: sched_deadline_block: now=3964.148426022 deadline=3964.178285227 remaining_runtime=9878164
> b-5139 [007] d..3 3964.148429: sched_switch: prev_comm=b prev_pid=5139 prev_prio=-1 prev_state=S ==> next_comm=swapper/7 next_pid=0 next_prio=120
>
I still fail to see the usefulness of the block tracepoint. I could
imagine that if we add the dynamic part of the sched_switch tracepoint
to include deadline and runtime, we would get the same information.
-- Steve
On Tue, Mar 29, 2016 at 01:10:56PM -0400, Steven Rostedt wrote:
> On Tue, 29 Mar 2016 18:04:01 +0200
> Peter Zijlstra <[email protected]> wrote:
> > Urgh; maybe. But I would would not want the new thing to be called
> > _deadline, maybe _v{n} id anything and have a KERN_WARNING emitted when
> > people enable the old one.
>
> I wasn't thinking of having a new sched switch, I was thinking of
> having multiple ones. And not versions, as the one for a deadline task
> wouldn't be applicable for a non deadline task. But regardless, I'm
> also thinking of something else.
No, it should really stay one tracepoint, useful for all scheduling.
> > Ideally we'd rename the old one, but I suspect even that would break
> > stuff :/
>
> Yes, we don't want to get rid of the old one. But it shouldn't break
> anything if we extend it. I'm thinking of extending it with a dynamic
> array to store the deadline task values (runtime, period). And for non
> deadline tasks, the array would be empty (size zero). I think that
> could be doable and maintain backward compatibility.
Why the complexity? Why not just tack those 32 bytes on and get on with
life?
On Tue, 29 Mar 2016 22:11:45 +0200
Peter Zijlstra <[email protected]> wrote:
> > Yes, we don't want to get rid of the old one. But it shouldn't break
> > anything if we extend it. I'm thinking of extending it with a dynamic
> > array to store the deadline task values (runtime, period). And for non
> > deadline tasks, the array would be empty (size zero). I think that
> > could be doable and maintain backward compatibility.
>
> Why the complexity? Why not just tack those 32 bytes on and get on with
> life?
32 bytes that are zero and meaningless for 99.999% of scheduling?
The scheduling tracepoint is probably the most common tracepoint used,
and one of the frequent ones. 32bytes of wasted space per event can
cause a lot of tracing to be missed.
-- Steve
On Tue, Mar 29, 2016 at 04:29:07PM -0400, Steven Rostedt wrote:
> On Tue, 29 Mar 2016 22:11:45 +0200
> Peter Zijlstra <[email protected]> wrote:
>
>
> > > Yes, we don't want to get rid of the old one. But it shouldn't break
> > > anything if we extend it. I'm thinking of extending it with a dynamic
> > > array to store the deadline task values (runtime, period). And for non
> > > deadline tasks, the array would be empty (size zero). I think that
> > > could be doable and maintain backward compatibility.
> >
> > Why the complexity? Why not just tack those 32 bytes on and get on with
> > life?
>
> 32 bytes that are zero and meaningless for 99.999% of scheduling?
>
> The scheduling tracepoint is probably the most common tracepoint used,
> and one of the frequent ones. 32bytes of wasted space per event can
> cause a lot of tracing to be missed.
Typically you don't schedule _that_ often. Sure if you run pipe-bench
and hit ~.5e6 ctx/s its ~15M/s extra. But building a kernel gets me
~.5e3 ctx/s (per cpu), at which rate its 15K/s extra.
But sure, if you want to make it fancy have at.
On 03/29/2016 05:29 PM, Steven Rostedt wrote:
>>> Yes, we don't want to get rid of the old one. But it shouldn't break
>>> > > anything if we extend it. I'm thinking of extending it with a dynamic
>>> > > array to store the deadline task values (runtime, period). And for non
>>> > > deadline tasks, the array would be empty (size zero). I think that
>>> > > could be doable and maintain backward compatibility.
>> >
>> > Why the complexity? Why not just tack those 32 bytes on and get on with
>> > life?
> 32 bytes that are zero and meaningless for 99.999% of scheduling?
I agree. Not only because of the extra bytes, but also because of extra
information that is not useful for 99.999% of non-deadline users.
> The scheduling tracepoint is probably the most common tracepoint used,
> and one of the frequent ones. 32bytes of wasted space per event can
> cause a lot of tracing to be missed.
And any change on it, now and in the future, will cause confusion for
99.999% of raw sched_switch users. Without considering those who wrote
bad applications that will break, and those who wrote nice applications
and probably will have to keep many versions of their handlers to keep
backward compatibility with old kernels.
If it needs to be generic, I vote for a dynamic set of data, handled
"per-scheduler", as Steven mentioned before... (even though it sounds
contradictory)
-- Daniel
> And any change on it, now and in the future, will cause confusion for
> 99.999% of raw sched_switch users.
Sod that, this attitude makes me want to rip out all sched tracepoints.
> Without considering those who wrote bad applications that will break,
That's bonus points, right?
I'm seriously annoyed with this hard ABI for tracepoints crap.
On Tue, 29 Mar 2016 23:03:43 +0200
Peter Zijlstra <[email protected]> wrote:
> > And any change on it, now and in the future, will cause confusion for
> > 99.999% of raw sched_switch users.
>
> Sod that, this attitude makes me want to rip out all sched tracepoints.
>
> > Without considering those who wrote bad applications that will break,
>
> That's bonus points, right?
>
> I'm seriously annoyed with this hard ABI for tracepoints crap.
No need. Userspace tools that use tracepoints should be able to be
fixed. Linus has been a bit lenient with respect to tracepoint
breakage, if we can get tools updated before we change them. He's
mentioned that tracepoints are a bit special because they are so tied
to the internals of the kernel, and those tools that read them, should
be a bit tied to the kernel as well. But we need to make sure those
tools still work with updates. Thus we need to work with the tools that
might break.
That also means that if there's bad applications that will break, they
should be fixed.
With that. I'm not concerned at all about users being inconvenienced
that tracepoint data isn't what they want to see, as long as they can
get the information out that they do need. A tool can always massage
the tracepoints into whatever nice formality that users expect.
Wasted space is a concern to me because that means lack of data, which
is what I don't want. Confusing data can be changed by userspace to be
less confusing. Missing data is gone and there's nothing userspace can
do about it.
-- Steve
Hi,
On 29/03/16 15:25, Steven Rostedt wrote:
> On Tue, 29 Mar 2016 16:12:38 -0300
> Daniel Bristot de Oliveira <[email protected]> wrote:
>
> > On 03/29/2016 02:13 PM, Steven Rostedt wrote:
> > >> <idle>-0 [007] d..3 78377.688969: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=b next_pid=18973 next_prio=-1
> > >> > b-18973 [007] d..3 78377.688979: sched_deadline_block: now=78377.688976271 deadline=78377.718945137 remaining_runtime=9968866
> > >> > b-18973 [007] d..3 78377.688981: sched_switch: prev_comm=b prev_pid=18973 prev_prio=-1 prev_state=S ==> next_comm=swapper/7 next_pid=0 next_prio=120
> > > Why did it go to sleep? The above is still not very useful. What do you
> > > mean "blocking on a system call"?
> >
> > A task can go can to sleep in a blocking system call, like waiting
> > a network packet, or any other external event.
>
> Note, waiting for a network packet or some other external event is a
> userspace call. A schedule out in 'S' state means exactly that. But
> I hate the term "blocked" because that is more like waiting for
> something else to finish (like blocked on a lock). In which case, if
> that did happen, the state would be "D" not "S".
>
> "S" is basically "sleeping" and it gets woken up by some other event. A
> slight difference to the term "blocked".
>
> >
> > The "block state" is a possible state of a task running in the deadline
> > scheduler. It means that a task voluntarily left the processor, not
> > by calling sched_yield(), but by blocking (or sleeping) waiting another
> > event.
> >
> > This state is described in the Figure 2 of the article "Deadline
> > scheduling in the Linux kernel", available at:
> > http://onlinelibrary.wiley.com/doi/10.1002/spe.2335/abstract
>
> Bah, confusing terminology.
>
Mmm, a bit of overloading yes. Should be consistent with RT literature
terminology, though (I hope :-/).
> >
> > The block state affects the replenishment of the task, and that is why
> > it is different of both yeild and throttle. If the task blocks and is
> > awakened prior to the deadline, the task's runtime will not be
> > replenished. On the other hand it will. For example:
> >
Not entirely true. We can have a replenishment even if the task wakes up
before the deadline, if it happens to wake up after the 0-lag point
(after which its runtime can't be recycled if we don't want to affect
others' guarantees). Anyway, this doesn't affect the discussion, I only
wanted to point out that the fact that a replenishment happened might be
useful information to get.
> > Blocking, and then waking up after the deadline:
> > b-5152 [007] d..3 3983.376428: sched_deadline_replenish: comm=b pid=5152 now=3983.376425148 deadline=3983.406425148 runtime=10000000
> > b-5152 [007] d..3 3983.376515: sched_deadline_block: now=3983.376511101 deadline=3983.406425148 remaining_runtime=9909566
> > b-5152 [007] d..3 3983.376529: sched_switch: prev_comm=b prev_pid=5152 prev_prio=-1 prev_state=S ==> next_comm=swapper/7 next_pid=0 next_prio=120
> >
> > <idle>-0 [007] d.h4 3983.476592: sched_deadline_replenish: comm=b pid=5152 now=3983.476589573 deadline=3983.506589573 runtime=10000000
> > <idle>-0 [007] dNh4 3983.476596: sched_wakeup: comm=b pid=5152 prio=-1 target_cpu=007
> > <idle>-0 [007] d..3 3983.476648: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=b next_pid=5152 next_prio=-1
> > b-5152 [007] d..3 3983.476660: sched_deadline_block: now=3983.476656613 deadline=3983.506589573 remaining_runtime=9932960
> > b-5152 [007] d..3 3983.476663: sched_switch: prev_comm=b prev_pid=5152 prev_prio=-1 prev_state=S ==> next_comm=swapper/7 next_pid=0 next_prio=120
> >
> >
> > Blocking, and then waking up before the deadline:
> > b-5139 [007] d..3 3964.148290: sched_deadline_replenish: comm=b pid=5139 now=3964.148285227 deadline=3964.178285227 runtime=10000000
> > b-5139 [007] d..3 3964.148396: sched_deadline_block: now=3964.148385308 deadline=3964.178285227 remaining_runtime=9895243
> > b-5139 [007] d..3 3964.148400: sched_switch: prev_comm=b prev_pid=5139 prev_prio=-1 prev_state=S ==> next_comm=swapper/7 next_pid=0 next_prio=120
> >
> > <idle>-0 [007] dNh5 3964.148411: sched_wakeup: comm=b pid=5139 prio=-1 target_cpu=007
> > <idle>-0 [007] d..3 3964.148419: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=b next_pid=5139 next_prio=-1
> > b-5139 [007] d..3 3964.148427: sched_deadline_block: now=3964.148426022 deadline=3964.178285227 remaining_runtime=9878164
> > b-5139 [007] d..3 3964.148429: sched_switch: prev_comm=b prev_pid=5139 prev_prio=-1 prev_state=S ==> next_comm=swapper/7 next_pid=0 next_prio=120
> >
>
> I still fail to see the usefulness of the block tracepoint. I could
> imagine that if we add the dynamic part of the sched_switch tracepoint
> to include deadline and runtime, we would get the same information.
>
Right. It seems to me that knowing when a task switched-off should be
enough information (given that we have the dynamic part), complemented
by having or not having a replenishment at wakeup.
Thanks,
- Juri