[1]:
Timer tick (and task_tick) can interrups between getting sum_exec_runtime
and task_delta_exec(p). Since task's runtime is updated in the tick,
it makes the value of cpu->sched about a tick less than what it should be.
So, interrupts should be disabled while sampling runtime.
[2]:
enqueue_task also updates runtime of task running on the rq which
a task is going to be enqueued.
So, runqueue should be locked while sampling runtime.
>From user land (@HZ=250), it looks like:
prev call current call diff
(0.191195713) > (0.191200456) : 4743
(0.191200456) > (0.191205198) : 4742
(0.191205198) > (0.187213693) : -3991505
(0.187213693) > (0.191218688) : 4004995
(0.191218688) > (0.191223436) : 4748
clock_gettime(CLOCK_PROCESS_CPUTIME_ID) also has same problem,
but it is complicated because it need to handle thread group.
There should be another patch to fix it.
Signed-off-by: Hidetoshi Seto <[email protected]>
---
include/linux/kernel_stat.h | 1 +
kernel/posix-cpu-timers.c | 2 +-
kernel/sched.c | 27 +++++++++++++++++++++++++++
3 files changed, 29 insertions(+), 1 deletions(-)
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 4a145ca..225f27a 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -66,6 +66,7 @@ static inline unsigned int kstat_irqs(unsigned int irq)
return sum;
}
+extern unsigned long long task_total_exec(struct task_struct *);
extern unsigned long long task_delta_exec(struct task_struct *);
extern void account_user_time(struct task_struct *, cputime_t);
extern void account_user_time_scaled(struct task_struct *, cputime_t);
diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 4e5288a..8e828b4 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -294,7 +294,7 @@ static int cpu_clock_sample(const clockid_t which_clock, struct task_struct *p,
cpu->cpu = virt_ticks(p);
break;
case CPUCLOCK_SCHED:
- cpu->sched = p->se.sum_exec_runtime + task_delta_exec(p);
+ cpu->sched = task_total_exec(p);
break;
}
return 0;
diff --git a/kernel/sched.c b/kernel/sched.c
index e4bb1dd..5c7365a 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4064,6 +4064,33 @@ DEFINE_PER_CPU(struct kernel_stat, kstat);
EXPORT_PER_CPU_SYMBOL(kstat);
/*
+ * Return total of runtime which already banked and which not yet
+ * @p in case that task is currently running.
+ */
+unsigned long long task_total_exec(struct task_struct *p)
+{
+ unsigned long flags;
+ struct rq *rq;
+ u64 ns = 0;
+
+ rq = task_rq_lock(p, &flags);
+
+ if (task_current(rq, p)) {
+ u64 delta_exec;
+
+ update_rq_clock(rq);
+ delta_exec = rq->clock - p->se.exec_start;
+ if ((s64)delta_exec > 0)
+ ns = delta_exec;
+ }
+ ns += p->se.sum_exec_runtime;
+
+ task_rq_unlock(rq, &flags);
+
+ return ns;
+}
+
+/*
* Return any ns on the sched_clock that have not yet been banked in
* @p in case that task is currently running.
*/
--
1.6.1.rc4
* Hidetoshi Seto <[email protected]> wrote:
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -4064,6 +4064,33 @@ DEFINE_PER_CPU(struct kernel_stat, kstat);
> EXPORT_PER_CPU_SYMBOL(kstat);
>
> /*
> + * Return total of runtime which already banked and which not yet
> + * @p in case that task is currently running.
> + */
> +unsigned long long task_total_exec(struct task_struct *p)
> +{
> + unsigned long flags;
> + struct rq *rq;
> + u64 ns = 0;
> +
> + rq = task_rq_lock(p, &flags);
> +
> + if (task_current(rq, p)) {
> + u64 delta_exec;
> +
> + update_rq_clock(rq);
> + delta_exec = rq->clock - p->se.exec_start;
> + if ((s64)delta_exec > 0)
> + ns = delta_exec;
> + }
> + ns += p->se.sum_exec_runtime;
> +
> + task_rq_unlock(rq, &flags);
> +
> + return ns;
> +}
> +
ok - the issue is real, but there's a few implementational details. Did
you know about task_delta_exec()? This new function duplicates that
function's behavior - and task_delta_exec() is already used by posix
task/cpu timers.
I find your new task_total_exec() interface better (it's easier to use),
could you have a look at eliminating task_delta_exec() users and migrating
them over to task_total_exec()?
Maybe even change the name of sum_exec_runtime over to __sum_exec_runtime,
to make sure nobody uses the 'raw' value - which is always stale up to one
tick's worth of execution. Accessing that raw value would only make sense
after the task has finished executing. (i.e. at cutime/cstime collection
time)
Ingo
Ingo Molnar wrote:
> ok - the issue is real, but there's a few implementational details. Did
> you know about task_delta_exec()? This new function duplicates that
> function's behavior - and task_delta_exec() is already used by posix
> task/cpu timers.
>
> I find your new task_total_exec() interface better (it's easier to use),
> could you have a look at eliminating task_delta_exec() users and migrating
> them over to task_total_exec()?
Yes. There are only 2 users (clock_gettime() on thread timer and that on
process timer), so I'll post new patch with better description soon.
Thanks,
H.Seto