Process timers are moving fasters than their corresponding
cpu clock for various reasons:
1. There is a race condition when getting a timer sample that makes the sample
be smaller than it should leading to setting the timer expiration to soon.
2. When initializing the cputimer, by including tasks deltas in the initial
timer value, it makes them be counted twice.
3. When a thread autoreap itself when exiting, the last context switch update
will update the cputimer and not the overall process values stored in
signal.
I have also removed to erractic cputimer start/stop. I am guessing that it
was done to 'resync' once in a while the cputimer with the clock but
you could start the cputimer by calling timer_settimer that finally
do not end up by arming a new posix timer so you could have the cputimer
running with 0 armed timers or have 1 periodic process timer.
every time the periodic timer is moved to the timer list to be fire,
the cputimer is stopped to be restarted immediately after when it
is rescheduled. This lead to unnecessary lock retention, IMO.
With this patch, the glibc/rt unit tests pass with a 100% success rate.
I have been hammered it with invoking tst-cputimer1 in an infinite loop.
Signed-off-by: Olivier Langlois <[email protected]>
---
include/linux/kernel_stat.h | 1 +
include/linux/sched.h | 5 +++
kernel/posix-cpu-timers.c | 93 +++++++++++++++++++++++++++++++++------------
kernel/sched/core.c | 22 +++++++++--
kernel/sched/cputime.c | 47 ++++++++++++++++++++---
kernel/sched/fair.c | 11 +++++-
6 files changed, 144 insertions(+), 35 deletions(-)
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index ed5f6ed..9f38c80 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -121,6 +121,7 @@ static inline unsigned int kstat_cpu_irqs_sum(unsigned int cpu)
* Lock/unlock the current runqueue - to extract task statistics:
*/
extern unsigned long long task_delta_exec(struct task_struct *);
+extern unsigned long long group_delta_exec(struct task_struct *);
extern void account_user_time(struct task_struct *, cputime_t, cputime_t);
extern void account_system_time(struct task_struct *, int, cputime_t, cputime_t);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d35d2b6..8666632 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2003,6 +2003,9 @@ static inline void disable_sched_clock_irqtime(void) {}
extern unsigned long long
task_sched_runtime(struct task_struct *task);
+extern unsigned long long
+task_sched_runtime_nodelta(struct task_struct *task, unsigned long long *delta);
+
/* sched_exec is called by processes performing an exec */
#ifdef CONFIG_SMP
extern void sched_exec(void);
@@ -2625,6 +2628,8 @@ static inline int spin_needbreak(spinlock_t *lock)
* Thread group CPU time accounting.
*/
void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times);
+void thread_group_cputime_nodelta(struct task_struct *tsk, struct task_cputime *times,
+ unsigned long long *delta);
void thread_group_cputimer(struct task_struct *tsk, struct task_cputime *times);
static inline void thread_group_cputime_init(struct signal_struct *sig)
diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 8fd709c..0d4a841 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -226,6 +226,9 @@ static int cpu_clock_sample(const clockid_t which_clock, struct task_struct *p,
return 0;
}
+/*
+ * Ensure the timer monotonicity.
+ */
static void update_gt_cputime(struct task_cputime *a, struct task_cputime *b)
{
if (b->utime > a->utime)
@@ -238,29 +241,75 @@ static void update_gt_cputime(struct task_cputime *a, struct task_cputime *b)
a->sum_exec_runtime = b->sum_exec_runtime;
}
-void thread_group_cputimer(struct task_struct *tsk, struct task_cputime *times)
+/*
+ * Fetch the thread group cputime and the group tasks delta sum
+ * atomically when initializing the timer or make sure that the
+ * race condition does not make timers fire earlier than specified
+ * by having the timer sample earlier than its corresponding clock.
+ *
+ * Except when initializing the cputimer, it is not always necessary
+ * to fetch the delta. It is mandatory only when setting a timer
+ * to avoid shooting it before its time. So enhance the sample
+ * accurary when getting the delta is free or when really needed.
+ */
+#define CPUTIMER_NEED_DELTA 1
+#define CPUTIMER_NO_DELTA 0
+
+static void thread_group_cputimer_withdelta(struct task_struct *tsk,
+ struct task_cputime *times,
+ unsigned long long *delta)
{
struct thread_group_cputimer *cputimer = &tsk->signal->cputimer;
struct task_cputime sum;
unsigned long flags;
- if (!cputimer->running) {
+ if (unlikely(!cputimer->running)) {
/*
* The POSIX timer interface allows for absolute time expiry
* values through the TIMER_ABSTIME flag, therefore we have
* to synchronize the timer to the clock every time we start
* it.
+ *
+ * Exclude task deltas or else they will be accounted twice
+ * in the cputimer.
*/
- thread_group_cputime(tsk, &sum);
+ thread_group_cputime_nodelta(tsk, &sum, delta);
raw_spin_lock_irqsave(&cputimer->lock, flags);
cputimer->running = 1;
update_gt_cputime(&cputimer->cputime, &sum);
- } else
+ } else {
+ /*
+ * Ideally, you would expect to get:
+ *
+ * 1. delta = x, times->sum_exec_runtime = y or
+ * 2. delta = 0, times->sum_exec_runtime = y+x
+ *
+ * but because of the race condition between this function and
+ * update_curr(), it is possible to get:
+ *
+ * 3. delta = 0, times->sum_exec_runtime = y by fetching the
+ * cputimer before delta or
+ * 4. delta = x, times->sum_exec_runtime = y+x by inverting the
+ * sequence.
+ *
+ * Situation #3 is to be avoided or else it will make a timer being
+ * fired sooner than requested.
+ *
+ * Calling group_delta_exec() is required to guaranty accurate result
+ */
+ if (delta && *delta == CPUTIMER_NEED_DELTA)
+ *delta = group_delta_exec(tsk);
raw_spin_lock_irqsave(&cputimer->lock, flags);
+ }
*times = cputimer->cputime;
raw_spin_unlock_irqrestore(&cputimer->lock, flags);
}
+void thread_group_cputimer(struct task_struct *tsk, struct task_cputime *times)
+{
+ thread_group_cputimer_withdelta(tsk, times, NULL);
+}
+
/*
* Sample a process (thread group) clock for the given group_leader task.
* Must be called with tasklist_lock held for reading.
@@ -615,22 +664,27 @@ static void cpu_timer_fire(struct k_itimer *timer)
*/
static int cpu_timer_sample_group(const clockid_t which_clock,
struct task_struct *p,
- union cpu_time_count *cpu)
+ union cpu_time_count *cpu,
+ unsigned need_delta)
{
struct task_cputime cputime;
+ unsigned long long delta;
- thread_group_cputimer(p, &cputime);
switch (CPUCLOCK_WHICH(which_clock)) {
default:
return -EINVAL;
case CPUCLOCK_PROF:
+ thread_group_cputimer_withdelta(p, &cputime, NULL);
cpu->cpu = cputime.utime + cputime.stime;
break;
case CPUCLOCK_VIRT:
+ thread_group_cputimer_withdelta(p, &cputime, NULL);
cpu->cpu = cputime.utime;
break;
case CPUCLOCK_SCHED:
- cpu->sched = cputime.sum_exec_runtime + task_delta_exec(p);
+ delta = need_delta;
+ thread_group_cputimer_withdelta(p, &cputime, &delta);
+ cpu->sched = cputime.sum_exec_runtime + delta;
break;
}
return 0;
@@ -697,7 +751,8 @@ static int posix_cpu_timer_set(struct k_itimer *timer, int flags,
if (CPUCLOCK_PERTHREAD(timer->it_clock)) {
cpu_clock_sample(timer->it_clock, p, &val);
} else {
- cpu_timer_sample_group(timer->it_clock, p, &val);
+ cpu_timer_sample_group(timer->it_clock, p, &val,
+ CPUTIMER_NEED_DELTA);
}
if (old) {
@@ -845,7 +900,8 @@ static void posix_cpu_timer_get(struct k_itimer *timer, struct itimerspec *itp)
read_unlock(&tasklist_lock);
goto dead;
} else {
- cpu_timer_sample_group(timer->it_clock, p, &now);
+ cpu_timer_sample_group(timer->it_clock, p, &now,
+ CPUTIMER_NO_DELTA);
clear_dead = (unlikely(p->exit_state) &&
thread_group_empty(p));
}
@@ -967,16 +1023,6 @@ static void check_thread_timers(struct task_struct *tsk,
}
}
-static void stop_process_timers(struct signal_struct *sig)
-{
- struct thread_group_cputimer *cputimer = &sig->cputimer;
- unsigned long flags;
-
- raw_spin_lock_irqsave(&cputimer->lock, flags);
- cputimer->running = 0;
- raw_spin_unlock_irqrestore(&cputimer->lock, flags);
-}
-
static u32 onecputick;
static void check_cpu_itimer(struct task_struct *tsk, struct cpu_itimer *it,
@@ -1042,7 +1088,7 @@ static void check_process_timers(struct task_struct *tsk,
/*
* Collect the current process totals.
*/
- thread_group_cputimer(tsk, &cputime);
+ thread_group_cputimer_withdelta(tsk, &cputime, NULL);
utime = cputime.utime;
ptime = utime + cputime.stime;
sum_sched_runtime = cputime.sum_exec_runtime;
@@ -1130,8 +1176,6 @@ static void check_process_timers(struct task_struct *tsk,
sig->cputime_expires.prof_exp = prof_expires;
sig->cputime_expires.virt_exp = virt_expires;
sig->cputime_expires.sched_exp = sched_expires;
- if (task_cputime_zero(&sig->cputime_expires))
- stop_process_timers(sig);
}
/*
@@ -1182,7 +1226,8 @@ void posix_cpu_timer_schedule(struct k_itimer *timer)
goto out_unlock;
}
spin_lock(&p->sighand->siglock);
- cpu_timer_sample_group(timer->it_clock, p, &now);
+ cpu_timer_sample_group(timer->it_clock, p, &now,
+ CPUTIMER_NO_DELTA);
bump_cpu_timer(timer, now);
/* Leave the tasklist_lock locked for the call below. */
}
@@ -1348,7 +1393,7 @@ void set_process_cpu_timer(struct task_struct *tsk, unsigned int clock_idx,
union cpu_time_count now;
BUG_ON(clock_idx == CPUCLOCK_SCHED);
- cpu_timer_sample_group(clock_idx, tsk, &now);
+ cpu_timer_sample_group(clock_idx, tsk, &now, CPUTIMER_NEED_DELTA);
if (oldval) {
/*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7f12624..bf73b57 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2659,23 +2659,37 @@ unsigned long long task_delta_exec(struct task_struct *p)
/*
* Return accounted runtime for the task.
- * In case the task is currently running, return the runtime plus current's
- * pending runtime that have not been accounted yet.
+ * Return separately the current's pending runtime that have not been
+ * accounted yet.
*/
-unsigned long long task_sched_runtime(struct task_struct *p)
+unsigned long long task_sched_runtime_nodelta(struct task_struct *p, unsigned long long *delta)
{
unsigned long flags;
struct rq *rq;
u64 ns = 0;
rq = task_rq_lock(p, &flags);
- ns = p->se.sum_exec_runtime + do_task_delta_exec(p, rq);
+ ns = p->se.sum_exec_runtime;
+ *delta = do_task_delta_exec(p, rq);
task_rq_unlock(rq, p, &flags);
return ns;
}
/*
+ * Return accounted runtime for the task.
+ * In case the task is currently running, return the runtime plus current's
+ * pending runtime that have not been accounted yet.
+ */
+unsigned long long task_sched_runtime(struct task_struct *p)
+{
+ unsigned long long delta;
+ u64 ns = task_sched_runtime_nodelta(p, &delta);
+ ns += delta;
+ return ns;
+}
+
+/*
* This function gets called by the timer code, with HZ frequency.
* We call it with interrupts disabled.
*/
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index ed12cbb..69836c9 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -289,15 +289,14 @@ static __always_inline bool steal_account_process_tick(void)
return false;
}
-/*
- * Accumulate raw cputime values of dead tasks (sig->[us]time) and live
- * tasks (sum on group iteration) belonging to @tsk's group.
- */
-void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
+void thread_group_cputime_nodelta(struct task_struct *tsk, struct task_cputime *times,
+ unsigned long long *delta)
{
struct signal_struct *sig = tsk->signal;
cputime_t utime, stime;
struct task_struct *t;
+ unsigned long long d = 0;
+ unsigned long long td;
times->utime = sig->utime;
times->stime = sig->stime;
@@ -313,10 +312,46 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
task_cputime(tsk, &utime, &stime);
times->utime += utime;
times->stime += stime;
- times->sum_exec_runtime += task_sched_runtime(t);
+ times->sum_exec_runtime += task_sched_runtime_nodelta(t, &td);
+ d += td;
} while_each_thread(tsk, t);
out:
rcu_read_unlock();
+
+ if (delta)
+ *delta = d;
+}
+
+/*
+ * Accumulate raw cputime values of dead tasks (sig->[us]time) and live
+ * tasks (sum on group iteration) belonging to @tsk's group.
+ */
+void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
+{
+ unsigned long long d;
+ thread_group_cputime_nodelta(tsk, times, &d);
+ times->sum_exec_runtime += d;
+}
+
+
+unsigned long long group_delta_exec(struct task_struct *tsk)
+{
+ unsigned long long ns = 0;
+ struct task_struct *t;
+
+ rcu_read_lock();
+ /* make sure we can trust tsk->thread_group list */
+ if (!likely(pid_alive(tsk)))
+ goto out;
+
+ t = tsk;
+ do {
+ ns += task_delta_exec(t);
+ } while_each_thread(tsk, t);
+out:
+ rcu_read_unlock();
+
+ return ns;
}
#ifdef CONFIG_IRQ_TIME_ACCOUNTING
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a33e59..b82d070 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -708,7 +708,16 @@ static void update_curr(struct cfs_rq *cfs_rq)
trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
cpuacct_charge(curtask, delta_exec);
- account_group_exec_runtime(curtask, delta_exec);
+
+ /*
+ * Do not update the cputimer if the task is already released by
+ * release_task().
+ *
+ * it would preferable to defer the autoreap release_task
+ * after the last context switch but harder to do.
+ */
+ if (likely(curtask->sighand))
+ account_group_exec_runtime(curtask, delta_exec);
}
account_cfs_rq_runtime(cfs_rq, delta_exec);
--
1.8.2
On Fri, 2013-04-05 at 13:59 -0400, Olivier Langlois wrote:
> Process timers are moving fasters than their corresponding
> cpu clock for various reasons:
>
> 1. There is a race condition when getting a timer sample that makes the sample
> be smaller than it should leading to setting the timer expiration to soon.
> 2. When initializing the cputimer, by including tasks deltas in the initial
> timer value, it makes them be counted twice.
> 3. When a thread autoreap itself when exiting, the last context switch update
> will update the cputimer and not the overall process values stored in
> signal.
Please explain these races. Things like task_sched_runtime() on which
most of this stuff is build read both sum_exec_runtime and compute the
delta while holding the rq->lock; this should avoid any and all races
against update_curr() and the sort.
All this fiddling with conditional deltas seems very ugly, esp. since
no attempt is made to explain why it would be impossible to tighten the
synchronization to avoid the races.
> I have also removed to erractic cputimer start/stop. I am guessing that it
> was done to 'resync' once in a while the cputimer with the clock but
> you could start the cputimer by calling timer_settimer that finally
> do not end up by arming a new posix timer so you could have the cputimer
> running with 0 armed timers or have 1 periodic process timer.
No keeping a process wide (over all threads) cputime aggregate running
is _expensive_, so its important to stop doing this once there's nobody
who cares about it anymore.
On Wed, 2013-04-10 at 13:35 +0200, Peter Zijlstra wrote:
> On Fri, 2013-04-05 at 13:59 -0400, Olivier Langlois wrote:
> > Process timers are moving fasters than their corresponding
> > cpu clock for various reasons:
> >
> > 1. There is a race condition when getting a timer sample that makes the sample
> > be smaller than it should leading to setting the timer expiration to soon.
> > 2. When initializing the cputimer, by including tasks deltas in the initial
> > timer value, it makes them be counted twice.
> > 3. When a thread autoreap itself when exiting, the last context switch update
> > will update the cputimer and not the overall process values stored in
> > signal.
>
> Please explain these races. Things like task_sched_runtime() on which
> most of this stuff is build read both sum_exec_runtime and compute the
> delta while holding the rq->lock; this should avoid any and all races
> against update_curr() and the sort.
>
> All this fiddling with conditional deltas seems very ugly, esp. since
> no attempt is made to explain why it would be impossible to tighten the
> synchronization to avoid the races.
Peter,
You have valid concerns and I will attempt to clarify the changes I
propose. Before I do, realise that as a first time patcher, I sincerely
attempted to minimize the changes required to fix the posix cputimers.
The real source of the problem is that the process clock is distinct
from its cputimer. It is not explained why it is done like that in the
code but I understand that the benefit is that you can fetch the
cputimer value and avoiding the cost to traverse the list of tasks
member of the group. The price to pay however it is that it is painful
to make sure that the clock and its corresponding cputimer remain in
sync as they advance. With that in mind, I did all I can to minimize
thread group task list traversal when possible and do it only when
mandatory which is when you set a timer expiration time.
1. To set an accurate expiration time, you need to account for all
thread group tasks delta.
The code used to only get the delta of 1 task when getting a cputimer
sample but this is not enough. Imagine getting this sample on a thread
group of at least same number of tasks than there is processor on the
system just before a timer interrupt. Your timer could fire +- (num of
cores)/HZ sec earlier than requested because just after getting your
sample all these deltas will be added to the process cputimer.
For the race condition, IMO it is not easy or desirable to fix it as it
would be too expensive. Simply acknowledge its presence and work around
it to avoid side effects is wiser. That is what I explain in the comment
inside the function thread_group_cputimer_withdelta().
To compute the sum of deltas or when calling thread_group_cputime(), the
code locks each task rq one by one. After that thread_group_cputimer()
will lock the cputimer lock to get or set the cputimer value.
If you get the cputimer value first from thread_group_cputimer(), you
could have another thread executing update_curr(), it is maybe blocked
on the cputimer lock. This delta will appear nowhere as it will not be
in the cputimer, nor will it appear when call task_delta_exec() after.
Inverse the order, call delta_exec first and then get cputimer, the
delta can be counted twice. A task running concurently for which you
already collected its delta could update the cputimer while you complete
the call on group_delta_exec.
To really fix the race, you would need to hold all the process tasks rqs
and the cputimer lock at the same time. As a first time patcher, I
didn't want to go there or even dare to contemplate the idea :-) but IMO
if the goal is just to make sure that timers do not fire before their
requested expiration, just order the function calls to be sure you reach
the goal is sufficient to me.
There is another manifestation of the same race condition when
initializing the cputimer. When calling thread_group_cputime() and
before cputimer->running is set, nothing stops another thread to drop
its delta when calling account_group_exec_runtime().
This one would affect only absolute timers however. If you really wanted
to address it you could:
1. Lock cputimer
2. Set cputimer->running
3. Save to a local variable its value
3. Unlock cputimer call thread_group_cputime()
4. Relock cputimer.
5. substract previous cputimer value from current one.
6. Set cputimer with cputime add the diff computed in #5.
I experimented with this idea and did measure frequently diff of 20000
and sometimes up to 60000.
But going there is really starting to be ugly and I didn't care about
absolute timers. We can explore the idea if you want tho.
>
> > I have also removed to erractic cputimer start/stop. I am guessing that it
> > was done to 'resync' once in a while the cputimer with the clock but
> > you could start the cputimer by calling timer_settimer that finally
> > do not end up by arming a new posix timer so you could have the cputimer
> > running with 0 armed timers or have 1 periodic process timer.
>
> No keeping a process wide (over all threads) cputime aggregate running
> is _expensive_, so its important to stop doing this once there's nobody
> who cares about it anymore.
>
Please explain how expensive it is. All I am seeing is a couple of additions.
If it is important to stop it then the code will require a fix anyway.
1.
As you could start it
indefinitely with 0 timers if calling posix_cpu_timer_set() does not end
up arming a timer or
2.
Having one periodic timer. The cputimer will be stopped in
check_process_timers() to be immediatly restarted in
posix_cpu_timer_schedule() every single time the timer will be fired.
By adding a small printk() when initializing the cputimer, I have been
puzzled and surprised to see how many times it was intialized by
executing glibc2.17/rt/tst-cputimer1 !
I am not sure how expensive it is to run the timer, but I do know
initializing it is _really_expensive_ expecially if it is done when a
periodic high frequency timer is fired.
To conclude, I have tested the patch on a variety of hardware from a
ATOM N450 to a Intel 3930K I7. Since my initial post, I am sad to
announce that it does not always work on very fast system. IMO, all
small fixes that I have presented are valid and do indeed improve the
posix cpu timers accuracy but I still have a small glitch or 2 to nail
down.
Thank you,
Olivier
On Wed, 2013-04-10 at 13:35 +0200, Peter Zijlstra wrote:
> On Fri, 2013-04-05 at 13:59 -0400, Olivier Langlois wrote:
> > Process timers are moving fasters than their corresponding
> > cpu clock for various reasons:
> >
> > 1. There is a race condition when getting a timer sample that makes the sample
> > be smaller than it should leading to setting the timer expiration to soon.
> > 2. When initializing the cputimer, by including tasks deltas in the initial
> > timer value, it makes them be counted twice.
> > 3. When a thread autoreap itself when exiting, the last context switch update
> > will update the cputimer and not the overall process values stored in
> > signal.
>
> Please explain these races. Things like task_sched_runtime() on which
> most of this stuff is build read both sum_exec_runtime and compute the
> delta while holding the rq->lock; this should avoid any and all races
> against update_curr() and the sort.
>
In my previous reply, I have explained in length the race condition but
I didn't realize that you were also mentioning my refactoring of
task_sched_runtime() so I comment a little bit more about this proposal:
currently:
- cputimer is initialized with the result of thread_group_cputime()
which is (accounted time + tasks deltas)
- cputimer sample value is then cputimer + 1 more task_delta_exec()
- After all active tasks pass through update_curr(), cputimer is
(accounted time + 2*(tasks deltas))
By being able to get separately get accounted time and delta, you can:
- Initialize cputimer to accounted time
- thread group cputimer sample will be cputimer + delta (which is
essentially equivalent to what would thread_group_cputime() return)
- After all the deltas are in by having called
account_group_exec_runtime(), cputimer will be set to (accounted time +
tasks delta) and have the exact same value of the corresponding process
clock.
In other words, currently the way the cputimer is initialized contribute
to make it advance faster than its corressponding process clock.
This part of the patch has nothing to do with race condition, as far as
I can tell, thread_group_cputime() and task_delta_exec() are rock solid.
It is just that you need delta and accounted time separately and
preferably atomically to be able to initialize posix cpu timer
correctly.
Greetings,
Olivier
On Wed, 2013-04-10 at 11:48 -0400, Olivier Langlois wrote:
> Please explain how expensive it is. All I am seeing is a couple of
> additions.
Let me start with this, since your earlier argument also refers to
this.
So yes it does look simple and straight fwd, only one addition. However
its an atomic operation across all threads of the same process. Imagine
a single process with 512 threads, all running on a separate cpu.
Do you see the problem? The cacheline contention of that one atomic is
enough to bring a machine that size to its knees. People tried, it
works.
This is a fundamentally unscalable problem that is part of the POSIX
interface.
Also, since its a concurrent problem, the entire question: "what is the
current runtime of the process" is uncertain and fuzzy. I prefer to
look at it as a Heisenberg uncertainty principle of SMP computing; you
cannot know the exact state of your SMP system and have it run (fast).
On Wed, 2013-04-10 at 11:48 -0400, Olivier Langlois wrote:
> If it is important to stop it then the code will require a fix anyway.
>
> 1.
> As you could start it
> indefinitely with 0 timers if calling posix_cpu_timer_set() does not
> end
> up arming a timer or
>
> 2.
>
> Having one periodic timer. The cputimer will be stopped in
> check_process_timers() to be immediatly restarted in
> posix_cpu_timer_schedule() every single time the timer will be fired.
OK, then yes that wants fixing. That was not intended.
The intent was starting the accounting when required and stopping it
when no longer needed.
On Wed, 2013-04-10 at 11:48 -0400, Olivier Langlois wrote:
>
> You have valid concerns and I will attempt to clarify the changes I
> propose. Before I do, realise that as a first time patcher, I
> sincerely
> attempted to minimize the changes required to fix the posix cputimers.
Right, I suspect some of that is what made me go yuck! when reading the
patch. I feel some interfaces could be avoided if we refactor a bit
more -- and given the complexity of the code its well worth doing.
> The real source of the problem is that the process clock is distinct
> from its cputimer. It is not explained why it is done like that in the
> code but I understand that the benefit is that you can fetch the
> cputimer value and avoiding the cost to traverse the list of tasks
> member of the group. The price to pay however it is that it is painful
> to make sure that the clock and its corresponding cputimer remain in
> sync as they advance. With that in mind, I did all I can to minimize
> thread group task list traversal when possible and do it only when
> mandatory which is when you set a timer expiration time.
Right, I hope my earlier email explained why it is so expensive and
thus why they're separated.
I'll try and dig through the rest of your email later.. sorry for being
a tad slow etc.
On Fri, 2013-04-12 at 12:50 +0200, Peter Zijlstra wrote:
> I'll try and dig through the rest of your email later.. sorry for
> being
> a tad slow etc.
So at thread_group_cputimer() we initialize the cputimer->cputime state
by using thread_group_cputime() which iterates all tasks of the process
and calls task_sched_runtime() upon them (which includes the current
delta).
Upon subsequent account_group_exec_runtime() calls (from all schedule
events and timer ticks) we add the current delta to cputimer->cputime.
However since we already added the first (or part thereof) delta to the
initial state, we account this double. Thus we can be up to
NR_CPUS*TICK_NSEC ahead.
On every timer tick we evaluate the cputimer state using
cpu_timer_sample_group() which adds the current tasks delta. This can
thus be up to (NR_CPUS-1)*TICK_NSEC behind.
The combination of the timeline behind ahead and the sample being
behind make it a virtual guarantee we'll hit early by almost
2*NR_CPUS*TICK_NSEC.
This is what you've been saying right?
So how about we do not include the deltas into the initial sum, so that
we're up to NR_CPUS*TICK_NSEC behind. That way, with the sample up to
(NR_CPUS-1)*TICK_NSEC behind, we're in the order of TICK_NSEC late with
firing.
Hmm?
---
include/linux/sched.h | 5 +++--
kernel/posix-cpu-timers.c | 15 ++++++++++-----
kernel/sched/core.c | 6 ++++--
kernel/sched/cputime.c | 8 ++++----
4 files changed, 21 insertions(+), 13 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 88ec7f4..abe5870 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1832,7 +1832,7 @@ static inline void disable_sched_clock_irqtime(void) {}
#endif
extern unsigned long long
-task_sched_runtime(struct task_struct *task);
+task_sched_runtime(struct task_struct *task, bool add_delta);
/* sched_exec is called by processes performing an exec */
#ifdef CONFIG_SMP
@@ -2496,7 +2496,8 @@ static inline void current_clr_polling(void) { }
/*
* Thread group CPU time accounting.
*/
-void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times);
+void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times,
+ bool add_delta);
void thread_group_cputimer(struct task_struct *tsk, struct task_cputime *times);
static inline void thread_group_cputime_init(struct signal_struct *sig)
diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 8fd709c..d8133ad 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -220,7 +220,7 @@ static int cpu_clock_sample(const clockid_t which_clock, struct task_struct *p,
cpu->cpu = virt_ticks(p);
break;
case CPUCLOCK_SCHED:
- cpu->sched = task_sched_runtime(p);
+ cpu->sched = task_sched_runtime(p, true);
break;
}
return 0;
@@ -250,8 +250,13 @@ void thread_group_cputimer(struct task_struct *tsk, struct task_cputime *times)
* values through the TIMER_ABSTIME flag, therefore we have
* to synchronize the timer to the clock every time we start
* it.
+ *
+ * Do no add the current delta, because
+ * account_group_exec_runtime() will also add this delta and we
+ * wouldn't want to double account time and get ahead of
+ * ourselves.
*/
- thread_group_cputime(tsk, &sum);
+ thread_group_cputime(tsk, &sum, false);
raw_spin_lock_irqsave(&cputimer->lock, flags);
cputimer->running = 1;
update_gt_cputime(&cputimer->cputime, &sum);
@@ -275,15 +280,15 @@ static int cpu_clock_sample_group(const clockid_t which_clock,
default:
return -EINVAL;
case CPUCLOCK_PROF:
- thread_group_cputime(p, &cputime);
+ thread_group_cputime(p, &cputime, true);
cpu->cpu = cputime.utime + cputime.stime;
break;
case CPUCLOCK_VIRT:
- thread_group_cputime(p, &cputime);
+ thread_group_cputime(p, &cputime, true);
cpu->cpu = cputime.utime;
break;
case CPUCLOCK_SCHED:
- thread_group_cputime(p, &cputime);
+ thread_group_cputime(p, &cputime, true);
cpu->sched = cputime.sum_exec_runtime;
break;
}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e8167e3..704fa44 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2677,14 +2677,16 @@ unsigned long long task_delta_exec(struct task_struct *p)
* In case the task is currently running, return the runtime plus current's
* pending runtime that have not been accounted yet.
*/
-unsigned long long task_sched_runtime(struct task_struct *p)
+unsigned long long task_sched_runtime(struct task_struct *p, bool add_delta)
{
unsigned long flags;
struct rq *rq;
u64 ns = 0;
rq = task_rq_lock(p, &flags);
- ns = p->se.sum_exec_runtime + do_task_delta_exec(p, rq);
+ ns = p->se.sum_exec_runtime;
+ if (add_delta)
+ ns += do_task_delta_exec(p, rq);
task_rq_unlock(rq, p, &flags);
return ns;
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index ea32f02..c3495e1 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -277,7 +277,7 @@ static __always_inline bool steal_account_process_tick(void)
* Accumulate raw cputime values of dead tasks (sig->[us]time) and live
* tasks (sum on group iteration) belonging to @tsk's group.
*/
-void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
+void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times, bool add_delta)
{
struct signal_struct *sig = tsk->signal;
cputime_t utime, stime;
@@ -297,7 +297,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
task_cputime(t, &utime, &stime);
times->utime += utime;
times->stime += stime;
- times->sum_exec_runtime += task_sched_runtime(t);
+ times->sum_exec_runtime += task_sched_runtime(t, add_delta);
} while_each_thread(tsk, t);
out:
rcu_read_unlock();
@@ -444,7 +444,7 @@ void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime
{
struct task_cputime cputime;
- thread_group_cputime(p, &cputime);
+ thread_group_cputime(p, &cputime, true);
*ut = cputime.utime;
*st = cputime.stime;
@@ -606,7 +606,7 @@ void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime
{
struct task_cputime cputime;
- thread_group_cputime(p, &cputime);
+ thread_group_cputime(p, &cputime, true);
cputime_adjust(&cputime, &p->signal->prev_cputime, ut, st);
}
#endif /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
On Fri, 2013-04-12 at 11:16 +0200, Peter Zijlstra wrote:
> On Wed, 2013-04-10 at 11:48 -0400, Olivier Langlois wrote:
> > Please explain how expensive it is. All I am seeing is a couple of
> > additions.
>
> Let me start with this, since your earlier argument also refers to
> this.
>
> So yes it does look simple and straight fwd, only one addition. However
> its an atomic operation across all threads of the same process. Imagine
> a single process with 512 threads, all running on a separate cpu.
>
Peter,
It now makes perfect sense. Thank you for your explanation. It is
showing me an aspect that I did overlook.
Greetings,
On Fri, 2013-04-12 at 17:55 +0200, Peter Zijlstra wrote:
> On Fri, 2013-04-12 at 12:50 +0200, Peter Zijlstra wrote:
>
> > I'll try and dig through the rest of your email later.. sorry for
> > being
> > a tad slow etc.
>
>
> So at thread_group_cputimer() we initialize the cputimer->cputime state
> by using thread_group_cputime() which iterates all tasks of the process
> and calls task_sched_runtime() upon them (which includes the current
> delta).
>
> Upon subsequent account_group_exec_runtime() calls (from all schedule
> events and timer ticks) we add the current delta to cputimer->cputime.
>
> However since we already added the first (or part thereof) delta to the
> initial state, we account this double. Thus we can be up to
> NR_CPUS*TICK_NSEC ahead.
Peter, I am thrilled to see that we are starting to understand each
other. So far, that is what I am saying!
There is one missing key concept to understand the issue. The initial
value of the cputimer is not really that important (at least for
relative timers). What is really important it is the pace that it will
move. The thread group deltas represent the minimum size of the step
that the cputimer will be incremented at any moment from now.
The timer expiration time computed must include the deltas or else it
will be fire ahead of its time. This will hold true for any initial
value given to the cputimer.
With that said, maybe this code snippet from my patch explaining the race
condition effect choosen by the statement order makes more sense:
+ } else {
+ /*
+ * Ideally, you would expect to get:
+ *
+ * 1. delta = x, times->sum_exec_runtime = y or
+ * 2. delta = 0, times->sum_exec_runtime = y+x
+ *
+ * but because of the race condition between this function and
+ * update_curr(), it is possible to get:
+ *
+ * 3. delta = 0, times->sum_exec_runtime = y by fetching the
+ * cputimer before delta or
+ * 4. delta = x, times->sum_exec_runtime = y+x by inverting the
+ * sequence.
+ *
+ * Situation #3 is to be avoided or else it will make a timer being
+ * fired sooner than requested.
+ *
+ * Calling group_delta_exec() is required to guaranty accurate result
+ */
+ if (delta && *delta == CPUTIMER_NEED_DELTA)
+ *delta = group_delta_exec(tsk);
Setting the initial value of the cputimer to thread_group_cputime()
minus deltas just ensure that its value will be exactly equal to the
corresponding process clock which is nice for absolute timers.
>
> On every timer tick we evaluate the cputimer state using
> cpu_timer_sample_group() which adds the current tasks delta. This can
> thus be up to (NR_CPUS-1)*TICK_NSEC behind.
>
> The combination of the timeline behind ahead and the sample being
> behind make it a virtual guarantee we'll hit early by almost
> 2*NR_CPUS*TICK_NSEC.
The second part is not quite what I have been saying.
On every timer tick in the function check_process_timers() raw cputimer
is evaluated and this looks fine to me find as it is staying on the
conservative side.
Please provide a quote from my e-mails that have led you to this
understanding. I will try to rectify it.
>
> This is what you've been saying right?
>
>
> So how about we do not include the deltas into the initial sum, so that
> we're up to NR_CPUS*TICK_NSEC behind. That way, with the sample up to
> (NR_CPUS-1)*TICK_NSEC behind, we're in the order of TICK_NSEC late with
> firing.
>
> Hmm?
For the reasons listed above, I think that with these proposed changes a
timer could still fire too early. However, this is making the cputimer less
far ahead to its corresponding process clock.
How about if you let me rework my original patch?
Stopping/restarting the cputimer is important for performance.
What else is important to you?
Some new function names to rework?
I could get rid completely of the new function group_delta_exec() and use
thread_group_cputime_nodelta().
The other thing that makes the cputimer moving faster than the process clock,
it is the last call to update_curr() from the last context switch performed
after the call to release_task() for autoreaped tasks.
Nobody commented on that so I am assuming that everyone understand that one. Is
this correct?
I am not totally satisfied with the solution that I am proposing in sched/fair.c
I was hoping for improvement suggestions on that one.
Greetings,
Olivier
2013/4/12 Peter Zijlstra <[email protected]>:
> On Fri, 2013-04-12 at 12:50 +0200, Peter Zijlstra wrote:
>
>> I'll try and dig through the rest of your email later.. sorry for
>> being
>> a tad slow etc.
>
>
> So at thread_group_cputimer() we initialize the cputimer->cputime state
> by using thread_group_cputime() which iterates all tasks of the process
> and calls task_sched_runtime() upon them (which includes the current
> delta).
>
> Upon subsequent account_group_exec_runtime() calls (from all schedule
> events and timer ticks) we add the current delta to cputimer->cputime.
>
> However since we already added the first (or part thereof) delta to the
> initial state, we account this double. Thus we can be up to
> NR_CPUS*TICK_NSEC ahead.
>
> On every timer tick we evaluate the cputimer state using
> cpu_timer_sample_group() which adds the current tasks delta. This can
> thus be up to (NR_CPUS-1)*TICK_NSEC behind.
>
> The combination of the timeline behind ahead and the sample being
> behind make it a virtual guarantee we'll hit early by almost
> 2*NR_CPUS*TICK_NSEC.
>
> This is what you've been saying right?
>
>
> So how about we do not include the deltas into the initial sum, so that
> we're up to NR_CPUS*TICK_NSEC behind. That way, with the sample up to
> (NR_CPUS-1)*TICK_NSEC behind, we're in the order of TICK_NSEC late with
> firing.
>
> Hmm?
I feel we are hitting the same issue than this patch:
https://lkml.org/lkml/2013/4/5/116
I'm adding Kosaki in Cc, who proposed roughly the same fix.
Thanks.
Frederic.
>
> ---
> include/linux/sched.h | 5 +++--
> kernel/posix-cpu-timers.c | 15 ++++++++++-----
> kernel/sched/core.c | 6 ++++--
> kernel/sched/cputime.c | 8 ++++----
> 4 files changed, 21 insertions(+), 13 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 88ec7f4..abe5870 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1832,7 +1832,7 @@ static inline void disable_sched_clock_irqtime(void) {}
> #endif
>
> extern unsigned long long
> -task_sched_runtime(struct task_struct *task);
> +task_sched_runtime(struct task_struct *task, bool add_delta);
>
> /* sched_exec is called by processes performing an exec */
> #ifdef CONFIG_SMP
> @@ -2496,7 +2496,8 @@ static inline void current_clr_polling(void) { }
> /*
> * Thread group CPU time accounting.
> */
> -void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times);
> +void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times,
> + bool add_delta);
> void thread_group_cputimer(struct task_struct *tsk, struct task_cputime *times);
>
> static inline void thread_group_cputime_init(struct signal_struct *sig)
> diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
> index 8fd709c..d8133ad 100644
> --- a/kernel/posix-cpu-timers.c
> +++ b/kernel/posix-cpu-timers.c
> @@ -220,7 +220,7 @@ static int cpu_clock_sample(const clockid_t which_clock, struct task_struct *p,
> cpu->cpu = virt_ticks(p);
> break;
> case CPUCLOCK_SCHED:
> - cpu->sched = task_sched_runtime(p);
> + cpu->sched = task_sched_runtime(p, true);
> break;
> }
> return 0;
> @@ -250,8 +250,13 @@ void thread_group_cputimer(struct task_struct *tsk, struct task_cputime *times)
> * values through the TIMER_ABSTIME flag, therefore we have
> * to synchronize the timer to the clock every time we start
> * it.
> + *
> + * Do no add the current delta, because
> + * account_group_exec_runtime() will also add this delta and we
> + * wouldn't want to double account time and get ahead of
> + * ourselves.
> */
> - thread_group_cputime(tsk, &sum);
> + thread_group_cputime(tsk, &sum, false);
> raw_spin_lock_irqsave(&cputimer->lock, flags);
> cputimer->running = 1;
> update_gt_cputime(&cputimer->cputime, &sum);
> @@ -275,15 +280,15 @@ static int cpu_clock_sample_group(const clockid_t which_clock,
> default:
> return -EINVAL;
> case CPUCLOCK_PROF:
> - thread_group_cputime(p, &cputime);
> + thread_group_cputime(p, &cputime, true);
> cpu->cpu = cputime.utime + cputime.stime;
> break;
> case CPUCLOCK_VIRT:
> - thread_group_cputime(p, &cputime);
> + thread_group_cputime(p, &cputime, true);
> cpu->cpu = cputime.utime;
> break;
> case CPUCLOCK_SCHED:
> - thread_group_cputime(p, &cputime);
> + thread_group_cputime(p, &cputime, true);
> cpu->sched = cputime.sum_exec_runtime;
> break;
> }
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e8167e3..704fa44 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2677,14 +2677,16 @@ unsigned long long task_delta_exec(struct task_struct *p)
> * In case the task is currently running, return the runtime plus current's
> * pending runtime that have not been accounted yet.
> */
> -unsigned long long task_sched_runtime(struct task_struct *p)
> +unsigned long long task_sched_runtime(struct task_struct *p, bool add_delta)
> {
> unsigned long flags;
> struct rq *rq;
> u64 ns = 0;
>
> rq = task_rq_lock(p, &flags);
> - ns = p->se.sum_exec_runtime + do_task_delta_exec(p, rq);
> + ns = p->se.sum_exec_runtime;
> + if (add_delta)
> + ns += do_task_delta_exec(p, rq);
> task_rq_unlock(rq, p, &flags);
>
> return ns;
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index ea32f02..c3495e1 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -277,7 +277,7 @@ static __always_inline bool steal_account_process_tick(void)
> * Accumulate raw cputime values of dead tasks (sig->[us]time) and live
> * tasks (sum on group iteration) belonging to @tsk's group.
> */
> -void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
> +void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times, bool add_delta)
> {
> struct signal_struct *sig = tsk->signal;
> cputime_t utime, stime;
> @@ -297,7 +297,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
> task_cputime(t, &utime, &stime);
> times->utime += utime;
> times->stime += stime;
> - times->sum_exec_runtime += task_sched_runtime(t);
> + times->sum_exec_runtime += task_sched_runtime(t, add_delta);
> } while_each_thread(tsk, t);
> out:
> rcu_read_unlock();
> @@ -444,7 +444,7 @@ void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime
> {
> struct task_cputime cputime;
>
> - thread_group_cputime(p, &cputime);
> + thread_group_cputime(p, &cputime, true);
>
> *ut = cputime.utime;
> *st = cputime.stime;
> @@ -606,7 +606,7 @@ void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime
> {
> struct task_cputime cputime;
>
> - thread_group_cputime(p, &cputime);
> + thread_group_cputime(p, &cputime, true);
> cputime_adjust(&cputime, &p->signal->prev_cputime, ut, st);
> }
> #endif /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
>
>
> I feel we are hitting the same issue than this patch:
> https://lkml.org/lkml/2013/4/5/116
>
> I'm adding Kosaki in Cc, who proposed roughly the same fix.
Thanks to CCing. I'm now sitting LSF and I can't read whole tons emails.
However the fix is definitely same and I definitely agree this approach.
thank you.
On Fri, Apr 19, 2013 at 10:38 AM, KOSAKI Motohiro
<[email protected]> wrote:
>> I feel we are hitting the same issue than this patch:
>> https://lkml.org/lkml/2013/4/5/116
>>
>> I'm adding Kosaki in Cc, who proposed roughly the same fix.
>
> Thanks to CCing. I'm now sitting LSF and I can't read whole tons emails.
> However the fix is definitely same and I definitely agree this approach.
>
> thank you.
And if I understand correctly, update_gt_cputime() is no longer
necessary after this patch because time never makes backward.
What do you think?
On Fri, 2013-04-19 at 11:08 -0700, KOSAKI Motohiro wrote:
> On Fri, Apr 19, 2013 at 10:38 AM, KOSAKI Motohiro
> <[email protected]> wrote:
> >> I feel we are hitting the same issue than this patch:
> >> https://lkml.org/lkml/2013/4/5/116
> >>
> >> I'm adding Kosaki in Cc, who proposed roughly the same fix.
> >
> > Thanks to CCing. I'm now sitting LSF and I can't read whole tons emails.
> > However the fix is definitely same and I definitely agree this approach.
> >
> > thank you.
>
> And if I understand correctly, update_gt_cputime() is no longer
> necessary after this patch because time never makes backward.
>
> What do you think?
Kosaki, I would tend to say that what you propose is exact. After having
added the task deltas I was puzzled to see the cputimer still moving
faster than the process clock. I was seeing it with the the help of
printk statement inside update_gt_cputime().
After nailing down the last remaining cause of that inside sched/core.c,
I have never seen after the cputimer being in advance.
On Fri, 2013-04-26 at 00:40 -0400, Olivier Langlois wrote:
> On Fri, 2013-04-19 at 11:08 -0700, KOSAKI Motohiro wrote:
> > On Fri, Apr 19, 2013 at 10:38 AM, KOSAKI Motohiro
> > <[email protected]> wrote:
> > >> I feel we are hitting the same issue than this patch:
> > >> https://lkml.org/lkml/2013/4/5/116
> > >>
> > >> I'm adding Kosaki in Cc, who proposed roughly the same fix.
> > >
> > > Thanks to CCing. I'm now sitting LSF and I can't read whole tons emails.
> > > However the fix is definitely same and I definitely agree this approach.
> > >
> > > thank you.
> >
> > And if I understand correctly, update_gt_cputime() is no longer
> > necessary after this patch because time never makes backward.
> >
> > What do you think?
>
> Kosaki, I would tend to say that what you propose is exact. After having
> added the task deltas I was puzzled to see the cputimer still moving
> faster than the process clock. I was seeing it with the the help of
> printk statement inside update_gt_cputime().
>
> After nailing down the last remaining cause of that inside sched/core.c,
> I have never seen after the cputimer being in advance.
>
I need to add that I can only confirm that to be true with
sum_exec_runtime.
To affirm it to be true for stime and utime would require more
investigation. I didn't look them at all. I was only concerned with
sum_exec_runtime.
I will prepare a v2 of the patch accounting all the feedbacks that I
received from KOSAKI Motohiro, Frederic Weisbecker and Peter Zijlstra
and send it back here for further discussion.
Thank you very much all!
Olivier
> I need to add that I can only confirm that to be true with
> sum_exec_runtime.
>
> To affirm it to be true for stime and utime would require more
> investigation. I didn't look them at all. I was only concerned with
> sum_exec_runtime.
>
> I will prepare a v2 of the patch accounting all the feedbacks that I
> received from KOSAKI Motohiro, Frederic Weisbecker and Peter Zijlstra
> and send it back here for further discussion.
>
> Thank you very much all!
Do you mean your utime test case still failure? If you share your test-case,
I'm going to look at your issue too.
On Fri, 2013-04-26 at 15:08 -0400, KOSAKI Motohiro wrote:
> > I need to add that I can only confirm that to be true with
> > sum_exec_runtime.
> >
> > To affirm it to be true for stime and utime would require more
> > investigation. I didn't look them at all. I was only concerned with
> > sum_exec_runtime.
> >
> > I will prepare a v2 of the patch accounting all the feedbacks that I
> > received from KOSAKI Motohiro, Frederic Weisbecker and Peter Zijlstra
> > and send it back here for further discussion.
> >
> > Thank you very much all!
>
> Do you mean your utime test case still failure? If you share your test-case,
> I'm going to look at your issue too.
>
Sure with pleasure. My testcase is glibc-2.17/rt/tst-cputimer1.c
That being said, it strictly test CPUCLOCK_SCHED timers. Hence my focus
when modifying the code was strictly on sum_exec_runtime.
If utime and stime components of cputimer are moving faster than their
associated clock, this is something that I did not address.
On Fri, Apr 26, 2013 at 9:51 PM, Olivier Langlois
<[email protected]> wrote:
> On Fri, 2013-04-26 at 15:08 -0400, KOSAKI Motohiro wrote:
>> > I need to add that I can only confirm that to be true with
>> > sum_exec_runtime.
>> >
>> > To affirm it to be true for stime and utime would require more
>> > investigation. I didn't look them at all. I was only concerned with
>> > sum_exec_runtime.
>> >
>> > I will prepare a v2 of the patch accounting all the feedbacks that I
>> > received from KOSAKI Motohiro, Frederic Weisbecker and Peter Zijlstra
>> > and send it back here for further discussion.
>> >
>> > Thank you very much all!
>>
>> Do you mean your utime test case still failure? If you share your test-case,
>> I'm going to look at your issue too.
>>
> Sure with pleasure. My testcase is glibc-2.17/rt/tst-cputimer1.c
>
> That being said, it strictly test CPUCLOCK_SCHED timers. Hence my focus
> when modifying the code was strictly on sum_exec_runtime.
>
> If utime and stime components of cputimer are moving faster than their
> associated clock, this is something that I did not address.
Hmm... Sorry. I'm confused. 1) I haven't seen any glibc test failure
after applying
my patch. 2) tst-cputimer1.c only have CLOCK_PROCESS_CPUTIME_ID test and
don't have any utime, stime tests.
Please let me know if you've seen any failure after applying my patch.
Forbids the cputimer to drift ahead of its process clock by
blocking its update when a tick occurs while a autoreaping task
is currently in do_exit() between the call to release_task() and
its final call to schedule().
Any task stats update after having called release_task() will
be lost because they are added to the global process stats located
in the signal struct from release_task().
Ideally, you should postpone the release_task() call after the
final context switch to get all the stats added but this is
more complex to achieve.
In other words, this is slowing down the cputimer so it keep the same
pace than the process clock but in fact, what should be done is to
speed up the process clock by adding the missing stats to it.
Signed-off-by: Olivier Langlois <[email protected]>
---
kernel/sched/fair.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a33e59..52d7b10 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -708,7 +708,15 @@ static void update_curr(struct cfs_rq *cfs_rq)
trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
cpuacct_charge(curtask, delta_exec);
- account_group_exec_runtime(curtask, delta_exec);
+ /*
+ * Do not update the cputimer if the task is already released by
+ * release_task().
+ *
+ * it would preferable to defer the autoreap release_task
+ * after the last context switch but harder to do.
+ */
+ if (likely(curtask->sighand))
+ account_group_exec_runtime(curtask, delta_exec);
}
account_cfs_rq_runtime(cfs_rq, delta_exec);
--
1.8.2.1
On Fri, 2013-04-26 at 22:15 -0400, KOSAKI Motohiro wrote:
> On Fri, Apr 26, 2013 at 9:51 PM, Olivier Langlois
> <[email protected]> wrote:
> > On Fri, 2013-04-26 at 15:08 -0400, KOSAKI Motohiro wrote:
> >> > I need to add that I can only confirm that to be true with
> >> > sum_exec_runtime.
> >> >
> >> > To affirm it to be true for stime and utime would require more
> >> > investigation. I didn't look them at all. I was only concerned with
> >> > sum_exec_runtime.
> >> >
> >> > I will prepare a v2 of the patch accounting all the feedbacks that I
> >> > received from KOSAKI Motohiro, Frederic Weisbecker and Peter Zijlstra
> >> > and send it back here for further discussion.
> >> >
> >> > Thank you very much all!
> >>
> >> Do you mean your utime test case still failure? If you share your test-case,
> >> I'm going to look at your issue too.
> >>
> > Sure with pleasure. My testcase is glibc-2.17/rt/tst-cputimer1.c
> >
> > That being said, it strictly test CPUCLOCK_SCHED timers. Hence my focus
> > when modifying the code was strictly on sum_exec_runtime.
> >
> > If utime and stime components of cputimer are moving faster than their
> > associated clock, this is something that I did not address.
>
> Hmm... Sorry. I'm confused. 1) I haven't seen any glibc test failure
> after applying
> my patch. 2) tst-cputimer1.c only have CLOCK_PROCESS_CPUTIME_ID test and
> don't have any utime, stime tests.
>
> Please let me know if you've seen any failure after applying my patch.
Basically it is to exclude deltas away from the cputimer initialization
value.
Yes I still have the failure. My patch can be broken into 3 elements.
1. fair.c - cputimer slow down
2. Add deltas to cputimer sample
3. Address race condition
If any of these 3 elements are missing, you get the failure.
That being said, it is not a systematic failure. It is intermittent. I
am wrapping the execution of the unittest into a simple perl script:
#!/usr/bin/perl
use strict;
use warnings;
my $i = 0;
while (system("./tst-cputimer1") == 0) { ++$i; }
print "run $i successful iteration\n";
----------------------------------------
I get the failure after 10-300 iterations depending how I am
lucky/unlucky.
Also other factors to consider it is are you doing the test on a very
loaded system? What is your platform?
I have tested it positively on 32 bit, 64 bits build on Atom N450
i7 first and second generation system.
I did vary HZ from 300 to 1000 HZ, I tried the 3 three different
preemption models.
With all these combinations, I still have the problem.
Basically the timer error is usually < 500 uS so depending when the
process is called as a response to the timer event and this depends on
scheduler decisions, your failure rate will vary depending on the system
load IMO.
> 2) tst-cputimer1.c only have CLOCK_PROCESS_CPUTIME_ID test and
> don't have any utime, stime tests.
>
Sorry, I should take a couple minutes break before pressing send to be
sure that I have said everything :-)
CLOCK_PROCESS_CPUTIME_ID is user space name for kernel space
CPUCLOCK_SCHED clock.
and your are correct. It doesn't have any utime and stime test. Hence
this is why I am reticent to state anything about them when we discussed
about the relevence of update_gt_cputime() once we have the patch in.
Greetings,
Olivier
>
>
> Also other factors to consider it is are you doing the test on a very
> loaded system? What is your platform?
>
> I have tested it positively on 32 bit, 64 bits build on Atom N450
>
> i7 first and second generation system.
>
> I did vary HZ from 300 to 1000 HZ, I tried the 3 three different
> preemption models.
>
> With all these combinations, I still have the problem.
>
Kosaki,
Also clocksource accuracy may affect the result. Here are mines:
lano1106@hpmini ~ $
cat /sys/devices/system/clocksource/clocksource0/current_clocksource
hpet
lano1106@Wailaba2 ~/dev/linux/src/linux-3.8/kernel $
cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc
>
>
> I did vary HZ from 300 to 1000 HZ, I tried the 3 three different
> preemption models.
>
> With all these combinations, I still have the problem.
>
Actually, I may have observed more failure with 1000 HZ and
CONFIG_PREEMPT=y (low-latency desktop)
Also make sure that High resolution timers support is enabled
(CONFIG_HIGH_RES_TIMERS).
2013/4/27 Olivier Langlois <[email protected]>:
>
>
> Forbids the cputimer to drift ahead of its process clock by
> blocking its update when a tick occurs while a autoreaping task
> is currently in do_exit() between the call to release_task() and
> its final call to schedule().
>
> Any task stats update after having called release_task() will
> be lost because they are added to the global process stats located
> in the signal struct from release_task().
>
> Ideally, you should postpone the release_task() call after the
> final context switch to get all the stats added but this is
> more complex to achieve.
>
> In other words, this is slowing down the cputimer so it keep the same
> pace than the process clock but in fact, what should be done is to
> speed up the process clock by adding the missing stats to it.
>
> Signed-off-by: Olivier Langlois <[email protected]>
Thanks.
Could you please resend these three patches in a new mail thread to
make reviews easier? Also it would be nice to propose a different
subject for each individual patch. Each of which describing what the
patch does in a few words.
Thanks.
(4/27/13 12:40 AM), Olivier Langlois wrote:
>
>
> Forbids the cputimer to drift ahead of its process clock by
> blocking its update when a tick occurs while a autoreaping task
> is currently in do_exit() between the call to release_task() and
> its final call to schedule().
>
> Any task stats update after having called release_task() will
> be lost because they are added to the global process stats located
> in the signal struct from release_task().
>
> Ideally, you should postpone the release_task() call after the
> final context switch to get all the stats added but this is
> more complex to achieve.
>
> In other words, this is slowing down the cputimer so it keep the same
> pace than the process clock but in fact, what should be done is to
> speed up the process clock by adding the missing stats to it.
>
> Signed-off-by: Olivier Langlois <[email protected]>
> ---
> kernel/sched/fair.c | 10 +++++++++-
> 1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7a33e59..52d7b10 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -708,7 +708,15 @@ static void update_curr(struct cfs_rq *cfs_rq)
>
> trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
> cpuacct_charge(curtask, delta_exec);
> - account_group_exec_runtime(curtask, delta_exec);
> + /*
> + * Do not update the cputimer if the task is already released by
> + * release_task().
> + *
> + * it would preferable to defer the autoreap release_task
> + * after the last context switch but harder to do.
> + */
> + if (likely(curtask->sighand))
> + account_group_exec_runtime(curtask, delta_exec);
> }
I'm confused. glibc's rt/tst-cputimer1 doesn't have thread exiting code. I have
no seen any issue in this accounting.
On Mon, 2013-04-29 at 01:06 -0400, KOSAKI Motohiro wrote:
> (4/27/13 12:40 AM), Olivier Langlois wrote:
> >
> >
> > Forbids the cputimer to drift ahead of its process clock by
> > blocking its update when a tick occurs while a autoreaping task
> > is currently in do_exit() between the call to release_task() and
> > its final call to schedule().
> >
> > Any task stats update after having called release_task() will
> > be lost because they are added to the global process stats located
> > in the signal struct from release_task().
> >
> > Ideally, you should postpone the release_task() call after the
> > final context switch to get all the stats added but this is
> > more complex to achieve.
> >
> > In other words, this is slowing down the cputimer so it keep the same
> > pace than the process clock but in fact, what should be done is to
> > speed up the process clock by adding the missing stats to it.
> >
> > Signed-off-by: Olivier Langlois <[email protected]>
> > ---
> > kernel/sched/fair.c | 10 +++++++++-
> > 1 file changed, 9 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 7a33e59..52d7b10 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -708,7 +708,15 @@ static void update_curr(struct cfs_rq *cfs_rq)
> >
> > trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
> > cpuacct_charge(curtask, delta_exec);
> > - account_group_exec_runtime(curtask, delta_exec);
> > + /*
> > + * Do not update the cputimer if the task is already released by
> > + * release_task().
> > + *
> > + * it would preferable to defer the autoreap release_task
> > + * after the last context switch but harder to do.
> > + */
> > + if (likely(curtask->sighand))
> > + account_group_exec_runtime(curtask, delta_exec);
> > }
>
> I'm confused. glibc's rt/tst-cputimer1 doesn't have thread exiting code. I have
> no seen any issue in this accounting.
glibc launch a helper thread to receive timer signal and will also
create a new thread upon signal reception when a timer is created with
sigev_notify = SIGEV_THREAD;
please see:
glibc-2.17/nptl/sysdeps/unix/sysv/linux/timer_create.c
glibc-2.17/nptl/sysdeps/unix/sysv/linux/timer_routines.c
Greetings,
Olivier
On Mon, 2013-04-29 at 02:45 +0200, Frederic Weisbecker wrote:
> 2013/4/27 Olivier Langlois <[email protected]>:
> >
> >
> > Forbids the cputimer to drift ahead of its process clock by
> > blocking its update when a tick occurs while a autoreaping task
> > is currently in do_exit() between the call to release_task() and
> > its final call to schedule().
> >
> > Any task stats update after having called release_task() will
> > be lost because they are added to the global process stats located
> > in the signal struct from release_task().
> >
> > Ideally, you should postpone the release_task() call after the
> > final context switch to get all the stats added but this is
> > more complex to achieve.
> >
> > In other words, this is slowing down the cputimer so it keep the same
> > pace than the process clock but in fact, what should be done is to
> > speed up the process clock by adding the missing stats to it.
> >
> > Signed-off-by: Olivier Langlois <[email protected]>
>
> Thanks.
>
> Could you please resend these three patches in a new mail thread to
> make reviews easier? Also it would be nice to propose a different
> subject for each individual patch. Each of which describing what the
> patch does in a few words.
>
> Thanks.
Frederic, Ok. I will do it and keep the description brief. If someone
inquire more details, I will make them refer back to specific post
inside this thread to avoid unnecessary repetition.
Is this cool?
On Mon, 2013-04-29 at 13:10 -0400, Olivier Langlois wrote:
> On Mon, 2013-04-29 at 01:06 -0400, KOSAKI Motohiro wrote:
> > (4/27/13 12:40 AM), Olivier Langlois wrote:
> > >
> > >
> > > Forbids the cputimer to drift ahead of its process clock by
> > > blocking its update when a tick occurs while a autoreaping task
> > > is currently in do_exit() between the call to release_task() and
> > > its final call to schedule().
> > >
> > > Any task stats update after having called release_task() will
> > > be lost because they are added to the global process stats located
> > > in the signal struct from release_task().
> > >
> > > Ideally, you should postpone the release_task() call after the
> > > final context switch to get all the stats added but this is
> > > more complex to achieve.
> > >
> > > In other words, this is slowing down the cputimer so it keep the same
> > > pace than the process clock but in fact, what should be done is to
> > > speed up the process clock by adding the missing stats to it.
> > >
> > > Signed-off-by: Olivier Langlois <[email protected]>
> > > ---
> > > kernel/sched/fair.c | 10 +++++++++-
> > > 1 file changed, 9 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 7a33e59..52d7b10 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -708,7 +708,15 @@ static void update_curr(struct cfs_rq *cfs_rq)
> > >
> > > trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
> > > cpuacct_charge(curtask, delta_exec);
> > > - account_group_exec_runtime(curtask, delta_exec);
> > > + /*
> > > + * Do not update the cputimer if the task is already released by
> > > + * release_task().
> > > + *
> > > + * it would preferable to defer the autoreap release_task
> > > + * after the last context switch but harder to do.
> > > + */
> > > + if (likely(curtask->sighand))
> > > + account_group_exec_runtime(curtask, delta_exec);
> > > }
> >
> > I'm confused. glibc's rt/tst-cputimer1 doesn't have thread exiting code. I have
> > no seen any issue in this accounting.
>
> glibc launch a helper thread to receive timer signal and will also
> create a new thread upon signal reception when a timer is created with
> sigev_notify = SIGEV_THREAD;
>
> please see:
>
> glibc-2.17/nptl/sysdeps/unix/sysv/linux/timer_create.c
> glibc-2.17/nptl/sysdeps/unix/sysv/linux/timer_routines.c
>
One very easy way to see the problem is to add a printk statement inside
update_gt_cputime()
if (b->sum_exec_runtime > a->sum_exec_runtime)
a->sum_exec_runtime = b->sum_exec_runtime;
else
printk( KERN_DEBUG "cputimer %llu, process clock %llu, diff %llu\n",
a->sum_exec_runtime, b->sum_exec_runtime,
a->sum_exec_runtime-b->sum_exec_runtime);
Check the output with and without the fair.c modif when executing
tst-cputimer1. As an extra bonus, this trace will show the spurious
start/stop cputimer problem that I was trying to explain to Frederic.
(4/29/13 1:10 PM), Olivier Langlois wrote:
> On Mon, 2013-04-29 at 01:06 -0400, KOSAKI Motohiro wrote:
>> (4/27/13 12:40 AM), Olivier Langlois wrote:
>>>
>>>
>>> Forbids the cputimer to drift ahead of its process clock by
>>> blocking its update when a tick occurs while a autoreaping task
>>> is currently in do_exit() between the call to release_task() and
>>> its final call to schedule().
>>>
>>> Any task stats update after having called release_task() will
>>> be lost because they are added to the global process stats located
>>> in the signal struct from release_task().
>>>
>>> Ideally, you should postpone the release_task() call after the
>>> final context switch to get all the stats added but this is
>>> more complex to achieve.
>>>
>>> In other words, this is slowing down the cputimer so it keep the same
>>> pace than the process clock but in fact, what should be done is to
>>> speed up the process clock by adding the missing stats to it.
>>>
>>> Signed-off-by: Olivier Langlois <[email protected]>
>>> ---
>>> kernel/sched/fair.c | 10 +++++++++-
>>> 1 file changed, 9 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 7a33e59..52d7b10 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -708,7 +708,15 @@ static void update_curr(struct cfs_rq *cfs_rq)
>>>
>>> trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
>>> cpuacct_charge(curtask, delta_exec);
>>> - account_group_exec_runtime(curtask, delta_exec);
>>> + /*
>>> + * Do not update the cputimer if the task is already released by
>>> + * release_task().
>>> + *
>>> + * it would preferable to defer the autoreap release_task
>>> + * after the last context switch but harder to do.
>>> + */
>>> + if (likely(curtask->sighand))
>>> + account_group_exec_runtime(curtask, delta_exec);
>>> }
>>
>> I'm confused. glibc's rt/tst-cputimer1 doesn't have thread exiting code. I have
>> no seen any issue in this accounting.
>
> glibc launch a helper thread to receive timer signal and will also
> create a new thread upon signal reception when a timer is created with
> sigev_notify = SIGEV_THREAD;
>
> please see:
>
> glibc-2.17/nptl/sysdeps/unix/sysv/linux/timer_create.c
> glibc-2.17/nptl/sysdeps/unix/sysv/linux/timer_routines.c
I know. I taled thread exiting. not thread creating. And, as far as I can see, only test sig1 can fail,
not thr[12].
>
> >> I'm confused. glibc's rt/tst-cputimer1 doesn't have thread exiting code. I have
> >> no seen any issue in this accounting.
> >
> > glibc launch a helper thread to receive timer signal and will also
> > create a new thread upon signal reception when a timer is created with
> > sigev_notify = SIGEV_THREAD;
> >
> > please see:
> >
> > glibc-2.17/nptl/sysdeps/unix/sysv/linux/timer_create.c
> > glibc-2.17/nptl/sysdeps/unix/sysv/linux/timer_routines.c
>
> I know. I taled thread exiting. not thread creating. And, as far as I can see, only test sig1 can fail,
> not thr[12].
>
Apart from glibc helper thread, the threads created for handling timer
firing all do exit immediatly as soon as user callback returns.
I count 12 thread exits during tst-cputimer1 execution. The errors do
add up hence you're more likely to see errors after 2.5 sec and up from
start of execution. I have seen sig1, thr[12] fails. I see no reason why
one could not fail.
(4/29/13 2:20 PM), Olivier Langlois wrote:
>
>>
>>>> I'm confused. glibc's rt/tst-cputimer1 doesn't have thread exiting code. I have
>>>> no seen any issue in this accounting.
>>>
>>> glibc launch a helper thread to receive timer signal and will also
>>> create a new thread upon signal reception when a timer is created with
>>> sigev_notify = SIGEV_THREAD;
>>>
>>> please see:
>>>
>>> glibc-2.17/nptl/sysdeps/unix/sysv/linux/timer_create.c
>>> glibc-2.17/nptl/sysdeps/unix/sysv/linux/timer_routines.c
>>
>> I know. I taled thread exiting. not thread creating. And, as far as I can see, only test sig1 can fail,
>> not thr[12].
>>
> Apart from glibc helper thread, the threads created for handling timer
> firing all do exit immediatly as soon as user callback returns.
And, libc ensure its exiting finished before starting actual tests. Why such thread exiting
affect timers code? It shouldn't. becuase signal.cputimer is initialized timer_settime().
The initialization is incorrect, we should fix initialization.
> I count 12 thread exits during tst-cputimer1 execution. The errors do
> add up hence you're more likely to see errors after 2.5 sec and up from
> start of execution. I have seen sig1, thr[12] fails. I see no reason why
> one could not fail.
On Mon, 2013-04-29 at 14:31 -0400, KOSAKI Motohiro wrote:
> (4/29/13 2:20 PM), Olivier Langlois wrote:
> >
> >>
> >>>> I'm confused. glibc's rt/tst-cputimer1 doesn't have thread exiting code. I have
> >>>> no seen any issue in this accounting.
> >>>
> >>> glibc launch a helper thread to receive timer signal and will also
> >>> create a new thread upon signal reception when a timer is created with
> >>> sigev_notify = SIGEV_THREAD;
> >>>
> >>> please see:
> >>>
> >>> glibc-2.17/nptl/sysdeps/unix/sysv/linux/timer_create.c
> >>> glibc-2.17/nptl/sysdeps/unix/sysv/linux/timer_routines.c
> >>
> >> I know. I taled thread exiting. not thread creating. And, as far as I can see, only test sig1 can fail,
> >> not thr[12].
> >>
> > Apart from glibc helper thread, the threads created for handling timer
> > firing all do exit immediatly as soon as user callback returns.
>
> And, libc ensure its exiting finished before starting actual tests. Why such thread exiting
> affect timers code? It shouldn't. becuase signal.cputimer is initialized timer_settime().
> The initialization is incorrect, we should fix initialization.
It doesn't have anything to do with initialisation.
Quick Quiz #1: How does the cputimer tick?
Answer: With calls to account_group_exec_runtime()
Every task updates occuring after release_task() has been called in
do_exit() (scheduler ticks or the task final schedule() call) will be
lost because tasks stats are added to the global group stats located in
the signal struct in release_task() So every update after release_task()
will be lost but account_group_exec_runtime is still called.
Tasks that go in zombie state are fine because release_task() will be
called later. Autoreap task (those with CLONE_THREAD ie: pthreads) calls
release_task() before the last context switch. Please do read
kernel/exit.c.
Hence cputimer advance faster than the process clock.
Hence the POSIX compliance from your pseudo code does not hold
sighandler(){
t1 = clock_gettime()
}
t0 = clock_gettime()
timer_settime(timeout);
... wait to fire
assert (t1 - t0 >= timeout)
>
>
> > I count 12 thread exits during tst-cputimer1 execution. The errors do
> > add up hence you're more likely to see errors after 2.5 sec and up from
> > start of execution. I have seen sig1, thr[12] fails. I see no reason why
> > one could not fail.
(4/29/13 2:54 PM), Olivier Langlois wrote:
> On Mon, 2013-04-29 at 14:31 -0400, KOSAKI Motohiro wrote:
>> (4/29/13 2:20 PM), Olivier Langlois wrote:
>>>
>>>>
>>>>>> I'm confused. glibc's rt/tst-cputimer1 doesn't have thread exiting code. I have
>>>>>> no seen any issue in this accounting.
>>>>>
>>>>> glibc launch a helper thread to receive timer signal and will also
>>>>> create a new thread upon signal reception when a timer is created with
>>>>> sigev_notify = SIGEV_THREAD;
>>>>>
>>>>> please see:
>>>>>
>>>>> glibc-2.17/nptl/sysdeps/unix/sysv/linux/timer_create.c
>>>>> glibc-2.17/nptl/sysdeps/unix/sysv/linux/timer_routines.c
>>>>
>>>> I know. I taled thread exiting. not thread creating. And, as far as I can see, only test sig1 can fail,
>>>> not thr[12].
>>>>
>>> Apart from glibc helper thread, the threads created for handling timer
>>> firing all do exit immediatly as soon as user callback returns.
>>
>> And, libc ensure its exiting finished before starting actual tests. Why such thread exiting
>> affect timers code? It shouldn't. becuase signal.cputimer is initialized timer_settime().
>> The initialization is incorrect, we should fix initialization.
>
> It doesn't have anything to do with initialisation.
>
> Quick Quiz #1: How does the cputimer tick?
> Answer: With calls to account_group_exec_runtime()
Only account when cputimer->running. Quick Quiz: When turn on cputimer->running?
> Every task updates occuring after release_task() has been called in
> do_exit() (scheduler ticks or the task final schedule() call) will be
> lost because tasks stats are added to the global group stats located in
> the signal struct in release_task() So every update after release_task()
> will be lost but account_group_exec_runtime is still called.
tick lost doesn't occur an issue. because glibc only test posix conformance and
posix allow inacculacy. In other words, timer must not run faster than real clock.
but lost and makes slower are accepted in the test.
>
> Tasks that go in zombie state are fine because release_task() will be
> called later. Autoreap task (those with CLONE_THREAD ie: pthreads) calls
> release_task() before the last context switch. Please do read
> kernel/exit.c.
>
> Hence cputimer advance faster than the process clock.
Again the lost makes slower, not faster.
> Hence the POSIX compliance from your pseudo code does not hold
>
> sighandler(){
> t1 = clock_gettime()
> }
>
> t0 = clock_gettime()
> timer_settime(timeout);
> ... wait to fire
>
> assert (t1 - t0 >= timeout)
>
>>
>>
>>> I count 12 thread exits during tst-cputimer1 execution. The errors do
>>> add up hence you're more likely to see errors after 2.5 sec and up from
>>> start of execution. I have seen sig1, thr[12] fails. I see no reason why
>>> one could not fail.
I ran over 3000 times run this case. but I have no seen your described issue.
On Mon, 2013-04-29 at 15:09 -0400, KOSAKI Motohiro wrote:
> (4/29/13 2:54 PM), Olivier Langlois wrote:
> > On Mon, 2013-04-29 at 14:31 -0400, KOSAKI Motohiro wrote:
> >> (4/29/13 2:20 PM), Olivier Langlois wrote:
> >>>
> >>>>
> >>>>>> I'm confused. glibc's rt/tst-cputimer1 doesn't have thread exiting code. I have
> >>>>>> no seen any issue in this accounting.
> >>>>>
> >>>>> glibc launch a helper thread to receive timer signal and will also
> >>>>> create a new thread upon signal reception when a timer is created with
> >>>>> sigev_notify = SIGEV_THREAD;
> >>>>>
> >>>>> please see:
> >>>>>
> >>>>> glibc-2.17/nptl/sysdeps/unix/sysv/linux/timer_create.c
> >>>>> glibc-2.17/nptl/sysdeps/unix/sysv/linux/timer_routines.c
> >>>>
> >>>> I know. I taled thread exiting. not thread creating. And, as far as I can see, only test sig1 can fail,
> >>>> not thr[12].
> >>>>
> >>> Apart from glibc helper thread, the threads created for handling timer
> >>> firing all do exit immediatly as soon as user callback returns.
> >>
> >> And, libc ensure its exiting finished before starting actual tests. Why such thread exiting
> >> affect timers code? It shouldn't. becuase signal.cputimer is initialized timer_settime().
> >> The initialization is incorrect, we should fix initialization.
> >
> > It doesn't have anything to do with initialisation.
> >
> > Quick Quiz #1: How does the cputimer tick?
> > Answer: With calls to account_group_exec_runtime()
>
> Only account when cputimer->running. Quick Quiz: When turn on cputimer->running?
>
>
> > Every task updates occuring after release_task() has been called in
> > do_exit() (scheduler ticks or the task final schedule() call) will be
> > lost because tasks stats are added to the global group stats located in
> > the signal struct in release_task() So every update after release_task()
> > will be lost but account_group_exec_runtime is still called.
>
> tick lost doesn't occur an issue. because glibc only test posix conformance and
> posix allow inacculacy. In other words, timer must not run faster than real clock.
> but lost and makes slower are accepted in the test.
>
What is lost isn't cputimer tick. They are accounted
account_group_exec_runtime(). What is lost it is what is added to
curr->sum_exec_runtime. Thus making the thread group clock running
slower than the cputimer.
Please spend some time reading the code and less time writing e-mails.
Read the code of release_task() and where it is called in do_exit().
Once it is done, it should be clear to you.
>> tick lost doesn't occur an issue. because glibc only test posix conformance and
>> posix allow inacculacy. In other words, timer must not run faster than real clock.
>> but lost and makes slower are accepted in the test.
>>
> What is lost isn't cputimer tick. They are accounted
> account_group_exec_runtime(). What is lost it is what is added to
> curr->sum_exec_runtime. Thus making the thread group clock running
> slower than the cputimer.
>
> Please spend some time reading the code and less time writing e-mails.
>
> Read the code of release_task() and where it is called in do_exit().
>
> Once it is done, it should be clear to you.
After several grepping, I may got your point. Guys, sig->sum_sched_runtime is gathered
in __exit_signal, not release_task() nor do_exit().
So, I'm now convinced your patch is correct. Even though the comments is terribuly misleading.