2013-05-26 21:36:03

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH v5 0/8] posix timers fixlet

From: KOSAKI Motohiro <[email protected]>

Glibc's posix timer testcase found a lot of bugs in posix timer code.
This series, hopefully, fixes all of them. All patches are independent
each other logically.

Changes from v4
- [1/8] comments colarification, fix account_group_{user_system}_time too.
- [8/8] added CPUCLOCK_VIRT and CPUCLOCK_PROF timer initialization fix

Changes from v3
- task_sched_runtime() micro optimization add to care tsk->on_cpu.
suggested Paul Turner.
- fixed several typo in changelogs.


KOSAKI Motohiro (8):
posix-cpu-timers: don't account cpu timer after stopped thread
runtime accounting
posix-cpu-timers: fix acounting delta_exec twice
posix-cpu-timers: fix wrong timer initialization
posix-cpu-timers: timer functions should use timer time instead of
clock time
posix-cpu-timers: check_thread_timers() uses task_sched_runtime()
sched: task_sched_runtime introduce micro optimization
posix-cpu-timers: cleanup cpu_{clock,timer}_sample{,_group}
posix-cpu-timers: fix cputimer initialization mistake for {u,s}time

fs/binfmt_elf.c | 2 +-
fs/binfmt_elf_fdpic.c | 2 +-
include/linux/kernel_stat.h | 5 --
include/linux/sched.h | 5 +-
kernel/posix-cpu-timers.c | 163 ++++++++++++++++++++++++++-----------------
kernel/sched/core.c | 34 +++++----
kernel/sched/cputime.c | 17 ++++-
kernel/sched/stats.h | 33 ++++++++-
8 files changed, 167 insertions(+), 94 deletions(-)


2013-05-26 21:36:15

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 1/8] posix-cpu-timers: don't account cpu timer after stopped thread runtime accounting

From: KOSAKI Motohiro <[email protected]>

When tsk->signal->cputimer->running is 1, signal->cputimer (i.e. per process
timer account) and tsk->sum_sched_runtime (i.e. per thread timer account)
increase at the same pace because update_curr() increases both accounting.

However, there is one exception. When thread exiting, __exit_signal() turns
over task's sum_shced_runtime to sig->sum_sched_runtime, but it doesn't stop
signal->cputimer accounting.

This inconsistency makes POSIX timer wake up too early. This patch fixes it.

Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Acked-by: Frederic Weisbecker <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Signed-off-by: Olivier Langlois <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
kernel/sched/stats.h | 33 ++++++++++++++++++++++++++++++---
1 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 2ef90a5..8a0567f 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -162,6 +162,33 @@ sched_info_switch(struct task_struct *prev, struct task_struct *next)
*/

/**
+ * cputimer_running - return true if cputimer is running
+ *
+ * @tsk: Pointer to target task.
+ */
+static inline bool cputimer_running(struct task_struct *tsk)
+
+{
+ struct thread_group_cputimer *cputimer = &tsk->signal->cputimer;
+
+ if (!cputimer->running)
+ return false;
+
+ /*
+ * After turning over se.sum_exec_runtime's time accounting to
+ * sig->sum_sched_runtime in __exit_signal(), a per-thread cputime
+ * of the thread is going to be lost. We must not account exec_runtime
+ * here too because we need to keep consistent cputime between
+ * per-thread and per-process. Otherwise, the inconsistency is
+ * observable when single thread program run.
+ */
+ if (unlikely(!tsk->sighand))
+ return false;
+
+ return true;
+}
+
+/**
* account_group_user_time - Maintain utime for a thread group.
*
* @tsk: Pointer to task structure.
@@ -176,7 +203,7 @@ static inline void account_group_user_time(struct task_struct *tsk,
{
struct thread_group_cputimer *cputimer = &tsk->signal->cputimer;

- if (!cputimer->running)
+ if (!cputimer_running(tsk))
return;

raw_spin_lock(&cputimer->lock);
@@ -199,7 +226,7 @@ static inline void account_group_system_time(struct task_struct *tsk,
{
struct thread_group_cputimer *cputimer = &tsk->signal->cputimer;

- if (!cputimer->running)
+ if (!cputimer_running(tsk))
return;

raw_spin_lock(&cputimer->lock);
@@ -222,7 +249,7 @@ static inline void account_group_exec_runtime(struct task_struct *tsk,
{
struct thread_group_cputimer *cputimer = &tsk->signal->cputimer;

- if (!cputimer->running)
+ if (!cputimer_running(tsk))
return;

raw_spin_lock(&cputimer->lock);
--
1.7.1

2013-05-26 21:36:34

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 2/8] posix-cpu-timers: fix acounting delta_exec twice

From: KOSAKI Motohiro <[email protected]>

Currently glibc rt/tst-cpuclock2 test(*) sporadically fails because
scheduler delta can be accounted twice from thread_group_cputimer()
and account_group_exec_runtime().

Finally, clock_nanosleep() wakes up earlier than an argument. This is posix
violation. This issue was introduced by commit d670ec1317 (posix-cpu-timers:
Cure SMP wobbles).

(*) http://sourceware.org/git/?p=glibc.git;a=blob;f=rt/tst-cpuclock2.c;h=6752721717f959e89c0d692b3f1ee082d507eec2;hb=HEAD

Cc: Olivier Langlois <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
fs/binfmt_elf.c | 2 +-
fs/binfmt_elf_fdpic.c | 2 +-
include/linux/sched.h | 4 ++--
kernel/posix-cpu-timers.c | 15 ++++++++++-----
kernel/sched/core.c | 6 ++++--
kernel/sched/cputime.c | 6 +++---
6 files changed, 21 insertions(+), 14 deletions(-)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index f8a0b0e..9f7e13a 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1326,7 +1326,7 @@ static void fill_prstatus(struct elf_prstatus *prstatus,
* This is the record for the group leader. It shows the
* group-wide total, not its individual thread total.
*/
- thread_group_cputime(p, &cputime);
+ thread_group_cputime(p, true, &cputime);
cputime_to_timeval(cputime.utime, &prstatus->pr_utime);
cputime_to_timeval(cputime.stime, &prstatus->pr_stime);
} else {
diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
index c166f32..fbcaeeb 100644
--- a/fs/binfmt_elf_fdpic.c
+++ b/fs/binfmt_elf_fdpic.c
@@ -1368,7 +1368,7 @@ static void fill_prstatus(struct elf_prstatus *prstatus,
* This is the record for the group leader. It shows the
* group-wide total, not its individual thread total.
*/
- thread_group_cputime(p, &cputime);
+ thread_group_cputime(p, true, &cputime);
cputime_to_timeval(cputime.utime, &prstatus->pr_utime);
cputime_to_timeval(cputime.stime, &prstatus->pr_stime);
} else {
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 178a8d9..4878579 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1841,7 +1841,7 @@ static inline void disable_sched_clock_irqtime(void) {}
#endif

extern unsigned long long
-task_sched_runtime(struct task_struct *task);
+task_sched_runtime(struct task_struct *task, bool add_delta);

/* sched_exec is called by processes performing an exec */
#ifdef CONFIG_SMP
@@ -2502,7 +2502,7 @@ static inline void current_clr_polling(void) { }
/*
* Thread group CPU time accounting.
*/
-void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times);
+void thread_group_cputime(struct task_struct *tsk, bool add_delta, struct task_cputime *times);
void thread_group_cputimer(struct task_struct *tsk, struct task_cputime *times);

static inline void thread_group_cputime_init(struct signal_struct *sig)
diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 42670e9..25447c5 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -237,7 +237,7 @@ static int cpu_clock_sample(const clockid_t which_clock, struct task_struct *p,
cpu->cpu = virt_ticks(p);
break;
case CPUCLOCK_SCHED:
- cpu->sched = task_sched_runtime(p);
+ cpu->sched = task_sched_runtime(p, true);
break;
}
return 0;
@@ -267,8 +267,13 @@ void thread_group_cputimer(struct task_struct *tsk, struct task_cputime *times)
* values through the TIMER_ABSTIME flag, therefore we have
* to synchronize the timer to the clock every time we start
* it.
+ *
+ * Do not add the current delta, because
+ * account_group_exec_runtime() will also this delta and we
+ * wouldn't want to double account time and get ahead of
+ * ourselves.
*/
- thread_group_cputime(tsk, &sum);
+ thread_group_cputime(tsk, false, &sum);
raw_spin_lock_irqsave(&cputimer->lock, flags);
cputimer->running = 1;
update_gt_cputime(&cputimer->cputime, &sum);
@@ -292,15 +297,15 @@ static int cpu_clock_sample_group(const clockid_t which_clock,
default:
return -EINVAL;
case CPUCLOCK_PROF:
- thread_group_cputime(p, &cputime);
+ thread_group_cputime(p, true, &cputime);
cpu->cpu = cputime.utime + cputime.stime;
break;
case CPUCLOCK_VIRT:
- thread_group_cputime(p, &cputime);
+ thread_group_cputime(p, true, &cputime);
cpu->cpu = cputime.utime;
break;
case CPUCLOCK_SCHED:
- thread_group_cputime(p, &cputime);
+ thread_group_cputime(p, true, &cputime);
cpu->sched = cputime.sum_exec_runtime;
break;
}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 58453b8..a1e823b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2699,14 +2699,16 @@ unsigned long long task_delta_exec(struct task_struct *p)
* In case the task is currently running, return the runtime plus current's
* pending runtime that have not been accounted yet.
*/
-unsigned long long task_sched_runtime(struct task_struct *p)
+unsigned long long task_sched_runtime(struct task_struct *p, bool add_delta)
{
unsigned long flags;
struct rq *rq;
u64 ns = 0;

rq = task_rq_lock(p, &flags);
- ns = p->se.sum_exec_runtime + do_task_delta_exec(p, rq);
+ ns = p->se.sum_exec_runtime;
+ if (add_delta)
+ ns += do_task_delta_exec(p, rq);
task_rq_unlock(rq, p, &flags);

return ns;
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index cc2dc3e..3ca432c 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -277,7 +277,7 @@ static __always_inline bool steal_account_process_tick(void)
* Accumulate raw cputime values of dead tasks (sig->[us]time) and live
* tasks (sum on group iteration) belonging to @tsk's group.
*/
-void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
+void thread_group_cputime(struct task_struct *tsk, bool add_delta, struct task_cputime *times)
{
struct signal_struct *sig = tsk->signal;
cputime_t utime, stime;
@@ -297,7 +297,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
task_cputime(t, &utime, &stime);
times->utime += utime;
times->stime += stime;
- times->sum_exec_runtime += task_sched_runtime(t);
+ times->sum_exec_runtime += task_sched_runtime(t, add_delta);
} while_each_thread(tsk, t);
out:
rcu_read_unlock();
@@ -628,7 +628,7 @@ void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime
{
struct task_cputime cputime;

- thread_group_cputime(p, &cputime);
+ thread_group_cputime(p, true, &cputime);
cputime_adjust(&cputime, &p->signal->prev_cputime, ut, st);
}
#endif /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
--
1.7.1

2013-05-26 21:36:33

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 2/8] posix-cpu-timers: fix acounting delta_exec twice

From: KOSAKI Motohiro <[email protected]>

Currently glibc rt/tst-cpuclock2 test(*) sporadically fails because
scheduler delta can be accounted twice from thread_group_cputimer()
and account_group_exec_runtime().

Finally, clock_nanosleep() wakes up before an argument. This is posix
violation. This issue was introduced by commit d670ec1317 (posix-cpu-timers:
Cure SMP wobbles).

(*) http://sourceware.org/git/?p=glibc.git;a=blob;f=rt/tst-cpuclock2.c;h=6752721717f959e89c0d692b3f1ee082d507eec2;hb=HEAD

Cc: Olivier Langlois <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
fs/binfmt_elf.c | 2 +-
fs/binfmt_elf_fdpic.c | 2 +-
include/linux/sched.h | 4 ++--
kernel/posix-cpu-timers.c | 15 ++++++++++-----
kernel/sched/core.c | 6 ++++--
kernel/sched/cputime.c | 6 +++---
6 files changed, 21 insertions(+), 14 deletions(-)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index f8a0b0e..9f7e13a 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1326,7 +1326,7 @@ static void fill_prstatus(struct elf_prstatus *prstatus,
* This is the record for the group leader. It shows the
* group-wide total, not its individual thread total.
*/
- thread_group_cputime(p, &cputime);
+ thread_group_cputime(p, true, &cputime);
cputime_to_timeval(cputime.utime, &prstatus->pr_utime);
cputime_to_timeval(cputime.stime, &prstatus->pr_stime);
} else {
diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
index c166f32..fbcaeeb 100644
--- a/fs/binfmt_elf_fdpic.c
+++ b/fs/binfmt_elf_fdpic.c
@@ -1368,7 +1368,7 @@ static void fill_prstatus(struct elf_prstatus *prstatus,
* This is the record for the group leader. It shows the
* group-wide total, not its individual thread total.
*/
- thread_group_cputime(p, &cputime);
+ thread_group_cputime(p, true, &cputime);
cputime_to_timeval(cputime.utime, &prstatus->pr_utime);
cputime_to_timeval(cputime.stime, &prstatus->pr_stime);
} else {
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 178a8d9..4878579 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1841,7 +1841,7 @@ static inline void disable_sched_clock_irqtime(void) {}
#endif

extern unsigned long long
-task_sched_runtime(struct task_struct *task);
+task_sched_runtime(struct task_struct *task, bool add_delta);

/* sched_exec is called by processes performing an exec */
#ifdef CONFIG_SMP
@@ -2502,7 +2502,7 @@ static inline void current_clr_polling(void) { }
/*
* Thread group CPU time accounting.
*/
-void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times);
+void thread_group_cputime(struct task_struct *tsk, bool add_delta, struct task_cputime *times);
void thread_group_cputimer(struct task_struct *tsk, struct task_cputime *times);

static inline void thread_group_cputime_init(struct signal_struct *sig)
diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 42670e9..25447c5 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -237,7 +237,7 @@ static int cpu_clock_sample(const clockid_t which_clock, struct task_struct *p,
cpu->cpu = virt_ticks(p);
break;
case CPUCLOCK_SCHED:
- cpu->sched = task_sched_runtime(p);
+ cpu->sched = task_sched_runtime(p, true);
break;
}
return 0;
@@ -267,8 +267,13 @@ void thread_group_cputimer(struct task_struct *tsk, struct task_cputime *times)
* values through the TIMER_ABSTIME flag, therefore we have
* to synchronize the timer to the clock every time we start
* it.
+ *
+ * Do not add the current delta, because
+ * account_group_exec_runtime() will also this delta and we
+ * wouldn't want to double account time and get ahead of
+ * ourselves.
*/
- thread_group_cputime(tsk, &sum);
+ thread_group_cputime(tsk, false, &sum);
raw_spin_lock_irqsave(&cputimer->lock, flags);
cputimer->running = 1;
update_gt_cputime(&cputimer->cputime, &sum);
@@ -292,15 +297,15 @@ static int cpu_clock_sample_group(const clockid_t which_clock,
default:
return -EINVAL;
case CPUCLOCK_PROF:
- thread_group_cputime(p, &cputime);
+ thread_group_cputime(p, true, &cputime);
cpu->cpu = cputime.utime + cputime.stime;
break;
case CPUCLOCK_VIRT:
- thread_group_cputime(p, &cputime);
+ thread_group_cputime(p, true, &cputime);
cpu->cpu = cputime.utime;
break;
case CPUCLOCK_SCHED:
- thread_group_cputime(p, &cputime);
+ thread_group_cputime(p, true, &cputime);
cpu->sched = cputime.sum_exec_runtime;
break;
}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 58453b8..a1e823b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2699,14 +2699,16 @@ unsigned long long task_delta_exec(struct task_struct *p)
* In case the task is currently running, return the runtime plus current's
* pending runtime that have not been accounted yet.
*/
-unsigned long long task_sched_runtime(struct task_struct *p)
+unsigned long long task_sched_runtime(struct task_struct *p, bool add_delta)
{
unsigned long flags;
struct rq *rq;
u64 ns = 0;

rq = task_rq_lock(p, &flags);
- ns = p->se.sum_exec_runtime + do_task_delta_exec(p, rq);
+ ns = p->se.sum_exec_runtime;
+ if (add_delta)
+ ns += do_task_delta_exec(p, rq);
task_rq_unlock(rq, p, &flags);

return ns;
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index cc2dc3e..3ca432c 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -277,7 +277,7 @@ static __always_inline bool steal_account_process_tick(void)
* Accumulate raw cputime values of dead tasks (sig->[us]time) and live
* tasks (sum on group iteration) belonging to @tsk's group.
*/
-void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
+void thread_group_cputime(struct task_struct *tsk, bool add_delta, struct task_cputime *times)
{
struct signal_struct *sig = tsk->signal;
cputime_t utime, stime;
@@ -297,7 +297,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
task_cputime(t, &utime, &stime);
times->utime += utime;
times->stime += stime;
- times->sum_exec_runtime += task_sched_runtime(t);
+ times->sum_exec_runtime += task_sched_runtime(t, add_delta);
} while_each_thread(tsk, t);
out:
rcu_read_unlock();
@@ -628,7 +628,7 @@ void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime
{
struct task_cputime cputime;

- thread_group_cputime(p, &cputime);
+ thread_group_cputime(p, true, &cputime);
cputime_adjust(&cputime, &p->signal->prev_cputime, ut, st);
}
#endif /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
--
1.7.1

2013-05-26 21:36:57

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 3/8] posix-cpu-timers: fix wrong timer initialization

From: KOSAKI Motohiro <[email protected]>

Currently glibc's rt/tst-cputimer1 testcase sporadically fails because
a timer created by timer_create() may fire earlier than specified.

posix_cpu_timer_set() uses "val" as current time for three purpose. 1)
initialize sig->cputimer. 2) calculation "old" val. 3) calculations an
expires.

(1) and (2) should only use committed time (i.e. without delta_exec)
because run_posix_cpu_timers() don't care of delta_exec and we need
consistency, but (3) need exact current time (aka cpu clock time) because
an expires should be "now + timeout" by definition.

This patch distinguishes between two kinds of "now".

Cc: Olivier Langlois <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Ingo Molnar <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
include/linux/kernel_stat.h | 5 -----
kernel/posix-cpu-timers.c | 14 ++++++++++++--
kernel/sched/core.c | 13 -------------
3 files changed, 12 insertions(+), 20 deletions(-)

diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index ed5f6ed..f5d4fdf 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -117,11 +117,6 @@ static inline unsigned int kstat_cpu_irqs_sum(unsigned int cpu)
return kstat_cpu(cpu).irqs_sum;
}

-/*
- * Lock/unlock the current runqueue - to extract task statistics:
- */
-extern unsigned long long task_delta_exec(struct task_struct *);
-
extern void account_user_time(struct task_struct *, cputime_t, cputime_t);
extern void account_system_time(struct task_struct *, int, cputime_t, cputime_t);
extern void account_steal_time(cputime_t);
diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 25447c5..d068808 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -652,7 +652,7 @@ static int cpu_timer_sample_group(const clockid_t which_clock,
cpu->cpu = cputime.utime;
break;
case CPUCLOCK_SCHED:
- cpu->sched = cputime.sum_exec_runtime + task_delta_exec(p);
+ cpu->sched = cputime.sum_exec_runtime;
break;
}
return 0;
@@ -797,7 +797,17 @@ static int posix_cpu_timer_set(struct k_itimer *timer, int flags,
}

if (new_expires.sched != 0 && !(flags & TIMER_ABSTIME)) {
- cpu_time_add(timer->it_clock, &new_expires, val);
+ union cpu_time_count now;
+
+ /*
+ * The expires is "now + timeout" by definition. So,
+ * we need exact current time.
+ */
+ if (CPUCLOCK_PERTHREAD(timer->it_clock))
+ now = val;
+ else
+ cpu_clock_sample_group(timer->it_clock, p, &now);
+ cpu_time_add(timer->it_clock, &new_expires, now);
}

/*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a1e823b..96512e9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2681,19 +2681,6 @@ static u64 do_task_delta_exec(struct task_struct *p, struct rq *rq)
return ns;
}

-unsigned long long task_delta_exec(struct task_struct *p)
-{
- unsigned long flags;
- struct rq *rq;
- u64 ns = 0;
-
- rq = task_rq_lock(p, &flags);
- ns = do_task_delta_exec(p, rq);
- task_rq_unlock(rq, p, &flags);
-
- return ns;
-}
-
/*
* Return accounted runtime for the task.
* In case the task is currently running, return the runtime plus current's
--
1.7.1

2013-05-26 21:37:10

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 4/8] posix-cpu-timers: timer functions should use timer time instead of clock time

From: KOSAKI Motohiro <[email protected]>

For process timers, we use cpu_clock_sample_group() and cpu_timer_sample_group()
correctly. However, for thread timers, we always use cpu_clock_sample(). This is
wrong because a cpu_clock_sample() accounts uncommitted delta_exec too. And this
is inconsistent against run_posix_cpu_tiemrs().

This is not big matter because the following workaround code in
posix_cpu_timer_get() hides the timer miscalculation issue. However, it makes
rq lock held wrongly and it would be better to fix it.

if (cpu_time_before(timer->it_clock, now, timer->it.cpu.expires)) {
...
} else {
/*
* The timer should have expired already, but the firing
* hasn't taken place yet. Say it's just about to expire.
*/
itp->it_value.tv_nsec = 1;
itp->it_value.tv_sec = 0;

Cc: Olivier Langlois <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Ingo Molnar <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
kernel/posix-cpu-timers.c | 37 +++++++++++++++++++++++++++----------
1 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index d068808..1b33c77 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -221,11 +221,10 @@ posix_cpu_clock_set(const clockid_t which_clock, const struct timespec *tp)
}


-/*
- * Sample a per-thread clock for the given task.
- */
-static int cpu_clock_sample(const clockid_t which_clock, struct task_struct *p,
- union cpu_time_count *cpu)
+static int do_cpu_clock_timer_sample(const clockid_t which_clock,
+ struct task_struct *p,
+ bool add_delta,
+ union cpu_time_count *cpu)
{
switch (CPUCLOCK_WHICH(which_clock)) {
default:
@@ -237,12 +236,30 @@ static int cpu_clock_sample(const clockid_t which_clock, struct task_struct *p,
cpu->cpu = virt_ticks(p);
break;
case CPUCLOCK_SCHED:
- cpu->sched = task_sched_runtime(p, true);
+ cpu->sched = task_sched_runtime(p, add_delta);
break;
}
return 0;
}

+/*
+ * Sample a per-thread clock for the given task.
+ */
+static int cpu_clock_sample(const clockid_t which_clock, struct task_struct *p,
+ union cpu_time_count *cpu)
+{
+ return do_cpu_clock_timer_sample(which_clock, p, true, cpu);
+}
+
+/*
+ * Sample a per-thread timer clock for the given task.
+ */
+static int cpu_timer_sample(const clockid_t which_clock, struct task_struct *p,
+ union cpu_time_count *cpu)
+{
+ return do_cpu_clock_timer_sample(which_clock, p, false, cpu);
+}
+
static void update_gt_cputime(struct task_cputime *a, struct task_cputime *b)
{
if (b->utime > a->utime)
@@ -748,7 +765,7 @@ static int posix_cpu_timer_set(struct k_itimer *timer, int flags,
* check if it's already passed. In short, we need a sample.
*/
if (CPUCLOCK_PERTHREAD(timer->it_clock)) {
- cpu_clock_sample(timer->it_clock, p, &val);
+ cpu_timer_sample(timer->it_clock, p, &val);
} else {
cpu_timer_sample_group(timer->it_clock, p, &val);
}
@@ -804,7 +821,7 @@ static int posix_cpu_timer_set(struct k_itimer *timer, int flags,
* we need exact current time.
*/
if (CPUCLOCK_PERTHREAD(timer->it_clock))
- now = val;
+ cpu_clock_sample(timer->it_clock, p, &now);
else
cpu_clock_sample_group(timer->it_clock, p, &now);
cpu_time_add(timer->it_clock, &new_expires, now);
@@ -894,7 +911,7 @@ static void posix_cpu_timer_get(struct k_itimer *timer, struct itimerspec *itp)
* Sample the clock to take the difference with the expiry time.
*/
if (CPUCLOCK_PERTHREAD(timer->it_clock)) {
- cpu_clock_sample(timer->it_clock, p, &now);
+ cpu_timer_sample(timer->it_clock, p, &now);
clear_dead = p->exit_state;
} else {
read_lock(&tasklist_lock);
@@ -1203,7 +1220,7 @@ void posix_cpu_timer_schedule(struct k_itimer *timer)
* Fetch the current sample and update the timer's expiry time.
*/
if (CPUCLOCK_PERTHREAD(timer->it_clock)) {
- cpu_clock_sample(timer->it_clock, p, &now);
+ cpu_timer_sample(timer->it_clock, p, &now);
bump_cpu_timer(timer, now);
if (unlikely(p->exit_state)) {
clear_dead_task(timer, now);
--
1.7.1

2013-05-26 21:37:24

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 5/8] posix-cpu-timers: check_thread_timers() uses task_sched_runtime()

From: KOSAKI Motohiro <[email protected]>

A type of tsk->se.sum_exec_runtime is u64. Thus, reading it is racy when
running 32bit. We should use task_sched_runtime().

Cc: Olivier Langlois <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Ingo Molnar <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
kernel/posix-cpu-timers.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 1b33c77..62466d4 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -1008,7 +1008,8 @@ static void check_thread_timers(struct task_struct *tsk,
struct cpu_timer_list *t = list_first_entry(timers,
struct cpu_timer_list,
entry);
- if (!--maxfire || tsk->se.sum_exec_runtime < t->expires.sched) {
+ unsigned long long runtime = task_sched_runtime(tsk, false);
+ if (!--maxfire || runtime < t->expires.sched) {
tsk->cputime_expires.sched_exp = t->expires.sched;
break;
}
--
1.7.1

2013-05-26 21:37:34

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 6/8] sched: task_sched_runtime introduce micro optimization

From: KOSAKI Motohiro <[email protected]>

rq lock in task_sched_runtime() is necessary for two reasons. 1)
accessing se.sum_exec_runtime is not atomic on 32bit and 2)
do_task_delta_exec() require it.

So, 64bit can avoid holding rq lock when add_delta is false and
delta_exec is 0.

Cc: Olivier Langlois <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Ingo Molnar <[email protected]>
Suggested-by: Paul Turner <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
kernel/sched/core.c | 15 +++++++++++++++
1 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 96512e9..0f859cc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2692,6 +2692,21 @@ unsigned long long task_sched_runtime(struct task_struct *p, bool add_delta)
struct rq *rq;
u64 ns = 0;

+#ifdef CONFIG_64BIT
+ /*
+ * 64-bit doesn't need locks to atomically read a 64bit value. So we
+ * have two optimization chances, 1) when caller doesn't need
+ * delta_exec and 2) when the task's delta_exec is 0. The former is
+ * obvious. The latter is complicated. reading ->on_cpu is racy, but
+ * this is ok. If we race with it leaving cpu, we'll take a lock. So
+ * we're correct. If we race with it entering cpu, unaccounted time
+ * is 0. This is indistinguishable from the read occurring a few
+ * cycles earlier.
+ */
+ if (!add_delta || !p->on_cpu)
+ return p->se.sum_exec_runtime;
+#endif
+
rq = task_rq_lock(p, &flags);
ns = p->se.sum_exec_runtime;
if (add_delta)
--
1.7.1

2013-05-26 21:37:46

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 7/8] posix-cpu-timers: cleanup cpu_{clock,timer}_sample{,_group}

From: KOSAKI Motohiro <[email protected]>

Now we have four similar timer related functions, cpu_clock_sample(),
cpu_clock_sample_group(), cpu_timer_sample() and cpu_timer_sample_group().

For readability, make do_cpu_clock_timer_sample() and thread_cputime()
helper functions and all *_sample functions use these.

Cc: Olivier Langlois <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Ingo Molnar <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
include/linux/sched.h | 1 +
kernel/posix-cpu-timers.c | 132 +++++++++++++++++++-------------------------
kernel/sched/cputime.c | 11 ++++
3 files changed, 69 insertions(+), 75 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4878579..ba36321 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2458,6 +2458,7 @@ static inline int spin_needbreak(spinlock_t *lock)
#endif
}

+void thread_cputime(struct task_struct *tsk, bool add_delta, struct task_cputime *times);
/*
* Idle thread specific functions to determine the need_resched
* polling state. We have two versions, one based on TS_POLLING in
diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 62466d4..34bb3f1 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -220,46 +220,6 @@ posix_cpu_clock_set(const clockid_t which_clock, const struct timespec *tp)
return error;
}

-
-static int do_cpu_clock_timer_sample(const clockid_t which_clock,
- struct task_struct *p,
- bool add_delta,
- union cpu_time_count *cpu)
-{
- switch (CPUCLOCK_WHICH(which_clock)) {
- default:
- return -EINVAL;
- case CPUCLOCK_PROF:
- cpu->cpu = prof_ticks(p);
- break;
- case CPUCLOCK_VIRT:
- cpu->cpu = virt_ticks(p);
- break;
- case CPUCLOCK_SCHED:
- cpu->sched = task_sched_runtime(p, add_delta);
- break;
- }
- return 0;
-}
-
-/*
- * Sample a per-thread clock for the given task.
- */
-static int cpu_clock_sample(const clockid_t which_clock, struct task_struct *p,
- union cpu_time_count *cpu)
-{
- return do_cpu_clock_timer_sample(which_clock, p, true, cpu);
-}
-
-/*
- * Sample a per-thread timer clock for the given task.
- */
-static int cpu_timer_sample(const clockid_t which_clock, struct task_struct *p,
- union cpu_time_count *cpu)
-{
- return do_cpu_clock_timer_sample(which_clock, p, false, cpu);
-}
-
static void update_gt_cputime(struct task_cputime *a, struct task_cputime *b)
{
if (b->utime > a->utime)
@@ -301,34 +261,83 @@ void thread_group_cputimer(struct task_struct *tsk, struct task_cputime *times)
}

/*
- * Sample a process (thread group) clock for the given group_leader task.
- * Must be called with tasklist_lock held for reading.
+ * Sample time for the given task.
+ * @which_clock: Clock id.
+ * @p: Target task. Must be thread group leader when
+ * thread_group is true.
+ * @thread_group: True if want to get process time.
+ * @add_delta: True if want to get clock time,
+ * otherwise, get timer time.
*/
-static int cpu_clock_sample_group(const clockid_t which_clock,
- struct task_struct *p,
- union cpu_time_count *cpu)
+static int do_cpu_clock_timer_sample(const clockid_t which_clock,
+ struct task_struct *p,
+ bool thread_group,
+ bool clock_time,
+ union cpu_time_count *cpu)
{
struct task_cputime cputime;

+ if (thread_group) {
+ if (clock_time)
+ thread_group_cputime(p, true, &cputime);
+ else
+ thread_group_cputimer(p, &cputime);
+ } else
+ thread_cputime(p, clock_time, &cputime);
+
switch (CPUCLOCK_WHICH(which_clock)) {
default:
return -EINVAL;
case CPUCLOCK_PROF:
- thread_group_cputime(p, true, &cputime);
cpu->cpu = cputime.utime + cputime.stime;
break;
case CPUCLOCK_VIRT:
- thread_group_cputime(p, true, &cputime);
cpu->cpu = cputime.utime;
break;
case CPUCLOCK_SCHED:
- thread_group_cputime(p, true, &cputime);
cpu->sched = cputime.sum_exec_runtime;
break;
}
return 0;
}

+/*
+ * Sample a per-thread clock time for the given task.
+ */
+static int cpu_clock_sample(const clockid_t which_clock, struct task_struct *p,
+ union cpu_time_count *cpu)
+{
+ return do_cpu_clock_timer_sample(which_clock, p, false, true, cpu);
+}
+
+/*
+ * Sample a per-process clock time for the given task.
+ */
+static int cpu_clock_sample_group(const clockid_t which_clock,
+ struct task_struct *p,
+ union cpu_time_count *cpu)
+{
+ return do_cpu_clock_timer_sample(which_clock, p, true, true, cpu);
+}
+
+/*
+ * Sample a per-thread timer time for the given task.
+ */
+static int cpu_timer_sample(const clockid_t which_clock, struct task_struct *p,
+ union cpu_time_count *cpu)
+{
+ return do_cpu_clock_timer_sample(which_clock, p, false, false, cpu);
+}
+
+/*
+ * Sample a per-process timer time for the given task.
+ */
+static int cpu_timer_sample_group(const clockid_t which_clock,
+ struct task_struct *p,
+ union cpu_time_count *cpu)
+{
+ return do_cpu_clock_timer_sample(which_clock, p, true, false, cpu);
+}

static int posix_cpu_clock_get(const clockid_t which_clock, struct timespec *tp)
{
@@ -648,33 +657,6 @@ static void cpu_timer_fire(struct k_itimer *timer)
}
}

-/*
- * Sample a process (thread group) timer for the given group_leader task.
- * Must be called with tasklist_lock held for reading.
- */
-static int cpu_timer_sample_group(const clockid_t which_clock,
- struct task_struct *p,
- union cpu_time_count *cpu)
-{
- struct task_cputime cputime;
-
- thread_group_cputimer(p, &cputime);
- switch (CPUCLOCK_WHICH(which_clock)) {
- default:
- return -EINVAL;
- case CPUCLOCK_PROF:
- cpu->cpu = cputime.utime + cputime.stime;
- break;
- case CPUCLOCK_VIRT:
- cpu->cpu = cputime.utime;
- break;
- case CPUCLOCK_SCHED:
- cpu->sched = cputime.sum_exec_runtime;
- break;
- }
- return 0;
-}
-
#ifdef CONFIG_NO_HZ_FULL
static void nohz_kick_work_fn(struct work_struct *work)
{
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 3ca432c..53a7de8 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -273,6 +273,17 @@ static __always_inline bool steal_account_process_tick(void)
return false;
}

+void thread_cputime(struct task_struct *tsk, bool add_delta, struct task_cputime *times)
+{
+ struct signal_struct *sig = tsk->signal;
+ cputime_t utime, stime;
+
+ task_cputime(tsk, &utime, &stime);
+ times->utime = utime;
+ times->stime = stime;
+ times->sum_exec_runtime = task_sched_runtime(tsk, add_delta);
+}
+
/*
* Accumulate raw cputime values of dead tasks (sig->[us]time) and live
* tasks (sum on group iteration) belonging to @tsk's group.
--
1.7.1

2013-05-26 21:37:56

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 8/8] posix-cpu-timers: fix cputimer initialization mistake for {u,s}time

From: KOSAKI Motohiro <[email protected]>

When using CPUCLOCK_VIRT or CPUCLOCK_PROF, we need to round an expire up
because jiffies based accounting may point to one jiffy behind at maximum.
So, current code lead to wake up posix timer too early.

This patch adds one jiffy at timer initialization. itimer already has a
similar trick and it match a userland assumption.

Luckily, CPUCLOCK_VIRT and CPUCLOCK_PROF posix timer are undocumented,
standard undefined, and no libc support. So maybe nobody uses them.
However, it still would be nice to fix if possible.

Cc: Olivier Langlois <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
kernel/posix-cpu-timers.c | 20 ++++++++++++++++++++
1 files changed, 20 insertions(+), 0 deletions(-)

diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 34bb3f1..42874c2 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -339,6 +339,18 @@ static int cpu_timer_sample_group(const clockid_t which_clock,
return do_cpu_clock_timer_sample(which_clock, p, true, false, cpu);
}

+static union cpu_time_count cpu_clock_resolution(clockid_t cpuid)
+{
+ union cpu_time_count ret;
+
+ if (CPUCLOCK_WHICH(cpuid) == CPUCLOCK_SCHED)
+ ret.sched = 1;
+ else
+ ret.cpu = cputime_one_jiffy;
+
+ return ret;
+}
+
static int posix_cpu_clock_get(const clockid_t which_clock, struct timespec *tp)
{
const pid_t pid = CPUCLOCK_PID(which_clock);
@@ -806,7 +818,15 @@ static int posix_cpu_timer_set(struct k_itimer *timer, int flags,
cpu_clock_sample(timer->it_clock, p, &now);
else
cpu_clock_sample_group(timer->it_clock, p, &now);
+
cpu_time_add(timer->it_clock, &new_expires, now);
+
+ /*
+ * When inaccurate timer, we need to adjust an expire time
+ * because "now" may point to slightly past time.
+ */
+ cpu_time_add(timer->it_clock, &new_expires,
+ cpu_clock_resolution(timer->it_clock));
}

/*
--
1.7.1

2013-05-31 16:26:58

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH v5 0/8] posix timers fixlet

(5/26/13 5:35 PM), [email protected] wrote:
> From: KOSAKI Motohiro <[email protected]>
>
> Glibc's posix timer testcase found a lot of bugs in posix timer code.
> This series, hopefully, fixes all of them. All patches are independent
> each other logically.
>
> Changes from v4
> - [1/8] comments colarification, fix account_group_{user_system}_time too.
> - [8/8] added CPUCLOCK_VIRT and CPUCLOCK_PROF timer initialization fix

Ping?

Thomas, Could you pick them up into your tree?

2013-06-03 19:07:39

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 2/8] posix-cpu-timers: fix acounting delta_exec twice

On Sun, May 26, 2013 at 05:35:42PM -0400, [email protected] wrote:
> From: KOSAKI Motohiro <[email protected]>
>
> Currently glibc rt/tst-cpuclock2 test(*) sporadically fails because
> scheduler delta can be accounted twice from thread_group_cputimer()
> and account_group_exec_runtime().
>
> Finally, clock_nanosleep() wakes up earlier than an argument. This is posix
> violation. This issue was introduced by commit d670ec1317 (posix-cpu-timers:
> Cure SMP wobbles).
>
> (*) http://sourceware.org/git/?p=glibc.git;a=blob;f=rt/tst-cpuclock2.c;h=6752721717f959e89c0d692b3f1ee082d507eec2;hb=HEAD
>
> Cc: Olivier Langlois <[email protected]>
> Cc: Thomas Gleixner <[email protected]>

So with this patch the timer sample can be (nr_threads * TICK_NSEC) nsecs behind the
clock in the worst case.

This is still better than possibly (nr_threads * TICK_NSEC) nsecs ahead of the clock in
the worst case like we have before the patch :)

But the worst case can be unbounded with full dynticks tasks. May be the ideal thing
would be to perform an update_curr() on the threads before we read their sum_exec_runtime,
This way we make sure we are never far away behind.

This needs to be done on top of your change anyway because we still want to avoid
the double count, so: Acked-by: Frederic Weisbecker <[email protected]>

Thanks.

2013-06-18 14:20:59

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 3/8] posix-cpu-timers: fix wrong timer initialization

On Sun, May 26, 2013 at 05:35:44PM -0400, [email protected] wrote:
> From: KOSAKI Motohiro <[email protected]>
>
> Currently glibc's rt/tst-cputimer1 testcase sporadically fails because
> a timer created by timer_create() may fire earlier than specified.
>
> posix_cpu_timer_set() uses "val" as current time for three purpose. 1)
> initialize sig->cputimer. 2) calculation "old" val. 3) calculations an
> expires.
>
> (1) and (2) should only use committed time (i.e. without delta_exec)
> because run_posix_cpu_timers() don't care of delta_exec and we need
> consistency, but (3) need exact current time (aka cpu clock time) because
> an expires should be "now + timeout" by definition.
>
> This patch distinguishes between two kinds of "now".
>
> Cc: Olivier Langlois <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Frederic Weisbecker <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Acked-by: Peter Zijlstra <[email protected]>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> ---
> include/linux/kernel_stat.h | 5 -----
> kernel/posix-cpu-timers.c | 14 ++++++++++++--
> kernel/sched/core.c | 13 -------------
> 3 files changed, 12 insertions(+), 20 deletions(-)
>
> diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
> index ed5f6ed..f5d4fdf 100644
> --- a/include/linux/kernel_stat.h
> +++ b/include/linux/kernel_stat.h
> @@ -117,11 +117,6 @@ static inline unsigned int kstat_cpu_irqs_sum(unsigned int cpu)
> return kstat_cpu(cpu).irqs_sum;
> }
>
> -/*
> - * Lock/unlock the current runqueue - to extract task statistics:
> - */
> -extern unsigned long long task_delta_exec(struct task_struct *);
> -
> extern void account_user_time(struct task_struct *, cputime_t, cputime_t);
> extern void account_system_time(struct task_struct *, int, cputime_t, cputime_t);
> extern void account_steal_time(cputime_t);
> diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
> index 25447c5..d068808 100644
> --- a/kernel/posix-cpu-timers.c
> +++ b/kernel/posix-cpu-timers.c
> @@ -652,7 +652,7 @@ static int cpu_timer_sample_group(const clockid_t which_clock,
> cpu->cpu = cputime.utime;
> break;
> case CPUCLOCK_SCHED:
> - cpu->sched = cputime.sum_exec_runtime + task_delta_exec(p);
> + cpu->sched = cputime.sum_exec_runtime;

Are you sure that all callers of cpu_timer_sample_group() are fine with that change?
Looking at set_process_cpu_timer() it seems we want the committed time as well to
be added on newval. For the same reasons we use cpu_clock_sample_group() in (3) here.

2013-06-18 14:27:52

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 6/8] sched: task_sched_runtime introduce micro optimization

On Sun, May 26, 2013 at 05:35:47PM -0400, [email protected] wrote:
> From: KOSAKI Motohiro <[email protected]>
>
> rq lock in task_sched_runtime() is necessary for two reasons. 1)
> accessing se.sum_exec_runtime is not atomic on 32bit and 2)
> do_task_delta_exec() require it.
>
> So, 64bit can avoid holding rq lock when add_delta is false and
> delta_exec is 0.
>
> Cc: Olivier Langlois <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Frederic Weisbecker <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Suggested-by: Paul Turner <[email protected]>
> Acked-by: Peter Zijlstra <[email protected]>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> ---
> kernel/sched/core.c | 15 +++++++++++++++
> 1 files changed, 15 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 96512e9..0f859cc 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2692,6 +2692,21 @@ unsigned long long task_sched_runtime(struct task_struct *p, bool add_delta)
> struct rq *rq;
> u64 ns = 0;
>
> +#ifdef CONFIG_64BIT
> + /*
> + * 64-bit doesn't need locks to atomically read a 64bit value. So we
> + * have two optimization chances, 1) when caller doesn't need
> + * delta_exec and 2) when the task's delta_exec is 0. The former is
> + * obvious. The latter is complicated. reading ->on_cpu is racy, but
> + * this is ok. If we race with it leaving cpu, we'll take a lock. So
> + * we're correct. If we race with it entering cpu, unaccounted time
> + * is 0. This is indistinguishable from the read occurring a few
> + * cycles earlier.
> + */
> + if (!add_delta || !p->on_cpu)
> + return p->se.sum_exec_runtime;

I'm not sure this is correct from an smp ordering POV. p->on_cpu may appear
to be 0 whereas the task is actually running for a while and p->se.sum_exec_runtime
can then be past the actual value on the remote CPU.

> +#endif
> +
> rq = task_rq_lock(p, &flags);
> ns = p->se.sum_exec_runtime;
> if (add_delta)
> --
> 1.7.1
>

2013-06-18 15:12:39

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 3/8] posix-cpu-timers: fix wrong timer initialization

On Tue, Jun 18, 2013 at 10:20 AM, Frederic Weisbecker
<[email protected]> wrote:
> On Sun, May 26, 2013 at 05:35:44PM -0400, [email protected] wrote:
>> From: KOSAKI Motohiro <[email protected]>
>>
>> Currently glibc's rt/tst-cputimer1 testcase sporadically fails because
>> a timer created by timer_create() may fire earlier than specified.
>>
>> posix_cpu_timer_set() uses "val" as current time for three purpose. 1)
>> initialize sig->cputimer. 2) calculation "old" val. 3) calculations an
>> expires.
>>
>> (1) and (2) should only use committed time (i.e. without delta_exec)
>> because run_posix_cpu_timers() don't care of delta_exec and we need
>> consistency, but (3) need exact current time (aka cpu clock time) because
>> an expires should be "now + timeout" by definition.
>>
>> This patch distinguishes between two kinds of "now".
>>
>> Cc: Olivier Langlois <[email protected]>
>> Cc: Thomas Gleixner <[email protected]>
>> Cc: Frederic Weisbecker <[email protected]>
>> Cc: Ingo Molnar <[email protected]>
>> Acked-by: Peter Zijlstra <[email protected]>
>> Signed-off-by: KOSAKI Motohiro <[email protected]>
>> ---
>> include/linux/kernel_stat.h | 5 -----
>> kernel/posix-cpu-timers.c | 14 ++++++++++++--
>> kernel/sched/core.c | 13 -------------
>> 3 files changed, 12 insertions(+), 20 deletions(-)
>>
>> diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
>> index ed5f6ed..f5d4fdf 100644
>> --- a/include/linux/kernel_stat.h
>> +++ b/include/linux/kernel_stat.h
>> @@ -117,11 +117,6 @@ static inline unsigned int kstat_cpu_irqs_sum(unsigned int cpu)
>> return kstat_cpu(cpu).irqs_sum;
>> }
>>
>> -/*
>> - * Lock/unlock the current runqueue - to extract task statistics:
>> - */
>> -extern unsigned long long task_delta_exec(struct task_struct *);
>> -
>> extern void account_user_time(struct task_struct *, cputime_t, cputime_t);
>> extern void account_system_time(struct task_struct *, int, cputime_t, cputime_t);
>> extern void account_steal_time(cputime_t);
>> diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
>> index 25447c5..d068808 100644
>> --- a/kernel/posix-cpu-timers.c
>> +++ b/kernel/posix-cpu-timers.c
>> @@ -652,7 +652,7 @@ static int cpu_timer_sample_group(const clockid_t which_clock,
>> cpu->cpu = cputime.utime;
>> break;
>> case CPUCLOCK_SCHED:
>> - cpu->sched = cputime.sum_exec_runtime + task_delta_exec(p);
>> + cpu->sched = cputime.sum_exec_runtime;
>
> Are you sure that all callers of cpu_timer_sample_group() are fine with that change?

Now, cpu_timer_sample_group() is used from following four points.

posix_cpu_timer_set(): for timer initialization
posix_cpu_timer_get(): for timer_gettime(2)
posix_cpu_timer_schedule(): timer firing
set_process_cpu_timer(): for itimer

I think all of them are safe because, the point is, timer firing
procedure (check_thread_timers and check_process_timers) don't care
uncommitted delta. Then, other timer functions need to use the same
timer tick. Otherwise the inconsistency leak to userland sooner or
later.

The another solution is, check_{thread/process}_timers take plenty rq
locks and use accurate time. However, of course, it may make lots
performance hit. So, I don't want
to take this way.


> Looking at set_process_cpu_timer() it seems we want the committed time as well to
> be added on newval. For the same reasons we use cpu_clock_sample_group() in (3) here.

Sorry, I haven't caught your point. Could you elaborate more?

2013-06-18 15:18:05

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 6/8] sched: task_sched_runtime introduce micro optimization

>> +#ifdef CONFIG_64BIT
>> + /*
>> + * 64-bit doesn't need locks to atomically read a 64bit value. So we
>> + * have two optimization chances, 1) when caller doesn't need
>> + * delta_exec and 2) when the task's delta_exec is 0. The former is
>> + * obvious. The latter is complicated. reading ->on_cpu is racy, but
>> + * this is ok. If we race with it leaving cpu, we'll take a lock. So
>> + * we're correct. If we race with it entering cpu, unaccounted time
>> + * is 0. This is indistinguishable from the read occurring a few
>> + * cycles earlier.
>> + */
>> + if (!add_delta || !p->on_cpu)
>> + return p->se.sum_exec_runtime;
>
> I'm not sure this is correct from an smp ordering POV. p->on_cpu may appear
> to be 0 whereas the task is actually running for a while and p->se.sum_exec_runtime
> can then be past the actual value on the remote CPU.

Quate form Paul's last e-mail

>Stronger:
>
>+#ifdef CONFIG_64BIT
>+ if (!p->on_cpu)
>+ return p->se.sum_exec_runtime;
>+#endif
>
>[ Or !p->on_cpu || !add_delta ].
>
>We can take the racy read versus p->on_cpu since:
> If we race with it leaving cpu: we take lock, we're correct
> If we race with it entering cpu: unaccounted time ---> 0, this is
>indistinguishable from the read occurring a few cycles earlier.

That said, even though we got slightly inaccurate current time, we
have no way to
know this is inaccurate. E.g. cpu clock saving feature bring us more
inaccuracy, but
we already live in such world.

2013-06-18 15:28:48

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 6/8] sched: task_sched_runtime introduce micro optimization

On Tue, Jun 18, 2013 at 11:17 AM, KOSAKI Motohiro
<[email protected]> wrote:
>>> +#ifdef CONFIG_64BIT
>>> + /*
>>> + * 64-bit doesn't need locks to atomically read a 64bit value. So we
>>> + * have two optimization chances, 1) when caller doesn't need
>>> + * delta_exec and 2) when the task's delta_exec is 0. The former is
>>> + * obvious. The latter is complicated. reading ->on_cpu is racy, but
>>> + * this is ok. If we race with it leaving cpu, we'll take a lock. So
>>> + * we're correct. If we race with it entering cpu, unaccounted time
>>> + * is 0. This is indistinguishable from the read occurring a few
>>> + * cycles earlier.
>>> + */
>>> + if (!add_delta || !p->on_cpu)
>>> + return p->se.sum_exec_runtime;
>>
>> I'm not sure this is correct from an smp ordering POV. p->on_cpu may appear
>> to be 0 whereas the task is actually running for a while and p->se.sum_exec_runtime
>> can then be past the actual value on the remote CPU.
>
> Quate form Paul's last e-mail
>
>>Stronger:
>>
>>+#ifdef CONFIG_64BIT
>>+ if (!p->on_cpu)
>>+ return p->se.sum_exec_runtime;
>>+#endif
>>
>>[ Or !p->on_cpu || !add_delta ].
>>
>>We can take the racy read versus p->on_cpu since:
>> If we race with it leaving cpu: we take lock, we're correct
>> If we race with it entering cpu: unaccounted time ---> 0, this is
>>indistinguishable from the read occurring a few cycles earlier.
>
> That said, even though we got slightly inaccurate current time, we
> have no way to
> know this is inaccurate. E.g. cpu clock saving feature bring us more
> inaccuracy, but
> we already live in such world.

Ah, RT folks may want to call preempt_disable() in thread_group_cputime()
because preemptive kernel can be preemptible while for-each-threads loop
for getting accurate time.
But it is another story, it's not new issue and it's not introduced by me. :-)

2013-06-18 17:18:25

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 6/8] sched: task_sched_runtime introduce micro optimization

On Tue, Jun 18, 2013 at 11:17:41AM -0400, KOSAKI Motohiro wrote:
> >> +#ifdef CONFIG_64BIT
> >> + /*
> >> + * 64-bit doesn't need locks to atomically read a 64bit value. So we
> >> + * have two optimization chances, 1) when caller doesn't need
> >> + * delta_exec and 2) when the task's delta_exec is 0. The former is
> >> + * obvious. The latter is complicated. reading ->on_cpu is racy, but
> >> + * this is ok. If we race with it leaving cpu, we'll take a lock. So
> >> + * we're correct. If we race with it entering cpu, unaccounted time
> >> + * is 0. This is indistinguishable from the read occurring a few
> >> + * cycles earlier.
> >> + */
> >> + if (!add_delta || !p->on_cpu)
> >> + return p->se.sum_exec_runtime;
> >
> > I'm not sure this is correct from an smp ordering POV. p->on_cpu may appear
> > to be 0 whereas the task is actually running for a while and p->se.sum_exec_runtime
> > can then be past the actual value on the remote CPU.
>
> Quate form Paul's last e-mail
>
> >Stronger:
> >
> >+#ifdef CONFIG_64BIT
> >+ if (!p->on_cpu)
> >+ return p->se.sum_exec_runtime;
> >+#endif
> >
> >[ Or !p->on_cpu || !add_delta ].
> >
> >We can take the racy read versus p->on_cpu since:
> > If we race with it leaving cpu: we take lock, we're correct
> > If we race with it entering cpu: unaccounted time ---> 0, this is
> >indistinguishable from the read occurring a few cycles earlier.

Yeah, my worry was more about both p->on_cpu and p->se.sum_exec_runtime being
stale for too long. How much time can happen in the worst case before CPU X sees
the updates done by a CPU Y under rq(Y)->lock considering that CPU X doesn't take rq(Y)
to read that update? I guess it depends on the hardware, locking and ordering
that happened before.

Bah it probably doesn't matter in practice.

Thanks.

2013-06-19 20:57:06

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 3/8] posix-cpu-timers: fix wrong timer initialization

On Tue, Jun 18, 2013 at 11:12:17AM -0400, KOSAKI Motohiro wrote:
> On Tue, Jun 18, 2013 at 10:20 AM, Frederic Weisbecker
> <[email protected]> wrote:
> > On Sun, May 26, 2013 at 05:35:44PM -0400, [email protected] wrote:
> >> From: KOSAKI Motohiro <[email protected]>
> >>
> >> Currently glibc's rt/tst-cputimer1 testcase sporadically fails because
> >> a timer created by timer_create() may fire earlier than specified.
> >>
> >> posix_cpu_timer_set() uses "val" as current time for three purpose. 1)
> >> initialize sig->cputimer. 2) calculation "old" val. 3) calculations an
> >> expires.
> >>
> >> (1) and (2) should only use committed time (i.e. without delta_exec)
> >> because run_posix_cpu_timers() don't care of delta_exec and we need
> >> consistency, but (3) need exact current time (aka cpu clock time) because
> >> an expires should be "now + timeout" by definition.
> >>
> >> This patch distinguishes between two kinds of "now".
> >>
> >> Cc: Olivier Langlois <[email protected]>
> >> Cc: Thomas Gleixner <[email protected]>
> >> Cc: Frederic Weisbecker <[email protected]>
> >> Cc: Ingo Molnar <[email protected]>
> >> Acked-by: Peter Zijlstra <[email protected]>
> >> Signed-off-by: KOSAKI Motohiro <[email protected]>
> >> ---
> >> include/linux/kernel_stat.h | 5 -----
> >> kernel/posix-cpu-timers.c | 14 ++++++++++++--
> >> kernel/sched/core.c | 13 -------------
> >> 3 files changed, 12 insertions(+), 20 deletions(-)
> >>
> >> diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
> >> index ed5f6ed..f5d4fdf 100644
> >> --- a/include/linux/kernel_stat.h
> >> +++ b/include/linux/kernel_stat.h
> >> @@ -117,11 +117,6 @@ static inline unsigned int kstat_cpu_irqs_sum(unsigned int cpu)
> >> return kstat_cpu(cpu).irqs_sum;
> >> }
> >>
> >> -/*
> >> - * Lock/unlock the current runqueue - to extract task statistics:
> >> - */
> >> -extern unsigned long long task_delta_exec(struct task_struct *);
> >> -
> >> extern void account_user_time(struct task_struct *, cputime_t, cputime_t);
> >> extern void account_system_time(struct task_struct *, int, cputime_t, cputime_t);
> >> extern void account_steal_time(cputime_t);
> >> diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
> >> index 25447c5..d068808 100644
> >> --- a/kernel/posix-cpu-timers.c
> >> +++ b/kernel/posix-cpu-timers.c
> >> @@ -652,7 +652,7 @@ static int cpu_timer_sample_group(const clockid_t which_clock,
> >> cpu->cpu = cputime.utime;
> >> break;
> >> case CPUCLOCK_SCHED:
> >> - cpu->sched = cputime.sum_exec_runtime + task_delta_exec(p);
> >> + cpu->sched = cputime.sum_exec_runtime;
> >
> > Are you sure that all callers of cpu_timer_sample_group() are fine with that change?
>
> Now, cpu_timer_sample_group() is used from following four points.
>
> posix_cpu_timer_set(): for timer initialization

Right, so to recall what is in your changelog, here we want to:

1) get initial sample and initialize cputime->running to 1, so here
we don't want to commit pending deltas otherwise they may be
accounted twice

2) to compute old value. I would say that here both kind of samples
work (with or without committed pending deltas)

3) set the new timer. We want committed pending deltas here to
compute now + deltas, otherwise the timer might trigger too early

> posix_cpu_timer_get(): for timer_gettime(2)

Here I would say it doesn't matter whether we include pending delta
or not. But just to stay consistent with clock_gettime(), I'd rather
include the pending deltas.

> posix_cpu_timer_schedule(): timer firing

firing or fired. But rescheduling in any case. I would tend to
think we want to include pending deltas as a base to calculate
the next expiry time on top of interval increments, etc...
Pretty much like posix_cpu_timer_set() in fact.

> set_process_cpu_timer(): for itimer

So here the case seem to be very similar to posix_cpu_timer_set()
again.
We pass a relative expiring time delta to setitimer() so we want
the timer to expire at NOW + timeout. So NOW must be the clock
sample that includes the pending deltas that haven't yet been
committed, otherwise the timer may expire too early.

Shouldn't we use cpu_clock_sample_group() here?

>
> I think all of them are safe because, the point is, timer firing
> procedure (check_thread_timers and check_process_timers) don't care
> uncommitted delta. Then, other timer functions need to use the same
> timer tick. Otherwise the inconsistency leak to userland sooner or
> later.
>
> The another solution is, check_{thread/process}_timers take plenty rq
> locks and use accurate time. However, of course, it may make lots
> performance hit. So, I don't want
> to take this way.

If only we could commit the pending deltas on the task stats (like calling update_curr())
everytime we check the timer/clock sample. This way we wouldn't worry about all
these pending sum_exec_runtime stuff to be accounted twice and we could just always read it
without further thoughts.

Also, cpu_timer_sample_group() looks to be fundamentally buggy to begin with.
It may add task_delta_exec(p) twice if thread_group_timer() is called with cputimer_running == 0.

>
>
> > Looking at set_process_cpu_timer() it seems we want the committed time as well to
> > be added on newval. For the same reasons we use cpu_clock_sample_group() in (3) here.
>
> Sorry, I haven't caught your point. Could you elaborate more?

See above when I describe my worries on set_process_cpu_timer().

2013-06-20 08:41:52

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 6/8] sched: task_sched_runtime introduce micro optimization

(6/18/13 10:18 AM), Frederic Weisbecker wrote:
> On Tue, Jun 18, 2013 at 11:17:41AM -0400, KOSAKI Motohiro wrote:
>>>> +#ifdef CONFIG_64BIT
>>>> + /*
>>>> + * 64-bit doesn't need locks to atomically read a 64bit value. So we
>>>> + * have two optimization chances, 1) when caller doesn't need
>>>> + * delta_exec and 2) when the task's delta_exec is 0. The former is
>>>> + * obvious. The latter is complicated. reading ->on_cpu is racy, but
>>>> + * this is ok. If we race with it leaving cpu, we'll take a lock. So
>>>> + * we're correct. If we race with it entering cpu, unaccounted time
>>>> + * is 0. This is indistinguishable from the read occurring a few
>>>> + * cycles earlier.
>>>> + */
>>>> + if (!add_delta || !p->on_cpu)
>>>> + return p->se.sum_exec_runtime;
>>>
>>> I'm not sure this is correct from an smp ordering POV. p->on_cpu may appear
>>> to be 0 whereas the task is actually running for a while and p->se.sum_exec_runtime
>>> can then be past the actual value on the remote CPU.
>>
>> Quate form Paul's last e-mail
>>
>>> Stronger:
>>>
>>> +#ifdef CONFIG_64BIT
>>> + if (!p->on_cpu)
>>> + return p->se.sum_exec_runtime;
>>> +#endif
>>>
>>> [ Or !p->on_cpu || !add_delta ].
>>>
>>> We can take the racy read versus p->on_cpu since:
>>> If we race with it leaving cpu: we take lock, we're correct
>>> If we race with it entering cpu: unaccounted time ---> 0, this is
>>> indistinguishable from the read occurring a few cycles earlier.
>
> Yeah, my worry was more about both p->on_cpu and p->se.sum_exec_runtime being
> stale for too long. How much time can happen in the worst case before CPU X sees
> the updates done by a CPU Y under rq(Y)->lock considering that CPU X doesn't take rq(Y)
> to read that update? I guess it depends on the hardware, locking and ordering
> that happened before.

Right. The worst case depend on memory access cost on remote node.
If memory access cost is m, The worst scenario is:

t
-------------
0: CPU X start to fetch p->on_cpu. (now it's 0)
0: CPU Y changes p->on_cpu to 1.
m: CPU X sees p->on_cpu is 0.

Then CPU X uses p->se.sum_exec_runtime even though p has delta m. And, in worst case,
all cpus make the same scenario. So, inaccuracy should be m x nr_online_cpus.
In this case, we can ignore cpu cache because we don't hit anyway in worst case.

However, I think it's ok. m is us order and our run_posx_cpu_timer() only has 1/HZ accuracy.
Moreover,taking lock needs more time than m because it need lock prefixed op. Then, It's free lunch.

I have no seen any ordering issue in this code because each CPU have independent time and no dependency. Please tell me if I'm missing something.

> Bah it probably doesn't matter in practice.

I'm ok to drop this "!p->on_cpu" optimization if you really dislike it.
It is just a micro optimization and the benefit is not so much.