LinuxLists.cc - [RFC 0/4] do not make cputime scaling in kernel

2013-03-28 16:53:50

Subject: [RFC 0/4] do not make cputime scaling in kernel

This patch series removes cputime scaling from kernel. It can be easily
done in user space using floating point if we provide sum_exec_runtime,
what patches 2/4 and 3/4 do. I have procps which utilize that:

http://people.redhat.com/sgruszka/procps-use-sum_exec_runtime.patch

I will post it, if this patch set will be queued.

Change affect also getrusage() and times() syscals, but I don't think
kernel give guarantees about utime/stime precision, in a matter of fact
before commit b27f03d4bdc145a09fb7b0c0e004b29f1ee555fa, we do not
perform any scaling and we provided raw cputime values to user space.

Providing sum_exec_runtime via proc is done against malware that utilize
lot of cpu time but hide itself from top program.

This affect kernels not compiled with CONFIG_VIRT_CPU_ACCOUNTING_{GEN,NATIVE},
if user choose to compile kernel with some of those options, he/she will
have more precise cputime accounting, what is documented in Kconfig.

2013-03-28 16:52:56

by Stanislaw Gruszka

[permalink] [raw]

Subject: [RFC 3/4] sched,proc: add csum_sched_runtime

Account and export via proc scheduler calculated time of children
execution.

Signed-off-by: Stanislaw Gruszka <[email protected]>
---
fs/proc/array.c | 3 +++
include/linux/sched.h | 1 +
kernel/exit.c | 2 ++
3 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index ee47b29..1444dc5 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -404,6 +404,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
char tcomm[sizeof(task->comm)];
unsigned long flags;
u64 sum_exec_runtime = 0;
+ u64 csum_exec_runtime = 0;

state = *get_task_state(task);
vsize = eip = esp = 0;
@@ -442,6 +443,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
cutime = sig->cutime;
cstime = sig->cstime;
cgtime = sig->cgtime;
+ csum_exec_runtime = sig->csum_sched_runtime;
rsslim = ACCESS_ONCE(sig->rlim[RLIMIT_RSS].rlim_cur);

/* add up live thread stats at the group level */
@@ -558,6 +560,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
seq_put_decimal_ll(m, ' ', 0);

seq_put_decimal_ull(m, ' ', sum_exec_runtime);
+ seq_put_decimal_ull(m, ' ', csum_exec_runtime);

seq_putc(m, '\n');
if (mm)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index eb6005c..6673a59 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -574,6 +574,7 @@ struct signal_struct {
* other than jiffies.)
*/
unsigned long long sum_sched_runtime;
+ unsigned long long csum_sched_runtime;

/*
* We don't bother to synchronize most readers of this at all,
diff --git a/kernel/exit.c b/kernel/exit.c
index 5a7023f..eda4fa7 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1094,6 +1094,8 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
sig = p->signal;
psig->cutime += tg_cputime.utime + sig->cutime;
psig->cstime += tg_cputime.stime + sig->cstime;
+ psig->csum_sched_runtime +=
+ tg_cputime.sum_exec_runtime + sig->csum_sched_runtime;
psig->cgtime += task_gtime(p) + sig->gtime + sig->cgtime;
psig->cmin_flt +=
p->min_flt + sig->min_flt + sig->cmin_flt;
--
1.7.1

2013-03-28 16:53:03

by Stanislaw Gruszka

[permalink] [raw]

Subject: [RFC 4/4] cputime: remove scaling

Scaling cputime cause problems, bunch of them was fixed, but still is possible
to hit multiplication overflow issue, which make {u,s}time values incorrect.
This problem has no good solution in kernel.

This patch remove scaling code and export raw values of {u,t}ime . Procps
programs can use newly introduced sum_exec_runtime to find out precisely
calculated process cpu time and scale utime, stime values accordingly.

Unfortunately times(2) syscall has no such option.

This change affect kernels compiled without CONFIG_VIRT_CPU_ACCOUNTING_*.

Signed-off-by: Stanislaw Gruszka <[email protected]>
---
fs/proc/array.c | 4 +-
include/linux/sched.h | 20 --------
kernel/exit.c | 2 +-
kernel/fork.c | 3 -
kernel/sched/cputime.c | 117 +-----------------------------------------------
kernel/sys.c | 6 +-
6 files changed, 7 insertions(+), 145 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 1444dc5..5feadc4 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -459,7 +459,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,

min_flt += sig->min_flt;
maj_flt += sig->maj_flt;
- thread_group_cputime_adjusted(task, &cputime);
+ thread_group_cputime(task, &cputime);
utime = cputime.utime;
stime = cputime.stime;
sum_exec_runtime = cputime.sum_exec_runtime;
@@ -478,7 +478,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
if (!whole) {
min_flt = task->min_flt;
maj_flt = task->maj_flt;
- task_cputime_adjusted(task, &utime, &stime);
+ task_cputime(task, &utime, &stime);
sum_exec_runtime = task->se.sum_exec_runtime;
gtime = task_gtime(task);
}
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6673a59..26257a7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -397,18 +397,6 @@ struct cpu_itimer {
};

/**
- * struct cputime - snaphsot of system and user cputime
- * @utime: time spent in user mode
- * @stime: time spent in system mode
- *
- * Gathers a generic snapshot of user and system time.
- */
-struct cputime {
- cputime_t utime;
- cputime_t stime;
-};
-
-/**
* struct task_cputime - collected CPU time counts
* @utime: time spent in user mode, in &cputime_t units
* @stime: time spent in kernel mode, in &cputime_t units
@@ -558,9 +546,6 @@ struct signal_struct {
cputime_t utime, stime, cutime, cstime;
cputime_t gtime;
cputime_t cgtime;
-#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
- struct cputime prev_cputime;
-#endif
unsigned long nvcsw, nivcsw, cnvcsw, cnivcsw;
unsigned long min_flt, maj_flt, cmin_flt, cmaj_flt;
unsigned long inblock, oublock, cinblock, coublock;
@@ -1159,9 +1144,6 @@ struct task_struct {

cputime_t utime, stime, utimescaled, stimescaled;
cputime_t gtime;
-#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
- struct cputime prev_cputime;
-#endif
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
seqlock_t vtime_seqlock;
unsigned long long vtime_snap;
@@ -1595,8 +1577,6 @@ static inline cputime_t task_gtime(struct task_struct *t)
return t->gtime;
}
#endif
-extern void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st);
-extern void thread_group_cputime_adjusted(struct task_struct *p, struct task_cputime *ct);

/*
* Per process flags
diff --git a/kernel/exit.c b/kernel/exit.c
index eda4fa7..9a7ecbb 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1088,7 +1088,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
* group, which consolidates times for all threads in the
* group including the group leader.
*/
- thread_group_cputime_adjusted(p, &tg_cputime);
+ thread_group_cputime(p, &tg_cputime);
spin_lock_irq(&p->real_parent->sighand->siglock);
psig = p->real_parent->signal;
sig = p->signal;
diff --git a/kernel/fork.c b/kernel/fork.c
index 339f60d..2ae1706 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1233,9 +1233,6 @@ static struct task_struct *copy_process(unsigned long clone_flags,

p->utime = p->stime = p->gtime = 0;
p->utimescaled = p->stimescaled = 0;
-#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
- p->prev_cputime.utime = p->prev_cputime.stime = 0;
-#endif
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
seqlock_init(&p->vtime_seqlock);
p->vtime_snap = 0;
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index a600f7f..23df74b 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -448,19 +448,7 @@ EXPORT_SYMBOL_GPL(vtime_account_irq_enter);
#endif /* __ARCH_HAS_VTIME_ACCOUNT */
#endif /* CONFIG_VIRT_CPU_ACCOUNTING */

-
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
-void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
-{
- *ut = p->utime;
- *st = p->stime;
-}
-
-void thread_group_cputime_adjusted(struct task_struct *p, struct task_cputime *cputime)
-{
- thread_group_cputime(p, cputime);
-}
-#else /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
/*
* Account a single tick of cpu time.
* @p: the process that the cpu time gets accounted to
@@ -516,109 +504,6 @@ void account_idle_ticks(unsigned long ticks)
account_idle_time(jiffies_to_cputime(ticks));
}

-/*
- * Perform (stime * rtime) / total with reduced chances
- * of multiplication overflows by using smaller factors
- * like quotient and remainders of divisions between
- * rtime and total.
- */
-static cputime_t scale_stime(u64 stime, u64 rtime, u64 total)
-{
- u64 rem, res, scaled;
-
- if (rtime >= total) {
- /*
- * Scale up to rtime / total then add
- * the remainder scaled to stime / total.
- */
- res = div64_u64_rem(rtime, total, &rem);
- scaled = stime * res;
- scaled += div64_u64(stime * rem, total);
- } else {
- /*
- * Same in reverse: scale down to total / rtime
- * then substract that result scaled to
- * to the remaining part.
- */
- res = div64_u64_rem(total, rtime, &rem);
- scaled = div64_u64(stime, res);
- scaled -= div64_u64(scaled * rem, total);
- }
-
- return (__force cputime_t) scaled;
-}
-
-/*
- * Adjust tick based cputime random precision against scheduler
- * runtime accounting.
- */
-static void cputime_adjust(struct task_cputime *curr,
- struct cputime *prev,
- cputime_t *ut, cputime_t *st)
-{
- cputime_t rtime, stime, total;
-
- if (vtime_accounting_enabled()) {
- *ut = curr->utime;
- *st = curr->stime;
- return;
- }
-
- stime = curr->stime;
- total = stime + curr->utime;
-
- /*
- * Tick based cputime accounting depend on random scheduling
- * timeslices of a task to be interrupted or not by the timer.
- * Depending on these circumstances, the number of these interrupts
- * may be over or under-optimistic, matching the real user and system
- * cputime with a variable precision.
- *
- * Fix this by scaling these tick based values against the total
- * runtime accounted by the CFS scheduler.
- */
- rtime = nsecs_to_cputime(curr->sum_exec_runtime);
-
- if (!rtime) {
- stime = 0;
- } else if (!total) {
- stime = rtime;
- } else {
- stime = scale_stime((__force u64)stime,
- (__force u64)rtime, (__force u64)total);
- }
-
- /*
- * If the tick based count grows faster than the scheduler one,
- * the result of the scaling may go backward.
- * Let's enforce monotonicity.
- */
- prev->stime = max(prev->stime, stime);
- prev->utime = max(prev->utime, rtime - prev->stime);
-
- *ut = prev->utime;
- *st = prev->stime;
-}
-
-void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
-{
- struct task_cputime cputime = {
- .sum_exec_runtime = p->se.sum_exec_runtime,
- };
-
- task_cputime(p, &cputime.utime, &cputime.stime);
- cputime_adjust(&cputime, &p->prev_cputime, ut, st);
-}
-
-/*
- * Must be called with siglock held.
- */
-void thread_group_cputime_adjusted(struct task_struct *p, struct task_cputime *cputime)
-{
- thread_group_cputime(p, cputime);
- cputime_adjust(cputime, &p->signal->prev_cputime,
- &cputime->utime, &cputime->stime);
-}
#endif /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */

#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
diff --git a/kernel/sys.c b/kernel/sys.c
index 2f555c1..00f143e 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1049,7 +1049,7 @@ void do_sys_times(struct tms *tms)
struct task_cputime tg_cputime;

spin_lock_irq(&current->sighand->siglock);
- thread_group_cputime_adjusted(current, &tg_cputime);
+ thread_group_cputime(current, &tg_cputime);
cutime = current->signal->cutime;
cstime = current->signal->cstime;
spin_unlock_irq(&current->sighand->siglock);
@@ -1708,7 +1708,7 @@ static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
utime = stime = 0;

if (who == RUSAGE_THREAD) {
- task_cputime_adjusted(current, &utime, &stime);
+ task_cputime(current, &utime, &stime);
accumulate_thread_rusage(p, r);
maxrss = p->signal->maxrss;
goto out;
@@ -1734,7 +1734,7 @@ static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
break;

case RUSAGE_SELF:
- thread_group_cputime_adjusted(p, &tg_cputime);
+ thread_group_cputime(p, &tg_cputime);
utime += tg_cputime.utime;
stime += tg_cputime.stime;
r->ru_nvcsw += p->signal->nvcsw;
--
1.7.1

2013-03-28 16:53:29

by Stanislaw Gruszka

[permalink] [raw]

Subject: [RFC 1/4] cputime: change parameter of thread_group_cputime_adjusted

This is preparation for further changes in cputime code.

Signed-off-by: Stanislaw Gruszka <[email protected]>
---
fs/proc/array.c | 5 ++++-
include/linux/sched.h | 2 +-
kernel/exit.c | 8 ++++----
kernel/sched/cputime.c | 18 ++++++------------
kernel/sys.c | 18 ++++++++++--------
5 files changed, 25 insertions(+), 26 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index f7ed9ee..fef7850 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -446,6 +446,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
/* add up live thread stats at the group level */
if (whole) {
struct task_struct *t = task;
+ struct task_cputime cputime;
do {
min_flt += t->min_flt;
maj_flt += t->maj_flt;
@@ -455,7 +456,9 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,

min_flt += sig->min_flt;
maj_flt += sig->maj_flt;
- thread_group_cputime_adjusted(task, &utime, &stime);
+ thread_group_cputime_adjusted(task, &cputime);
+ utime = cputime.utime;
+ stime = cputime.stime;
gtime += sig->gtime;
}

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 10626e2..eb6005c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1595,7 +1595,7 @@ static inline cputime_t task_gtime(struct task_struct *t)
}
#endif
extern void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st);
-extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st);
+extern void thread_group_cputime_adjusted(struct task_struct *p, struct task_cputime *ct);

/*
* Per process flags
diff --git a/kernel/exit.c b/kernel/exit.c
index 51e485c..5a7023f 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1067,7 +1067,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
struct signal_struct *psig;
struct signal_struct *sig;
unsigned long maxrss;
- cputime_t tgutime, tgstime;
+ struct task_cputime tg_cputime;

/*
* The resource counters for the group leader are in its
@@ -1088,12 +1088,12 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
* group, which consolidates times for all threads in the
* group including the group leader.
*/
- thread_group_cputime_adjusted(p, &tgutime, &tgstime);
+ thread_group_cputime_adjusted(p, &tg_cputime);
spin_lock_irq(&p->real_parent->sighand->siglock);
psig = p->real_parent->signal;
sig = p->signal;
- psig->cutime += tgutime + sig->cutime;
- psig->cstime += tgstime + sig->cstime;
+ psig->cutime += tg_cputime.utime + sig->cutime;
+ psig->cstime += tg_cputime.stime + sig->cstime;
psig->cgtime += task_gtime(p) + sig->gtime + sig->cgtime;
psig->cmin_flt +=
p->min_flt + sig->min_flt + sig->cmin_flt;
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 4006b42..a600f7f 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -456,14 +456,9 @@ void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
*st = p->stime;
}

-void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
+void thread_group_cputime_adjusted(struct task_struct *p, struct task_cputime *cputime)
{
- struct task_cputime cputime;
-
- thread_group_cputime(p, &cputime);
-
- *ut = cputime.utime;
- *st = cputime.stime;
+ thread_group_cputime(p, cputime);
}
#else /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
/*
@@ -618,12 +613,11 @@ void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
/*
* Must be called with siglock held.
*/
-void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
+void thread_group_cputime_adjusted(struct task_struct *p, struct task_cputime *cputime)
{
- struct task_cputime cputime;
-
- thread_group_cputime(p, &cputime);
- cputime_adjust(&cputime, &p->signal->prev_cputime, ut, st);
+ thread_group_cputime(p, cputime);
+ cputime_adjust(cputime, &p->signal->prev_cputime,
+ &cputime->utime, &cputime->stime);
}
#endif /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */

diff --git a/kernel/sys.c b/kernel/sys.c
index 39c9c4a..2f555c1 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1045,15 +1045,16 @@ change_okay:

void do_sys_times(struct tms *tms)
{
- cputime_t tgutime, tgstime, cutime, cstime;
+ cputime_t cutime, cstime;
+ struct task_cputime tg_cputime;

spin_lock_irq(&current->sighand->siglock);
- thread_group_cputime_adjusted(current, &tgutime, &tgstime);
+ thread_group_cputime_adjusted(current, &tg_cputime);
cutime = current->signal->cutime;
cstime = current->signal->cstime;
spin_unlock_irq(&current->sighand->siglock);
- tms->tms_utime = cputime_to_clock_t(tgutime);
- tms->tms_stime = cputime_to_clock_t(tgstime);
+ tms->tms_utime = cputime_to_clock_t(tg_cputime.utime);
+ tms->tms_stime = cputime_to_clock_t(tg_cputime.stime);
tms->tms_cutime = cputime_to_clock_t(cutime);
tms->tms_cstime = cputime_to_clock_t(cstime);
}
@@ -1699,8 +1700,9 @@ static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
{
struct task_struct *t;
unsigned long flags;
- cputime_t tgutime, tgstime, utime, stime;
+ cputime_t utime, stime;
unsigned long maxrss = 0;
+ struct task_cputime tg_cputime;

memset((char *) r, 0, sizeof *r);
utime = stime = 0;
@@ -1732,9 +1734,9 @@ static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
break;

case RUSAGE_SELF:
- thread_group_cputime_adjusted(p, &tgutime, &tgstime);
- utime += tgutime;
- stime += tgstime;
+ thread_group_cputime_adjusted(p, &tg_cputime);
+ utime += tg_cputime.utime;
+ stime += tg_cputime.stime;
r->ru_nvcsw += p->signal->nvcsw;
r->ru_nivcsw += p->signal->nivcsw;
r->ru_minflt += p->signal->min_flt;
--
1.7.1

2013-03-28 16:52:55

by Stanislaw Gruszka

[permalink] [raw]

Subject: [RFC 2/4] procfs: add sum_exec_runtime to /proc/PID/stat

Allow procps programs (i.e. top) to read precise CPU time usage of
processes.

Signed-off-by: Stanislaw Gruszka <[email protected]>
---
fs/proc/array.c | 5 +++++
1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index fef7850..ee47b29 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -403,6 +403,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
unsigned long rsslim = 0;
char tcomm[sizeof(task->comm)];
unsigned long flags;
+ u64 sum_exec_runtime = 0;

state = *get_task_state(task);
vsize = eip = esp = 0;
@@ -459,6 +460,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
thread_group_cputime_adjusted(task, &cputime);
utime = cputime.utime;
stime = cputime.stime;
+ sum_exec_runtime = cputime.sum_exec_runtime;
gtime += sig->gtime;
}

@@ -475,6 +477,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
min_flt = task->min_flt;
maj_flt = task->maj_flt;
task_cputime_adjusted(task, &utime, &stime);
+ sum_exec_runtime = task->se.sum_exec_runtime;
gtime = task_gtime(task);
}

@@ -554,6 +557,8 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
else
seq_put_decimal_ll(m, ' ', 0);

+ seq_put_decimal_ull(m, ' ', sum_exec_runtime);
+
seq_putc(m, '\n');
if (mm)
mmput(mm);
--
1.7.1

2013-04-10 12:02:34

by Ingo Molnar

[permalink] [raw]

Subject: Re: [RFC 4/4] cputime: remove scaling

* Stanislaw Gruszka <[email protected]> wrote:

> Scaling cputime cause problems, bunch of them was fixed, but still is possible
> to hit multiplication overflow issue, which make {u,s}time values incorrect.
> This problem has no good solution in kernel.

Wasn't 128-bit math a solution to the overflow problems? 128-bit math isn't nice,
but at least for multiplication it's defensible.

> This patch remove scaling code and export raw values of {u,t}ime . Procps
> programs can use newly introduced sum_exec_runtime to find out precisely
> calculated process cpu time and scale utime, stime values accordingly.
>
> Unfortunately times(2) syscall has no such option.
>
> This change affect kernels compiled without CONFIG_VIRT_CPU_ACCOUNTING_*.

So, the concern here is that 'top hiding' code can now hide again. It's also that
we are not really solving the problem, we are pushing it to user-space - which in
the best case gets updated to solve the problem in some similar fashion - and in
the worst case does not get updated or does it in a buggy way.

So while user-space has it a bit easier because it can do floating point math, is
there really no workable solution to the current kernel side integer overflow bug?
I really prefer robust kernel side accounting/instrumentation.

Thanks,

Ingo

2013-04-10 14:29:46

by H. Peter Anvin

[permalink] [raw]

Subject: Re: [RFC 4/4] cputime: remove scaling

I have a patch that does scaling by multiply for 64-bit architectures. I probably should clean it up and send it in. I need to see if it fixes this problem.

Ingo Molnar <[email protected]> wrote:

>
>* Stanislaw Gruszka <[email protected]> wrote:
>
>> Scaling cputime cause problems, bunch of them was fixed, but still is
>possible
>> to hit multiplication overflow issue, which make {u,s}time values
>incorrect.
>> This problem has no good solution in kernel.
>
>Wasn't 128-bit math a solution to the overflow problems? 128-bit math
>isn't nice,
>but at least for multiplication it's defensible.
>
>> This patch remove scaling code and export raw values of {u,t}ime .
>Procps
>> programs can use newly introduced sum_exec_runtime to find out
>precisely
>> calculated process cpu time and scale utime, stime values
>accordingly.
>>
>> Unfortunately times(2) syscall has no such option.
>>
>> This change affect kernels compiled without
>CONFIG_VIRT_CPU_ACCOUNTING_*.
>
>So, the concern here is that 'top hiding' code can now hide again. It's
>also that
>we are not really solving the problem, we are pushing it to user-space
>- which in
>the best case gets updated to solve the problem in some similar fashion
>- and in
>the worst case does not get updated or does it in a buggy way.
>
>So while user-space has it a bit easier because it can do floating
>point math, is
>there really no workable solution to the current kernel side integer
>overflow bug?
>I really prefer robust kernel side accounting/instrumentation.
>
>Thanks,
>
> Ingo

--
Sent from my mobile phone. Please excuse brevity and lack of formatting.

2013-04-11 08:36:05

by Stanislaw Gruszka

[permalink] [raw]

Subject: Re: [RFC 4/4] cputime: remove scaling

On Wed, Apr 10, 2013 at 02:02:28PM +0200, Ingo Molnar wrote:
>
> * Stanislaw Gruszka <[email protected]> wrote:
>
> > Scaling cputime cause problems, bunch of them was fixed, but still is possible
> > to hit multiplication overflow issue, which make {u,s}time values incorrect.
> > This problem has no good solution in kernel.
>
> Wasn't 128-bit math a solution to the overflow problems? 128-bit math isn't nice,
> but at least for multiplication it's defensible.

128 bit division is needed unfortunately. Though on 99.9% of cases, it will go
through 64 bit fast path.

> > This patch remove scaling code and export raw values of {u,t}ime . Procps
> > programs can use newly introduced sum_exec_runtime to find out precisely
> > calculated process cpu time and scale utime, stime values accordingly.
> >
> > Unfortunately times(2) syscall has no such option.
> >
> > This change affect kernels compiled without CONFIG_VIRT_CPU_ACCOUNTING_*.
>
> So, the concern here is that 'top hiding' code can now hide again. It's also that
> we are not really solving the problem, we are pushing it to user-space - which in
> the best case gets updated to solve the problem in some similar fashion - and in
> the worst case does not get updated or does it in a buggy way.
>
> So while user-space has it a bit easier because it can do floating point math, is
> there really no workable solution to the current kernel side integer overflow bug?

I do not see any. Basically all we have make problem less reproducible
or just defer it. The best solution, except full 128 bit math I found
is something like this (dropping precision if values are big and overflow
will happen):

u64 _scale_time(u64 rtime, u64 total, u64 time)
{
const int zero_bits = clzll(time) + clzll(rtime);
u64 scaled;

if (zero_bits < 64) {
/* Drop precision */
const int drop_bits = 64 - zero_bits;

time >>= drop_bits;
rtime >>= drop_bits;
total >>= 2*drop_bits;

if (total == 0)
return time;
}

scaled = (time * rtime) / total;

return scaled;
}

It defer problem to quite long period. My testing script detect failure at:

FAIL!
rtime: 1954463459156 <- 22621 days (one thread , CONFIG_HZ=1000)
total: 1771603722423
stime: 354320744484
kernel: 391351504748 <- kernel value
python: 390892691830 <- correct value

For one thread this is fine, but for 512 threads inaccuracy will happen
after only 40 days (due to dropping too many of "total" variable bits).

> I really prefer robust kernel side accounting/instrumentation.

We have CONFIG_IRQ_TIME_ACCOUNTING and CONFIG_VIRT_CPU_ACCOUNTING_GEN.
Perhaps we can change to use one of those options by default. I wonder
if the additional performance cost related with them is really something
that we should care about. Are there any measurement that show those
will make performance worse ?

Stanislaw

2013-04-11 08:36:53

by Stanislaw Gruszka

[permalink] [raw]

Subject: Re: [RFC 4/4] cputime: remove scaling

On Wed, Apr 10, 2013 at 07:29:21AM -0700, H. Peter Anvin wrote:
> I have a patch that does scaling by multiply for 64-bit architectures. I probably should clean it up and send it in. I need to see if it fixes this problem.
Interesting. Could you attach draft patch, so I could look at it ?

Stanislaw

2013-04-11 15:06:44

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: [RFC 4/4] cputime: remove scaling

On Thu, Apr 11, 2013 at 10:36:35AM +0200, Stanislaw Gruszka wrote:
> > I really prefer robust kernel side accounting/instrumentation.
>
> We have CONFIG_IRQ_TIME_ACCOUNTING and CONFIG_VIRT_CPU_ACCOUNTING_GEN.
> Perhaps we can change to use one of those options by default. I wonder
> if the additional performance cost related with them is really something
> that we should care about. Are there any measurement that show those
> will make performance worse ?

CONFIG_IRQ_TIME_ACCOUNTING also make use of scaling. And CONFIG_VIRT_CPU_ACCOUNTING_GEN
involves too much overhead on IO-bound workloads. It's mostly good for
undisturbed userspace bound workloads (few IRQs, few exceptions, few syscalls).

2013-04-11 15:20:30

by H. Peter Anvin

[permalink] [raw]

Subject: Re: [RFC 4/4] cputime: remove scaling

On 04/11/2013 01:37 AM, Stanislaw Gruszka wrote:
> On Wed, Apr 10, 2013 at 07:29:21AM -0700, H. Peter Anvin wrote:
>> I have a patch that does scaling by multiply for 64-bit architectures. I probably should clean it up and send it in. I need to see if it fixes this problem.
> Interesting. Could you attach draft patch, so I could look at it ?
>

It's just the trivial extension of the code already in kernel/time.c to
the 64-bit case.

However, it looks like in your case the scaling factor stime/total is
dynamic. This means that doing scaling by multiplication isn't practical.

I guess I don't understand what function the scaling is supposed to
serve there.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.