LinuxLists.cc - [PATCH -tip 0/4] do not make cputime scaling in kernel

2013-04-04 09:10:05

Subject: [PATCH -tip 0/4] do not make cputime scaling in kernel

This patch series removes cputime scaling from kernel. It can be easily
done in user space using floating point if we provide sum_exec_runtime,
what patches 2/4 and 3/4 do. I have procps patch which utilize that:

http://people.redhat.com/sgruszka/procps-use-sum_exec_runtime.patch

I will post it, if this patch set will be queued.

Change affect also getrusage() and times() syscals, but I don't think
kernel give guarantees about utime/stime precision, in a matter of fact
before commit b27f03d4bdc145a09fb7b0c0e004b29f1ee555fa, we do not
perform any scaling and we provided raw cputime values to user space.

Providing sum_exec_runtime via proc is done against malware that utilize
lot of cpu time but hide itself from top program.

This affect kernels not compiled with CONFIG_VIRT_CPU_ACCOUNTING_{GEN,NATIVE},
if user choose to compile kernel with some of those options, he/she will
have more precise cputime accounting, what is documented in Kconfig.

2013-04-04 09:10:16

by Stanislaw Gruszka

[permalink] [raw]

Subject: [PATCH -tip 3/4] sched,proc: add csum_sched_runtime to /proc/PID/stat

Account precise CPU time used by finished children and export that value
via procfs.

Signed-off-by: Stanislaw Gruszka <[email protected]>
---
Documentation/filesystems/proc.txt | 1 +
fs/proc/array.c | 3 +++
include/linux/sched.h | 11 +++++++----
kernel/exit.c | 2 ++
4 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index ffd012b..a0e9162 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -320,6 +320,7 @@ Table 1-4: Contents of the stat files (as of 2.6.30-rc7)
env_end address below which program environment is placed
exit_code the thread's exit_code in the form reported by the waitpid system call
exec_time total process time spent on the CPU, in nanoseconds
+ cexec_time total finished children time spent on the CPU, in nanoseconds
..............................................................................

The /proc/PID/maps file containing the currently mapped memory regions and
diff --git a/fs/proc/array.c b/fs/proc/array.c
index ee47b29..1444dc5 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -404,6 +404,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
char tcomm[sizeof(task->comm)];
unsigned long flags;
u64 sum_exec_runtime = 0;
+ u64 csum_exec_runtime = 0;

state = *get_task_state(task);
vsize = eip = esp = 0;
@@ -442,6 +443,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
cutime = sig->cutime;
cstime = sig->cstime;
cgtime = sig->cgtime;
+ csum_exec_runtime = sig->csum_sched_runtime;
rsslim = ACCESS_ONCE(sig->rlim[RLIMIT_RSS].rlim_cur);

/* add up live thread stats at the group level */
@@ -558,6 +560,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
seq_put_decimal_ll(m, ' ', 0);

seq_put_decimal_ull(m, ' ', sum_exec_runtime);
+ seq_put_decimal_ull(m, ' ', csum_exec_runtime);

seq_putc(m, '\n');
if (mm)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 01fc6d4..c25772d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -568,14 +568,17 @@ struct signal_struct {
struct task_io_accounting ioac;

/*
- * Cumulative ns of schedule CPU time fo dead threads in the
- * group, not including a zombie group leader, (This only differs
- * from jiffies_to_ns(utime + stime) if sched_clock uses something
- * other than jiffies.)
+ * Cumulative ns of schedule CPU time of dead threads in the
+ * group, not including a zombie group leader.
*/
unsigned long long sum_sched_runtime;

/*
+ * Cumulative ns of schedule CPU time of all finished child processes.
+ */
+ unsigned long long csum_sched_runtime;
+
+ /*
* We don't bother to synchronize most readers of this at all,
* because there is no reader checking a limit that actually needs
* to get both rlim_cur and rlim_max atomically, and either one
diff --git a/kernel/exit.c b/kernel/exit.c
index 27f0907..fb158f1 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1094,6 +1094,8 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
sig = p->signal;
psig->cutime += tg_cputime.utime + sig->cutime;
psig->cstime += tg_cputime.stime + sig->cstime;
+ psig->csum_sched_runtime +=
+ tg_cputime.sum_exec_runtime + sig->csum_sched_runtime;
psig->cgtime += task_gtime(p) + sig->gtime + sig->cgtime;
psig->cmin_flt +=
p->min_flt + sig->min_flt + sig->cmin_flt;
--
1.7.1

2013-04-04 09:10:31

by Stanislaw Gruszka

[permalink] [raw]

Subject: [PATCH -tip 4/4] cputime: remove scaling

Scaling cputime cause problems, bunch of them was fixed, but still is possible
to hit multiplication overflow issue, which make {u,s}time values incorrect.
This problem has no good solution in kernel.

This patch remove scaling code and export raw values of {u,t}ime . Procps
programs can use newly introduced sum_exec_runtime to find out precisely
calculated process cpu time and scale utime/stime values accordingly.

Unfortunately times(2) syscall has no such option.

This change affect kernels compiled without CONFIG_VIRT_CPU_ACCOUNTING_*.

Signed-off-by: Stanislaw Gruszka <[email protected]>
---
fs/proc/array.c | 4 +-
include/linux/sched.h | 20 --------
kernel/exit.c | 4 +-
kernel/fork.c | 3 -
kernel/sched/cputime.c | 117 +-----------------------------------------------
kernel/sys.c | 6 +-
6 files changed, 8 insertions(+), 146 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 1444dc5..5feadc4 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -459,7 +459,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,

min_flt += sig->min_flt;
maj_flt += sig->maj_flt;
- thread_group_cputime_adjusted(task, &cputime);
+ thread_group_cputime(task, &cputime);
utime = cputime.utime;
stime = cputime.stime;
sum_exec_runtime = cputime.sum_exec_runtime;
@@ -478,7 +478,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
if (!whole) {
min_flt = task->min_flt;
maj_flt = task->maj_flt;
- task_cputime_adjusted(task, &utime, &stime);
+ task_cputime(task, &utime, &stime);
sum_exec_runtime = task->se.sum_exec_runtime;
gtime = task_gtime(task);
}
diff --git a/include/linux/sched.h b/include/linux/sched.h
index c25772d..23c8ac3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -397,18 +397,6 @@ struct cpu_itimer {
};

/**
- * struct cputime - snaphsot of system and user cputime
- * @utime: time spent in user mode
- * @stime: time spent in system mode
- *
- * Gathers a generic snapshot of user and system time.
- */
-struct cputime {
- cputime_t utime;
- cputime_t stime;
-};
-
-/**
* struct task_cputime - collected CPU time counts
* @utime: time spent in user mode, in &cputime_t units
* @stime: time spent in kernel mode, in &cputime_t units
@@ -558,9 +546,6 @@ struct signal_struct {
cputime_t utime, stime, cutime, cstime;
cputime_t gtime;
cputime_t cgtime;
-#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
- struct cputime prev_cputime;
-#endif
unsigned long nvcsw, nivcsw, cnvcsw, cnivcsw;
unsigned long min_flt, maj_flt, cmin_flt, cmaj_flt;
unsigned long inblock, oublock, cinblock, coublock;
@@ -1161,9 +1146,6 @@ struct task_struct {

cputime_t utime, stime, utimescaled, stimescaled;
cputime_t gtime;
-#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
- struct cputime prev_cputime;
-#endif
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
seqlock_t vtime_seqlock;
unsigned long long vtime_snap;
@@ -1597,8 +1579,6 @@ static inline cputime_t task_gtime(struct task_struct *t)
return t->gtime;
}
#endif
-extern void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st);
-extern void thread_group_cputime_adjusted(struct task_struct *p, struct task_cputime *ct);

/*
* Per process flags
diff --git a/kernel/exit.c b/kernel/exit.c
index fb158f1..b6bd7ae 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1084,11 +1084,11 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
* as other threads in the parent group can be right
* here reaping other children at the same time.
*
- * We use thread_group_cputime_adjusted() to get times for the thread
+ * We use thread_group_cputime() to get times for the thread
* group, which consolidates times for all threads in the
* group including the group leader.
*/
- thread_group_cputime_adjusted(p, &tg_cputime);
+ thread_group_cputime(p, &tg_cputime);
spin_lock_irq(&p->real_parent->sighand->siglock);
psig = p->real_parent->signal;
sig = p->signal;
diff --git a/kernel/fork.c b/kernel/fork.c
index 339f60d..2ae1706 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1233,9 +1233,6 @@ static struct task_struct *copy_process(unsigned long clone_flags,

p->utime = p->stime = p->gtime = 0;
p->utimescaled = p->stimescaled = 0;
-#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
- p->prev_cputime.utime = p->prev_cputime.stime = 0;
-#endif
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
seqlock_init(&p->vtime_seqlock);
p->vtime_snap = 0;
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index a600f7f..23df74b 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -448,19 +448,7 @@ EXPORT_SYMBOL_GPL(vtime_account_irq_enter);
#endif /* __ARCH_HAS_VTIME_ACCOUNT */
#endif /* CONFIG_VIRT_CPU_ACCOUNTING */

-
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
-void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
-{
- *ut = p->utime;
- *st = p->stime;
-}
-
-void thread_group_cputime_adjusted(struct task_struct *p, struct task_cputime *cputime)
-{
- thread_group_cputime(p, cputime);
-}
-#else /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
/*
* Account a single tick of cpu time.
* @p: the process that the cpu time gets accounted to
@@ -516,109 +504,6 @@ void account_idle_ticks(unsigned long ticks)
account_idle_time(jiffies_to_cputime(ticks));
}

-/*
- * Perform (stime * rtime) / total with reduced chances
- * of multiplication overflows by using smaller factors
- * like quotient and remainders of divisions between
- * rtime and total.
- */
-static cputime_t scale_stime(u64 stime, u64 rtime, u64 total)
-{
- u64 rem, res, scaled;
-
- if (rtime >= total) {
- /*
- * Scale up to rtime / total then add
- * the remainder scaled to stime / total.
- */
- res = div64_u64_rem(rtime, total, &rem);
- scaled = stime * res;
- scaled += div64_u64(stime * rem, total);
- } else {
- /*
- * Same in reverse: scale down to total / rtime
- * then substract that result scaled to
- * to the remaining part.
- */
- res = div64_u64_rem(total, rtime, &rem);
- scaled = div64_u64(stime, res);
- scaled -= div64_u64(scaled * rem, total);
- }
-
- return (__force cputime_t) scaled;
-}
-
-/*
- * Adjust tick based cputime random precision against scheduler
- * runtime accounting.
- */
-static void cputime_adjust(struct task_cputime *curr,
- struct cputime *prev,
- cputime_t *ut, cputime_t *st)
-{
- cputime_t rtime, stime, total;
-
- if (vtime_accounting_enabled()) {
- *ut = curr->utime;
- *st = curr->stime;
- return;
- }
-
- stime = curr->stime;
- total = stime + curr->utime;
-
- /*
- * Tick based cputime accounting depend on random scheduling
- * timeslices of a task to be interrupted or not by the timer.
- * Depending on these circumstances, the number of these interrupts
- * may be over or under-optimistic, matching the real user and system
- * cputime with a variable precision.
- *
- * Fix this by scaling these tick based values against the total
- * runtime accounted by the CFS scheduler.
- */
- rtime = nsecs_to_cputime(curr->sum_exec_runtime);
-
- if (!rtime) {
- stime = 0;
- } else if (!total) {
- stime = rtime;
- } else {
- stime = scale_stime((__force u64)stime,
- (__force u64)rtime, (__force u64)total);
- }
-
- /*
- * If the tick based count grows faster than the scheduler one,
- * the result of the scaling may go backward.
- * Let's enforce monotonicity.
- */
- prev->stime = max(prev->stime, stime);
- prev->utime = max(prev->utime, rtime - prev->stime);
-
- *ut = prev->utime;
- *st = prev->stime;
-}
-
-void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
-{
- struct task_cputime cputime = {
- .sum_exec_runtime = p->se.sum_exec_runtime,
- };
-
- task_cputime(p, &cputime.utime, &cputime.stime);
- cputime_adjust(&cputime, &p->prev_cputime, ut, st);
-}
-
-/*
- * Must be called with siglock held.
- */
-void thread_group_cputime_adjusted(struct task_struct *p, struct task_cputime *cputime)
-{
- thread_group_cputime(p, cputime);
- cputime_adjust(cputime, &p->signal->prev_cputime,
- &cputime->utime, &cputime->stime);
-}
#endif /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */

#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
diff --git a/kernel/sys.c b/kernel/sys.c
index 2f555c1..00f143e 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1049,7 +1049,7 @@ void do_sys_times(struct tms *tms)
struct task_cputime tg_cputime;

spin_lock_irq(&current->sighand->siglock);
- thread_group_cputime_adjusted(current, &tg_cputime);
+ thread_group_cputime(current, &tg_cputime);
cutime = current->signal->cutime;
cstime = current->signal->cstime;
spin_unlock_irq(&current->sighand->siglock);
@@ -1708,7 +1708,7 @@ static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
utime = stime = 0;

if (who == RUSAGE_THREAD) {
- task_cputime_adjusted(current, &utime, &stime);
+ task_cputime(current, &utime, &stime);
accumulate_thread_rusage(p, r);
maxrss = p->signal->maxrss;
goto out;
@@ -1734,7 +1734,7 @@ static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
break;

case RUSAGE_SELF:
- thread_group_cputime_adjusted(p, &tg_cputime);
+ thread_group_cputime(p, &tg_cputime);
utime += tg_cputime.utime;
stime += tg_cputime.stime;
r->ru_nvcsw += p->signal->nvcsw;
--
1.7.1

2013-04-04 09:10:29

by Stanislaw Gruszka

[permalink] [raw]

Subject: [PATCH -tip 2/4] procfs: add sum_exec_runtime to /proc/PID/stat

Allow user-space (i.e. top) to read precise CPU time usage of
the process.

Signed-off-by: Stanislaw Gruszka <[email protected]>
---
Documentation/filesystems/proc.txt | 1 +
fs/proc/array.c | 5 +++++
2 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index fd8d0d5..ffd012b 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -319,6 +319,7 @@ Table 1-4: Contents of the stat files (as of 2.6.30-rc7)
env_start address above which program environment is placed
env_end address below which program environment is placed
exit_code the thread's exit_code in the form reported by the waitpid system call
+ exec_time total process time spent on the CPU, in nanoseconds
..............................................................................

The /proc/PID/maps file containing the currently mapped memory regions and
diff --git a/fs/proc/array.c b/fs/proc/array.c
index fef7850..ee47b29 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -403,6 +403,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
unsigned long rsslim = 0;
char tcomm[sizeof(task->comm)];
unsigned long flags;
+ u64 sum_exec_runtime = 0;

state = *get_task_state(task);
vsize = eip = esp = 0;
@@ -459,6 +460,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
thread_group_cputime_adjusted(task, &cputime);
utime = cputime.utime;
stime = cputime.stime;
+ sum_exec_runtime = cputime.sum_exec_runtime;
gtime += sig->gtime;
}

@@ -475,6 +477,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
min_flt = task->min_flt;
maj_flt = task->maj_flt;
task_cputime_adjusted(task, &utime, &stime);
+ sum_exec_runtime = task->se.sum_exec_runtime;
gtime = task_gtime(task);
}

@@ -554,6 +557,8 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
else
seq_put_decimal_ll(m, ' ', 0);

+ seq_put_decimal_ull(m, ' ', sum_exec_runtime);
+
seq_putc(m, '\n');
if (mm)
mmput(mm);
--
1.7.1

2013-04-04 09:10:26

by Stanislaw Gruszka

[permalink] [raw]

Subject: [PATCH -tip 1/4] cputime: change parameter of thread_group_cputime_adjusted

This is preparation for further changes in cputime code.

Signed-off-by: Stanislaw Gruszka <[email protected]>
---
fs/proc/array.c | 5 ++++-
include/linux/sched.h | 2 +-
kernel/exit.c | 8 ++++----
kernel/sched/cputime.c | 18 ++++++------------
kernel/sys.c | 18 ++++++++++--------
5 files changed, 25 insertions(+), 26 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index f7ed9ee..fef7850 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -446,6 +446,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
/* add up live thread stats at the group level */
if (whole) {
struct task_struct *t = task;
+ struct task_cputime cputime;
do {
min_flt += t->min_flt;
maj_flt += t->maj_flt;
@@ -455,7 +456,9 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,

min_flt += sig->min_flt;
maj_flt += sig->maj_flt;
- thread_group_cputime_adjusted(task, &utime, &stime);
+ thread_group_cputime_adjusted(task, &cputime);
+ utime = cputime.utime;
+ stime = cputime.stime;
gtime += sig->gtime;
}

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4c3925e..01fc6d4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1595,7 +1595,7 @@ static inline cputime_t task_gtime(struct task_struct *t)
}
#endif
extern void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st);
-extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st);
+extern void thread_group_cputime_adjusted(struct task_struct *p, struct task_cputime *ct);

/*
* Per process flags
diff --git a/kernel/exit.c b/kernel/exit.c
index 60bc027..27f0907 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1067,7 +1067,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
struct signal_struct *psig;
struct signal_struct *sig;
unsigned long maxrss;
- cputime_t tgutime, tgstime;
+ struct task_cputime tg_cputime;

/*
* The resource counters for the group leader are in its
@@ -1088,12 +1088,12 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
* group, which consolidates times for all threads in the
* group including the group leader.
*/
- thread_group_cputime_adjusted(p, &tgutime, &tgstime);
+ thread_group_cputime_adjusted(p, &tg_cputime);
spin_lock_irq(&p->real_parent->sighand->siglock);
psig = p->real_parent->signal;
sig = p->signal;
- psig->cutime += tgutime + sig->cutime;
- psig->cstime += tgstime + sig->cstime;
+ psig->cutime += tg_cputime.utime + sig->cutime;
+ psig->cstime += tg_cputime.stime + sig->cstime;
psig->cgtime += task_gtime(p) + sig->gtime + sig->cgtime;
psig->cmin_flt +=
p->min_flt + sig->min_flt + sig->cmin_flt;
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 4006b42..a600f7f 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -456,14 +456,9 @@ void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
*st = p->stime;
}

-void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
+void thread_group_cputime_adjusted(struct task_struct *p, struct task_cputime *cputime)
{
- struct task_cputime cputime;
-
- thread_group_cputime(p, &cputime);
-
- *ut = cputime.utime;
- *st = cputime.stime;
+ thread_group_cputime(p, cputime);
}
#else /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
/*
@@ -618,12 +613,11 @@ void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
/*
* Must be called with siglock held.
*/
-void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
+void thread_group_cputime_adjusted(struct task_struct *p, struct task_cputime *cputime)
{
- struct task_cputime cputime;
-
- thread_group_cputime(p, &cputime);
- cputime_adjust(&cputime, &p->signal->prev_cputime, ut, st);
+ thread_group_cputime(p, cputime);
+ cputime_adjust(cputime, &p->signal->prev_cputime,
+ &cputime->utime, &cputime->stime);
}
#endif /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */

diff --git a/kernel/sys.c b/kernel/sys.c
index 39c9c4a..2f555c1 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1045,15 +1045,16 @@ change_okay:

void do_sys_times(struct tms *tms)
{
- cputime_t tgutime, tgstime, cutime, cstime;
+ cputime_t cutime, cstime;
+ struct task_cputime tg_cputime;

spin_lock_irq(&current->sighand->siglock);
- thread_group_cputime_adjusted(current, &tgutime, &tgstime);
+ thread_group_cputime_adjusted(current, &tg_cputime);
cutime = current->signal->cutime;
cstime = current->signal->cstime;
spin_unlock_irq(&current->sighand->siglock);
- tms->tms_utime = cputime_to_clock_t(tgutime);
- tms->tms_stime = cputime_to_clock_t(tgstime);
+ tms->tms_utime = cputime_to_clock_t(tg_cputime.utime);
+ tms->tms_stime = cputime_to_clock_t(tg_cputime.stime);
tms->tms_cutime = cputime_to_clock_t(cutime);
tms->tms_cstime = cputime_to_clock_t(cstime);
}
@@ -1699,8 +1700,9 @@ static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
{
struct task_struct *t;
unsigned long flags;
- cputime_t tgutime, tgstime, utime, stime;
+ cputime_t utime, stime;
unsigned long maxrss = 0;
+ struct task_cputime tg_cputime;

memset((char *) r, 0, sizeof *r);
utime = stime = 0;
@@ -1732,9 +1734,9 @@ static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
break;

case RUSAGE_SELF:
- thread_group_cputime_adjusted(p, &tgutime, &tgstime);
- utime += tgutime;
- stime += tgstime;
+ thread_group_cputime_adjusted(p, &tg_cputime);
+ utime += tg_cputime.utime;
+ stime += tg_cputime.stime;
r->ru_nvcsw += p->signal->nvcsw;
r->ru_nivcsw += p->signal->nivcsw;
r->ru_minflt += p->signal->min_flt;
--
1.7.1

2013-04-04 12:31:44

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: [PATCH -tip 0/4] do not make cputime scaling in kernel

2013/4/4 Stanislaw Gruszka <[email protected]>:
> This patch series removes cputime scaling from kernel. It can be easily
> done in user space using floating point if we provide sum_exec_runtime,
> what patches 2/4 and 3/4 do. I have procps patch which utilize that:
>
> http://people.redhat.com/sgruszka/procps-use-sum_exec_runtime.patch
>
> I will post it, if this patch set will be queued.
>
> Change affect also getrusage() and times() syscals, but I don't think
> kernel give guarantees about utime/stime precision, in a matter of fact
> before commit b27f03d4bdc145a09fb7b0c0e004b29f1ee555fa, we do not
> perform any scaling and we provided raw cputime values to user space.
>
> Providing sum_exec_runtime via proc is done against malware that utilize
> lot of cpu time but hide itself from top program.
>
> This affect kernels not compiled with CONFIG_VIRT_CPU_ACCOUNTING_{GEN,NATIVE},
> if user choose to compile kernel with some of those options, he/she will
> have more precise cputime accounting, what is documented in Kconfig.
>

I don't know. I'm not convinced userland is the right place to perform
this kind of check. The kernel perhaps doesn't give guarantee about
utime/stime precision but now users may have got used to that scaled
behaviour. It's also a matter of security, a malicous app can hide
from the tick to make its activity less visible from tools like top.

It's sortof an ABI breakage to remove such an implicit protection. And
fixing that from userspace with a lib or so won't change that fact.

How about that 128bits based idea? I'm adding Paul Turner in Cc
because he seemed to agree with doing it using 128bits maths.

2013-04-04 13:09:29

by Stanislaw Gruszka

[permalink] [raw]

Subject: Re: [PATCH -tip 0/4] do not make cputime scaling in kernel

On Thu, Apr 04, 2013 at 02:31:42PM +0200, Frederic Weisbecker wrote:
> 2013/4/4 Stanislaw Gruszka <[email protected]>:
> > This patch series removes cputime scaling from kernel. It can be easily
> > done in user space using floating point if we provide sum_exec_runtime,
> > what patches 2/4 and 3/4 do. I have procps patch which utilize that:
> >
> > http://people.redhat.com/sgruszka/procps-use-sum_exec_runtime.patch
> >
> > I will post it, if this patch set will be queued.
> >
> > Change affect also getrusage() and times() syscals, but I don't think
> > kernel give guarantees about utime/stime precision, in a matter of fact
> > before commit b27f03d4bdc145a09fb7b0c0e004b29f1ee555fa, we do not
> > perform any scaling and we provided raw cputime values to user space.
> >
> > Providing sum_exec_runtime via proc is done against malware that utilize
> > lot of cpu time but hide itself from top program.
> >
> > This affect kernels not compiled with CONFIG_VIRT_CPU_ACCOUNTING_{GEN,NATIVE},
> > if user choose to compile kernel with some of those options, he/she will
> > have more precise cputime accounting, what is documented in Kconfig.
> >
>
> I don't know. I'm not convinced userland is the right place to perform
> this kind of check. The kernel perhaps doesn't give guarantee about
> utime/stime precision but now users may have got used to that scaled
> behaviour. It's also a matter of security, a malicous app can hide
> from the tick to make its activity less visible from tools like top.
>
> It's sortof an ABI breakage to remove such an implicit protection. And
> fixing that from userspace with a lib or so won't change that fact.

I think number of fields in /proc/PID/stat is not part of ABI. For
example commit 5b172087f99189416d5f47fd7ab5e6fb762a9ba3 add various
new fields at the end of the file. What is imported to keep unchanged
ABI is not changing order or meaning of fields we already have.

Regarding top, I added those additional fields to allow top to detect
those malicious software. Patched top will work well with old and new
(patched) kernel. Problem is old top with new kernel, but I believe
users who care about security update they software regularly.

Besides for most cases (not counting hostile software), those
statistical stime/utime accounting give good approximation of CPU
time utilizing by each process.

> How about that 128bits based idea? I'm adding Paul Turner in Cc
> because he seemed to agree with doing it using 128bits maths.

For problem that I try to solve 128bits math is not necessary, assuming
we can do multiplication in user space. Taking into account how easily
things can be done in user space using floating point math, I prefer not
to add complexity in kernel. This solution make kernel simpler and
faster.

Stanislaw

2013-04-04 13:47:23

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: [PATCH -tip 0/4] do not make cputime scaling in kernel

2013/4/4 Stanislaw Gruszka <[email protected]>:
> On Thu, Apr 04, 2013 at 02:31:42PM +0200, Frederic Weisbecker wrote:
>> I don't know. I'm not convinced userland is the right place to perform
>> this kind of check. The kernel perhaps doesn't give guarantee about
>> utime/stime precision but now users may have got used to that scaled
>> behaviour. It's also a matter of security, a malicous app can hide
>> from the tick to make its activity less visible from tools like top.
>>
>> It's sortof an ABI breakage to remove such an implicit protection. And
>> fixing that from userspace with a lib or so won't change that fact.
>
> I think number of fields in /proc/PID/stat is not part of ABI. For
> example commit 5b172087f99189416d5f47fd7ab5e6fb762a9ba3 add various
> new fields at the end of the file. What is imported to keep unchanged
> ABI is not changing order or meaning of fields we already have.

Oh I wasn't considering the layout of the proc file but the semantic
change in its utime/stime fields.

> Regarding top, I added those additional fields to allow top to detect
> those malicious software. Patched top will work well with old and new
> (patched) kernel. Problem is old top with new kernel, but I believe
> users who care about security update they software regularly.

The usual rule is that but you can't remove a feature from the kernel
and tell userspace to fix it itself. That's basically an ABI breakage.
A semantic one. We do it sometimes for some cases. But the more we are
dealing with a central ABI component, the harder it is to change it.
And here it is quite a central component.

>
> Besides for most cases (not counting hostile software), those
> statistical stime/utime accounting give good approximation of CPU
> time utilizing by each process.

Yeah but still the users are expecting that result to be scaled now.

>
>> How about that 128bits based idea? I'm adding Paul Turner in Cc
>> because he seemed to agree with doing it using 128bits maths.
>
> For problem that I try to solve 128bits math is not necessary, assuming
> we can do multiplication in user space. Taking into account how easily
> things can be done in user space using floating point math, I prefer not
> to add complexity in kernel. This solution make kernel simpler and
> faster.

I'm always all for making the kernel simpler and faster, as long as we
don't break userspace.

2013-04-05 12:55:21

by Stanislaw Gruszka

[permalink] [raw]

Subject: Re: [PATCH -tip 0/4] do not make cputime scaling in kernel

On Thu, Apr 04, 2013 at 03:47:19PM +0200, Frederic Weisbecker wrote:
> 2013/4/4 Stanislaw Gruszka <[email protected]>:
> > On Thu, Apr 04, 2013 at 02:31:42PM +0200, Frederic Weisbecker wrote:
> >> I don't know. I'm not convinced userland is the right place to perform
> >> this kind of check. The kernel perhaps doesn't give guarantee about
> >> utime/stime precision but now users may have got used to that scaled
> >> behaviour. It's also a matter of security, a malicous app can hide
> >> from the tick to make its activity less visible from tools like top.
> >>
> >> It's sortof an ABI breakage to remove such an implicit protection. And
> >> fixing that from userspace with a lib or so won't change that fact.
> >
> > I think number of fields in /proc/PID/stat is not part of ABI. For
> > example commit 5b172087f99189416d5f47fd7ab5e6fb762a9ba3 add various
> > new fields at the end of the file. What is imported to keep unchanged
> > ABI is not changing order or meaning of fields we already have.
>
> Oh I wasn't considering the layout of the proc file but the semantic
> change in its utime/stime fields.

Please see below.

> > Regarding top, I added those additional fields to allow top to detect
> > those malicious software. Patched top will work well with old and new
> > (patched) kernel. Problem is old top with new kernel, but I believe
> > users who care about security update they software regularly.
>
> The usual rule is that but you can't remove a feature from the kernel
> and tell userspace to fix it itself. That's basically an ABI breakage.
> A semantic one. We do it sometimes for some cases. But the more we are
> dealing with a central ABI component, the harder it is to change it.
> And here it is quite a central component.

I don't think patches change ABI. Semantic (meaning) of utime/stime is
the same with and without those patches. They are user mode and kernel
mode jiffies of the process, this does not change. What changes is
kernel calculation of those values, i.e. implementation.

Of course there is this top hiding malware, and that is "tell userspace
to fix it itself" thing. But I'm considering this as special case.
I comment more about that at the end of the email.

> > Besides for most cases (not counting hostile software), those
> > statistical stime/utime accounting give good approximation of CPU
> > time utilizing by each process.
>
> Yeah but still the users are expecting that result to be scaled now.

I'm not sure about that, though I'm sure users expect other things. For
example that the values will increase when process utilize CPU, what is
now broken for long running CPU bound processes. Also they expect values
will not decrease, what was broken by adding cputime scaling, but is now
fixed. Or that the values will not increase if process do not utilize
CPU, what was another kernel bug, which is also now fixed.

My point here is that, in the past and now, cputime scaling cause lot of
troubles and get rid of it would be good thing to do.

> >> How about that 128bits based idea? I'm adding Paul Turner in Cc
> >> because he seemed to agree with doing it using 128bits maths.
> >
> > For problem that I try to solve 128bits math is not necessary, assuming
> > we can do multiplication in user space. Taking into account how easily
> > things can be done in user space using floating point math, I prefer not
> > to add complexity in kernel. This solution make kernel simpler and
> > faster.
>
> I'm always all for making the kernel simpler and faster, as long as we
> don't break userspace.

I don't think in general the change does break user space, except special
case, which is this top hiding malware. But since this is _top_ hiding
malware it can be fixed in _top_. Consider this that way, kernel changed
way of calculating utime/stime to help top against hiding malware. But
since it is complicated and affect kernel performance, we will provide
other mechanism that top can use to protect itself against malware. Then
we can remove complicated utime/stime calculation.

Note times(2) and getrsuage(2) also use scaled cputime values, but do
this only for self process (getrusage(2) also for terminated children),
so there is no top hiding like issue for those.

Also note that cpu-timers (setitimer(2) and timer_create(2)) use raw
{signal,task}->{u,s}time values not scaled ones.

In general I agree with Frederic, that we should not remove feature that
was added to the kernel. But this case I'm not sure if it worth to keep
it and make kernel more complex, since the main and the only problem
that feature solves can be easily handled by top.

Would help if I would write that top hiding exploit (for patched kernel)
and ask for assign CVE number (not sure if that will be accepted), then
make sure that this CVE will be fixed and released in top, then patch
and release kernel with this patch set? Does it sound reasonable?

If not, I can continue try to solve problem by creating patch with 128bit
math (or some other solution, but we already tried different solutions
with no luck).

Stanislaw

2013-04-08 15:32:56

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH -tip 0/4] do not make cputime scaling in kernel

* Frederic Weisbecker <[email protected]> wrote:

> 2013/4/4 Stanislaw Gruszka <[email protected]>:
> > On Thu, Apr 04, 2013 at 02:31:42PM +0200, Frederic Weisbecker wrote:
> >> I don't know. I'm not convinced userland is the right place to perform
> >> this kind of check. The kernel perhaps doesn't give guarantee about
> >> utime/stime precision but now users may have got used to that scaled
> >> behaviour. It's also a matter of security, a malicous app can hide
> >> from the tick to make its activity less visible from tools like top.
> >>
> >> It's sortof an ABI breakage to remove such an implicit protection. And
> >> fixing that from userspace with a lib or so won't change that fact.
> >
> > I think number of fields in /proc/PID/stat is not part of ABI. For
> > example commit 5b172087f99189416d5f47fd7ab5e6fb762a9ba3 add various
> > new fields at the end of the file. What is imported to keep unchanged
> > ABI is not changing order or meaning of fields we already have.
>
> Oh I wasn't considering the layout of the proc file but the semantic
> change in its utime/stime fields.

Btw., even the ordering of fields in /proc/PID/stat might be an ABI, iif an
application relies on it and breaks if we change it.

What matters is what applications do, not what we think they do or what we think
they should do in an ideal world.

> > Regarding top, I added those additional fields to allow top to detect those
> > malicious software. Patched top will work well with old and new (patched)
> > kernel. Problem is old top with new kernel, but I believe users who care about
> > security update they software regularly.
>
> The usual rule is that but you can't remove a feature from the kernel and tell
> userspace to fix it itself. That's basically an ABI breakage.

Correct.

Thanks,

Ingo

2013-04-11 15:17:22

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: [PATCH -tip 0/4] do not make cputime scaling in kernel

On Mon, Apr 08, 2013 at 05:32:50PM +0200, Ingo Molnar wrote:
>
> * Frederic Weisbecker <[email protected]> wrote:
>
> > 2013/4/4 Stanislaw Gruszka <[email protected]>:
> > > On Thu, Apr 04, 2013 at 02:31:42PM +0200, Frederic Weisbecker wrote:
> > >> I don't know. I'm not convinced userland is the right place to perform
> > >> this kind of check. The kernel perhaps doesn't give guarantee about
> > >> utime/stime precision but now users may have got used to that scaled
> > >> behaviour. It's also a matter of security, a malicous app can hide
> > >> from the tick to make its activity less visible from tools like top.
> > >>
> > >> It's sortof an ABI breakage to remove such an implicit protection. And
> > >> fixing that from userspace with a lib or so won't change that fact.
> > >
> > > I think number of fields in /proc/PID/stat is not part of ABI. For
> > > example commit 5b172087f99189416d5f47fd7ab5e6fb762a9ba3 add various
> > > new fields at the end of the file. What is imported to keep unchanged
> > > ABI is not changing order or meaning of fields we already have.
> >
> > Oh I wasn't considering the layout of the proc file but the semantic
> > change in its utime/stime fields.
>
> Btw., even the ordering of fields in /proc/PID/stat might be an ABI, iif an
> application relies on it and breaks if we change it.

Sure, but it seems there are exceptions as in the above mentioned commit.

>
> What matters is what applications do, not what we think they do or what we think
> they should do in an ideal world.

Agreed.