This follows up Martin Schwidefsky's patch which propose to delay
cputime accounting to the tick in order to minimize the calls to
account_system_time() and alikes as these functions can carry quite some
overhead:
http://lkml.kernel.org/r/20161121111728.13a0a3db@mschwide
The set includes Martin's patch, rebased on top of tip:sched/core and
latest s390 changes, and extends it to the other implementations of
CONFIG_VIRT_CPU_ACCOUNTING_NATIVE (powerpc and ia64) along with a few
core changes to adapt the whole.
Only built-tested though as I don't have access to any of these archs.
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
vtime/acc
HEAD: ee6c393b212193bc01818c7cf4ae9cba3f469f00
Thanks,
Frederic
---
Frederic Weisbecker (9):
powerpc32: Fix stale scaled stime on context switch
ia64: Fix wrong start cputime assignment on task switch
cputime: Allow accounting system time using cpustat index
cputime: Export account_guest_time
powerpc: Prepare accounting structure for cputime flush on tick
powerpc: Migrate stolen_time field to accounting structure
powerpc/vtime: Accumulate cputime and account only on tick/task switch
ia64: Accumulate cputime and account only on tick/task switch
vtime: Rename vtime_account_user() to vtime_flush()
Martin Schwidefsky (1):
s390/cputime: delayed accounting of system time
arch/ia64/include/asm/thread_info.h | 6 ++
arch/ia64/kernel/time.c | 66 +++++++++++-----
arch/powerpc/include/asm/accounting.h | 14 +++-
arch/powerpc/include/asm/paca.h | 1 -
arch/powerpc/kernel/asm-offsets.c | 8 +-
arch/powerpc/kernel/time.c | 138 +++++++++++++++++++++-------------
arch/powerpc/xmon/xmon.c | 8 +-
arch/s390/include/asm/lowcore.h | 65 ++++++++--------
arch/s390/include/asm/processor.h | 3 +
arch/s390/kernel/vtime.c | 116 +++++++++++++++++-----------
include/linux/kernel_stat.h | 7 +-
include/linux/vtime.h | 7 +-
kernel/sched/cputime.c | 25 +++---
13 files changed, 295 insertions(+), 169 deletions(-)
On context switch with powerpc32, the cputime is accumulated in the
thread_info struct. So the switching-in task must move forward its
start time snapshot to the current time in order to later compute the
delta spent in system mode.
This is what we do for the normal cputime by initializing the starttime
field to the value of the previous task's starttime which got freshly
updated.
But we are missing the update of the scaled cputime start time. As a
result we may be accounting too much scaled cputime later.
Fix this by initializing the scaled cputime the same way we do for
normal cputime.
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Cc: Wanpeng Li <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
arch/powerpc/kernel/time.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index be9751f..bca2781 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -407,6 +407,7 @@ void arch_vtime_task_switch(struct task_struct *prev)
struct cpu_accounting_data *acct = get_accounting(current);
acct->starttime = get_accounting(prev)->starttime;
+ acct->startspurr = get_accounting(prev)->startspurr;
acct->system_time = 0;
acct->user_time = 0;
}
--
2.7.4
That in order to gather all cputime accumulation to the same place.
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Cc: Wanpeng Li <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
arch/powerpc/include/asm/paca.h | 1 -
arch/powerpc/kernel/time.c | 6 +++---
arch/powerpc/xmon/xmon.c | 2 +-
3 files changed, 4 insertions(+), 5 deletions(-)
diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index 6a6792b..708c3e5 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -187,7 +187,6 @@ struct paca_struct {
/* Stuff for accurate time accounting */
struct cpu_accounting_data accounting;
- u64 stolen_time; /* TB ticks taken by hypervisor */
u64 dtl_ridx; /* read index in dispatch log */
struct dtl_entry *dtl_curr; /* pointer corresponding to dtl_ridx */
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 613acb4..9625fec 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -273,7 +273,7 @@ void accumulate_stolen_time(void)
ust = scan_dispatch_log(acct->starttime);
acct->stime -= sst;
acct->utime -= ust;
- local_paca->stolen_time += ust + sst;
+ acct->steal_time += ust + sst;
local_paca->soft_enabled = save_soft_enabled;
}
@@ -288,8 +288,8 @@ static inline u64 calculate_stolen_time(u64 stop_tb)
acct->stime -= stolen;
}
- stolen += get_paca()->stolen_time;
- get_paca()->stolen_time = 0;
+ stolen += acct->steal_time;
+ acct->steal_time = 0;
return stolen;
}
diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index 0a3a3e8..fd7b2b0 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -2291,7 +2291,7 @@ static void dump_one_paca(int cpu)
DUMP(p, accounting.starttime_user, "llx");
DUMP(p, accounting.startspurr, "llx");
DUMP(p, accounting.utime_sspurr, "llx");
- DUMP(p, stolen_time, "llx");
+ DUMP(p, accounting.steal_time, "llx");
#undef DUMP
catch_memory_errors = 0;
--
2.7.4
In order to prepare for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE to delay
cputime accounting to the tick, let's provide APIs to account system
time to precise contexts: hardirq, softirq, pure system, ...
Inspired-by: Martin Schwidefsky <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Cc: Wanpeng Li <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
include/linux/kernel_stat.h | 4 ++++
kernel/sched/cputime.c | 19 ++++++++++++++-----
2 files changed, 18 insertions(+), 5 deletions(-)
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 00f7768..544aa86 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -80,10 +80,14 @@ static inline unsigned int kstat_cpu_irqs_sum(unsigned int cpu)
extern void account_user_time(struct task_struct *, cputime_t);
extern void account_system_time(struct task_struct *, int, cputime_t);
+extern void account_system_index_time(struct task_struct *, cputime_t,
+ enum cpu_usage_stat);
extern void account_steal_time(cputime_t);
extern void account_idle_time(cputime_t);
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
+extern void account_system_index_scaled(struct task_struct *, cputime_t,
+ cputime_t, enum cpu_usage_stat);
static inline void account_process_tick(struct task_struct *tsk, int user)
{
vtime_account_user(tsk);
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 7700a9c..0e6ea60 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -176,8 +176,8 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime)
* @cputime: the cpu time spent in kernel space since the last update
* @index: pointer to cpustat field that has to be updated
*/
-static inline
-void __account_system_time(struct task_struct *p, cputime_t cputime, int index)
+void account_system_index_time(struct task_struct *p,
+ cputime_t cputime, enum cpu_usage_stat index)
{
/* Add system time to process. */
p->stime += cputime;
@@ -190,6 +190,15 @@ void __account_system_time(struct task_struct *p, cputime_t cputime, int index)
acct_account_cputime(p);
}
+#ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME
+void account_system_index_scaled(struct task_struct *p, cputime_t cputime,
+ cputime_t scaled, enum cpu_usage_stat index)
+{
+ p->stimescaled += scaled;
+ account_system_index_time(p, cputime, index);
+}
+#endif
+
/*
* Account system cpu time to a process.
* @p: the process that the cpu time gets accounted to
@@ -213,7 +222,7 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
else
index = CPUTIME_SYSTEM;
- __account_system_time(p, cputime, index);
+ account_system_index_time(p, cputime, index);
}
/*
@@ -400,7 +409,7 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
* So, we have to handle it separately here.
* Also, p->stime needs to be updated for ksoftirqd.
*/
- __account_system_time(p, cputime, CPUTIME_SOFTIRQ);
+ account_system_index_time(p, cputime, CPUTIME_SOFTIRQ);
} else if (user_tick) {
account_user_time(p, cputime);
} else if (p == rq->idle) {
@@ -408,7 +417,7 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
} else if (p->flags & PF_VCPU) { /* System time or guest time */
account_guest_time(p, cputime);
} else {
- __account_system_time(p, cputime, CPUTIME_SYSTEM);
+ account_system_index_time(p, cputime, CPUTIME_SYSTEM);
}
}
--
2.7.4
In order to prepare for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE to delay
cputime accounting to the tick, let's allow archs to account cputime
directly to gtime.
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Cc: Wanpeng Li <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
include/linux/kernel_stat.h | 1 +
kernel/sched/cputime.c | 2 +-
2 files changed, 2 insertions(+), 1 deletion(-)
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 544aa86..1cee310 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -79,6 +79,7 @@ static inline unsigned int kstat_cpu_irqs_sum(unsigned int cpu)
}
extern void account_user_time(struct task_struct *, cputime_t);
+extern void account_guest_time(struct task_struct *, cputime_t);
extern void account_system_time(struct task_struct *, int, cputime_t);
extern void account_system_index_time(struct task_struct *, cputime_t,
enum cpu_usage_stat);
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 0e6ea60..dd67770 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -151,7 +151,7 @@ void account_user_time(struct task_struct *p, cputime_t cputime)
* @p: the process that the cpu time gets accounted to
* @cputime: the cpu time spent in virtual machine since the last update
*/
-static void account_guest_time(struct task_struct *p, cputime_t cputime)
+void account_guest_time(struct task_struct *p, cputime_t cputime)
{
u64 *cpustat = kcpustat_this_cpu->cpustat;
--
2.7.4
Currently CONFIG_VIRT_CPU_ACCOUNTING_NATIVE accounts the cputime on
any context boundary: irq entry/exit, guest entry/exit, context switch,
etc...
Calling functions such as account_system_time(), account_user_time()
and such can be costly, especially if they are called on many fastpath
such as twice per IRQ. Those functions do more than just accounting to
kcpustat and task cputime. Depending on the config, some subsystems can
perform unpleasant multiplications and divisions, among other things.
So lets accumulate the cputime instead and delay the accounting on ticks
and context switches only.
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Cc: Wanpeng Li <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
arch/powerpc/kernel/time.c | 120 +++++++++++++++++++++++++++++----------------
1 file changed, 77 insertions(+), 43 deletions(-)
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 9625fec..093588a 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -280,17 +280,10 @@ void accumulate_stolen_time(void)
static inline u64 calculate_stolen_time(u64 stop_tb)
{
- u64 stolen = 0;
- struct cpu_accounting_data *acct = &local_paca->accounting;
+ if (get_paca()->dtl_ridx != be64_to_cpu(get_lppaca()->dtl_idx))
+ return scan_dispatch_log(stop_tb);
- if (get_paca()->dtl_ridx != be64_to_cpu(get_lppaca()->dtl_idx)) {
- stolen = scan_dispatch_log(stop_tb);
- acct->stime -= stolen;
- }
-
- stolen += acct->steal_time;
- acct->steal_time = 0;
- return stolen;
+ return 0;
}
#else /* CONFIG_PPC_SPLPAR */
@@ -306,27 +299,26 @@ static inline u64 calculate_stolen_time(u64 stop_tb)
* or soft irq state.
*/
static unsigned long vtime_delta(struct task_struct *tsk,
- unsigned long *sys_scaled,
- unsigned long *stolen)
+ unsigned long *stime_scaled,
+ unsigned long *steal_time)
{
unsigned long now, nowscaled, deltascaled;
- unsigned long udelta, delta, user_scaled;
+ unsigned long stime;
+ unsigned long utime, utime_scaled;
struct cpu_accounting_data *acct = get_accounting(tsk);
WARN_ON_ONCE(!irqs_disabled());
now = mftb();
nowscaled = read_spurr(now);
- acct->stime += now - acct->starttime;
+ stime = now - acct->starttime;
acct->starttime = now;
deltascaled = nowscaled - acct->startspurr;
acct->startspurr = nowscaled;
- *stolen = calculate_stolen_time(now);
+ *steal_time = calculate_stolen_time(now);
- delta = acct->stime;
- acct->stime = 0;
- udelta = acct->utime - acct->utime_sspurr;
+ utime = acct->utime - acct->utime_sspurr;
acct->utime_sspurr = acct->utime;
/*
@@ -339,39 +331,54 @@ static unsigned long vtime_delta(struct task_struct *tsk,
* the user ticks get saved up in paca->user_time_scaled to be
* used by account_process_tick.
*/
- *sys_scaled = delta;
- user_scaled = udelta;
- if (deltascaled != delta + udelta) {
- if (udelta) {
- *sys_scaled = deltascaled * delta / (delta + udelta);
- user_scaled = deltascaled - *sys_scaled;
+ *stime_scaled = stime;
+ utime_scaled = utime;
+ if (deltascaled != stime + utime) {
+ if (utime) {
+ *stime_scaled = deltascaled * stime / (stime + utime);
+ utime_scaled = deltascaled - *stime_scaled;
} else {
- *sys_scaled = deltascaled;
+ *stime_scaled = deltascaled;
}
}
- acct->utime_scaled += user_scaled;
+ acct->utime_scaled += utime_scaled;
- return delta;
+ return stime;
}
void vtime_account_system(struct task_struct *tsk)
{
- unsigned long delta, sys_scaled, stolen;
+ unsigned long stime, stime_scaled, steal_time;
+ struct cpu_accounting_data *acct = get_accounting(tsk);
- delta = vtime_delta(tsk, &sys_scaled, &stolen);
- account_system_time(tsk, 0, delta);
- tsk->stimescaled += sys_scaled;
- if (stolen)
- account_steal_time(stolen);
+ stime = vtime_delta(tsk, &stime_scaled, &steal_time);
+
+ stime -= min(stime, steal_time);
+ acct->steal_time += steal_time;
+
+ if ((tsk->flags & PF_VCPU) && !irq_count()) {
+ acct->gtime += stime;
+ acct->utime_scaled += stime_scaled;
+ } else {
+ if (hardirq_count())
+ acct->hardirq_time += stime;
+ else if (in_serving_softirq())
+ acct->softirq_time += stime;
+ else
+ acct->stime += stime;
+
+ acct->stime_scaled += stime_scaled;
+ }
}
EXPORT_SYMBOL_GPL(vtime_account_system);
void vtime_account_idle(struct task_struct *tsk)
{
- unsigned long delta, sys_scaled, stolen;
+ unsigned long stime, stime_scaled, steal_time;
+ struct cpu_accounting_data *acct = get_accounting(tsk);
- delta = vtime_delta(tsk, &sys_scaled, &stolen);
- account_idle_time(delta + stolen);
+ stime = vtime_delta(tsk, &stime_scaled, &steal_time);
+ acct->idle_time += stime + steal_time;
}
/*
@@ -385,16 +392,45 @@ void vtime_account_idle(struct task_struct *tsk)
*/
void vtime_account_user(struct task_struct *tsk)
{
- cputime_t utime, utimescaled;
struct cpu_accounting_data *acct = get_accounting(tsk);
- utime = acct->utime;
- utimescaled = acct->utime_scaled;
+ if (acct->utime)
+ account_user_time(tsk, acct->utime);
+
+ if (acct->utime_scaled)
+ tsk->utimescaled += acct->utime_scaled;
+
+ if (acct->gtime)
+ account_guest_time(tsk, acct->gtime);
+
+ if (acct->steal_time)
+ account_steal_time(acct->steal_time);
+
+ if (acct->idle_time)
+ account_idle_time(acct->idle_time);
+
+ if (acct->stime)
+ account_system_index_time(tsk, acct->stime, CPUTIME_SYSTEM);
+
+ if (acct->stime_scaled)
+ tsk->stimescaled += acct->stime_scaled;
+
+ if (acct->hardirq_time)
+ account_system_index_time(tsk, acct->hardirq_time, CPUTIME_IRQ);
+
+ if (acct->softirq_time)
+ account_system_index_time(tsk, acct->softirq_time, CPUTIME_SOFTIRQ);
+
acct->utime = 0;
acct->utime_scaled = 0;
acct->utime_sspurr = 0;
- account_user_time(tsk, utime);
- tsk->utimescaled += utimescaled;
+ acct->gtime = 0;
+ acct->steal_time = 0;
+ acct->idle_time = 0;
+ acct->stime = 0;
+ acct->stime_scaled = 0;
+ acct->hardirq_time = 0;
+ acct->softirq_time = 0;
}
#ifdef CONFIG_PPC32
@@ -409,8 +445,6 @@ void arch_vtime_task_switch(struct task_struct *prev)
acct->starttime = get_accounting(prev)->starttime;
acct->startspurr = get_accounting(prev)->startspurr;
- acct->stime = 0;
- acct->utime = 0;
}
#endif /* CONFIG_PPC32 */
--
2.7.4
From: Martin Schwidefsky <[email protected]>
The account_system_time() function is called with a cputime that
occurred while running in the kernel. The function detects which
context the CPU is currently running in and accounts the time to
the correct bucket. This forces the arch code to account the
cputime for hardirq and softirq immediately.
Such accounting function can be costly and perform unwelcome divisions
and multiplications, among others.
The arch code can delay the accounting for system time. For s390
the accounting is done once per timer tick and for each task switch.
Signed-off-by: Martin Schwidefsky <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Cc: Wanpeng Li <[email protected]>
[rebase against latest cputime tree, massaged changelog accordingly]
Signed-off-by: Frederic Weisbecker <[email protected]>
---
arch/s390/include/asm/lowcore.h | 65 +++++++++++-----------
arch/s390/include/asm/processor.h | 3 +
arch/s390/kernel/vtime.c | 112 +++++++++++++++++++++++---------------
3 files changed, 106 insertions(+), 74 deletions(-)
diff --git a/arch/s390/include/asm/lowcore.h b/arch/s390/include/asm/lowcore.h
index 9bfad2a..61261e0e 100644
--- a/arch/s390/include/asm/lowcore.h
+++ b/arch/s390/include/asm/lowcore.h
@@ -85,53 +85,56 @@ struct lowcore {
__u64 mcck_enter_timer; /* 0x02c0 */
__u64 exit_timer; /* 0x02c8 */
__u64 user_timer; /* 0x02d0 */
- __u64 system_timer; /* 0x02d8 */
- __u64 steal_timer; /* 0x02e0 */
- __u64 last_update_timer; /* 0x02e8 */
- __u64 last_update_clock; /* 0x02f0 */
- __u64 int_clock; /* 0x02f8 */
- __u64 mcck_clock; /* 0x0300 */
- __u64 clock_comparator; /* 0x0308 */
+ __u64 guest_timer; /* 0x02d8 */
+ __u64 system_timer; /* 0x02e0 */
+ __u64 hardirq_timer; /* 0x02e8 */
+ __u64 softirq_timer; /* 0x02f0 */
+ __u64 steal_timer; /* 0x02f8 */
+ __u64 last_update_timer; /* 0x0300 */
+ __u64 last_update_clock; /* 0x0308 */
+ __u64 int_clock; /* 0x0310 */
+ __u64 mcck_clock; /* 0x0318 */
+ __u64 clock_comparator; /* 0x0320 */
/* Current process. */
- __u64 current_task; /* 0x0310 */
- __u8 pad_0x318[0x320-0x318]; /* 0x0318 */
- __u64 kernel_stack; /* 0x0320 */
+ __u64 current_task; /* 0x0328 */
+ __u8 pad_0x318[0x320-0x318]; /* 0x0330 */
+ __u64 kernel_stack; /* 0x0338 */
/* Interrupt, panic and restart stack. */
- __u64 async_stack; /* 0x0328 */
- __u64 panic_stack; /* 0x0330 */
- __u64 restart_stack; /* 0x0338 */
+ __u64 async_stack; /* 0x0340 */
+ __u64 panic_stack; /* 0x0348 */
+ __u64 restart_stack; /* 0x0350 */
/* Restart function and parameter. */
- __u64 restart_fn; /* 0x0340 */
- __u64 restart_data; /* 0x0348 */
- __u64 restart_source; /* 0x0350 */
+ __u64 restart_fn; /* 0x0358 */
+ __u64 restart_data; /* 0x0360 */
+ __u64 restart_source; /* 0x0368 */
/* Address space pointer. */
- __u64 kernel_asce; /* 0x0358 */
- __u64 user_asce; /* 0x0360 */
+ __u64 kernel_asce; /* 0x0370 */
+ __u64 user_asce; /* 0x0378 */
/*
* The lpp and current_pid fields form a
* 64-bit value that is set as program
* parameter with the LPP instruction.
*/
- __u32 lpp; /* 0x0368 */
- __u32 current_pid; /* 0x036c */
+ __u32 lpp; /* 0x0380 */
+ __u32 current_pid; /* 0x0384 */
/* SMP info area */
- __u32 cpu_nr; /* 0x0370 */
- __u32 softirq_pending; /* 0x0374 */
- __u64 percpu_offset; /* 0x0378 */
- __u64 vdso_per_cpu_data; /* 0x0380 */
- __u64 machine_flags; /* 0x0388 */
- __u32 preempt_count; /* 0x0390 */
- __u8 pad_0x0394[0x0398-0x0394]; /* 0x0394 */
- __u64 gmap; /* 0x0398 */
- __u32 spinlock_lockval; /* 0x03a0 */
- __u32 fpu_flags; /* 0x03a4 */
- __u8 pad_0x03a8[0x0400-0x03a8]; /* 0x03a8 */
+ __u32 cpu_nr; /* 0x0388 */
+ __u32 softirq_pending; /* 0x038c */
+ __u64 percpu_offset; /* 0x0390 */
+ __u64 vdso_per_cpu_data; /* 0x0398 */
+ __u64 machine_flags; /* 0x03a0 */
+ __u32 preempt_count; /* 0x03a8 */
+ __u8 pad_0x03ac[0x03b0-0x03ac]; /* 0x03ac */
+ __u64 gmap; /* 0x03b0 */
+ __u32 spinlock_lockval; /* 0x03b8 */
+ __u32 fpu_flags; /* 0x03bc */
+ __u8 pad_0x03c0[0x0400-0x03c0]; /* 0x03c0 */
/* Per cpu primary space access list */
__u32 paste[16]; /* 0x0400 */
diff --git a/arch/s390/include/asm/processor.h b/arch/s390/include/asm/processor.h
index 9c00351..6f07907 100644
--- a/arch/s390/include/asm/processor.h
+++ b/arch/s390/include/asm/processor.h
@@ -111,7 +111,10 @@ struct thread_struct {
unsigned int acrs[NUM_ACRS];
unsigned long ksp; /* kernel stack pointer */
unsigned long user_timer; /* task cputime in user space */
+ unsigned long guest_timer; /* task cputime in kvm guest */
unsigned long system_timer; /* task cputime in kernel space */
+ unsigned long hardirq_timer; /* task cputime in hardirq context */
+ unsigned long softirq_timer; /* task cputime in softirq context */
unsigned long sys_call_table; /* system call table address */
mm_segment_t mm_segment;
unsigned long gmap_addr; /* address of last gmap fault. */
diff --git a/arch/s390/kernel/vtime.c b/arch/s390/kernel/vtime.c
index 6b246aa..0fdcaca 100644
--- a/arch/s390/kernel/vtime.c
+++ b/arch/s390/kernel/vtime.c
@@ -90,14 +90,23 @@ static void update_mt_scaling(void)
__this_cpu_write(mt_scaling_jiffies, jiffies_64);
}
+static inline u64 scale_vtime(u64 vtime)
+{
+ u64 mult = __this_cpu_read(mt_scaling_mult);
+ u64 div = __this_cpu_read(mt_scaling_div);
+
+ if (smp_cpu_mtid)
+ return vtime * mult / div;
+ return vtime;
+}
+
/*
* Update process times based on virtual cpu times stored by entry.S
* to the lowcore fields user_timer, system_timer & steal_clock.
*/
static int do_account_vtime(struct task_struct *tsk, int hardirq_offset)
{
- u64 timer, clock, user, system, steal;
- u64 user_scaled, system_scaled;
+ u64 timer, clock, user, guest, system, hardirq, softirq, steal;
timer = S390_lowcore.last_update_timer;
clock = S390_lowcore.last_update_clock;
@@ -110,36 +119,57 @@ static int do_account_vtime(struct task_struct *tsk, int hardirq_offset)
#endif
: "=m" (S390_lowcore.last_update_timer),
"=m" (S390_lowcore.last_update_clock));
- S390_lowcore.system_timer += timer - S390_lowcore.last_update_timer;
- S390_lowcore.steal_timer += S390_lowcore.last_update_clock - clock;
+ clock = S390_lowcore.last_update_clock - clock;
+ timer -= S390_lowcore.last_update_timer;
+
+ if ((tsk->flags & PF_VCPU) && (irq_count() - hardirq_offset == 0))
+ S390_lowcore.guest_timer += timer;
+ else if (hardirq_count() - hardirq_offset)
+ S390_lowcore.hardirq_timer += timer;
+ else if (in_serving_softirq())
+ S390_lowcore.softirq_timer += timer;
+ else
+ S390_lowcore.system_timer += timer;
/* Update MT utilization calculation */
if (smp_cpu_mtid &&
time_after64(jiffies_64, this_cpu_read(mt_scaling_jiffies)))
update_mt_scaling();
+ /* Calculate cputime delta */
user = S390_lowcore.user_timer - tsk->thread.user_timer;
- S390_lowcore.steal_timer -= user;
tsk->thread.user_timer = S390_lowcore.user_timer;
-
+ guest = S390_lowcore.guest_timer - tsk->thread.guest_timer;
+ tsk->thread.guest_timer = S390_lowcore.guest_timer;
system = S390_lowcore.system_timer - tsk->thread.system_timer;
- S390_lowcore.steal_timer -= system;
tsk->thread.system_timer = S390_lowcore.system_timer;
+ hardirq = S390_lowcore.hardirq_timer - tsk->thread.hardirq_timer;
+ tsk->thread.hardirq_timer = S390_lowcore.hardirq_timer;
+ softirq = S390_lowcore.softirq_timer - tsk->thread.softirq_timer;
+ tsk->thread.softirq_timer = S390_lowcore.softirq_timer;
+ S390_lowcore.steal_timer +=
+ clock - user - guest - system - hardirq - softirq;
- user_scaled = user;
- system_scaled = system;
- /* Do MT utilization scaling */
- if (smp_cpu_mtid) {
- u64 mult = __this_cpu_read(mt_scaling_mult);
- u64 div = __this_cpu_read(mt_scaling_div);
+ /* Push account value */
+ if (user) {
+ account_user_time(tsk, user);
+ tsk->utimescaled += scale_vtime(user);
+ }
- user_scaled = (user_scaled * mult) / div;
- system_scaled = (system_scaled * mult) / div;
+ if (guest) {
+ account_guest_time(tsk, guest);
+ tsk->utimescaled += scale_vtime(guest);
}
- account_user_time(tsk, user);
- tsk->utimescaled += user_scaled;
- account_system_time(tsk, hardirq_offset, system);
- tsk->stimescaled += system_scaled;
+
+ if (system)
+ account_system_index_scaled(tsk, system, scale_vtime(system),
+ CPUTIME_SYSTEM);
+ if (hardirq)
+ account_system_index_scaled(tsk, hardirq, scale_vtime(hardirq),
+ CPUTIME_IRQ);
+ if (softirq)
+ account_system_index_scaled(tsk, softirq, scale_vtime(softirq),
+ CPUTIME_SOFTIRQ);
steal = S390_lowcore.steal_timer;
if ((s64) steal > 0) {
@@ -147,16 +177,22 @@ static int do_account_vtime(struct task_struct *tsk, int hardirq_offset)
account_steal_time(steal);
}
- return virt_timer_forward(user + system);
+ return virt_timer_forward(user + guest + system + hardirq + softirq);
}
void vtime_task_switch(struct task_struct *prev)
{
do_account_vtime(prev, 0);
prev->thread.user_timer = S390_lowcore.user_timer;
+ prev->thread.guest_timer = S390_lowcore.guest_timer;
prev->thread.system_timer = S390_lowcore.system_timer;
+ prev->thread.hardirq_timer = S390_lowcore.hardirq_timer;
+ prev->thread.softirq_timer = S390_lowcore.softirq_timer;
S390_lowcore.user_timer = current->thread.user_timer;
+ S390_lowcore.guest_timer = current->thread.guest_timer;
S390_lowcore.system_timer = current->thread.system_timer;
+ S390_lowcore.hardirq_timer = current->thread.hardirq_timer;
+ S390_lowcore.softirq_timer = current->thread.softirq_timer;
}
/*
@@ -176,32 +212,22 @@ void vtime_account_user(struct task_struct *tsk)
*/
void vtime_account_irq_enter(struct task_struct *tsk)
{
- u64 timer, system, system_scaled;
+ u64 timer;
timer = S390_lowcore.last_update_timer;
S390_lowcore.last_update_timer = get_vtimer();
- S390_lowcore.system_timer += timer - S390_lowcore.last_update_timer;
-
- /* Update MT utilization calculation */
- if (smp_cpu_mtid &&
- time_after64(jiffies_64, this_cpu_read(mt_scaling_jiffies)))
- update_mt_scaling();
-
- system = S390_lowcore.system_timer - tsk->thread.system_timer;
- S390_lowcore.steal_timer -= system;
- tsk->thread.system_timer = S390_lowcore.system_timer;
- system_scaled = system;
- /* Do MT utilization scaling */
- if (smp_cpu_mtid) {
- u64 mult = __this_cpu_read(mt_scaling_mult);
- u64 div = __this_cpu_read(mt_scaling_div);
-
- system_scaled = (system_scaled * mult) / div;
- }
- account_system_time(tsk, 0, system);
- tsk->stimescaled += system_scaled;
-
- virt_timer_forward(system);
+ timer -= S390_lowcore.last_update_timer;
+
+ if ((tsk->flags & PF_VCPU) && (irq_count() == 0))
+ S390_lowcore.guest_timer += timer;
+ else if (hardirq_count())
+ S390_lowcore.hardirq_timer += timer;
+ else if (in_serving_softirq())
+ S390_lowcore.softirq_timer += timer;
+ else
+ S390_lowcore.system_timer += timer;
+
+ virt_timer_forward(timer);
}
EXPORT_SYMBOL_GPL(vtime_account_irq_enter);
--
2.7.4
In order to prepare for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE to delay
cputime accounting to the tick, provide finegrained accumulators to
powerpc in order to store the cputime until flushing.
While at it, normalize the name of several fields according to common
cputime naming.
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Cc: Wanpeng Li <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
arch/powerpc/include/asm/accounting.h | 14 +++++++++++---
arch/powerpc/kernel/asm-offsets.c | 8 ++++----
arch/powerpc/kernel/time.c | 31 ++++++++++++++++---------------
arch/powerpc/xmon/xmon.c | 6 +++---
4 files changed, 34 insertions(+), 25 deletions(-)
diff --git a/arch/powerpc/include/asm/accounting.h b/arch/powerpc/include/asm/accounting.h
index c133246..3abcf98 100644
--- a/arch/powerpc/include/asm/accounting.h
+++ b/arch/powerpc/include/asm/accounting.h
@@ -12,9 +12,17 @@
/* Stuff for accurate time accounting */
struct cpu_accounting_data {
- unsigned long user_time; /* accumulated usermode TB ticks */
- unsigned long system_time; /* accumulated system TB ticks */
- unsigned long user_time_scaled; /* accumulated usermode SPURR ticks */
+ /* Accumulated cputime values to flush on ticks*/
+ unsigned long utime;
+ unsigned long stime;
+ unsigned long utime_scaled;
+ unsigned long stime_scaled;
+ unsigned long gtime;
+ unsigned long hardirq_time;
+ unsigned long softirq_time;
+ unsigned long steal_time;
+ unsigned long idle_time;
+ /* Internal counters */
unsigned long starttime; /* TB value snapshot */
unsigned long starttime_user; /* TB value on exit to usermode */
unsigned long startspurr; /* SPURR value snapshot */
diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c
index caec7bf..170319d 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -249,9 +249,9 @@ int main(void)
DEFINE(ACCOUNT_STARTTIME_USER,
offsetof(struct paca_struct, accounting.starttime_user));
DEFINE(ACCOUNT_USER_TIME,
- offsetof(struct paca_struct, accounting.user_time));
+ offsetof(struct paca_struct, accounting.utime));
DEFINE(ACCOUNT_SYSTEM_TIME,
- offsetof(struct paca_struct, accounting.system_time));
+ offsetof(struct paca_struct, accounting.stime));
DEFINE(PACA_TRAP_SAVE, offsetof(struct paca_struct, trap_save));
DEFINE(PACA_NAPSTATELOST, offsetof(struct paca_struct, nap_state_lost));
DEFINE(PACA_SPRG_VDSO, offsetof(struct paca_struct, sprg_vdso));
@@ -262,9 +262,9 @@ int main(void)
DEFINE(ACCOUNT_STARTTIME_USER,
offsetof(struct thread_info, accounting.starttime_user));
DEFINE(ACCOUNT_USER_TIME,
- offsetof(struct thread_info, accounting.user_time));
+ offsetof(struct thread_info, accounting.utime));
DEFINE(ACCOUNT_SYSTEM_TIME,
- offsetof(struct thread_info, accounting.system_time));
+ offsetof(struct thread_info, accounting.stime));
#endif
#endif /* CONFIG_PPC64 */
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index bca2781..613acb4 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -271,8 +271,8 @@ void accumulate_stolen_time(void)
sst = scan_dispatch_log(acct->starttime_user);
ust = scan_dispatch_log(acct->starttime);
- acct->system_time -= sst;
- acct->user_time -= ust;
+ acct->stime -= sst;
+ acct->utime -= ust;
local_paca->stolen_time += ust + sst;
local_paca->soft_enabled = save_soft_enabled;
@@ -281,10 +281,11 @@ void accumulate_stolen_time(void)
static inline u64 calculate_stolen_time(u64 stop_tb)
{
u64 stolen = 0;
+ struct cpu_accounting_data *acct = &local_paca->accounting;
if (get_paca()->dtl_ridx != be64_to_cpu(get_lppaca()->dtl_idx)) {
stolen = scan_dispatch_log(stop_tb);
- get_paca()->accounting.system_time -= stolen;
+ acct->stime -= stolen;
}
stolen += get_paca()->stolen_time;
@@ -316,17 +317,17 @@ static unsigned long vtime_delta(struct task_struct *tsk,
now = mftb();
nowscaled = read_spurr(now);
- acct->system_time += now - acct->starttime;
+ acct->stime += now - acct->starttime;
acct->starttime = now;
deltascaled = nowscaled - acct->startspurr;
acct->startspurr = nowscaled;
*stolen = calculate_stolen_time(now);
- delta = acct->system_time;
- acct->system_time = 0;
- udelta = acct->user_time - acct->utime_sspurr;
- acct->utime_sspurr = acct->user_time;
+ delta = acct->stime;
+ acct->stime = 0;
+ udelta = acct->utime - acct->utime_sspurr;
+ acct->utime_sspurr = acct->utime;
/*
* Because we don't read the SPURR on every kernel entry/exit,
@@ -348,7 +349,7 @@ static unsigned long vtime_delta(struct task_struct *tsk,
*sys_scaled = deltascaled;
}
}
- acct->user_time_scaled += user_scaled;
+ acct->utime_scaled += user_scaled;
return delta;
}
@@ -387,10 +388,10 @@ void vtime_account_user(struct task_struct *tsk)
cputime_t utime, utimescaled;
struct cpu_accounting_data *acct = get_accounting(tsk);
- utime = acct->user_time;
- utimescaled = acct->user_time_scaled;
- acct->user_time = 0;
- acct->user_time_scaled = 0;
+ utime = acct->utime;
+ utimescaled = acct->utime_scaled;
+ acct->utime = 0;
+ acct->utime_scaled = 0;
acct->utime_sspurr = 0;
account_user_time(tsk, utime);
tsk->utimescaled += utimescaled;
@@ -408,8 +409,8 @@ void arch_vtime_task_switch(struct task_struct *prev)
acct->starttime = get_accounting(prev)->starttime;
acct->startspurr = get_accounting(prev)->startspurr;
- acct->system_time = 0;
- acct->user_time = 0;
+ acct->stime = 0;
+ acct->utime = 0;
}
#endif /* CONFIG_PPC32 */
diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index 7605455..0a3a3e8 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -2284,9 +2284,9 @@ static void dump_one_paca(int cpu)
DUMP(p, subcore_sibling_mask, "x");
#endif
- DUMP(p, accounting.user_time, "llx");
- DUMP(p, accounting.system_time, "llx");
- DUMP(p, accounting.user_time_scaled, "llx");
+ DUMP(p, accounting.utime, "llx");
+ DUMP(p, accounting.stime, "llx");
+ DUMP(p, accounting.utime_scaled, "llx");
DUMP(p, accounting.starttime, "llx");
DUMP(p, accounting.starttime_user, "llx");
DUMP(p, accounting.startspurr, "llx");
--
2.7.4
CONFIG_VIRT_CPU_ACCOUNTING_NATIVE used to accumulate user time and
account it on ticks and context switches only through the
vtime_account_user() function.
Now this model has been generalized on the 3 archs for all kind of
cputime (system, irq, ...) and all the cputime flushing happens under
vtime_account_user().
So let's rename this function to better reflect its new role.
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Cc: Wanpeng Li <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
arch/ia64/kernel/time.c | 2 +-
arch/powerpc/kernel/time.c | 6 ++----
arch/s390/kernel/vtime.c | 2 +-
include/linux/kernel_stat.h | 2 +-
include/linux/vtime.h | 7 +++++--
kernel/sched/cputime.c | 4 +---
6 files changed, 11 insertions(+), 12 deletions(-)
diff --git a/arch/ia64/kernel/time.c b/arch/ia64/kernel/time.c
index 8de59ae..9d3e452 100644
--- a/arch/ia64/kernel/time.c
+++ b/arch/ia64/kernel/time.c
@@ -61,7 +61,7 @@ static struct clocksource *itc_clocksource;
extern cputime_t cycle_to_cputime(u64 cyc);
-void vtime_account_user(struct task_struct *tsk)
+void vtime_flush(struct task_struct *tsk)
{
struct thread_info *ti = task_thread_info(tsk);
cputime_t delta;
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 093588a..9e82afb 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -382,15 +382,13 @@ void vtime_account_idle(struct task_struct *tsk)
}
/*
- * Transfer the user time accumulated in the paca
- * by the exception entry and exit code to the generic
- * process user time records.
+ * Account the whole cputime accumulated in the paca
* Must be called with interrupts disabled.
* Assumes that vtime_account_system/idle() has been called
* recently (i.e. since the last entry from usermode) so that
* get_paca()->user_time_scaled is up to date.
*/
-void vtime_account_user(struct task_struct *tsk)
+void vtime_flush(struct task_struct *tsk)
{
struct cpu_accounting_data *acct = get_accounting(tsk);
diff --git a/arch/s390/kernel/vtime.c b/arch/s390/kernel/vtime.c
index 0fdcaca..8a5c803 100644
--- a/arch/s390/kernel/vtime.c
+++ b/arch/s390/kernel/vtime.c
@@ -200,7 +200,7 @@ void vtime_task_switch(struct task_struct *prev)
* accounting system time in order to correctly compute
* the stolen time accounting.
*/
-void vtime_account_user(struct task_struct *tsk)
+void vtime_flush(struct task_struct *tsk)
{
if (do_account_vtime(tsk, HARDIRQ_OFFSET))
virt_timer_expire();
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 1cee310..53bc37f 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -91,7 +91,7 @@ extern void account_system_index_scaled(struct task_struct *, cputime_t,
cputime_t, enum cpu_usage_stat);
static inline void account_process_tick(struct task_struct *tsk, int user)
{
- vtime_account_user(tsk);
+ vtime_flush(tsk);
}
#else
extern void account_process_tick(struct task_struct *, int user);
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index aa9bfea..0681fe2 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -58,27 +58,28 @@ static inline void vtime_task_switch(struct task_struct *prev)
extern void vtime_account_system(struct task_struct *tsk);
extern void vtime_account_idle(struct task_struct *tsk);
-extern void vtime_account_user(struct task_struct *tsk);
#else /* !CONFIG_VIRT_CPU_ACCOUNTING */
static inline void vtime_task_switch(struct task_struct *prev) { }
static inline void vtime_account_system(struct task_struct *tsk) { }
-static inline void vtime_account_user(struct task_struct *tsk) { }
#endif /* !CONFIG_VIRT_CPU_ACCOUNTING */
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
extern void arch_vtime_task_switch(struct task_struct *tsk);
+extern void vtime_account_user(struct task_struct *tsk);
extern void vtime_user_enter(struct task_struct *tsk);
static inline void vtime_user_exit(struct task_struct *tsk)
{
vtime_account_user(tsk);
}
+
extern void vtime_guest_enter(struct task_struct *tsk);
extern void vtime_guest_exit(struct task_struct *tsk);
extern void vtime_init_idle(struct task_struct *tsk, int cpu);
#else /* !CONFIG_VIRT_CPU_ACCOUNTING_GEN */
+static inline void vtime_account_user(struct task_struct *tsk) { }
static inline void vtime_user_enter(struct task_struct *tsk) { }
static inline void vtime_user_exit(struct task_struct *tsk) { }
static inline void vtime_guest_enter(struct task_struct *tsk) { }
@@ -93,9 +94,11 @@ static inline void vtime_account_irq_exit(struct task_struct *tsk)
/* On hard|softirq exit we always account to hard|softirq cputime */
vtime_account_system(tsk);
}
+extern void vtime_flush(struct task_struct *tsk);
#else /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
static inline void vtime_account_irq_enter(struct task_struct *tsk) { }
static inline void vtime_account_irq_exit(struct task_struct *tsk) { }
+static inline void vtime_flush(struct task_struct *tsk) { }
#endif
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index dd67770..3dc6b11 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -446,9 +446,7 @@ void vtime_common_task_switch(struct task_struct *prev)
else
vtime_account_system(prev);
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
- vtime_account_user(prev);
-#endif
+ vtime_flush(prev);
arch_vtime_task_switch(prev);
}
#endif
--
2.7.4
Currently CONFIG_VIRT_CPU_ACCOUNTING_NATIVE accounts the cputime on
any context boundary: irq entry/exit, guest entry/exit, context switch,
etc...
Calling functions such as account_system_time(), account_user_time()
and such can be costly, especially if they are called on many fastpath
such as twice per IRQ. Those functions do more than just accounting to
kcpustat and task cputime. Depending on the config, some subsystems can
perform unpleasant multiplications and divisions, among other things.
So lets accumulate the cputime instead and delay the accounting on ticks
and context switches only.
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Cc: Wanpeng Li <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
arch/ia64/include/asm/thread_info.h | 6 ++++
arch/ia64/kernel/time.c | 60 ++++++++++++++++++++++++++++---------
2 files changed, 52 insertions(+), 14 deletions(-)
diff --git a/arch/ia64/include/asm/thread_info.h b/arch/ia64/include/asm/thread_info.h
index c702642..8742d74 100644
--- a/arch/ia64/include/asm/thread_info.h
+++ b/arch/ia64/include/asm/thread_info.h
@@ -27,6 +27,12 @@ struct thread_info {
mm_segment_t addr_limit; /* user-level address space limit */
int preempt_count; /* 0=premptable, <0=BUG; will also serve as bh-counter */
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
+ __u64 utime;
+ __u64 stime;
+ __u64 gtime;
+ __u64 hardirq_time;
+ __u64 softirq_time;
+ __u64 idle_time;
__u64 ac_stamp;
__u64 ac_leave;
__u64 ac_stime;
diff --git a/arch/ia64/kernel/time.c b/arch/ia64/kernel/time.c
index 88f0984..8de59ae 100644
--- a/arch/ia64/kernel/time.c
+++ b/arch/ia64/kernel/time.c
@@ -63,14 +63,39 @@ extern cputime_t cycle_to_cputime(u64 cyc);
void vtime_account_user(struct task_struct *tsk)
{
- cputime_t delta_utime;
struct thread_info *ti = task_thread_info(tsk);
+ cputime_t delta;
- if (ti->ac_utime) {
- delta_utime = cycle_to_cputime(ti->ac_utime);
- account_user_time(tsk, delta_utime);
- ti->ac_utime = 0;
+ if (ti->utime)
+ account_user_time(tsk, cycle_to_cputime(ti->utime));
+
+ if (ti->gtime)
+ account_guest_time(tsk, cycle_to_cputime(ti->gtime));
+
+ if (ti->idle_time)
+ account_idle_time(cycle_to_cputime(ti->idle_time));
+
+ if (ti->stime) {
+ delta = cycle_to_cputime(ti->stime);
+ account_system_index_time(tsk, delta, CPUTIME_SYSTEM);
+ }
+
+ if (ti->hardirq_time) {
+ delta = cycle_to_cputime(ti->hardirq_time);
+ account_system_index_time(tsk, delta, CPUTIME_IRQ);
+ }
+
+ if (ti->softirq_time) {
+ delta = cycle_to_cputime(ti->softirq_time);
+ account_system_index_time(tsk, delta, CPUTIME_SOFTIRQ);
}
+
+ ti->utime = 0;
+ ti->gtime = 0;
+ ti->idle_time = 0;
+ ti->stime = 0;
+ ti->hardirq_time = 0;
+ ti->softirq_time = 0;
}
/*
@@ -91,18 +116,15 @@ void arch_vtime_task_switch(struct task_struct *prev)
* Account time for a transition between system, hard irq or soft irq state.
* Note that this function is called with interrupts enabled.
*/
-static cputime_t vtime_delta(struct task_struct *tsk)
+static __u64 vtime_delta(struct task_struct *tsk)
{
struct thread_info *ti = task_thread_info(tsk);
- cputime_t delta_stime;
- __u64 now;
+ __u64 now, delta_stime;
WARN_ON_ONCE(!irqs_disabled());
now = ia64_get_itc();
-
- delta_stime = cycle_to_cputime(ti->ac_stime + (now - ti->ac_stamp));
- ti->ac_stime = 0;
+ delta_stime = now - ti->ac_stamp;
ti->ac_stamp = now;
return delta_stime;
@@ -110,15 +132,25 @@ static cputime_t vtime_delta(struct task_struct *tsk)
void vtime_account_system(struct task_struct *tsk)
{
- cputime_t delta = vtime_delta(tsk);
+ struct thread_info *ti = task_thread_info(tsk);
+ __u64 stime = vtime_delta(tsk);
- account_system_time(tsk, 0, delta);
+ if ((tsk->flags & PF_VCPU) && !irq_count())
+ ti->gtime += stime;
+ else if (hardirq_count())
+ ti->hardirq_time += stime;
+ else if (in_serving_softirq())
+ ti->softirq_time += stime;
+ else
+ ti->stime += stime;
}
EXPORT_SYMBOL_GPL(vtime_account_system);
void vtime_account_idle(struct task_struct *tsk)
{
- account_idle_time(vtime_delta(tsk));
+ struct thread_info *ti = task_thread_info(tsk);
+
+ ti->idle_time += vtime_delta(tsk);
}
#endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
--
2.7.4
On task switch we must initialize the current cputime of the next task
using the value of the previous task which got freshly updated.
But we are confusing that with doing the opposite, which should result
in wrong cputime accounting.
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Cc: Wanpeng Li <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
arch/ia64/kernel/time.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/ia64/kernel/time.c b/arch/ia64/kernel/time.c
index 021f44a..88f0984 100644
--- a/arch/ia64/kernel/time.c
+++ b/arch/ia64/kernel/time.c
@@ -83,7 +83,7 @@ void arch_vtime_task_switch(struct task_struct *prev)
struct thread_info *pi = task_thread_info(prev);
struct thread_info *ni = task_thread_info(current);
- pi->ac_stamp = ni->ac_stamp;
+ ni->ac_stamp = pi->ac_stamp;
ni->ac_stime = ni->ac_utime = 0;
}
--
2.7.4
On Tue, Dec 06, 2016 at 03:32:13AM +0100, Frederic Weisbecker wrote:
> This follows up Martin Schwidefsky's patch which propose to delay
> cputime accounting to the tick in order to minimize the calls to
> account_system_time() and alikes as these functions can carry quite some
> overhead:
>
> http://lkml.kernel.org/r/20161121111728.13a0a3db@mschwide
>
> The set includes Martin's patch, rebased on top of tip:sched/core and
> latest s390 changes, and extends it to the other implementations of
> CONFIG_VIRT_CPU_ACCOUNTING_NATIVE (powerpc and ia64) along with a few
> core changes to adapt the whole.
>
> Only built-tested though as I don't have access to any of these archs.
The patches look reasonable at a quick look. I assume that to test
them, we would want to run a guest in an overcommitted system, so as
to get some steal time. Do you have any more specific suggestions as
to what to run as a test? Just run some benchmark and see if the
user/system/irq times look reasonable? Or do you have something more
quantitative?
Paul.
On Tue, 6 Dec 2016 15:20:55 +1100
Paul Mackerras <[email protected]> wrote:
> On Tue, Dec 06, 2016 at 03:32:13AM +0100, Frederic Weisbecker wrote:
> > This follows up Martin Schwidefsky's patch which propose to delay
> > cputime accounting to the tick in order to minimize the calls to
> > account_system_time() and alikes as these functions can carry quite some
> > overhead:
> >
> > http://lkml.kernel.org/r/20161121111728.13a0a3db@mschwide
> >
> > The set includes Martin's patch, rebased on top of tip:sched/core and
> > latest s390 changes, and extends it to the other implementations of
> > CONFIG_VIRT_CPU_ACCOUNTING_NATIVE (powerpc and ia64) along with a few
> > core changes to adapt the whole.
> >
> > Only built-tested though as I don't have access to any of these archs.
>
> The patches look reasonable at a quick look. I assume that to test
> them, we would want to run a guest in an overcommitted system, so as
> to get some steal time. Do you have any more specific suggestions as
> to what to run as a test? Just run some benchmark and see if the
> user/system/irq times look reasonable? Or do you have something more
> quantitative?
My quick test this morning showed that for s390 it works quite well.
I yet have to do more detailed tests to see if the numbers really
add up.
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
On 12/06/2016 03:32 AM, Frederic Weisbecker wrote:
> This follows up Martin Schwidefsky's patch which propose to delay
> cputime accounting to the tick in order to minimize the calls to
> account_system_time() and alikes as these functions can carry quite some
> overhead:
>
> http://lkml.kernel.org/r/20161121111728.13a0a3db@mschwide
>
> The set includes Martin's patch, rebased on top of tip:sched/core and
> latest s390 changes, and extends it to the other implementations of
> CONFIG_VIRT_CPU_ACCOUNTING_NATIVE (powerpc and ia64) along with a few
> core changes to adapt the whole.
>
> Only built-tested though as I don't have access to any of these archs.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> vtime/acc
>
I have not checked the results yet (1), but this branch reduces the overhead
of calculating the times a lot. I have a microbenchmark that does some simple
KVM hypercalls under s390 and the average time shrinks from 280ns to 260ns.
(we call accounting at guest enter and exit)
About 50% of the savings seems to be Martins changes in vtime_account_irq_enter,
the other 50% seems to be common code accounting doing less work.
These savings of ~20ns per guest exit will also help virtio, pagefault and any
other host activity on behalf of KVM guests. Nice.
(1) the values seem to make sense but I have not done any calculation yet.
> HEAD: ee6c393b212193bc01818c7cf4ae9cba3f469f00
>
> Thanks,
> Frederic
> ---
>
> Frederic Weisbecker (9):
> powerpc32: Fix stale scaled stime on context switch
> ia64: Fix wrong start cputime assignment on task switch
> cputime: Allow accounting system time using cpustat index
> cputime: Export account_guest_time
> powerpc: Prepare accounting structure for cputime flush on tick
> powerpc: Migrate stolen_time field to accounting structure
> powerpc/vtime: Accumulate cputime and account only on tick/task switch
> ia64: Accumulate cputime and account only on tick/task switch
> vtime: Rename vtime_account_user() to vtime_flush()
>
> Martin Schwidefsky (1):
> s390/cputime: delayed accounting of system time
>
>
> arch/ia64/include/asm/thread_info.h | 6 ++
> arch/ia64/kernel/time.c | 66 +++++++++++-----
> arch/powerpc/include/asm/accounting.h | 14 +++-
> arch/powerpc/include/asm/paca.h | 1 -
> arch/powerpc/kernel/asm-offsets.c | 8 +-
> arch/powerpc/kernel/time.c | 138 +++++++++++++++++++++-------------
> arch/powerpc/xmon/xmon.c | 8 +-
> arch/s390/include/asm/lowcore.h | 65 ++++++++--------
> arch/s390/include/asm/processor.h | 3 +
> arch/s390/kernel/vtime.c | 116 +++++++++++++++++-----------
> include/linux/kernel_stat.h | 7 +-
> include/linux/vtime.h | 7 +-
> kernel/sched/cputime.c | 25 +++---
> 13 files changed, 295 insertions(+), 169 deletions(-)
>
On Tue, Dec 06, 2016 at 03:20:55PM +1100, Paul Mackerras wrote:
> On Tue, Dec 06, 2016 at 03:32:13AM +0100, Frederic Weisbecker wrote:
> > This follows up Martin Schwidefsky's patch which propose to delay
> > cputime accounting to the tick in order to minimize the calls to
> > account_system_time() and alikes as these functions can carry quite some
> > overhead:
> >
> > http://lkml.kernel.org/r/20161121111728.13a0a3db@mschwide
> >
> > The set includes Martin's patch, rebased on top of tip:sched/core and
> > latest s390 changes, and extends it to the other implementations of
> > CONFIG_VIRT_CPU_ACCOUNTING_NATIVE (powerpc and ia64) along with a few
> > core changes to adapt the whole.
> >
> > Only built-tested though as I don't have access to any of these archs.
>
> The patches look reasonable at a quick look. I assume that to test
> them, we would want to run a guest in an overcommitted system, so as
> to get some steal time. Do you have any more specific suggestions as
> to what to run as a test? Just run some benchmark and see if the
> user/system/irq times look reasonable? Or do you have something more
> quantitative?
So I guess we want to test both correctness and performance.
To check correctness I use two little programs, one that does a userspace
loop:
int main(int argc, char **argv)
{
while (1);
return 0;
}
And another that does a kernelspace loop. The latter
is not 100% kernel loop but spends most of its time in
kernel mode.
int main(int argc, char **argv)
{
void *addr = sbrk(0);
while (1) {
brk(addr + 4096);
brk(addr);
}
return 0;
}
Testing idle time just consist in checking the difference between two
cat /proc/stat in a given timelapse for an idle CPU.
For irqs it gets harder. There you just need to check if the numbers are
reasonable.
Now in order to measure performance, I think you need a workload that either
does a lot of guest/host switch or does a lot of IRQs. Maybe just something
that involves networking. Then comparing stime, hardirq and softirq should
show some better nummbers. In order to increase the effect, you can set a very
low HZ value (100?).
Thanks.
On Tue, Dec 06, 2016 at 03:32:22AM +0100, Frederic Weisbecker wrote:
> From: Martin Schwidefsky <[email protected]>
>
> The account_system_time() function is called with a cputime that
> occurred while running in the kernel. The function detects which
> context the CPU is currently running in and accounts the time to
> the correct bucket. This forces the arch code to account the
> cputime for hardirq and softirq immediately.
>
> Such accounting function can be costly and perform unwelcome divisions
> and multiplications, among others.
>
> The arch code can delay the accounting for system time. For s390
> the accounting is done once per timer tick and for each task switch.
>
> Signed-off-by: Martin Schwidefsky <[email protected]>
> Cc: Benjamin Herrenschmidt <[email protected]>
> Cc: Paul Mackerras <[email protected]>
> Cc: Michael Ellerman <[email protected]>
> Cc: Heiko Carstens <[email protected]>
> Cc: Martin Schwidefsky <[email protected]>
> Cc: Tony Luck <[email protected]>
> Cc: Fenghua Yu <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Stanislaw Gruszka <[email protected]>
> Cc: Wanpeng Li <[email protected]>
> [rebase against latest cputime tree, massaged changelog accordingly]
> Signed-off-by: Frederic Weisbecker <[email protected]>
Looking at this patch again, I think I need to do another pass on it.
Comments below:
> /*
> * Update process times based on virtual cpu times stored by entry.S
> * to the lowcore fields user_timer, system_timer & steal_clock.
> */
> static int do_account_vtime(struct task_struct *tsk, int hardirq_offset)
> {
> - u64 timer, clock, user, system, steal;
> - u64 user_scaled, system_scaled;
> + u64 timer, clock, user, guest, system, hardirq, softirq, steal;
>
> timer = S390_lowcore.last_update_timer;
> clock = S390_lowcore.last_update_clock;
> @@ -110,36 +119,57 @@ static int do_account_vtime(struct task_struct *tsk, int hardirq_offset)
> #endif
> : "=m" (S390_lowcore.last_update_timer),
> "=m" (S390_lowcore.last_update_clock));
> - S390_lowcore.system_timer += timer - S390_lowcore.last_update_timer;
> - S390_lowcore.steal_timer += S390_lowcore.last_update_clock - clock;
> + clock = S390_lowcore.last_update_clock - clock;
> + timer -= S390_lowcore.last_update_timer;
> +
> + if ((tsk->flags & PF_VCPU) && (irq_count() - hardirq_offset == 0))
> + S390_lowcore.guest_timer += timer;
> + else if (hardirq_count() - hardirq_offset)
> + S390_lowcore.hardirq_timer += timer;
We should get rid of the hardirq_offset argument, it doesn't really make sense
anymore. Also it makes the accounting buggy now. It's called from the tick
through account_user_time() with hardirq_offset=1, so the irq time is incorrectly
accumulated as system time. Guest time may be incorrect too.
In fact it may have been buggy even before this patchset because vtime_account_user()
isn't only called from the tick but also from task switch, and hardirq_offset remains 1
for those two cases. Not good.
> + else if (in_serving_softirq())
> + S390_lowcore.softirq_timer += timer;
> + else
> + S390_lowcore.system_timer += timer;
>
> /* Update MT utilization calculation */
> if (smp_cpu_mtid &&
> time_after64(jiffies_64, this_cpu_read(mt_scaling_jiffies)))
> update_mt_scaling();
>
> + /* Calculate cputime delta */
> user = S390_lowcore.user_timer - tsk->thread.user_timer;
> - S390_lowcore.steal_timer -= user;
> tsk->thread.user_timer = S390_lowcore.user_timer;
> -
> + guest = S390_lowcore.guest_timer - tsk->thread.guest_timer;
> + tsk->thread.guest_timer = S390_lowcore.guest_timer;
> system = S390_lowcore.system_timer - tsk->thread.system_timer;
> - S390_lowcore.steal_timer -= system;
> tsk->thread.system_timer = S390_lowcore.system_timer;
> + hardirq = S390_lowcore.hardirq_timer - tsk->thread.hardirq_timer;
> + tsk->thread.hardirq_timer = S390_lowcore.hardirq_timer;
> + softirq = S390_lowcore.softirq_timer - tsk->thread.softirq_timer;
> + tsk->thread.softirq_timer = S390_lowcore.softirq_timer;
> + S390_lowcore.steal_timer +=
> + clock - user - guest - system - hardirq - softirq;
>
> - user_scaled = user;
> - system_scaled = system;
> - /* Do MT utilization scaling */
> - if (smp_cpu_mtid) {
> - u64 mult = __this_cpu_read(mt_scaling_mult);
> - u64 div = __this_cpu_read(mt_scaling_div);
> + /* Push account value */
> + if (user) {
> + account_user_time(tsk, user);
> + tsk->utimescaled += scale_vtime(user);
> + }
>
> - user_scaled = (user_scaled * mult) / div;
> - system_scaled = (system_scaled * mult) / div;
> + if (guest) {
> + account_guest_time(tsk, guest);
> + tsk->utimescaled += scale_vtime(guest);
> }
> - account_user_time(tsk, user);
> - tsk->utimescaled += user_scaled;
> - account_system_time(tsk, hardirq_offset, system);
> - tsk->stimescaled += system_scaled;
> +
> + if (system)
> + account_system_index_scaled(tsk, system, scale_vtime(system),
> + CPUTIME_SYSTEM);
> + if (hardirq)
> + account_system_index_scaled(tsk, hardirq, scale_vtime(hardirq),
> + CPUTIME_IRQ);
> + if (softirq)
> + account_system_index_scaled(tsk, softirq, scale_vtime(softirq),
> + CPUTIME_SOFTIRQ);
>
> steal = S390_lowcore.steal_timer;
> if ((s64) steal > 0) {
> @@ -147,16 +177,22 @@ static int do_account_vtime(struct task_struct *tsk, int hardirq_offset)
> account_steal_time(steal);
> }
>
> - return virt_timer_forward(user + system);
> + return virt_timer_forward(user + guest + system + hardirq + softirq);
> }
>
> void vtime_task_switch(struct task_struct *prev)
> {
> do_account_vtime(prev, 0);
This call should be removed, the task switch already calls vtime_account_user().
> prev->thread.user_timer = S390_lowcore.user_timer;
> + prev->thread.guest_timer = S390_lowcore.guest_timer;
> prev->thread.system_timer = S390_lowcore.system_timer;
> + prev->thread.hardirq_timer = S390_lowcore.hardirq_timer;
> + prev->thread.softirq_timer = S390_lowcore.softirq_timer;
> S390_lowcore.user_timer = current->thread.user_timer;
> + S390_lowcore.guest_timer = current->thread.guest_timer;
> S390_lowcore.system_timer = current->thread.system_timer;
> + S390_lowcore.hardirq_timer = current->thread.hardirq_timer;
> + S390_lowcore.softirq_timer = current->thread.softirq_timer;
> }
On Sat, 10 Dec 2016 02:48:06 +0100
Frederic Weisbecker <[email protected]> wrote:
> On Tue, Dec 06, 2016 at 03:32:22AM +0100, Frederic Weisbecker wrote:
> > From: Martin Schwidefsky <[email protected]>
> >
> > The account_system_time() function is called with a cputime that
> > occurred while running in the kernel. The function detects which
> > context the CPU is currently running in and accounts the time to
> > the correct bucket. This forces the arch code to account the
> > cputime for hardirq and softirq immediately.
> >
> > Such accounting function can be costly and perform unwelcome divisions
> > and multiplications, among others.
> >
> > The arch code can delay the accounting for system time. For s390
> > the accounting is done once per timer tick and for each task switch.
> >
> > Signed-off-by: Martin Schwidefsky <[email protected]>
> > Cc: Benjamin Herrenschmidt <[email protected]>
> > Cc: Paul Mackerras <[email protected]>
> > Cc: Michael Ellerman <[email protected]>
> > Cc: Heiko Carstens <[email protected]>
> > Cc: Martin Schwidefsky <[email protected]>
> > Cc: Tony Luck <[email protected]>
> > Cc: Fenghua Yu <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Rik van Riel <[email protected]>
> > Cc: Thomas Gleixner <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Stanislaw Gruszka <[email protected]>
> > Cc: Wanpeng Li <[email protected]>
> > [rebase against latest cputime tree, massaged changelog accordingly]
> > Signed-off-by: Frederic Weisbecker <[email protected]>
>
> Looking at this patch again, I think I need to do another pass on it.
> Comments below:
>
> > /*
> > * Update process times based on virtual cpu times stored by entry.S
> > * to the lowcore fields user_timer, system_timer & steal_clock.
> > */
> > static int do_account_vtime(struct task_struct *tsk, int hardirq_offset)
> > {
> > - u64 timer, clock, user, system, steal;
> > - u64 user_scaled, system_scaled;
> > + u64 timer, clock, user, guest, system, hardirq, softirq, steal;
> >
> > timer = S390_lowcore.last_update_timer;
> > clock = S390_lowcore.last_update_clock;
> > @@ -110,36 +119,57 @@ static int do_account_vtime(struct task_struct *tsk, int hardirq_offset)
> > #endif
> > : "=m" (S390_lowcore.last_update_timer),
> > "=m" (S390_lowcore.last_update_clock));
> > - S390_lowcore.system_timer += timer - S390_lowcore.last_update_timer;
> > - S390_lowcore.steal_timer += S390_lowcore.last_update_clock - clock;
> > + clock = S390_lowcore.last_update_clock - clock;
> > + timer -= S390_lowcore.last_update_timer;
> > +
> > + if ((tsk->flags & PF_VCPU) && (irq_count() - hardirq_offset == 0))
> > + S390_lowcore.guest_timer += timer;
> > + else if (hardirq_count() - hardirq_offset)
> > + S390_lowcore.hardirq_timer += timer;
>
> We should get rid of the hardirq_offset argument, it doesn't really make sense
> anymore. Also it makes the accounting buggy now. It's called from the tick
> through account_user_time() with hardirq_offset=1, so the irq time is incorrectly
> accumulated as system time. Guest time may be incorrect too.
>
> In fact it may have been buggy even before this patchset because vtime_account_user()
> isn't only called from the tick but also from task switch, and hardirq_offset remains 1
> for those two cases. Not good.
For s390 the do_account_vtime function is called from vtime_task_switch and vtime_flush.
1) vtime_task_switch is exclusively called from finish_task_switch outside of irq context.
The call to do_account_vtime with hardirq_offset==0 from vtime_task_switch is correct.
2) The call to vtime_flush in vtime_common_task_switch is irrelevant for s390 as we
define __ARCH_HAS_VTIME_TASK_SWITCH
3) The call to vtime_flush in account_process_tick is done in irq context from
update_process_times. hardirq_offset==1 is also correct.
As far as s390 is concerned that looks good.
> > + else if (in_serving_softirq())
> > + S390_lowcore.softirq_timer += timer;
> > + else
> > + S390_lowcore.system_timer += timer;
> >
> > /* Update MT utilization calculation */
> > if (smp_cpu_mtid &&
> > time_after64(jiffies_64, this_cpu_read(mt_scaling_jiffies)))
> > update_mt_scaling();
> >
> > + /* Calculate cputime delta */
> > user = S390_lowcore.user_timer - tsk->thread.user_timer;
> > - S390_lowcore.steal_timer -= user;
> > tsk->thread.user_timer = S390_lowcore.user_timer;
> > -
> > + guest = S390_lowcore.guest_timer - tsk->thread.guest_timer;
> > + tsk->thread.guest_timer = S390_lowcore.guest_timer;
> > system = S390_lowcore.system_timer - tsk->thread.system_timer;
> > - S390_lowcore.steal_timer -= system;
> > tsk->thread.system_timer = S390_lowcore.system_timer;
> > + hardirq = S390_lowcore.hardirq_timer - tsk->thread.hardirq_timer;
> > + tsk->thread.hardirq_timer = S390_lowcore.hardirq_timer;
> > + softirq = S390_lowcore.softirq_timer - tsk->thread.softirq_timer;
> > + tsk->thread.softirq_timer = S390_lowcore.softirq_timer;
> > + S390_lowcore.steal_timer +=
> > + clock - user - guest - system - hardirq - softirq;
> >
> > - user_scaled = user;
> > - system_scaled = system;
> > - /* Do MT utilization scaling */
> > - if (smp_cpu_mtid) {
> > - u64 mult = __this_cpu_read(mt_scaling_mult);
> > - u64 div = __this_cpu_read(mt_scaling_div);
> > + /* Push account value */
> > + if (user) {
> > + account_user_time(tsk, user);
> > + tsk->utimescaled += scale_vtime(user);
> > + }
> >
> > - user_scaled = (user_scaled * mult) / div;
> > - system_scaled = (system_scaled * mult) / div;
> > + if (guest) {
> > + account_guest_time(tsk, guest);
> > + tsk->utimescaled += scale_vtime(guest);
> > }
> > - account_user_time(tsk, user);
> > - tsk->utimescaled += user_scaled;
> > - account_system_time(tsk, hardirq_offset, system);
> > - tsk->stimescaled += system_scaled;
> > +
> > + if (system)
> > + account_system_index_scaled(tsk, system, scale_vtime(system),
> > + CPUTIME_SYSTEM);
> > + if (hardirq)
> > + account_system_index_scaled(tsk, hardirq, scale_vtime(hardirq),
> > + CPUTIME_IRQ);
> > + if (softirq)
> > + account_system_index_scaled(tsk, softirq, scale_vtime(softirq),
> > + CPUTIME_SOFTIRQ);
> >
> > steal = S390_lowcore.steal_timer;
> > if ((s64) steal > 0) {
> > @@ -147,16 +177,22 @@ static int do_account_vtime(struct task_struct *tsk, int hardirq_offset)
> > account_steal_time(steal);
> > }
> >
> > - return virt_timer_forward(user + system);
> > + return virt_timer_forward(user + guest + system + hardirq + softirq);
> > }
> >
> > void vtime_task_switch(struct task_struct *prev)
> > {
> > do_account_vtime(prev, 0);
>
> This call should be removed, the task switch already calls vtime_account_user().
The vtime_account_user function is empty for s390..
> > prev->thread.user_timer = S390_lowcore.user_timer;
> > + prev->thread.guest_timer = S390_lowcore.guest_timer;
> > prev->thread.system_timer = S390_lowcore.system_timer;
> > + prev->thread.hardirq_timer = S390_lowcore.hardirq_timer;
> > + prev->thread.softirq_timer = S390_lowcore.softirq_timer;
> > S390_lowcore.user_timer = current->thread.user_timer;
> > + S390_lowcore.guest_timer = current->thread.guest_timer;
> > S390_lowcore.system_timer = current->thread.system_timer;
> > + S390_lowcore.hardirq_timer = current->thread.hardirq_timer;
> > + S390_lowcore.softirq_timer = current->thread.softirq_timer;
> > }
>
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
On Mon, Dec 12, 2016 at 11:27:54AM +0100, Martin Schwidefsky wrote:
> On Sat, 10 Dec 2016 02:48:06 +0100
> Frederic Weisbecker <[email protected]> wrote:
> > We should get rid of the hardirq_offset argument, it doesn't really make sense
> > anymore. Also it makes the accounting buggy now. It's called from the tick
> > through account_user_time() with hardirq_offset=1, so the irq time is incorrectly
> > accumulated as system time. Guest time may be incorrect too.
> >
> > In fact it may have been buggy even before this patchset because vtime_account_user()
> > isn't only called from the tick but also from task switch, and hardirq_offset remains 1
> > for those two cases. Not good.
>
> For s390 the do_account_vtime function is called from vtime_task_switch and vtime_flush.
> 1) vtime_task_switch is exclusively called from finish_task_switch outside of irq context.
> The call to do_account_vtime with hardirq_offset==0 from vtime_task_switch is correct.
Yes that one is fine.
> 2) The call to vtime_flush in vtime_common_task_switch is irrelevant for s390 as we
> define __ARCH_HAS_VTIME_TASK_SWITCH
That's right, I missed that. And now I remember that special case remains because s390 has its
own way to account idle time.
> 3) The call to vtime_flush in account_process_tick is done in irq context from
> update_process_times. hardirq_offset==1 is also correct.
Let's see this for example:
+ if ((tsk->flags & PF_VCPU) && (irq_count() - hardirq_offset == 0))
+ S390_lowcore.guest_timer += timer;
If the tick is interrupting guest, we have accounted the guest time on tick IRQ entry.
Now we are in the middle of the tick interrupt and since hardirq_offset is 1, we
are taking the above path by accounting half of the tick-IRQ time as guest, which is wrong,
it's actually IRQ time.
> > > void vtime_task_switch(struct task_struct *prev)
> > > {
> > > do_account_vtime(prev, 0);
> >
> > This call should be removed, the task switch already calls vtime_account_user().
>
> The vtime_account_user function is empty for s390..
That's right.
On Mon, 12 Dec 2016 16:02:30 +0100
Frederic Weisbecker <[email protected]> wrote:
> On Mon, Dec 12, 2016 at 11:27:54AM +0100, Martin Schwidefsky wrote:
> > 3) The call to vtime_flush in account_process_tick is done in irq context from
> > update_process_times. hardirq_offset==1 is also correct.
>
> Let's see this for example:
>
> + if ((tsk->flags & PF_VCPU) && (irq_count() - hardirq_offset == 0))
> + S390_lowcore.guest_timer += timer;
>
> If the tick is interrupting guest, we have accounted the guest time on tick IRQ entry.
> Now we are in the middle of the tick interrupt and since hardirq_offset is 1, we
> are taking the above path by accounting half of the tick-IRQ time as guest, which is wrong,
> it's actually IRQ time.
Hmm, you got me there. The system time from irq_enter until account_process_tick
is reached is indeed IRQ time. It is not much but it is incorrect. The best fix
would be to rip out the accounting of the system time from account_process_tick
as irq_enter / irq_exit will do system time accounting anyway. To do that
do_account_vtime needs to be split, because for the task switch we need to
account the system time of the previous task.
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
On Tue, 13 Dec 2016 12:13:22 +0100
Martin Schwidefsky <[email protected]> wrote:
> On Mon, 12 Dec 2016 16:02:30 +0100
> Frederic Weisbecker <[email protected]> wrote:
>
> > On Mon, Dec 12, 2016 at 11:27:54AM +0100, Martin Schwidefsky wrote:
> > > 3) The call to vtime_flush in account_process_tick is done in irq context from
> > > update_process_times. hardirq_offset==1 is also correct.
> >
> > Let's see this for example:
> >
> > + if ((tsk->flags & PF_VCPU) && (irq_count() - hardirq_offset == 0))
> > + S390_lowcore.guest_timer += timer;
> >
> > If the tick is interrupting guest, we have accounted the guest time on tick IRQ entry.
> > Now we are in the middle of the tick interrupt and since hardirq_offset is 1, we
> > are taking the above path by accounting half of the tick-IRQ time as guest, which is wrong,
> > it's actually IRQ time.
>
> Hmm, you got me there. The system time from irq_enter until account_process_tick
> is reached is indeed IRQ time. It is not much but it is incorrect. The best fix
> would be to rip out the accounting of the system time from account_process_tick
> as irq_enter / irq_exit will do system time accounting anyway. To do that
> do_account_vtime needs to be split, because for the task switch we need to
> account the system time of the previous task.
New patch for the delayed cputime account. I can not just rip out system time
accounting from account_process_tick after all, I need a sync point for the
steal time calculation. It basically is the same patch as before but with a new
helper update_tsk_timer, the removal of hardirq_offset and a simplification
for do_account_vtime: the last accounting delta is either hardirq time for
the tick or system time for the task switch.
Keeping my finger crossed..
--
>From 2d7ca61909c438f947d27110b7b2a2b67a44607a Mon Sep 17 00:00:00 2001
From: Martin Schwidefsky <[email protected]>
Date: Sat, 3 Dec 2016 15:56:50 +0100
Subject: [PATCH 1/2] s390/cputime: delayed accounting of system time
The account_system_time() function is called with a cputime that
occurred while running in the kernel. The function detects which
context the CPU is currently running in and accounts the time to
the correct bucket. This forces the arch code to account the
cputime for hardirq and softirq immediately.
Such accounting function can be costly and perform unwelcome divisions
and multiplications, among others.
The arch code can delay the accounting for system time. For s390
the accounting is done once per timer tick and for each task switch.
Signed-off-by: Martin Schwidefsky <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Cc: Wanpeng Li <[email protected]>
[rebase against latest cputime tree, massaged changelog accordingly]
---
arch/s390/include/asm/lowcore.h | 65 +++++++++---------
arch/s390/include/asm/processor.h | 3 +
arch/s390/kernel/vtime.c | 134 +++++++++++++++++++++++---------------
3 files changed, 120 insertions(+), 82 deletions(-)
diff --git a/arch/s390/include/asm/lowcore.h b/arch/s390/include/asm/lowcore.h
index 9bfad2a..61261e0 100644
--- a/arch/s390/include/asm/lowcore.h
+++ b/arch/s390/include/asm/lowcore.h
@@ -85,53 +85,56 @@ struct lowcore {
__u64 mcck_enter_timer; /* 0x02c0 */
__u64 exit_timer; /* 0x02c8 */
__u64 user_timer; /* 0x02d0 */
- __u64 system_timer; /* 0x02d8 */
- __u64 steal_timer; /* 0x02e0 */
- __u64 last_update_timer; /* 0x02e8 */
- __u64 last_update_clock; /* 0x02f0 */
- __u64 int_clock; /* 0x02f8 */
- __u64 mcck_clock; /* 0x0300 */
- __u64 clock_comparator; /* 0x0308 */
+ __u64 guest_timer; /* 0x02d8 */
+ __u64 system_timer; /* 0x02e0 */
+ __u64 hardirq_timer; /* 0x02e8 */
+ __u64 softirq_timer; /* 0x02f0 */
+ __u64 steal_timer; /* 0x02f8 */
+ __u64 last_update_timer; /* 0x0300 */
+ __u64 last_update_clock; /* 0x0308 */
+ __u64 int_clock; /* 0x0310 */
+ __u64 mcck_clock; /* 0x0318 */
+ __u64 clock_comparator; /* 0x0320 */
/* Current process. */
- __u64 current_task; /* 0x0310 */
- __u8 pad_0x318[0x320-0x318]; /* 0x0318 */
- __u64 kernel_stack; /* 0x0320 */
+ __u64 current_task; /* 0x0328 */
+ __u8 pad_0x318[0x320-0x318]; /* 0x0330 */
+ __u64 kernel_stack; /* 0x0338 */
/* Interrupt, panic and restart stack. */
- __u64 async_stack; /* 0x0328 */
- __u64 panic_stack; /* 0x0330 */
- __u64 restart_stack; /* 0x0338 */
+ __u64 async_stack; /* 0x0340 */
+ __u64 panic_stack; /* 0x0348 */
+ __u64 restart_stack; /* 0x0350 */
/* Restart function and parameter. */
- __u64 restart_fn; /* 0x0340 */
- __u64 restart_data; /* 0x0348 */
- __u64 restart_source; /* 0x0350 */
+ __u64 restart_fn; /* 0x0358 */
+ __u64 restart_data; /* 0x0360 */
+ __u64 restart_source; /* 0x0368 */
/* Address space pointer. */
- __u64 kernel_asce; /* 0x0358 */
- __u64 user_asce; /* 0x0360 */
+ __u64 kernel_asce; /* 0x0370 */
+ __u64 user_asce; /* 0x0378 */
/*
* The lpp and current_pid fields form a
* 64-bit value that is set as program
* parameter with the LPP instruction.
*/
- __u32 lpp; /* 0x0368 */
- __u32 current_pid; /* 0x036c */
+ __u32 lpp; /* 0x0380 */
+ __u32 current_pid; /* 0x0384 */
/* SMP info area */
- __u32 cpu_nr; /* 0x0370 */
- __u32 softirq_pending; /* 0x0374 */
- __u64 percpu_offset; /* 0x0378 */
- __u64 vdso_per_cpu_data; /* 0x0380 */
- __u64 machine_flags; /* 0x0388 */
- __u32 preempt_count; /* 0x0390 */
- __u8 pad_0x0394[0x0398-0x0394]; /* 0x0394 */
- __u64 gmap; /* 0x0398 */
- __u32 spinlock_lockval; /* 0x03a0 */
- __u32 fpu_flags; /* 0x03a4 */
- __u8 pad_0x03a8[0x0400-0x03a8]; /* 0x03a8 */
+ __u32 cpu_nr; /* 0x0388 */
+ __u32 softirq_pending; /* 0x038c */
+ __u64 percpu_offset; /* 0x0390 */
+ __u64 vdso_per_cpu_data; /* 0x0398 */
+ __u64 machine_flags; /* 0x03a0 */
+ __u32 preempt_count; /* 0x03a8 */
+ __u8 pad_0x03ac[0x03b0-0x03ac]; /* 0x03ac */
+ __u64 gmap; /* 0x03b0 */
+ __u32 spinlock_lockval; /* 0x03b8 */
+ __u32 fpu_flags; /* 0x03bc */
+ __u8 pad_0x03c0[0x0400-0x03c0]; /* 0x03c0 */
/* Per cpu primary space access list */
__u32 paste[16]; /* 0x0400 */
diff --git a/arch/s390/include/asm/processor.h b/arch/s390/include/asm/processor.h
index 9c00351..6f07907 100644
--- a/arch/s390/include/asm/processor.h
+++ b/arch/s390/include/asm/processor.h
@@ -111,7 +111,10 @@ struct thread_struct {
unsigned int acrs[NUM_ACRS];
unsigned long ksp; /* kernel stack pointer */
unsigned long user_timer; /* task cputime in user space */
+ unsigned long guest_timer; /* task cputime in kvm guest */
unsigned long system_timer; /* task cputime in kernel space */
+ unsigned long hardirq_timer; /* task cputime in hardirq context */
+ unsigned long softirq_timer; /* task cputime in softirq context */
unsigned long sys_call_table; /* system call table address */
mm_segment_t mm_segment;
unsigned long gmap_addr; /* address of last gmap fault. */
diff --git a/arch/s390/kernel/vtime.c b/arch/s390/kernel/vtime.c
index 6b246aa..65dc7b6 100644
--- a/arch/s390/kernel/vtime.c
+++ b/arch/s390/kernel/vtime.c
@@ -90,14 +90,33 @@ static void update_mt_scaling(void)
__this_cpu_write(mt_scaling_jiffies, jiffies_64);
}
+static inline u64 update_tsk_timer(unsigned long *tsk_vtime, u64 new)
+{
+ u64 delta;
+
+ delta = new - *tsk_vtime;
+ *tsk_vtime = new;
+ return delta;
+}
+
+
+static inline u64 scale_vtime(u64 vtime)
+{
+ u64 mult = __this_cpu_read(mt_scaling_mult);
+ u64 div = __this_cpu_read(mt_scaling_div);
+
+ if (smp_cpu_mtid)
+ return vtime * mult / div;
+ return vtime;
+}
+
/*
* Update process times based on virtual cpu times stored by entry.S
* to the lowcore fields user_timer, system_timer & steal_clock.
*/
-static int do_account_vtime(struct task_struct *tsk, int hardirq_offset)
+static int do_account_vtime(struct task_struct *tsk)
{
- u64 timer, clock, user, system, steal;
- u64 user_scaled, system_scaled;
+ u64 timer, clock, user, guest, system, hardirq, softirq, steal;
timer = S390_lowcore.last_update_timer;
clock = S390_lowcore.last_update_clock;
@@ -110,36 +129,53 @@ static int do_account_vtime(struct task_struct *tsk, int hardirq_offset)
#endif
: "=m" (S390_lowcore.last_update_timer),
"=m" (S390_lowcore.last_update_clock));
- S390_lowcore.system_timer += timer - S390_lowcore.last_update_timer;
- S390_lowcore.steal_timer += S390_lowcore.last_update_clock - clock;
+ clock = S390_lowcore.last_update_clock - clock;
+ timer -= S390_lowcore.last_update_timer;
+
+ if (hardirq_count())
+ S390_lowcore.hardirq_timer += timer;
+ else
+ S390_lowcore.system_timer += timer;
/* Update MT utilization calculation */
if (smp_cpu_mtid &&
time_after64(jiffies_64, this_cpu_read(mt_scaling_jiffies)))
update_mt_scaling();
- user = S390_lowcore.user_timer - tsk->thread.user_timer;
- S390_lowcore.steal_timer -= user;
- tsk->thread.user_timer = S390_lowcore.user_timer;
-
- system = S390_lowcore.system_timer - tsk->thread.system_timer;
- S390_lowcore.steal_timer -= system;
- tsk->thread.system_timer = S390_lowcore.system_timer;
-
- user_scaled = user;
- system_scaled = system;
- /* Do MT utilization scaling */
- if (smp_cpu_mtid) {
- u64 mult = __this_cpu_read(mt_scaling_mult);
- u64 div = __this_cpu_read(mt_scaling_div);
+ /* Calculate cputime delta */
+ user = update_tsk_timer(&tsk->thread.user_timer,
+ READ_ONCE(S390_lowcore.user_timer));
+ guest = update_tsk_timer(&tsk->thread.guest_timer,
+ READ_ONCE(S390_lowcore.guest_timer));
+ system = update_tsk_timer(&tsk->thread.system_timer,
+ READ_ONCE(S390_lowcore.system_timer));
+ hardirq = update_tsk_timer(&tsk->thread.hardirq_timer,
+ READ_ONCE(S390_lowcore.hardirq_timer));
+ softirq = update_tsk_timer(&tsk->thread.softirq_timer,
+ READ_ONCE(S390_lowcore.softirq_timer));
+ S390_lowcore.steal_timer +=
+ clock - user - guest - system - hardirq - softirq;
+
+ /* Push account value */
+ if (user) {
+ account_user_time(tsk, user);
+ tsk->utimescaled += scale_vtime(user);
+ }
- user_scaled = (user_scaled * mult) / div;
- system_scaled = (system_scaled * mult) / div;
+ if (guest) {
+ account_guest_time(tsk, guest);
+ tsk->utimescaled += scale_vtime(guest);
}
- account_user_time(tsk, user);
- tsk->utimescaled += user_scaled;
- account_system_time(tsk, hardirq_offset, system);
- tsk->stimescaled += system_scaled;
+
+ if (system)
+ account_system_index_scaled(tsk, system, scale_vtime(system),
+ CPUTIME_SYSTEM);
+ if (hardirq)
+ account_system_index_scaled(tsk, hardirq, scale_vtime(hardirq),
+ CPUTIME_IRQ);
+ if (softirq)
+ account_system_index_scaled(tsk, softirq, scale_vtime(softirq),
+ CPUTIME_SOFTIRQ);
steal = S390_lowcore.steal_timer;
if ((s64) steal > 0) {
@@ -147,16 +183,22 @@ static int do_account_vtime(struct task_struct *tsk, int hardirq_offset)
account_steal_time(steal);
}
- return virt_timer_forward(user + system);
+ return virt_timer_forward(user + guest + system + hardirq + softirq);
}
void vtime_task_switch(struct task_struct *prev)
{
- do_account_vtime(prev, 0);
+ do_account_vtime(prev);
prev->thread.user_timer = S390_lowcore.user_timer;
+ prev->thread.guest_timer = S390_lowcore.guest_timer;
prev->thread.system_timer = S390_lowcore.system_timer;
+ prev->thread.hardirq_timer = S390_lowcore.hardirq_timer;
+ prev->thread.softirq_timer = S390_lowcore.softirq_timer;
S390_lowcore.user_timer = current->thread.user_timer;
+ S390_lowcore.guest_timer = current->thread.guest_timer;
S390_lowcore.system_timer = current->thread.system_timer;
+ S390_lowcore.hardirq_timer = current->thread.hardirq_timer;
+ S390_lowcore.softirq_timer = current->thread.softirq_timer;
}
/*
@@ -166,7 +208,7 @@ void vtime_task_switch(struct task_struct *prev)
*/
void vtime_account_user(struct task_struct *tsk)
{
- if (do_account_vtime(tsk, HARDIRQ_OFFSET))
+ if (do_account_vtime(tsk))
virt_timer_expire();
}
@@ -176,32 +218,22 @@ void vtime_account_user(struct task_struct *tsk)
*/
void vtime_account_irq_enter(struct task_struct *tsk)
{
- u64 timer, system, system_scaled;
+ u64 timer;
timer = S390_lowcore.last_update_timer;
S390_lowcore.last_update_timer = get_vtimer();
- S390_lowcore.system_timer += timer - S390_lowcore.last_update_timer;
-
- /* Update MT utilization calculation */
- if (smp_cpu_mtid &&
- time_after64(jiffies_64, this_cpu_read(mt_scaling_jiffies)))
- update_mt_scaling();
-
- system = S390_lowcore.system_timer - tsk->thread.system_timer;
- S390_lowcore.steal_timer -= system;
- tsk->thread.system_timer = S390_lowcore.system_timer;
- system_scaled = system;
- /* Do MT utilization scaling */
- if (smp_cpu_mtid) {
- u64 mult = __this_cpu_read(mt_scaling_mult);
- u64 div = __this_cpu_read(mt_scaling_div);
-
- system_scaled = (system_scaled * mult) / div;
- }
- account_system_time(tsk, 0, system);
- tsk->stimescaled += system_scaled;
-
- virt_timer_forward(system);
+ timer -= S390_lowcore.last_update_timer;
+
+ if ((tsk->flags & PF_VCPU) && (irq_count() == 0))
+ S390_lowcore.guest_timer += timer;
+ else if (hardirq_count())
+ S390_lowcore.hardirq_timer += timer;
+ else if (in_serving_softirq())
+ S390_lowcore.softirq_timer += timer;
+ else
+ S390_lowcore.system_timer += timer;
+
+ virt_timer_forward(timer);
}
EXPORT_SYMBOL_GPL(vtime_account_irq_enter);
--
2.8.4
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
On Tue, Dec 13, 2016 at 12:13:22PM +0100, Martin Schwidefsky wrote:
> On Mon, 12 Dec 2016 16:02:30 +0100
> Frederic Weisbecker <[email protected]> wrote:
>
> > On Mon, Dec 12, 2016 at 11:27:54AM +0100, Martin Schwidefsky wrote:
> > > 3) The call to vtime_flush in account_process_tick is done in irq context from
> > > update_process_times. hardirq_offset==1 is also correct.
> >
> > Let's see this for example:
> >
> > + if ((tsk->flags & PF_VCPU) && (irq_count() - hardirq_offset == 0))
> > + S390_lowcore.guest_timer += timer;
> >
> > If the tick is interrupting guest, we have accounted the guest time on tick IRQ entry.
> > Now we are in the middle of the tick interrupt and since hardirq_offset is 1, we
> > are taking the above path by accounting half of the tick-IRQ time as guest, which is wrong,
> > it's actually IRQ time.
>
> Hmm, you got me there. The system time from irq_enter until account_process_tick
> is reached is indeed IRQ time. It is not much but it is incorrect. The best fix
> would be to rip out the accounting of the system time from account_process_tick
> as irq_enter / irq_exit will do system time accounting anyway. To do that
> do_account_vtime needs to be split, because for the task switch we need to
> account the system time of the previous task.
Exactly!
On Tue, Dec 13, 2016 at 02:21:21PM +0100, Martin Schwidefsky wrote:
> On Tue, 13 Dec 2016 12:13:22 +0100
> Martin Schwidefsky <[email protected]> wrote:
>
> > On Mon, 12 Dec 2016 16:02:30 +0100
> > Frederic Weisbecker <[email protected]> wrote:
> >
> > > On Mon, Dec 12, 2016 at 11:27:54AM +0100, Martin Schwidefsky wrote:
> > > > 3) The call to vtime_flush in account_process_tick is done in irq context from
> > > > update_process_times. hardirq_offset==1 is also correct.
> > >
> > > Let's see this for example:
> > >
> > > + if ((tsk->flags & PF_VCPU) && (irq_count() - hardirq_offset == 0))
> > > + S390_lowcore.guest_timer += timer;
> > >
> > > If the tick is interrupting guest, we have accounted the guest time on tick IRQ entry.
> > > Now we are in the middle of the tick interrupt and since hardirq_offset is 1, we
> > > are taking the above path by accounting half of the tick-IRQ time as guest, which is wrong,
> > > it's actually IRQ time.
> >
> > Hmm, you got me there. The system time from irq_enter until account_process_tick
> > is reached is indeed IRQ time. It is not much but it is incorrect. The best fix
> > would be to rip out the accounting of the system time from account_process_tick
> > as irq_enter / irq_exit will do system time accounting anyway. To do that
> > do_account_vtime needs to be split, because for the task switch we need to
> > account the system time of the previous task.
>
> New patch for the delayed cputime account. I can not just rip out system time
> accounting from account_process_tick after all, I need a sync point for the
> steal time calculation. It basically is the same patch as before but with a new
> helper update_tsk_timer, the removal of hardirq_offset and a simplification
> for do_account_vtime: the last accounting delta is either hardirq time for
> the tick or system time for the task switch.
>
> Keeping my finger crossed..
The patch looks good. But you might want to remove the hardirq_offset in a
separate patch to queue for this merge window (I'm not sure if it needs a
stable tag, the argument may be there since the beginning).
Because the rest depends on the series that is unlikely to be queued in this
merge window at this stage.
Thanks!
On Wed, 14 Dec 2016 02:44:46 +0100
Frederic Weisbecker <[email protected]> wrote:
> On Tue, Dec 13, 2016 at 02:21:21PM +0100, Martin Schwidefsky wrote:
> > On Tue, 13 Dec 2016 12:13:22 +0100
> > Martin Schwidefsky <[email protected]> wrote:
> >
> > > On Mon, 12 Dec 2016 16:02:30 +0100
> > > Frederic Weisbecker <[email protected]> wrote:
> > >
> > > > On Mon, Dec 12, 2016 at 11:27:54AM +0100, Martin Schwidefsky wrote:
> > > > > 3) The call to vtime_flush in account_process_tick is done in irq context from
> > > > > update_process_times. hardirq_offset==1 is also correct.
> > > >
> > > > Let's see this for example:
> > > >
> > > > + if ((tsk->flags & PF_VCPU) && (irq_count() - hardirq_offset == 0))
> > > > + S390_lowcore.guest_timer += timer;
> > > >
> > > > If the tick is interrupting guest, we have accounted the guest time on tick IRQ entry.
> > > > Now we are in the middle of the tick interrupt and since hardirq_offset is 1, we
> > > > are taking the above path by accounting half of the tick-IRQ time as guest, which is wrong,
> > > > it's actually IRQ time.
> > >
> > > Hmm, you got me there. The system time from irq_enter until account_process_tick
> > > is reached is indeed IRQ time. It is not much but it is incorrect. The best fix
> > > would be to rip out the accounting of the system time from account_process_tick
> > > as irq_enter / irq_exit will do system time accounting anyway. To do that
> > > do_account_vtime needs to be split, because for the task switch we need to
> > > account the system time of the previous task.
> >
> > New patch for the delayed cputime account. I can not just rip out system time
> > accounting from account_process_tick after all, I need a sync point for the
> > steal time calculation. It basically is the same patch as before but with a new
> > helper update_tsk_timer, the removal of hardirq_offset and a simplification
> > for do_account_vtime: the last accounting delta is either hardirq time for
> > the tick or system time for the task switch.
> >
> > Keeping my finger crossed..
>
> The patch looks good. But you might want to remove the hardirq_offset in a
> separate patch to queue for this merge window (I'm not sure if it needs a
> stable tag, the argument may be there since the beginning).
>
> Because the rest depends on the series that is unlikely to be queued in this
> merge window at this stage.
I just pushed two fixes to the linux-s390:fixes tree which will be merged
eventually after the first -rc candidate for 4.10 is released.
This includes "s390/vtime: correct system time accounting" which fixes the
hardirq_offset bug for s390.
The link to the fixes-tree in case you want to look at the patch:
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git fixes
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
On Tue, Dec 20, 2016 at 03:13:53PM +0100, Martin Schwidefsky wrote:
> On Wed, 14 Dec 2016 02:44:46 +0100
> Frederic Weisbecker <[email protected]> wrote:
>
> > On Tue, Dec 13, 2016 at 02:21:21PM +0100, Martin Schwidefsky wrote:
> > > On Tue, 13 Dec 2016 12:13:22 +0100
> > > Martin Schwidefsky <[email protected]> wrote:
> > >
> > > > On Mon, 12 Dec 2016 16:02:30 +0100
> > > > Frederic Weisbecker <[email protected]> wrote:
> > > >
> > > > > On Mon, Dec 12, 2016 at 11:27:54AM +0100, Martin Schwidefsky wrote:
> > > > > > 3) The call to vtime_flush in account_process_tick is done in irq context from
> > > > > > update_process_times. hardirq_offset==1 is also correct.
> > > > >
> > > > > Let's see this for example:
> > > > >
> > > > > + if ((tsk->flags & PF_VCPU) && (irq_count() - hardirq_offset == 0))
> > > > > + S390_lowcore.guest_timer += timer;
> > > > >
> > > > > If the tick is interrupting guest, we have accounted the guest time on tick IRQ entry.
> > > > > Now we are in the middle of the tick interrupt and since hardirq_offset is 1, we
> > > > > are taking the above path by accounting half of the tick-IRQ time as guest, which is wrong,
> > > > > it's actually IRQ time.
> > > >
> > > > Hmm, you got me there. The system time from irq_enter until account_process_tick
> > > > is reached is indeed IRQ time. It is not much but it is incorrect. The best fix
> > > > would be to rip out the accounting of the system time from account_process_tick
> > > > as irq_enter / irq_exit will do system time accounting anyway. To do that
> > > > do_account_vtime needs to be split, because for the task switch we need to
> > > > account the system time of the previous task.
> > >
> > > New patch for the delayed cputime account. I can not just rip out system time
> > > accounting from account_process_tick after all, I need a sync point for the
> > > steal time calculation. It basically is the same patch as before but with a new
> > > helper update_tsk_timer, the removal of hardirq_offset and a simplification
> > > for do_account_vtime: the last accounting delta is either hardirq time for
> > > the tick or system time for the task switch.
> > >
> > > Keeping my finger crossed..
> >
> > The patch looks good. But you might want to remove the hardirq_offset in a
> > separate patch to queue for this merge window (I'm not sure if it needs a
> > stable tag, the argument may be there since the beginning).
> >
> > Because the rest depends on the series that is unlikely to be queued in this
> > merge window at this stage.
>
> I just pushed two fixes to the linux-s390:fixes tree which will be merged
> eventually after the first -rc candidate for 4.10 is released.
>
> This includes "s390/vtime: correct system time accounting" which fixes the
> hardirq_offset bug for s390.
>
> The link to the fixes-tree in case you want to look at the patch:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git fixes
Thanks a lot!