2010-12-22 01:09:43

by Venkatesh Pallipadi

[permalink] [raw]
Subject: [PATCH 0/5] Proper kernel irq time reporting -v2

This is Part 2 of
"Proper kernel irq time accounting -v4"
http://lkml.indiana.edu/hypermail//linux/kernel/1010.0/01175.html

and applies over 2.6.37-rc7.

Part 1 solves the way irqs are accounted in scheduler and tasks. This
patchset solves how irq times are reported in /proc/stat and also not
to include irq time in task->stime, etc.

Example:
Running a cpu intensive loop and network intensive nc on a 4 CPU system
and looking at 'top' output.

With vanilla kernel:
Cpu0 : 0.0% us, 0.3% sy, 0.0% ni, 99.3% id, 0.0% wa, 0.0% hi, 0.3% si
Cpu1 : 100.0% us, 0.0% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu2 : 1.3% us, 27.2% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 71.4% si
Cpu3 : 1.6% us, 1.3% sy, 0.0% ni, 96.7% id, 0.0% wa, 0.0% hi, 0.3% si

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7555 root 20 0 1760 528 436 R 100 0.0 0:15.79 nc
7563 root 20 0 3632 268 204 R 100 0.0 0:13.13 loop

Notes:
* Both tasks show 100% CPU, even when one of them is stuck on a CPU thats
processing 70% softirq.
* no hardirq time.


With "Part 1" patches:
Cpu0 : 0.0% us, 0.0% sy, 0.0% ni, 100.0% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu1 : 100.0% us, 0.0% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu2 : 2.0% us, 30.6% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 67.4% si
Cpu3 : 0.7% us, 0.7% sy, 0.3% ni, 98.3% id, 0.0% wa, 0.0% hi, 0.0% si

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6289 root 20 0 3632 268 204 R 100 0.0 2:18.67 loop
5737 root 20 0 1760 528 436 R 33 0.0 0:26.72 nc

Notes:
* Tasks show 100% CPU and 33% CPU that correspond to their non-irq exec time.
* no hardirq time.


With "Part 1 + Part 2" patches:
Cpu0 : 1.3% us, 1.0% sy, 0.3% ni, 97.0% id, 0.0% wa, 0.0% hi, 0.3% si
Cpu1 : 99.3% us, 0.0% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.7% hi, 0.0% si
Cpu2 : 1.3% us, 31.5% sy, 0.0% ni, 0.0% id, 0.0% wa, 8.3% hi, 58.9% si
Cpu3 : 1.0% us, 2.0% sy, 0.3% ni, 95.0% id, 0.0% wa, 0.7% hi, 1.0% si

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20929 root 20 0 3632 268 204 R 99 0.0 3:48.25 loop
20796 root 20 0 1760 528 436 R 33 0.0 2:38.65 nc

Notes:
* Both task exec time and hard irq time reported correctly.
* hi and si time are based on fine granularity info and not on samples.
* getrusage would give proper utime/stime split not including irq times
in that ratio.
* Other places that report user/sys time like, cgroup cpuacct.stat will
now include only non-irq exectime.


Changes since v1:
( v1 - http://kerneltrap.com/mailarchive/linux-kernel/2010/10/25/4637170 )
* Rebase over other related code changes since v1

Signed-off-by: Venkatesh Pallipadi <[email protected]>


2010-12-22 01:09:57

by Venkatesh Pallipadi

[permalink] [raw]
Subject: [PATCH 1/5] Free up pf flag PF_KSOFTIRQD -v2

Patchset:
This is Part 2 of
"Proper kernel irq time accounting -v4"
http://lkml.indiana.edu/hypermail//linux/kernel/1010.0/01175.html

and applies 2.6.37-rc7.

Part 1 solves the way irqs are accounted in scheduler and tasks. This
patchset solves how irq times are reported in /proc/stat and also not
to include irq time in task->stime, etc.

Example:
Running a cpu intensive loop and network intensive nc on a 4 CPU system
and looking at 'top' output.

With vanilla kernel:
Cpu0 : 0.0% us, 0.3% sy, 0.0% ni, 99.3% id, 0.0% wa, 0.0% hi, 0.3% si
Cpu1 : 100.0% us, 0.0% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu2 : 1.3% us, 27.2% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 71.4% si
Cpu3 : 1.6% us, 1.3% sy, 0.0% ni, 96.7% id, 0.0% wa, 0.0% hi, 0.3% si

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7555 root 20 0 1760 528 436 R 100 0.0 0:15.79 nc
7563 root 20 0 3632 268 204 R 100 0.0 0:13.13 loop

Notes:
* Both tasks show 100% CPU, even when one of them is stuck on a CPU thats
processing 70% softirq.
* no hardirq time.


With "Part 1" patches:
Cpu0 : 0.0% us, 0.0% sy, 0.0% ni, 100.0% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu1 : 100.0% us, 0.0% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu2 : 2.0% us, 30.6% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 67.4% si
Cpu3 : 0.7% us, 0.7% sy, 0.3% ni, 98.3% id, 0.0% wa, 0.0% hi, 0.0% si

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6289 root 20 0 3632 268 204 R 100 0.0 2:18.67 loop
5737 root 20 0 1760 528 436 R 33 0.0 0:26.72 nc

Notes:
* Tasks show 100% CPU and 33% CPU that correspond to their non-irq exec time.
* no hardirq time.


With "Part 1 + Part 2" patches:
Cpu0 : 1.3% us, 1.0% sy, 0.3% ni, 97.0% id, 0.0% wa, 0.0% hi, 0.3% si
Cpu1 : 99.3% us, 0.0% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.7% hi, 0.0% si
Cpu2 : 1.3% us, 31.5% sy, 0.0% ni, 0.0% id, 0.0% wa, 8.3% hi, 58.9% si
Cpu3 : 1.0% us, 2.0% sy, 0.3% ni, 95.0% id, 0.0% wa, 0.7% hi, 1.0% si

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20929 root 20 0 3632 268 204 R 99 0.0 3:48.25 loop
20796 root 20 0 1760 528 436 R 33 0.0 2:38.65 nc

Notes:
* Both task exec time and hard irq time reported correctly.
* hi and si time are based on fine granularity info and not on samples.
* getrusage would give proper utime/stime split not including irq times
in that ratio.
* Other places that report user/sys time like, cgroup cpuacct.stat will
now include only non-irq exectime.

This patch:

Cleanup patch, freeing up PF_KSOFTIRQD and use per_cpu ksoftirqd pointer
instead, as suggested by Eric Dumazet.

Tested-by: Shaun Ruffell <[email protected]>
Signed-off-by: Venkatesh Pallipadi <[email protected]>
---
include/linux/interrupt.h | 7 +++++++
include/linux/sched.h | 1 -
kernel/sched.c | 2 +-
kernel/softirq.c | 3 +--
4 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 79d0c4f..3802fac 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -426,6 +426,13 @@ extern void raise_softirq(unsigned int nr);
*/
DECLARE_PER_CPU(struct list_head [NR_SOFTIRQS], softirq_work_list);

+DECLARE_PER_CPU(struct task_struct *, ksoftirqd);
+
+static inline struct task_struct *this_cpu_ksoftirqd(void)
+{
+ return this_cpu_read(ksoftirqd);
+}
+
/* Try to send a softirq to a remote cpu. If this cannot be done, the
* work will be queued to the local cpu.
*/
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2238745..86924ff 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1699,7 +1699,6 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
/*
* Per process flags
*/
-#define PF_KSOFTIRQD 0x00000001 /* I am ksoftirqd */
#define PF_STARTING 0x00000002 /* being created */
#define PF_EXITING 0x00000004 /* getting shut down */
#define PF_EXITPIDONE 0x00000008 /* pi exit done on shut down */
diff --git a/kernel/sched.c b/kernel/sched.c
index 297d1a0..bfc9646 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2011,7 +2011,7 @@ void account_system_vtime(struct task_struct *curr)
*/
if (hardirq_count())
__this_cpu_add(cpu_hardirq_time, delta);
- else if (in_serving_softirq() && !(curr->flags & PF_KSOFTIRQD))
+ else if (in_serving_softirq() && curr != this_cpu_ksoftirqd())
__this_cpu_add(cpu_softirq_time, delta);

irq_time_write_end();
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 18f4be0..b904be8 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -54,7 +54,7 @@ EXPORT_SYMBOL(irq_stat);

static struct softirq_action softirq_vec[NR_SOFTIRQS] __cacheline_aligned_in_smp;

-static DEFINE_PER_CPU(struct task_struct *, ksoftirqd);
+DEFINE_PER_CPU(struct task_struct *, ksoftirqd);

char *softirq_to_name[NR_SOFTIRQS] = {
"HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "BLOCK_IOPOLL",
@@ -721,7 +721,6 @@ static int run_ksoftirqd(void * __bind_cpu)
{
set_current_state(TASK_INTERRUPTIBLE);

- current->flags |= PF_KSOFTIRQD;
while (!kthread_should_stop()) {
preempt_disable();
if (!local_softirq_pending()) {
--
1.7.3.1

2010-12-22 01:10:16

by Venkatesh Pallipadi

[permalink] [raw]
Subject: [PATCH 2/5] Add nsecs_to_cputime64 interface for asm-generic -v2

Add nsecs_to_cputime64 interface. This is used in following patches that
updates cpu irq stat based on ns granularity info in IRQ_TIME_ACCOUNTING.

Tested-by: Shaun Ruffell <[email protected]>
Signed-off-by: Venkatesh Pallipadi <[email protected]>
---
include/asm-generic/cputime.h | 3 +++
include/linux/jiffies.h | 1 +
kernel/time.c | 23 +++++++++++++++++++++--
3 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/include/asm-generic/cputime.h b/include/asm-generic/cputime.h
index 2bcc5c7..61e03dd 100644
--- a/include/asm-generic/cputime.h
+++ b/include/asm-generic/cputime.h
@@ -30,6 +30,9 @@ typedef u64 cputime64_t;
#define cputime64_to_jiffies64(__ct) (__ct)
#define jiffies64_to_cputime64(__jif) (__jif)
#define cputime_to_cputime64(__ct) ((u64) __ct)
+#define cputime64_gt(__a, __b) ((__a) > (__b))
+
+#define nsecs_to_cputime64(__ct) nsecs_to_jiffies64(__ct)


/*
diff --git a/include/linux/jiffies.h b/include/linux/jiffies.h
index 6811f4b..922aa31 100644
--- a/include/linux/jiffies.h
+++ b/include/linux/jiffies.h
@@ -307,6 +307,7 @@ extern clock_t jiffies_to_clock_t(long x);
extern unsigned long clock_t_to_jiffies(unsigned long x);
extern u64 jiffies_64_to_clock_t(u64 x);
extern u64 nsec_to_clock_t(u64 x);
+extern u64 nsecs_to_jiffies64(u64 n);
extern unsigned long nsecs_to_jiffies(u64 n);

#define TIMESTAMP_SIZE 30
diff --git a/kernel/time.c b/kernel/time.c
index ba9b338..fde4691 100644
--- a/kernel/time.c
+++ b/kernel/time.c
@@ -645,7 +645,7 @@ u64 nsec_to_clock_t(u64 x)
}

/**
- * nsecs_to_jiffies - Convert nsecs in u64 to jiffies
+ * nsecs_to_jiffies64 - Convert nsecs in u64 to jiffies64
*
* @n: nsecs in u64
*
@@ -657,7 +657,7 @@ u64 nsec_to_clock_t(u64 x)
* NSEC_PER_SEC = 10^9 = (5^9 * 2^9) = (1953125 * 512)
* ULLONG_MAX ns = 18446744073.709551615 secs = about 584 years
*/
-unsigned long nsecs_to_jiffies(u64 n)
+u64 nsecs_to_jiffies64(u64 n)
{
#if (NSEC_PER_SEC % HZ) == 0
/* Common case, HZ = 100, 128, 200, 250, 256, 500, 512, 1000 etc. */
@@ -674,6 +674,25 @@ unsigned long nsecs_to_jiffies(u64 n)
#endif
}

+
+/**
+ * nsecs_to_jiffies - Convert nsecs in u64 to jiffies
+ *
+ * @n: nsecs in u64
+ *
+ * Unlike {m,u}secs_to_jiffies, type of input is not unsigned int but u64.
+ * And this doesn't return MAX_JIFFY_OFFSET since this function is designed
+ * for scheduler, not for use in device drivers to calculate timeout value.
+ *
+ * note:
+ * NSEC_PER_SEC = 10^9 = (5^9 * 2^9) = (1953125 * 512)
+ * ULLONG_MAX ns = 18446744073.709551615 secs = about 584 years
+ */
+unsigned long nsecs_to_jiffies(u64 n)
+{
+ return (unsigned long)nsecs_to_jiffies64(n);
+}
+
#if (BITS_PER_LONG < 64)
u64 get_jiffies_64(void)
{
--
1.7.3.1

2010-12-22 01:10:20

by Venkatesh Pallipadi

[permalink] [raw]
Subject: [PATCH 3/5] Refactor account_system_time separating id-update -v2

Refactor account_system_time, to separate out the logic of
identifying the update needed and code that does actual update.

This is used by following patch for IRQ_TIME_ACCOUNTING,
which has different identification logic and same update logic.

Tested-by: Shaun Ruffell <[email protected]>
Signed-off-by: Venkatesh Pallipadi <[email protected]>
---
kernel/sched.c | 46 +++++++++++++++++++++++++++++++---------------
1 files changed, 31 insertions(+), 15 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index bfc9646..c428017 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3701,6 +3701,32 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime,
}

/*
+ * Account system cpu time to a process and desired cpustat field
+ * @p: the process that the cpu time gets accounted to
+ * @cputime: the cpu time spent in kernel space since the last update
+ * @cputime_scaled: cputime scaled by cpu frequency
+ * @target_cputime64: pointer to cpustat field that has to be updated
+ */
+static inline
+void __account_system_time(struct task_struct *p, cputime_t cputime,
+ cputime_t cputime_scaled, cputime64_t *target_cputime64)
+{
+ cputime64_t tmp = cputime_to_cputime64(cputime);
+
+ /* Add system time to process. */
+ p->stime = cputime_add(p->stime, cputime);
+ p->stimescaled = cputime_add(p->stimescaled, cputime_scaled);
+ account_group_system_time(p, cputime);
+
+ /* Add system time to cpustat. */
+ *target_cputime64 = cputime64_add(*target_cputime64, tmp);
+ cpuacct_update_stats(p, CPUACCT_STAT_SYSTEM, cputime);
+
+ /* Account for system time used */
+ acct_update_integrals(p);
+}
+
+/*
* Account system cpu time to a process.
* @p: the process that the cpu time gets accounted to
* @hardirq_offset: the offset to subtract from hardirq_count()
@@ -3711,31 +3737,21 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
cputime_t cputime, cputime_t cputime_scaled)
{
struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
- cputime64_t tmp;
+ cputime64_t *target_cputime64;

if ((p->flags & PF_VCPU) && (irq_count() - hardirq_offset == 0)) {
account_guest_time(p, cputime, cputime_scaled);
return;
}

- /* Add system time to process. */
- p->stime = cputime_add(p->stime, cputime);
- p->stimescaled = cputime_add(p->stimescaled, cputime_scaled);
- account_group_system_time(p, cputime);
-
- /* Add system time to cpustat. */
- tmp = cputime_to_cputime64(cputime);
if (hardirq_count() - hardirq_offset)
- cpustat->irq = cputime64_add(cpustat->irq, tmp);
+ target_cputime64 = &cpustat->irq;
else if (in_serving_softirq())
- cpustat->softirq = cputime64_add(cpustat->softirq, tmp);
+ target_cputime64 = &cpustat->softirq;
else
- cpustat->system = cputime64_add(cpustat->system, tmp);
+ target_cputime64 = &cpustat->system;

- cpuacct_update_stats(p, CPUACCT_STAT_SYSTEM, cputime);
-
- /* Account for system time used */
- acct_update_integrals(p);
+ __account_system_time(p, cputime, cputime_scaled, target_cputime64);
}

/*
--
1.7.3.1

2010-12-22 01:10:28

by Venkatesh Pallipadi

[permalink] [raw]
Subject: [PATCH 5/5] Account ksoftirqd time as cpustat softirq -v2

softirq time in ksoftirqd context is not accounted in ns granularity
per cpu softirq stats, as we want that to be a part of ksoftirqd
exec_runtime.

Accounting them as softirq on /proc/stat separately.

Tested-by: Shaun Ruffell <[email protected]>
Signed-off-by: Venkatesh Pallipadi <[email protected]>
---
kernel/sched.c | 8 ++++++++
1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 193b1d0..4ff6d1a 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3819,6 +3819,14 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
cpustat->irq = cputime64_add(cpustat->irq, tmp);
} else if (irqtime_account_si_update()) {
cpustat->softirq = cputime64_add(cpustat->softirq, tmp);
+ } else if (this_cpu_ksoftirqd() == p) {
+ /*
+ * ksoftirqd time do not get accounted in cpu_softirq_time.
+ * So, we have to handle it separately here.
+ * Also, p->stime needs to be updated for ksoftirqd.
+ */
+ __account_system_time(p, cputime_one_jiffy, one_jiffy_scaled,
+ &cpustat->softirq);
} else if (user_tick) {
account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);
} else if (p == rq->idle) {
--
1.7.3.1

2010-12-22 01:10:49

by Venkatesh Pallipadi

[permalink] [raw]
Subject: [PATCH 4/5] Export ns irqtimes through /proc/stat -v2

CONFIG_IRQ_TIME_ACCOUNTING adds ns granularity irq time on each CPU.
This info is already used in scheduler to do proper task chargeback
(earlier patches). This patch retro-fits this ns granularity
hardirq and softirq information to /proc/stat irq and softirq fields.

The update is still done on timer tick, where we look at accumulated
ns hardirq/softirq time and account the tick to user/system/irq/hardirq/guest
accordingly.

No new interface added.

Earlier versions looked at adding this as new fields in some /proc
files. This one seems to be the best in terms of impact to existing
apps, even though it has somewhat more kernel code than earlier versions.

Tested-by: Shaun Ruffell <[email protected]>
Signed-off-by: Venkatesh Pallipadi <[email protected]>
---
kernel/sched.c | 102 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 102 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index c428017..193b1d0 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2051,8 +2051,40 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
sched_rt_avg_update(rq, irq_delta);
}

+static int irqtime_account_hi_update(void)
+{
+ struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+ unsigned long flags;
+ u64 latest_ns;
+ int ret = 0;
+
+ local_irq_save(flags);
+ latest_ns = this_cpu_read(cpu_hardirq_time);
+ if (cputime64_gt(nsecs_to_cputime64(latest_ns), cpustat->irq))
+ ret = 1;
+ local_irq_restore(flags);
+ return ret;
+}
+
+static int irqtime_account_si_update(void)
+{
+ struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+ unsigned long flags;
+ u64 latest_ns;
+ int ret = 0;
+
+ local_irq_save(flags);
+ latest_ns = this_cpu_read(cpu_softirq_time);
+ if (cputime64_gt(nsecs_to_cputime64(latest_ns), cpustat->softirq))
+ ret = 1;
+ local_irq_restore(flags);
+ return ret;
+}
+
#else /* CONFIG_IRQ_TIME_ACCOUNTING */

+#define sched_clock_irqtime (0)
+
static void update_rq_clock_task(struct rq *rq, s64 delta)
{
rq->clock_task += delta;
@@ -3754,6 +3786,65 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
__account_system_time(p, cputime, cputime_scaled, target_cputime64);
}

+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+/*
+ * Account a tick to a process and cpustat
+ * @p: the process that the cpu time gets accounted to
+ * @user_tick: is the tick from userspace
+ * @rq: the pointer to rq
+ *
+ * Tick demultiplexing follows the order
+ * - pending hardirq update
+ * - pending softirq update
+ * - user_time
+ * - idle_time
+ * - system time
+ * - check for guest_time
+ * - else account as system_time
+ *
+ * Check for hardirq is done both for system and user time as there is
+ * no timer going off while we are on hardirq and hence we may never get an
+ * opportunity to update it solely in system time.
+ * p->stime and friends are only updated on system time and not on irq
+ * softirq as those do not count in task exec_runtime any more.
+ */
+static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
+ struct rq *rq)
+{
+ cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy);
+ cputime64_t tmp = cputime_to_cputime64(cputime_one_jiffy);
+ struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+
+ if (irqtime_account_hi_update()) {
+ cpustat->irq = cputime64_add(cpustat->irq, tmp);
+ } else if (irqtime_account_si_update()) {
+ cpustat->softirq = cputime64_add(cpustat->softirq, tmp);
+ } else if (user_tick) {
+ account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);
+ } else if (p == rq->idle) {
+ account_idle_time(cputime_one_jiffy);
+ } else if (p->flags & PF_VCPU) { /* System time or guest time */
+ account_guest_time(p, cputime_one_jiffy, one_jiffy_scaled);
+ } else {
+ __account_system_time(p, cputime_one_jiffy, one_jiffy_scaled,
+ &cpustat->system);
+ }
+}
+
+static void irqtime_account_idle_ticks(int ticks)
+{
+ int i;
+ struct rq *rq = this_rq();
+
+ for (i = 0; i < ticks; i++)
+ irqtime_account_process_tick(current, 0, rq);
+}
+#else
+static void irqtime_account_idle_ticks(int ticks) {}
+static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
+ struct rq *rq) {}
+#endif
+
/*
* Account for involuntary wait time.
* @steal: the cpu time spent in involuntary wait
@@ -3794,6 +3885,11 @@ void account_process_tick(struct task_struct *p, int user_tick)
cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy);
struct rq *rq = this_rq();

+ if (sched_clock_irqtime) {
+ irqtime_account_process_tick(p, user_tick, rq);
+ return;
+ }
+
if (user_tick)
account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);
else if ((p != rq->idle) || (irq_count() != HARDIRQ_OFFSET))
@@ -3819,6 +3915,12 @@ void account_steal_ticks(unsigned long ticks)
*/
void account_idle_ticks(unsigned long ticks)
{
+
+ if (sched_clock_irqtime) {
+ irqtime_account_idle_ticks(ticks);
+ return;
+ }
+
account_idle_time(jiffies_to_cputime(ticks));
}

--
1.7.3.1

2010-12-22 08:30:40

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [PATCH 2/5] Add nsecs_to_cputime64 interface for asm-generic -v2

On Tue, 21 Dec 2010 17:09:01 -0800
Venkatesh Pallipadi <[email protected]> wrote:

> Add nsecs_to_cputime64 interface. This is used in following patches that
> updates cpu irq stat based on ns granularity info in IRQ_TIME_ACCOUNTING.
>
> Tested-by: Shaun Ruffell <[email protected]>
> Signed-off-by: Venkatesh Pallipadi <[email protected]>
> ---
> include/asm-generic/cputime.h | 3 +++
> include/linux/jiffies.h | 1 +
> kernel/time.c | 23 +++++++++++++++++++++--
> 3 files changed, 25 insertions(+), 2 deletions(-)
>
> diff --git a/include/asm-generic/cputime.h b/include/asm-generic/cputime.h
> index 2bcc5c7..61e03dd 100644
> --- a/include/asm-generic/cputime.h
> +++ b/include/asm-generic/cputime.h
> @@ -30,6 +30,9 @@ typedef u64 cputime64_t;
> #define cputime64_to_jiffies64(__ct) (__ct)
> #define jiffies64_to_cputime64(__jif) (__jif)
> #define cputime_to_cputime64(__ct) ((u64) __ct)
> +#define cputime64_gt(__a, __b) ((__a) > (__b))
> +
> +#define nsecs_to_cputime64(__ct) nsecs_to_jiffies64(__ct)
>
>
> /*

If you define a new cputime primitive it is not enough to add it to the
generic cputime header. You have to add it to the arch-specific cputime
header for ia64, powerpc and s390 as well. Otherwise the build will
break with CONFIG_VIRT_CPU_ACCOUNTING=y on these architectures once the
new primitive is used.

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

2010-12-22 09:17:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/5] Free up pf flag PF_KSOFTIRQD -v2

On Tue, 2010-12-21 at 17:09 -0800, Venkatesh Pallipadi wrote:
> Patchset:
> This is Part 2 of
> "Proper kernel irq time accounting -v4"
> http://lkml.indiana.edu/hypermail//linux/kernel/1010.0/01175.html
>
> and applies 2.6.37-rc7.
>
> Part 1 solves the way irqs are accounted in scheduler and tasks. This
> patchset solves how irq times are reported in /proc/stat and also not
> to include irq time in task->stime, etc.
>
> Example:
> Running a cpu intensive loop and network intensive nc on a 4 CPU system
> and looking at 'top' output.
>
> With vanilla kernel:
> Cpu0 : 0.0% us, 0.3% sy, 0.0% ni, 99.3% id, 0.0% wa, 0.0% hi, 0.3% si
> Cpu1 : 100.0% us, 0.0% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 0.0% si
> Cpu2 : 1.3% us, 27.2% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 71.4% si
> Cpu3 : 1.6% us, 1.3% sy, 0.0% ni, 96.7% id, 0.0% wa, 0.0% hi, 0.3% si
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 7555 root 20 0 1760 528 436 R 100 0.0 0:15.79 nc
> 7563 root 20 0 3632 268 204 R 100 0.0 0:13.13 loop
>
> Notes:
> * Both tasks show 100% CPU, even when one of them is stuck on a CPU thats
> processing 70% softirq.
> * no hardirq time.
>
>
> With "Part 1" patches:
> Cpu0 : 0.0% us, 0.0% sy, 0.0% ni, 100.0% id, 0.0% wa, 0.0% hi, 0.0% si
> Cpu1 : 100.0% us, 0.0% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 0.0% si
> Cpu2 : 2.0% us, 30.6% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 67.4% si
> Cpu3 : 0.7% us, 0.7% sy, 0.3% ni, 98.3% id, 0.0% wa, 0.0% hi, 0.0% si
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 6289 root 20 0 3632 268 204 R 100 0.0 2:18.67 loop
> 5737 root 20 0 1760 528 436 R 33 0.0 0:26.72 nc
>
> Notes:
> * Tasks show 100% CPU and 33% CPU that correspond to their non-irq exec time.
> * no hardirq time.
>
>
> With "Part 1 + Part 2" patches:
> Cpu0 : 1.3% us, 1.0% sy, 0.3% ni, 97.0% id, 0.0% wa, 0.0% hi, 0.3% si
> Cpu1 : 99.3% us, 0.0% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.7% hi, 0.0% si
> Cpu2 : 1.3% us, 31.5% sy, 0.0% ni, 0.0% id, 0.0% wa, 8.3% hi, 58.9% si
> Cpu3 : 1.0% us, 2.0% sy, 0.3% ni, 95.0% id, 0.0% wa, 0.7% hi, 1.0% si
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 20929 root 20 0 3632 268 204 R 99 0.0 3:48.25 loop
> 20796 root 20 0 1760 528 436 R 33 0.0 2:38.65 nc
>
> Notes:
> * Both task exec time and hard irq time reported correctly.
> * hi and si time are based on fine granularity info and not on samples.
> * getrusage would give proper utime/stime split not including irq times
> in that ratio.
> * Other places that report user/sys time like, cgroup cpuacct.stat will
> now include only non-irq exectime.
>
> This patch:

Your 0/x seem repeated in here for some reason... I would expect on the
below little bit.

> Cleanup patch, freeing up PF_KSOFTIRQD and use per_cpu ksoftirqd pointer
> instead, as suggested by Eric Dumazet.
>
> Tested-by: Shaun Ruffell <[email protected]>
> Signed-off-by: Venkatesh Pallipadi <[email protected]>
> ---

2010-12-22 09:20:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 5/5] Account ksoftirqd time as cpustat softirq -v2

On Tue, 2010-12-21 at 17:09 -0800, Venkatesh Pallipadi wrote:
> softirq time in ksoftirqd context is not accounted in ns granularity
> per cpu softirq stats, as we want that to be a part of ksoftirqd
> exec_runtime.
>
> Accounting them as softirq on /proc/stat separately.
>
> Tested-by: Shaun Ruffell <[email protected]>
> Signed-off-by: Venkatesh Pallipadi <[email protected]>
> ---
> kernel/sched.c | 8 ++++++++
> 1 files changed, 8 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 193b1d0..4ff6d1a 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -3819,6 +3819,14 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
> cpustat->irq = cputime64_add(cpustat->irq, tmp);
> } else if (irqtime_account_si_update()) {
> cpustat->softirq = cputime64_add(cpustat->softirq, tmp);
> + } else if (this_cpu_ksoftirqd() == p) {
> + /*
> + * ksoftirqd time do not get accounted in cpu_softirq_time.
> + * So, we have to handle it separately here.
> + * Also, p->stime needs to be updated for ksoftirqd.
> + */
> + __account_system_time(p, cputime_one_jiffy, one_jiffy_scaled,
> + &cpustat->softirq);
> } else if (user_tick) {
> account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);
> } else if (p == rq->idle) {

So you add the ksoftirqd runtime as softirq time in the /proc/stat
output?

That seems dubious...

2010-12-22 14:00:10

by Venkatesh Pallipadi

[permalink] [raw]
Subject: Re: [PATCH 5/5] Account ksoftirqd time as cpustat softirq -v2

On Wed, Dec 22, 2010 at 1:20 AM, Peter Zijlstra <[email protected]> wrote:
> On Tue, 2010-12-21 at 17:09 -0800, Venkatesh Pallipadi wrote:
>> softirq time in ksoftirqd context is not accounted in ns granularity
>> per cpu softirq stats, as we want that to be a part of ksoftirqd
>> exec_runtime.
>>
>> Accounting them as softirq on /proc/stat separately.
>>
>> Tested-by: Shaun Ruffell <[email protected]>
>> Signed-off-by: Venkatesh Pallipadi <[email protected]>
>> ---
>> ?kernel/sched.c | ? ?8 ++++++++
>> ?1 files changed, 8 insertions(+), 0 deletions(-)
>>
>> diff --git a/kernel/sched.c b/kernel/sched.c
>> index 193b1d0..4ff6d1a 100644
>> --- a/kernel/sched.c
>> +++ b/kernel/sched.c
>> @@ -3819,6 +3819,14 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
>> ? ? ? ? ? ? ? cpustat->irq = cputime64_add(cpustat->irq, tmp);
>> ? ? ? } else if (irqtime_account_si_update()) {
>> ? ? ? ? ? ? ? cpustat->softirq = cputime64_add(cpustat->softirq, tmp);
>> + ? ? } else if (this_cpu_ksoftirqd() == p) {
>> + ? ? ? ? ? ? /*
>> + ? ? ? ? ? ? ?* ksoftirqd time do not get accounted in cpu_softirq_time.
>> + ? ? ? ? ? ? ?* So, we have to handle it separately here.
>> + ? ? ? ? ? ? ?* Also, p->stime needs to be updated for ksoftirqd.
>> + ? ? ? ? ? ? ?*/
>> + ? ? ? ? ? ? __account_system_time(p, cputime_one_jiffy, one_jiffy_scaled,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? &cpustat->softirq);
>> ? ? ? } else if (user_tick) {
>> ? ? ? ? ? ? ? account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);
>> ? ? ? } else if (p == rq->idle) {
>
> So you add the ksoftirqd runtime as softirq time in the /proc/stat
> output?
>
> That seems dubious...
>

Yes. Thats similar to what we do today. As long as we are in softirq,
whether bottomhalf or ksoftirqd, we account the tick as softirq time
in /proc/stat.
I thought it also makes sense from user perspective. They want to know
how much time are we currently spending serving softirqs and hardirqs.
It is better to get that information at one place, instead of say look
at /proc/stat and add percpu ksoftirqd time to it. I mean,
user/administrator does not have to care whether ksoftirqd is involved
or not to know how much time was spend serving softirqs.

Thanks,
Venki

2010-12-22 14:06:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 5/5] Account ksoftirqd time as cpustat softirq -v2

On Wed, 2010-12-22 at 05:59 -0800, Venkatesh Pallipadi wrote:
> > So you add the ksoftirqd runtime as softirq time in the /proc/stat
> > output?
> >
> > That seems dubious...
> >
>
> Yes. Thats similar to what we do today. As long as we are in softirq,
> whether bottomhalf or ksoftirqd, we account the tick as softirq time
> in /proc/stat.
> I thought it also makes sense from user perspective. They want to know
> how much time are we currently spending serving softirqs and hardirqs.
> It is better to get that information at one place, instead of say look
> at /proc/stat and add percpu ksoftirqd time to it. I mean,
> user/administrator does not have to care whether ksoftirqd is involved
> or not to know how much time was spend serving softirqs.
>
But ksoftirqd time is also counted as sys time, so somewhere this won't
add up to 100%.

2010-12-22 14:18:04

by Venkatesh Pallipadi

[permalink] [raw]
Subject: Re: [PATCH 5/5] Account ksoftirqd time as cpustat softirq -v2

On Wed, Dec 22, 2010 at 6:05 AM, Peter Zijlstra <[email protected]> wrote:
> On Wed, 2010-12-22 at 05:59 -0800, Venkatesh Pallipadi wrote:
>> > So you add the ksoftirqd runtime as softirq time in the /proc/stat
>> > output?
>> >
>> > That seems dubious...
>> >
>>
>> Yes. Thats similar to what we do today. As long as we are in softirq,
>> whether bottomhalf or ksoftirqd, we account the tick as softirq time
>> in /proc/stat.
>> I thought it also makes sense from user perspective. They want to know
>> how much time are we currently spending serving softirqs and hardirqs.
>> It is better to get that information at one place, instead of say look
>> at /proc/stat and add percpu ksoftirqd time to it. I mean,
>> user/administrator does not have to care whether ksoftirqd is involved
>> or not to know how much time was spend serving softirqs.
>>
> But ksoftirqd time is also counted as sys time, so somewhere this won't
> add up to 100%.
>

It won't get added into cpu kstat system time. It will get added into
task stime, so looking into ksoftirqd task time, this will be seen as
system time. So, per cpu kstat should be consistent when added up. The
problem will be, with ksoftirqd, cpu stats will say the time as
softirq and task stats will account the time to ksoftirqd task as
stime, which is different from what we do with other tasks. I guess
that should be OK as ksoftirqd are special in some sense.

Thanks,
Venki

2010-12-22 14:23:58

by Venkatesh Pallipadi

[permalink] [raw]
Subject: Re: [PATCH 2/5] Add nsecs_to_cputime64 interface for asm-generic -v2

On Wed, Dec 22, 2010 at 12:30 AM, Martin Schwidefsky
<[email protected]> wrote:
> On Tue, 21 Dec 2010 17:09:01 -0800
> Venkatesh Pallipadi <[email protected]> wrote:
>
>> Add nsecs_to_cputime64 interface. This is used in following patches that
>> updates cpu irq stat based on ns granularity info in IRQ_TIME_ACCOUNTING.
>>
>> Tested-by: Shaun Ruffell <[email protected]>
>> Signed-off-by: Venkatesh Pallipadi <[email protected]>
>> ---
>> ?include/asm-generic/cputime.h | ? ?3 +++
>> ?include/linux/jiffies.h ? ? ? | ? ?1 +
>> ?kernel/time.c ? ? ? ? ? ? ? ? | ? 23 +++++++++++++++++++++--
>> ?3 files changed, 25 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/asm-generic/cputime.h b/include/asm-generic/cputime.h
>> index 2bcc5c7..61e03dd 100644
>> --- a/include/asm-generic/cputime.h
>> +++ b/include/asm-generic/cputime.h
>> @@ -30,6 +30,9 @@ typedef u64 cputime64_t;
>> ?#define cputime64_to_jiffies64(__ct) (__ct)
>> ?#define jiffies64_to_cputime64(__jif) ? ? ? ?(__jif)
>> ?#define cputime_to_cputime64(__ct) ? ((u64) __ct)
>> +#define cputime64_gt(__a, __b) ? ? ? ? ? ? ? ((__a) > ?(__b))
>> +
>> +#define nsecs_to_cputime64(__ct) ? ? nsecs_to_jiffies64(__ct)
>>
>>
>> ?/*
>
> If you define a new cputime primitive it is not enough to add it to the
> generic cputime header. You have to add it to the arch-specific cputime
> header for ia64, powerpc and s390 as well. Otherwise the build will
> break with CONFIG_VIRT_CPU_ACCOUNTING=y on these architectures once the
> new primitive is used.
>

But, those archs have arch specific cputime as well and so need a
different macro for nsecs_to_cputime64, instead of nsec_to_jiffies64.
Right? As of now this new primitive should not be used in those archs.
I guess this is something needed when using the primitive in future?

Thanks,
Venki

2010-12-22 15:26:04

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [PATCH 2/5] Add nsecs_to_cputime64 interface for asm-generic -v2

On Wed, 22 Dec 2010 06:23:55 -0800
Venkatesh Pallipadi <[email protected]> wrote:

> On Wed, Dec 22, 2010 at 12:30 AM, Martin Schwidefsky
> <[email protected]> wrote:
> > On Tue, 21 Dec 2010 17:09:01 -0800
> > Venkatesh Pallipadi <[email protected]> wrote:
> >
> >> Add nsecs_to_cputime64 interface. This is used in following patches that
> >> updates cpu irq stat based on ns granularity info in IRQ_TIME_ACCOUNTING.
> >>
> >> Tested-by: Shaun Ruffell <[email protected]>
> >> Signed-off-by: Venkatesh Pallipadi <[email protected]>
> >> ---
> >>  include/asm-generic/cputime.h |    3 +++
> >>  include/linux/jiffies.h       |    1 +
> >>  kernel/time.c                 |   23 +++++++++++++++++++++--
> >>  3 files changed, 25 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/include/asm-generic/cputime.h b/include/asm-generic/cputime.h
> >> index 2bcc5c7..61e03dd 100644
> >> --- a/include/asm-generic/cputime.h
> >> +++ b/include/asm-generic/cputime.h
> >> @@ -30,6 +30,9 @@ typedef u64 cputime64_t;
> >>  #define cputime64_to_jiffies64(__ct) (__ct)
> >>  #define jiffies64_to_cputime64(__jif)        (__jif)
> >>  #define cputime_to_cputime64(__ct)   ((u64) __ct)
> >> +#define cputime64_gt(__a, __b)               ((__a) >  (__b))
> >> +
> >> +#define nsecs_to_cputime64(__ct)     nsecs_to_jiffies64(__ct)
> >>
> >>
> >>  /*
> >
> > If you define a new cputime primitive it is not enough to add it to the
> > generic cputime header. You have to add it to the arch-specific cputime
> > header for ia64, powerpc and s390 as well. Otherwise the build will
> > break with CONFIG_VIRT_CPU_ACCOUNTING=y on these architectures once the
> > new primitive is used.
> >
>
> But, those archs have arch specific cputime as well and so need a
> different macro for nsecs_to_cputime64, instead of nsec_to_jiffies64.
> Right? As of now this new primitive should not be used in those archs.
> I guess this is something needed when using the primitive in future?

Yes on both parts of the question. You need an independent
implementation of the nsecs_to_cputime64 primitive for the architectures
that have their own definition of cputime and this definition will be
needed when the primitive is used is the future.

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.