2016-03-22 08:40:11

by Zhao Lei

[permalink] [raw]
Subject: [PATCH v4 0/4] cpuacct: split usage into user_usage and sys_usage

Sometimes, cpuacct.usage is not detialed enough to user
to see how much usage a group used. We want to know how
much time it used in user mode and how much in kernel mode.

This patch introduce some more files to tell user these information.
# ls /sys/fs/cgroup/cpuacct/cpuacct.usage*
/sys/fs/cgroup/cpuacct/cpuacct.usage
/sys/fs/cgroup/cpuacct/cpuacct.usage_percpu
/sys/fs/cgroup/cpuacct/cpuacct.usage_user
/sys/fs/cgroup/cpuacct/cpuacct.usage_percpu_user
/sys/fs/cgroup/cpuacct/cpuacct.usage_sys
/sys/fs/cgroup/cpuacct/cpuacct.usage_percpu_sys

Changelog v3->v4:
1: Add a patch to use for_each_possible_cpu to iterate all cpus,
Suggested-by: Peter Zijlstra <[email protected]>

Changelog v2->v3:
1: Remove some unnecessary locks and so some cleanup based on
suggestion from Peter Zijlstra <[email protected]>

Changelog v1->v2:
1: Rebase on top of 4.5-rc6
2: Fix little spelling typo in description.
3: Fix line over 80 characters

Yang Dongsheng (2):
cpuacct: rename parameter in cpuusage_write for readability
cpuacct: split usage into user_usage and sys_usage.


Zhao Lei (2):
cpuacct: small restruct for cpuacct
cpuacct: Show possible_cpu in cpuacct

kernel/sched/cpuacct.c | 173 ++++++++++++++++++++++++++++++++++---------------
kernel/sched/cpuacct.h | 4 +-
2 files changed, 124 insertions(+), 53 deletions(-)

--
1.8.5.1




2016-03-22 08:39:26

by Zhao Lei

[permalink] [raw]
Subject: [PATCH v4 3/4] cpuacct: Show possible_cpu in cpuacct

Current code show stats of online cpus in cpuacct.statcpus,
show stats of present cpus in cpuacct.usage(_percpu), and using
present cpus for setting cpuacct.usage.

It will cause inconsistent result when a cpu is online or offline
or hotpluged.

We should always use possible cpus to avoid above problem.

Content of cpuacct.usage_percpu changed after this patch:
(On a 4 cpu system with maxcpus=32)
Before patch:
# cat cpuacct.usage_percpu
2456565 411435 1052897 832584
After patch:
# cat cpuacct.usage_percpu
2456565 411435 1052897 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Zhao Lei <[email protected]>
---
kernel/sched/cpuacct.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 434c2fa..4399850 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -138,7 +138,7 @@ static u64 cpuusage_read(struct cgroup_subsys_state *css, struct cftype *cft)
u64 totalcpuusage = 0;
int i;

- for_each_present_cpu(i)
+ for_each_possible_cpu(i)
totalcpuusage += cpuacct_cpuusage_read(ca, i);

return totalcpuusage;
@@ -159,7 +159,7 @@ static int cpuusage_write(struct cgroup_subsys_state *css, struct cftype *cft,
goto out;
}

- for_each_present_cpu(i)
+ for_each_possible_cpu(i)
cpuacct_cpuusage_write(ca, i, 0);

out:
@@ -172,7 +172,7 @@ static int cpuacct_percpu_seq_show(struct seq_file *m, void *V)
u64 percpu;
int i;

- for_each_present_cpu(i) {
+ for_each_possible_cpu(i) {
percpu = cpuacct_cpuusage_read(ca, i);
seq_printf(m, "%llu ", (unsigned long long) percpu);
}
@@ -191,7 +191,7 @@ static int cpuacct_stats_show(struct seq_file *sf, void *v)
int cpu;
s64 val = 0;

- for_each_online_cpu(cpu) {
+ for_each_possible_cpu(cpu) {
struct kernel_cpustat *kcpustat = per_cpu_ptr(ca->cpustat, cpu);
val += kcpustat->cpustat[CPUTIME_USER];
val += kcpustat->cpustat[CPUTIME_NICE];
@@ -200,7 +200,7 @@ static int cpuacct_stats_show(struct seq_file *sf, void *v)
seq_printf(sf, "%s %lld\n", cpuacct_stat_desc[CPUACCT_STAT_USER], val);

val = 0;
- for_each_online_cpu(cpu) {
+ for_each_possible_cpu(cpu) {
struct kernel_cpustat *kcpustat = per_cpu_ptr(ca->cpustat, cpu);
val += kcpustat->cpustat[CPUTIME_SYSTEM];
val += kcpustat->cpustat[CPUTIME_IRQ];
--
1.8.5.1



2016-03-22 08:39:36

by Zhao Lei

[permalink] [raw]
Subject: [PATCH v4 1/4] cpuacct: rename parameter in cpuusage_write for readability

From: Yang Dongsheng <[email protected]>

The name of 'reset' makes a little confusion in reading, we would
say, if we want to reset usage, return -EINVAL. That's not true.

Actually, we want to say, we only allow user to do a reset. This
patch rename reset to val and add a comment here, making the code
more readable.

From: Dongsheng Yang <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Signed-off-by: Zhao Lei <[email protected]>
---
kernel/sched/cpuacct.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index dd7cbb5..9c2bbf7 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -145,13 +145,16 @@ static u64 cpuusage_read(struct cgroup_subsys_state *css, struct cftype *cft)
}

static int cpuusage_write(struct cgroup_subsys_state *css, struct cftype *cft,
- u64 reset)
+ u64 val)
{
struct cpuacct *ca = css_ca(css);
int err = 0;
int i;

- if (reset) {
+ /*
+ * Only allow '0' here to do a reset.
+ */
+ if (val) {
err = -EINVAL;
goto out;
}
--
1.8.5.1



2016-03-22 08:39:44

by Zhao Lei

[permalink] [raw]
Subject: [PATCH v4 2/4] cpuacct: small restruct for cpuacct

1: Use for() instead of while() loop in some functions
to make code simple.
2: Use this_cpu_ptr() instead of per_cpu_ptr() to make code
clean with little performance up.

Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Zhao Lei <[email protected]>
---
kernel/sched/cpuacct.c | 28 +++++-----------------------
kernel/sched/cpuacct.h | 4 ++--
2 files changed, 7 insertions(+), 25 deletions(-)

diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 9c2bbf7..434c2fa 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -238,23 +238,10 @@ static struct cftype files[] = {
void cpuacct_charge(struct task_struct *tsk, u64 cputime)
{
struct cpuacct *ca;
- int cpu;
-
- cpu = task_cpu(tsk);

rcu_read_lock();
-
- ca = task_ca(tsk);
-
- while (true) {
- u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
- *cpuusage += cputime;
-
- ca = parent_ca(ca);
- if (!ca)
- break;
- }
-
+ for (ca = task_ca(tsk); ca; ca = parent_ca(ca))
+ *this_cpu_ptr(ca->cpuusage) += cputime;
rcu_read_unlock();
}

@@ -263,18 +250,13 @@ void cpuacct_charge(struct task_struct *tsk, u64 cputime)
*
* Note: it's the caller that updates the account of the root cgroup.
*/
-void cpuacct_account_field(struct task_struct *p, int index, u64 val)
+void cpuacct_account_field(struct task_struct *tsk, int index, u64 val)
{
- struct kernel_cpustat *kcpustat;
struct cpuacct *ca;

rcu_read_lock();
- ca = task_ca(p);
- while (ca != &root_cpuacct) {
- kcpustat = this_cpu_ptr(ca->cpustat);
- kcpustat->cpustat[index] += val;
- ca = parent_ca(ca);
- }
+ for (ca = task_ca(tsk); ca != &root_cpuacct; ca = parent_ca(ca))
+ this_cpu_ptr(ca->cpustat)->cpustat[index] += val;
rcu_read_unlock();
}

diff --git a/kernel/sched/cpuacct.h b/kernel/sched/cpuacct.h
index ed60562..ba72807 100644
--- a/kernel/sched/cpuacct.h
+++ b/kernel/sched/cpuacct.h
@@ -1,7 +1,7 @@
#ifdef CONFIG_CGROUP_CPUACCT

extern void cpuacct_charge(struct task_struct *tsk, u64 cputime);
-extern void cpuacct_account_field(struct task_struct *p, int index, u64 val);
+extern void cpuacct_account_field(struct task_struct *tsk, int index, u64 val);

#else

@@ -10,7 +10,7 @@ static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime)
}

static inline void
-cpuacct_account_field(struct task_struct *p, int index, u64 val)
+cpuacct_account_field(struct task_struct *tsk, int index, u64 val)
{
}

--
1.8.5.1



2016-03-22 08:40:06

by Zhao Lei

[permalink] [raw]
Subject: [PATCH v4 4/4] cpuacct: split usage into user_usage and sys_usage

From: Dongsheng Yang <[email protected]>

Sometimes, cpuacct.usage is not detialed enough to user
to see how much usage a group used. We want to know how
much time it used in user mode and how much in kernel mode.

This patch introduce some more files to tell user these information.
# ls /sys/fs/cgroup/cpuacct/cpuacct.usage*
/sys/fs/cgroup/cpuacct/cpuacct.usage
/sys/fs/cgroup/cpuacct/cpuacct.usage_percpu
/sys/fs/cgroup/cpuacct/cpuacct.usage_user
/sys/fs/cgroup/cpuacct/cpuacct.usage_percpu_user
/sys/fs/cgroup/cpuacct/cpuacct.usage_sys
/sys/fs/cgroup/cpuacct/cpuacct.usage_percpu_sys

From: Dongsheng Yang <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Signed-off-by: Zhao Lei <[email protected]>
---
kernel/sched/cpuacct.c | 140 +++++++++++++++++++++++++++++++++++++++----------
1 file changed, 113 insertions(+), 27 deletions(-)

diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 4399850..6a4de22 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -25,11 +25,22 @@ enum cpuacct_stat_index {
CPUACCT_STAT_NSTATS,
};

+enum cpuacct_usage_index {
+ CPUACCT_USAGE_USER, /* ... user mode */
+ CPUACCT_USAGE_SYSTEM, /* ... kernel mode */
+
+ CPUACCT_USAGE_NRUSAGE,
+};
+
+struct cpuacct_usage {
+ u64 usages[CPUACCT_USAGE_NRUSAGE];
+};
+
/* track cpu usage of a group of tasks and its child groups */
struct cpuacct {
struct cgroup_subsys_state css;
/* cpuusage holds pointer to a u64-type object on every cpu */
- u64 __percpu *cpuusage;
+ struct cpuacct_usage __percpu *cpuusage;
struct kernel_cpustat __percpu *cpustat;
};

@@ -49,7 +60,7 @@ static inline struct cpuacct *parent_ca(struct cpuacct *ca)
return css_ca(ca->css.parent);
}

-static DEFINE_PER_CPU(u64, root_cpuacct_cpuusage);
+static DEFINE_PER_CPU(struct cpuacct_usage, root_cpuacct_cpuusage);
static struct cpuacct root_cpuacct = {
.cpustat = &kernel_cpustat,
.cpuusage = &root_cpuacct_cpuusage,
@@ -68,7 +79,7 @@ cpuacct_css_alloc(struct cgroup_subsys_state *parent_css)
if (!ca)
goto out;

- ca->cpuusage = alloc_percpu(u64);
+ ca->cpuusage = alloc_percpu(struct cpuacct_usage);
if (!ca->cpuusage)
goto out_free_ca;

@@ -96,20 +107,37 @@ static void cpuacct_css_free(struct cgroup_subsys_state *css)
kfree(ca);
}

-static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu)
+static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu,
+ enum cpuacct_usage_index index)
{
- u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+ struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
u64 data;

+ /*
+ * We allow index == CPUACCT_USAGE_NRUSAGE here to read
+ * the sum of suages.
+ */
+ BUG_ON(index > CPUACCT_USAGE_NRUSAGE);
+
#ifndef CONFIG_64BIT
/*
* Take rq->lock to make 64-bit read safe on 32-bit platforms.
*/
raw_spin_lock_irq(&cpu_rq(cpu)->lock);
- data = *cpuusage;
+#endif
+
+ if (index == CPUACCT_USAGE_NRUSAGE) {
+ int i = 0;
+
+ data = 0;
+ for (i = 0; i < CPUACCT_USAGE_NRUSAGE; i++)
+ data += cpuusage->usages[i];
+ } else {
+ data = cpuusage->usages[index];
+ }
+
+#ifndef CONFIG_64BIT
raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
-#else
- data = *cpuusage;
#endif

return data;
@@ -117,69 +145,103 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu)

static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
{
- u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+ struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+ int i;

#ifndef CONFIG_64BIT
/*
* Take rq->lock to make 64-bit write safe on 32-bit platforms.
*/
raw_spin_lock_irq(&cpu_rq(cpu)->lock);
- *cpuusage = val;
+#endif
+
+ for (i = 0; i < CPUACCT_USAGE_NRUSAGE; i++)
+ cpuusage->usages[i] = val;
+
+#ifndef CONFIG_64BIT
raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
-#else
- *cpuusage = val;
#endif
}

/* return total cpu usage (in nanoseconds) of a group */
-static u64 cpuusage_read(struct cgroup_subsys_state *css, struct cftype *cft)
+static u64 __cpuusage_read(struct cgroup_subsys_state *css,
+ enum cpuacct_usage_index index)
{
struct cpuacct *ca = css_ca(css);
u64 totalcpuusage = 0;
int i;

for_each_possible_cpu(i)
- totalcpuusage += cpuacct_cpuusage_read(ca, i);
+ totalcpuusage += cpuacct_cpuusage_read(ca, i, index);

return totalcpuusage;
}

+static u64 cpuusage_user_read(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ return __cpuusage_read(css, CPUACCT_USAGE_USER);
+}
+
+static u64 cpuusage_sys_read(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ return __cpuusage_read(css, CPUACCT_USAGE_SYSTEM);
+}
+
+static u64 cpuusage_read(struct cgroup_subsys_state *css, struct cftype *cft)
+{
+ return __cpuusage_read(css, CPUACCT_USAGE_NRUSAGE);
+}
+
static int cpuusage_write(struct cgroup_subsys_state *css, struct cftype *cft,
u64 val)
{
struct cpuacct *ca = css_ca(css);
- int err = 0;
- int i;
+ int cpu;

/*
* Only allow '0' here to do a reset.
*/
- if (val) {
- err = -EINVAL;
- goto out;
- }
+ if (val)
+ return -EINVAL;

- for_each_possible_cpu(i)
- cpuacct_cpuusage_write(ca, i, 0);
+ for_each_possible_cpu(cpu)
+ cpuacct_cpuusage_write(ca, cpu, 0);

-out:
- return err;
+ return 0;
}

-static int cpuacct_percpu_seq_show(struct seq_file *m, void *V)
+static int __cpuacct_percpu_seq_show(struct seq_file *m,
+ enum cpuacct_usage_index index)
{
struct cpuacct *ca = css_ca(seq_css(m));
u64 percpu;
int i;

for_each_possible_cpu(i) {
- percpu = cpuacct_cpuusage_read(ca, i);
+ percpu = cpuacct_cpuusage_read(ca, i, index);
seq_printf(m, "%llu ", (unsigned long long) percpu);
}
seq_printf(m, "\n");
return 0;
}

+static int cpuacct_percpu_user_seq_show(struct seq_file *m, void *V)
+{
+ return __cpuacct_percpu_seq_show(m, CPUACCT_USAGE_USER);
+}
+
+static int cpuacct_percpu_sys_seq_show(struct seq_file *m, void *V)
+{
+ return __cpuacct_percpu_seq_show(m, CPUACCT_USAGE_SYSTEM);
+}
+
+static int cpuacct_percpu_seq_show(struct seq_file *m, void *V)
+{
+ return __cpuacct_percpu_seq_show(m, CPUACCT_USAGE_NRUSAGE);
+}
+
static const char * const cpuacct_stat_desc[] = {
[CPUACCT_STAT_USER] = "user",
[CPUACCT_STAT_SYSTEM] = "system",
@@ -220,10 +282,26 @@ static struct cftype files[] = {
.write_u64 = cpuusage_write,
},
{
+ .name = "usage_user",
+ .read_u64 = cpuusage_user_read,
+ },
+ {
+ .name = "usage_sys",
+ .read_u64 = cpuusage_sys_read,
+ },
+ {
.name = "usage_percpu",
.seq_show = cpuacct_percpu_seq_show,
},
{
+ .name = "usage_percpu_user",
+ .seq_show = cpuacct_percpu_user_seq_show,
+ },
+ {
+ .name = "usage_percpu_sys",
+ .seq_show = cpuacct_percpu_sys_seq_show,
+ },
+ {
.name = "stat",
.seq_show = cpuacct_stats_show,
},
@@ -238,10 +316,18 @@ static struct cftype files[] = {
void cpuacct_charge(struct task_struct *tsk, u64 cputime)
{
struct cpuacct *ca;
+ int index;
+
+ if (user_mode(task_pt_regs(tsk)))
+ index = CPUACCT_USAGE_USER;
+ else
+ index = CPUACCT_USAGE_SYSTEM;

rcu_read_lock();
+
for (ca = task_ca(tsk); ca; ca = parent_ca(ca))
- *this_cpu_ptr(ca->cpuusage) += cputime;
+ this_cpu_ptr(ca->cpuusage)->usages[index] += cputime;
+
rcu_read_unlock();
}

--
1.8.5.1



2016-03-22 09:44:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 0/4] cpuacct: split usage into user_usage and sys_usage

On Tue, Mar 22, 2016 at 04:37:04PM +0800, Zhao Lei wrote:
> Sometimes, cpuacct.usage is not detialed enough to user
> to see how much usage a group used. We want to know how
> much time it used in user mode and how much in kernel mode.
>
> This patch introduce some more files to tell user these information.
> # ls /sys/fs/cgroup/cpuacct/cpuacct.usage*
> /sys/fs/cgroup/cpuacct/cpuacct.usage
> /sys/fs/cgroup/cpuacct/cpuacct.usage_percpu
> /sys/fs/cgroup/cpuacct/cpuacct.usage_user
> /sys/fs/cgroup/cpuacct/cpuacct.usage_percpu_user
> /sys/fs/cgroup/cpuacct/cpuacct.usage_sys
> /sys/fs/cgroup/cpuacct/cpuacct.usage_percpu_sys
>
> Changelog v3->v4:
> 1: Add a patch to use for_each_possible_cpu to iterate all cpus,
> Suggested-by: Peter Zijlstra <[email protected]>

Thanks!

Subject: [tip:sched/core] sched/cpuacct: Show all possible CPUs in cpuacct output

Commit-ID: 5ca3726af7f66a8cc71ce4414cfeb86deb784491
Gitweb: http://git.kernel.org/tip/5ca3726af7f66a8cc71ce4414cfeb86deb784491
Author: Zhao Lei <[email protected]>
AuthorDate: Tue, 22 Mar 2016 16:37:07 +0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 31 Mar 2016 10:45:56 +0200

sched/cpuacct: Show all possible CPUs in cpuacct output

Current code show stats of online CPUs in cpuacct.statcpus,
show stats of present cpus in cpuacct.usage(_percpu), and using
present CPUs for setting cpuacct.usage.

It will cause inconsistent result when a CPU is online or offline
or hotpluged.

We should always use possible CPUs to avoid above problem.

Here are the contents of a cpuacct.usage_percpu sysfs file,
on a 4 CPU system with maxcpus=32:

Before the patch:
# cat cpuacct.usage_percpu
2456565 411435 1052897 832584

After the patch:
# cat cpuacct.usage_percpu
2456565 411435 1052897 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Zhao Lei <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/a11d56cef12d0b4807f8be3a46bf9798c3014d59.1458635566.git.zhaolei@cn.fujitsu.com
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/cpuacct.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 4a81120..e39bd4f 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -138,7 +138,7 @@ static u64 cpuusage_read(struct cgroup_subsys_state *css, struct cftype *cft)
u64 totalcpuusage = 0;
int i;

- for_each_present_cpu(i)
+ for_each_possible_cpu(i)
totalcpuusage += cpuacct_cpuusage_read(ca, i);

return totalcpuusage;
@@ -159,7 +159,7 @@ static int cpuusage_write(struct cgroup_subsys_state *css, struct cftype *cft,
goto out;
}

- for_each_present_cpu(i)
+ for_each_possible_cpu(i)
cpuacct_cpuusage_write(ca, i, 0);

out:
@@ -172,7 +172,7 @@ static int cpuacct_percpu_seq_show(struct seq_file *m, void *V)
u64 percpu;
int i;

- for_each_present_cpu(i) {
+ for_each_possible_cpu(i) {
percpu = cpuacct_cpuusage_read(ca, i);
seq_printf(m, "%llu ", (unsigned long long) percpu);
}
@@ -191,7 +191,7 @@ static int cpuacct_stats_show(struct seq_file *sf, void *v)
int cpu;
s64 val = 0;

- for_each_online_cpu(cpu) {
+ for_each_possible_cpu(cpu) {
struct kernel_cpustat *kcpustat = per_cpu_ptr(ca->cpustat, cpu);
val += kcpustat->cpustat[CPUTIME_USER];
val += kcpustat->cpustat[CPUTIME_NICE];
@@ -200,7 +200,7 @@ static int cpuacct_stats_show(struct seq_file *sf, void *v)
seq_printf(sf, "%s %lld\n", cpuacct_stat_desc[CPUACCT_STAT_USER], val);

val = 0;
- for_each_online_cpu(cpu) {
+ for_each_possible_cpu(cpu) {
struct kernel_cpustat *kcpustat = per_cpu_ptr(ca->cpustat, cpu);
val += kcpustat->cpustat[CPUTIME_SYSTEM];
val += kcpustat->cpustat[CPUTIME_IRQ];

Subject: [tip:sched/core] sched/cpuacct: Split usage accounting into user_usage and sys_usage

Commit-ID: d740037fac7052e49450f6fa1454f1144a103b55
Gitweb: http://git.kernel.org/tip/d740037fac7052e49450f6fa1454f1144a103b55
Author: Dongsheng Yang <[email protected]>
AuthorDate: Tue, 22 Mar 2016 16:37:08 +0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 31 Mar 2016 10:48:54 +0200

sched/cpuacct: Split usage accounting into user_usage and sys_usage

Sometimes, cpuacct.usage is not detailed enough to see how much CPU
usage a group had. We want to know how much time it used in user mode
and how much in kernel mode.

This patch introduces more files to give this information:

# ls /sys/fs/cgroup/cpuacct/cpuacct.usage*
/sys/fs/cgroup/cpuacct/cpuacct.usage
/sys/fs/cgroup/cpuacct/cpuacct.usage_percpu
/sys/fs/cgroup/cpuacct/cpuacct.usage_user
/sys/fs/cgroup/cpuacct/cpuacct.usage_percpu_user
/sys/fs/cgroup/cpuacct/cpuacct.usage_sys
/sys/fs/cgroup/cpuacct/cpuacct.usage_percpu_sys

... while keeping the ABI with the existing counter.

Signed-off-by: Dongsheng Yang <[email protected]>
[ Ported to newer kernels. ]
Signed-off-by: Zhao Lei <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/aa171da036b520b51c79549e9b3215d29473f19d.1458635566.git.zhaolei@cn.fujitsu.com
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/cpuacct.c | 140 +++++++++++++++++++++++++++++++++++++++----------
1 file changed, 113 insertions(+), 27 deletions(-)

diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index e39bd4f..df947e0 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -25,11 +25,22 @@ enum cpuacct_stat_index {
CPUACCT_STAT_NSTATS,
};

+enum cpuacct_usage_index {
+ CPUACCT_USAGE_USER, /* ... user mode */
+ CPUACCT_USAGE_SYSTEM, /* ... kernel mode */
+
+ CPUACCT_USAGE_NRUSAGE,
+};
+
+struct cpuacct_usage {
+ u64 usages[CPUACCT_USAGE_NRUSAGE];
+};
+
/* track cpu usage of a group of tasks and its child groups */
struct cpuacct {
struct cgroup_subsys_state css;
/* cpuusage holds pointer to a u64-type object on every cpu */
- u64 __percpu *cpuusage;
+ struct cpuacct_usage __percpu *cpuusage;
struct kernel_cpustat __percpu *cpustat;
};

@@ -49,7 +60,7 @@ static inline struct cpuacct *parent_ca(struct cpuacct *ca)
return css_ca(ca->css.parent);
}

-static DEFINE_PER_CPU(u64, root_cpuacct_cpuusage);
+static DEFINE_PER_CPU(struct cpuacct_usage, root_cpuacct_cpuusage);
static struct cpuacct root_cpuacct = {
.cpustat = &kernel_cpustat,
.cpuusage = &root_cpuacct_cpuusage,
@@ -68,7 +79,7 @@ cpuacct_css_alloc(struct cgroup_subsys_state *parent_css)
if (!ca)
goto out;

- ca->cpuusage = alloc_percpu(u64);
+ ca->cpuusage = alloc_percpu(struct cpuacct_usage);
if (!ca->cpuusage)
goto out_free_ca;

@@ -96,20 +107,37 @@ static void cpuacct_css_free(struct cgroup_subsys_state *css)
kfree(ca);
}

-static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu)
+static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu,
+ enum cpuacct_usage_index index)
{
- u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+ struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
u64 data;

+ /*
+ * We allow index == CPUACCT_USAGE_NRUSAGE here to read
+ * the sum of suages.
+ */
+ BUG_ON(index > CPUACCT_USAGE_NRUSAGE);
+
#ifndef CONFIG_64BIT
/*
* Take rq->lock to make 64-bit read safe on 32-bit platforms.
*/
raw_spin_lock_irq(&cpu_rq(cpu)->lock);
- data = *cpuusage;
+#endif
+
+ if (index == CPUACCT_USAGE_NRUSAGE) {
+ int i = 0;
+
+ data = 0;
+ for (i = 0; i < CPUACCT_USAGE_NRUSAGE; i++)
+ data += cpuusage->usages[i];
+ } else {
+ data = cpuusage->usages[index];
+ }
+
+#ifndef CONFIG_64BIT
raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
-#else
- data = *cpuusage;
#endif

return data;
@@ -117,69 +145,103 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu)

static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
{
- u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+ struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+ int i;

#ifndef CONFIG_64BIT
/*
* Take rq->lock to make 64-bit write safe on 32-bit platforms.
*/
raw_spin_lock_irq(&cpu_rq(cpu)->lock);
- *cpuusage = val;
+#endif
+
+ for (i = 0; i < CPUACCT_USAGE_NRUSAGE; i++)
+ cpuusage->usages[i] = val;
+
+#ifndef CONFIG_64BIT
raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
-#else
- *cpuusage = val;
#endif
}

/* return total cpu usage (in nanoseconds) of a group */
-static u64 cpuusage_read(struct cgroup_subsys_state *css, struct cftype *cft)
+static u64 __cpuusage_read(struct cgroup_subsys_state *css,
+ enum cpuacct_usage_index index)
{
struct cpuacct *ca = css_ca(css);
u64 totalcpuusage = 0;
int i;

for_each_possible_cpu(i)
- totalcpuusage += cpuacct_cpuusage_read(ca, i);
+ totalcpuusage += cpuacct_cpuusage_read(ca, i, index);

return totalcpuusage;
}

+static u64 cpuusage_user_read(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ return __cpuusage_read(css, CPUACCT_USAGE_USER);
+}
+
+static u64 cpuusage_sys_read(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ return __cpuusage_read(css, CPUACCT_USAGE_SYSTEM);
+}
+
+static u64 cpuusage_read(struct cgroup_subsys_state *css, struct cftype *cft)
+{
+ return __cpuusage_read(css, CPUACCT_USAGE_NRUSAGE);
+}
+
static int cpuusage_write(struct cgroup_subsys_state *css, struct cftype *cft,
u64 val)
{
struct cpuacct *ca = css_ca(css);
- int err = 0;
- int i;
+ int cpu;

/*
* Only allow '0' here to do a reset.
*/
- if (val) {
- err = -EINVAL;
- goto out;
- }
+ if (val)
+ return -EINVAL;

- for_each_possible_cpu(i)
- cpuacct_cpuusage_write(ca, i, 0);
+ for_each_possible_cpu(cpu)
+ cpuacct_cpuusage_write(ca, cpu, 0);

-out:
- return err;
+ return 0;
}

-static int cpuacct_percpu_seq_show(struct seq_file *m, void *V)
+static int __cpuacct_percpu_seq_show(struct seq_file *m,
+ enum cpuacct_usage_index index)
{
struct cpuacct *ca = css_ca(seq_css(m));
u64 percpu;
int i;

for_each_possible_cpu(i) {
- percpu = cpuacct_cpuusage_read(ca, i);
+ percpu = cpuacct_cpuusage_read(ca, i, index);
seq_printf(m, "%llu ", (unsigned long long) percpu);
}
seq_printf(m, "\n");
return 0;
}

+static int cpuacct_percpu_user_seq_show(struct seq_file *m, void *V)
+{
+ return __cpuacct_percpu_seq_show(m, CPUACCT_USAGE_USER);
+}
+
+static int cpuacct_percpu_sys_seq_show(struct seq_file *m, void *V)
+{
+ return __cpuacct_percpu_seq_show(m, CPUACCT_USAGE_SYSTEM);
+}
+
+static int cpuacct_percpu_seq_show(struct seq_file *m, void *V)
+{
+ return __cpuacct_percpu_seq_show(m, CPUACCT_USAGE_NRUSAGE);
+}
+
static const char * const cpuacct_stat_desc[] = {
[CPUACCT_STAT_USER] = "user",
[CPUACCT_STAT_SYSTEM] = "system",
@@ -220,10 +282,26 @@ static struct cftype files[] = {
.write_u64 = cpuusage_write,
},
{
+ .name = "usage_user",
+ .read_u64 = cpuusage_user_read,
+ },
+ {
+ .name = "usage_sys",
+ .read_u64 = cpuusage_sys_read,
+ },
+ {
.name = "usage_percpu",
.seq_show = cpuacct_percpu_seq_show,
},
{
+ .name = "usage_percpu_user",
+ .seq_show = cpuacct_percpu_user_seq_show,
+ },
+ {
+ .name = "usage_percpu_sys",
+ .seq_show = cpuacct_percpu_sys_seq_show,
+ },
+ {
.name = "stat",
.seq_show = cpuacct_stats_show,
},
@@ -238,10 +316,18 @@ static struct cftype files[] = {
void cpuacct_charge(struct task_struct *tsk, u64 cputime)
{
struct cpuacct *ca;
+ int index;
+
+ if (user_mode(task_pt_regs(tsk)))
+ index = CPUACCT_USAGE_USER;
+ else
+ index = CPUACCT_USAGE_SYSTEM;

rcu_read_lock();
+
for (ca = task_ca(tsk); ca; ca = parent_ca(ca))
- *this_cpu_ptr(ca->cpuusage) += cputime;
+ this_cpu_ptr(ca->cpuusage)->usages[index] += cputime;
+
rcu_read_unlock();
}


2016-04-04 14:04:28

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [tip:sched/core] sched/cpuacct: Split usage accounting into user_usage and sys_usage

* tip-bot for Dongsheng Yang <[email protected]> [2016-03-31 02:27:39]:

> Commit-ID: d740037fac7052e49450f6fa1454f1144a103b55
> Gitweb: http://git.kernel.org/tip/d740037fac7052e49450f6fa1454f1144a103b55
> Author: Dongsheng Yang <[email protected]>
> AuthorDate: Tue, 22 Mar 2016 16:37:08 +0800
> Committer: Ingo Molnar <[email protected]>
> CommitDate: Thu, 31 Mar 2016 10:48:54 +0200
>
> sched/cpuacct: Split usage accounting into user_usage and sys_usage
>
> Sometimes, cpuacct.usage is not detailed enough to see how much CPU
> usage a group had. We want to know how much time it used in user mode
> and how much in kernel mode.
>

Unfortunately this commit causes boot to fail on my power 7 box.

> @@ -238,10 +316,18 @@ static struct cftype files[] = {
> void cpuacct_charge(struct task_struct *tsk, u64 cputime)
> {
> struct cpuacct *ca;
> + int index;
> +
> + if (user_mode(task_pt_regs(tsk)))
> + index = CPUACCT_USAGE_USER;
> + else
> + index = CPUACCT_USAGE_SYSTEM;
>
> rcu_read_lock();
> +
> for (ca = task_ca(tsk); ca; ca = parent_ca(ca))
> - *this_cpu_ptr(ca->cpuusage) += cputime;
> + this_cpu_ptr(ca->cpuusage)->usages[index] += cputime;

The above line seems to be the cause of oops. Attached the complete console log below.

Unable to handle kernel paging request for data at address 0x00000108
Faulting instruction address: 0xc0000000000f8fbc
cpu 0x0: Vector: 300 (Data Access) at [c000000f1f607720]
pc: c0000000000f8fbc: .cpuacct_charge+0x2c/0x110
lr: c0000000000dd184: .update_curr+0xa4/0x1d0
sp: c000000f1f6079a0
msr: 9000000002009032
dar: 108
dsisr: 40000000
current = 0xc000000f1f582480
paca = 0xc00000000fe00000 softe: 0 irq_happened: 0x01
pid = 2, comm = kthreadd
Linux version 4.6.0-rc1-master+ (srikar@llmjuno) (gcc version 4.8.2 20131212 (Red Hat 4.8.2-7) (GCC) ) #38 SMP Tue Apr 5 00:20:26 IST 2016
enter ? for help
[c000000f1f607a30] c0000000000dd184 .update_curr+0xa4/0x1d0
[c000000f1f607ad0] c0000000000dfc2c .dequeue_task_fair+0xbc/0x1110
[c000000f1f607be0] c0000000000d3c70 .do_set_cpus_allowed+0x1b0/0x1c0
[c000000f1f607c70] c0000000000d4424 .__set_cpus_allowed_ptr+0x194/0x2e0
[c000000f1f607d50] c0000000000c70b8 .kthreadd+0x68/0x2c0
[c000000f1f607e30] c0000000000095e4 .ret_from_kernel_thread+0x58/0x74

git bisect shows this as the commit that causes the problem.
I verifed by booting with the tip + revert of this commit.

> +
> rcu_read_unlock();
> }
>

--
Thanks and Regards
Srikar Dronamraju


phys_mem_size = 0x1000000000
cpu_features = 0x0b7e7ae518500049
possible = 0x3fffffff18500649
always = 0x0000000018100040
cpu_user_features = 0xdc0065c7 0x20000000
mmu_features = 0x7c000003
firmware_features = 0x0000000010000000
htab_address = 0xc000000ffc000000
htab_hash_mask = 0x7ffff
-----------------------------------------------------
<- setup_system()
Linux version 4.6.0-rc1-master+ (srikar@llmjuno) (gcc version 4.8.2 20131212 (Red Hat 4.8.2-7) (GCC) ) #38 SMP Tue Apr 5 00:20:26 IST 2016
Node 0 Memory:
Node 1 Memory: 0x0-0x1000000000
numa: Initmem setup node 0
numa: NODE_DATA [mem 0xf2ed9e100-0xf2eda7fff]
numa: NODE_DATA(0) on node 1
numa: Initmem setup node 1 [mem 0x00000000-0xfffffffff]
numa: NODE_DATA [mem 0xf2ed94200-0xf2ed9e0ff]
Section 3884 and 3886 (node 1) have a circular dependency on usemap and pgdat allocations
Probing IODA IO-Hub /io-hub@3efe00000000
Initializing IODA0 OPAL PHB /io-hub@3efe00000000/pciex@3efe00080000
PCI host bridge /io-hub@3efe00000000/pciex@3efe00080000 (primary) ranges:
IO 0x00003efe01000000..0x00003efe017fffff -> 0x0000000000000000
MEM 0x00003da080000000..0x00003da0fffeffff -> 0x0000000080000000
Not support M64 window
128 (127) PE's M32: 0x80000000 [segment=0x1000000]
IO: 0x800000 [segment=0x10000]
Allocated bitmap for 256 MSIs (base IRQ 0x600)
Initializing IODA0 OPAL PHB /io-hub@3efe00000000/pciex@3efe00090000
PCI host bridge /io-hub@3efe00000000/pciex@3efe00090000 ranges:
IO 0x00003efe01800000..0x00003efe01ffffff -> 0x0000000000000000
MEM 0x00003da180000000..0x00003da1fffeffff -> 0x0000000080000000
Not support M64 window
128 (127) PE's M32: 0x80000000 [segment=0x1000000]
IO: 0x800000 [segment=0x10000]
Allocated bitmap for 256 MSIs (base IRQ 0xa00)
Initializing IODA0 OPAL PHB /io-hub@3efe00000000/pciex@3efe000a0000
PCI host bridge /io-hub@3efe00000000/pciex@3efe000a0000 ranges:
IO 0x00003efe02000000..0x00003efe027fffff -> 0x0000000000000000
MEM 0x00003da280000000..0x00003da2fffeffff -> 0x0000000080000000
Not support M64 window
128 (127) PE's M32: 0x80000000 [segment=0x1000000]
IO: 0x800000 [segment=0x10000]
Allocated bitmap for 256 MSIs (base IRQ 0xe00)
Initializing IODA0 OPAL PHB /io-hub@3efe00000000/pciex@3efe000b0000
PCI host bridge /io-hub@3efe00000000/pciex@3efe000b0000 ranges:
IO 0x00003efe02800000..0x00003efe02ffffff -> 0x0000000000000000
MEM 0x00003da380000000..0x00003da3fffeffff -> 0x0000000080000000
Not support M64 window
128 (127) PE's M32: 0x80000000 [segment=0x1000000]
IO: 0x800000 [segment=0x10000]
Allocated bitmap for 256 MSIs (base IRQ 0x1200)
Initializing IODA0 OPAL PHB /io-hub@3efe00000000/pciex@3efe000c0000
PCI host bridge /io-hub@3efe00000000/pciex@3efe000c0000 ranges:
IO 0x00003efe03000000..0x00003efe037fffff -> 0x0000000000000000
MEM 0x00003da480000000..0x00003da4fffeffff -> 0x0000000080000000
Not support M64 window
128 (127) PE's M32: 0x80000000 [segment=0x1000000]
IO: 0x800000 [segment=0x10000]
Allocated bitmap for 256 MSIs (base IRQ 0x1600)
Initializing IODA0 OPAL PHB /io-hub@3efe00000000/pciex@3efe000d0000
PCI host bridge /io-hub@3efe00000000/pciex@3efe000d0000 ranges:
IO 0x00003efe03800000..0x00003efe03ffffff -> 0x0000000000000000
MEM 0x00003da580000000..0x00003da5fffeffff -> 0x0000000080000000
Not support M64 window
128 (127) PE's M32: 0x80000000 [segment=0x1000000]
IO: 0x800000 [segment=0x10000]
Allocated bitmap for 256 MSIs (base IRQ 0x1a00)
OPAL nvram setup, 1048576 bytes
Top of RAM: 0x1000000000, Total RAM: 0x1000000000
Memory hole size: 0MB
Zone ranges:
DMA [mem 0x0000000000000000-0x0000000fffffffff]
DMA32 empty
Normal empty
Movable zone start for each node
Early memory node ranges
node 1: [mem 0x0000000000000000-0x0000000fffffffff]
Could not find start_pfn for node 0
Initmem setup node 0 [mem 0x0000000000000000-0x0000000000000000]
On node 0 totalpages: 0
Initmem setup node 1 [mem 0x0000000000000000-0x0000000fffffffff]
On node 1 totalpages: 1048576
DMA zone: 1024 pages used for memmap
DMA zone: 0 pages reserved
DMA zone: 1048576 pages, LIFO batch:1
percpu: Embedded 3 pages/cpu @c000000f2dc00000 s133656 r0 d62952 u262144
pcpu-alloc: s133656 r0 d62952 u262144 alloc=1*1048576
pcpu-alloc: [0] 00 01 02 03 [0] 04 05 06 07
pcpu-alloc: [0] 08 09 10 11 [0] 12 13 14 15
pcpu-alloc: [0] 16 17 18 19 [0] 20 21 22 23
pcpu-alloc: [0] 24 25 26 27 [0] 28 29 30 31
pcpu-alloc: [0] 32 33 34 35 [0] 36 37 38 39
pcpu-alloc: [0] 40 41 42 43 [0] 44 45 46 47
pcpu-alloc: [0] 48 49 50 51 [0] 52 53 54 55
pcpu-alloc: [0] 56 57 58 59 [0] 60 61 62 63
Built 2 zonelists in Node order, mobility grouping on. Total pages: 1047552
Policy zone: DMA
Kernel command line: BOOT_IMAGE=/boot/vmlinux-4.6.0-rc1-master+ root=/dev/mapper/fedora_llmjuno03b-root ro rd.md=0 rd.dm=0 vconsole.keymap=us rd.luks=0 vconsole.font=latarcyrheb-sun16 rd.lvm.lv=fedora_llmjuno03b/swap rd.lvm.lv=fedora_llmjuno03b/root nmi_watchdog=0 nohpet ignore_loglevel log_buf_len=10M print_fatal_signals=1 loglevel=8 panic=30
log_buf_len: 16777216 bytes
early log buf free: 122804(93%)
PID hash table entries: 4096 (order: -1, 32768 bytes)
Sorting __ex_table...
Memory: 63497024K/67108864K available (9152K kernel code, 1728K rwdata, 3160K rodata, 832K init, 1040K bss, 253120K reserved, 3358720K cma-reserved)
SLUB: HWalign=128, Order=0-3, MinObjects=0, CPUs=64, Nodes=2
Hierarchical RCU implementation.
Build-time adjustment of leaf fanout to 64.
RCU restricting CPUs from NR_CPUS=2048 to nr_cpu_ids=64.
RCU: Adjusting geometry for rcu_fanout_leaf=64, nr_cpu_ids=64
NR_IRQS:512 nr_irqs:512 16
ICS OPAL backend registered
time_init: decrementer frequency = 512.000000 MHz
time_init: processor frequency = 3300.000000 MHz
clocksource: timebase: mask: 0xffffffffffffffff max_cycles: 0x761537d007, max_idle_ns: 440795202126 ns
clocksource: timebase mult[1f40000] shift[24] registered
clockevent: decrementer mult[83126e98] shift[32] cpu[0]
Console: colour dummy device 80x25
console [hvc0] enabled
console [hvc0] enabled
bootconsole [udbg0] disabled
bootconsole [udbg0] disabled
mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl
pid_max: default: 65536 minimum: 512
Dentry cache hash table entries: 8388608 (order: 10, 67108864 bytes)
Inode-cache hash table entries: 4194304 (order: 9, 33554432 bytes)
Mount-cache hash table entries: 131072 (order: 4, 1048576 bytes)
Mountpoint-cache hash table entries: 131072 (order: 4, 1048576 bytes)
Unable to handle kernel paging request for data at address 0x00000108
Faulting instruction address: 0xc0000000000f8fbc
cpu 0x0: Vector: 300 (Data Access) at [c000000f1f607720]
pc: c0000000000f8fbc: .cpuacct_charge+0x2c/0x110
lr: c0000000000dd184: .update_curr+0xa4/0x1d0
sp: c000000f1f6079a0
msr: 9000000002009032
dar: 108
dsisr: 40000000
current = 0xc000000f1f582480
paca = 0xc00000000fe00000 softe: 0 irq_happened: 0x01
pid = 2, comm = kthreadd
Linux version 4.6.0-rc1-master+ (srikar@llmjuno) (gcc version 4.8.2 20131212 (Red Hat 4.8.2-7) (GCC) ) #38 SMP Tue Apr 5 00:20:26 IST 2016
enter ? for help
[c000000f1f607a30] c0000000000dd184 .update_curr+0xa4/0x1d0
[c000000f1f607ad0] c0000000000dfc2c .dequeue_task_fair+0xbc/0x1110
[c000000f1f607be0] c0000000000d3c70 .do_set_cpus_allowed+0x1b0/0x1c0
[c000000f1f607c70] c0000000000d4424 .__set_cpus_allowed_ptr+0x194/0x2e0
[c000000f1f607d50] c0000000000c70b8 .kthreadd+0x68/0x2c0
[c000000f1f607e30] c0000000000095e4 .ret_from_kernel_thread+0x58/0x74

2016-04-06 06:54:42

by Ingo Molnar

[permalink] [raw]
Subject: Re: [tip:sched/core] sched/cpuacct: Split usage accounting into user_usage and sys_usage


* Srikar Dronamraju <[email protected]> wrote:

> * tip-bot for Dongsheng Yang <[email protected]> [2016-03-31 02:27:39]:
>
> > Commit-ID: d740037fac7052e49450f6fa1454f1144a103b55
> > Gitweb: http://git.kernel.org/tip/d740037fac7052e49450f6fa1454f1144a103b55
> > Author: Dongsheng Yang <[email protected]>
> > AuthorDate: Tue, 22 Mar 2016 16:37:08 +0800
> > Committer: Ingo Molnar <[email protected]>
> > CommitDate: Thu, 31 Mar 2016 10:48:54 +0200
> >
> > sched/cpuacct: Split usage accounting into user_usage and sys_usage
> >
> > Sometimes, cpuacct.usage is not detailed enough to see how much CPU
> > usage a group had. We want to know how much time it used in user mode
> > and how much in kernel mode.
> >
>
> Unfortunately this commit causes boot to fail on my power 7 box.
>
> > @@ -238,10 +316,18 @@ static struct cftype files[] = {
> > void cpuacct_charge(struct task_struct *tsk, u64 cputime)
> > {
> > struct cpuacct *ca;
> > + int index;
> > +
> > + if (user_mode(task_pt_regs(tsk)))
> > + index = CPUACCT_USAGE_USER;
> > + else
> > + index = CPUACCT_USAGE_SYSTEM;
> >
> > rcu_read_lock();
> > +
> > for (ca = task_ca(tsk); ca; ca = parent_ca(ca))
> > - *this_cpu_ptr(ca->cpuusage) += cputime;
> > + this_cpu_ptr(ca->cpuusage)->usages[index] += cputime;
>
> The above line seems to be the cause of oops. Attached the complete console log below.

Weird - not much changed wrt. the cpuusage logic, we only increased its size.

If you change the above loop to something like:

for (ca = task_ca(tsk); ca; ca = parent_ca(ca)) {
if (WARN_ON_ONCE(!ca->cpuusage))
continue;
this_cpu_ptr(ca->cpuusage)->usages[index] += cputime;
}

then do you get the warning and the bootup succeeds?

Thanks,

Ingo

2016-04-06 07:47:26

by Zhao Lei

[permalink] [raw]
Subject: RE: [tip:sched/core] sched/cpuacct: Split usage accounting into user_usage and sys_usage

Hi, Ingo and Srikar,

> -----Original Message-----
> From: Ingo Molnar [mailto:[email protected]] On Behalf Of Ingo
> Molnar
> Sent: Wednesday, April 06, 2016 2:55 PM
> To: Srikar Dronamraju <[email protected]>
> Cc: [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]
> Subject: Re: [tip:sched/core] sched/cpuacct: Split usage accounting into
> user_usage and sys_usage
>
>
> * Srikar Dronamraju <[email protected]> wrote:
>
> > * tip-bot for Dongsheng Yang <[email protected]> [2016-03-31 02:27:39]:
> >
> > > Commit-ID: d740037fac7052e49450f6fa1454f1144a103b55
> > > Gitweb:
> http://git.kernel.org/tip/d740037fac7052e49450f6fa1454f1144a103b55
> > > Author: Dongsheng Yang <[email protected]>
> > > AuthorDate: Tue, 22 Mar 2016 16:37:08 +0800
> > > Committer: Ingo Molnar <[email protected]>
> > > CommitDate: Thu, 31 Mar 2016 10:48:54 +0200
> > >
> > > sched/cpuacct: Split usage accounting into user_usage and sys_usage
> > >
> > > Sometimes, cpuacct.usage is not detailed enough to see how much CPU
> > > usage a group had. We want to know how much time it used in user mode
> > > and how much in kernel mode.
> > >
> >
> > Unfortunately this commit causes boot to fail on my power 7 box.
> >
> > > @@ -238,10 +316,18 @@ static struct cftype files[] = {
> > > void cpuacct_charge(struct task_struct *tsk, u64 cputime)
> > > {
> > > struct cpuacct *ca;
> > > + int index;
> > > +
> > > + if (user_mode(task_pt_regs(tsk)))
> > > + index = CPUACCT_USAGE_USER;
> > > + else
> > > + index = CPUACCT_USAGE_SYSTEM;
> > >
> > > rcu_read_lock();
> > > +
> > > for (ca = task_ca(tsk); ca; ca = parent_ca(ca))
> > > - *this_cpu_ptr(ca->cpuusage) += cputime;
> > > + this_cpu_ptr(ca->cpuusage)->usages[index] += cputime;
> >
> > The above line seems to be the cause of oops. Attached the complete console
> log below.
>
Thanks for reporting this bug.
I'll take over this patch.

> Weird - not much changed wrt. the cpuusage logic, we only increased its size.
>
Thanks for concern.
I trying to review this patch, but had not found something strange, except:

> If you change the above loop to something like:
>
> for (ca = task_ca(tsk); ca; ca = parent_ca(ca)) {
> if (WARN_ON_ONCE(!ca->cpuusage))
> continue;
> this_cpu_ptr(ca->cpuusage)->usages[index] += cputime;

Or s/ this_cpu_ptr/get_cpu_var to avoid preempt?

But preempt problem should not cause problem in every boot,
And Srikar's bisect means it failed in every boot.

> }
>
> then do you get the warning and the bootup succeeds?
>
I also building a power7 vm for test, still in building...

Thanks
Zhaolei




2016-04-06 07:50:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:sched/core] sched/cpuacct: Split usage accounting into user_usage and sys_usage

On Wed, Apr 06, 2016 at 03:47:18PM +0800, Zhao Lei wrote:
> > If you change the above loop to something like:
> >
> > for (ca = task_ca(tsk); ca; ca = parent_ca(ca)) {
> > if (WARN_ON_ONCE(!ca->cpuusage))
> > continue;
> > this_cpu_ptr(ca->cpuusage)->usages[index] += cputime;
>
> Or s/ this_cpu_ptr/get_cpu_var to avoid preempt?

This is all called with rq->lock held, preemption is firmly disabled.

2016-04-06 10:33:21

by Anton Blanchard

[permalink] [raw]
Subject: Re: [tip:sched/core] sched/cpuacct: Split usage accounting into user_usage and sys_usage

Hi,

> > > void cpuacct_charge(struct task_struct *tsk, u64 cputime)
> > > {
> > > struct cpuacct *ca;
> > > + int index;
> > > +
> > > + if (user_mode(task_pt_regs(tsk)))
> > > + index = CPUACCT_USAGE_USER;
> > > + else
> > > + index = CPUACCT_USAGE_SYSTEM;

This is oopsing because PowerPC task_pt_regs() returns NULL for
kernel threads.

Anton

2016-04-06 11:08:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:sched/core] sched/cpuacct: Split usage accounting into user_usage and sys_usage

On Wed, Apr 06, 2016 at 08:32:19PM +1000, Anton Blanchard wrote:
> Hi,
>
> > > > void cpuacct_charge(struct task_struct *tsk, u64 cputime)
> > > > {
> > > > struct cpuacct *ca;
> > > > + int index;
> > > > +
> > > > + if (user_mode(task_pt_regs(tsk)))
> > > > + index = CPUACCT_USAGE_USER;
> > > > + else
> > > > + index = CPUACCT_USAGE_SYSTEM;
>
> This is oopsing because PowerPC task_pt_regs() returns NULL for
> kernel threads.

Ah, so sometihng like:

struct pt_regs *regs = task_pt_regs();
int index = CPUACCT_USAGE_SYSTEM;

if (regs && user_mode(regs))
index = CPUACCT_USAGE_USER;

should work, right?


2016-04-06 11:44:24

by Zhao Lei

[permalink] [raw]
Subject: RE: [tip:sched/core] sched/cpuacct: Split usage accounting into user_usage and sys_usage

Hi, Anton and Peter

> -----Original Message-----
> From: Peter Zijlstra [mailto:[email protected]]
> Sent: Wednesday, April 06, 2016 7:08 PM
> To: Anton Blanchard <[email protected]>
> Cc: Ingo Molnar <[email protected]>; Srikar Dronamraju
> <[email protected]>; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; Stephen Rothwell
> <[email protected]>; Michael Ellerman <[email protected]>
> Subject: Re: [tip:sched/core] sched/cpuacct: Split usage accounting into
> user_usage and sys_usage
>
> On Wed, Apr 06, 2016 at 08:32:19PM +1000, Anton Blanchard wrote:
> > Hi,
> >
> > > > > void cpuacct_charge(struct task_struct *tsk, u64 cputime)
> > > > > {
> > > > > struct cpuacct *ca;
> > > > > + int index;
> > > > > +
> > > > > + if (user_mode(task_pt_regs(tsk)))
> > > > > + index = CPUACCT_USAGE_USER;
> > > > > + else
> > > > > + index = CPUACCT_USAGE_SYSTEM;
> >
> > This is oopsing because PowerPC task_pt_regs() returns NULL for
> > kernel threads.
>
Thanks for notice it.

> Ah, so sometihng like:
>
> struct pt_regs *regs = task_pt_regs();
> int index = CPUACCT_USAGE_SYSTEM;
>
> if (regs && user_mode(regs))
> index = CPUACCT_USAGE_USER;
>
> should work, right?
>
Thanks, I'll confirm it in a Power7 VM from buildroot.
(maybe need some hours or till tomorrow in a poor pc...)

Btw, I reproduced this bug in above vm.

Thanks
Zhaolei


>




2016-04-06 12:00:37

by Anton Blanchard

[permalink] [raw]
Subject: [PATCH] sched/cpuacct: Check for NULL when using task_pt_regs()

Hi Peter,

> Ah, so sometihng like:
>
> struct pt_regs *regs = task_pt_regs();
> int index = CPUACCT_USAGE_SYSTEM;
>
> if (regs && user_mode(regs))
> index = CPUACCT_USAGE_USER;
>
> should work, right?

Looks good, and the patch below does fix the oops for me.

Anton
--

task_pt_regs() can return NULL for kernel threads, so add a check.
This fixes an oops at boot on ppc64.

Signed-off-by: Anton Blanchard <[email protected]>
---

diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index df947e0..41f85c4 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -316,12 +316,11 @@ static struct cftype files[] = {
void cpuacct_charge(struct task_struct *tsk, u64 cputime)
{
struct cpuacct *ca;
- int index;
+ int index = CPUACCT_USAGE_SYSTEM;
+ struct pt_regs *regs = task_pt_regs(tsk);

- if (user_mode(task_pt_regs(tsk)))
+ if (regs && user_mode(regs))
index = CPUACCT_USAGE_USER;
- else
- index = CPUACCT_USAGE_SYSTEM;

rcu_read_lock();


2016-04-06 14:38:16

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH] sched/cpuacct: Check for NULL when using task_pt_regs()

* Anton Blanchard <[email protected]> [2016-04-06 21:59:50]:

> Looks good, and the patch below does fix the oops for me.
>
> Anton
> --
>
> task_pt_regs() can return NULL for kernel threads, so add a check.
> This fixes an oops at boot on ppc64.
>
> Signed-off-by: Anton Blanchard <[email protected]>

Works for me too.

Reported-and-Tested-by: Srikar Dronamraju <[email protected]>

--
Thanks and Regards
Srikar Dronamraju

2016-04-06 17:06:06

by Zhao Lei

[permalink] [raw]
Subject: RE: [PATCH] sched/cpuacct: Check for NULL when using task_pt_regs()



> -----Original Message-----
> From: Srikar Dronamraju [mailto:[email protected]]
> Sent: Wednesday, April 06, 2016 9:27 PM
> To: Anton Blanchard <[email protected]>
> Cc: Peter Zijlstra <[email protected]>; Ingo Molnar <[email protected]>;
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]; Stephen
> Rothwell <[email protected]>; Michael Ellerman <[email protected]>;
> [email protected]
> Subject: Re: [PATCH] sched/cpuacct: Check for NULL when using task_pt_regs()
>
> * Anton Blanchard <[email protected]> [2016-04-06 21:59:50]:
>
> > Looks good, and the patch below does fix the oops for me.
> >
> > Anton
> > --
> >
> > task_pt_regs() can return NULL for kernel threads, so add a check.
> > This fixes an oops at boot on ppc64.
> >
> > Signed-off-by: Anton Blanchard <[email protected]>
>
> Works for me too.
>
> Reported-and-Tested-by: Srikar Dronamraju <[email protected]>
>
Thanks, all

I tested it in the vm(can not boot before fix):

[single cpu]
# for f in cpuacct.*; do echo "$f: "; cat $f; done
cpuacct.stat:
user 260
system 451
cpuacct.usage:
6923456654
cpuacct.usage_percpu:
6963446178
cpuacct.usage_percpu_sys:
2610934362
cpuacct.usage_percpu_user:
4451667140
cpuacct.usage_sys:
2612684054
cpuacct.usage_user:
4540275322
#

[2 cpu wih 8 maxcpus]
# for f in cpuacct.*; do echo "$f: "; cat $f; done
cpuacct.stat:
user 205
system 536
cpuacct.usage:
7293688020
cpuacct.usage_percpu:
3785674990 3551323200 0 0 0 0 0 0
cpuacct.usage_percpu_sys:
2227281124 1635060584 0 0 0 0 0 0
cpuacct.usage_percpu_user:
1567487176 1992278818 0 0 0 0 0 0
cpuacct.usage_sys:
3863445982
cpuacct.usage_user:
3643874038
#

# cat /proc/cpuinfo
processor : 0
cpu : POWER7 (raw), altivec supported
clock : 1000.000000MHz
revision : 2.3 (pvr 003f 0203)

processor : 1
cpu : POWER7 (raw), altivec supported
clock : 1000.000000MHz
revision : 2.3 (pvr 003f 0203)

timebase : 512000000
platform : pSeries
model : IBM pSeries (emulated by qemu)
machine : CHRP IBM pSeries (emulated by qemu)
#

Thanks
Zhaolei

> --
> Thanks and Regards
> Srikar Dronamraju
>




2016-04-11 06:16:22

by Michael Ellerman

[permalink] [raw]
Subject: Ping? Re: [PATCH] sched/cpuacct: Check for NULL when using task_pt_regs()

Hi Peter/Ingo,

On Wed, 2016-04-06 at 21:59 +1000, Anton Blanchard wrote:
> Hi Peter,
>
> > Ah, so sometihng like:
> >
> > struct pt_regs *regs = task_pt_regs();
> > int index = CPUACCT_USAGE_SYSTEM;
> >
> > if (regs && user_mode(regs))
> > index = CPUACCT_USAGE_USER;
> >
> > should work, right?
>
> Looks good, and the patch below does fix the oops for me.

Can we get this merged please? It's blocking all my boot tests.

If anyone's bothered:

Fixes: d740037fac70 ("sched/cpuacct: Split usage accounting into user_usage and sys_usage")

cheers

> task_pt_regs() can return NULL for kernel threads, so add a check.
> This fixes an oops at boot on ppc64.
>
> Signed-off-by: Anton Blanchard <[email protected]>
> ---
>
> diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
> index df947e0..41f85c4 100644
> --- a/kernel/sched/cpuacct.c
> +++ b/kernel/sched/cpuacct.c
> @@ -316,12 +316,11 @@ static struct cftype files[] = {
> void cpuacct_charge(struct task_struct *tsk, u64 cputime)
> {
> struct cpuacct *ca;
> - int index;
> + int index = CPUACCT_USAGE_SYSTEM;
> + struct pt_regs *regs = task_pt_regs(tsk);
>
> - if (user_mode(task_pt_regs(tsk)))
> + if (regs && user_mode(regs))
> index = CPUACCT_USAGE_USER;
> - else
> - index = CPUACCT_USAGE_SYSTEM;
>
> rcu_read_lock();
>
>

2016-04-13 07:43:26

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] sched/cpuacct: Check for NULL when using task_pt_regs()


* Srikar Dronamraju <[email protected]> wrote:

> * Anton Blanchard <[email protected]> [2016-04-06 21:59:50]:
>
> > Looks good, and the patch below does fix the oops for me.
> >
> > Anton
> > --
> >
> > task_pt_regs() can return NULL for kernel threads, so add a check.
> > This fixes an oops at boot on ppc64.
> >
> > Signed-off-by: Anton Blanchard <[email protected]>
>
> Works for me too.
>
> Reported-and-Tested-by: Srikar Dronamraju <[email protected]>

Could someone please re-send the fix, because it has not reached me nor lkml.

Thanks,

Ingo

2016-04-13 11:02:09

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH] sched/cpuacct: Check for NULL when using task_pt_regs()

On Wed, 2016-04-13 at 09:43 +0200, Ingo Molnar wrote:
> * Srikar Dronamraju <[email protected]> wrote:
>
> > * Anton Blanchard <[email protected]> [2016-04-06 21:59:50]:
> >
> > > Looks good, and the patch below does fix the oops for me.
> > >
> > > Anton
> > > --
> > >
> > > task_pt_regs() can return NULL for kernel threads, so add a check.
> > > This fixes an oops at boot on ppc64.
> > >
> > > Signed-off-by: Anton Blanchard <[email protected]>
> >
> > Works for me too.
> >
> > Reported-and-Tested-by: Srikar Dronamraju <[email protected]>
>
> Could someone please re-send the fix, because it has not reached me nor lkml.

It did hit LKML:

http://lkml.kernel.org/r/20160406215950.04bc3f0b@kryten

But that did have some verbiage at the top.

Anton's also resent it directly To you.

cheers

2016-04-13 11:21:28

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] sched/cpuacct: Check for NULL when using task_pt_regs()


* Michael Ellerman <[email protected]> wrote:

> On Wed, 2016-04-13 at 09:43 +0200, Ingo Molnar wrote:
> > * Srikar Dronamraju <[email protected]> wrote:
> >
> > > * Anton Blanchard <[email protected]> [2016-04-06 21:59:50]:
> > >
> > > > Looks good, and the patch below does fix the oops for me.
> > > >
> > > > Anton
> > > > --
> > > >
> > > > task_pt_regs() can return NULL for kernel threads, so add a check.
> > > > This fixes an oops at boot on ppc64.
> > > >
> > > > Signed-off-by: Anton Blanchard <[email protected]>
> > >
> > > Works for me too.
> > >
> > > Reported-and-Tested-by: Srikar Dronamraju <[email protected]>
> >
> > Could someone please re-send the fix, because it has not reached me nor lkml.
>
> It did hit LKML:
>
> http://lkml.kernel.org/r/20160406215950.04bc3f0b@kryten
>
> But that did have some verbiage at the top.
>
> Anton's also resent it directly To you.

So it was in my Spam folder, due to the following SPF softfail:

Received-SPF: softfail (google.com: domain of transitioning [email protected] does not designate 198.145.29.136 as permitted sender) client-ip=198.145.29.136;

have the patch now.

Thanks,

Ingo

Subject: [tip:sched/core] sched/cpuacct: Check for NULL when using task_pt_regs()

Commit-ID: bd92883051a0228cc34996b8e766111ba10c9aac
Gitweb: http://git.kernel.org/tip/bd92883051a0228cc34996b8e766111ba10c9aac
Author: Anton Blanchard <[email protected]>
AuthorDate: Wed, 6 Apr 2016 21:59:50 +1000
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 13 Apr 2016 13:22:37 +0200

sched/cpuacct: Check for NULL when using task_pt_regs()

task_pt_regs() can return NULL for kernel threads, so add a check.
This fixes an oops at boot on ppc64.

Reported-and-Tested-by: Srikar Dronamraju <[email protected]>
Tested-by: Zhao Lei <[email protected]>
Signed-off-by: Anton Blanchard <[email protected]>
Acked-by: Zhao Lei <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/20160406215950.04bc3f0b@kryten
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/cpuacct.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index df947e0..41f85c4 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -316,12 +316,11 @@ static struct cftype files[] = {
void cpuacct_charge(struct task_struct *tsk, u64 cputime)
{
struct cpuacct *ca;
- int index;
+ int index = CPUACCT_USAGE_SYSTEM;
+ struct pt_regs *regs = task_pt_regs(tsk);

- if (user_mode(task_pt_regs(tsk)))
+ if (regs && user_mode(regs))
index = CPUACCT_USAGE_USER;
- else
- index = CPUACCT_USAGE_SYSTEM;

rcu_read_lock();