2019-11-21 02:46:03

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 0/6] sched/nohz: Make the rest of kcpustat vtime aware v3

(For the record, see v2 at
https://lore.kernel.org/lkml/[email protected]/ )

This set addresses reviews from Ingo on previous take:

* Fix comment nit
* Remind in comment about the reschedule IPI that may happen on renice
* Constify input
* Indent a few blocks of assignments
* Use struct kernel_cpustat as an output for kcpustat accessor. It makes
things easier and cleaner than the independant variables.

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
nohz/kcpustat-v5

HEAD: 8941f0df7da73e1d95f0c0bed0121a14cdb52873

Thanks,
Frederic
---

Frederic Weisbecker (6):
sched/cputime: Support other fields on kcpustat_field()
sched/vtime: Bring up complete kcpustat accessor
procfs: Use all-in-one vtime aware kcpustat accessor
cpufreq: Use vtime aware kcpustat accessors for user time
leds: Use all-in-one vtime aware kcpustat accessor
rackmeter: Use vtime aware kcpustat accessor


drivers/cpufreq/cpufreq.c | 17 +--
drivers/cpufreq/cpufreq_governor.c | 6 +-
drivers/leds/trigger/ledtrig-activity.c | 14 ++-
drivers/macintosh/rack-meter.c | 7 +-
fs/proc/stat.c | 56 +++++-----
include/linux/kernel_stat.h | 7 ++
kernel/sched/cputime.c | 190 ++++++++++++++++++++++++++------
7 files changed, 223 insertions(+), 74 deletions(-)


2019-11-21 02:46:26

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 3/6] procfs: Use all-in-one vtime aware kcpustat accessor

Now that we can read also user and guest time safely under vtime, use
the relevant accessor to fix frozen kcpustat values on nohz_full CPUs.

Reported-by: Yauheni Kaliuta <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Yauheni Kaliuta <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
fs/proc/stat.c | 54 ++++++++++++++++++++++++++++----------------------
1 file changed, 30 insertions(+), 24 deletions(-)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 5c6bd0ae3802..37bdbec5b402 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -120,20 +120,23 @@ static int show_stat(struct seq_file *p, void *v)
getboottime64(&boottime);

for_each_possible_cpu(i) {
- struct kernel_cpustat *kcs = &kcpustat_cpu(i);
+ struct kernel_cpustat kcpustat;
+ u64 *cpustat = kcpustat.cpustat;

- user += kcs->cpustat[CPUTIME_USER];
- nice += kcs->cpustat[CPUTIME_NICE];
- system += kcpustat_field(kcs, CPUTIME_SYSTEM, i);
- idle += get_idle_time(kcs, i);
- iowait += get_iowait_time(kcs, i);
- irq += kcs->cpustat[CPUTIME_IRQ];
- softirq += kcs->cpustat[CPUTIME_SOFTIRQ];
- steal += kcs->cpustat[CPUTIME_STEAL];
- guest += kcs->cpustat[CPUTIME_GUEST];
- guest_nice += kcs->cpustat[CPUTIME_GUEST_NICE];
- sum += kstat_cpu_irqs_sum(i);
- sum += arch_irq_stat_cpu(i);
+ kcpustat_cpu_fetch(&kcpustat, i);
+
+ user += cpustat[CPUTIME_USER];
+ nice += cpustat[CPUTIME_NICE];
+ system += cpustat[CPUTIME_SYSTEM];
+ idle += get_idle_time(&kcpustat, i);
+ iowait += get_iowait_time(&kcpustat, i);
+ irq += cpustat[CPUTIME_IRQ];
+ softirq += cpustat[CPUTIME_SOFTIRQ];
+ steal += cpustat[CPUTIME_STEAL];
+ guest += cpustat[CPUTIME_GUEST];
+ guest_nice += cpustat[CPUTIME_USER];
+ sum += kstat_cpu_irqs_sum(i);
+ sum += arch_irq_stat_cpu(i);

for (j = 0; j < NR_SOFTIRQS; j++) {
unsigned int softirq_stat = kstat_softirqs_cpu(j, i);
@@ -157,19 +160,22 @@ static int show_stat(struct seq_file *p, void *v)
seq_putc(p, '\n');

for_each_online_cpu(i) {
- struct kernel_cpustat *kcs = &kcpustat_cpu(i);
+ struct kernel_cpustat kcpustat;
+ u64 *cpustat = kcpustat.cpustat;
+
+ kcpustat_cpu_fetch(&kcpustat, i);

/* Copy values here to work around gcc-2.95.3, gcc-2.96 */
- user = kcs->cpustat[CPUTIME_USER];
- nice = kcs->cpustat[CPUTIME_NICE];
- system = kcpustat_field(kcs, CPUTIME_SYSTEM, i);
- idle = get_idle_time(kcs, i);
- iowait = get_iowait_time(kcs, i);
- irq = kcs->cpustat[CPUTIME_IRQ];
- softirq = kcs->cpustat[CPUTIME_SOFTIRQ];
- steal = kcs->cpustat[CPUTIME_STEAL];
- guest = kcs->cpustat[CPUTIME_GUEST];
- guest_nice = kcs->cpustat[CPUTIME_GUEST_NICE];
+ user = cpustat[CPUTIME_USER];
+ nice = cpustat[CPUTIME_NICE];
+ system = cpustat[CPUTIME_SYSTEM];
+ idle = get_idle_time(&kcpustat, i);
+ iowait = get_iowait_time(&kcpustat, i);
+ irq = cpustat[CPUTIME_IRQ];
+ softirq = cpustat[CPUTIME_SOFTIRQ];
+ steal = cpustat[CPUTIME_STEAL];
+ guest = cpustat[CPUTIME_GUEST];
+ guest_nice = cpustat[CPUTIME_USER];
seq_printf(p, "cpu%d", i);
seq_put_decimal_ull(p, " ", nsec_to_clock_t(user));
seq_put_decimal_ull(p, " ", nsec_to_clock_t(nice));
--
2.23.0

2019-11-21 02:46:33

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 4/6] cpufreq: Use vtime aware kcpustat accessors for user time

We can now safely read user and guest kcpustat fields on nohz_full CPUs.
Use the appropriate accessors.

Reported-by: Yauheni Kaliuta <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
---
drivers/cpufreq/cpufreq.c | 17 ++++++++++-------
drivers/cpufreq/cpufreq_governor.c | 6 +++---
2 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 527fd068dc12..ee23eaf20f35 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -113,18 +113,21 @@ EXPORT_SYMBOL_GPL(get_governor_parent_kobj);

static inline u64 get_cpu_idle_time_jiffy(unsigned int cpu, u64 *wall)
{
- u64 idle_time;
+ struct kernel_cpustat kcpustat;
u64 cur_wall_time;
+ u64 idle_time;
u64 busy_time;

cur_wall_time = jiffies64_to_nsecs(get_jiffies_64());

- busy_time = kcpustat_cpu(cpu).cpustat[CPUTIME_USER];
- busy_time += kcpustat_field(&kcpustat_cpu(cpu), CPUTIME_SYSTEM, cpu);
- busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_IRQ];
- busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_SOFTIRQ];
- busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_STEAL];
- busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_NICE];
+ kcpustat_cpu_fetch(&kcpustat, cpu);
+
+ busy_time = kcpustat.cpustat[CPUTIME_USER];
+ busy_time += kcpustat.cpustat[CPUTIME_SYSTEM];
+ busy_time += kcpustat.cpustat[CPUTIME_IRQ];
+ busy_time += kcpustat.cpustat[CPUTIME_SOFTIRQ];
+ busy_time += kcpustat.cpustat[CPUTIME_STEAL];
+ busy_time += kcpustat.cpustat[CPUTIME_NICE];

idle_time = cur_wall_time - busy_time;
if (wall)
diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c
index 4bb054d0cb43..f99ae45efaea 100644
--- a/drivers/cpufreq/cpufreq_governor.c
+++ b/drivers/cpufreq/cpufreq_governor.c
@@ -105,7 +105,7 @@ void gov_update_cpu_data(struct dbs_data *dbs_data)
j_cdbs->prev_cpu_idle = get_cpu_idle_time(j, &j_cdbs->prev_update_time,
dbs_data->io_is_busy);
if (dbs_data->ignore_nice_load)
- j_cdbs->prev_cpu_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE];
+ j_cdbs->prev_cpu_nice = kcpustat_field(&kcpustat_cpu(j), CPUTIME_NICE, j);
}
}
}
@@ -149,7 +149,7 @@ unsigned int dbs_update(struct cpufreq_policy *policy)
j_cdbs->prev_cpu_idle = cur_idle_time;

if (ignore_nice) {
- u64 cur_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE];
+ u64 cur_nice = kcpustat_field(&kcpustat_cpu(j), CPUTIME_NICE, j);

idle_time += div_u64(cur_nice - j_cdbs->prev_cpu_nice, NSEC_PER_USEC);
j_cdbs->prev_cpu_nice = cur_nice;
@@ -530,7 +530,7 @@ int cpufreq_dbs_governor_start(struct cpufreq_policy *policy)
j_cdbs->prev_load = 0;

if (ignore_nice)
- j_cdbs->prev_cpu_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE];
+ j_cdbs->prev_cpu_nice = kcpustat_field(&kcpustat_cpu(j), CPUTIME_NICE, j);
}

gov->start(policy);
--
2.23.0

2019-11-21 02:46:59

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 1/6] sched/cputime: Support other fields on kcpustat_field()

Provide support for user, nice, guest and guest_nice fields through
kcpustat_field().

Whether we account the delta to a nice or not nice field is decided on
top of the nice value snapshot taken at the time we call kcpustat_field().
If the nice value of the task has been changed since the last vtime
update, we may have inacurrate distribution of the nice VS unnice
cputime.

However this is considered as a minor issue compared to the proper fix
that would involve interrupting the target on nice updates, which is
undesired on nohz_full CPUs.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Yauheni Kaliuta <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
kernel/sched/cputime.c | 54 +++++++++++++++++++++++++++++++++---------
1 file changed, 43 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index e0cd20693ef5..27b5406222fc 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -912,11 +912,21 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
} while (read_seqcount_retry(&vtime->seqcount, seq));
}

+static u64 kcpustat_user_vtime(struct vtime *vtime)
+{
+ if (vtime->state == VTIME_USER)
+ return vtime->utime + vtime_delta(vtime);
+ else if (vtime->state == VTIME_GUEST)
+ return vtime->gtime + vtime_delta(vtime);
+ return 0;
+}
+
static int kcpustat_field_vtime(u64 *cpustat,
- struct vtime *vtime,
+ struct task_struct *tsk,
enum cpu_usage_stat usage,
int cpu, u64 *val)
{
+ struct vtime *vtime = &tsk->vtime;
unsigned int seq;
int err;

@@ -946,9 +956,37 @@ static int kcpustat_field_vtime(u64 *cpustat,

*val = cpustat[usage];

- if (vtime->state == VTIME_SYS)
- *val += vtime->stime + vtime_delta(vtime);
-
+ /*
+ * Nice VS unnice cputime accounting may be inaccurate if
+ * the nice value has changed since the last vtime update.
+ * But proper fix would involve interrupting target on nice
+ * updates which is a no go on nohz_full (although the scheduler
+ * may still interrupt the target if rescheduling is needed...)
+ */
+ switch (usage) {
+ case CPUTIME_SYSTEM:
+ if (vtime->state == VTIME_SYS)
+ *val += vtime->stime + vtime_delta(vtime);
+ break;
+ case CPUTIME_USER:
+ if (task_nice(tsk) <= 0)
+ *val += kcpustat_user_vtime(vtime);
+ break;
+ case CPUTIME_NICE:
+ if (task_nice(tsk) > 0)
+ *val += kcpustat_user_vtime(vtime);
+ break;
+ case CPUTIME_GUEST:
+ if (vtime->state == VTIME_GUEST && task_nice(tsk) <= 0)
+ *val += vtime->gtime + vtime_delta(vtime);
+ break;
+ case CPUTIME_GUEST_NICE:
+ if (vtime->state == VTIME_GUEST && task_nice(tsk) > 0)
+ *val += vtime->gtime + vtime_delta(vtime);
+ break;
+ default:
+ break;
+ }
} while (read_seqcount_retry(&vtime->seqcount, seq));

return 0;
@@ -965,15 +1003,10 @@ u64 kcpustat_field(struct kernel_cpustat *kcpustat,
if (!vtime_accounting_enabled_cpu(cpu))
return cpustat[usage];

- /* Only support sys vtime for now */
- if (usage != CPUTIME_SYSTEM)
- return cpustat[usage];
-
rq = cpu_rq(cpu);

for (;;) {
struct task_struct *curr;
- struct vtime *vtime;

rcu_read_lock();
curr = rcu_dereference(rq->curr);
@@ -982,8 +1015,7 @@ u64 kcpustat_field(struct kernel_cpustat *kcpustat,
return cpustat[usage];
}

- vtime = &curr->vtime;
- err = kcpustat_field_vtime(cpustat, vtime, usage, cpu, &val);
+ err = kcpustat_field_vtime(cpustat, curr, usage, cpu, &val);
rcu_read_unlock();

if (!err)
--
2.23.0

2019-11-21 02:47:52

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 5/6] leds: Use all-in-one vtime aware kcpustat accessor

We can now safely read user kcpustat fields on nohz_full CPUs.
Use the appropriate accessor.

Reported-by: Yauheni Kaliuta <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Jacek Anaszewski <[email protected]>
Cc: Pavel Machek <[email protected]>
Cc: Yauheni Kaliuta <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
drivers/leds/trigger/ledtrig-activity.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/drivers/leds/trigger/ledtrig-activity.c b/drivers/leds/trigger/ledtrig-activity.c
index ddfc5edd07c8..6901e3631c22 100644
--- a/drivers/leds/trigger/ledtrig-activity.c
+++ b/drivers/leds/trigger/ledtrig-activity.c
@@ -57,11 +57,15 @@ static void led_activity_function(struct timer_list *t)
curr_used = 0;

for_each_possible_cpu(i) {
- curr_used += kcpustat_cpu(i).cpustat[CPUTIME_USER]
- + kcpustat_cpu(i).cpustat[CPUTIME_NICE]
- + kcpustat_field(&kcpustat_cpu(i), CPUTIME_SYSTEM, i)
- + kcpustat_cpu(i).cpustat[CPUTIME_SOFTIRQ]
- + kcpustat_cpu(i).cpustat[CPUTIME_IRQ];
+ struct kernel_cpustat kcpustat;
+
+ kcpustat_fetch_cpu(&kcpustat, i);
+
+ curr_used += kcpustat.cpustat[CPUTIME_USER]
+ + kcpustat.cpustat[CPUTIME_NICE]
+ + kcpustat.cpustat[CPUTIME_SYSTEM]
+ + kcpustat.cpustat[CPUTIME_SOFTIRQ]
+ + kcpustat.cpustat[CPUTIME_IRQ];
cpus++;
}

--
2.23.0

2019-11-21 02:48:24

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 6/6] rackmeter: Use vtime aware kcpustat accessor

Now that we have a vtime safe kcpustat accessor, use it to fetch
CPUTIME_NICE and fix frozen kcpustat values on nohz_full CPUs.

Reported-by: Yauheni Kaliuta <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Yauheni Kaliuta <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
drivers/macintosh/rack-meter.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/macintosh/rack-meter.c b/drivers/macintosh/rack-meter.c
index 4134e580f786..60311e8d6240 100644
--- a/drivers/macintosh/rack-meter.c
+++ b/drivers/macintosh/rack-meter.c
@@ -81,13 +81,14 @@ static int rackmeter_ignore_nice;
*/
static inline u64 get_cpu_idle_time(unsigned int cpu)
{
+ struct kernel_cpustat *kcpustat = &kcpustat_cpu(cpu);
u64 retval;

- retval = kcpustat_cpu(cpu).cpustat[CPUTIME_IDLE] +
- kcpustat_cpu(cpu).cpustat[CPUTIME_IOWAIT];
+ retval = kcpustat->cpustat[CPUTIME_IDLE] +
+ kcpustat->cpustat[CPUTIME_IOWAIT];

if (rackmeter_ignore_nice)
- retval += kcpustat_cpu(cpu).cpustat[CPUTIME_NICE];
+ retval += kcpustat_field(kcpustat, CPUTIME_NICE, cpu);

return retval;
}
--
2.23.0

2019-11-21 02:50:04

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 2/6] sched/vtime: Bring up complete kcpustat accessor

Many callsites want to fetch the values of system, user, user_nice, guest
or guest_nice kcpustat fields altogether or at least a pair of these.

In that case calling kcpustat_field() for each requested field brings
unecessary overhead when we could fetch all of them in a row.

So provide kcpustat_cpu_fetch() that fetches the whole kcpustat array
in a vtime safe way under the same RCU and seqcount block.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Yauheni Kaliuta <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
include/linux/kernel_stat.h | 7 ++
kernel/sched/cputime.c | 136 ++++++++++++++++++++++++++++++------
2 files changed, 123 insertions(+), 20 deletions(-)

diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 79781196eb25..89f0745c096d 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -81,12 +81,19 @@ static inline unsigned int kstat_cpu_irqs_sum(unsigned int cpu)
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
extern u64 kcpustat_field(struct kernel_cpustat *kcpustat,
enum cpu_usage_stat usage, int cpu);
+extern void kcpustat_cpu_fetch(struct kernel_cpustat *dst, int cpu);
#else
static inline u64 kcpustat_field(struct kernel_cpustat *kcpustat,
enum cpu_usage_stat usage, int cpu)
{
return kcpustat->cpustat[usage];
}
+
+static inline void kcpustat_cpu_fetch(struct kernel_cpustat *dst, int cpu)
+{
+ *dst = kcpustat_cpu(cpu);
+}
+
#endif

extern void account_user_time(struct task_struct *, u64);
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 27b5406222fc..d43318a489f2 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -912,6 +912,30 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
} while (read_seqcount_retry(&vtime->seqcount, seq));
}

+static int vtime_state_check(struct vtime *vtime, int cpu)
+{
+ /*
+ * We raced against a context switch, fetch the
+ * kcpustat task again.
+ */
+ if (vtime->cpu != cpu && vtime->cpu != -1)
+ return -EAGAIN;
+
+ /*
+ * Two possible things here:
+ * 1) We are seeing the scheduling out task (prev) or any past one.
+ * 2) We are seeing the scheduling in task (next) but it hasn't
+ * passed though vtime_task_switch() yet so the pending
+ * cputime of the prev task may not be flushed yet.
+ *
+ * Case 1) is ok but 2) is not. So wait for a safe VTIME state.
+ */
+ if (vtime->state == VTIME_INACTIVE)
+ return -EAGAIN;
+
+ return 0;
+}
+
static u64 kcpustat_user_vtime(struct vtime *vtime)
{
if (vtime->state == VTIME_USER)
@@ -933,26 +957,9 @@ static int kcpustat_field_vtime(u64 *cpustat,
do {
seq = read_seqcount_begin(&vtime->seqcount);

- /*
- * We raced against context switch, fetch the
- * kcpustat task again.
- */
- if (vtime->cpu != cpu && vtime->cpu != -1)
- return -EAGAIN;
-
- /*
- * Two possible things here:
- * 1) We are seeing the scheduling out task (prev) or any past one.
- * 2) We are seeing the scheduling in task (next) but it hasn't
- * passed though vtime_task_switch() yet so the pending
- * cputime of the prev task may not be flushed yet.
- *
- * Case 1) is ok but 2) is not. So wait for a safe VTIME state.
- */
- if (vtime->state == VTIME_INACTIVE)
- return -EAGAIN;
-
- err = 0;
+ err = vtime_state_check(vtime, cpu);
+ if (err < 0)
+ return err;

*val = cpustat[usage];

@@ -1025,4 +1032,93 @@ u64 kcpustat_field(struct kernel_cpustat *kcpustat,
}
}
EXPORT_SYMBOL_GPL(kcpustat_field);
+
+static int kcpustat_cpu_fetch_vtime(struct kernel_cpustat *dst,
+ const struct kernel_cpustat *src,
+ struct task_struct *tsk, int cpu)
+{
+ struct vtime *vtime = &tsk->vtime;
+ unsigned int seq;
+ int err;
+
+ do {
+ u64 *cpustat;
+ u64 delta;
+
+ seq = read_seqcount_begin(&vtime->seqcount);
+
+ err = vtime_state_check(vtime, cpu);
+ if (err < 0)
+ return err;
+
+ *dst = *src;
+ cpustat = dst->cpustat;
+
+ /* Task is sleeping, dead or idle, nothing to add */
+ if (vtime->state < VTIME_SYS)
+ continue;
+
+ delta = vtime_delta(vtime);
+
+ /*
+ * Task runs either in user (including guest) or kernel space,
+ * add pending nohz time to the right place.
+ */
+ if (vtime->state == VTIME_SYS) {
+ cpustat[CPUTIME_SYSTEM] += vtime->stime + delta;
+ } else if (vtime->state == VTIME_USER) {
+ if (task_nice(tsk) > 0)
+ cpustat[CPUTIME_NICE] += vtime->utime + delta;
+ else
+ cpustat[CPUTIME_USER] += vtime->utime + delta;
+ } else {
+ WARN_ON_ONCE(vtime->state != VTIME_GUEST);
+ if (task_nice(tsk) > 0) {
+ cpustat[CPUTIME_GUEST_NICE] += vtime->gtime + delta;
+ cpustat[CPUTIME_NICE] += vtime->gtime + delta;
+ } else {
+ cpustat[CPUTIME_GUEST] += vtime->gtime + delta;
+ cpustat[CPUTIME_USER] += vtime->gtime + delta;
+ }
+ }
+ } while (read_seqcount_retry(&vtime->seqcount, seq));
+
+ return err;
+}
+
+void kcpustat_cpu_fetch(struct kernel_cpustat *dst, int cpu)
+{
+ const struct kernel_cpustat *src = &kcpustat_cpu(cpu);
+ struct rq *rq;
+ int err;
+
+ if (!vtime_accounting_enabled_cpu(cpu)) {
+ *dst = *src;
+ return;
+ }
+
+ rq = cpu_rq(cpu);
+
+ for (;;) {
+ struct task_struct *curr;
+
+ rcu_read_lock();
+ curr = rcu_dereference(rq->curr);
+ if (WARN_ON_ONCE(!curr)) {
+ rcu_read_unlock();
+ *dst = *src;
+ return;
+ }
+
+ err = kcpustat_cpu_fetch_vtime(dst, src, curr, cpu);
+ rcu_read_unlock();
+
+ if (!err)
+ return;
+
+ cpu_relax();
+ }
+}
+EXPORT_SYMBOL_GPL(kcpustat_cpu_fetch);
+
#endif /* CONFIG_VIRT_CPU_ACCOUNTING_GEN */
--
2.23.0

2019-11-21 07:02:59

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 5/6] leds: Use all-in-one vtime aware kcpustat accessor


* Frederic Weisbecker <[email protected]> wrote:

> We can now safely read user kcpustat fields on nohz_full CPUs.
> Use the appropriate accessor.
>
> Reported-by: Yauheni Kaliuta <[email protected]>
> Signed-off-by: Frederic Weisbecker <[email protected]>
> Cc: Jacek Anaszewski <[email protected]>
> Cc: Pavel Machek <[email protected]>
> Cc: Yauheni Kaliuta <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Wanpeng Li <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> ---
> drivers/leds/trigger/ledtrig-activity.c | 14 +++++++++-----
> 1 file changed, 9 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/leds/trigger/ledtrig-activity.c b/drivers/leds/trigger/ledtrig-activity.c
> index ddfc5edd07c8..6901e3631c22 100644
> --- a/drivers/leds/trigger/ledtrig-activity.c
> +++ b/drivers/leds/trigger/ledtrig-activity.c
> @@ -57,11 +57,15 @@ static void led_activity_function(struct timer_list *t)
> curr_used = 0;
>
> for_each_possible_cpu(i) {
> - curr_used += kcpustat_cpu(i).cpustat[CPUTIME_USER]
> - + kcpustat_cpu(i).cpustat[CPUTIME_NICE]
> - + kcpustat_field(&kcpustat_cpu(i), CPUTIME_SYSTEM, i)
> - + kcpustat_cpu(i).cpustat[CPUTIME_SOFTIRQ]
> - + kcpustat_cpu(i).cpustat[CPUTIME_IRQ];
> + struct kernel_cpustat kcpustat;
> +
> + kcpustat_fetch_cpu(&kcpustat, i);
> +
> + curr_used += kcpustat.cpustat[CPUTIME_USER]
> + + kcpustat.cpustat[CPUTIME_NICE]
> + + kcpustat.cpustat[CPUTIME_SYSTEM]
> + + kcpustat.cpustat[CPUTIME_SOFTIRQ]
> + + kcpustat.cpustat[CPUTIME_IRQ];

Not the best tested series:

--- a/drivers/leds/trigger/ledtrig-activity.c
+++ b/drivers/leds/trigger/ledtrig-activity.c
@@ -59,7 +59,7 @@ static void led_activity_function(struct timer_list *t)
for_each_possible_cpu(i) {
struct kernel_cpustat kcpustat;

- kcpustat_fetch_cpu(&kcpustat, i);
+ kcpustat_cpu_fetch(&kcpustat, i);

curr_used += kcpustat.cpustat[CPUTIME_USER]
+ kcpustat.cpustat[CPUTIME_NICE]


:-)

Thanks,

Ingo

2019-11-21 14:16:40

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 5/6] leds: Use all-in-one vtime aware kcpustat accessor

On Thu, Nov 21, 2019 at 07:58:26AM +0100, Ingo Molnar wrote:
>
> * Frederic Weisbecker <[email protected]> wrote:
>
> > We can now safely read user kcpustat fields on nohz_full CPUs.
> > Use the appropriate accessor.
> >
> > Reported-by: Yauheni Kaliuta <[email protected]>
> > Signed-off-by: Frederic Weisbecker <[email protected]>
> > Cc: Jacek Anaszewski <[email protected]>
> > Cc: Pavel Machek <[email protected]>
> > Cc: Yauheni Kaliuta <[email protected]>
> > Cc: Thomas Gleixner <[email protected]>
> > Cc: Rik van Riel <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Wanpeng Li <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > ---
> > drivers/leds/trigger/ledtrig-activity.c | 14 +++++++++-----
> > 1 file changed, 9 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/leds/trigger/ledtrig-activity.c b/drivers/leds/trigger/ledtrig-activity.c
> > index ddfc5edd07c8..6901e3631c22 100644
> > --- a/drivers/leds/trigger/ledtrig-activity.c
> > +++ b/drivers/leds/trigger/ledtrig-activity.c
> > @@ -57,11 +57,15 @@ static void led_activity_function(struct timer_list *t)
> > curr_used = 0;
> >
> > for_each_possible_cpu(i) {
> > - curr_used += kcpustat_cpu(i).cpustat[CPUTIME_USER]
> > - + kcpustat_cpu(i).cpustat[CPUTIME_NICE]
> > - + kcpustat_field(&kcpustat_cpu(i), CPUTIME_SYSTEM, i)
> > - + kcpustat_cpu(i).cpustat[CPUTIME_SOFTIRQ]
> > - + kcpustat_cpu(i).cpustat[CPUTIME_IRQ];
> > + struct kernel_cpustat kcpustat;
> > +
> > + kcpustat_fetch_cpu(&kcpustat, i);
> > +
> > + curr_used += kcpustat.cpustat[CPUTIME_USER]
> > + + kcpustat.cpustat[CPUTIME_NICE]
> > + + kcpustat.cpustat[CPUTIME_SYSTEM]
> > + + kcpustat.cpustat[CPUTIME_SOFTIRQ]
> > + + kcpustat.cpustat[CPUTIME_IRQ];
>
> Not the best tested series:
>
> --- a/drivers/leds/trigger/ledtrig-activity.c
> +++ b/drivers/leds/trigger/ledtrig-activity.c
> @@ -59,7 +59,7 @@ static void led_activity_function(struct timer_list *t)
> for_each_possible_cpu(i) {
> struct kernel_cpustat kcpustat;
>
> - kcpustat_fetch_cpu(&kcpustat, i);
> + kcpustat_cpu_fetch(&kcpustat, i);
>
> curr_used += kcpustat.cpustat[CPUTIME_USER]
> + kcpustat.cpustat[CPUTIME_NICE]
>
>
> :-)

Oops, I tested with vtime on and off but that one slipped under my config.
Do you want me to resend?

Thanks.

Subject: [tip: sched/core] rackmeter: Use vtime aware kcpustat accessor

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 8392853e964c025b0616bd54c0cdf9cbc3c9a769
Gitweb: https://git.kernel.org/tip/8392853e964c025b0616bd54c0cdf9cbc3c9a769
Author: Frederic Weisbecker <[email protected]>
AuthorDate: Thu, 21 Nov 2019 03:44:30 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Thu, 21 Nov 2019 07:59:00 +01:00

rackmeter: Use vtime aware kcpustat accessor

Now that we have a vtime safe kcpustat accessor, use it to fetch
CPUTIME_NICE and fix frozen kcpustat values on nohz_full CPUs.

Reported-by: Yauheni Kaliuta <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Wanpeng Li <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
drivers/macintosh/rack-meter.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/macintosh/rack-meter.c b/drivers/macintosh/rack-meter.c
index 4134e58..60311e8 100644
--- a/drivers/macintosh/rack-meter.c
+++ b/drivers/macintosh/rack-meter.c
@@ -81,13 +81,14 @@ static int rackmeter_ignore_nice;
*/
static inline u64 get_cpu_idle_time(unsigned int cpu)
{
+ struct kernel_cpustat *kcpustat = &kcpustat_cpu(cpu);
u64 retval;

- retval = kcpustat_cpu(cpu).cpustat[CPUTIME_IDLE] +
- kcpustat_cpu(cpu).cpustat[CPUTIME_IOWAIT];
+ retval = kcpustat->cpustat[CPUTIME_IDLE] +
+ kcpustat->cpustat[CPUTIME_IOWAIT];

if (rackmeter_ignore_nice)
- retval += kcpustat_cpu(cpu).cpustat[CPUTIME_NICE];
+ retval += kcpustat_field(kcpustat, CPUTIME_NICE, cpu);

return retval;
}

Subject: [tip: sched/core] procfs: Use all-in-one vtime aware kcpustat accessor

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 26dae145a76c3615588f263885904c6e567ff116
Gitweb: https://git.kernel.org/tip/26dae145a76c3615588f263885904c6e567ff116
Author: Frederic Weisbecker <[email protected]>
AuthorDate: Thu, 21 Nov 2019 03:44:27 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Thu, 21 Nov 2019 07:33:24 +01:00

procfs: Use all-in-one vtime aware kcpustat accessor

Now that we can read also user and guest time safely under vtime, use
the relevant accessor to fix frozen kcpustat values on nohz_full CPUs.

Reported-by: Yauheni Kaliuta <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Wanpeng Li <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
fs/proc/stat.c | 56 +++++++++++++++++++++++++++----------------------
1 file changed, 31 insertions(+), 25 deletions(-)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 5c6bd0a..37bdbec 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -120,20 +120,23 @@ static int show_stat(struct seq_file *p, void *v)
getboottime64(&boottime);

for_each_possible_cpu(i) {
- struct kernel_cpustat *kcs = &kcpustat_cpu(i);
-
- user += kcs->cpustat[CPUTIME_USER];
- nice += kcs->cpustat[CPUTIME_NICE];
- system += kcpustat_field(kcs, CPUTIME_SYSTEM, i);
- idle += get_idle_time(kcs, i);
- iowait += get_iowait_time(kcs, i);
- irq += kcs->cpustat[CPUTIME_IRQ];
- softirq += kcs->cpustat[CPUTIME_SOFTIRQ];
- steal += kcs->cpustat[CPUTIME_STEAL];
- guest += kcs->cpustat[CPUTIME_GUEST];
- guest_nice += kcs->cpustat[CPUTIME_GUEST_NICE];
- sum += kstat_cpu_irqs_sum(i);
- sum += arch_irq_stat_cpu(i);
+ struct kernel_cpustat kcpustat;
+ u64 *cpustat = kcpustat.cpustat;
+
+ kcpustat_cpu_fetch(&kcpustat, i);
+
+ user += cpustat[CPUTIME_USER];
+ nice += cpustat[CPUTIME_NICE];
+ system += cpustat[CPUTIME_SYSTEM];
+ idle += get_idle_time(&kcpustat, i);
+ iowait += get_iowait_time(&kcpustat, i);
+ irq += cpustat[CPUTIME_IRQ];
+ softirq += cpustat[CPUTIME_SOFTIRQ];
+ steal += cpustat[CPUTIME_STEAL];
+ guest += cpustat[CPUTIME_GUEST];
+ guest_nice += cpustat[CPUTIME_USER];
+ sum += kstat_cpu_irqs_sum(i);
+ sum += arch_irq_stat_cpu(i);

for (j = 0; j < NR_SOFTIRQS; j++) {
unsigned int softirq_stat = kstat_softirqs_cpu(j, i);
@@ -157,19 +160,22 @@ static int show_stat(struct seq_file *p, void *v)
seq_putc(p, '\n');

for_each_online_cpu(i) {
- struct kernel_cpustat *kcs = &kcpustat_cpu(i);
+ struct kernel_cpustat kcpustat;
+ u64 *cpustat = kcpustat.cpustat;
+
+ kcpustat_cpu_fetch(&kcpustat, i);

/* Copy values here to work around gcc-2.95.3, gcc-2.96 */
- user = kcs->cpustat[CPUTIME_USER];
- nice = kcs->cpustat[CPUTIME_NICE];
- system = kcpustat_field(kcs, CPUTIME_SYSTEM, i);
- idle = get_idle_time(kcs, i);
- iowait = get_iowait_time(kcs, i);
- irq = kcs->cpustat[CPUTIME_IRQ];
- softirq = kcs->cpustat[CPUTIME_SOFTIRQ];
- steal = kcs->cpustat[CPUTIME_STEAL];
- guest = kcs->cpustat[CPUTIME_GUEST];
- guest_nice = kcs->cpustat[CPUTIME_GUEST_NICE];
+ user = cpustat[CPUTIME_USER];
+ nice = cpustat[CPUTIME_NICE];
+ system = cpustat[CPUTIME_SYSTEM];
+ idle = get_idle_time(&kcpustat, i);
+ iowait = get_iowait_time(&kcpustat, i);
+ irq = cpustat[CPUTIME_IRQ];
+ softirq = cpustat[CPUTIME_SOFTIRQ];
+ steal = cpustat[CPUTIME_STEAL];
+ guest = cpustat[CPUTIME_GUEST];
+ guest_nice = cpustat[CPUTIME_USER];
seq_printf(p, "cpu%d", i);
seq_put_decimal_ull(p, " ", nsec_to_clock_t(user));
seq_put_decimal_ull(p, " ", nsec_to_clock_t(nice));

Subject: [tip: sched/core] sched/vtime: Bring up complete kcpustat accessor

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 74722bb223d0f236303b60c9509ff924a9713780
Gitweb: https://git.kernel.org/tip/74722bb223d0f236303b60c9509ff924a9713780
Author: Frederic Weisbecker <[email protected]>
AuthorDate: Thu, 21 Nov 2019 03:44:26 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Thu, 21 Nov 2019 07:33:24 +01:00

sched/vtime: Bring up complete kcpustat accessor

Many callsites want to fetch the values of system, user, user_nice, guest
or guest_nice kcpustat fields altogether or at least a pair of these.

In that case calling kcpustat_field() for each requested field brings
unecessary overhead when we could fetch all of them in a row.

So provide kcpustat_cpu_fetch() that fetches the whole kcpustat array
in a vtime safe way under the same RCU and seqcount block.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Yauheni Kaliuta <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/kernel_stat.h | 7 ++-
kernel/sched/cputime.c | 136 +++++++++++++++++++++++++++++------
2 files changed, 123 insertions(+), 20 deletions(-)

diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 7978119..89f0745 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -81,12 +81,19 @@ static inline unsigned int kstat_cpu_irqs_sum(unsigned int cpu)
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
extern u64 kcpustat_field(struct kernel_cpustat *kcpustat,
enum cpu_usage_stat usage, int cpu);
+extern void kcpustat_cpu_fetch(struct kernel_cpustat *dst, int cpu);
#else
static inline u64 kcpustat_field(struct kernel_cpustat *kcpustat,
enum cpu_usage_stat usage, int cpu)
{
return kcpustat->cpustat[usage];
}
+
+static inline void kcpustat_cpu_fetch(struct kernel_cpustat *dst, int cpu)
+{
+ *dst = kcpustat_cpu(cpu);
+}
+
#endif

extern void account_user_time(struct task_struct *, u64);
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 27b5406..d43318a 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -912,6 +912,30 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
} while (read_seqcount_retry(&vtime->seqcount, seq));
}

+static int vtime_state_check(struct vtime *vtime, int cpu)
+{
+ /*
+ * We raced against a context switch, fetch the
+ * kcpustat task again.
+ */
+ if (vtime->cpu != cpu && vtime->cpu != -1)
+ return -EAGAIN;
+
+ /*
+ * Two possible things here:
+ * 1) We are seeing the scheduling out task (prev) or any past one.
+ * 2) We are seeing the scheduling in task (next) but it hasn't
+ * passed though vtime_task_switch() yet so the pending
+ * cputime of the prev task may not be flushed yet.
+ *
+ * Case 1) is ok but 2) is not. So wait for a safe VTIME state.
+ */
+ if (vtime->state == VTIME_INACTIVE)
+ return -EAGAIN;
+
+ return 0;
+}
+
static u64 kcpustat_user_vtime(struct vtime *vtime)
{
if (vtime->state == VTIME_USER)
@@ -933,26 +957,9 @@ static int kcpustat_field_vtime(u64 *cpustat,
do {
seq = read_seqcount_begin(&vtime->seqcount);

- /*
- * We raced against context switch, fetch the
- * kcpustat task again.
- */
- if (vtime->cpu != cpu && vtime->cpu != -1)
- return -EAGAIN;
-
- /*
- * Two possible things here:
- * 1) We are seeing the scheduling out task (prev) or any past one.
- * 2) We are seeing the scheduling in task (next) but it hasn't
- * passed though vtime_task_switch() yet so the pending
- * cputime of the prev task may not be flushed yet.
- *
- * Case 1) is ok but 2) is not. So wait for a safe VTIME state.
- */
- if (vtime->state == VTIME_INACTIVE)
- return -EAGAIN;
-
- err = 0;
+ err = vtime_state_check(vtime, cpu);
+ if (err < 0)
+ return err;

*val = cpustat[usage];

@@ -1025,4 +1032,93 @@ u64 kcpustat_field(struct kernel_cpustat *kcpustat,
}
}
EXPORT_SYMBOL_GPL(kcpustat_field);
+
+static int kcpustat_cpu_fetch_vtime(struct kernel_cpustat *dst,
+ const struct kernel_cpustat *src,
+ struct task_struct *tsk, int cpu)
+{
+ struct vtime *vtime = &tsk->vtime;
+ unsigned int seq;
+ int err;
+
+ do {
+ u64 *cpustat;
+ u64 delta;
+
+ seq = read_seqcount_begin(&vtime->seqcount);
+
+ err = vtime_state_check(vtime, cpu);
+ if (err < 0)
+ return err;
+
+ *dst = *src;
+ cpustat = dst->cpustat;
+
+ /* Task is sleeping, dead or idle, nothing to add */
+ if (vtime->state < VTIME_SYS)
+ continue;
+
+ delta = vtime_delta(vtime);
+
+ /*
+ * Task runs either in user (including guest) or kernel space,
+ * add pending nohz time to the right place.
+ */
+ if (vtime->state == VTIME_SYS) {
+ cpustat[CPUTIME_SYSTEM] += vtime->stime + delta;
+ } else if (vtime->state == VTIME_USER) {
+ if (task_nice(tsk) > 0)
+ cpustat[CPUTIME_NICE] += vtime->utime + delta;
+ else
+ cpustat[CPUTIME_USER] += vtime->utime + delta;
+ } else {
+ WARN_ON_ONCE(vtime->state != VTIME_GUEST);
+ if (task_nice(tsk) > 0) {
+ cpustat[CPUTIME_GUEST_NICE] += vtime->gtime + delta;
+ cpustat[CPUTIME_NICE] += vtime->gtime + delta;
+ } else {
+ cpustat[CPUTIME_GUEST] += vtime->gtime + delta;
+ cpustat[CPUTIME_USER] += vtime->gtime + delta;
+ }
+ }
+ } while (read_seqcount_retry(&vtime->seqcount, seq));
+
+ return err;
+}
+
+void kcpustat_cpu_fetch(struct kernel_cpustat *dst, int cpu)
+{
+ const struct kernel_cpustat *src = &kcpustat_cpu(cpu);
+ struct rq *rq;
+ int err;
+
+ if (!vtime_accounting_enabled_cpu(cpu)) {
+ *dst = *src;
+ return;
+ }
+
+ rq = cpu_rq(cpu);
+
+ for (;;) {
+ struct task_struct *curr;
+
+ rcu_read_lock();
+ curr = rcu_dereference(rq->curr);
+ if (WARN_ON_ONCE(!curr)) {
+ rcu_read_unlock();
+ *dst = *src;
+ return;
+ }
+
+ err = kcpustat_cpu_fetch_vtime(dst, src, curr, cpu);
+ rcu_read_unlock();
+
+ if (!err)
+ return;
+
+ cpu_relax();
+ }
+}
+EXPORT_SYMBOL_GPL(kcpustat_cpu_fetch);
+
#endif /* CONFIG_VIRT_CPU_ACCOUNTING_GEN */

Subject: [tip: sched/core] leds: Use all-in-one vtime aware kcpustat accessor

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 8688f2fb671b2ed59b1b16083136b6edc3750435
Gitweb: https://git.kernel.org/tip/8688f2fb671b2ed59b1b16083136b6edc3750435
Author: Frederic Weisbecker <[email protected]>
AuthorDate: Thu, 21 Nov 2019 03:44:29 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Thu, 21 Nov 2019 07:58:48 +01:00

leds: Use all-in-one vtime aware kcpustat accessor

We can now safely read user kcpustat fields on nohz_full CPUs.
Use the appropriate accessor.

[ mingo: Fixed build failure. ]

Reported-by: Yauheni Kaliuta <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Jacek Anaszewski <[email protected]> (maintainer:LED SUBSYSTEM)
Cc: Pavel Machek <[email protected]> (maintainer:LED SUBSYSTEM)
Cc: Dan Murphy <[email protected]> (reviewer:LED SUBSYSTEM)
Cc: Peter Zijlstra <[email protected]>
Cc: Wanpeng Li <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
drivers/leds/trigger/ledtrig-activity.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/drivers/leds/trigger/ledtrig-activity.c b/drivers/leds/trigger/ledtrig-activity.c
index ddfc5ed..14ba7fa 100644
--- a/drivers/leds/trigger/ledtrig-activity.c
+++ b/drivers/leds/trigger/ledtrig-activity.c
@@ -57,11 +57,15 @@ static void led_activity_function(struct timer_list *t)
curr_used = 0;

for_each_possible_cpu(i) {
- curr_used += kcpustat_cpu(i).cpustat[CPUTIME_USER]
- + kcpustat_cpu(i).cpustat[CPUTIME_NICE]
- + kcpustat_field(&kcpustat_cpu(i), CPUTIME_SYSTEM, i)
- + kcpustat_cpu(i).cpustat[CPUTIME_SOFTIRQ]
- + kcpustat_cpu(i).cpustat[CPUTIME_IRQ];
+ struct kernel_cpustat kcpustat;
+
+ kcpustat_cpu_fetch(&kcpustat, i);
+
+ curr_used += kcpustat.cpustat[CPUTIME_USER]
+ + kcpustat.cpustat[CPUTIME_NICE]
+ + kcpustat.cpustat[CPUTIME_SYSTEM]
+ + kcpustat.cpustat[CPUTIME_SOFTIRQ]
+ + kcpustat.cpustat[CPUTIME_IRQ];
cpus++;
}

Subject: [tip: sched/core] sched/cputime: Support other fields on kcpustat_field()

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 5a1c95580f1d89c8a736bb02ecd82a8858388b8a
Gitweb: https://git.kernel.org/tip/5a1c95580f1d89c8a736bb02ecd82a8858388b8a
Author: Frederic Weisbecker <[email protected]>
AuthorDate: Thu, 21 Nov 2019 03:44:25 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Thu, 21 Nov 2019 07:33:23 +01:00

sched/cputime: Support other fields on kcpustat_field()

Provide support for user, nice, guest and guest_nice fields through
kcpustat_field().

Whether we account the delta to a nice or not nice field is decided on
top of the nice value snapshot taken at the time we call kcpustat_field().
If the nice value of the task has been changed since the last vtime
update, we may have inacurrate distribution of the nice VS unnice
cputime.

However this is considered as a minor issue compared to the proper fix
that would involve interrupting the target on nice updates, which is
undesired on nohz_full CPUs.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Yauheni Kaliuta <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/cputime.c | 54 ++++++++++++++++++++++++++++++++---------
1 file changed, 43 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index e0cd206..27b5406 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -912,11 +912,21 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
} while (read_seqcount_retry(&vtime->seqcount, seq));
}

+static u64 kcpustat_user_vtime(struct vtime *vtime)
+{
+ if (vtime->state == VTIME_USER)
+ return vtime->utime + vtime_delta(vtime);
+ else if (vtime->state == VTIME_GUEST)
+ return vtime->gtime + vtime_delta(vtime);
+ return 0;
+}
+
static int kcpustat_field_vtime(u64 *cpustat,
- struct vtime *vtime,
+ struct task_struct *tsk,
enum cpu_usage_stat usage,
int cpu, u64 *val)
{
+ struct vtime *vtime = &tsk->vtime;
unsigned int seq;
int err;

@@ -946,9 +956,37 @@ static int kcpustat_field_vtime(u64 *cpustat,

*val = cpustat[usage];

- if (vtime->state == VTIME_SYS)
- *val += vtime->stime + vtime_delta(vtime);
-
+ /*
+ * Nice VS unnice cputime accounting may be inaccurate if
+ * the nice value has changed since the last vtime update.
+ * But proper fix would involve interrupting target on nice
+ * updates which is a no go on nohz_full (although the scheduler
+ * may still interrupt the target if rescheduling is needed...)
+ */
+ switch (usage) {
+ case CPUTIME_SYSTEM:
+ if (vtime->state == VTIME_SYS)
+ *val += vtime->stime + vtime_delta(vtime);
+ break;
+ case CPUTIME_USER:
+ if (task_nice(tsk) <= 0)
+ *val += kcpustat_user_vtime(vtime);
+ break;
+ case CPUTIME_NICE:
+ if (task_nice(tsk) > 0)
+ *val += kcpustat_user_vtime(vtime);
+ break;
+ case CPUTIME_GUEST:
+ if (vtime->state == VTIME_GUEST && task_nice(tsk) <= 0)
+ *val += vtime->gtime + vtime_delta(vtime);
+ break;
+ case CPUTIME_GUEST_NICE:
+ if (vtime->state == VTIME_GUEST && task_nice(tsk) > 0)
+ *val += vtime->gtime + vtime_delta(vtime);
+ break;
+ default:
+ break;
+ }
} while (read_seqcount_retry(&vtime->seqcount, seq));

return 0;
@@ -965,15 +1003,10 @@ u64 kcpustat_field(struct kernel_cpustat *kcpustat,
if (!vtime_accounting_enabled_cpu(cpu))
return cpustat[usage];

- /* Only support sys vtime for now */
- if (usage != CPUTIME_SYSTEM)
- return cpustat[usage];
-
rq = cpu_rq(cpu);

for (;;) {
struct task_struct *curr;
- struct vtime *vtime;

rcu_read_lock();
curr = rcu_dereference(rq->curr);
@@ -982,8 +1015,7 @@ u64 kcpustat_field(struct kernel_cpustat *kcpustat,
return cpustat[usage];
}

- vtime = &curr->vtime;
- err = kcpustat_field_vtime(cpustat, vtime, usage, cpu, &val);
+ err = kcpustat_field_vtime(cpustat, curr, usage, cpu, &val);
rcu_read_unlock();

if (!err)

Subject: [tip: sched/core] cpufreq: Use vtime aware kcpustat accessors for user time

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 5720821ba1d845f0b91a9278137e9005c5f6931d
Gitweb: https://git.kernel.org/tip/5720821ba1d845f0b91a9278137e9005c5f6931d
Author: Frederic Weisbecker <[email protected]>
AuthorDate: Thu, 21 Nov 2019 03:44:28 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Thu, 21 Nov 2019 07:33:25 +01:00

cpufreq: Use vtime aware kcpustat accessors for user time

We can now safely read user and guest kcpustat fields on nohz_full CPUs.
Use the appropriate accessors.

Reported-by: Yauheni Kaliuta <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Wanpeng Li <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
drivers/cpufreq/cpufreq.c | 17 ++++++++++-------
drivers/cpufreq/cpufreq_governor.c | 6 +++---
2 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 527fd06..ee23eaf 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -113,18 +113,21 @@ EXPORT_SYMBOL_GPL(get_governor_parent_kobj);

static inline u64 get_cpu_idle_time_jiffy(unsigned int cpu, u64 *wall)
{
- u64 idle_time;
+ struct kernel_cpustat kcpustat;
u64 cur_wall_time;
+ u64 idle_time;
u64 busy_time;

cur_wall_time = jiffies64_to_nsecs(get_jiffies_64());

- busy_time = kcpustat_cpu(cpu).cpustat[CPUTIME_USER];
- busy_time += kcpustat_field(&kcpustat_cpu(cpu), CPUTIME_SYSTEM, cpu);
- busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_IRQ];
- busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_SOFTIRQ];
- busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_STEAL];
- busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_NICE];
+ kcpustat_cpu_fetch(&kcpustat, cpu);
+
+ busy_time = kcpustat.cpustat[CPUTIME_USER];
+ busy_time += kcpustat.cpustat[CPUTIME_SYSTEM];
+ busy_time += kcpustat.cpustat[CPUTIME_IRQ];
+ busy_time += kcpustat.cpustat[CPUTIME_SOFTIRQ];
+ busy_time += kcpustat.cpustat[CPUTIME_STEAL];
+ busy_time += kcpustat.cpustat[CPUTIME_NICE];

idle_time = cur_wall_time - busy_time;
if (wall)
diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c
index 4bb054d..f99ae45 100644
--- a/drivers/cpufreq/cpufreq_governor.c
+++ b/drivers/cpufreq/cpufreq_governor.c
@@ -105,7 +105,7 @@ void gov_update_cpu_data(struct dbs_data *dbs_data)
j_cdbs->prev_cpu_idle = get_cpu_idle_time(j, &j_cdbs->prev_update_time,
dbs_data->io_is_busy);
if (dbs_data->ignore_nice_load)
- j_cdbs->prev_cpu_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE];
+ j_cdbs->prev_cpu_nice = kcpustat_field(&kcpustat_cpu(j), CPUTIME_NICE, j);
}
}
}
@@ -149,7 +149,7 @@ unsigned int dbs_update(struct cpufreq_policy *policy)
j_cdbs->prev_cpu_idle = cur_idle_time;

if (ignore_nice) {
- u64 cur_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE];
+ u64 cur_nice = kcpustat_field(&kcpustat_cpu(j), CPUTIME_NICE, j);

idle_time += div_u64(cur_nice - j_cdbs->prev_cpu_nice, NSEC_PER_USEC);
j_cdbs->prev_cpu_nice = cur_nice;
@@ -530,7 +530,7 @@ int cpufreq_dbs_governor_start(struct cpufreq_policy *policy)
j_cdbs->prev_load = 0;

if (ignore_nice)
- j_cdbs->prev_cpu_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE];
+ j_cdbs->prev_cpu_nice = kcpustat_field(&kcpustat_cpu(j), CPUTIME_NICE, j);
}

gov->start(policy);

2019-11-21 16:54:03

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 5/6] leds: Use all-in-one vtime aware kcpustat accessor


* Frederic Weisbecker <[email protected]> wrote:

> On Thu, Nov 21, 2019 at 07:58:26AM +0100, Ingo Molnar wrote:
> >
> > * Frederic Weisbecker <[email protected]> wrote:
> >
> > > We can now safely read user kcpustat fields on nohz_full CPUs.
> > > Use the appropriate accessor.
> > >
> > > Reported-by: Yauheni Kaliuta <[email protected]>
> > > Signed-off-by: Frederic Weisbecker <[email protected]>
> > > Cc: Jacek Anaszewski <[email protected]>
> > > Cc: Pavel Machek <[email protected]>
> > > Cc: Yauheni Kaliuta <[email protected]>
> > > Cc: Thomas Gleixner <[email protected]>
> > > Cc: Rik van Riel <[email protected]>
> > > Cc: Peter Zijlstra <[email protected]>
> > > Cc: Wanpeng Li <[email protected]>
> > > Cc: Ingo Molnar <[email protected]>
> > > ---
> > > drivers/leds/trigger/ledtrig-activity.c | 14 +++++++++-----
> > > 1 file changed, 9 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/drivers/leds/trigger/ledtrig-activity.c b/drivers/leds/trigger/ledtrig-activity.c
> > > index ddfc5edd07c8..6901e3631c22 100644
> > > --- a/drivers/leds/trigger/ledtrig-activity.c
> > > +++ b/drivers/leds/trigger/ledtrig-activity.c
> > > @@ -57,11 +57,15 @@ static void led_activity_function(struct timer_list *t)
> > > curr_used = 0;
> > >
> > > for_each_possible_cpu(i) {
> > > - curr_used += kcpustat_cpu(i).cpustat[CPUTIME_USER]
> > > - + kcpustat_cpu(i).cpustat[CPUTIME_NICE]
> > > - + kcpustat_field(&kcpustat_cpu(i), CPUTIME_SYSTEM, i)
> > > - + kcpustat_cpu(i).cpustat[CPUTIME_SOFTIRQ]
> > > - + kcpustat_cpu(i).cpustat[CPUTIME_IRQ];
> > > + struct kernel_cpustat kcpustat;
> > > +
> > > + kcpustat_fetch_cpu(&kcpustat, i);
> > > +
> > > + curr_used += kcpustat.cpustat[CPUTIME_USER]
> > > + + kcpustat.cpustat[CPUTIME_NICE]
> > > + + kcpustat.cpustat[CPUTIME_SYSTEM]
> > > + + kcpustat.cpustat[CPUTIME_SOFTIRQ]
> > > + + kcpustat.cpustat[CPUTIME_IRQ];
> >
> > Not the best tested series:
> >
> > --- a/drivers/leds/trigger/ledtrig-activity.c
> > +++ b/drivers/leds/trigger/ledtrig-activity.c
> > @@ -59,7 +59,7 @@ static void led_activity_function(struct timer_list *t)
> > for_each_possible_cpu(i) {
> > struct kernel_cpustat kcpustat;
> >
> > - kcpustat_fetch_cpu(&kcpustat, i);
> > + kcpustat_cpu_fetch(&kcpustat, i);
> >
> > curr_used += kcpustat.cpustat[CPUTIME_USER]
> > + kcpustat.cpustat[CPUTIME_NICE]
> >
> >
> > :-)
>
> Oops, I tested with vtime on and off but that one slipped under my config.
> Do you want me to resend?

No need, I suspect this slipped in via my last minute request for that
interface cleanup - so I just applied the fix and tested it - it all
passed testing today and I just pushed it out into tip:sched/core.

So all's good so far.

Thanks Frederic!

Ingo

2019-12-08 16:09:21

by Paul Orlyk

[permalink] [raw]
Subject: Re: [PATCH 3/6] procfs: Use all-in-one vtime aware kcpustat accessor

Looks like a copy-paste error. I think it should be:

- guest_nice += cpustat[CPUTIME_USER];
+ guest_nice += cpustat[CPUTIME_GUEST_NICE];

and

- guest_nice = cpustat[CPUTIME_USER];
+ guest_nice = cpustat[CPUTIME_GUEST_NICE];

With best regards,
Paul Orlyk

On 11/21/19 4:44 AM, Frederic Weisbecker wrote:
> + guest += cpustat[CPUTIME_GUEST];
> + guest_nice += cpustat[CPUTIME_USER];
> + sum += kstat_cpu_irqs_sum(i);
>
> + guest = cpustat[CPUTIME_GUEST];
> + guest_nice = cpustat[CPUTIME_USER];
> seq_printf(p, "cpu%d", i);

2019-12-09 15:53:08

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 3/6] procfs: Use all-in-one vtime aware kcpustat accessor

On Sun, Dec 08, 2019 at 05:57:47PM +0200, Paul Orlyk wrote:
> Looks like a copy-paste error. I think it should be:
>
> - guest_nice += cpustat[CPUTIME_USER];
> + guest_nice += cpustat[CPUTIME_GUEST_NICE];
>
> and
>
> - guest_nice = cpustat[CPUTIME_USER];
> + guest_nice = cpustat[CPUTIME_GUEST_NICE];
>
> With best regards,
> Paul Orlyk

Yes the fix should be applied soonish:

https://lore.kernel.org/lkml/[email protected]/

Thanks.

>
> On 11/21/19 4:44 AM, Frederic Weisbecker wrote:
> > + guest += cpustat[CPUTIME_GUEST];
> > + guest_nice += cpustat[CPUTIME_USER];
> > + sum += kstat_cpu_irqs_sum(i);
> >
> > + guest = cpustat[CPUTIME_GUEST];
> > + guest_nice = cpustat[CPUTIME_USER];
> > seq_printf(p, "cpu%d", i);

2019-12-28 20:58:08

by Chris Wilson

[permalink] [raw]
Subject: Re: [PATCH 2/6] sched/vtime: Bring up complete kcpustat accessor

Quoting Frederic Weisbecker (2019-11-21 02:44:26)
> +static int kcpustat_cpu_fetch_vtime(struct kernel_cpustat *dst,
> + const struct kernel_cpustat *src,
> + struct task_struct *tsk, int cpu)
> +{
> + struct vtime *vtime = &tsk->vtime;
> + unsigned int seq;
> + int err;
> +
> + do {
> + u64 *cpustat;
> + u64 delta;
> +
> + seq = read_seqcount_begin(&vtime->seqcount);
> +
> + err = vtime_state_check(vtime, cpu);
> + if (err < 0)
> + return err;
> +
> + *dst = *src;
> + cpustat = dst->cpustat;
> +
> + /* Task is sleeping, dead or idle, nothing to add */
> + if (vtime->state < VTIME_SYS)
> + continue;
> +
> + delta = vtime_delta(vtime);
> +
> + /*
> + * Task runs either in user (including guest) or kernel space,
> + * add pending nohz time to the right place.
> + */
> + if (vtime->state == VTIME_SYS) {
> + cpustat[CPUTIME_SYSTEM] += vtime->stime + delta;
> + } else if (vtime->state == VTIME_USER) {
> + if (task_nice(tsk) > 0)
> + cpustat[CPUTIME_NICE] += vtime->utime + delta;
> + else
> + cpustat[CPUTIME_USER] += vtime->utime + delta;
> + } else {
> + WARN_ON_ONCE(vtime->state != VTIME_GUEST);

I'm randomly hitting this WARN on a non-virtualised system reading
/proc/stat.

vtime->state is updated under the write_seqcount, so the access here is
deliberately racey, and the change in vtime->state would be picked up
the seqcount_retry.

Quick suggestion would be something along the lines of

static int vtime_state_check(struct vtime *vtime, int cpu)
{
+ int state = READ_ONCE(vtime->state);
+
/*
* We raced against a context switch, fetch the
* kcpustat task again.
@@ -930,10 +932,10 @@ static int vtime_state_check(struct vtime *vtime, int cpu)
*
* Case 1) is ok but 2) is not. So wait for a safe VTIME state.
*/
- if (vtime->state == VTIME_INACTIVE)
+ if (state == VTIME_INACTIVE)
return -EAGAIN;

- return 0;
+ return state;
}

static u64 kcpustat_user_vtime(struct vtime *vtime)
@@ -1055,7 +1057,7 @@ static int kcpustat_cpu_fetch_vtime(struct kernel_cpustat *dst,
cpustat = dst->cpustat;

/* Task is sleeping, dead or idle, nothing to add */
- if (vtime->state < VTIME_SYS)
+ if (err < VTIME_SYS)
continue;

delta = vtime_delta(vtime);
@@ -1064,15 +1066,15 @@ static int kcpustat_cpu_fetch_vtime(struct kernel_cpustat *dst,
* Task runs either in user (including guest) or kernel space,
* add pending nohz time to the right place.
*/
- if (vtime->state == VTIME_SYS) {
+ if (err == VTIME_SYS) {
cpustat[CPUTIME_SYSTEM] += vtime->stime + delta;
- } else if (vtime->state == VTIME_USER) {
+ } else if (err == VTIME_USER) {
if (task_nice(tsk) > 0)
cpustat[CPUTIME_NICE] += vtime->utime + delta;
else
cpustat[CPUTIME_USER] += vtime->utime + delta;
} else {
- WARN_ON_ONCE(vtime->state != VTIME_GUEST);
+ WARN_ON_ONCE(err != VTIME_GUEST);
if (task_nice(tsk) > 0) {
cpustat[CPUTIME_GUEST_NICE] += vtime->gtime + delta;
cpustat[CPUTIME_NICE] += vtime->gtime + delta;

Or drop the warn.
-Chris

2019-12-30 01:09:32

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 2/6] sched/vtime: Bring up complete kcpustat accessor

On Sat, Dec 28, 2019 at 08:56:19PM +0000, Chris Wilson wrote:
> I'm randomly hitting this WARN on a non-virtualised system reading
> /proc/stat.
>
> vtime->state is updated under the write_seqcount, so the access here is
> deliberately racey, and the change in vtime->state would be picked up
> the seqcount_retry.
>
> Quick suggestion would be something along the lines of
>
> static int vtime_state_check(struct vtime *vtime, int cpu)
> {
> + int state = READ_ONCE(vtime->state);
> +
> /*
> * We raced against a context switch, fetch the
> * kcpustat task again.
> @@ -930,10 +932,10 @@ static int vtime_state_check(struct vtime *vtime, int cpu)
> *
> * Case 1) is ok but 2) is not. So wait for a safe VTIME state.
> */
> - if (vtime->state == VTIME_INACTIVE)
> + if (state == VTIME_INACTIVE)
> return -EAGAIN;
>
> - return 0;
> + return state;
> }
>
> static u64 kcpustat_user_vtime(struct vtime *vtime)
> @@ -1055,7 +1057,7 @@ static int kcpustat_cpu_fetch_vtime(struct kernel_cpustat *dst,
> cpustat = dst->cpustat;
>
> /* Task is sleeping, dead or idle, nothing to add */
> - if (vtime->state < VTIME_SYS)
> + if (err < VTIME_SYS)
> continue;
>
> delta = vtime_delta(vtime);
> @@ -1064,15 +1066,15 @@ static int kcpustat_cpu_fetch_vtime(struct kernel_cpustat *dst,
> * Task runs either in user (including guest) or kernel space,
> * add pending nohz time to the right place.
> */
> - if (vtime->state == VTIME_SYS) {
> + if (err == VTIME_SYS) {
> cpustat[CPUTIME_SYSTEM] += vtime->stime + delta;
> - } else if (vtime->state == VTIME_USER) {
> + } else if (err == VTIME_USER) {
> if (task_nice(tsk) > 0)
> cpustat[CPUTIME_NICE] += vtime->utime + delta;
> else
> cpustat[CPUTIME_USER] += vtime->utime + delta;
> } else {
> - WARN_ON_ONCE(vtime->state != VTIME_GUEST);
> + WARN_ON_ONCE(err != VTIME_GUEST);
> if (task_nice(tsk) > 0) {
> cpustat[CPUTIME_GUEST_NICE] += vtime->gtime + delta;
> cpustat[CPUTIME_NICE] += vtime->gtime + delta;
>
> Or drop the warn.

Good catch, can I use your Signed-off-by ?

Thanks.

2019-12-30 09:06:19

by Chris Wilson

[permalink] [raw]
Subject: [PATCH] sched/vtime: Prevent unstable evaluation of WARN(vtime->state)

As the vtime is sampled under loose seqcount protection by kcpustat, the
vtime fields may change as the code flows. Where logic dictates a field
has a static value, use a READ_ONCE.

Fixes: 74722bb223d0 ("sched/vtime: Bring up complete kcpustat accessor")
Signed-off-by: Chris Wilson <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
kernel/sched/cputime.c | 39 +++++++++++++++++++++------------------
1 file changed, 21 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index d43318a489f2..96bbd52f43ae 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -914,6 +914,8 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)

static int vtime_state_check(struct vtime *vtime, int cpu)
{
+ int state = READ_ONCE(vtime->state);
+
/*
* We raced against a context switch, fetch the
* kcpustat task again.
@@ -930,10 +932,10 @@ static int vtime_state_check(struct vtime *vtime, int cpu)
*
* Case 1) is ok but 2) is not. So wait for a safe VTIME state.
*/
- if (vtime->state == VTIME_INACTIVE)
+ if (state == VTIME_INACTIVE)
return -EAGAIN;

- return 0;
+ return state;
}

static u64 kcpustat_user_vtime(struct vtime *vtime)
@@ -952,14 +954,15 @@ static int kcpustat_field_vtime(u64 *cpustat,
{
struct vtime *vtime = &tsk->vtime;
unsigned int seq;
- int err;

do {
+ int state;
+
seq = read_seqcount_begin(&vtime->seqcount);

- err = vtime_state_check(vtime, cpu);
- if (err < 0)
- return err;
+ state = vtime_state_check(vtime, cpu);
+ if (state < 0)
+ return state;

*val = cpustat[usage];

@@ -972,7 +975,7 @@ static int kcpustat_field_vtime(u64 *cpustat,
*/
switch (usage) {
case CPUTIME_SYSTEM:
- if (vtime->state == VTIME_SYS)
+ if (state == VTIME_SYS)
*val += vtime->stime + vtime_delta(vtime);
break;
case CPUTIME_USER:
@@ -984,11 +987,11 @@ static int kcpustat_field_vtime(u64 *cpustat,
*val += kcpustat_user_vtime(vtime);
break;
case CPUTIME_GUEST:
- if (vtime->state == VTIME_GUEST && task_nice(tsk) <= 0)
+ if (state == VTIME_GUEST && task_nice(tsk) <= 0)
*val += vtime->gtime + vtime_delta(vtime);
break;
case CPUTIME_GUEST_NICE:
- if (vtime->state == VTIME_GUEST && task_nice(tsk) > 0)
+ if (state == VTIME_GUEST && task_nice(tsk) > 0)
*val += vtime->gtime + vtime_delta(vtime);
break;
default:
@@ -1039,23 +1042,23 @@ static int kcpustat_cpu_fetch_vtime(struct kernel_cpustat *dst,
{
struct vtime *vtime = &tsk->vtime;
unsigned int seq;
- int err;

do {
u64 *cpustat;
u64 delta;
+ int state;

seq = read_seqcount_begin(&vtime->seqcount);

- err = vtime_state_check(vtime, cpu);
- if (err < 0)
- return err;
+ state = vtime_state_check(vtime, cpu);
+ if (state < 0)
+ return state;

*dst = *src;
cpustat = dst->cpustat;

/* Task is sleeping, dead or idle, nothing to add */
- if (vtime->state < VTIME_SYS)
+ if (state < VTIME_SYS)
continue;

delta = vtime_delta(vtime);
@@ -1064,15 +1067,15 @@ static int kcpustat_cpu_fetch_vtime(struct kernel_cpustat *dst,
* Task runs either in user (including guest) or kernel space,
* add pending nohz time to the right place.
*/
- if (vtime->state == VTIME_SYS) {
+ if (state == VTIME_SYS) {
cpustat[CPUTIME_SYSTEM] += vtime->stime + delta;
- } else if (vtime->state == VTIME_USER) {
+ } else if (state == VTIME_USER) {
if (task_nice(tsk) > 0)
cpustat[CPUTIME_NICE] += vtime->utime + delta;
else
cpustat[CPUTIME_USER] += vtime->utime + delta;
} else {
- WARN_ON_ONCE(vtime->state != VTIME_GUEST);
+ WARN_ON_ONCE(state != VTIME_GUEST);
if (task_nice(tsk) > 0) {
cpustat[CPUTIME_GUEST_NICE] += vtime->gtime + delta;
cpustat[CPUTIME_NICE] += vtime->gtime + delta;
@@ -1083,7 +1086,7 @@ static int kcpustat_cpu_fetch_vtime(struct kernel_cpustat *dst,
}
} while (read_seqcount_retry(&vtime->seqcount, seq));

- return err;
+ return 0;
}

void kcpustat_cpu_fetch(struct kernel_cpustat *dst, int cpu)
--
2.25.0.rc0

2019-12-30 17:41:34

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH] sched/vtime: Prevent unstable evaluation of WARN(vtime->state)

On Mon, Dec 30, 2019 at 09:04:36AM +0000, Chris Wilson wrote:
> As the vtime is sampled under loose seqcount protection by kcpustat, the
> vtime fields may change as the code flows. Where logic dictates a field
> has a static value, use a READ_ONCE.
>
> Fixes: 74722bb223d0 ("sched/vtime: Bring up complete kcpustat accessor")
> Signed-off-by: Chris Wilson <[email protected]>
> Cc: Frederic Weisbecker <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>

Oh thanks for taking the time to make it a proper patch!
Looks good, I'm relaying it.