LinuxLists.cc - [PATCH -next] sched/cputime: Fix the time backward issue about /proc/stat

2022-09-28 03:46:50

Subject: [PATCH -next] sched/cputime: Fix the time backward issue about /proc/stat

From: Zheng Zucheng <[email protected]>

The cputime of cpuN read from /proc/stat has an issue of cputime descent.
For example, the phenomenon of cputime descent of user is as followed:

The value read first is 319, and the value read again is 318. As follows:
first:
cat /proc/stat | grep cpu1
cpu1 319 0 496 41665 0 0 0 0 0 0
again:
cat /proc/stat | grep cpu1
cpu1 318 0 497 41674 0 0 0 0 0 0

The value read from /proc/stat should be monotonically increasing. Otherwise
user may get incorrect CPU usage.

The root cause of this problem is that, in the implementation of
kcpustat_cpu_fetch_vtime, vtime->utime + delta is added to the stack variable
cpustat instantaneously. If the task is switched between two times, the value
added to cpustat for the second time may be smaller than that for the first time.

CPU0 CPU1
First:
show_stat()
->kcpustat_cpu_fetch()
->kcpustat_cpu_fetch_vtime()
->cpustat[CPUTIME_USER] = kcpustat_cpu(cpu) + vtime->utime + delta rq->curr is task A
A switch to B，and A->vtime->utime is less than 1 tick
Then:
show_stat()
->kcpustat_cpu_fetch()
->kcpustat_cpu_fetch_vtime()
->cpustat[CPUTIME_USER] = kcpustat_cpu(cpu) + vtime->utime + delta; rq->curr is task B

Fixes: 74722bb223d0 ("sched/vtime: Bring up complete kcpustat accessor")
Signed-off-by: Li Hua <[email protected]>
Signed-off-by: Zheng Zucheng <[email protected]>
---
kernel/sched/cputime.c | 24 +++++++++++++++++++++++-
1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 95fc77853743..c7a812ff1fb7 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -1060,9 +1060,17 @@ static int kcpustat_cpu_fetch_vtime(struct kernel_cpustat *dst,
return 0;
}

+DEFINE_PER_CPU(struct kernel_cpustat, kernel_cpustat_reverse);
+DEFINE_PER_CPU(raw_spinlock_t, kernel_cpustat_reverse_lock);
+
void kcpustat_cpu_fetch(struct kernel_cpustat *dst, int cpu)
{
const struct kernel_cpustat *src = &kcpustat_cpu(cpu);
+ struct kernel_cpustat *reverse = &per_cpu(kernel_cpustat_reverse, cpu);
+ raw_spinlock_t *cpustat_lock = &per_cpu(kernel_cpustat_reverse_lock, cpu);
+ u64 *dstat = dst->cpustat;
+ u64 *restat = reverse->cpustat;
+ unsigned long flags;
struct rq *rq;
int err;

@@ -1087,8 +1095,22 @@ void kcpustat_cpu_fetch(struct kernel_cpustat *dst, int cpu)
err = kcpustat_cpu_fetch_vtime(dst, src, curr, cpu);
rcu_read_unlock();

- if (!err)
+ if (!err) {
+ raw_spin_lock_irqsave(cpustat_lock, flags);
+ if (dstat[CPUTIME_USER] < restat[CPUTIME_USER])
+ dstat[CPUTIME_USER] = restat[CPUTIME_USER];
+ if (dstat[CPUTIME_SYSTEM] < restat[CPUTIME_SYSTEM])
+ dstat[CPUTIME_SYSTEM] = restat[CPUTIME_SYSTEM];
+ if (dstat[CPUTIME_NICE] < restat[CPUTIME_NICE])
+ dstat[CPUTIME_NICE] = restat[CPUTIME_NICE];
+ if (dstat[CPUTIME_GUEST] < restat[CPUTIME_GUEST])
+ dstat[CPUTIME_GUEST] = restat[CPUTIME_GUEST];
+ if (dstat[CPUTIME_GUEST_NICE] < restat[CPUTIME_GUEST_NICE])
+ dstat[CPUTIME_GUEST_NICE] = restat[CPUTIME_GUEST_NICE];
+ *reverse = *dst;
+ raw_spin_unlock_irqrestore(cpustat_lock, flags);
return;
+ }

cpu_relax();
}
--
2.18.0.huawei.25

2022-09-28 09:05:15

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH -next] sched/cputime: Fix the time backward issue about /proc/stat

On Wed, Sep 28, 2022 at 11:34:02AM +0800, Zucheng Zheng wrote:
> From: Zheng Zucheng <[email protected]>
>
> The cputime of cpuN read from /proc/stat has an issue of cputime descent.
> For example, the phenomenon of cputime descent of user is as followed:
>
> The value read first is 319, and the value read again is 318. As follows:
> first:
> cat /proc/stat | grep cpu1
> cpu1 319 0 496 41665 0 0 0 0 0 0
> again:
> cat /proc/stat | grep cpu1
> cpu1 318 0 497 41674 0 0 0 0 0 0
>
> The value read from /proc/stat should be monotonically increasing. Otherwise
> user may get incorrect CPU usage.
>
> The root cause of this problem is that, in the implementation of
> kcpustat_cpu_fetch_vtime, vtime->utime + delta is added to the stack variable
> cpustat instantaneously. If the task is switched between two times, the value
> added to cpustat for the second time may be smaller than that for the first time.
>
> CPU0 CPU1
> First:
> show_stat()
> ->kcpustat_cpu_fetch()
> ->kcpustat_cpu_fetch_vtime()
> ->cpustat[CPUTIME_USER] = kcpustat_cpu(cpu) + vtime->utime + delta rq->curr is task A
> A switch to B，and A->vtime->utime is less than 1 tick
> Then:
> show_stat()
> ->kcpustat_cpu_fetch()
> ->kcpustat_cpu_fetch_vtime()
> ->cpustat[CPUTIME_USER] = kcpustat_cpu(cpu) + vtime->utime + delta; rq->curr is task B

You're still not explaining where the time gets lost. And the patch is a
horrible band-aid.

What I think you're saying; after staring at this for a while, is that:

vtime_task_switch_generic()
__vtime_account_kernel(prev, vtime)
vtime_account_{guest,system}(tsk, vtime)
vtime->*time += get_vtime_delta()
if (vtime->*time >= TICK_NSEC)
account_*_time()
account_system_index_time()
task_group_account_field()
__this_cpu_add(kernel_cpustat.cpustat[index], tmp); <---- here

is not folding time into kernel_cpustat when the task vtime isn't at
least a tick's worth. And then when we switch to another task, we leak
time.

There's another problem here, vtime_task_switch_generic() should use a
single call to sched_clock() to compute the old vtime_delta and set the
new vtime->starttime, otherwise there's a time hole there as well.

This is all quite the maze and it really wants cleaning up, not be made
worse.

So I think you want to do two things:

- pull kernel_cpustat updates out of task_group_account_field()
and put them into vtime_task_switch_generic() to be purely
vtime->starttime based.

- make vtime_task_switch_generic() use a single sched_clock() call.

I did not audit all the flavours of cputime; there might be fallout, be
sure to cross compile a lot.

Frederic, you agree?

2022-09-28 13:00:22

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: [PATCH -next] sched/cputime: Fix the time backward issue about /proc/stat

On Wed, Sep 28, 2022 at 10:12:33AM +0200, Peter Zijlstra wrote:
> On Wed, Sep 28, 2022 at 11:34:02AM +0800, Zucheng Zheng wrote:
> > From: Zheng Zucheng <[email protected]>
> >
> > The cputime of cpuN read from /proc/stat has an issue of cputime descent.
> > For example, the phenomenon of cputime descent of user is as followed:
> >
> > The value read first is 319, and the value read again is 318. As follows:
> > first:
> > cat /proc/stat | grep cpu1
> > cpu1 319 0 496 41665 0 0 0 0 0 0
> > again:
> > cat /proc/stat | grep cpu1
> > cpu1 318 0 497 41674 0 0 0 0 0 0
> >
> > The value read from /proc/stat should be monotonically increasing. Otherwise
> > user may get incorrect CPU usage.
> >
> > The root cause of this problem is that, in the implementation of
> > kcpustat_cpu_fetch_vtime, vtime->utime + delta is added to the stack variable
> > cpustat instantaneously. If the task is switched between two times, the value
> > added to cpustat for the second time may be smaller than that for the first time.
> >
> > CPU0 CPU1
> > First:
> > show_stat()
> > ->kcpustat_cpu_fetch()
> > ->kcpustat_cpu_fetch_vtime()
> > ->cpustat[CPUTIME_USER] = kcpustat_cpu(cpu) + vtime->utime + delta rq->curr is task A
> > A switch to B，and A->vtime->utime is less than 1 tick
> > Then:
> > show_stat()
> > ->kcpustat_cpu_fetch()
> > ->kcpustat_cpu_fetch_vtime()
> > ->cpustat[CPUTIME_USER] = kcpustat_cpu(cpu) + vtime->utime + delta; rq->curr is task B
>
> You're still not explaining where the time gets lost. And the patch is a
> horrible band-aid.
>
> What I think you're saying; after staring at this for a while, is that:
>
> vtime_task_switch_generic()
> __vtime_account_kernel(prev, vtime)
> vtime_account_{guest,system}(tsk, vtime)
> vtime->*time += get_vtime_delta()
> if (vtime->*time >= TICK_NSEC)
> account_*_time()
> account_system_index_time()
> task_group_account_field()
> __this_cpu_add(kernel_cpustat.cpustat[index], tmp); <---- here
>
> is not folding time into kernel_cpustat when the task vtime isn't at
> least a tick's worth. And then when we switch to another task, we leak
> time.

Looks right. Last time the patch was posted I misunderstood the issue.

>
> There's another problem here, vtime_task_switch_generic() should use a
> single call to sched_clock() to compute the old vtime_delta and set the
> new vtime->starttime, otherwise there's a time hole there as well.

Right, but does it really matter? It's just a few nanosecs ignored
between two tasks switching.

>
> This is all quite the maze and it really wants cleaning up, not be made
> worse.
>
> So I think you want to do two things:
>
> - pull kernel_cpustat updates out of task_group_account_field()
> and put them into vtime_task_switch_generic() to be purely
> vtime->starttime based.

So you want to force the update on all context switches? We used that TICK_NSEC
limit before updating in order to lower some overhead.

There is also user <-> kernel involved.

How about handling that from the read side? (below untested)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 78a233d43757..f0f1af337e49 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -814,9 +814,9 @@ u64 task_gtime(struct task_struct *t)
do {
seq = read_seqcount_begin(&vtime->seqcount);

- gtime = t->gtime;
+ gtime = t->gtime + vtime->gtime;
if (vtime->state == VTIME_GUEST)
- gtime += vtime->gtime + vtime_delta(vtime);
+ gtime += vtime_delta(vtime);

} while (read_seqcount_retry(&vtime->seqcount, seq));

@@ -845,8 +845,8 @@ bool task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
ret = false;
seq = read_seqcount_begin(&vtime->seqcount);

- *utime = t->utime;
- *stime = t->stime;
+ *utime = t->utime + vtime->utime;
+ *stime = t->stime + vtime->stime;

/* Task is sleeping or idle, nothing to add */
if (vtime->state < VTIME_SYS)
@@ -860,9 +860,9 @@ bool task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
* add pending nohz time to the right place.
*/
if (vtime->state == VTIME_SYS)
- *stime += vtime->stime + delta;
+ *stime += delta;
else
- *utime += vtime->utime + delta;
+ *utime += delta;
} while (read_seqcount_retry(&vtime->seqcount, seq));

return ret;
@@ -896,11 +896,22 @@ static int vtime_state_fetch(struct vtime *vtime, int cpu)

static u64 kcpustat_user_vtime(struct vtime *vtime)
{
- if (vtime->state == VTIME_USER)
- return vtime->utime + vtime_delta(vtime);
- else if (vtime->state == VTIME_GUEST)
- return vtime->gtime + vtime_delta(vtime);
- return 0;
+ u64 delta = vtime->utime + vtime->gtime;
+
+ if (vtime->state == VTIME_USER || vtime->state == VTIME_GUEST)
+ delta += vtime_delta(vtime);
+
+ return delta;
+}
+
+static u64 kcpustat_guest_vtime(struct vtime *vtime)
+{
+ u64 delta = vtime->gtime;
+
+ if (vtime->state == VTIME_GUEST)
+ delta += vtime_delta(vtime);
+
+ return delta;
}

static int kcpustat_field_vtime(u64 *cpustat,
@@ -931,8 +942,9 @@ static int kcpustat_field_vtime(u64 *cpustat,
*/
switch (usage) {
case CPUTIME_SYSTEM:
+ *val += vtime->stime;
if (state == VTIME_SYS)
- *val += vtime->stime + vtime_delta(vtime);
+ *val += vtime_delta(vtime);
break;
case CPUTIME_USER:
if (task_nice(tsk) <= 0)
@@ -943,12 +955,12 @@ static int kcpustat_field_vtime(u64 *cpustat,
*val += kcpustat_user_vtime(vtime);
break;
case CPUTIME_GUEST:
- if (state == VTIME_GUEST && task_nice(tsk) <= 0)
- *val += vtime->gtime + vtime_delta(vtime);
+ if (task_nice(tsk) <= 0)
+ *val += kcpustat_guest_vtime(vtime);
break;
case CPUTIME_GUEST_NICE:
- if (state == VTIME_GUEST && task_nice(tsk) > 0)
- *val += vtime->gtime + vtime_delta(vtime);
+ if (task_nice(tsk) > 0)
+ *val += kcpustat_guest_vtime(vtime);
break;
default:
break;
@@ -1013,6 +1025,15 @@ static int kcpustat_cpu_fetch_vtime(struct kernel_cpustat *dst,
*dst = *src;
cpustat = dst->cpustat;

+ cpustat[CPUTIME_SYSTEM] += vtime->stime;
+ if (task_nice(tsk) > 0) {
+ cpustat[CPUTIME_NICE] += vtime->utime + vtime->gtime;
+ cpustat[CPUTIME_GUEST_NICE] += vtime->gtime;
+ } else {
+ cpustat[CPUTIME_USER] += vtime->utime + vtime->gtime;
+ cpustat[CPUTIME_GUEST] += vtime->gtime;
+ }
+
/* Task is sleeping, dead or idle, nothing to add */
if (state < VTIME_SYS)
continue;
@@ -1024,20 +1045,20 @@ static int kcpustat_cpu_fetch_vtime(struct kernel_cpustat *dst,
* add pending nohz time to the right place.
*/
if (state == VTIME_SYS) {
- cpustat[CPUTIME_SYSTEM] += vtime->stime + delta;
+ cpustat[CPUTIME_SYSTEM] += delta;
} else if (state == VTIME_USER) {
if (task_nice(tsk) > 0)
- cpustat[CPUTIME_NICE] += vtime->utime + delta;
+ cpustat[CPUTIME_NICE] += delta;
else
- cpustat[CPUTIME_USER] += vtime->utime + delta;
+ cpustat[CPUTIME_USER] += delta;
} else {
WARN_ON_ONCE(state != VTIME_GUEST);
if (task_nice(tsk) > 0) {
- cpustat[CPUTIME_GUEST_NICE] += vtime->gtime + delta;
- cpustat[CPUTIME_NICE] += vtime->gtime + delta;
+ cpustat[CPUTIME_GUEST_NICE] += delta;
+ cpustat[CPUTIME_NICE] += delta;
} else {
- cpustat[CPUTIME_GUEST] += vtime->gtime + delta;
- cpustat[CPUTIME_USER] += vtime->gtime + delta;
+ cpustat[CPUTIME_GUEST] += delta;
+ cpustat[CPUTIME_USER] += delta;
}
}
} while (read_seqcount_retry(&vtime->seqcount, seq));

2022-09-30 03:21:18

by zhengzucheng

[permalink] [raw]

Subject: Re: [PATCH -next] sched/cputime: Fix the time backward issue about /proc/stat

在 2022/9/28 20:11, Frederic Weisbecker 写道:
> On Wed, Sep 28, 2022 at 10:12:33AM +0200, Peter Zijlstra wrote:
>> On Wed, Sep 28, 2022 at 11:34:02AM +0800, Zucheng Zheng wrote:
>>> From: Zheng Zucheng <[email protected]>
>>>
>>> The cputime of cpuN read from /proc/stat has an issue of cputime descent.
>>> For example, the phenomenon of cputime descent of user is as followed:
>>>
>>> The value read first is 319, and the value read again is 318. As follows:
>>> first:
>>> cat /proc/stat | grep cpu1
>>> cpu1 319 0 496 41665 0 0 0 0 0 0
>>> again:
>>> cat /proc/stat | grep cpu1
>>> cpu1 318 0 497 41674 0 0 0 0 0 0
>>>
>>> The value read from /proc/stat should be monotonically increasing. Otherwise
>>> user may get incorrect CPU usage.
>>>
>>> The root cause of this problem is that, in the implementation of
>>> kcpustat_cpu_fetch_vtime, vtime->utime + delta is added to the stack variable
>>> cpustat instantaneously. If the task is switched between two times, the value
>>> added to cpustat for the second time may be smaller than that for the first time.
>>>
>>> CPU0 CPU1
>>> First:
>>> show_stat()
>>> ->kcpustat_cpu_fetch()
>>> ->kcpustat_cpu_fetch_vtime()
>>> ->cpustat[CPUTIME_USER] = kcpustat_cpu(cpu) + vtime->utime + delta rq->curr is task A
>>> A switch to B，and A->vtime->utime is less than 1 tick
>>> Then:
>>> show_stat()
>>> ->kcpustat_cpu_fetch()
>>> ->kcpustat_cpu_fetch_vtime()
>>> ->cpustat[CPUTIME_USER] = kcpustat_cpu(cpu) + vtime->utime + delta; rq->curr is task B
>> You're still not explaining where the time gets lost. And the patch is a
>> horrible band-aid.
>>
>> What I think you're saying; after staring at this for a while, is that:
>>
>> vtime_task_switch_generic()
>> __vtime_account_kernel(prev, vtime)
>> vtime_account_{guest,system}(tsk, vtime)
>> vtime->*time += get_vtime_delta()
>> if (vtime->*time >= TICK_NSEC)
>> account_*_time()
>> account_system_index_time()
>> task_group_account_field()
>> __this_cpu_add(kernel_cpustat.cpustat[index], tmp); <---- here
>>
>> is not folding time into kernel_cpustat when the task vtime isn't at
>> least a tick's worth. And then when we switch to another task, we leak
>> time.
> Looks right. Last time the patch was posted I misunderstood the issue.
>
>> There's another problem here, vtime_task_switch_generic() should use a
>> single call to sched_clock() to compute the old vtime_delta and set the
>> new vtime->starttime, otherwise there's a time hole there as well.
> Right, but does it really matter? It's just a few nanosecs ignored
> between two tasks switching.
>
>> This is all quite the maze and it really wants cleaning up, not be made
>> worse.
>>
>> So I think you want to do two things:
>>
>> - pull kernel_cpustat updates out of task_group_account_field()
>> and put them into vtime_task_switch_generic() to be purely
>> vtime->starttime based.
> So you want to force the update on all context switches? We used that TICK_NSEC
> limit before updating in order to lower some overhead.
>
> There is also user <-> kernel involved.
>
> How about handling that from the read side? (below untested)
>
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 78a233d43757..f0f1af337e49 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -814,9 +814,9 @@ u64 task_gtime(struct task_struct *t)
> do {
> seq = read_seqcount_begin(&vtime->seqcount);
>
> - gtime = t->gtime;
> + gtime = t->gtime + vtime->gtime;
> if (vtime->state == VTIME_GUEST)
> - gtime += vtime->gtime + vtime_delta(vtime);
> + gtime += vtime_delta(vtime);
>
> } while (read_seqcount_retry(&vtime->seqcount, seq));
>
> @@ -845,8 +845,8 @@ bool task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
> ret = false;
> seq = read_seqcount_begin(&vtime->seqcount);
>
> - *utime = t->utime;
> - *stime = t->stime;
> + *utime = t->utime + vtime->utime;
> + *stime = t->stime + vtime->stime;
>
> /* Task is sleeping or idle, nothing to add */
> if (vtime->state < VTIME_SYS)
> @@ -860,9 +860,9 @@ bool task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
> * add pending nohz time to the right place.
> */
> if (vtime->state == VTIME_SYS)
> - *stime += vtime->stime + delta;
> + *stime += delta;
> else
> - *utime += vtime->utime + delta;
> + *utime += delta;
> } while (read_seqcount_retry(&vtime->seqcount, seq));
>
> return ret;
> @@ -896,11 +896,22 @@ static int vtime_state_fetch(struct vtime *vtime, int cpu)
>
> static u64 kcpustat_user_vtime(struct vtime *vtime)
> {
> - if (vtime->state == VTIME_USER)
> - return vtime->utime + vtime_delta(vtime);
> - else if (vtime->state == VTIME_GUEST)
> - return vtime->gtime + vtime_delta(vtime);
> - return 0;
> + u64 delta = vtime->utime + vtime->gtime;
> +
> + if (vtime->state == VTIME_USER || vtime->state == VTIME_GUEST)
> + delta += vtime_delta(vtime);
> +
> + return delta;
> +}
> +
> +static u64 kcpustat_guest_vtime(struct vtime *vtime)
> +{
> + u64 delta = vtime->gtime;
> +
> + if (vtime->state == VTIME_GUEST)
> + delta += vtime_delta(vtime);
> +
> + return delta;
> }
>
> static int kcpustat_field_vtime(u64 *cpustat,
> @@ -931,8 +942,9 @@ static int kcpustat_field_vtime(u64 *cpustat,
> */
> switch (usage) {
> case CPUTIME_SYSTEM:
> + *val += vtime->stime;
> if (state == VTIME_SYS)
> - *val += vtime->stime + vtime_delta(vtime);
> + *val += vtime_delta(vtime);
> break;
> case CPUTIME_USER:
> if (task_nice(tsk) <= 0)
> @@ -943,12 +955,12 @@ static int kcpustat_field_vtime(u64 *cpustat,
> *val += kcpustat_user_vtime(vtime);
> break;
> case CPUTIME_GUEST:
> - if (state == VTIME_GUEST && task_nice(tsk) <= 0)
> - *val += vtime->gtime + vtime_delta(vtime);
> + if (task_nice(tsk) <= 0)
> + *val += kcpustat_guest_vtime(vtime);
> break;
> case CPUTIME_GUEST_NICE:
> - if (state == VTIME_GUEST && task_nice(tsk) > 0)
> - *val += vtime->gtime + vtime_delta(vtime);
> + if (task_nice(tsk) > 0)
> + *val += kcpustat_guest_vtime(vtime);
> break;
> default:
> break;
> @@ -1013,6 +1025,15 @@ static int kcpustat_cpu_fetch_vtime(struct kernel_cpustat *dst,
> *dst = *src;
> cpustat = dst->cpustat;
>
> + cpustat[CPUTIME_SYSTEM] += vtime->stime;
> + if (task_nice(tsk) > 0) {
> + cpustat[CPUTIME_NICE] += vtime->utime + vtime->gtime;
> + cpustat[CPUTIME_GUEST_NICE] += vtime->gtime;
> + } else {
> + cpustat[CPUTIME_USER] += vtime->utime + vtime->gtime;
> + cpustat[CPUTIME_GUEST] += vtime->gtime;
> + }
> +
> /* Task is sleeping, dead or idle, nothing to add */
> if (state < VTIME_SYS)
> continue;
> @@ -1024,20 +1045,20 @@ static int kcpustat_cpu_fetch_vtime(struct kernel_cpustat *dst,
> * add pending nohz time to the right place.
> */
> if (state == VTIME_SYS) {
> - cpustat[CPUTIME_SYSTEM] += vtime->stime + delta;
> + cpustat[CPUTIME_SYSTEM] += delta;
> } else if (state == VTIME_USER) {
> if (task_nice(tsk) > 0)
> - cpustat[CPUTIME_NICE] += vtime->utime + delta;
> + cpustat[CPUTIME_NICE] += delta;
> else
> - cpustat[CPUTIME_USER] += vtime->utime + delta;
> + cpustat[CPUTIME_USER] += delta;
“delta” has the same problem as vtime->utime, which varies with
different tasks. switching between different tasks may cause time
statistics to be reversed.
> } else {
> WARN_ON_ONCE(state != VTIME_GUEST);
> if (task_nice(tsk) > 0) {
> - cpustat[CPUTIME_GUEST_NICE] += vtime->gtime + delta;
> - cpustat[CPUTIME_NICE] += vtime->gtime + delta;
> + cpustat[CPUTIME_GUEST_NICE] += delta;
> + cpustat[CPUTIME_NICE] += delta;
> } else {
> - cpustat[CPUTIME_GUEST] += vtime->gtime + delta;
> - cpustat[CPUTIME_USER] += vtime->gtime + delta;
> + cpustat[CPUTIME_GUEST] += delta;
> + cpustat[CPUTIME_USER] += delta;
> }
> }
> } while (read_seqcount_retry(&vtime->seqcount, seq));
> .

2022-09-30 12:29:42

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: [PATCH -next] sched/cputime: Fix the time backward issue about /proc/stat

On Fri, Sep 30, 2022 at 10:43:58AM +0800, zhengzucheng wrote:
>
> 在 2022/9/28 20:11, Frederic Weisbecker 写道:
> > @@ -1024,20 +1045,20 @@ static int kcpustat_cpu_fetch_vtime(struct kernel_cpustat *dst,
> > * add pending nohz time to the right place.
> > */
> > if (state == VTIME_SYS) {
> > - cpustat[CPUTIME_SYSTEM] += vtime->stime + delta;
> > + cpustat[CPUTIME_SYSTEM] += delta;
> > } else if (state == VTIME_USER) {
> > if (task_nice(tsk) > 0)
> > - cpustat[CPUTIME_NICE] += vtime->utime + delta;
> > + cpustat[CPUTIME_NICE] += delta;
> > else
> > - cpustat[CPUTIME_USER] += vtime->utime + delta;
> > + cpustat[CPUTIME_USER] += delta;
> “delta” has the same problem as vtime->utime, which varies with different
> tasks. switching between different tasks may cause time statistics to be
> reversed.

I'm a bit confused, can you provide an example?

Thanks.

> > } else {
> > WARN_ON_ONCE(state != VTIME_GUEST);
> > if (task_nice(tsk) > 0) {
> > - cpustat[CPUTIME_GUEST_NICE] += vtime->gtime + delta;
> > - cpustat[CPUTIME_NICE] += vtime->gtime + delta;
> > + cpustat[CPUTIME_GUEST_NICE] += delta;
> > + cpustat[CPUTIME_NICE] += delta;
> > } else {
> > - cpustat[CPUTIME_GUEST] += vtime->gtime + delta;
> > - cpustat[CPUTIME_USER] += vtime->gtime + delta;
> > + cpustat[CPUTIME_GUEST] += delta;
> > + cpustat[CPUTIME_USER] += delta;
> > }
> > }
> > } while (read_seqcount_retry(&vtime->seqcount, seq));
> > .

2022-09-30 12:31:42

by kernel test robot

[permalink] [raw]

Subject: [sched/cputime] 131c995687: BUG:spinlock_trylock_failure_on_UP_on_CPU

Greeting,

FYI, we noticed the following commit (built with gcc-11):

commit: 131c99568796ed8af66d1a5fd7e61465d252aeca ("[PATCH -next] sched/cputime: Fix the time backward issue about /proc/stat")
url: https://github.com/intel-lab-lkp/linux/commits/Zucheng-Zheng/sched-cputime-Fix-the-time-backward-issue-about-proc-stat/20220928-113916
patch link: https://lore.kernel.org/lkml/[email protected]

in testcase: boot

on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):

[ 53.618563][ T321] BUG: spinlock trylock failure on UP on CPU#0, cp/321
[ 53.620281][ T321] CPU: 0 PID: 321 Comm: cp Tainted: G T 6.0.0-rc7-next-20220927-00001-g131c99568796 #33
[ 53.621323][ T321] Call Trace:
[ 53.621647][ T321] <TASK>
[ 53.621952][ T321] dump_stack_lvl (??:?)
[ 53.622362][ T321] dump_stack (??:?)
[ 53.622715][ T321] spin_bug (spinlock_debug.c:?)
[ 53.623068][ T321] do_raw_spin_trylock (??:?)
[ 53.623489][ T321] __raw_spin_lock_irqsave (spinlock.c:?)
[ 53.623951][ T321] ? _raw_spin_lock_irqsave (??:?)
[ 53.624440][ T321] _raw_spin_lock_irqsave (??:?)
[ 53.624904][ T321] kcpustat_cpu_fetch (??:?)
[ 53.625383][ T321] uptime_proc_show (uptime.c:?)
[ 53.625833][ T321] ? lock_is_held (kvm_main.c:?)
[ 53.626267][ T321] ? show_stat (uptime.c:?)
[ 53.626648][ T321] ? rcu_read_lock_bh_held (??:?)
[ 53.627079][ T321] ? kasan_save_alloc_info (??:?)
[ 53.627540][ T321] ? ____kasan_kmalloc (common.c:?)
[ 53.627956][ T321] ? kvmalloc_node (??:?)
[ 53.628378][ T321] seq_read_iter (??:?)
[ 53.628804][ T321] proc_reg_read_iter (inode.c:?)
[ 53.629291][ T321] call_read_iter (read_write.c:?)
[ 53.629725][ T321] vfs_read (??:?)
[ 53.630125][ T321] ? kernel_read (??:?)
[ 53.630537][ T321] ? __lock_release (lockdep.c:?)
[ 53.630981][ T321] ? __kasan_check_read (??:?)
[ 53.631479][ T321] ? __fdget_pos (??:?)
[ 53.631901][ T321] ksys_read (??:?)
[ 53.632308][ T321] ? vfs_write (??:?)
[ 53.632721][ T321] __x64_sys_read (??:?)
[ 53.633136][ T321] do_syscall_64 (??:?)
[ 53.633558][ T321] entry_SYSCALL_64_after_hwframe (??:?)
[ 53.634127][ T321] RIP: 0033:0x7f3802b01e8e
[ 53.634512][ T321] Code: c0 e9 b6 fe ff ff 50 48 8d 3d 6e 18 0a 00 e8 89 e8 01 00 66 0f 1f 84 00 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 5a c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 28
All code
========
0: c0 e9 b6 shr $0xb6,%cl
3: fe (bad)
4: ff (bad)
5: ff 50 48 callq *0x48(%rax)
8: 8d 3d 6e 18 0a 00 lea 0xa186e(%rip),%edi # 0xa187c
e: e8 89 e8 01 00 callq 0x1e89c
13: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
1a: 00 00
1c: 64 8b 04 25 18 00 00 mov %fs:0x18,%eax
23: 00
24: 85 c0 test %eax,%eax
26: 75 14 jne 0x3c
28: 0f 05 syscall
2a:* 48 3d 00 f0 ff ff cmp $0xfffffffffffff000,%rax <-- trapping instruction
30: 77 5a ja 0x8c
32: c3 retq
33: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
3a: 00 00
3c: 48 83 ec 28 sub $0x28,%rsp

Code starting with the faulting instruction
===========================================
0: 48 3d 00 f0 ff ff cmp $0xfffffffffffff000,%rax
6: 77 5a ja 0x62
8: c3 retq
9: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
10: 00 00
12: 48 83 ec 28 sub $0x28,%rsp
[ 53.636139][ T321] RSP: 002b:00007ffe76036628 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 53.636894][ T321] RAX: ffffffffffffffda RBX: 00007ffe76036c60 RCX: 00007f3802b01e8e
[ 53.637635][ T321] RDX: 0000000000020000 RSI: 00007f3802763000 RDI: 0000000000000003
[ 53.638389][ T321] RBP: 00007ffe76036ae0 R08: 0000000000000000 R09: 0000000000000000
[ 53.639174][ T321] R10: fffffffffffff28d R11: 0000000000000246 R12: 0000000000000000
[ 53.639905][ T321] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
[ 53.640642][ T321] </TASK>

If you fix the issue, kindly add following tag
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/r/[email protected]

To reproduce:

# build kernel
cd linux
cp config-6.0.0-rc7-next-20220927-00001-g131c99568796 .config
make HOSTCC=gcc-11 CC=gcc-11 ARCH=x86_64 olddefconfig prepare modules_prepare bzImage modules
make HOSTCC=gcc-11 CC=gcc-11 ARCH=x86_64 INSTALL_MOD_PATH=<mod-install-dir> modules_install
cd <mod-install-dir>
find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> -m modules.cgz job-script # job-script is attached in this email

# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.

--
0-DAY CI Kernel Test Service
https://01.org/lkp

Attachments:

(No filename) (5.12 kB)
config-6.0.0-rc7-next-20220927-00001-g131c99568796 (138.87 kB)
job-script (5.32 kB)
dmesg.xz (29.44 kB)
Download all attachments

2022-10-09 02:52:58

by zhengzucheng

[permalink] [raw]

Subject: Re: [PATCH -next] sched/cputime: Fix the time backward issue about /proc/stat

在 2022/9/30 20:16, Frederic Weisbecker 写道:
> On Fri, Sep 30, 2022 at 10:43:58AM +0800, zhengzucheng wrote:
>> 在 2022/9/28 20:11, Frederic Weisbecker 写道:
>>> @@ -1024,20 +1045,20 @@ static int kcpustat_cpu_fetch_vtime(struct kernel_cpustat *dst,
>>> * add pending nohz time to the right place.
>>> */
>>> if (state == VTIME_SYS) {
>>> - cpustat[CPUTIME_SYSTEM] += vtime->stime + delta;
>>> + cpustat[CPUTIME_SYSTEM] += delta;
>>> } else if (state == VTIME_USER) {
>>> if (task_nice(tsk) > 0)
>>> - cpustat[CPUTIME_NICE] += vtime->utime + delta;
>>> + cpustat[CPUTIME_NICE] += delta;
>>> else
>>> - cpustat[CPUTIME_USER] += vtime->utime + delta;
>>> + cpustat[CPUTIME_USER] += delta;
>> “delta” has the same problem as vtime->utime, which varies with different
>> tasks. switching between different tasks may cause time statistics to be
>> reversed.
> I'm a bit confused, can you provide an example?
Chinese National Day holiday, sorry for not replying to you in time.

CONFIG_HZ=100
const struct kernel_cpustat *src = &kcpustat_cpu(cpu);
struct vtime *vtime = &tsk->vtime;
cpustat[CPUTIME_USER] += vtime->utime + delta;

first：
cat /proc/stat | grep cpu1
cpu1 319 0 496 41665 0 0 0 0 0 0
Task A is running on CPU 1，so vtime is A->vtime and delta is A's delta.
kcpustat_cpu_fetch_vtime: cpu=1 src->cpustat[CPUTIME_USER]=3189000000
vtime->utime=900000 delta=100001
cpustat[CPUTIME_USER] = 3189000000 + 900000 + 100001 is 3,190,000,001,
319 ticks

again：
cat /proc/stat | grep cpu1
cpu1 318 0 497 41674 0 0 0 0 0 0
Task B is running on CPU 1，so vtime is B->vtime and delta is B's delta.
kcpustat_cpu_fetch_vtime: cpu=1 src->cpustat[CPUTIME_USER]=3189000000
vtime->utime=900000 delta=90000
cpustat[CPUTIME_USER] = 3189000000 + 900000 + 90000 is 3,189,990,000,
318 ticks

The root cause is that the value of task B's "vtime->utime + delta" may
be smaller than task A.
>
> Thanks.
>
>>> } else {
>>> WARN_ON_ONCE(state != VTIME_GUEST);
>>> if (task_nice(tsk) > 0) {
>>> - cpustat[CPUTIME_GUEST_NICE] += vtime->gtime + delta;
>>> - cpustat[CPUTIME_NICE] += vtime->gtime + delta;
>>> + cpustat[CPUTIME_GUEST_NICE] += delta;
>>> + cpustat[CPUTIME_NICE] += delta;
>>> } else {
>>> - cpustat[CPUTIME_GUEST] += vtime->gtime + delta;
>>> - cpustat[CPUTIME_USER] += vtime->gtime + delta;
>>> + cpustat[CPUTIME_GUEST] += delta;
>>> + cpustat[CPUTIME_USER] += delta;
>>> }
>>> }
>>> } while (read_seqcount_retry(&vtime->seqcount, seq));
>>> .
> .