In psi_update_stats, it is possible that period has value like
0xXXXXXXXX00000000 where the lower 32 bit is 0, then it calls div_u64 which
truncates u64 period to u32, results in zero divisor.
Use div64_u64() instead of div_u64() if the divisor is u64 to avoid
truncation to 32-bit on 64-bit platforms.
Signed-off-by: xiejingfeng <[email protected]>
---
kernel/sched/psi.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 517e3719027e..399d6f106de5 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -291,7 +291,7 @@ static void calc_avgs(unsigned long avg[3], int missed_periods,
}
/* Sample the most recent active period */
- pct = div_u64(time * 100, period);
+ pct = div64_u64(time * 100, period);
pct *= FIXED_1;
avg[0] = calc_load(avg[0], EXP_10s, pct);
avg[1] = calc_load(avg[1], EXP_60s, pct);
--
2.14.4.44.g2045bb
On Fri, Nov 08, 2019 at 03:33:24PM +0800, tim wrote:
> In psi_update_stats, it is possible that period has value like
> 0xXXXXXXXX00000000 where the lower 32 bit is 0, then it calls div_u64 which
How can this happen? Is that a valid case or should we be avoiding that?
It happens multiple times on our online machines, the crash call trace is like below:
[58914.066423] divide error: 0000 [#1] SMP
[58914.070416] Modules linked in: ipmi_poweroff ipmi_watchdog toa overlay fuse tcp_diag inet_diag binfmt_misc aisqos(O) aisqos_hotfixes(O)
[58914.083158] CPU: 94 PID: 140364 Comm: kworker/94:2 Tainted: G W OE K 4.9.151-015.ali3000.alios7.x86_64 #1
[58914.093722] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.23.34 02/14/2019
[58914.102728] Workqueue: events psi_update_work
[58914.107258] task: ffff8879da83c280 task.stack: ffffc90059dcc000
[58914.113336] RIP: 0010:[] [] psi_update_stats+0x1c1/0x330
[58914.122183] RSP: 0018:ffffc90059dcfd60 EFLAGS: 00010246
[58914.127650] RAX: 0000000000000000 RBX: ffff8858fe98be50 RCX: 000000007744d640
[58914.134947] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00003594f700648e
[58914.142243] RBP: ffffc90059dcfdf8 R08: 0000359500000000 R09: 0000000000000000
[58914.149538] R10: 0000000000000000 R11: 0000000000000000 R12: 0000359500000000
[58914.156837] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8858fe98bd78
[58914.164136] FS: 0000000000000000(0000) GS:ffff887f7f380000(0000) knlGS:0000000000000000
[58914.172529] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[58914.178467] CR2: 00007f2240452090 CR3: 0000005d5d258000 CR4: 00000000007606f0
[58914.185765] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[58914.193061] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[58914.200360] PKRU: 55555554
[58914.203221] Stack:
[58914.205383] ffff8858fe98bd48 00000000000002f0 0000002e81036d09 ffffc90059dcfde8
[58914.213168] ffff8858fe98bec8 0000000000000000 0000000000000000 0000000000000000
[58914.220951] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[58914.228734] Call Trace:
[58914.231337] [] psi_update_work+0x22/0x60
[58914.237067] [] process_one_work+0x189/0x420
[58914.243063] [] worker_thread+0x4e/0x4b0
[58914.248701] [] ? process_one_work+0x420/0x420
[58914.254869] [] kthread+0xe6/0x100
[58914.259994] [] ? kthread_park+0x60/0x60
[58914.265640] [] ret_from_fork+0x39/0x50
[58914.271193] Code: 41 29 c3 4d 39 dc 4d 0f 42 dc <49> f7 f1 48 8b 13 48 89 c7 48 c1
[58914.279691] RIP [] psi_update_stats+0x1c1/0x330
[58914.286053] RSP
With full kdump vmcore analysis, The R8 is period in psi_update_stats which results in the zero division error.
On 2019/11/8 PM 5:31,“Peter Zijlstra”<[email protected]> wrote:
On Fri, Nov 08, 2019 at 03:33:24PM +0800, tim wrote:
> In psi_update_stats, it is possible that period has value like
> 0xXXXXXXXX00000000 where the lower 32 bit is 0, then it calls div_u64 which
How can this happen? Is that a valid case or should we be avoiding that?
https://people.kernel.org/tglx/notes-about-netiquette-qw89
On Fri, Nov 08, 2019 at 05:49:01PM +0800, Jingfeng Xie wrote:
> It happens multiple times on our online machines, the crash call trace is like below:
> [58914.066423] divide error: 0000 [#1] SMP
> [58914.070416] Modules linked in: ipmi_poweroff ipmi_watchdog toa overlay fuse tcp_diag inet_diag binfmt_misc aisqos(O) aisqos_hotfixes(O)
> [58914.083158] CPU: 94 PID: 140364 Comm: kworker/94:2 Tainted: G W OE K 4.9.151-015.ali3000.alios7.x86_64 #1
> [58914.093722] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.23.34 02/14/2019
> [58914.102728] Workqueue: events psi_update_work
> [58914.107258] task: ffff8879da83c280 task.stack: ffffc90059dcc000
> [58914.113336] RIP: 0010:[] [] psi_update_stats+0x1c1/0x330
> [58914.122183] RSP: 0018:ffffc90059dcfd60 EFLAGS: 00010246
> [58914.127650] RAX: 0000000000000000 RBX: ffff8858fe98be50 RCX: 000000007744d640
> [58914.134947] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00003594f700648e
> [58914.142243] RBP: ffffc90059dcfdf8 R08: 0000359500000000 R09: 0000000000000000
> [58914.149538] R10: 0000000000000000 R11: 0000000000000000 R12: 0000359500000000
> [58914.156837] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8858fe98bd78
> [58914.164136] FS: 0000000000000000(0000) GS:ffff887f7f380000(0000) knlGS:0000000000000000
> [58914.172529] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [58914.178467] CR2: 00007f2240452090 CR3: 0000005d5d258000 CR4: 00000000007606f0
> [58914.185765] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [58914.193061] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [58914.200360] PKRU: 55555554
> [58914.203221] Stack:
> [58914.205383] ffff8858fe98bd48 00000000000002f0 0000002e81036d09 ffffc90059dcfde8
> [58914.213168] ffff8858fe98bec8 0000000000000000 0000000000000000 0000000000000000
> [58914.220951] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [58914.228734] Call Trace:
> [58914.231337] [] psi_update_work+0x22/0x60
> [58914.237067] [] process_one_work+0x189/0x420
> [58914.243063] [] worker_thread+0x4e/0x4b0
> [58914.248701] [] ? process_one_work+0x420/0x420
> [58914.254869] [] kthread+0xe6/0x100
> [58914.259994] [] ? kthread_park+0x60/0x60
> [58914.265640] [] ret_from_fork+0x39/0x50
> [58914.271193] Code: 41 29 c3 4d 39 dc 4d 0f 42 dc <49> f7 f1 48 8b 13 48 89 c7 48 c1
> [58914.279691] RIP [] psi_update_stats+0x1c1/0x330
> [58914.286053] RSP
>
> With full kdump vmcore analysis, The R8 is period in psi_update_stats which results in the zero division error.
This does not answer either question I asked.
> How can this happen? Is that a valid case or should we be avoiding that?
This does not explain how the period got so large, nor if that is a
valid/expected scenario.
On Fri, Nov 08, 2019 at 03:33:24PM +0800, tim wrote:
> In psi_update_stats, it is possible that period has value like
> 0xXXXXXXXX00000000 where the lower 32 bit is 0, then it calls div_u64 which
> truncates u64 period to u32, results in zero divisor.
> Use div64_u64() instead of div_u64() if the divisor is u64 to avoid
> truncation to 32-bit on 64-bit platforms.
>
> Signed-off-by: xiejingfeng <[email protected]>
This is legit. When we stop the periodic averaging worker due to an
idle CPU, the period after restart can be much longer than the ~4 sec
in the lower 32 bits. See the missed_periods logic in update_averages.
What is surprising is that you can hit this repeatedly, as the odds
are 1 in 4,294,967,296. An extremely coarse clock source?
Acked-by: Johannes Weiner <[email protected]>
There are several more instances of div_u64 in psi.c. They all look
fine to me except for one in the psi poll() windowing code, where we
divide by the window size, which can be up to 10s. CCing Suren.
---
From 009cece5f37a38f4baeb1bebdcb432ac9ae66ef8 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <[email protected]>
Date: Tue, 12 Nov 2019 10:35:26 -0500
Subject: [PATCH] psi: fix a division error in psi poll()
The psi window size is a u64 an can be up to 10 seconds right now,
which exceeds the lower 32 bits of the variable. But div_u64 is meant
only for 32-bit divisors. Use div64_u64 for the 64-bit divisor.
Signed-off-by: Johannes Weiner <[email protected]>
---
kernel/sched/psi.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 517e3719027e..84af7aa158bf 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -481,7 +481,7 @@ static u64 window_update(struct psi_window *win, u64 now, u64 value)
u32 remaining;
remaining = win->size - elapsed;
- growth += div_u64(win->prev_growth * remaining, win->size);
+ growth += div64_u64(win->prev_growth * remaining, win->size);
}
return growth;
--
2.24.0
On Tue, Nov 12, 2019 at 10:41:46AM -0500, Johannes Weiner wrote:
> On Fri, Nov 08, 2019 at 03:33:24PM +0800, tim wrote:
> > In psi_update_stats, it is possible that period has value like
> > 0xXXXXXXXX00000000 where the lower 32 bit is 0, then it calls div_u64 which
> > truncates u64 period to u32, results in zero divisor.
> > Use div64_u64() instead of div_u64() if the divisor is u64 to avoid
> > truncation to 32-bit on 64-bit platforms.
> >
> > Signed-off-by: xiejingfeng <[email protected]>
>
> This is legit. When we stop the periodic averaging worker due to an
> idle CPU, the period after restart can be much longer than the ~4 sec
> in the lower 32 bits. See the missed_periods logic in update_averages.
Argh, that's not right. Of course I notice right after hitting send.
missed_periods are subtracted out of the difference between now and
the last update, so period should be not much bigger than 2s.
Something else is going on here.
On Tue, Nov 12, 2019 at 10:48:46AM -0500, Johannes Weiner wrote:
> On Tue, Nov 12, 2019 at 10:41:46AM -0500, Johannes Weiner wrote:
> > On Fri, Nov 08, 2019 at 03:33:24PM +0800, tim wrote:
> > > In psi_update_stats, it is possible that period has value like
> > > 0xXXXXXXXX00000000 where the lower 32 bit is 0, then it calls div_u64 which
> > > truncates u64 period to u32, results in zero divisor.
> > > Use div64_u64() instead of div_u64() if the divisor is u64 to avoid
> > > truncation to 32-bit on 64-bit platforms.
> > >
> > > Signed-off-by: xiejingfeng <[email protected]>
> >
> > This is legit. When we stop the periodic averaging worker due to an
> > idle CPU, the period after restart can be much longer than the ~4 sec
> > in the lower 32 bits. See the missed_periods logic in update_averages.
>
> Argh, that's not right. Of course I notice right after hitting send.
>
> missed_periods are subtracted out of the difference between now and
> the last update, so period should be not much bigger than 2s.
>
> Something else is going on here.
Tim, does this happen right after boot? I wonder if it's because we're
not initializing avg_last_update, and the initial delta between the
last update (0) and the first scheduled update (sched_clock() + 2s)
ends up bigger than 4 seconds somehow. Later on, the delta between the
last and the scheduled update should always be ~2s. But for that to
happen, it would require a pretty slow boot, or a sched_clock() that
does not start at 0.
Tim, if you have a coredump, can you extract the value of the other
variables printed in the following patch?
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 84af7aa158bf..1b6836d23091 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -374,6 +374,10 @@ static u64 update_averages(struct psi_group *group, u64 now)
*/
avg_next_update = expires + ((1 + missed_periods) * psi_period);
period = now - (group->avg_last_update + (missed_periods * psi_period));
+
+ WARN(period >> 32, "period=%ld now=%ld expires=%ld last=%ld missed=%ld\n",
+ period, now, expires, group->avg_last_update, missed_periods);
+
group->avg_last_update = now;
for (s = 0; s < NR_PSI_STATES - 1; s++) {
And we may need something like this to make the tick initialization
more robust regardless of the reported bug here:
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 84af7aa158bf..ce8f6748678a 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -185,7 +185,8 @@ static void group_init(struct psi_group *group)
for_each_possible_cpu(cpu)
seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
- group->avg_next_update = sched_clock() + psi_period;
+ group->avg_last_update = sched_clock();
+ group->avg_next_update = group->avg_last_update + psi_period;
INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work);
mutex_init(&group->avgs_lock);
/* Init trigger-related members */
On Tue, Nov 12, 2019 at 8:08 AM Johannes Weiner <[email protected]> wrote:
>
> On Tue, Nov 12, 2019 at 10:48:46AM -0500, Johannes Weiner wrote:
> > On Tue, Nov 12, 2019 at 10:41:46AM -0500, Johannes Weiner wrote:
> > > On Fri, Nov 08, 2019 at 03:33:24PM +0800, tim wrote:
> > > > In psi_update_stats, it is possible that period has value like
> > > > 0xXXXXXXXX00000000 where the lower 32 bit is 0, then it calls div_u64 which
> > > > truncates u64 period to u32, results in zero divisor.
> > > > Use div64_u64() instead of div_u64() if the divisor is u64 to avoid
> > > > truncation to 32-bit on 64-bit platforms.
> > > >
> > > > Signed-off-by: xiejingfeng <[email protected]>
> > >
> > > This is legit. When we stop the periodic averaging worker due to an
> > > idle CPU, the period after restart can be much longer than the ~4 sec
> > > in the lower 32 bits. See the missed_periods logic in update_averages.
> >
> > Argh, that's not right. Of course I notice right after hitting send.
> >
> > missed_periods are subtracted out of the difference between now and
> > the last update, so period should be not much bigger than 2s.
> >
> > Something else is going on here.
>
> Tim, does this happen right after boot? I wonder if it's because we're
> not initializing avg_last_update, and the initial delta between the
> last update (0) and the first scheduled update (sched_clock() + 2s)
> ends up bigger than 4 seconds somehow. Later on, the delta between the
> last and the scheduled update should always be ~2s. But for that to
> happen, it would require a pretty slow boot, or a sched_clock() that
> does not start at 0.
>
> Tim, if you have a coredump, can you extract the value of the other
> variables printed in the following patch?
>
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 84af7aa158bf..1b6836d23091 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -374,6 +374,10 @@ static u64 update_averages(struct psi_group *group, u64 now)
> */
> avg_next_update = expires + ((1 + missed_periods) * psi_period);
> period = now - (group->avg_last_update + (missed_periods * psi_period));
> +
> + WARN(period >> 32, "period=%ld now=%ld expires=%ld last=%ld missed=%ld\n",
> + period, now, expires, group->avg_last_update, missed_periods);
> +
> group->avg_last_update = now;
>
> for (s = 0; s < NR_PSI_STATES - 1; s++) {
>
> And we may need something like this to make the tick initialization
> more robust regardless of the reported bug here:
>
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 84af7aa158bf..ce8f6748678a 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -185,7 +185,8 @@ static void group_init(struct psi_group *group)
>
> for_each_possible_cpu(cpu)
> seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
> - group->avg_next_update = sched_clock() + psi_period;
> + group->avg_last_update = sched_clock();
> + group->avg_next_update = group->avg_last_update + psi_period;
> INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work);
> mutex_init(&group->avgs_lock);
> /* Init trigger-related members */
Both fixes for group_init() and window_update() make sense to me.
window_update() division would be reproducible because win->size is
set during trigger setup and does not change afterwards. Since
userspace defines the window size in usecs this would require doing
some math and finding a value that yields zeros in 32 LSBs after
conversion into nsecs (see: t->win.size = window_us * NSEC_PER_USEC in
psi_trigger_create()). Haven't seen this issue because all my test
cases (1, 2, 10secs) had non-zero LSBs.
On Tue, Nov 12, 2019 at 7:41 AM Johannes Weiner <[email protected]> wrote:
>
> On Fri, Nov 08, 2019 at 03:33:24PM +0800, tim wrote:
> > In psi_update_stats, it is possible that period has value like
> > 0xXXXXXXXX00000000 where the lower 32 bit is 0, then it calls div_u64 which
> > truncates u64 period to u32, results in zero divisor.
> > Use div64_u64() instead of div_u64() if the divisor is u64 to avoid
> > truncation to 32-bit on 64-bit platforms.
> >
> > Signed-off-by: xiejingfeng <[email protected]>
>
> This is legit. When we stop the periodic averaging worker due to an
> idle CPU, the period after restart can be much longer than the ~4 sec
> in the lower 32 bits. See the missed_periods logic in update_averages.
>
> What is surprising is that you can hit this repeatedly, as the odds
> are 1 in 4,294,967,296. An extremely coarse clock source?
>
> Acked-by: Johannes Weiner <[email protected]>
>
> There are several more instances of div_u64 in psi.c. They all look
> fine to me except for one in the psi poll() windowing code, where we
> divide by the window size, which can be up to 10s. CCing Suren.
>
> ---
> From 009cece5f37a38f4baeb1bebdcb432ac9ae66ef8 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <[email protected]>
> Date: Tue, 12 Nov 2019 10:35:26 -0500
> Subject: [PATCH] psi: fix a division error in psi poll()
>
> The psi window size is a u64 an can be up to 10 seconds right now,
> which exceeds the lower 32 bits of the variable. But div_u64 is meant
> only for 32-bit divisors. Use div64_u64 for the 64-bit divisor.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> kernel/sched/psi.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 517e3719027e..84af7aa158bf 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -481,7 +481,7 @@ static u64 window_update(struct psi_window *win, u64 now, u64 value)
> u32 remaining;
>
> remaining = win->size - elapsed;
> - growth += div_u64(win->prev_growth * remaining, win->size);
> + growth += div64_u64(win->prev_growth * remaining, win->size);
> }
>
> return growth;
> --
> 2.24.0
>
Reviewed-by: Suren Baghdasaryan <[email protected]>
Weiner,
The crash does not happen right after boot, in my case, it happens in 58914 ~ 815463 seconds range since boot
With my coredump,some values are extracted as below:
period = 001df2dc00000000
now = 001df2dc00000000, same as period
expires = group->next_update = rdi = 00003594f700648e
group->avg_last_update could not be known
missed_periods = 0
在 2019/11/13 上午12:08,“Johannes Weiner”<[email protected]> 写入:
On Tue, Nov 12, 2019 at 10:48:46AM -0500, Johannes Weiner wrote:
> On Tue, Nov 12, 2019 at 10:41:46AM -0500, Johannes Weiner wrote:
> > On Fri, Nov 08, 2019 at 03:33:24PM +0800, tim wrote:
> > > In psi_update_stats, it is possible that period has value like
> > > 0xXXXXXXXX00000000 where the lower 32 bit is 0, then it calls div_u64 which
> > > truncates u64 period to u32, results in zero divisor.
> > > Use div64_u64() instead of div_u64() if the divisor is u64 to avoid
> > > truncation to 32-bit on 64-bit platforms.
> > >
> > > Signed-off-by: xiejingfeng <[email protected]>
> >
> > This is legit. When we stop the periodic averaging worker due to an
> > idle CPU, the period after restart can be much longer than the ~4 sec
> > in the lower 32 bits. See the missed_periods logic in update_averages.
>
> Argh, that's not right. Of course I notice right after hitting send.
>
> missed_periods are subtracted out of the difference between now and
> the last update, so period should be not much bigger than 2s.
>
> Something else is going on here.
Tim, does this happen right after boot? I wonder if it's because we're
not initializing avg_last_update, and the initial delta between the
last update (0) and the first scheduled update (sched_clock() + 2s)
ends up bigger than 4 seconds somehow. Later on, the delta between the
last and the scheduled update should always be ~2s. But for that to
happen, it would require a pretty slow boot, or a sched_clock() that
does not start at 0.
Tim, if you have a coredump, can you extract the value of the other
variables printed in the following patch?
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 84af7aa158bf..1b6836d23091 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -374,6 +374,10 @@ static u64 update_averages(struct psi_group *group, u64 now)
*/
avg_next_update = expires + ((1 + missed_periods) * psi_period);
period = now - (group->avg_last_update + (missed_periods * psi_period));
+
+ WARN(period >> 32, "period=%ld now=%ld expires=%ld last=%ld missed=%ld\n",
+ period, now, expires, group->avg_last_update, missed_periods);
+
group->avg_last_update = now;
for (s = 0; s < NR_PSI_STATES - 1; s++) {
And we may need something like this to make the tick initialization
more robust regardless of the reported bug here:
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 84af7aa158bf..ce8f6748678a 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -185,7 +185,8 @@ static void group_init(struct psi_group *group)
for_each_possible_cpu(cpu)
seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
- group->avg_next_update = sched_clock() + psi_period;
+ group->avg_last_update = sched_clock();
+ group->avg_next_update = group->avg_last_update + psi_period;
INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work);
mutex_init(&group->avgs_lock);
/* Init trigger-related members */
On Thu, Nov 28, 2019 at 10:37 PM Jingfeng Xie
<[email protected]> wrote:
>
> Weiner,
> The crash does not happen right after boot, in my case, it happens in 58914 ~ 815463 seconds range since boot
>
> With my coredump,some values are extracted as below:
>
> period = 001df2dc00000000
> now = 001df2dc00000000, same as period
> expires = group->next_update = rdi = 00003594f700648e
> group->avg_last_update could not be known
> missed_periods = 0
>
Considering that "period = now - (group->avg_last_update +
(missed_periods * psi_period))" and the above values (period==now and
missed_periods==0), group->avg_last_update must be 0 and that would
mean this is indeed the first update_averages() call.
I think this can happen if a cgroup is created long after the boot.
The following call chain would happen:
cgroup_create->psi_cgroup_alloc->group_init->INIT_DELAYED_WORK->psi_avgs_work->update_averages.
If this cgroup creation is timed so that psi_avgs_work is called when
sched_clock returns a value with LSBs of 0 then we get this problem.
The patch Johannes posted earlier which sets group->avg_last_update to
sched_clock in group_init should have fixed this problem. Tim, did you
capture this coredump after applying that patch? If not please try
applying it and see if it still happens.
> 在 2019/11/13 上午12:08,“Johannes Weiner”<[email protected]> 写入:
>
> On Tue, Nov 12, 2019 at 10:48:46AM -0500, Johannes Weiner wrote:
> > On Tue, Nov 12, 2019 at 10:41:46AM -0500, Johannes Weiner wrote:
> > > On Fri, Nov 08, 2019 at 03:33:24PM +0800, tim wrote:
> > > > In psi_update_stats, it is possible that period has value like
> > > > 0xXXXXXXXX00000000 where the lower 32 bit is 0, then it calls div_u64 which
> > > > truncates u64 period to u32, results in zero divisor.
> > > > Use div64_u64() instead of div_u64() if the divisor is u64 to avoid
> > > > truncation to 32-bit on 64-bit platforms.
> > > >
> > > > Signed-off-by: xiejingfeng <[email protected]>
> > >
> > > This is legit. When we stop the periodic averaging worker due to an
> > > idle CPU, the period after restart can be much longer than the ~4 sec
> > > in the lower 32 bits. See the missed_periods logic in update_averages.
> >
> > Argh, that's not right. Of course I notice right after hitting send.
> >
> > missed_periods are subtracted out of the difference between now and
> > the last update, so period should be not much bigger than 2s.
> >
> > Something else is going on here.
>
> Tim, does this happen right after boot? I wonder if it's because we're
> not initializing avg_last_update, and the initial delta between the
> last update (0) and the first scheduled update (sched_clock() + 2s)
> ends up bigger than 4 seconds somehow. Later on, the delta between the
> last and the scheduled update should always be ~2s. But for that to
> happen, it would require a pretty slow boot, or a sched_clock() that
> does not start at 0.
>
> Tim, if you have a coredump, can you extract the value of the other
> variables printed in the following patch?
>
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 84af7aa158bf..1b6836d23091 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -374,6 +374,10 @@ static u64 update_averages(struct psi_group *group, u64 now)
> */
> avg_next_update = expires + ((1 + missed_periods) * psi_period);
> period = now - (group->avg_last_update + (missed_periods * psi_period));
> +
> + WARN(period >> 32, "period=%ld now=%ld expires=%ld last=%ld missed=%ld\n",
> + period, now, expires, group->avg_last_update, missed_periods);
> +
> group->avg_last_update = now;
>
> for (s = 0; s < NR_PSI_STATES - 1; s++) {
>
> And we may need something like this to make the tick initialization
> more robust regardless of the reported bug here:
>
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 84af7aa158bf..ce8f6748678a 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -185,7 +185,8 @@ static void group_init(struct psi_group *group)
>
> for_each_possible_cpu(cpu)
> seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
> - group->avg_next_update = sched_clock() + psi_period;
> + group->avg_last_update = sched_clock();
> + group->avg_next_update = group->avg_last_update + psi_period;
> INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work);
> mutex_init(&group->avgs_lock);
> /* Init trigger-related members */
>
>
>
>