LinuxLists.cc - [PATCH v2 0/2] Fix two warnings about rq clock

2023-05-10 08:45:44

Subject: [PATCH v2 0/2] Fix two warnings about rq clock

These two patches fix two warnings about rq clock

Patch1 fixes the warning of using the old rq clock caused by
missing update rq clock.

Patch2 fixes the warning that the rq clock was updated multiple
times while holding the rq lock.

v1->v2:
- Vincent Guittot suggested using rq_clock_start_loop_update()
to prevent multiple calls to update_rq_clock() in unthrottle_cfs_rq()
instead of removing update_rq_clock() from unthrottle_cfs_rq()
and updating the rq clock before unthrottle_cfs_rq() for patch2.

[v1] https://lore.kernel.org/all/[email protected]

Hao Jia (2):
sched/core: Fixed missing rq clock update before calling
set_rq_offline()
sched/core: Avoid double calling update_rq_clock()

kernel/sched/core.c | 7 ++++---
kernel/sched/fair.c | 16 ++++++++++++++++
kernel/sched/sched.h | 21 +++++++++++++++++++++
kernel/sched/topology.c | 10 ++++++----
4 files changed, 47 insertions(+), 7 deletions(-)

--
2.37.0

2023-05-10 09:18:30

by Hao Jia

[permalink] [raw]

Subject: [PATCH v2 1/2] sched/core: Fixed missing rq clock update before calling set_rq_offline()

This is triggered during cpu offline when CONFIG_CPU_FREQ is enabled:

------------[ cut here ]------------
rq->clock_update_flags < RQCF_ACT_SKIP
WARNING: CPU: 24 PID: 754 at kernel/sched/sched.h:1496
enqueue_top_rt_rq+0x139/0x160
Call Trace:
<TASK>
? intel_pstate_update_util+0x3b0/0x3b0
rq_offline_rt+0x1b7/0x250
set_rq_offline.part.120+0x28/0x60
rq_attach_root+0xc4/0xd0
cpu_attach_domain+0x3dc/0x7f0
? __schedule+0x65e/0x1310
partition_sched_domains_locked+0x2a5/0x3c0
rebuild_sched_domains_locked+0x477/0x830
? percpu_rwsem_wait+0x140/0x140
rebuild_sched_domains+0x1b/0x30
cpuset_hotplug_workfn+0x2ca/0xc90
? balance_push+0x56/0x120
? _raw_spin_unlock+0x15/0x30
? finish_task_switch+0x98/0x2f0
? __switch_to+0x116/0x410
? __schedule+0x65e/0x1310 ? internal_add_timer+0x42/0x60
? _raw_spin_unlock_irqrestore+0x23/0x40
? add_timer_on+0xd5/0x130
process_one_work+0x1bc/0x3d0
worker_thread+0x4c/0x380
? preempt_count_add+0x56/0xa0
? rescuer_thread+0x310/0x310
kthread+0xe6/0x110
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x1f/0x30

Before calling set_rq_offline() we need to update the rq clock to avoid
using the old rq clock, and use rq_lock_irqsave()/rq_unlock_irqrestore()
to replace raw_spin_rq_lock_irqsave()/raw_spin_rq_unlock_irqrestore() to
ensure that rq->clock_update_flags are cleared before updating the rq
clock.

Steps to reproduce:
1. Enable CONFIG_SMP and CONFIG_CPU_FREQ when compiling the kernel
2. echo 1 > /sys/kernel/debug/clear_warn_once
3. cpupower -c all frequency-set -g powersave
4. Run some rt tasks e.g. Create 5*n rt (100% running) tasks (on a
system with n CPUs)
5. Offline cpu one by one until the warninng is triggered

Signed-off-by: Hao Jia <[email protected]>
---
kernel/sched/topology.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 6682535e37c8..b89497696880 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -487,15 +487,17 @@ static void free_rootdomain(struct rcu_head *rcu)
void rq_attach_root(struct rq *rq, struct root_domain *rd)
{
struct root_domain *old_rd = NULL;
- unsigned long flags;
+ struct rq_flags rf;

- raw_spin_rq_lock_irqsave(rq, flags);
+ rq_lock_irqsave(rq, &rf);

if (rq->rd) {
old_rd = rq->rd;

- if (cpumask_test_cpu(rq->cpu, old_rd->online))
+ if (cpumask_test_cpu(rq->cpu, old_rd->online)) {
+ update_rq_clock(rq);
set_rq_offline(rq);
+ }

cpumask_clear_cpu(rq->cpu, old_rd->span);

@@ -515,7 +517,7 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
if (cpumask_test_cpu(rq->cpu, cpu_active_mask))
set_rq_online(rq);

- raw_spin_rq_unlock_irqrestore(rq, flags);
+ rq_unlock_irqrestore(rq, &rf);

if (old_rd)
call_rcu(&old_rd->rcu, free_rootdomain);
--
2.37.0

2023-05-10 09:21:46

by Hao Jia

[permalink] [raw]

Subject: [PATCH v2 2/2] sched/core: Avoid double calling update_rq_clock()

There are some double rq clock update warnings are triggered.
------------[ cut here ]------------
rq->clock_update_flags & RQCF_UPDATED
WARNING: CPU: 17 PID: 138 at kernel/sched/core.c:741
update_rq_clock+0xaf/0x180
Call Trace:
<TASK>
__balance_push_cpu_stop+0x146/0x180
? migration_cpu_stop+0x2a0/0x2a0
cpu_stopper_thread+0xa3/0x140
smpboot_thread_fn+0x14f/0x210
? sort_range+0x20/0x20
kthread+0xe6/0x110
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x1f/0x30

------------[ cut here ]------------
rq->clock_update_flags & RQCF_UPDATED
WARNING: CPU: 54 PID: 0 at kernel/sched/core.c:741
update_rq_clock+0xaf/0x180
Call Trace:
<TASK>
unthrottle_cfs_rq+0x4b/0x300
__cfsb_csd_unthrottle+0xe0/0x100
__flush_smp_call_function_queue+0xaf/0x1d0
flush_smp_call_function_queue+0x49/0x90
do_idle+0x17c/0x270
cpu_startup_entry+0x19/0x20
start_secondary+0xfa/0x120
secondary_startup_64_no_verify+0xce/0xdb

------------[ cut here ]------------
rq->clock_update_flags & RQCF_UPDATED
WARNING: CPU: 0 PID: 3323 at kernel/sched/core.c:741
update_rq_clock+0xaf/0x180
Call Trace:
<TASK>
unthrottle_cfs_rq+0x4b/0x300
rq_offline_fair+0x89/0x90
set_rq_offline.part.118+0x28/0x60
rq_attach_root+0xc4/0xd0
cpu_attach_domain+0x3dc/0x7f0
partition_sched_domains_locked+0x2a5/0x3c0
rebuild_sched_domains_locked+0x477/0x830
rebuild_sched_domains+0x1b/0x30
cpuset_hotplug_workfn+0x2ca/0xc90
? balance_push+0x56/0xf0
? _raw_spin_unlock+0x15/0x30
? finish_task_switch+0x98/0x2f0
? __switch_to+0x291/0x410
? __schedule+0x65e/0x1310
process_one_work+0x1bc/0x3d0
worker_thread+0x4c/0x380
? preempt_count_add+0x92/0xa0
? rescuer_thread+0x310/0x310
kthread+0xe6/0x110
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x1f/0x30

For the __balance_push_cpu_stop() case, we remove update_rq_clock() from
the __migrate_task() function to avoid double updating the rq clock.
And in order to avoid missing rq clock update, add update_rq_clock()
call before migration_cpu_stop() calls __migrate_task().

For the unthrottle_cfs_rq() case, we use rq_clock_start_loop_update() to
prevent multiple calls to update_rq_clock() in unthrottle_cfs_rq().

Note that the rq clock has been updated before the set_rq_offline()
function runs, so we don't need to call update_rq_clock() in
unthrottle_offline_cfs_rqs().

Suggested-by: Vincent Guittot <[email protected]>
Signed-off-by: Hao Jia <[email protected]>
---
kernel/sched/core.c | 7 ++++---
kernel/sched/fair.c | 16 ++++++++++++++++
kernel/sched/sched.h | 21 +++++++++++++++++++++
3 files changed, 41 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 944c3ae39861..2bcf6a8e71fc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2398,7 +2398,6 @@ static struct rq *__migrate_task(struct rq *rq, struct rq_flags *rf,
if (!is_cpu_allowed(p, dest_cpu))
return rq;

- update_rq_clock(rq);
rq = move_queued_task(rq, rf, p, dest_cpu);

return rq;
@@ -2456,10 +2455,12 @@ static int migration_cpu_stop(void *data)
goto out;
}

- if (task_on_rq_queued(p))
+ if (task_on_rq_queued(p)) {
+ update_rq_clock(rq);
rq = __migrate_task(rq, &rf, p, arg->dest_cpu);
- else
+ } else {
p->wake_cpu = arg->dest_cpu;
+ }

/*
* XXX __migrate_task() can fail, at which point we might end
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 373ff5f55884..9ec7bde620ff 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5575,6 +5575,13 @@ static void __cfsb_csd_unthrottle(void *arg)
struct rq_flags rf;

rq_lock(rq, &rf);
+ /*
+ * Iterating over the list can trigger several call to
+ * update_rq_clock() in unthrottle_cfs_rq().
+ * Do it once and skip the potential next ones.
+ */
+ update_rq_clock(rq);
+ rq_clock_start_loop_update(rq);

/*
* Since we hold rq lock we're safe from concurrent manipulation of
@@ -5595,6 +5602,7 @@ static void __cfsb_csd_unthrottle(void *arg)

rcu_read_unlock();

+ rq_clock_stop_loop_update(rq);
rq_unlock(rq, &rf);
}

@@ -6114,6 +6122,12 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
struct task_group *tg;

lockdep_assert_rq_held(rq);
+ /*
+ * The rq clock has already been updated before the
+ * set_rq_offline() runs, so we should skip updating
+ * the rq clock again in unthrottle_cfs_rq().
+ */
+ rq_clock_start_loop_update(rq);

rcu_read_lock();
list_for_each_entry_rcu(tg, &task_groups, list) {
@@ -6137,6 +6151,8 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
unthrottle_cfs_rq(cfs_rq);
}
rcu_read_unlock();
+
+ rq_clock_stop_loop_update(rq);
}

#else /* CONFIG_CFS_BANDWIDTH */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ec7b3e0a2b20..9c712f29e5a4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1536,6 +1536,27 @@ static inline void rq_clock_skip_update(struct rq *rq)
rq->clock_update_flags |= RQCF_REQ_SKIP;
}

+/*
+ * During cpu offlining and rq wide unthrottling, we can trigger
+ * an update_rq_clock() for several cfs and rt runqueues (Typically
+ * when using list_for_each_entry_*)
+ * rq_clock_start_loop_update() can be called after updating the clock
+ * once and before iterating over the list to prevent multiple update.
+ * After the iterative traversal, we need to call rq_clock_stop_loop_update()
+ * to restore rq->clock_update_flags.
+ */
+static inline void rq_clock_start_loop_update(struct rq *rq)
+{
+ lockdep_assert_rq_held(rq);
+ rq->clock_update_flags |= RQCF_ACT_SKIP;
+}
+
+static inline void rq_clock_stop_loop_update(struct rq *rq)
+{
+ lockdep_assert_rq_held(rq);
+ rq->clock_update_flags &= ~RQCF_ACT_SKIP;
+}
+
/*
* See rt task throttling, which is the only time a skip
* request is canceled.
--
2.37.0

2023-05-10 14:48:23

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH v2 2/2] sched/core: Avoid double calling update_rq_clock()

On Wed, May 10, 2023 at 04:34:50PM +0800, Hao Jia wrote:
> There are some double rq clock update warnings are triggered.
> ------------[ cut here ]------------
> rq->clock_update_flags & RQCF_UPDATED
> WARNING: CPU: 17 PID: 138 at kernel/sched/core.c:741
> update_rq_clock+0xaf/0x180
> Call Trace:
> <TASK>
> __balance_push_cpu_stop+0x146/0x180
> ? migration_cpu_stop+0x2a0/0x2a0
> cpu_stopper_thread+0xa3/0x140
> smpboot_thread_fn+0x14f/0x210
> ? sort_range+0x20/0x20
> kthread+0xe6/0x110
> ? kthread_complete_and_exit+0x20/0x20
> ret_from_fork+0x1f/0x30
>
> ------------[ cut here ]------------
> rq->clock_update_flags & RQCF_UPDATED
> WARNING: CPU: 54 PID: 0 at kernel/sched/core.c:741
> update_rq_clock+0xaf/0x180
> Call Trace:
> <TASK>
> unthrottle_cfs_rq+0x4b/0x300
> __cfsb_csd_unthrottle+0xe0/0x100
> __flush_smp_call_function_queue+0xaf/0x1d0
> flush_smp_call_function_queue+0x49/0x90
> do_idle+0x17c/0x270
> cpu_startup_entry+0x19/0x20
> start_secondary+0xfa/0x120
> secondary_startup_64_no_verify+0xce/0xdb
>
> ------------[ cut here ]------------
> rq->clock_update_flags & RQCF_UPDATED
> WARNING: CPU: 0 PID: 3323 at kernel/sched/core.c:741
> update_rq_clock+0xaf/0x180
> Call Trace:
> <TASK>
> unthrottle_cfs_rq+0x4b/0x300
> rq_offline_fair+0x89/0x90
> set_rq_offline.part.118+0x28/0x60
> rq_attach_root+0xc4/0xd0
> cpu_attach_domain+0x3dc/0x7f0
> partition_sched_domains_locked+0x2a5/0x3c0
> rebuild_sched_domains_locked+0x477/0x830
> rebuild_sched_domains+0x1b/0x30
> cpuset_hotplug_workfn+0x2ca/0xc90
> ? balance_push+0x56/0xf0
> ? _raw_spin_unlock+0x15/0x30
> ? finish_task_switch+0x98/0x2f0
> ? __switch_to+0x291/0x410
> ? __schedule+0x65e/0x1310
> process_one_work+0x1bc/0x3d0
> worker_thread+0x4c/0x380
> ? preempt_count_add+0x92/0xa0
> ? rescuer_thread+0x310/0x310
> kthread+0xe6/0x110
> ? kthread_complete_and_exit+0x20/0x20
> ret_from_fork+0x1f/0x30
>
> For the __balance_push_cpu_stop() case, we remove update_rq_clock() from
> the __migrate_task() function to avoid double updating the rq clock.
> And in order to avoid missing rq clock update, add update_rq_clock()
> call before migration_cpu_stop() calls __migrate_task().
>
> For the unthrottle_cfs_rq() case, we use rq_clock_start_loop_update() to
> prevent multiple calls to update_rq_clock() in unthrottle_cfs_rq().
>
> Note that the rq clock has been updated before the set_rq_offline()
> function runs, so we don't need to call update_rq_clock() in
> unthrottle_offline_cfs_rqs().

This reads like 3 separate issues -- in which case this ought to be 3
separate patches.

> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index ec7b3e0a2b20..9c712f29e5a4 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1536,6 +1536,27 @@ static inline void rq_clock_skip_update(struct rq *rq)
> rq->clock_update_flags |= RQCF_REQ_SKIP;
> }
>
> +/*
> + * During cpu offlining and rq wide unthrottling, we can trigger
> + * an update_rq_clock() for several cfs and rt runqueues (Typically
> + * when using list_for_each_entry_*)
> + * rq_clock_start_loop_update() can be called after updating the clock
> + * once and before iterating over the list to prevent multiple update.
> + * After the iterative traversal, we need to call rq_clock_stop_loop_update()
> + * to restore rq->clock_update_flags.
^^^^^^^
> + */
> +static inline void rq_clock_start_loop_update(struct rq *rq)
> +{
> + lockdep_assert_rq_held(rq);
> + rq->clock_update_flags |= RQCF_ACT_SKIP;
> +}
> +
> +static inline void rq_clock_stop_loop_update(struct rq *rq)
> +{
> + lockdep_assert_rq_held(rq);
> + rq->clock_update_flags &= ~RQCF_ACT_SKIP;
> +}

This does not restore the flag, it clears the flag.

2023-05-10 15:16:18

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH v2 1/2] sched/core: Fixed missing rq clock update before calling set_rq_offline()

On Wed, May 10, 2023 at 04:34:49PM +0800, Hao Jia wrote:
> This is triggered during cpu offline when CONFIG_CPU_FREQ is enabled:

I'm failing to see the relevance to CPU_FREQ...

> ------------[ cut here ]------------
> rq->clock_update_flags < RQCF_ACT_SKIP
> WARNING: CPU: 24 PID: 754 at kernel/sched/sched.h:1496
> enqueue_top_rt_rq+0x139/0x160
> Call Trace:
> <TASK>
> ? intel_pstate_update_util+0x3b0/0x3b0

AFAICT this is __disable_runtime() re-queueing throttled RT tasks

Your unwind is dodgy, what CONFIG_UNWINDER do you have?

> rq_offline_rt+0x1b7/0x250
> set_rq_offline.part.120+0x28/0x60
> rq_attach_root+0xc4/0xd0
> cpu_attach_domain+0x3dc/0x7f0
> ? __schedule+0x65e/0x1310
> partition_sched_domains_locked+0x2a5/0x3c0
> rebuild_sched_domains_locked+0x477/0x830
> ? percpu_rwsem_wait+0x140/0x140
> rebuild_sched_domains+0x1b/0x30
> cpuset_hotplug_workfn+0x2ca/0xc90
> ? balance_push+0x56/0x120
> ? _raw_spin_unlock+0x15/0x30
> ? finish_task_switch+0x98/0x2f0
> ? __switch_to+0x116/0x410
> ? __schedule+0x65e/0x1310 ? internal_add_timer+0x42/0x60
> ? _raw_spin_unlock_irqrestore+0x23/0x40
> ? add_timer_on+0xd5/0x130
> process_one_work+0x1bc/0x3d0
> worker_thread+0x4c/0x380
> ? preempt_count_add+0x56/0xa0
> ? rescuer_thread+0x310/0x310
> kthread+0xe6/0x110
> ? kthread_complete_and_exit+0x20/0x20
> ret_from_fork+0x1f/0x30
>
> Before calling set_rq_offline() we need to update the rq clock to avoid
> using the old rq clock, and use rq_lock_irqsave()/rq_unlock_irqrestore()
> to replace raw_spin_rq_lock_irqsave()/raw_spin_rq_unlock_irqrestore() to
> ensure that rq->clock_update_flags are cleared before updating the rq
> clock.
>
> Steps to reproduce:
> 1. Enable CONFIG_SMP and CONFIG_CPU_FREQ when compiling the kernel
> 2. echo 1 > /sys/kernel/debug/clear_warn_once
> 3. cpupower -c all frequency-set -g powersave
> 4. Run some rt tasks e.g. Create 5*n rt (100% running) tasks (on a
> system with n CPUs)
> 5. Offline cpu one by one until the warninng is triggered
>
> Signed-off-by: Hao Jia <[email protected]>
> ---
> kernel/sched/topology.c | 10 ++++++----
> 1 file changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 6682535e37c8..b89497696880 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -487,15 +487,17 @@ static void free_rootdomain(struct rcu_head *rcu)
> void rq_attach_root(struct rq *rq, struct root_domain *rd)
> {
> struct root_domain *old_rd = NULL;
> - unsigned long flags;
> + struct rq_flags rf;
>
> - raw_spin_rq_lock_irqsave(rq, flags);
> + rq_lock_irqsave(rq, &rf);
>
> if (rq->rd) {
> old_rd = rq->rd;
>
> - if (cpumask_test_cpu(rq->cpu, old_rd->online))
> + if (cpumask_test_cpu(rq->cpu, old_rd->online)) {
> + update_rq_clock(rq);
> set_rq_offline(rq);
> + }
>
> cpumask_clear_cpu(rq->cpu, old_rd->span);
>
> @@ -515,7 +517,7 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
> if (cpumask_test_cpu(rq->cpu, cpu_active_mask))
> set_rq_online(rq);
>
> - raw_spin_rq_unlock_irqrestore(rq, flags);
> + rq_unlock_irqrestore(rq, &rf);
>
> if (old_rd)
> call_rcu(&old_rd->rcu, free_rootdomain);

The patch itself is good though; just the Changelog has me
confused.

2023-05-11 04:07:29

by Hao Jia

[permalink] [raw]

Subject: Re: [External] Re: [PATCH v2 1/2] sched/core: Fixed missing rq clock update before calling set_rq_offline()

On 2023/5/10 Peter Zijlstra wrote:
> On Wed, May 10, 2023 at 04:34:49PM +0800, Hao Jia wrote:
>> This is triggered during cpu offline when CONFIG_CPU_FREQ is enabled:
>
> I'm failing to see the relevance to CPU_FREQ...

Sorry to confuse you, here is a more detailed function call graph.

rq_offline_rt()
__disable_runtime()
sched_rt_rq_enqueue()
enqueue_top_rt_rq()
cpufreq_update_util() <-- depends on CONFIG_CPU_FREQ
data->func(data, *rq_clock(rq)*, flags);
intel_pstate_update_util() <-- on my machine, I have
set cpufreq to powersave

Perhaps we need to add the above information to the commit Changelog?

>
>> ------------[ cut here ]------------
>> rq->clock_update_flags < RQCF_ACT_SKIP
>> WARNING: CPU: 24 PID: 754 at kernel/sched/sched.h:1496
>> enqueue_top_rt_rq+0x139/0x160
>> Call Trace:
>> <TASK>
>> ? intel_pstate_update_util+0x3b0/0x3b0
>
> AFAICT this is __disable_runtime() re-queueing throttled RT tasks
>
> Your unwind is dodgy, what CONFIG_UNWINDER do you have?

CONFIG_UNWINDER_ORC=y
# CONFIG_UNWINDER_FRAME_POINTER is not set

I try to use another configuration:

# CONFIG_UNWINDER_ORC is not set
CONFIG_UNWINDER_FRAME_POINTER=y

The log is as follows

[ 217.332774] ------------[ cut here ]------------
[ 217.332780] rq->clock_update_flags < RQCF_ACT_SKIP
[ 217.332788] WARNING: CPU: 0 PID: 817 at kernel/sched/sched.h:1496
enqueue_top_rt_rq+0x135/0x150

[ 217.333049] CPU: 0 PID: 817 Comm: kworker/53:1 Kdump: loaded Tainted:
G S E 6.4.0-rc1.bm.1-amd64+ #73
[ 217.333056] Hardware name: FOXCONN R-5111/GROOT, BIOS IC1B123F 07/16/2020
[ 217.333060] Workqueue: events cpuset_hotplug_workfn
[ 217.333070] RIP: 0010:enqueue_top_rt_rq+0x135/0x150
[ 217.333078] Code: 98 00 00 00 01 00 00 00 eb 83 80 3d d0 b5 69 01 00
0f 85 3f ff ff ff 48 c7 c7 e0 39 2f 9c c6 05 bc b5 69 01 01 e8 0b 7f fa
ff <0f> 0b e9 25 ff ff ff e8 7f 74 07 00 44 8b a3 30 08 00 00 e9 54 ff
[ 217.333084] RSP: 0018:ffffa7c15b94fac0 EFLAGS: 00010086
[ 217.333090] RAX: 0000000000000000 RBX: ffff93ad006ef580 RCX:
0000000000000003
[ 217.333094] RDX: 0000000000000003 RSI: ffffffff9c32eab9 RDI:
00000000ffffffff
[ 217.333097] RBP: ffffa7c15b94fad8 R08: 0000000000000000 R09:
c0000000ffffbfff
[ 217.333101] R10: ffffa7c15b94f9b8 R11: ffffa7c15b94f908 R12:
ffff934f52166608
[ 217.333104] R13: ffffffff9baff3c0 R14: 0000000000000246 R15:
ffff93ad006ef340
[ 217.333107] FS: 0000000000000000(0000) GS:ffff93ad00600000(0000)
knlGS:0000000000000000
[ 217.333112] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 217.333115] CR2: 00007f7d2ffa6ba0 CR3: 000000604df16003 CR4:
00000000007706f0
[ 217.333119] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 217.333122] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[ 217.333125] PKRU: 55555554
[ 217.333128] Call Trace:
[ 217.333131] <TASK>
[ 217.333136] rq_offline_rt+0x1b8/0x250
[ 217.333146] set_rq_offline.part.121+0x2d/0x70
[ 217.333159] set_rq_offline+0x18/0x20
[ 217.333165] rq_attach_root+0xcf/0xe0
[ 217.333176] cpu_attach_domain+0x3cb/0x7d0
[ 217.333186] partition_sched_domains_locked+0x2a4/0x3b0
[ 217.333197] rebuild_sched_domains_locked+0x47e/0x800
[ 217.333202] ? preempt_schedule_common+0x37/0x90
[ 217.333213] rebuild_sched_domains+0x1f/0x40
[ 217.333218] cpuset_hotplug_workfn+0x2df/0xd10
[ 217.333225] ? debug_smp_processor_id+0x17/0x20
[ 217.333233] ? _raw_spin_unlock+0x19/0x30
[ 217.333242] ? raw_spin_rq_unlock+0x10/0x30
[ 217.333247] ? finish_task_switch+0x98/0x2f0
[ 217.333251] ? __switch_to+0x12b/0x440
[ 217.333261] ? __switch_to_asm+0x34/0x60
[ 217.333270] ? __schedule+0x600/0x14e0
[ 217.333276] ? __queue_delayed_work+0x8b/0x90
[ 217.333285] ? queue_delayed_work_on+0x3a/0x50
[ 217.333291] process_one_work+0x1c8/0x3e0
[ 217.333298] worker_thread+0x4d/0x390
[ 217.333305] ? rescuer_thread+0x350/0x350
[ 217.333311] kthread+0xfe/0x130
[ 217.333316] ? kthread_complete_and_exit+0x20/0x20
[ 217.333321] ret_from_fork+0x1f/0x30
[ 217.333331] </TASK>
[ 217.333333] ---[ end trace 0000000000000000 ]---

Tests are done in an Intel Xeon(R) Platinum 8260 [email protected] machine
with 2 NUMA nodes each of which has 24 cores with SMT2 enabled, so 96
CPUs in total.

>
>> rq_offline_rt+0x1b7/0x250
>> set_rq_offline.part.120+0x28/0x60
>> rq_attach_root+0xc4/0xd0
>> cpu_attach_domain+0x3dc/0x7f0
>> ? __schedule+0x65e/0x1310
>> partition_sched_domains_locked+0x2a5/0x3c0
>> rebuild_sched_domains_locked+0x477/0x830
>> ? percpu_rwsem_wait+0x140/0x140
>> rebuild_sched_domains+0x1b/0x30
>> cpuset_hotplug_workfn+0x2ca/0xc90
>> ? balance_push+0x56/0x120
>> ? _raw_spin_unlock+0x15/0x30
>> ? finish_task_switch+0x98/0x2f0
>> ? __switch_to+0x116/0x410
>> ? __schedule+0x65e/0x1310 ? internal_add_timer+0x42/0x60
>> ? _raw_spin_unlock_irqrestore+0x23/0x40
>> ? add_timer_on+0xd5/0x130
>> process_one_work+0x1bc/0x3d0
>> worker_thread+0x4c/0x380
>> ? preempt_count_add+0x56/0xa0
>> ? rescuer_thread+0x310/0x310
>> kthread+0xe6/0x110
>> ? kthread_complete_and_exit+0x20/0x20
>> ret_from_fork+0x1f/0x30
>>
>> Before calling set_rq_offline() we need to update the rq clock to avoid
>> using the old rq clock, and use rq_lock_irqsave()/rq_unlock_irqrestore()
>> to replace raw_spin_rq_lock_irqsave()/raw_spin_rq_unlock_irqrestore() to
>> ensure that rq->clock_update_flags are cleared before updating the rq
>> clock.
>>
>> Steps to reproduce:
>> 1. Enable CONFIG_SMP and CONFIG_CPU_FREQ when compiling the kernel
>> 2. echo 1 > /sys/kernel/debug/clear_warn_once
>> 3. cpupower -c all frequency-set -g powersave
>> 4. Run some rt tasks e.g. Create 5*n rt (100% running) tasks (on a
>> system with n CPUs)
>> 5. Offline cpu one by one until the warninng is triggered
>>
>> Signed-off-by: Hao Jia <[email protected]>
>> ---
>> kernel/sched/topology.c | 10 ++++++----
>> 1 file changed, 6 insertions(+), 4 deletions(-)
>>
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index 6682535e37c8..b89497696880 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -487,15 +487,17 @@ static void free_rootdomain(struct rcu_head *rcu)
>> void rq_attach_root(struct rq *rq, struct root_domain *rd)
>> {
>> struct root_domain *old_rd = NULL;
>> - unsigned long flags;
>> + struct rq_flags rf;
>>
>> - raw_spin_rq_lock_irqsave(rq, flags);
>> + rq_lock_irqsave(rq, &rf);
>>
>> if (rq->rd) {
>> old_rd = rq->rd;
>>
>> - if (cpumask_test_cpu(rq->cpu, old_rd->online))
>> + if (cpumask_test_cpu(rq->cpu, old_rd->online)) {
>> + update_rq_clock(rq);
>> set_rq_offline(rq);
>> + }
>>
>> cpumask_clear_cpu(rq->cpu, old_rd->span);
>>
>> @@ -515,7 +517,7 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
>> if (cpumask_test_cpu(rq->cpu, cpu_active_mask))
>> set_rq_online(rq);
>>
>> - raw_spin_rq_unlock_irqrestore(rq, flags);
>> + rq_unlock_irqrestore(rq, &rf);
>>
>> if (old_rd)
>> call_rcu(&old_rd->rcu, free_rootdomain);
>
> The patch itself is good though; just the Changelog has me
> confused.

Sorry for not explaining clearly and confusing you.

Thanks,
Hao

2023-05-11 04:13:08

by Hao Jia

[permalink] [raw]

Subject: Re: [External] Re: [PATCH v2 2/2] sched/core: Avoid double calling update_rq_clock()

On 2023/5/10 Peter Zijlstra wrote:
> On Wed, May 10, 2023 at 04:34:50PM +0800, Hao Jia wrote:

>> For the __balance_push_cpu_stop() case, we remove update_rq_clock() from
>> the __migrate_task() function to avoid double updating the rq clock.
>> And in order to avoid missing rq clock update, add update_rq_clock()
>> call before migration_cpu_stop() calls __migrate_task().
>>
>> For the unthrottle_cfs_rq() case, we use rq_clock_start_loop_update() to
>> prevent multiple calls to update_rq_clock() in unthrottle_cfs_rq().
>>
>> Note that the rq clock has been updated before the set_rq_offline()
>> function runs, so we don't need to call update_rq_clock() in
>> unthrottle_offline_cfs_rqs().
>
> This reads like 3 separate issues -- in which case this ought to be 3
> separate patches.

I will do it in the next version.

Thanks,
Hao

>
>
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index ec7b3e0a2b20..9c712f29e5a4 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -1536,6 +1536,27 @@ static inline void rq_clock_skip_update(struct rq *rq)
>> rq->clock_update_flags |= RQCF_REQ_SKIP;
>> }
>>
>> +/*
>> + * During cpu offlining and rq wide unthrottling, we can trigger
>> + * an update_rq_clock() for several cfs and rt runqueues (Typically
>> + * when using list_for_each_entry_*)
>> + * rq_clock_start_loop_update() can be called after updating the clock
>> + * once and before iterating over the list to prevent multiple update.
>> + * After the iterative traversal, we need to call rq_clock_stop_loop_update()
>> + * to restore rq->clock_update_flags.
> ^^^^^^^
>> + */
>> +static inline void rq_clock_start_loop_update(struct rq *rq)
>> +{
>> + lockdep_assert_rq_held(rq);
>> + rq->clock_update_flags |= RQCF_ACT_SKIP;
>> +}
>> +
>> +static inline void rq_clock_stop_loop_update(struct rq *rq)
>> +{
>> + lockdep_assert_rq_held(rq);
>> + rq->clock_update_flags &= ~RQCF_ACT_SKIP;
>> +}
>
> This does not restore the flag, it clears the flag.

Thanks for your review.
I will also modify it in the next version.

Thanks,
Hao