2020-12-10 08:11:21

by kernel test robot

[permalink] [raw]
Subject: [sched/hotplug] 2558aacff8: will-it-scale.per_thread_ops -1.6% regression


Greeting,

FYI, we noticed a -1.6% regression of will-it-scale.per_thread_ops due to commit:


commit: 2558aacff8586699bcd248b406febb28b0a25de2 ("sched/hotplug: Ensure only per-cpu kthreads run during hotplug")
https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git sched/migrate-disable


in testcase: will-it-scale
on test machine: 144 threads Intel(R) Xeon(R) Gold 5318H CPU @ 2.50GHz with 128G memory
with following parameters:

nr_task: 100%
mode: thread
test: sched_yield
cpufreq_governor: performance
ucode: 0x700001e

test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
test-url: https://github.com/antonblanchard/will-it-scale



If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>


Details are as below:
-------------------------------------------------------------------------------------------------->


To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp run job.yaml

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase/ucode:
gcc-9/performance/x86_64-rhel-8.3/thread/100%/debian-10.4-x86_64-20200603.cgz/lkp-cpl-4sp1/sched_yield/will-it-scale/0x700001e

commit:
565790d28b ("sched: Fix balance_callback()")
2558aacff8 ("sched/hotplug: Ensure only per-cpu kthreads run during hotplug")

565790d28b1e33ee 2558aacff8586699bcd248b406f
---------------- ---------------------------
fail:runs %reproduction fail:runs
| | |
1:4 0% 1:4 perf-profile.children.cycles-pp.error_entry
0:4 0% 0:4 perf-profile.self.cycles-pp.error_entry
%stddev %change %stddev
\ | \
4.011e+08 -1.6% 3.945e+08 will-it-scale.144.threads
2785455 -1.6% 2739520 will-it-scale.per_thread_ops
4.011e+08 -1.6% 3.945e+08 will-it-scale.workload
12.05 +2.1 14.18 mpstat.cpu.all.usr%
1087711 ? 75% -79.0% 228885 ? 7% numa-numastat.node1.local_node
1126029 ? 74% -74.5% 286894 ? 6% numa-numastat.node1.numa_hit
33836 -2.3% 33042 proc-vmstat.nr_slab_reclaimable
74433 -1.5% 73345 proc-vmstat.nr_slab_unreclaimable
86.25 -2.3% 84.25 vmstat.cpu.sy
11.75 ? 3% +17.0% 13.75 ? 3% vmstat.cpu.us
333551 ? 17% -21.7% 261115 ? 5% vmstat.system.cs
329071 ? 3% -15.4% 278535 ? 4% sched_debug.cfs_rq:/.spread0.avg
472614 ? 2% -11.0% 420678 ? 2% sched_debug.cfs_rq:/.spread0.max
17597663 ? 17% -28.5% 12582107 ? 16% sched_debug.cpu.nr_switches.max
1897476 ? 17% -28.4% 1359264 ? 14% sched_debug.cpu.nr_switches.stddev
5628 ? 8% -10.9% 5012 ? 3% slabinfo.files_cache.active_objs
5628 ? 8% -10.9% 5012 ? 3% slabinfo.files_cache.num_objs
3613 ? 2% -10.9% 3219 slabinfo.kmalloc-rcl-512.active_objs
3644 ? 2% -10.9% 3248 slabinfo.kmalloc-rcl-512.num_objs
3967 ? 4% -8.3% 3638 ? 2% slabinfo.sock_inode_cache.active_objs
3967 ? 4% -8.3% 3638 ? 2% slabinfo.sock_inode_cache.num_objs
0.02 ? 9% -14.5% 0.02 ? 2% perf-sched.sch_delay.avg.ms.schedule_timeout.kcompactd.kthread.ret_from_fork
14.28 ? 38% +48.9% 21.26 ? 24% perf-sched.sch_delay.avg.ms.schedule_timeout.rcu_gp_kthread.kthread.ret_from_fork
0.02 ? 24% +38.2% 0.03 ? 14% perf-sched.sch_delay.max.ms.schedule_hrtimeout_range_clock.poll_schedule_timeout.constprop.0.do_sys_poll
0.04 ? 13% -22.8% 0.03 ? 14% perf-sched.sch_delay.max.ms.schedule_timeout.kcompactd.kthread.ret_from_fork
47.71 ? 30% +41.6% 67.54 ? 11% perf-sched.wait_and_delay.avg.ms.schedule_timeout.rcu_gp_kthread.kthread.ret_from_fork
3.20 ? 33% -81.5% 0.59 ? 91% perf-sched.wait_time.avg.ms.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt.[unknown]
33.43 ? 27% +38.4% 46.27 ? 8% perf-sched.wait_time.avg.ms.schedule_timeout.rcu_gp_kthread.kthread.ret_from_fork
0.05 ? 43% -68.9% 0.02 ? 63% perf-sched.wait_time.avg.ms.wait_for_partner.fifo_open.do_dentry_open.path_openat
8.23 ? 10% -57.9% 3.47 ? 97% perf-sched.wait_time.max.ms.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt.[unknown]
35211 ?169% -99.6% 129.50 ? 96% numa-vmstat.node1.nr_active_anon
8060 ? 17% -35.0% 5236 ? 31% numa-vmstat.node1.nr_slab_reclaimable
35211 ?169% -99.6% 129.50 ? 96% numa-vmstat.node1.nr_zone_active_anon
1053733 ? 53% -52.7% 498416 ? 7% numa-vmstat.node1.numa_hit
946160 ? 58% -62.7% 352475 ? 12% numa-vmstat.node1.numa_local
107572 ? 23% +35.7% 145940 ? 5% numa-vmstat.node1.numa_other
5914 ? 23% -28.5% 4226 ? 10% numa-vmstat.node2.nr_slab_reclaimable
18204 ? 3% -14.7% 15522 ? 12% numa-vmstat.node2.nr_slab_unreclaimable
629428 ? 9% -14.2% 540085 ? 7% numa-vmstat.node2.numa_hit
17302 ? 10% +26.0% 21807 ? 7% numa-vmstat.node3.nr_slab_unreclaimable
140785 ?169% -99.6% 520.75 ? 95% numa-meminfo.node1.Active
140785 ?169% -99.6% 520.75 ? 95% numa-meminfo.node1.Active(anon)
32241 ? 17% -35.0% 20948 ? 31% numa-meminfo.node1.KReclaimable
32241 ? 17% -35.0% 20948 ? 31% numa-meminfo.node1.SReclaimable
101007 ? 5% -15.7% 85162 ? 13% numa-meminfo.node1.Slab
23657 ? 23% -28.5% 16906 ? 10% numa-meminfo.node2.KReclaimable
23657 ? 23% -28.5% 16906 ? 10% numa-meminfo.node2.SReclaimable
72823 ? 3% -14.7% 62089 ? 12% numa-meminfo.node2.SUnreclaim
96481 ? 3% -18.1% 78996 ? 12% numa-meminfo.node2.Slab
69210 ? 10% +26.0% 87229 ? 7% numa-meminfo.node3.SUnreclaim
110579 ? 9% +26.7% 140158 ? 10% numa-meminfo.node3.Slab
388.75 ? 74% +1147.8% 4851 ?124% interrupts.33:PCI-MSI.524291-edge.eth0-TxRx-2
1540 ? 69% -78.2% 335.75 ? 51% interrupts.34:PCI-MSI.524292-edge.eth0-TxRx-3
388.75 ? 74% +1147.8% 4851 ?124% interrupts.CPU11.33:PCI-MSI.524291-edge.eth0-TxRx-2
307.50 +66.7% 512.50 ? 50% interrupts.CPU111.RES:Rescheduling_interrupts
1540 ? 69% -78.2% 335.75 ? 51% interrupts.CPU12.34:PCI-MSI.524292-edge.eth0-TxRx-3
350.50 ? 8% -10.3% 314.50 ? 2% interrupts.CPU122.RES:Rescheduling_interrupts
424.50 ? 24% -18.7% 345.00 ? 14% interrupts.CPU128.RES:Rescheduling_interrupts
8496 -50.1% 4241 interrupts.CPU29.NMI:Non-maskable_interrupts
8496 -50.1% 4241 interrupts.CPU29.PMI:Performance_monitoring_interrupts
314.25 +8.7% 341.50 ? 4% interrupts.CPU29.RES:Rescheduling_interrupts
8496 -50.1% 4242 interrupts.CPU30.NMI:Non-maskable_interrupts
8496 -50.1% 4242 interrupts.CPU30.PMI:Performance_monitoring_interrupts
311.50 +13.2% 352.75 ? 8% interrupts.CPU7.RES:Rescheduling_interrupts
21144 ? 15% -25.0% 15858 ? 24% interrupts.CPU72.CAL:Function_call_interrupts
317.75 +39.2% 442.25 ? 32% interrupts.CPU82.RES:Rescheduling_interrupts
8.557e+10 -1.8% 8.399e+10 perf-stat.i.branch-instructions
0.43 +0.4 0.87 perf-stat.i.branch-miss-rate%
3.479e+08 +106.5% 7.186e+08 perf-stat.i.branch-misses
333383 ? 17% -22.1% 259849 ? 6% perf-stat.i.context-switches
1.02 +2.4% 1.05 perf-stat.i.cpi
1.268e+11 -1.9% 1.243e+11 perf-stat.i.dTLB-loads
7.506e+10 -1.9% 7.363e+10 perf-stat.i.dTLB-stores
4.26e+08 ? 2% -31.3% 2.925e+08 perf-stat.i.iTLB-load-misses
538538 ? 36% -79.5% 110207 ? 16% perf-stat.i.iTLB-loads
3.983e+11 -1.9% 3.908e+11 perf-stat.i.instructions
946.16 ? 3% +43.4% 1356 perf-stat.i.instructions-per-iTLB-miss
0.99 -2.4% 0.97 perf-stat.i.ipc
1.22 ? 3% +30.8% 1.60 ? 3% perf-stat.i.metric.K/sec
1996 -1.9% 1958 perf-stat.i.metric.M/sec
0.41 +0.4 0.86 perf-stat.overall.branch-miss-rate%
1.01 +2.5% 1.03 perf-stat.overall.cpi
935.95 ? 2% +42.8% 1336 perf-stat.overall.instructions-per-iTLB-miss
0.99 -2.4% 0.97 perf-stat.overall.ipc
8.527e+10 -1.8% 8.37e+10 perf-stat.ps.branch-instructions
3.467e+08 +106.5% 7.161e+08 perf-stat.ps.branch-misses
334637 ? 17% -21.5% 262794 ? 5% perf-stat.ps.context-switches
1.264e+11 -1.9% 1.239e+11 perf-stat.ps.dTLB-loads
7.48e+10 -1.9% 7.338e+10 perf-stat.ps.dTLB-stores
4.244e+08 ? 2% -31.3% 2.915e+08 perf-stat.ps.iTLB-load-misses
539519 ? 36% -79.5% 110644 ? 16% perf-stat.ps.iTLB-loads
3.969e+11 -1.9% 3.895e+11 perf-stat.ps.instructions
1.2e+14 -2.0% 1.176e+14 perf-stat.total.instructions
0.68 ? 2% -0.1 0.59 ? 3% perf-profile.calltrace.cycles-pp.orc_find.unwind_next_frame.perf_callchain_kernel.get_perf_callchain.perf_callchain
0.93 -0.1 0.88 ? 3% perf-profile.calltrace.cycles-pp.__perf_event_header__init_id.perf_prepare_sample.perf_event_output_forward.__perf_event_overflow.perf_swevent_overflow
1.22 +0.1 1.30 ? 2% perf-profile.calltrace.cycles-pp.__orc_find.unwind_next_frame.perf_callchain_kernel.get_perf_callchain.perf_callchain
1.11 +0.1 1.21 ? 2% perf-profile.calltrace.cycles-pp.orc_find.unwind_next_frame.__unwind_start.perf_callchain_kernel.get_perf_callchain
1.51 -0.1 1.44 ? 3% perf-profile.children.cycles-pp.stack_access_ok
0.37 ? 3% -0.1 0.30 perf-profile.children.cycles-pp.__task_pid_nr_ns
0.47 ? 2% -0.1 0.41 ? 2% perf-profile.children.cycles-pp.perf_event_pid_type
0.30 ? 5% -0.0 0.25 ? 12% perf-profile.children.cycles-pp.__list_del_entry_valid
0.95 -0.0 0.90 ? 3% perf-profile.children.cycles-pp.__perf_event_header__init_id
0.10 ? 14% -0.0 0.07 ? 31% perf-profile.children.cycles-pp.sched_yield@plt
0.42 -0.0 0.38 ? 5% perf-profile.children.cycles-pp.ftrace_graph_ret_addr
0.10 ? 4% -0.0 0.06 ? 6% perf-profile.children.cycles-pp.is_module_text_address
0.11 ? 4% -0.0 0.09 ? 5% perf-profile.children.cycles-pp.is_ftrace_trampoline
0.10 ? 4% -0.0 0.08 ? 6% perf-profile.children.cycles-pp.ftrace_ops_trampoline
0.06 ? 15% +0.0 0.09 ? 7% perf-profile.children.cycles-pp.rcu_qs
0.23 ? 8% +0.1 0.31 ? 7% perf-profile.children.cycles-pp.rcu_note_context_switch
0.35 ? 4% -0.1 0.29 perf-profile.self.cycles-pp.__task_pid_nr_ns
1.24 -0.1 1.18 ? 2% perf-profile.self.cycles-pp.stack_access_ok
0.23 ? 5% -0.0 0.18 ? 12% perf-profile.self.cycles-pp.__list_del_entry_valid
0.34 ? 4% -0.0 0.30 ? 2% perf-profile.self.cycles-pp.perf_tp_event
0.12 ? 4% -0.0 0.10 ? 10% perf-profile.self.cycles-pp.sched_clock_cpu
0.32 -0.0 0.30 ? 5% perf-profile.self.cycles-pp.ftrace_graph_ret_addr
0.08 -0.0 0.06 ? 6% perf-profile.self.cycles-pp.ftrace_ops_trampoline
0.34 -0.0 0.33 ? 2% perf-profile.self.cycles-pp.unwind_get_return_address
0.08 +0.0 0.10 ? 7% perf-profile.self.cycles-pp.rcu_is_watching
0.05 ? 8% +0.0 0.09 ? 7% perf-profile.self.cycles-pp.rcu_qs
0.51 +0.0 0.55 ? 3% perf-profile.self.cycles-pp.bsearch
0.16 ? 13% +0.0 0.20 ? 7% perf-profile.self.cycles-pp.rcu_note_context_switch
1.44 ? 4% +0.3 1.76 ? 12% perf-profile.self.cycles-pp.__sched_yield



will-it-scale.per_thread_ops

3e+06 +-----------------------------------------------------------------+
| : O +.+.+.+.+.++.+.+.+.+.+ O O O O O O O O O O OO O O O O O O O |
2.5e+06 |-: : |
| : : |
| : : |
2e+06 |-+: : |
| : : |
1.5e+06 |-+: : |
| : : |
1e+06 |-+: : |
| : : |
| : |
500000 |-+ : |
| : |
0 +-----------------------------------------------------------------+


[*] bisect-good sample
[O] bisect-bad sample



Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


Thanks,
Oliver Sang


Attachments:
(No filename) (14.80 kB)
config-5.10.0-rc1-00033-g2558aacff858 (174.16 kB)
job-script (7.94 kB)
job.yaml (5.55 kB)
reproduce (353.00 B)
Download all attachments

2020-12-10 16:18:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [sched/hotplug] 2558aacff8: will-it-scale.per_thread_ops -1.6% regression

On Thu, Dec 10, 2020 at 04:18:59PM +0800, kernel test robot wrote:
> FYI, we noticed a -1.6% regression of will-it-scale.per_thread_ops due to commit:
> commit: 2558aacff8586699bcd248b406febb28b0a25de2 ("sched/hotplug: Ensure only per-cpu kthreads run during hotplug")

Mooo, weird but whatever. Does the below help at all?

---
kernel/sched/core.c | 40 +++++++++++++++-------------------------
kernel/sched/sched.h | 13 +++++--------
2 files changed, 20 insertions(+), 33 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7af80c3fce12..f80245c7f903 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3985,15 +3985,20 @@ static void do_balance_callbacks(struct rq *rq, struct callback_head *head)
}
}

+static void balance_push(struct rq *rq);
+
+struct callback_head balance_push_callback = {
+ .next = NULL,
+ .func = (void (*)(struct callback_head *))balance_push,
+};
+
static inline struct callback_head *splice_balance_callbacks(struct rq *rq)
{
struct callback_head *head = rq->balance_callback;

lockdep_assert_held(&rq->lock);
- if (head) {
+ if (head)
rq->balance_callback = NULL;
- rq->balance_flags &= ~BALANCE_WORK;
- }

return head;
}
@@ -4014,21 +4019,6 @@ static inline void balance_callbacks(struct rq *rq, struct callback_head *head)
}
}

-static void balance_push(struct rq *rq);
-
-static inline void balance_switch(struct rq *rq)
-{
- if (likely(!rq->balance_flags))
- return;
-
- if (rq->balance_flags & BALANCE_PUSH) {
- balance_push(rq);
- return;
- }
-
- __balance_callbacks(rq);
-}
-
#else

static inline void __balance_callbacks(struct rq *rq)
@@ -4044,10 +4034,6 @@ static inline void balance_callbacks(struct rq *rq, struct callback_head *head)
{
}

-static inline void balance_switch(struct rq *rq)
-{
-}
-
#endif

static inline void
@@ -4075,7 +4061,7 @@ static inline void finish_lock_switch(struct rq *rq)
* prev into current:
*/
spin_acquire(&rq->lock.dep_map, 0, 0, _THIS_IP_);
- balance_switch(rq);
+ __balance_callbacks(rq);
raw_spin_unlock_irq(&rq->lock);
}

@@ -7256,6 +7242,10 @@ static void balance_push(struct rq *rq)

lockdep_assert_held(&rq->lock);
SCHED_WARN_ON(rq->cpu != smp_processor_id());
+ /*
+ * Ensure the thing is persistent until balance_push_set(, on = false);
+ */
+ rq->balance_callback = &balance_push_callback;

/*
* Both the cpu-hotplug and stop task are in this case and are
@@ -7305,9 +7295,9 @@ static void balance_push_set(int cpu, bool on)

rq_lock_irqsave(rq, &rf);
if (on)
- rq->balance_flags |= BALANCE_PUSH;
+ rq->balance_callback = &balance_push_callback;
else
- rq->balance_flags &= ~BALANCE_PUSH;
+ rq->balance_callback = NULL;
rq_unlock_irqrestore(rq, &rf);
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f5acb6c5ce49..12ada79d40f3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -975,7 +975,6 @@ struct rq {
unsigned long cpu_capacity_orig;

struct callback_head *balance_callback;
- unsigned char balance_flags;

unsigned char nohz_idle_balance;
unsigned char idle_balance;
@@ -1226,6 +1225,8 @@ struct rq_flags {
#endif
};

+extern struct callback_head balance_push_callback;
+
/*
* Lockdep annotation that avoids accidental unlocks; it's like a
* sticky/continuous lockdep_assert_held().
@@ -1243,9 +1244,9 @@ static inline void rq_pin_lock(struct rq *rq, struct rq_flags *rf)
#ifdef CONFIG_SCHED_DEBUG
rq->clock_update_flags &= (RQCF_REQ_SKIP|RQCF_ACT_SKIP);
rf->clock_update_flags = 0;
-#endif
#ifdef CONFIG_SMP
- SCHED_WARN_ON(rq->balance_callback);
+ SCHED_WARN_ON(rq->balance_callback && rq->balance_callback != &balance_push_callback);
+#endif
#endif
}

@@ -1408,9 +1409,6 @@ init_numa_balancing(unsigned long clone_flags, struct task_struct *p)

#ifdef CONFIG_SMP

-#define BALANCE_WORK 0x01
-#define BALANCE_PUSH 0x02
-
static inline void
queue_balance_callback(struct rq *rq,
struct callback_head *head,
@@ -1418,13 +1416,12 @@ queue_balance_callback(struct rq *rq,
{
lockdep_assert_held(&rq->lock);

- if (unlikely(head->next || (rq->balance_flags & BALANCE_PUSH)))
+ if (unlikely(head->next || rq->balance_callback == &balance_push_callback))
return;

head->func = (void (*)(struct callback_head *))func;
head->next = rq->balance_callback;
rq->balance_callback = head;
- rq->balance_flags |= BALANCE_WORK;
}

#define rcu_dereference_check_sched_domain(p) \

2020-12-15 05:41:55

by Xing Zhengjun

[permalink] [raw]
Subject: Re: [LKP] Re: [sched/hotplug] 2558aacff8: will-it-scale.per_thread_ops -1.6% regression



On 12/11/2020 12:14 AM, Peter Zijlstra wrote:
> On Thu, Dec 10, 2020 at 04:18:59PM +0800, kernel test robot wrote:
>> FYI, we noticed a -1.6% regression of will-it-scale.per_thread_ops due to commit:
>> commit: 2558aacff8586699bcd248b406febb28b0a25de2 ("sched/hotplug: Ensure only per-cpu kthreads run during hotplug")
>
> Mooo, weird but whatever. Does the below help at all?

I test the patch, the regression reduced to -0.6%.

=========================================================================================
tbox_group/testcase/rootfs/kconfig/compiler/nr_task/mode/test/cpufreq_governor/ucode:

lkp-cpl-4sp1/will-it-scale/debian-10.4-x86_64-20200603.cgz/x86_64-rhel-8.3/gcc-9/100%/thread/sched_yield/performance/0x700001e

commit:
565790d28b1e33ee2f77bad5348b99f6dfc366fd
2558aacff8586699bcd248b406febb28b0a25de2
4b26139b8db627a55043183614a32b0aba799d27 (this test patch)

565790d28b1e33ee 2558aacff8586699bcd248b406f 4b26139b8db627a55043183614a
---------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
4.011e+08 -1.6% 3.945e+08 -0.6% 3.989e+08
will-it-scale.144.threads
2785455 -1.6% 2739520 -0.6% 2769967
will-it-scale.per_thread_ops
4.011e+08 -1.6% 3.945e+08 -0.6% 3.989e+08
will-it-scale.workload

>
> ---
> kernel/sched/core.c | 40 +++++++++++++++-------------------------
> kernel/sched/sched.h | 13 +++++--------
> 2 files changed, 20 insertions(+), 33 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 7af80c3fce12..f80245c7f903 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3985,15 +3985,20 @@ static void do_balance_callbacks(struct rq *rq, struct callback_head *head)
> }
> }
>
> +static void balance_push(struct rq *rq);
> +
> +struct callback_head balance_push_callback = {
> + .next = NULL,
> + .func = (void (*)(struct callback_head *))balance_push,
> +};
> +
> static inline struct callback_head *splice_balance_callbacks(struct rq *rq)
> {
> struct callback_head *head = rq->balance_callback;
>
> lockdep_assert_held(&rq->lock);
> - if (head) {
> + if (head)
> rq->balance_callback = NULL;
> - rq->balance_flags &= ~BALANCE_WORK;
> - }
>
> return head;
> }
> @@ -4014,21 +4019,6 @@ static inline void balance_callbacks(struct rq *rq, struct callback_head *head)
> }
> }
>
> -static void balance_push(struct rq *rq);
> -
> -static inline void balance_switch(struct rq *rq)
> -{
> - if (likely(!rq->balance_flags))
> - return;
> -
> - if (rq->balance_flags & BALANCE_PUSH) {
> - balance_push(rq);
> - return;
> - }
> -
> - __balance_callbacks(rq);
> -}
> -
> #else
>
> static inline void __balance_callbacks(struct rq *rq)
> @@ -4044,10 +4034,6 @@ static inline void balance_callbacks(struct rq *rq, struct callback_head *head)
> {
> }
>
> -static inline void balance_switch(struct rq *rq)
> -{
> -}
> -
> #endif
>
> static inline void
> @@ -4075,7 +4061,7 @@ static inline void finish_lock_switch(struct rq *rq)
> * prev into current:
> */
> spin_acquire(&rq->lock.dep_map, 0, 0, _THIS_IP_);
> - balance_switch(rq);
> + __balance_callbacks(rq);
> raw_spin_unlock_irq(&rq->lock);
> }
>
> @@ -7256,6 +7242,10 @@ static void balance_push(struct rq *rq)
>
> lockdep_assert_held(&rq->lock);
> SCHED_WARN_ON(rq->cpu != smp_processor_id());
> + /*
> + * Ensure the thing is persistent until balance_push_set(, on = false);
> + */
> + rq->balance_callback = &balance_push_callback;
>
> /*
> * Both the cpu-hotplug and stop task are in this case and are
> @@ -7305,9 +7295,9 @@ static void balance_push_set(int cpu, bool on)
>
> rq_lock_irqsave(rq, &rf);
> if (on)
> - rq->balance_flags |= BALANCE_PUSH;
> + rq->balance_callback = &balance_push_callback;
> else
> - rq->balance_flags &= ~BALANCE_PUSH;
> + rq->balance_callback = NULL;
> rq_unlock_irqrestore(rq, &rf);
> }
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index f5acb6c5ce49..12ada79d40f3 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -975,7 +975,6 @@ struct rq {
> unsigned long cpu_capacity_orig;
>
> struct callback_head *balance_callback;
> - unsigned char balance_flags;
>
> unsigned char nohz_idle_balance;
> unsigned char idle_balance;
> @@ -1226,6 +1225,8 @@ struct rq_flags {
> #endif
> };
>
> +extern struct callback_head balance_push_callback;
> +
> /*
> * Lockdep annotation that avoids accidental unlocks; it's like a
> * sticky/continuous lockdep_assert_held().
> @@ -1243,9 +1244,9 @@ static inline void rq_pin_lock(struct rq *rq, struct rq_flags *rf)
> #ifdef CONFIG_SCHED_DEBUG
> rq->clock_update_flags &= (RQCF_REQ_SKIP|RQCF_ACT_SKIP);
> rf->clock_update_flags = 0;
> -#endif
> #ifdef CONFIG_SMP
> - SCHED_WARN_ON(rq->balance_callback);
> + SCHED_WARN_ON(rq->balance_callback && rq->balance_callback != &balance_push_callback);
> +#endif
> #endif
> }
>
> @@ -1408,9 +1409,6 @@ init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
>
> #ifdef CONFIG_SMP
>
> -#define BALANCE_WORK 0x01
> -#define BALANCE_PUSH 0x02
> -
> static inline void
> queue_balance_callback(struct rq *rq,
> struct callback_head *head,
> @@ -1418,13 +1416,12 @@ queue_balance_callback(struct rq *rq,
> {
> lockdep_assert_held(&rq->lock);
>
> - if (unlikely(head->next || (rq->balance_flags & BALANCE_PUSH)))
> + if (unlikely(head->next || rq->balance_callback == &balance_push_callback))
> return;
>
> head->func = (void (*)(struct callback_head *))func;
> head->next = rq->balance_callback;
> rq->balance_callback = head;
> - rq->balance_flags |= BALANCE_WORK;
> }
>
> #define rcu_dereference_check_sched_domain(p) \
> _______________________________________________
> LKP mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
>

--
Zhengjun Xing

2020-12-15 08:41:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [LKP] Re: [sched/hotplug] 2558aacff8: will-it-scale.per_thread_ops -1.6% regression

On Tue, Dec 15, 2020 at 01:35:46PM +0800, Xing Zhengjun wrote:
> On 12/11/2020 12:14 AM, Peter Zijlstra wrote:
> > On Thu, Dec 10, 2020 at 04:18:59PM +0800, kernel test robot wrote:
> > > FYI, we noticed a -1.6% regression of will-it-scale.per_thread_ops due to commit:
> > > commit: 2558aacff8586699bcd248b406febb28b0a25de2 ("sched/hotplug: Ensure only per-cpu kthreads run during hotplug")
> >
> > Mooo, weird but whatever. Does the below help at all?
>
> I test the patch

Thanks!

> , the regression reduced to -0.6%.
>
> =========================================================================================
> tbox_group/testcase/rootfs/kconfig/compiler/nr_task/mode/test/cpufreq_governor/ucode:
>
> lkp-cpl-4sp1/will-it-scale/debian-10.4-x86_64-20200603.cgz/x86_64-rhel-8.3/gcc-9/100%/thread/sched_yield/performance/0x700001e
>
> commit:
> 565790d28b1e33ee2f77bad5348b99f6dfc366fd
> 2558aacff8586699bcd248b406febb28b0a25de2
> 4b26139b8db627a55043183614a32b0aba799d27 (this test patch)
>
> 565790d28b1e33ee 2558aacff8586699bcd248b406f 4b26139b8db627a55043183614a
> ---------------- --------------------------- ---------------------------
> %stddev %change %stddev %change %stddev
> \ | \ | \
> 4.011e+08 -1.6% 3.945e+08 -0.6% 3.989e+08 will-it-scale.144.threads
> 2785455 -1.6% 2739520 -0.6% 2769967 will-it-scale.per_thread_ops
> 4.011e+08 -1.6% 3.945e+08 -0.6% 3.989e+08 will-it-scale.workload

Well, that's better. But I'm rather confused now, because with this new
patch, the actual hot paths are identical, so I've no idea what is
actually causing the regression :/

The above numbers don't seem to have variance, how sure are we the
results are stable? The thing is, when I tried reproducing this locally,
I was mostly looking at noise.

> > ---
> > kernel/sched/core.c | 40 +++++++++++++++-------------------------
> > kernel/sched/sched.h | 13 +++++--------
> > 2 files changed, 20 insertions(+), 33 deletions(-)

Anyway, let me queue this in sched/urgent, it's simpler code and has
less regression.