2023-11-24 01:45:31

by Oliver Sang

[permalink] [raw]
Subject: [tip:sched/core] [sched/numa] 84db47ca71: autonuma-benchmark.numa01_THREAD_ALLOC.seconds -46.2% improvement



Hello,

kernel test robot noticed a -46.2% improvement of autonuma-benchmark.numa01_THREAD_ALLOC.seconds on:


commit: 84db47ca7146d7bd00eb5cf2b93989a971c84650 ("sched/numa: Fix mm numa_scan_seq based unconditional scan")
https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git sched/core

testcase: autonuma-benchmark
test machine: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480CTDX (Sapphire Rapids) with 256G memory
parameters:

iterations: 4x
test: numa01_THREAD_ALLOC
cpufreq_governor: performance






Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20231123/[email protected]

=========================================================================================
compiler/cpufreq_governor/iterations/kconfig/rootfs/tbox_group/test/testcase:
gcc-12/performance/4x/x86_64-rhel-8.3/debian-11.1-x86_64-20220510.cgz/lkp-spr-r02/numa01_THREAD_ALLOC/autonuma-benchmark

commit:
d6111cf45c ("sched: Use WRITE_ONCE() for p->on_rq")
84db47ca71 ("sched/numa: Fix mm numa_scan_seq based unconditional scan")

d6111cf45c578728 84db47ca7146d7bd00eb5cf2b93
---------------- ---------------------------
%stddev %change %stddev
\ | \
1424 -21.6% 1117 ? 2% uptime.boot
0.02 ? 38% +139.7% 0.05 ? 20% vmstat.procs.b
0.01 ? 15% +0.0 0.01 ? 9% mpstat.cpu.all.iowait%
0.09 ? 2% -0.0 0.07 ? 2% mpstat.cpu.all.soft%
1.84 +0.4 2.24 ? 4% mpstat.cpu.all.sys%
9497 ? 17% +37.1% 13024 ? 10% turbostat.C1
3.161e+08 -20.8% 2.503e+08 ? 2% turbostat.IRQ
8.86 ? 8% -1.7 7.16 ? 14% turbostat.PKG_%
646.52 +2.9% 665.41 turbostat.PkgWatt
52.74 +32.6% 69.93 turbostat.RAMWatt
258.20 -16.3% 216.21 ? 2% autonuma-benchmark.numa01.seconds
78.26 -46.2% 42.10 ? 5% autonuma-benchmark.numa01_THREAD_ALLOC.seconds
1381 -22.6% 1069 ? 2% autonuma-benchmark.time.elapsed_time
1381 -22.6% 1069 ? 2% autonuma-benchmark.time.elapsed_time.max
1090459 ? 2% -23.1% 838693 ? 3% autonuma-benchmark.time.involuntary_context_switches
286141 -23.6% 218671 ? 2% autonuma-benchmark.time.user_time
0.00 ?223% +23983.3% 0.24 ?110% perf-sched.sch_delay.avg.ms.__cond_resched.folio_copy.migrate_folio_extra.move_to_new_folio.migrate_pages_batch
0.01 ?223% +1.3e+05% 15.11 ?179% perf-sched.sch_delay.max.ms.__cond_resched.folio_copy.migrate_folio_extra.move_to_new_folio.migrate_pages_batch
3.35 ? 31% -70.7% 0.98 ?149% perf-sched.wait_and_delay.avg.ms.__cond_resched.generic_perform_write.shmem_file_write_iter.vfs_write.ksys_write
167.50 ? 32% -69.8% 50.67 ?102% perf-sched.wait_and_delay.count.__cond_resched.generic_perform_write.shmem_file_write_iter.vfs_write.ksys_write
8.50 ? 39% -80.4% 1.67 ?223% perf-sched.wait_and_delay.count.__cond_resched.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
1.22 ?134% +215.3% 3.83 ? 30% perf-sched.wait_time.avg.ms.__cond_resched.folio_copy.migrate_folio_extra.move_to_new_folio.migrate_pages_batch
1.60 ?136% +2231.7% 37.39 ?156% perf-sched.wait_time.max.ms.__cond_resched.folio_copy.migrate_folio_extra.move_to_new_folio.migrate_pages_batch
1578414 -1.4% 1556033 proc-vmstat.nr_anon_pages
74441 ? 14% +85.7% 138261 ? 6% proc-vmstat.numa_hint_faults
41582 ? 22% +91.1% 79461 ? 9% proc-vmstat.numa_hint_faults_local
34327 ? 6% +148.2% 85213 ? 2% proc-vmstat.numa_huge_pte_updates
5578006 -8.1% 5123719 ? 4% proc-vmstat.numa_local
4649420 ? 4% +126.1% 10511526 ? 3% proc-vmstat.numa_pages_migrated
17706681 ? 6% +147.2% 43778740 ? 2% proc-vmstat.numa_pte_updates
6749037 -10.7% 6027035 proc-vmstat.pgfault
4649420 ? 4% +126.1% 10511526 ? 3% proc-vmstat.pgmigrate_success
236153 -14.4% 202067 ? 3% proc-vmstat.pgreuse
9057 ? 4% +126.3% 20497 ? 3% proc-vmstat.thp_migration_success
30217875 ? 2% -20.4% 24043125 proc-vmstat.unevictable_pgs_scanned
10.73 ? 5% +125.3% 24.17 ? 17% perf-stat.i.MPKI
2.427e+08 +4.0% 2.523e+08 perf-stat.i.branch-instructions
24.45 +3.2 27.63 perf-stat.i.cache-miss-rate%
14364771 ? 5% +122.0% 31896201 ? 18% perf-stat.i.cache-misses
37679065 +70.9% 64408862 ? 10% perf-stat.i.cache-references
545.38 -2.9% 529.43 perf-stat.i.cpi
221.07 +7.7% 238.10 perf-stat.i.cpu-migrations
156883 ? 2% -36.8% 99195 perf-stat.i.cycles-between-cache-misses
3.331e+08 +3.3% 3.443e+08 perf-stat.i.dTLB-loads
1031040 +2.6% 1057642 perf-stat.i.dTLB-store-misses
1.877e+08 +3.5% 1.942e+08 perf-stat.i.dTLB-stores
1.24e+09 +3.7% 1.286e+09 perf-stat.i.instructions
0.00 ? 12% +54.1% 0.00 ? 39% perf-stat.i.ipc
2.07 +12.9% 2.33 ? 2% perf-stat.i.metric.M/sec
5074 ? 2% +12.7% 5718 perf-stat.i.minor-faults
43.56 ? 2% +4.0 47.59 ? 3% perf-stat.i.node-load-miss-rate%
519245 ? 3% +57.1% 815750 ? 3% perf-stat.i.node-load-misses
5074 ? 2% +12.7% 5718 perf-stat.i.page-faults
10.68 ? 4% +124.9% 24.01 ? 18% perf-stat.overall.MPKI
37.27 ? 5% +11.6 48.91 ? 7% perf-stat.overall.cache-miss-rate%
504.52 -5.1% 479.03 perf-stat.overall.cpi
47358 ? 4% -56.6% 20561 ? 17% perf-stat.overall.cycles-between-cache-misses
0.00 +5.4% 0.00 perf-stat.overall.ipc
42.32 ? 7% +8.3 50.63 ? 6% perf-stat.overall.node-load-miss-rate%
2.384e+08 +4.2% 2.486e+08 perf-stat.ps.branch-instructions
13020509 ? 5% +133.4% 30395642 ? 17% perf-stat.ps.cache-misses
34948633 ? 2% +76.6% 61721520 ? 9% perf-stat.ps.cache-references
218.01 +7.5% 234.41 perf-stat.ps.cpu-migrations
3.285e+08 +3.6% 3.402e+08 perf-stat.ps.dTLB-loads
1021092 +2.9% 1050584 perf-stat.ps.dTLB-store-misses
1.845e+08 +3.7% 1.914e+08 perf-stat.ps.dTLB-stores
1.219e+09 +3.9% 1.267e+09 perf-stat.ps.instructions
4707 +14.8% 5406 perf-stat.ps.minor-faults
502656 ? 3% +63.9% 823962 ? 4% perf-stat.ps.node-load-misses
4707 +14.8% 5406 perf-stat.ps.page-faults
1.686e+12 -19.1% 1.363e+12 ? 2% perf-stat.total.instructions
1.824e+08 ? 2% -27.1% 1.33e+08 ? 4% sched_debug.cfs_rq:/.avg_vruntime.avg
1.869e+08 ? 2% -26.9% 1.366e+08 ? 4% sched_debug.cfs_rq:/.avg_vruntime.max
1.498e+08 ? 6% -27.4% 1.087e+08 ? 6% sched_debug.cfs_rq:/.avg_vruntime.min
3892383 ? 9% -20.8% 3081639 ? 7% sched_debug.cfs_rq:/.avg_vruntime.stddev
0.81 ? 5% -11.7% 0.72 ? 6% sched_debug.cfs_rq:/.h_nr_running.min
4130 ? 3% +33.5% 5516 ? 14% sched_debug.cfs_rq:/.load_avg.max
1.824e+08 ? 2% -27.1% 1.33e+08 ? 4% sched_debug.cfs_rq:/.min_vruntime.avg
1.869e+08 ? 2% -26.9% 1.366e+08 ? 4% sched_debug.cfs_rq:/.min_vruntime.max
1.498e+08 ? 6% -27.4% 1.087e+08 ? 6% sched_debug.cfs_rq:/.min_vruntime.min
3892382 ? 9% -20.8% 3081638 ? 7% sched_debug.cfs_rq:/.min_vruntime.stddev
0.81 ? 5% -11.7% 0.72 ? 6% sched_debug.cfs_rq:/.nr_running.min
25.69 ? 11% +33.1% 34.20 ? 20% sched_debug.cfs_rq:/.removed.util_avg.max
804.13 ? 6% -13.2% 697.93 ? 5% sched_debug.cfs_rq:/.runnable_avg.min
642.08 ? 7% -18.7% 522.15 ? 5% sched_debug.cfs_rq:/.util_avg.min
30.15 ? 67% +1148.5% 376.47 ? 4% sched_debug.cfs_rq:/.util_est_enqueued.avg
490.48 ? 19% +140.0% 1177 ? 10% sched_debug.cfs_rq:/.util_est_enqueued.max
77.81 ? 47% +298.2% 309.83 ? 6% sched_debug.cfs_rq:/.util_est_enqueued.stddev
840536 ? 3% -29.6% 592026 ? 6% sched_debug.cpu.avg_idle.min
516622 ? 5% -16.1% 433228 ? 3% sched_debug.cpu.avg_idle.stddev
713848 ? 2% -24.3% 540516 ? 4% sched_debug.cpu.clock.avg
715060 ? 2% -24.3% 541264 ? 4% sched_debug.cpu.clock.max
712575 ? 2% -24.3% 539699 ? 4% sched_debug.cpu.clock.min
714.85 ? 8% -38.1% 442.30 ? 10% sched_debug.cpu.clock.stddev
705718 ? 2% -24.3% 534516 ? 4% sched_debug.cpu.clock_task.avg
708387 ? 2% -24.3% 536010 ? 4% sched_debug.cpu.clock_task.max
686460 ? 2% -24.5% 518080 ? 4% sched_debug.cpu.clock_task.min
2041 ? 12% -32.5% 1377 ? 7% sched_debug.cpu.clock_task.stddev
23332 ? 3% -20.1% 18646 ? 5% sched_debug.cpu.curr->pid.avg
26909 -15.7% 22694 ? 2% sched_debug.cpu.curr->pid.max
16993 ? 13% -33.7% 11263 ? 21% sched_debug.cpu.curr->pid.min
1393930 ? 3% -14.1% 1197373 ? 3% sched_debug.cpu.max_idle_balance_cost.max
154458 ? 5% -12.8% 134638 ? 3% sched_debug.cpu.max_idle_balance_cost.stddev
0.00 ? 7% -37.3% 0.00 ? 10% sched_debug.cpu.next_balance.stddev
0.82 ? 6% -12.6% 0.72 ? 9% sched_debug.cpu.nr_running.min
7472 ? 2% -19.9% 5982 ? 3% sched_debug.cpu.nr_switches.avg
2879 ? 6% -15.1% 2445 ? 7% sched_debug.cpu.nr_switches.min
5925 ? 5% -14.3% 5080 ? 4% sched_debug.cpu.nr_switches.stddev
6.07 ? 9% +39.6% 8.48 ? 8% sched_debug.cpu.nr_uninterruptible.stddev
712557 ? 2% -24.3% 539685 ? 4% sched_debug.cpu_clk
711344 ? 2% -24.3% 538474 ? 4% sched_debug.ktime
0.15 ? 78% +111.4% 0.31 ? 33% sched_debug.rt_rq:.rt_time.avg
32.86 ? 78% +111.4% 69.47 ? 33% sched_debug.rt_rq:.rt_time.max
2.19 ? 78% +111.4% 4.63 ? 33% sched_debug.rt_rq:.rt_time.stddev
713438 ? 2% -24.2% 540572 ? 4% sched_debug.sched_clk
3.34 ? 33% -1.5 1.81 ? 18% perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt
1.45 ? 39% -1.1 0.36 ?101% perf-profile.calltrace.cycles-pp.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
1.45 ? 39% -1.1 0.36 ?101% perf-profile.calltrace.cycles-pp.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
1.37 ? 38% -1.0 0.34 ?101% perf-profile.calltrace.cycles-pp.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
3.04 ? 32% -1.4 1.60 ? 18% perf-profile.children.cycles-pp.exit_to_user_mode_prepare
2.78 ? 29% -1.3 1.44 ? 17% perf-profile.children.cycles-pp.exit_to_user_mode_loop
2.37 ? 28% -1.2 1.16 ? 23% perf-profile.children.cycles-pp.irqentry_exit_to_user_mode
1.95 ? 28% -1.0 0.97 ? 13% perf-profile.children.cycles-pp.task_mm_cid_work
2.17 ? 29% -1.0 1.19 ? 16% perf-profile.children.cycles-pp.task_work_run
0.69 ? 30% -0.3 0.38 ? 49% perf-profile.children.cycles-pp.khugepaged
0.67 ? 31% -0.3 0.37 ? 49% perf-profile.children.cycles-pp.khugepaged_scan_mm_slot
0.67 ? 31% -0.3 0.37 ? 49% perf-profile.children.cycles-pp.hpage_collapse_scan_pmd
0.38 ? 27% -0.1 0.26 ? 23% perf-profile.children.cycles-pp.security_file_permission
0.34 ? 27% -0.1 0.23 ? 23% perf-profile.children.cycles-pp.apparmor_file_permission
0.26 ? 48% -0.1 0.15 ? 19% perf-profile.children.cycles-pp.dup_task_struct
0.24 ? 35% -0.1 0.15 ? 17% perf-profile.children.cycles-pp.folio_batch_move_lru
0.11 ? 51% -0.1 0.04 ? 75% perf-profile.children.cycles-pp.__vmalloc_node_range
0.14 ? 33% -0.1 0.08 ? 14% perf-profile.children.cycles-pp.percpu_counter_add_batch
0.10 ? 32% -0.0 0.06 ? 19% perf-profile.children.cycles-pp.move_page_tables
0.02 ?142% +0.1 0.10 ? 29% perf-profile.children.cycles-pp.task_tick_fair
0.02 ?223% +0.1 0.13 ? 56% perf-profile.children.cycles-pp.__flush_smp_call_function_queue
0.04 ?113% +0.2 0.20 ? 48% perf-profile.children.cycles-pp.on_each_cpu_cond_mask
0.00 +0.2 0.16 ? 56% perf-profile.children.cycles-pp.perf_mux_hrtimer_handler
0.04 ?107% +0.2 0.25 ? 42% perf-profile.children.cycles-pp.scheduler_tick
0.18 ? 79% +0.2 0.38 ? 40% perf-profile.children.cycles-pp.__do_sys_wait4
0.18 ? 80% +0.2 0.38 ? 40% perf-profile.children.cycles-pp.kernel_wait4
0.17 ? 76% +0.2 0.38 ? 40% perf-profile.children.cycles-pp.do_wait
0.00 +0.3 0.26 ? 77% perf-profile.children.cycles-pp.intel_idle
0.06 ?104% +0.3 0.32 ? 44% perf-profile.children.cycles-pp.update_process_times
0.06 ?106% +0.3 0.33 ? 46% perf-profile.children.cycles-pp.tick_sched_handle
0.07 ? 81% +0.3 0.36 ? 48% perf-profile.children.cycles-pp.tick_nohz_highres_handler
0.18 ? 54% +0.4 0.62 ? 48% perf-profile.children.cycles-pp.__hrtimer_run_queues
0.21 ? 50% +0.6 0.76 ? 47% perf-profile.children.cycles-pp.hrtimer_interrupt
0.22 ? 51% +0.6 0.79 ? 46% perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt
0.68 ? 44% +1.1 1.80 ? 60% perf-profile.children.cycles-pp.update_sg_lb_stats
1.46 ? 52% +1.2 2.62 ? 45% perf-profile.children.cycles-pp.__schedule
0.72 ? 46% +1.2 1.89 ? 60% perf-profile.children.cycles-pp.update_sd_lb_stats
0.72 ? 46% +1.2 1.89 ? 60% perf-profile.children.cycles-pp.find_busiest_group
0.03 ?141% +1.3 1.31 ? 75% perf-profile.children.cycles-pp.start_secondary
0.01 ?223% +1.3 1.36 ? 80% perf-profile.children.cycles-pp.cpuidle_enter
0.01 ?223% +1.3 1.36 ? 80% perf-profile.children.cycles-pp.cpuidle_enter_state
0.77 ? 44% +1.4 2.15 ? 62% perf-profile.children.cycles-pp.load_balance
0.20 ? 61% +1.4 1.64 ? 77% perf-profile.children.cycles-pp.pick_next_task_fair
0.02 ?142% +1.5 1.50 ? 80% perf-profile.children.cycles-pp.cpuidle_idle_call
0.04 ?146% +1.6 1.62 ? 80% perf-profile.children.cycles-pp.newidle_balance
0.03 ?141% +1.6 1.66 ? 79% perf-profile.children.cycles-pp.secondary_startup_64_no_verify
0.03 ?141% +1.6 1.66 ? 79% perf-profile.children.cycles-pp.cpu_startup_entry
0.03 ?141% +1.6 1.66 ? 79% perf-profile.children.cycles-pp.do_idle
1.94 ? 28% -1.0 0.95 ? 13% perf-profile.self.cycles-pp.task_mm_cid_work
0.13 ? 31% -0.1 0.08 ? 17% perf-profile.self.cycles-pp.percpu_counter_add_batch
0.00 +0.3 0.26 ? 77% perf-profile.self.cycles-pp.intel_idle
0.67 ? 44% +1.1 1.76 ? 60% perf-profile.self.cycles-pp.update_sg_lb_stats




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki