Greeting,
FYI, we noticed a -52.4% regression of fxmark.hdd_btrfs_DWAL_63_bufferedio.works/sec due to commit:
commit: f3344adf38bdb3107d40483dd9501215ad40edce ("mm: memcontrol: optimize per-lruvec stats counter memory usage")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
in testcase: fxmark
on test machine: 288 threads Intel(R) Xeon Phi(TM) CPU 7295 @ 1.50GHz with 80G memory
with following parameters:
disk: 1HDD
media: hdd
test: DWAL
fstype: btrfs
directio: bufferedio
cpufreq_governor: performance
ucode: 0x11
If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>
Details are as below:
-------------------------------------------------------------------------------------------------->
To reproduce:
git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp split-job --compatible job.yaml
bin/lkp run compatible-job.yaml
=========================================================================================
compiler/cpufreq_governor/directio/disk/fstype/kconfig/media/rootfs/tbox_group/test/testcase/ucode:
gcc-9/performance/bufferedio/1HDD/btrfs/x86_64-rhel-8.3/hdd/debian-10.4-x86_64-20200603.cgz/lkp-knm01/DWAL/fxmark/0x11
commit:
2e9bd48315 ("mm: memcg/slab: pre-allocate obj_cgroups for slab caches with SLAB_ACCOUNT")
f3344adf38 ("mm: memcontrol: optimize per-lruvec stats counter memory usage")
2e9bd483159939ed f3344adf38bdb3107d40483dd95
---------------- ---------------------------
%stddev %change %stddev
\ | \
8.36 ? 93% -99.0% 0.08 ?155% fxmark.hdd_btrfs_DWAL_54_bufferedio.iowait_sec
0.51 ? 93% -99.0% 0.01 ?155% fxmark.hdd_btrfs_DWAL_54_bufferedio.iowait_util
1.25 ? 4% -15.6% 1.06 fxmark.hdd_btrfs_DWAL_54_bufferedio.softirq_sec
0.08 ? 3% -15.4% 0.06 fxmark.hdd_btrfs_DWAL_54_bufferedio.softirq_util
7.08 -31.4% 4.85 fxmark.hdd_btrfs_DWAL_54_bufferedio.user_sec
0.43 -31.3% 0.30 fxmark.hdd_btrfs_DWAL_54_bufferedio.user_util
4718626 -51.2% 2300833 fxmark.hdd_btrfs_DWAL_54_bufferedio.works
156695 -51.1% 76693 fxmark.hdd_btrfs_DWAL_54_bufferedio.works/sec
0.30 ? 27% -92.9% 0.02 ?223% fxmark.hdd_btrfs_DWAL_63_bufferedio.iowait_sec
0.02 ? 27% -92.9% 0.00 ?223% fxmark.hdd_btrfs_DWAL_63_bufferedio.iowait_util
1.24 -19.4% 1.00 fxmark.hdd_btrfs_DWAL_63_bufferedio.softirq_sec
0.07 -19.4% 0.05 fxmark.hdd_btrfs_DWAL_63_bufferedio.softirq_util
6.94 ? 2% -29.8% 4.87 fxmark.hdd_btrfs_DWAL_63_bufferedio.user_sec
0.36 ? 2% -29.8% 0.26 fxmark.hdd_btrfs_DWAL_63_bufferedio.user_util
4210924 -52.4% 2003719 fxmark.hdd_btrfs_DWAL_63_bufferedio.works
140363 -52.4% 66789 fxmark.hdd_btrfs_DWAL_63_bufferedio.works/sec
1361 -9.5% 1232 fxmark.time.elapsed_time
1361 -9.5% 1232 fxmark.time.elapsed_time.max
25.67 +9.1% 28.00 fxmark.time.percent_of_cpu_this_job_got
343.68 -2.0% 336.76 fxmark.time.system_time
23.66 +2.5 26.12 mpstat.cpu.all.sys%
59226 -11.4% 52495 uptime.idle
3.967e+10 -18.7% 3.224e+10 cpuidle.C1.time
75583513 -20.6% 60007388 cpuidle.C1.usage
8999 ? 3% -17.9% 7389 ? 8% sched_debug.cpu.curr->pid.avg
6180 ? 5% -31.2% 4252 ? 9% sched_debug.cpu.curr->pid.min
0.04 ? 42% -51.9% 0.02 ? 20% perf-sched.sch_delay.avg.ms.schedule_timeout.rcu_gp_kthread.kthread.ret_from_fork
15.18 ? 73% -86.0% 2.12 ?128% perf-sched.sch_delay.max.ms.schedule_timeout.rcu_gp_kthread.kthread.ret_from_fork
135.95 ?123% -68.6% 42.65 ? 42% perf-sched.total_sch_delay.max.ms
58.64 -6.1% 55.04 iostat.cpu.idle
12.71 ? 2% +6.4% 13.52 ? 2% iostat.cpu.iowait
25.74 +9.7% 28.23 iostat.cpu.system
1.09 +9.1% 1.18 iostat.cpu.user
40785050 -10.2% 36631164 numa-numastat.node0.local_node
2545800 ? 2% -30.8% 1762639 ? 4% numa-numastat.node0.numa_foreign
40784981 -10.2% 36631126 numa-numastat.node0.numa_hit
2545800 ? 2% -30.8% 1762639 ? 4% numa-numastat.node1.numa_miss
2545801 ? 2% -30.8% 1762639 ? 4% numa-numastat.node1.other_node
58.33 -6.6% 54.50 vmstat.cpu.id
117701 -2.5% 114736 vmstat.io.bo
11919766 -10.4% 10683231 vmstat.memory.cache
7.00 +14.3% 8.00 vmstat.procs.r
70753 -7.3% 65591 vmstat.system.in
6877 -12.8% 5997 meminfo.Active
5228 -15.4% 4425 meminfo.Active(file)
11827996 -10.4% 10600714 meminfo.Cached
1868088 ? 6% -10.0% 1681283 ? 7% meminfo.DirectMap4k
11002944 -11.1% 9776954 meminfo.Inactive
10721808 -11.4% 9496681 meminfo.Inactive(file)
2044 ? 10% -5.3% 1937 ? 11% meminfo.Mlocked
5067083 -12.6% 4429136 meminfo.Writeback
34364 -7.8% 31683 perf-stat.i.cpu-clock
64.81 -3.5% 62.57 perf-stat.i.cpu-migrations
0.82 ? 4% +11.2% 0.91 ? 2% perf-stat.i.major-faults
0.42 +7.7% 0.46 perf-stat.i.metric.K/sec
34364 -7.8% 31683 perf-stat.i.task-clock
34581 -7.6% 31954 perf-stat.ps.cpu-clock
65.02 -3.4% 62.81 perf-stat.ps.cpu-migrations
0.82 ? 4% +11.3% 0.92 ? 2% perf-stat.ps.major-faults
34581 -7.6% 31954 perf-stat.ps.task-clock
2151 -9.7% 1943 ? 2% slabinfo.Acpi-State.active_objs
1591 -10.0% 1431 slabinfo.btrfs_delayed_node.active_objs
2647 ? 8% -17.2% 2191 ? 8% slabinfo.btrfs_delayed_tree_ref.active_objs
2792 ? 7% -15.7% 2354 ? 8% slabinfo.btrfs_delayed_tree_ref.num_objs
3441 ? 6% -24.7% 2591 ? 4% slabinfo.btrfs_extent_map.active_objs
3509 ? 6% -24.1% 2664 ? 4% slabinfo.btrfs_extent_map.num_objs
3576 ? 5% -11.6% 3162 ? 6% slabinfo.fsnotify_mark_connector.active_objs
2366 -10.8% 2109 slabinfo.khugepaged_mm_slot.active_objs
1967 -10.0% 1771 slabinfo.kmalloc-rcl-256.active_objs
2361 ? 3% -13.8% 2036 ? 9% slabinfo.mnt_cache.active_objs
2576 ? 3% -13.2% 2235 ? 8% slabinfo.mnt_cache.num_objs
3064 -13.9% 2638 ? 3% slabinfo.pool_workqueue.active_objs
6899 -13.0% 6005 ? 2% numa-meminfo.node0.Active
5248 -15.6% 4431 numa-meminfo.node0.Active(file)
10448455 -9.8% 9424769 numa-meminfo.node0.FilePages
10134571 -10.1% 9112146 numa-meminfo.node0.Inactive
9853444 -10.4% 8831873 numa-meminfo.node0.Inactive(file)
1172 ? 10% -5.4% 1109 ? 11% numa-meminfo.node0.Mlocked
4417620 -11.5% 3910520 numa-meminfo.node0.Writeback
45623 ? 11% -30.3% 31801 ? 11% numa-meminfo.node1.Dirty
1394970 -14.7% 1189612 ? 3% numa-meminfo.node1.FilePages
883781 ? 2% -23.2% 678451 ? 5% numa-meminfo.node1.Inactive
883781 ? 2% -23.2% 678451 ? 5% numa-meminfo.node1.Inactive(file)
1493112 -13.9% 1286160 ? 2% numa-meminfo.node1.MemUsed
657674 ? 3% -20.0% 526389 ? 4% numa-meminfo.node1.Writeback
1312 -15.4% 1110 numa-vmstat.node0.nr_active_file
20273962 -10.5% 18150513 numa-vmstat.node0.nr_dirtied
2614224 -9.8% 2358574 numa-vmstat.node0.nr_file_pages
2465467 -10.3% 2210356 numa-vmstat.node0.nr_inactive_file
292.67 ? 10% -5.5% 276.67 ? 11% numa-vmstat.node0.nr_mlock
1105637 -11.5% 978820 numa-vmstat.node0.nr_writeback
18953745 -10.6% 16954017 numa-vmstat.node0.nr_written
1312 -15.4% 1110 numa-vmstat.node0.nr_zone_active_file
2465467 -10.3% 2210355 numa-vmstat.node0.nr_zone_inactive_file
1293666 ? 3% -45.3% 707946 ? 7% numa-vmstat.node0.numa_foreign
1293694 ? 3% -45.3% 707978 ? 7% numa-vmstat.node1.nr_dirtied
11396 ? 11% -30.2% 7949 ? 11% numa-vmstat.node1.nr_dirty
349139 -14.7% 297821 ? 3% numa-vmstat.node1.nr_file_pages
221342 ? 2% -23.2% 170030 ? 5% numa-vmstat.node1.nr_inactive_file
164784 ? 3% -19.9% 131954 ? 4% numa-vmstat.node1.nr_writeback
1117506 ? 3% -49.2% 568071 ? 7% numa-vmstat.node1.nr_written
221342 ? 2% -23.2% 170030 ? 5% numa-vmstat.node1.nr_zone_inactive_file
176187 ? 3% -20.6% 139906 ? 4% numa-vmstat.node1.nr_zone_write_pending
1293707 ? 3% -45.3% 707981 ? 7% numa-vmstat.node1.numa_miss
1440348 ? 2% -40.7% 854565 ? 5% numa-vmstat.node1.numa_other
413.17 -5.0% 392.33 ? 5% proc-vmstat.nr_active_anon
1311 -15.6% 1107 proc-vmstat.nr_active_file
39929120 -11.7% 35246897 proc-vmstat.nr_dirtied
2964992 -10.4% 2657260 proc-vmstat.nr_file_pages
17305726 +1.8% 17614053 proc-vmstat.nr_free_pages
2688431 -11.4% 2381244 proc-vmstat.nr_inactive_file
511.83 ? 10% -5.3% 484.67 ? 11% proc-vmstat.nr_mlock
9266 -3.8% 8917 proc-vmstat.nr_shmem
26862 -3.4% 25962 proc-vmstat.nr_slab_reclaimable
56363 -1.7% 55420 proc-vmstat.nr_slab_unreclaimable
1271064 -12.6% 1110960 proc-vmstat.nr_writeback
39929120 -11.7% 35246897 proc-vmstat.nr_written
413.17 -5.0% 392.33 ? 5% proc-vmstat.nr_zone_active_anon
1311 -15.6% 1107 proc-vmstat.nr_zone_active_file
2688431 -11.4% 2381244 proc-vmstat.nr_zone_inactive_file
1499159 -10.7% 1338748 proc-vmstat.nr_zone_write_pending
2545800 ? 2% -30.8% 1762639 ? 4% proc-vmstat.numa_foreign
40811337 -10.2% 36655412 proc-vmstat.numa_hit
40811336 -10.2% 36655411 proc-vmstat.numa_local
2545800 ? 2% -30.8% 1762639 ? 4% proc-vmstat.numa_miss
2545802 ? 2% -30.8% 1762639 ? 4% proc-vmstat.numa_other
23159 -8.7% 21145 proc-vmstat.pgactivate
43659177 -11.4% 38693777 proc-vmstat.pgalloc_normal
9495 -5.9% 8938 ? 2% proc-vmstat.pgdeactivate
4069589 -8.9% 3706066 proc-vmstat.pgfault
43646081 -11.4% 38679395 proc-vmstat.pgfree
1.599e+08 -11.7% 1.411e+08 proc-vmstat.pgpgout
317885 -7.1% 295295 proc-vmstat.pgreuse
30120100 -14.0% 25902040 proc-vmstat.pgrotated
11392384 -8.3% 10446848 proc-vmstat.unevictable_pgs_scanned
0.67 ? 13% -0.4 0.30 ?101% perf-profile.calltrace.cycles-pp.read_counters.process_interval.dispatch_events.__run_perf_stat.cmd_stat
0.67 ? 13% -0.4 0.30 ?100% perf-profile.calltrace.cycles-pp.cmd_stat.run_builtin.main.__libc_start_main
0.67 ? 13% -0.4 0.30 ?100% perf-profile.calltrace.cycles-pp.__run_perf_stat.cmd_stat.run_builtin.main.__libc_start_main
0.67 ? 13% -0.4 0.30 ?100% perf-profile.calltrace.cycles-pp.dispatch_events.__run_perf_stat.cmd_stat.run_builtin.main
0.67 ? 13% -0.4 0.30 ?100% perf-profile.calltrace.cycles-pp.process_interval.dispatch_events.__run_perf_stat.cmd_stat.run_builtin
1.08 ? 7% +0.1 1.22 ? 6% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe
1.13 ? 6% +0.2 1.28 ? 5% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe
2.42 ? 17% -0.9 1.54 ? 29% perf-profile.children.cycles-pp.scsi_io_completion
2.42 ? 17% -0.9 1.54 ? 29% perf-profile.children.cycles-pp.scsi_end_request
2.43 ? 17% -0.9 1.55 ? 28% perf-profile.children.cycles-pp.blk_complete_reqs
2.37 ? 17% -0.9 1.50 ? 28% perf-profile.children.cycles-pp.end_bio_extent_writepage
2.37 ? 17% -0.9 1.50 ? 28% perf-profile.children.cycles-pp.btrfs_end_bio
2.38 ? 17% -0.9 1.51 ? 29% perf-profile.children.cycles-pp.blk_update_request
2.34 ? 20% -0.8 1.51 ? 25% perf-profile.children.cycles-pp.flush_smp_call_function_from_idle
2.32 ? 19% -0.8 1.50 ? 25% perf-profile.children.cycles-pp.do_softirq
2.06 ? 20% -0.8 1.31 ? 28% perf-profile.children.cycles-pp.end_page_writeback
1.09 ? 24% -0.5 0.60 ? 15% perf-profile.children.cycles-pp.test_clear_page_writeback
1.45 ? 9% -0.2 1.22 ? 6% perf-profile.children.cycles-pp.__schedule
0.67 ? 13% -0.1 0.54 ? 14% perf-profile.children.cycles-pp.read_counters
0.67 ? 13% -0.1 0.55 ? 13% perf-profile.children.cycles-pp.process_interval
0.67 ? 13% -0.1 0.55 ? 13% perf-profile.children.cycles-pp.cmd_stat
0.67 ? 13% -0.1 0.55 ? 13% perf-profile.children.cycles-pp.__run_perf_stat
0.67 ? 13% -0.1 0.55 ? 13% perf-profile.children.cycles-pp.dispatch_events
0.20 ? 30% -0.1 0.10 ? 35% perf-profile.children.cycles-pp.btrfs_dec_test_ordered_pending
0.47 ? 10% -0.1 0.38 ? 12% perf-profile.children.cycles-pp.sched_setaffinity
0.14 ? 22% -0.1 0.06 ? 52% perf-profile.children.cycles-pp.percpu_counter_add_batch
0.10 ? 22% -0.1 0.03 ?103% perf-profile.children.cycles-pp.__fprop_inc_percpu_max
0.14 ? 20% -0.0 0.09 ? 18% perf-profile.children.cycles-pp.do_dentry_open
0.23 ? 6% -0.0 0.19 ? 15% perf-profile.children.cycles-pp.__x64_sys_sched_setaffinity
0.13 ? 11% -0.0 0.09 ? 17% perf-profile.children.cycles-pp.rcu_irq_exit
0.01 ?223% +0.1 0.07 ? 21% perf-profile.children.cycles-pp.vm_area_dup
0.07 ? 14% +0.1 0.14 ? 39% perf-profile.children.cycles-pp.proc_reg_read_iter
0.35 ? 9% -0.1 0.27 ? 15% perf-profile.self.cycles-pp.tick_irq_enter
0.12 ? 27% -0.1 0.06 ? 51% perf-profile.self.cycles-pp.percpu_counter_add_batch
0.10 ? 7% -0.0 0.07 ? 45% perf-profile.self.cycles-pp.rcu_irq_exit
0.14 ? 12% +0.0 0.18 ? 14% perf-profile.self.cycles-pp.local_touch_nmi
19169 ?103% -83.0% 3263 ? 79% interrupts.40:IR-PCI-MSI.4194308-edge.eth0-TxRx-3
3029 -24.4% 2291 ? 44% interrupts.9:IR-IO-APIC.9-fasteoi.acpi
2725748 -9.5% 2466464 interrupts.CPU0.LOC:Local_timer_interrupts
2845 -24.9% 2137 ? 44% interrupts.CPU1.9:IR-IO-APIC.9-fasteoi.acpi
2571309 -10.0% 2313644 interrupts.CPU1.LOC:Local_timer_interrupts
724.67 ? 37% -56.3% 316.33 ? 61% interrupts.CPU1.NMI:Non-maskable_interrupts
724.67 ? 37% -56.3% 316.33 ? 61% interrupts.CPU1.PMI:Performance_monitoring_interrupts
1825803 -14.2% 1566625 interrupts.CPU10.LOC:Local_timer_interrupts
1826093 -14.2% 1566654 interrupts.CPU11.LOC:Local_timer_interrupts
1826402 -14.2% 1566964 interrupts.CPU12.LOC:Local_timer_interrupts
1826229 -14.2% 1566001 interrupts.CPU13.LOC:Local_timer_interrupts
8918 ? 67% -81.7% 1633 ? 83% interrupts.CPU14.40:IR-PCI-MSI.4194308-edge.eth0-TxRx-3
1825901 -14.2% 1566646 interrupts.CPU14.LOC:Local_timer_interrupts
1826230 -14.2% 1566106 interrupts.CPU15.LOC:Local_timer_interrupts
499.67 ? 27% -23.5% 382.33 ? 15% interrupts.CPU15.RES:Rescheduling_interrupts
1826157 -14.2% 1566532 interrupts.CPU16.LOC:Local_timer_interrupts
223.50 ? 13% -27.2% 162.67 ? 11% interrupts.CPU16.TLB:TLB_shootdowns
1827166 -14.3% 1566330 interrupts.CPU17.LOC:Local_timer_interrupts
1543366 -16.9% 1282894 interrupts.CPU18.LOC:Local_timer_interrupts
457.67 ? 16% -24.0% 348.00 ? 11% interrupts.CPU18.RES:Rescheduling_interrupts
177.00 ? 5% -13.9% 152.33 ? 2% interrupts.CPU18.TLB:TLB_shootdowns
1543422 -16.9% 1283093 interrupts.CPU19.LOC:Local_timer_interrupts
2393481 -11.0% 2130761 interrupts.CPU2.LOC:Local_timer_interrupts
1546040 -17.0% 1283196 interrupts.CPU20.LOC:Local_timer_interrupts
166.50 ? 11% -23.7% 127.00 ? 8% interrupts.CPU20.TLB:TLB_shootdowns
1546005 -17.0% 1283065 interrupts.CPU21.LOC:Local_timer_interrupts
1543527 -16.9% 1282065 interrupts.CPU22.LOC:Local_timer_interrupts
166.67 ? 29% -33.7% 110.50 ? 5% interrupts.CPU22.NMI:Non-maskable_interrupts
166.67 ? 29% -33.7% 110.50 ? 5% interrupts.CPU22.PMI:Performance_monitoring_interrupts
174.00 ? 8% -20.0% 139.17 ? 12% interrupts.CPU22.TLB:TLB_shootdowns
1543652 -16.9% 1282473 interrupts.CPU23.LOC:Local_timer_interrupts
1544481 -16.8% 1284243 interrupts.CPU24.LOC:Local_timer_interrupts
1543634 -16.9% 1283131 interrupts.CPU25.LOC:Local_timer_interrupts
334.33 ? 17% -24.5% 252.33 ? 6% interrupts.CPU25.RES:Rescheduling_interrupts
1543478 -16.8% 1283754 interrupts.CPU26.LOC:Local_timer_interrupts
1270462 -20.9% 1004765 interrupts.CPU27.LOC:Local_timer_interrupts
1270759 -20.9% 1005025 interrupts.CPU28.LOC:Local_timer_interrupts
186.50 ? 36% -42.4% 107.33 ? 5% interrupts.CPU28.NMI:Non-maskable_interrupts
186.50 ? 36% -42.4% 107.33 ? 5% interrupts.CPU28.PMI:Performance_monitoring_interrupts
374.17 ? 25% -36.7% 236.83 ? 8% interrupts.CPU28.RES:Rescheduling_interrupts
127.50 ? 12% -26.4% 93.83 ? 9% interrupts.CPU28.TLB:TLB_shootdowns
1270411 -20.9% 1004873 interrupts.CPU29.LOC:Local_timer_interrupts
2391061 -10.9% 2131255 interrupts.CPU3.LOC:Local_timer_interrupts
1270552 -20.8% 1006109 interrupts.CPU30.LOC:Local_timer_interrupts
1270962 -20.8% 1006752 interrupts.CPU31.LOC:Local_timer_interrupts
1270802 -20.9% 1005129 interrupts.CPU32.LOC:Local_timer_interrupts
319.67 ? 15% -33.0% 214.17 ? 16% interrupts.CPU32.RES:Rescheduling_interrupts
128.83 ? 7% -26.9% 94.17 ? 10% interrupts.CPU32.TLB:TLB_shootdowns
1270949 -20.9% 1005088 interrupts.CPU33.LOC:Local_timer_interrupts
248.17 ? 12% -34.7% 162.17 ? 15% interrupts.CPU33.RES:Rescheduling_interrupts
1270982 -20.9% 1004757 interrupts.CPU34.LOC:Local_timer_interrupts
310.17 ? 16% -30.4% 216.00 ? 20% interrupts.CPU34.RES:Rescheduling_interrupts
119.17 ? 9% -38.7% 73.00 ? 15% interrupts.CPU34.TLB:TLB_shootdowns
1270915 -20.9% 1004952 interrupts.CPU35.LOC:Local_timer_interrupts
301.67 ? 25% -48.0% 156.83 ? 18% interrupts.CPU35.RES:Rescheduling_interrupts
1054340 -25.2% 789110 interrupts.CPU36.LOC:Local_timer_interrupts
289.83 ? 46% -48.9% 148.00 ? 25% interrupts.CPU36.RES:Rescheduling_interrupts
97.67 ? 17% -34.0% 64.50 ? 15% interrupts.CPU36.TLB:TLB_shootdowns
1054687 -25.2% 789300 interrupts.CPU37.LOC:Local_timer_interrupts
1054414 -25.2% 789161 interrupts.CPU38.LOC:Local_timer_interrupts
196.17 ? 15% -36.7% 124.17 ? 25% interrupts.CPU38.RES:Rescheduling_interrupts
107.00 ? 11% -34.0% 70.67 ? 13% interrupts.CPU38.TLB:TLB_shootdowns
1054753 -25.2% 789273 interrupts.CPU39.LOC:Local_timer_interrupts
140.00 ? 14% -41.0% 82.67 ? 32% interrupts.CPU39.RES:Rescheduling_interrupts
2108382 -12.4% 1847408 interrupts.CPU4.LOC:Local_timer_interrupts
1054713 -25.2% 788967 interrupts.CPU40.LOC:Local_timer_interrupts
158.67 ? 13% -48.9% 81.00 ? 11% interrupts.CPU40.RES:Rescheduling_interrupts
103.33 ? 10% -30.3% 72.00 ? 10% interrupts.CPU40.TLB:TLB_shootdowns
1054880 -25.2% 789300 interrupts.CPU41.LOC:Local_timer_interrupts
123.83 ? 10% -55.7% 54.83 ? 14% interrupts.CPU41.RES:Rescheduling_interrupts
1054994 -25.2% 789506 interrupts.CPU42.LOC:Local_timer_interrupts
166.83 ? 24% -54.7% 75.50 ? 26% interrupts.CPU42.RES:Rescheduling_interrupts
103.17 ? 7% -39.3% 62.67 ? 10% interrupts.CPU42.TLB:TLB_shootdowns
1055003 -25.2% 789503 interrupts.CPU43.LOC:Local_timer_interrupts
1055288 -25.2% 789506 interrupts.CPU44.LOC:Local_timer_interrupts
866255 -30.3% 603432 ? 2% interrupts.CPU45.LOC:Local_timer_interrupts
865836 -30.3% 603569 ? 2% interrupts.CPU46.LOC:Local_timer_interrupts
154.00 ? 55% -67.3% 50.33 ? 42% interrupts.CPU46.RES:Rescheduling_interrupts
866042 -30.3% 603627 ? 2% interrupts.CPU47.LOC:Local_timer_interrupts
83.83 ? 26% -64.4% 29.83 ? 60% interrupts.CPU47.RES:Rescheduling_interrupts
865870 -30.3% 603499 ? 2% interrupts.CPU48.LOC:Local_timer_interrupts
120.00 ? 29% -55.1% 53.83 ? 20% interrupts.CPU48.RES:Rescheduling_interrupts
866071 -30.3% 603546 ? 2% interrupts.CPU49.LOC:Local_timer_interrupts
2105988 -12.3% 1847508 interrupts.CPU5.LOC:Local_timer_interrupts
865920 -30.4% 602929 ? 2% interrupts.CPU50.LOC:Local_timer_interrupts
101.00 ? 7% -32.3% 68.33 ? 24% interrupts.CPU50.RES:Rescheduling_interrupts
865759 -30.3% 603586 ? 2% interrupts.CPU51.LOC:Local_timer_interrupts
866156 -30.3% 603941 ? 2% interrupts.CPU52.LOC:Local_timer_interrupts
866281 -30.3% 603837 ? 2% interrupts.CPU53.LOC:Local_timer_interrupts
67.33 ? 40% -58.4% 28.00 ? 22% interrupts.CPU53.RES:Rescheduling_interrupts
565709 -22.2% 439840 ? 3% interrupts.CPU54.LOC:Local_timer_interrupts
565883 -22.2% 440423 ? 3% interrupts.CPU55.LOC:Local_timer_interrupts
565905 -22.2% 440262 ? 3% interrupts.CPU56.LOC:Local_timer_interrupts
566181 -22.2% 440345 ? 3% interrupts.CPU57.LOC:Local_timer_interrupts
566142 -22.2% 440350 ? 3% interrupts.CPU58.LOC:Local_timer_interrupts
566296 -22.2% 440526 ? 3% interrupts.CPU59.LOC:Local_timer_interrupts
2106208 -12.3% 1846452 interrupts.CPU6.LOC:Local_timer_interrupts
566391 -22.2% 440854 ? 3% interrupts.CPU60.LOC:Local_timer_interrupts
566515 -22.2% 440885 ? 3% interrupts.CPU61.LOC:Local_timer_interrupts
566287 -22.2% 440707 ? 3% interrupts.CPU62.LOC:Local_timer_interrupts
2106005 -12.3% 1846338 interrupts.CPU7.LOC:Local_timer_interrupts
17.67 ? 22% +210.4% 54.83 ? 80% interrupts.CPU70.RES:Rescheduling_interrupts
2105251 -12.3% 1846309 interrupts.CPU8.LOC:Local_timer_interrupts
1826895 -14.3% 1566515 interrupts.CPU9.LOC:Local_timer_interrupts
94970645 -16.3% 79528620 interrupts.LOC:Local_timer_interrupts
34100 ? 3% -11.9% 30034 ? 7% interrupts.RES:Rescheduling_interrupts
9665 ? 2% -12.3% 8473 ? 4% interrupts.TLB:TLB_shootdowns
65771 -11.5% 58237 softirqs.BLOCK
90352 ? 8% -14.2% 77480 ? 4% softirqs.CPU0.RCU
134474 ? 2% -10.6% 120238 softirqs.CPU0.SCHED
75762 ? 8% -13.5% 65519 ? 5% softirqs.CPU1.RCU
128093 ? 3% -11.8% 112953 ? 2% softirqs.CPU1.SCHED
52807 ? 6% -16.8% 43921 ? 4% softirqs.CPU10.RCU
51475 ? 6% -16.3% 43064 ? 4% softirqs.CPU11.RCU
88448 ? 3% -13.9% 76149 ? 2% softirqs.CPU11.SCHED
52162 ? 6% -15.0% 44336 ? 6% softirqs.CPU12.RCU
91086 ? 3% -15.5% 76968 softirqs.CPU12.SCHED
51960 ? 6% -16.9% 43179 ? 4% softirqs.CPU13.RCU
89130 ? 4% -14.5% 76170 ? 2% softirqs.CPU13.SCHED
92616 -19.3% 74779 ? 3% softirqs.CPU14.SCHED
93149 -19.4% 75048 ? 3% softirqs.CPU15.SCHED
47550 ? 5% -19.6% 38228 ? 3% softirqs.CPU16.RCU
90706 ? 2% -16.2% 75982 softirqs.CPU16.SCHED
45657 ? 5% -19.7% 36662 ? 6% softirqs.CPU17.RCU
92509 ? 2% -18.4% 75524 ? 3% softirqs.CPU17.SCHED
40345 ? 5% -18.9% 32722 ? 4% softirqs.CPU18.RCU
74826 ? 4% -20.3% 59656 ? 2% softirqs.CPU18.SCHED
40038 ? 5% -19.3% 32309 ? 4% softirqs.CPU19.RCU
74933 ? 2% -20.5% 59544 ? 3% softirqs.CPU19.SCHED
71513 ? 10% -14.2% 61325 ? 3% softirqs.CPU2.RCU
125281 -12.1% 110115 ? 2% softirqs.CPU2.SCHED
40913 ? 5% -20.3% 32627 ? 7% softirqs.CPU20.RCU
75012 ? 3% -21.4% 58983 ? 3% softirqs.CPU20.SCHED
40330 ? 6% -19.7% 32390 ? 5% softirqs.CPU21.RCU
77061 -21.9% 60193 softirqs.CPU21.SCHED
41108 ? 6% -20.4% 32733 ? 5% softirqs.CPU22.RCU
73758 ? 8% -22.0% 57534 ? 5% softirqs.CPU22.SCHED
40504 ? 6% -21.7% 31719 ? 3% softirqs.CPU23.RCU
75619 ? 2% -21.4% 59449 ? 3% softirqs.CPU23.SCHED
40737 ? 5% -21.1% 32124 ? 4% softirqs.CPU24.RCU
75078 ? 2% -20.2% 59906 ? 3% softirqs.CPU24.SCHED
42275 ? 6% -24.2% 32065 ? 6% softirqs.CPU25.RCU
75249 ? 2% -21.1% 59349 ? 3% softirqs.CPU25.SCHED
39777 ? 4% -18.1% 32579 ? 4% softirqs.CPU26.RCU
58509 ? 8% -22.8% 45183 ? 5% softirqs.CPU26.SCHED
32999 ? 5% -22.6% 25541 ? 5% softirqs.CPU27.RCU
62443 -26.7% 45782 ? 2% softirqs.CPU27.SCHED
33337 ? 4% -23.0% 25662 ? 4% softirqs.CPU28.RCU
61437 ? 4% -28.8% 43752 ? 5% softirqs.CPU28.SCHED
33225 ? 5% -23.7% 25339 ? 4% softirqs.CPU29.RCU
62395 -26.4% 45927 ? 2% softirqs.CPU29.SCHED
69894 ? 9% -16.4% 58424 ? 2% softirqs.CPU3.RCU
127016 -12.5% 111078 softirqs.CPU3.SCHED
32342 ? 6% -21.0% 25549 ? 5% softirqs.CPU30.RCU
60908 ? 4% -27.3% 44268 ? 6% softirqs.CPU30.SCHED
33191 ? 9% -23.5% 25381 ? 4% softirqs.CPU31.RCU
63086 -26.9% 46135 ? 2% softirqs.CPU31.SCHED
31248 ? 4% -26.2% 23075 ? 3% softirqs.CPU32.RCU
61965 -31.4% 42520 ? 6% softirqs.CPU32.SCHED
31398 ? 7% -26.9% 22941 ? 3% softirqs.CPU33.RCU
61620 ? 3% -26.5% 45277 ? 4% softirqs.CPU33.SCHED
32230 ? 8% -28.5% 23030 ? 5% softirqs.CPU34.RCU
58406 ? 7% -24.6% 44064 ? 6% softirqs.CPU34.SCHED
31962 ? 6% -28.5% 22842 ? 4% softirqs.CPU35.RCU
62155 -28.3% 44555 ? 5% softirqs.CPU35.SCHED
26867 ? 5% -27.8% 19399 ? 3% softirqs.CPU36.RCU
50147 ? 5% -32.1% 34040 ? 7% softirqs.CPU36.SCHED
27216 ? 4% -29.5% 19192 ? 4% softirqs.CPU37.RCU
51279 ? 4% -30.1% 35834 ? 2% softirqs.CPU37.SCHED
27426 ? 4% -29.8% 19262 ? 3% softirqs.CPU38.RCU
49670 ? 6% -32.1% 33706 ? 8% softirqs.CPU38.SCHED
27227 ? 5% -29.3% 19242 ? 2% softirqs.CPU39.RCU
52107 -32.2% 35338 ? 3% softirqs.CPU39.SCHED
62096 ? 9% -17.9% 50995 ? 3% softirqs.CPU4.RCU
112254 -16.9% 93337 ? 3% softirqs.CPU4.SCHED
27153 ? 4% -28.9% 19299 ? 3% softirqs.CPU40.RCU
50083 ? 5% -35.2% 32432 ? 10% softirqs.CPU40.SCHED
26850 ? 4% -30.1% 18777 ? 3% softirqs.CPU41.RCU
50450 ? 6% -30.3% 35173 ? 3% softirqs.CPU41.SCHED
27442 ? 4% -28.1% 19734 ? 4% softirqs.CPU42.RCU
48383 ? 8% -34.3% 31774 ? 6% softirqs.CPU42.SCHED
28193 ? 7% -31.1% 19425 ? 3% softirqs.CPU43.RCU
50892 ? 4% -32.2% 34529 ? 4% softirqs.CPU43.SCHED
27629 ? 5% -30.3% 19254 ? 3% softirqs.CPU44.RCU
40315 ? 11% -40.3% 24059 ? 18% softirqs.CPU44.SCHED
23882 ? 6% -34.8% 15564 ? 2% softirqs.CPU45.RCU
39335 ? 9% -30.6% 27280 ? 4% softirqs.CPU45.SCHED
23035 ? 6% -34.9% 14990 ? 3% softirqs.CPU46.RCU
38299 ? 11% -33.7% 25373 ? 11% softirqs.CPU46.SCHED
23186 ? 5% -34.9% 15102 ? 3% softirqs.CPU47.RCU
42818 ? 4% -40.9% 25322 ? 12% softirqs.CPU47.SCHED
20780 ? 5% -29.1% 14728 ? 15% softirqs.CPU48.RCU
37847 ? 14% -34.7% 24721 ? 13% softirqs.CPU48.SCHED
21694 ? 10% -35.1% 14076 ? 5% softirqs.CPU49.RCU
40590 ? 11% -39.4% 24611 ? 15% softirqs.CPU49.SCHED
62904 ? 9% -18.2% 51485 ? 6% softirqs.CPU5.RCU
111131 -16.1% 93231 ? 2% softirqs.CPU5.SCHED
20330 ? 5% -28.5% 14529 ? 5% softirqs.CPU50.RCU
41153 ? 6% -45.4% 22484 ? 16% softirqs.CPU50.SCHED
20608 ? 8% -31.3% 14161 ? 2% softirqs.CPU51.RCU
41691 ? 8% -37.5% 26063 ? 9% softirqs.CPU51.SCHED
20743 ? 5% -28.0% 14944 ? 6% softirqs.CPU52.RCU
20310 ? 4% -30.5% 14124 ? 3% softirqs.CPU53.RCU
41950 ? 5% -35.3% 27133 ? 4% softirqs.CPU53.SCHED
14945 ? 2% -22.1% 11637 ? 13% softirqs.CPU54.RCU
14535 ? 5% -20.9% 11500 ? 6% softirqs.CPU55.RCU
28273 ? 2% -36.8% 17868 ? 12% softirqs.CPU55.SCHED
15457 ? 6% -23.6% 11808 ? 8% softirqs.CPU56.RCU
26458 ? 10% -32.9% 17749 ? 16% softirqs.CPU56.SCHED
14839 ? 6% -19.9% 11883 ? 9% softirqs.CPU57.RCU
25944 ? 10% -23.4% 19871 ? 7% softirqs.CPU57.SCHED
15548 ? 5% -25.1% 11640 ? 6% softirqs.CPU58.RCU
27354 ? 9% -34.2% 18001 ? 17% softirqs.CPU58.SCHED
14830 ? 4% -21.2% 11685 ? 8% softirqs.CPU59.RCU
28093 ? 2% -31.7% 19191 ? 8% softirqs.CPU59.SCHED
62449 ? 8% -16.9% 51888 ? 2% softirqs.CPU6.RCU
108906 ? 2% -13.5% 94210 ? 2% softirqs.CPU6.SCHED
26747 ? 11% -29.0% 18988 ? 12% softirqs.CPU60.SCHED
14620 ? 5% -22.3% 11357 ? 5% softirqs.CPU61.RCU
27088 ? 8% -26.3% 19975 ? 7% softirqs.CPU61.SCHED
14599 ? 5% -21.9% 11399 ? 6% softirqs.CPU62.RCU
60751 ? 9% -15.9% 51077 ? 4% softirqs.CPU7.RCU
111493 -16.1% 93573 ? 2% softirqs.CPU7.SCHED
62457 ? 9% -18.8% 50696 ? 4% softirqs.CPU8.RCU
92965 ? 6% -13.9% 80040 softirqs.CPU8.SCHED
52370 ? 6% -16.6% 43660 ? 4% softirqs.CPU9.RCU
91668 ? 2% -18.4% 74818 ? 4% softirqs.CPU9.SCHED
2795478 ? 4% -17.3% 2312988 ? 2% softirqs.RCU
4761845 -19.4% 3840288 ? 2% softirqs.SCHED
fxmark.hdd_btrfs_DWAL_54_bufferedio.works
5e+06 +-----------------------------------------------------------------+
4.5e+06 |-+ +.+++.++.++.++. .++. ++.+ .+ ++.+++.+ |
|+.++.+ +++.++.++.+++.++ + + :+ |
4e+06 |-+ + |
3.5e+06 |-+ |
| |
3e+06 |-+ |
2.5e+06 |-+ OO O OO O OOO OO OO OOO O |
2e+06 |O+OO OO OOO OO O OOO O OO OO OOO OO OO O|
| |
1.5e+06 |-+ |
1e+06 |-+ |
| |
500000 |-+ |
0 +-----------------------------------------------------------------+
fxmark.hdd_btrfs_DWAL_54_bufferedio.works_sec
160000 +------------------------------------------------------------------+
|+.++.+ ++.++.++.++.+++.+ + + : : |
140000 |-+ : |
120000 |-+ + |
| |
100000 |-+ |
| |
80000 |O+OO OO OO OO OOO OO OO OO OO OO OOO OO OO OO OO O O OO OO OO OO O|
| |
60000 |-+ |
40000 |-+ |
| |
20000 |-+ |
| |
0 +------------------------------------------------------------------+
[*] bisect-good sample
[O] bisect-bad sample
Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
---
0DAY/LKP+ Test Infrastructure Open Source Technology Center
https://lists.01.org/hyperkitty/list/[email protected] Intel Corporation
Thanks,
Oliver Sang
On Sun, Mar 14, 2021 at 11:30 PM kernel test robot
<[email protected]> wrote:
>
> FYI, we noticed a -52.4% regression of fxmark.hdd_btrfs_DWAL_63_bufferedio.works/sec
That's quite the huge regression.
But:
> due to commit: f3344adf38bd ("mm: memcontrol: optimize per-lruvec stats counter memory usage")
That's _literally_ just changing a dynamically allocated per-cpu array
of "long[]" to an array of "s32[]" and in the process shrinking it
from 304 bytes to 152 bytes.
> in testcase: fxmark
> on test machine: 288 threads Intel(R) Xeon Phi(TM) CPU 7295 @ 1.50GHz with 80G memory
I think this must be some really random memory layout issue that
causes some false sharing or similar.
And it's not even that some fundamental data structure gets a
different layout, it literally is just either:
(a) the (now smaller) array is allocated from a differently chunk,
and that then causes random cache effects with something else
(b) the (old, and bigger) array was more spread out, and as a result
had different fields in different cachelines and less false sharing
Normally I'd say that (b) is the obvious case, except for the fact
that this is a __percpu array.
So in the common case there shouldn't be any cache contention _within_
the array itself. Any cache contention should be with something else
very hot that the change now randomly makes be in the same cache way
or whatever.
Afaik, only the flushing of the vmstats batches does access the percpu
arrays from other CPUs, so (b) is not _entirely_ impossible if
memcg_flush_percpu_vmstats() were to be very very very hot.
But the profile data doesn't show anything remotely like that.
In fact, the actual report seems to show things improving, ie things
like elapsed time going down:
> 1361 -9.5% 1232 fxmark.time.elapsed_time
> 1361 -9.5% 1232 fxmark.time.elapsed_time.max
> 25.67 +9.1% 28.00 fxmark.time.percent_of_cpu_this_job_got
> 343.68 -2.0% 336.76 fxmark.time.system_time
> 23.66 +2.5 26.12 mpstat.cpu.all.sys%
but I don't know what the benchmark actually does, so maybe it just
repeats things until it hits a certain confidence interval, and thus
"elapsed time" is immaterial.
Honestly, normally if I were to get a report about "52% regression"
for a commit that is supposed to optimize something, I'd just revert
the commit as a case of "ok, that optimization clearly didn't work".
But there is absolutely no sign that this commit is actually the
culprit, or explanation for why that should be, and what could be
going on.
So I'm going to treat this as a "bisection failure, possibly due to
random code or data layout changes". Particularly since this seems to
be a 4-socket Xeon Phi machine, which I think is likely a very very
fragile system if there is some odd cache layout issue.
If somebody can actually figure out what happened there, that would be
good, but for now it goes into my "archived as a random outlier"
folder.
Linus
On Mon 15-03-21 13:42:50, Linus Torvalds wrote:
> If somebody can actually figure out what happened there, that would be
> good, but for now it goes into my "archived as a random outlier"
> folder.
This is not something new. We have seen reports like that in the past
already. In many cases there was no apparent reason except for potential
code alignment: e.g. http://lkml.kernel.org/r/20200409135029.GA2072@xsang-OptiPlex-9020
There were some other memcg related which didn't really show any
increase in cache misses or similar. I cannot find those in my notes
right now. There were others like https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian/
with analysis http://lkml.kernel.org/r/[email protected]
which actually pointed to something legit actually.
For this particular report I do not see any real relation to the patch
either.
--
Michal Hocko
SUSE Labs
Hi Linus,
On Mon, Mar 15, 2021 at 01:42:50PM -0700, Linus Torvalds wrote:
> On Sun, Mar 14, 2021 at 11:30 PM kernel test robot
> <[email protected]> wrote:
> >
> > FYI, we noticed a -52.4% regression of fxmark.hdd_btrfs_DWAL_63_bufferedio.works/sec
>
> That's quite the huge regression.
>
> But:
>
> > due to commit: f3344adf38bd ("mm: memcontrol: optimize per-lruvec stats counter memory usage")
>
> That's _literally_ just changing a dynamically allocated per-cpu array
> of "long[]" to an array of "s32[]" and in the process shrinking it
> from 304 bytes to 152 bytes.
>
> > in testcase: fxmark
> > on test machine: 288 threads Intel(R) Xeon Phi(TM) CPU 7295 @ 1.50GHz with 80G memory
>
> I think this must be some really random memory layout issue that
> causes some false sharing or similar.
>
> And it's not even that some fundamental data structure gets a
> different layout, it literally is just either:
>
> (a) the (now smaller) array is allocated from a differently chunk,
> and that then causes random cache effects with something else
>
> (b) the (old, and bigger) array was more spread out, and as a result
> had different fields in different cachelines and less false sharing
>
> Normally I'd say that (b) is the obvious case, except for the fact
> that this is a __percpu array.
>
> So in the common case there shouldn't be any cache contention _within_
> the array itself. Any cache contention should be with something else
> very hot that the change now randomly makes be in the same cache way
> or whatever.
>
> Afaik, only the flushing of the vmstats batches does access the percpu
> arrays from other CPUs, so (b) is not _entirely_ impossible if
> memcg_flush_percpu_vmstats() were to be very very very hot.
>
> But the profile data doesn't show anything remotely like that.
>
> In fact, the actual report seems to show things improving, ie things
> like elapsed time going down:
>
> > 1361 -9.5% 1232 fxmark.time.elapsed_time
> > 1361 -9.5% 1232 fxmark.time.elapsed_time.max
> > 25.67 +9.1% 28.00 fxmark.time.percent_of_cpu_this_job_got
> > 343.68 -2.0% 336.76 fxmark.time.system_time
> > 23.66 +2.5 26.12 mpstat.cpu.all.sys%
>
> but I don't know what the benchmark actually does, so maybe it just
> repeats things until it hits a certain confidence interval, and thus
> "elapsed time" is immaterial.
I just checked the benchmark, seems it benchmarks the filesystem's
scalability by doing file/inode opertions with different task numbers
(from 1 to nr_cpus).
The benchmark has preparation and cleanup steps before/after every
run, and for the 100 less seconds of 'fxmark.time.elapsed_time' you
found, it is due to the newer kernel spends 100 seconds more in the
cleanup step after run ( a point worth checking)
> Honestly, normally if I were to get a report about "52% regression"
> for a commit that is supposed to optimize something, I'd just revert
> the commit as a case of "ok, that optimization clearly didn't work".
>
> But there is absolutely no sign that this commit is actually the
> culprit, or explanation for why that should be, and what could be
> going on.
>
> So I'm going to treat this as a "bisection failure, possibly due to
> random code or data layout changes". Particularly since this seems to
> be a 4-socket Xeon Phi machine, which I think is likely a very very
> fragile system if there is some odd cache layout issue.
Oliver retested it and made it run 12 times in total, and the data
is consistent. We tried some other test:
* run other sub cases of this 'fxmark' which sees no regression
* change 'btrfs' to 'ext4' of this case, and no regression
* test on Cascadelake platform, no regression
So the bitsection seems to be stable, though can't be explained yet.
We checked the System map of the 2 kernels, and didn't find obvious
code/data alignment change, which is expected, as the commit changes
data structure which is usually dynamically allocated.
Anyway, we will keep checking this and report back when there is
new data.
> If somebody can actually figure out what happened there, that would be
> good, but for now it goes into my "archived as a random outlier"
> folder.
Agreed. We shouldn't take actions before this change is root caused.
Thanks,
Feng
> Linus
Hi Linus,
Some updates on this, we found the regression is related with the
percpu stuff change and BTRFS, though the exact relation is unknown yet.
Some details below.
+ Michal who helped providing useful links for checking it.
+ Josef Bacik, for this is BTRFS related
On Fri, Mar 19, 2021 at 11:21:44AM +0800, Feng Tang wrote:
> Hi Linus,
>
> On Mon, Mar 15, 2021 at 01:42:50PM -0700, Linus Torvalds wrote:
> > On Sun, Mar 14, 2021 at 11:30 PM kernel test robot
> > <[email protected]> wrote:
> > > in testcase: fxmark
> > > on test machine: 288 threads Intel(R) Xeon Phi(TM) CPU 7295 @ 1.50GHz with 80G memory
> >
> > I think this must be some really random memory layout issue that
> > causes some false sharing or similar.
> >
> > And it's not even that some fundamental data structure gets a
> > different layout, it literally is just either:
> >
> > (a) the (now smaller) array is allocated from a differently chunk,
> > and that then causes random cache effects with something else
> >
> > (b) the (old, and bigger) array was more spread out, and as a result
> > had different fields in different cachelines and less false sharing
> >
> > Normally I'd say that (b) is the obvious case, except for the fact
> > that this is a __percpu array.
> >
> > So in the common case there shouldn't be any cache contention _within_
> > the array itself. Any cache contention should be with something else
> > very hot that the change now randomly makes be in the same cache way
> > or whatever.
> >
> > Afaik, only the flushing of the vmstats batches does access the percpu
> > arrays from other CPUs, so (b) is not _entirely_ impossible if
> > memcg_flush_percpu_vmstats() were to be very very very hot.
> >
> > But the profile data doesn't show anything remotely like that.
> >
> > In fact, the actual report seems to show things improving, ie things
> > like elapsed time going down:
> >
> > > 1361 -9.5% 1232 fxmark.time.elapsed_time
> > > 1361 -9.5% 1232 fxmark.time.elapsed_time.max
> > > 25.67 +9.1% 28.00 fxmark.time.percent_of_cpu_this_job_got
> > > 343.68 -2.0% 336.76 fxmark.time.system_time
> > > 23.66 +2.5 26.12 mpstat.cpu.all.sys%
> >
> > but I don't know what the benchmark actually does, so maybe it just
> > repeats things until it hits a certain confidence interval, and thus
> > "elapsed time" is immaterial.
>
> I just checked the benchmark, seems it benchmarks the filesystem's
> scalability by doing file/inode opertions with different task numbers
> (from 1 to nr_cpus).
>
> The benchmark has preparation and cleanup steps before/after every
> run, and for the 100 less seconds of 'fxmark.time.elapsed_time' you
> found, it is due to the newer kernel spends 100 seconds more in the
> cleanup step after run ( a point worth checking)
Yes, the longer running time is because the cleanup ('umount' specifically)
of older kernel takes longer time, as it complets more FS operations.
And perf-profile info in the original report was not accurate, as the
test tried 72/63/54/45/36/27/18/9/1 tasks, and perf data recorded may
not reflect the "63 tasks" subcase.
> > Honestly, normally if I were to get a report about "52% regression"
> > for a commit that is supposed to optimize something, I'd just revert
> > the commit as a case of "ok, that optimization clearly didn't work".
> >
> > But there is absolutely no sign that this commit is actually the
> > culprit, or explanation for why that should be, and what could be
> > going on.
> >
> > So I'm going to treat this as a "bisection failure, possibly due to
> > random code or data layout changes". Particularly since this seems to
> > be a 4-socket Xeon Phi machine, which I think is likely a very very
> > fragile system if there is some odd cache layout issue.
>
> Oliver retested it and made it run 12 times in total, and the data
> is consistent. We tried some other test:
> * run other sub cases of this 'fxmark' which sees no regression
> * change 'btrfs' to 'ext4' of this case, and no regression
> * test on Cascadelake platform, no regression
>
> So the bitsection seems to be stable, though can't be explained yet.
>
> We checked the System map of the 2 kernels, and didn't find obvious
> code/data alignment change, which is expected, as the commit changes
> data structure which is usually dynamically allocated.
We found with the commit some percpu related ops do have some change,
as shown in perf
old kernel
----------
1.06% 0.69% [kernel.kallsyms] [k] __percpu_counter_sum - -
1.06% __percpu_counter_sum;need_preemptive_reclaim.part.0;__reserve_bytes;btrfs_reserve_metadata_bytes;btrfs_delalloc_reserve_metadata;btrfs_buffered_write;btrfs_file_write_iter;new_sync_write;vfs_write;ksys_write;do_syscall_64;entry_SYSCALL_64_after_hwframe;write
89.85% 88.17% [kernel.kallsyms] [k] native_queued_spin_lock_slowpath - -
45.27% native_queued_spin_lock_slowpath;_raw_spin_lock;btrfs_block_rsv_release;btrfs_inode_rsv_release;btrfs_buffered_write;btrfs_file_write_iter;new_sync_write;vfs_write;ksys_write;do_syscall_64;entry_SYSCALL_64_after_hwframe;write
44.51% native_queued_spin_lock_slowpath;_raw_spin_lock;__reserve_bytes;btrfs_reserve_metadata_bytes;btrfs_delalloc_reserve_metadata;btrfs_buffered_write;btrfs_file_write_iter;new_sync_write;vfs_write;ksys_write;do_syscall_64;entry_SYSCALL_64_after_hwframe;write
new kernel
----------
1.33% 1.14% [kernel.kallsyms] [k] __percpu_counter_sum - -
1.33% __percpu_counter_sum;need_preemptive_reclaim.part.0;__reserve_bytes;btrfs_reserve_metadata_bytes;btrfs_delalloc_reserve_metadata;btrfs_buffered_write;btrfs_file_write_iter;new_sync_write;vfs_write;ksys_write;do_syscall_64;entry_SYSCALL_64_after_hwframe
95.95% 95.31% [kernel.kallsyms] [k] native_queued_spin_lock_slowpath - -
48.56% native_queued_spin_lock_slowpath;_raw_spin_lock;btrfs_block_rsv_release;btrfs_inode_rsv_release;btrfs_buffered_write;btrfs_file_write_iter;new_sync_write;vfs_write;ksys_write;do_syscall_64;entry_SYSCALL_64_after_hwframe
47.33% native_queued_spin_lock_slowpath;_raw_spin_lock;__reserve_bytes;btrfs_reserve_metadata_bytes;btrfs_delalloc_reserve_metadata;btrfs_buffered_write;btrfs_file_write_iter;new_sync_write;vfs_write;ksys_write;do_syscall_64;entry_SYSCALL_64_after_hwframe
__percpu_counter_sum is usually costy for platform with many CPUs and
it does rise much. It is called in fs/btrfs/space-info.c
need_preemptive_reclaim
ordered = percpu_counter_sum_positive(&fs_info->ordered_bytes);
delalloc = percpu_counter_sum_positive(&fs_info->delalloc_bytes);
And we did 2 experiments:
1. Change the percpu_counter_sum_positive() to percpu_counter_read_positive()
which skips looping online CPUs to get the sum, inside need_preemptive_reclaim(),
the regression is gone, and even better
2. We add some padding to restore the percpu stuff size, the regression
is also gone.
struct batched_lruvec_stat {
s32 count[NR_VM_NODE_STAT_ITEMS];
+ s32 pad[NR_VM_NODE_STAT_ITEMS];
};
So we think the regression is related with Muchun's patch and BTRFS.
And another thing I tried is changing the test tasks number from '63' to
'65', and the regression is also gone, and this is something mysterious!
This is only reproduced on a Xeon Phi, which lacks some perf events, so
tool like perf c2c can't be run to pin-point some data or text.
Thanks,
Feng
On Thu, Mar 25, 2021 at 02:51:42PM +0800, Feng Tang wrote:
> > > Honestly, normally if I were to get a report about "52% regression"
> > > for a commit that is supposed to optimize something, I'd just revert
> > > the commit as a case of "ok, that optimization clearly didn't work".
> > >
> > > But there is absolutely no sign that this commit is actually the
> > > culprit, or explanation for why that should be, and what could be
> > > going on.
> > >
> > > So I'm going to treat this as a "bisection failure, possibly due to
> > > random code or data layout changes". Particularly since this seems to
> > > be a 4-socket Xeon Phi machine, which I think is likely a very very
> > > fragile system if there is some odd cache layout issue.
> >
> > Oliver retested it and made it run 12 times in total, and the data
> > is consistent. We tried some other test:
> > * run other sub cases of this 'fxmark' which sees no regression
> > * change 'btrfs' to 'ext4' of this case, and no regression
> > * test on Cascadelake platform, no regression
> >
> > So the bitsection seems to be stable, though can't be explained yet.
> >
> > We checked the System map of the 2 kernels, and didn't find obvious
> > code/data alignment change, which is expected, as the commit changes
> > data structure which is usually dynamically allocated.
>
> We found with the commit some percpu related ops do have some change,
> as shown in perf
>
> old kernel
> ----------
> 1.06% 0.69% [kernel.kallsyms] [k] __percpu_counter_sum - -
> 1.06% __percpu_counter_sum;need_preemptive_reclaim.part.0;__reserve_bytes;btrfs_reserve_metadata_bytes;btrfs_delalloc_reserve_metadata;btrfs_buffered_write;btrfs_file_write_iter;new_sync_write;vfs_write;ksys_write;do_syscall_64;entry_SYSCALL_64_after_hwframe;write
>
> 89.85% 88.17% [kernel.kallsyms] [k] native_queued_spin_lock_slowpath - -
> 45.27% native_queued_spin_lock_slowpath;_raw_spin_lock;btrfs_block_rsv_release;btrfs_inode_rsv_release;btrfs_buffered_write;btrfs_file_write_iter;new_sync_write;vfs_write;ksys_write;do_syscall_64;entry_SYSCALL_64_after_hwframe;write
> 44.51% native_queued_spin_lock_slowpath;_raw_spin_lock;__reserve_bytes;btrfs_reserve_metadata_bytes;btrfs_delalloc_reserve_metadata;btrfs_buffered_write;btrfs_file_write_iter;new_sync_write;vfs_write;ksys_write;do_syscall_64;entry_SYSCALL_64_after_hwframe;write
>
>
> new kernel
> ----------
> 1.33% 1.14% [kernel.kallsyms] [k] __percpu_counter_sum - -
> 1.33% __percpu_counter_sum;need_preemptive_reclaim.part.0;__reserve_bytes;btrfs_reserve_metadata_bytes;btrfs_delalloc_reserve_metadata;btrfs_buffered_write;btrfs_file_write_iter;new_sync_write;vfs_write;ksys_write;do_syscall_64;entry_SYSCALL_64_after_hwframe
>
> 95.95% 95.31% [kernel.kallsyms] [k] native_queued_spin_lock_slowpath - -
> 48.56% native_queued_spin_lock_slowpath;_raw_spin_lock;btrfs_block_rsv_release;btrfs_inode_rsv_release;btrfs_buffered_write;btrfs_file_write_iter;new_sync_write;vfs_write;ksys_write;do_syscall_64;entry_SYSCALL_64_after_hwframe
> 47.33% native_queued_spin_lock_slowpath;_raw_spin_lock;__reserve_bytes;btrfs_reserve_metadata_bytes;btrfs_delalloc_reserve_metadata;btrfs_buffered_write;btrfs_file_write_iter;new_sync_write;vfs_write;ksys_write;do_syscall_64;entry_SYSCALL_64_after_hwframe
>
> __percpu_counter_sum is usually costy for platform with many CPUs and
> it does rise much. It is called in fs/btrfs/space-info.c
> need_preemptive_reclaim
> ordered = percpu_counter_sum_positive(&fs_info->ordered_bytes);
> delalloc = percpu_counter_sum_positive(&fs_info->delalloc_bytes);
>
> And we did 2 experiments:
> 1. Change the percpu_counter_sum_positive() to percpu_counter_read_positive()
> which skips looping online CPUs to get the sum, inside need_preemptive_reclaim(),
> the regression is gone, and even better
Interestingly, We just got mail from Oliver about btrfs perf's 81.3%
improvement:
https://lore.kernel.org/lkml/20210325055609.GA13061@xsang-OptiPlex-9020/
which is from Josef's patch which does the same conversion of functions
of getting percpu counter.
I guess with the patch, this regression will be gone, and several other
old regressions will be gone too (Thanks Josef):
https://lore.kernel.org/lkml/20201104061657.GB15746@xsang-OptiPlex-9020/
https://lore.kernel.org/lkml/20210305083757.GF31481@xsang-OptiPlex-9020/
Thanks,
Feng