LinuxLists.cc - [mm] f3344adf38: fxmark.hdd_btrfs_DWAL_63_bufferedio.works/sec -52.4% regression

2021-03-15 06:32:17

Subject: [mm] f3344adf38: fxmark.hdd_btrfs_DWAL_63_bufferedio.works/sec -52.4% regression

Greeting,

FYI, we noticed a

commit: f3344adf38bdb3107d4 Thanks,
Oliver Sang

-52.4% regression of fxmark.hdd_btrfs_DWAL_63_bufferedio.works/sec due to commit:
0483dd9501215ad40edce ("mm: memcontrol: optimize per-lruvec stats counter memory usage")
l.org/cgit/linux/kernel/git/torvalds/linux.git">https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
threads Intel(R) Xeon Phi(TM) CPU 7295 @ 1.50GHz with 80G memory
kindly add following tag
test robot <[email protected]>
----------------------------------------------------------------------->
href="https://github.com/intel/lkp-tests.git">https://github.com/intel/lkp-tests.git
job.yaml # job file is attached in this email
--compatible job.yaml
compatible-job.yaml
==============================================================
irectio/disk/fstype/kconfig/media/rootfs/tbox_group/test/testcase/ucode:
dio/1HDD/btrfs/x86_64-rhel-8.3/hdd/debian-10.4-x86_64-20200603.cgz/lkp-knm01/DWAL/fxmark/0x11
memcg/slab: pre-allocate obj_cgroups for slab caches with SLAB_ACCOUNT")
memcontrol: optimize per-lruvec stats counter memory usage")
bdb3107d40483dd95
-----------------
%stddev
\
0.08 ?155% fxmark.hdd_btrfs_DWAL_54_bufferedio.iowait_sec
0.01 ?155% fxmark.hdd_btrfs_DWAL_54_bufferedio.iowait_util
1.06 fxmark.hdd_btrfs_DWAL_54_bufferedio.softirq_sec
0.06 fxmark.hdd_btrfs_DWAL_54_bufferedio.softirq_util
4.85 fxmark.hdd_btrfs_DWAL_54_bufferedio.user_sec
0.30 fxmark.hdd_btrfs_DWAL_54_bufferedio.user_util
-51.2% 2300833 fxmark.hdd_btrfs_DWAL_54_bufferedio.works
76693 fxmark.hdd_btrfs_DWAL_54_bufferedio.works/sec
0.02 ?223% fxmark.hdd_btrfs_DWAL_63_bufferedio.iowait_sec
0.00 ?223% fxmark.hdd_btrfs_DWAL_63_bufferedio.iowait_util
1.00 fxmark.hdd_btrfs_DWAL_63_bufferedio.softirq_sec
0.05 fxmark.hdd_btrfs_DWAL_63_bufferedio.softirq_util
4.87 fxmark.hdd_btrfs_DWAL_63_bufferedio.user_sec
0.26 fxmark.hdd_btrfs_DWAL_63_bufferedio.user_util
-52.4% 2003719 fxmark.hdd_btrfs_DWAL_63_bufferedio.works
66789 fxmark.hdd_btrfs_DWAL_63_bufferedio.works/sec
1232 fxmark.time.elapsed_time
1232 fxmark.time.elapsed_time.max
28.00 fxmark.time.percent_of_cpu_this_job_got
-2.0% 336.76 fxmark.time.system_time
26.12 mpstat.cpu.all.sys%
52495 uptime.idle
-18.7% 3.224e+10 cpuidle.C1.time
-20.6% 60007388 cpuidle.C1.usage
7389 ? 8% sched_debug.cpu.curr->pid.avg
4252 ? 9% sched_debug.cpu.curr->pid.min
0.02 ? 20% perf-sched.sch_delay.avg.ms.schedule_timeout.rcu_gp_kthread.kthread.ret_from_fork
2.12 ?128% perf-sched.sch_delay.max.ms.schedule_timeout.rcu_gp_kthread.kthread.ret_from_fork
42.65 ? 42% perf-sched.total_sch_delay.max.ms
55.04 iostat.cpu.idle
13.52 ? 2% iostat.cpu.iowait
28.23 iostat.cpu.system
1.18 iostat.cpu.user
-10.2% 36631164 numa-numastat.node0.local_node
-30.8% 1762639 ? 4% numa-numastat.node0.numa_foreign
-10.2% 36631126 numa-numastat.node0.numa_hit
-30.8% 1762639 ? 4% numa-numastat.node1.numa_miss
-30.8% 1762639 ? 4% numa-numastat.node1.other_node
54.50 vmstat.cpu.id
-2.5% 114736 vmstat.io.bo
-10.4% 10683231 vmstat.memory.cache
8.00 vmstat.procs.r
65591 vmstat.system.in
5997 meminfo.Active
4425 meminfo.Active(file)
-10.4% 10600714 meminfo.Cached
-10.0% 1681283 ? 7% meminfo.DirectMap4k
-11.1% 9776954 meminfo.Inactive
-11.4% 9496681 meminfo.Inactive(file)
1937 ? 11% meminfo.Mlocked
-12.6% 4429136 meminfo.Writeback
31683 perf-stat.i.cpu-clock
62.57 perf-stat.i.cpu-migrations
0.91 ? 2% perf-stat.i.major-faults
0.46 perf-stat.i.metric.K/sec
31683 perf-stat.i.task-clock
31954 perf-stat.ps.cpu-clock
62.81 perf-stat.ps.cpu-migrations
0.92 ? 2% perf-stat.ps.major-faults
31954 perf-stat.ps.task-clock
1943 ? 2% slabinfo.Acpi-State.active_objs
1431 slabinfo.btrfs_delayed_node.active_objs
2191 ? 8% slabinfo.btrfs_delayed_tree_ref.active_objs
2354 ? 8% slabinfo.btrfs_delayed_tree_ref.num_objs
2591 ? 4% slabinfo.btrfs_extent_map.active_objs
2664 ? 4% slabinfo.btrfs_extent_map.num_objs
3162 ? 6% slabinfo.fsnotify_mark_connector.active_objs
2109 slabinfo.khugepaged_mm_slot.active_objs
1771 slabinfo.kmalloc-rcl-256.active_objs
2036 ? 9% slabinfo.mnt_cache.active_objs
2235 ? 8% slabinfo.mnt_cache.num_objs
2638 ? 3% slabinfo.pool_workqueue.active_objs
6005 ? 2% numa-meminfo.node0.Active
4431 numa-meminfo.node0.Active(file)
-9.8% 9424769 numa-meminfo.node0.FilePages
-10.1% 9112146 numa-meminfo.node0.Inactive
-10.4% 8831873 numa-meminfo.node0.Inactive(file)
1109 ? 11% numa-meminfo.node0.Mlocked
-11.5% 3910520 numa-meminfo.node0.Writeback
31801 ? 11% numa-meminfo.node1.Dirty
-14.7% 1189612 ? 3% numa-meminfo.node1.FilePages
678451 ? 5% numa-meminfo.node1.Inactive
678451 ? 5% numa-meminfo.node1.Inactive(file)
-13.9% 1286160 ? 2% numa-meminfo.node1.MemUsed
526389 ? 4% numa-meminfo.node1.Writeback
1110 numa-vmstat.node0.nr_active_file
-10.5% 18150513 numa-vmstat.node0.nr_dirtied
-9.8% 2358574 numa-vmstat.node0.nr_file_pages
-10.3% 2210356 numa-vmstat.node0.nr_inactive_file
-5.5% 276.67 ? 11% numa-vmstat.node0.nr_mlock
978820 numa-vmstat.node0.nr_writeback
-10.6% 16954017 numa-vmstat.node0.nr_written
1110 numa-vmstat.node0.nr_zone_active_file
-10.3% 2210355 numa-vmstat.node0.nr_zone_inactive_file
707946 ? 7% numa-vmstat.node0.numa_foreign
707978 ? 7% numa-vmstat.node1.nr_dirtied
7949 ? 11% numa-vmstat.node1.nr_dirty
297821 ? 3% numa-vmstat.node1.nr_file_pages
170030 ? 5% numa-vmstat.node1.nr_inactive_file
131954 ? 4% numa-vmstat.node1.nr_writeback
568071 ? 7% numa-vmstat.node1.nr_written
170030 ? 5% numa-vmstat.node1.nr_zone_inactive_file
139906 ? 4% numa-vmstat.node1.nr_zone_write_pending
707981 ? 7% numa-vmstat.node1.numa_miss
854565 ? 5% numa-vmstat.node1.numa_other
-5.0% 392.33 ? 5% proc-vmstat.nr_active_anon
1107 proc-vmstat.nr_active_file
-11.7% 35246897 proc-vmstat.nr_dirtied
-10.4% 2657260 proc-vmstat.nr_file_pages
+1.8% 17614053 proc-vmstat.nr_free_pages
-11.4% 2381244 proc-vmstat.nr_inactive_file
-5.3% 484.67 ? 11% proc-vmstat.nr_mlock
8917 proc-vmstat.nr_shmem
25962 proc-vmstat.nr_slab_reclaimable
55420 proc-vmstat.nr_slab_unreclaimable
-12.6% 1110960 proc-vmstat.nr_writeback
-11.7% 35246897 proc-vmstat.nr_written
-5.0% 392.33 ? 5% proc-vmstat.nr_zone_active_anon
1107 proc-vmstat.nr_zone_active_file
-11.4% 2381244 proc-vmstat.nr_zone_inactive_file
-10.7% 1338748 proc-vmstat.nr_zone_write_pending
-30.8% 1762639 ? 4% proc-vmstat.numa_foreign
-10.2% 36655412 proc-vmstat.numa_hit
-10.2% 36655411 proc-vmstat.numa_local
-30.8% 1762639 ? 4% proc-vmstat.numa_miss
-30.8% 1762639 ? 4% proc-vmstat.numa_other
21145 proc-vmstat.pgactivate
-11.4% 38693777 proc-vmstat.pgalloc_normal
8938 ? 2% proc-vmstat.pgdeactivate
-8.9% 3706066 proc-vmstat.pgfault
-11.4% 38679395 proc-vmstat.pgfree
-11.7% 1.411e+08 proc-vmstat.pgpgout
-7.1% 295295 proc-vmstat.pgreuse
-14.0% 25902040 proc-vmstat.pgrotated
-8.3% 10446848 proc-vmstat.unevictable_pgs_scanned
0.30 ?101% perf-profile.calltrace.cycles-pp.read_counters.process_interval.dispatch_events.__run_perf_stat.cmd_stat
0.30 ?100% perf-profile.calltrace.cycles-pp.cmd_stat.run_builtin.main.__libc_start_main
0.30 ?100% perf-profile.calltrace.cycles-pp.__run_perf_stat.cmd_stat.run_builtin.main.__libc_start_main
0.30 ?100% perf-profile.calltrace.cycles-pp.dispatch_events.__run_perf_stat.cmd_stat.run_builtin.main
0.30 ?100% perf-profile.calltrace.cycles-pp.process_interval.dispatch_events.__run_perf_stat.cmd_stat.run_builtin
1.22 ? 6% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe
1.28 ? 5% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe
1.54 ? 29% perf-profile.children.cycles-pp.scsi_io_completion
1.54 ? 29% perf-profile.children.cycles-pp.scsi_end_request
1.55 ? 28% perf-profile.children.cycles-pp.blk_complete_reqs
1.50 ? 28% perf-profile.children.cycles-pp.end_bio_extent_writepage
1.50 ? 28% perf-profile.children.cycles-pp.btrfs_end_bio
1.51 ? 29% perf-profile.children.cycles-pp.blk_update_request
1.51 ? 25% perf-profile.children.cycles-pp.flush_smp_call_function_from_idle
1.50 ? 25% perf-profile.children.cycles-pp.do_softirq
1.31 ? 28% perf-profile.children.cycles-pp.end_page_writeback
0.60 ? 15% perf-profile.children.cycles-pp.test_clear_page_writeback
1.22 ? 6% perf-profile.children.cycles-pp.__schedule
0.54 ? 14% perf-profile.children.cycles-pp.read_counters
0.55 ? 13% perf-profile.children.cycles-pp.process_interval
0.55 ? 13% perf-profile.children.cycles-pp.cmd_stat
0.55 ? 13% perf-profile.children.cycles-pp.__run_perf_stat
0.55 ? 13% perf-profile.children.cycles-pp.dispatch_events
0.10 ? 35% perf-profile.children.cycles-pp.btrfs_dec_test_ordered_pending
0.38 ? 12% perf-profile.children.cycles-pp.sched_setaffinity
0.06 ? 52% perf-profile.children.cycles-pp.percpu_counter_add_batch
0.03 ?103% perf-profile.children.cycles-pp.__fprop_inc_percpu_max
0.09 ? 18% perf-profile.children.cycles-pp.do_dentry_open
0.19 ? 15% perf-profile.children.cycles-pp.__x64_sys_sched_setaffinity
0.09 ? 17% perf-profile.children.cycles-pp.rcu_irq_exit
0.07 ? 21% perf-profile.children.cycles-pp.vm_area_dup
0.14 ? 39% perf-profile.children.cycles-pp.proc_reg_read_iter
0.27 ? 15% perf-profile.self.cycles-pp.tick_irq_enter
0.06 ? 51% perf-profile.self.cycles-pp.percpu_counter_add_batch
0.07 ? 45% perf-profile.self.cycles-pp.rcu_irq_exit
0.18 ? 14% perf-profile.self.cycles-pp.local_touch_nmi
3263 ? 79% interrupts.40:IR-PCI-MSI.4194308-edge.eth0-TxRx-3
2291 ? 44% interrupts.9:IR-IO-APIC.9-fasteoi.acpi
-9.5% 2466464 interrupts.CPU0.LOC:Local_timer_interrupts
2137 ? 44% interrupts.CPU1.9:IR-IO-APIC.9-fasteoi.acpi
-10.0% 2313644 interrupts.CPU1.LOC:Local_timer_interrupts
316.33 ? 61% interrupts.CPU1.NMI:Non-maskable_interrupts
316.33 ? 61% interrupts.CPU1.PMI:Performance_monitoring_interrupts
-14.2% 1566625 interrupts.CPU10.LOC:Local_timer_interrupts
-14.2% 1566654 interrupts.CPU11.LOC:Local_timer_interrupts
-14.2% 1566964 interrupts.CPU12.LOC:Local_timer_interrupts
-14.2% 1566001 interrupts.CPU13.LOC:Local_timer_interrupts
1633 ? 83% interrupts.CPU14.40:IR-PCI-MSI.4194308-edge.eth0-TxRx-3
-14.2% 1566646 interrupts.CPU14.LOC:Local_timer_interrupts
-14.2% 1566106 interrupts.CPU15.LOC:Local_timer_interrupts
382.33 ? 15% interrupts.CPU15.RES:Rescheduling_interrupts
-14.2% 1566532 interrupts.CPU16.LOC:Local_timer_interrupts
162.67 ? 11% interrupts.CPU16.TLB:TLB_shootdowns
-14.3% 1566330 interrupts.CPU17.LOC:Local_timer_interrupts
-16.9% 1282894 interrupts.CPU18.LOC:Local_timer_interrupts
348.00 ? 11% interrupts.CPU18.RES:Rescheduling_interrupts
152.33 ? 2% interrupts.CPU18.TLB:TLB_shootdowns
-16.9% 1283093 interrupts.CPU19.LOC:Local_timer_interrupts
-11.0% 2130761 interrupts.CPU2.LOC:Local_timer_interrupts
-17.0% 1283196 interrupts.CPU20.LOC:Local_timer_interrupts
127.00 ? 8% interrupts.CPU20.TLB:TLB_shootdowns
-17.0% 1283065 interrupts.CPU21.LOC:Local_timer_interrupts
-16.9% 1282065 interrupts.CPU22.LOC:Local_timer_interrupts
110.50 ? 5% interrupts.CPU22.NMI:Non-maskable_interrupts
110.50 ? 5% interrupts.CPU22.PMI:Performance_monitoring_interrupts
139.17 ? 12% interrupts.CPU22.TLB:TLB_shootdowns
-16.9% 1282473 interrupts.CPU23.LOC:Local_timer_interrupts
-16.8% 1284243 interrupts.CPU24.LOC:Local_timer_interrupts
-16.9% 1283131 interrupts.CPU25.LOC:Local_timer_interrupts
252.33 ? 6% interrupts.CPU25.RES:Rescheduling_interrupts
-16.8% 1283754 interrupts.CPU26.LOC:Local_timer_interrupts
-20.9% 1004765 interrupts.CPU27.LOC:Local_timer_interrupts
-20.9% 1005025 interrupts.CPU28.LOC:Local_timer_interrupts
107.33 ? 5% interrupts.CPU28.NMI:Non-maskable_interrupts
107.33 ? 5% interrupts.CPU28.PMI:Performance_monitoring_interrupts
236.83 ? 8% interrupts.CPU28.RES:Rescheduling_interrupts
93.83 ? 9% interrupts.CPU28.TLB:TLB_shootdowns
-20.9% 1004873 interrupts.CPU29.LOC:Local_timer_interrupts
-10.9% 2131255 interrupts.CPU3.LOC:Local_timer_interrupts
-20.8% 1006109 interrupts.CPU30.LOC:Local_timer_interrupts
-20.8% 1006752 interrupts.CPU31.LOC:Local_timer_interrupts
-20.9% 1005129 interrupts.CPU32.LOC:Local_timer_interrupts
214.17 ? 16% interrupts.CPU32.RES:Rescheduling_interrupts
94.17 ? 10% interrupts.CPU32.TLB:TLB_shootdowns
-20.9% 1005088 interrupts.CPU33.LOC:Local_timer_interrupts
162.17 ? 15% interrupts.CPU33.RES:Rescheduling_interrupts
-20.9% 1004757 interrupts.CPU34.LOC:Local_timer_interrupts
216.00 ? 20% interrupts.CPU34.RES:Rescheduling_interrupts
73.00 ? 15% interrupts.CPU34.TLB:TLB_shootdowns
-20.9% 1004952 interrupts.CPU35.LOC:Local_timer_interrupts
156.83 ? 18% interrupts.CPU35.RES:Rescheduling_interrupts
789110 interrupts.CPU36.LOC:Local_timer_interrupts
148.00 ? 25% interrupts.CPU36.RES:Rescheduling_interrupts
64.50 ? 15% interrupts.CPU36.TLB:TLB_shootdowns
789300 interrupts.CPU37.LOC:Local_timer_interrupts
789161 interrupts.CPU38.LOC:Local_timer_interrupts
124.17 ? 25% interrupts.CPU38.RES:Rescheduling_interrupts
70.67 ? 13% interrupts.CPU38.TLB:TLB_shootdowns
789273 interrupts.CPU39.LOC:Local_timer_interrupts
82.67 ? 32% interrupts.CPU39.RES:Rescheduling_interrupts
-12.4% 1847408 interrupts.CPU4.LOC:Local_timer_interrupts
788967 interrupts.CPU40.LOC:Local_timer_interrupts
81.00 ? 11% interrupts.CPU40.RES:Rescheduling_interrupts
72.00 ? 10% interrupts.CPU40.TLB:TLB_shootdowns
789300 interrupts.CPU41.LOC:Local_timer_interrupts
54.83 ? 14% interrupts.CPU41.RES:Rescheduling_interrupts
789506 interrupts.CPU42.LOC:Local_timer_interrupts
75.50 ? 26% interrupts.CPU42.RES:Rescheduling_interrupts
62.67 ? 10% interrupts.CPU42.TLB:TLB_shootdowns
789503 interrupts.CPU43.LOC:Local_timer_interrupts
789506 interrupts.CPU44.LOC:Local_timer_interrupts
603432 ? 2% interrupts.CPU45.LOC:Local_timer_interrupts
603569 ? 2% interrupts.CPU46.LOC:Local_timer_interrupts
50.33 ? 42% interrupts.CPU46.RES:Rescheduling_interrupts
603627 ? 2% interrupts.CPU47.LOC:Local_timer_interrupts
29.83 ? 60% interrupts.CPU47.RES:Rescheduling_interrupts
603499 ? 2% interrupts.CPU48.LOC:Local_timer_interrupts
53.83 ? 20% interrupts.CPU48.RES:Rescheduling_interrupts
603546 ? 2% interrupts.CPU49.LOC:Local_timer_interrupts
-12.3% 1847508 interrupts.CPU5.LOC:Local_timer_interrupts
602929 ? 2% interrupts.CPU50.LOC:Local_timer_interrupts
68.33 ? 24% interrupts.CPU50.RES:Rescheduling_interrupts
603586 ? 2% interrupts.CPU51.LOC:Local_timer_interrupts
603941 ? 2% interrupts.CPU52.LOC:Local_timer_interrupts
603837 ? 2% interrupts.CPU53.LOC:Local_timer_interrupts
28.00 ? 22% interrupts.CPU53.RES:Rescheduling_interrupts
439840 ? 3% interrupts.CPU54.LOC:Local_timer_interrupts
440423 ? 3% interrupts.CPU55.LOC:Local_timer_interrupts
440262 ? 3% interrupts.CPU56.LOC:Local_timer_interrupts
440345 ? 3% interrupts.CPU57.LOC:Local_timer_interrupts
440350 ? 3% interrupts.CPU58.LOC:Local_timer_interrupts
440526 ? 3% interrupts.CPU59.LOC:Local_timer_interrupts
-12.3% 1846452 interrupts.CPU6.LOC:Local_timer_interrupts
440854 ? 3% interrupts.CPU60.LOC:Local_timer_interrupts
440885 ? 3% interrupts.CPU61.LOC:Local_timer_interrupts
440707 ? 3% interrupts.CPU62.LOC:Local_timer_interrupts
-12.3% 1846338 interrupts.CPU7.LOC:Local_timer_interrupts
54.83 ? 80% interrupts.CPU70.RES:Rescheduling_interrupts
-12.3% 1846309 interrupts.CPU8.LOC:Local_timer_interrupts
-14.3% 1566515 interrupts.CPU9.LOC:Local_timer_interrupts
-16.3% 79528620 interrupts.LOC:Local_timer_interrupts
30034 ? 7% interrupts.RES:Rescheduling_interrupts
8473 ? 4% interrupts.TLB:TLB_shootdowns
58237 softirqs.BLOCK
77480 ? 4% softirqs.CPU0.RCU
120238 softirqs.CPU0.SCHED
65519 ? 5% softirqs.CPU1.RCU
112953 ? 2% softirqs.CPU1.SCHED
43921 ? 4% softirqs.CPU10.RCU
43064 ? 4% softirqs.CPU11.RCU
76149 ? 2% softirqs.CPU11.SCHED
44336 ? 6% softirqs.CPU12.RCU
76968 softirqs.CPU12.SCHED
43179 ? 4% softirqs.CPU13.RCU
76170 ? 2% softirqs.CPU13.SCHED
74779 ? 3% softirqs.CPU14.SCHED
75048 ? 3% softirqs.CPU15.SCHED
38228 ? 3% softirqs.CPU16.RCU
75982 softirqs.CPU16.SCHED
36662 ? 6% softirqs.CPU17.RCU
75524 ? 3% softirqs.CPU17.SCHED
32722 ? 4% softirqs.CPU18.RCU
59656 ? 2% softirqs.CPU18.SCHED
32309 ? 4% softirqs.CPU19.RCU
59544 ? 3% softirqs.CPU19.SCHED
61325 ? 3% softirqs.CPU2.RCU
110115 ? 2% softirqs.CPU2.SCHED
32627 ? 7% softirqs.CPU20.RCU
58983 ? 3% softirqs.CPU20.SCHED
32390 ? 5% softirqs.CPU21.RCU
60193 softirqs.CPU21.SCHED
32733 ? 5% softirqs.CPU22.RCU
57534 ? 5% softirqs.CPU22.SCHED
31719 ? 3% softirqs.CPU23.RCU
59449 ? 3% softirqs.CPU23.SCHED
32124 ? 4% softirqs.CPU24.RCU
59906 ? 3% softirqs.CPU24.SCHED
32065 ? 6% softirqs.CPU25.RCU
59349 ? 3% softirqs.CPU25.SCHED
32579 ? 4% softirqs.CPU26.RCU
45183 ? 5% softirqs.CPU26.SCHED
25541 ? 5% softirqs.CPU27.RCU
45782 ? 2% softirqs.CPU27.SCHED
25662 ? 4% softirqs.CPU28.RCU
43752 ? 5% softirqs.CPU28.SCHED
25339 ? 4% softirqs.CPU29.RCU
45927 ? 2% softirqs.CPU29.SCHED
58424 ? 2% softirqs.CPU3.RCU
111078 softirqs.CPU3.SCHED
25549 ? 5% softirqs.CPU30.RCU
44268 ? 6% softirqs.CPU30.SCHED
25381 ? 4% softirqs.CPU31.RCU
46135 ? 2% softirqs.CPU31.SCHED
23075 ? 3% softirqs.CPU32.RCU
42520 ? 6% softirqs.CPU32.SCHED
22941 ? 3% softirqs.CPU33.RCU
45277 ? 4% softirqs.CPU33.SCHED
23030 ? 5% softirqs.CPU34.RCU
44064 ? 6% softirqs.CPU34.SCHED
22842 ? 4% softirqs.CPU35.RCU
44555 ? 5% softirqs.CPU35.SCHED
19399 ? 3% softirqs.CPU36.RCU
34040 ? 7% softirqs.CPU36.SCHED
19192 ? 4% softirqs.CPU37.RCU
35834 ? 2% softirqs.CPU37.SCHED
19262 ? 3% softirqs.CPU38.RCU
33706 ? 8% softirqs.CPU38.SCHED
19242 ? 2% softirqs.CPU39.RCU
35338 ? 3% softirqs.CPU39.SCHED
50995 ? 3% softirqs.CPU4.RCU
93337 ? 3% softirqs.CPU4.SCHED
19299 ? 3% softirqs.CPU40.RCU
32432 ? 10% softirqs.CPU40.SCHED
18777 ? 3% softirqs.CPU41.RCU
35173 ? 3% softirqs.CPU41.SCHED
19734 ? 4% softirqs.CPU42.RCU
31774 ? 6% softirqs.CPU42.SCHED
19425 ? 3% softirqs.CPU43.RCU
34529 ? 4% softirqs.CPU43.SCHED
19254 ? 3% softirqs.CPU44.RCU
24059 ? 18% softirqs.CPU44.SCHED
15564 ? 2% softirqs.CPU45.RCU
27280 ? 4% softirqs.CPU45.SCHED
14990 ? 3% softirqs.CPU46.RCU
25373 ? 11% softirqs.CPU46.SCHED
15102 ? 3% softirqs.CPU47.RCU
25322 ? 12% softirqs.CPU47.SCHED
14728 ? 15% softirqs.CPU48.RCU
24721 ? 13% softirqs.CPU48.SCHED
14076 ? 5% softirqs.CPU49.RCU
24611 ? 15% softirqs.CPU49.SCHED
51485 ? 6% softirqs.CPU5.RCU
93231 ? 2% softirqs.CPU5.SCHED
14529 ? 5% softirqs.CPU50.RCU
22484 ? 16% softirqs.CPU50.SCHED
14161 ? 2% softirqs.CPU51.RCU
26063 ? 9% softirqs.CPU51.SCHED
14944 ? 6% softirqs.CPU52.RCU
14124 ? 3% softirqs.CPU53.RCU
27133 ? 4% softirqs.CPU53.SCHED
11637 ? 13% softirqs.CPU54.RCU
11500 ? 6% softirqs.CPU55.RCU
17868 ? 12% softirqs.CPU55.SCHED
11808 ? 8% softirqs.CPU56.RCU
17749 ? 16% softirqs.CPU56.SCHED
11883 ? 9% softirqs.CPU57.RCU
19871 ? 7% softirqs.CPU57.SCHED
11640 ? 6% softirqs.CPU58.RCU
18001 ? 17% softirqs.CPU58.SCHED
11685 ? 8% softirqs.CPU59.RCU
19191 ? 8% softirqs.CPU59.SCHED
51888 ? 2% softirqs.CPU6.RCU
94210 ? 2% softirqs.CPU6.SCHED
18988 ? 12% softirqs.CPU60.SCHED
11357 ? 5% softirqs.CPU61.RCU
19975 ? 7% softirqs.CPU61.SCHED
11399 ? 6% softirqs.CPU62.RCU
51077 ? 4% softirqs.CPU7.RCU
93573 ? 2% softirqs.CPU7.SCHED
50696 ? 4% softirqs.CPU8.RCU
80040 softirqs.CPU8.SCHED
43660 ? 4% softirqs.CPU9.RCU
74818 ? 4% softirqs.CPU9.SCHED
-17.3% 2312988 ? 2% softirqs.RCU
-19.4% 3840288 ? 2% softirqs.SCHED

fxmark.hdd_btrfs_DWAL_54_bufferedio.works

-------------------------------------------------+
.++. ++.+ .+ ++.+++.+ |
+++.++.++.+++.++ + + :+ |
+ |
|
|
|
OO O OO O OOO OO OO OOO O |
O OOO O OO OO OOO OO OO O|
|
|
|
|
|
-------------------------------------------------+

fxmark.hdd_btrfs_DWAL_54_bufferedio.works_sec

-------------------------------------------------+
++.++.++.++.+++.+ + + : : |
: |
+ |
|
|
|
OO OO OOO OO OO OO OO OO OOO OO OO OO OO O O OO OO OO OO O|
|
|
|
|
|
|
-------------------------------------------------+

estimated based on internal Intel analysis and are provided
purposes only. Any difference in system hardware or software
may affect actual performance.
Open Source Technology Center
org/hyperkitty/list/lkp@lists.01.org">https://lists.01.org/hyperkitty/list/[email protected] Intel Corporation

Attachments:

(No filename) (36.69 kB)
config-5.11.0-10257-gf3344adf38bd (175.12 kB)
job-script (8.19 kB)
job.yaml (5.51 kB)
reproduce (263.00 B)
Download all attachments

2021-03-16 04:41:14

by Linus Torvalds

[permalink] [raw]

Subject: Re: [mm] f3344adf38: fxmark.hdd_btrfs_DWAL_63_bufferedio.works/sec -52.4% regression

On Sun, Mar 14, 2021 at 11:30 PM kernel test robot
<[email protected]> wrote:
>
> FYI, we noticed a -52.4% regression of fxmark.hdd_btrfs_DWAL_63_bufferedio.works/sec

That's quite the huge regression.

But:

> due to commit: f3344adf38bd ("mm: memcontrol: optimize per-lruvec stats counter memory usage")

That's _literally_ just changing a dynamically allocated per-cpu array
of "long[]" to an array of "s32[]" and in the process shrinking it
from 304 bytes to 152 bytes.

> in testcase: fxmark
> on test machine: 288 threads Intel(R) Xeon Phi(TM) CPU 7295 @ 1.50GHz with 80G memory

I think this must be some really random memory layout issue that
causes some false sharing or similar.

And it's not even that some fundamental data structure gets a
different layout, it literally is just either:

(a) the (now smaller) array is allocated from a differently chunk,
and that then causes random cache effects with something else

(b) the (old, and bigger) array was more spread out, and as a result
had different fields in different cachelines and less false sharing

Normally I'd say that (b) is the obvious case, except for the fact
that this is a __percpu array.

So in the common case there shouldn't be any cache contention _within_
the array itself. Any cache contention should be with something else
very hot that the change now randomly makes be in the same cache way
or whatever.

Afaik, only the flushing of the vmstats batches does access the percpu
arrays from other CPUs, so (b) is not _entirely_ impossible if
memcg_flush_percpu_vmstats() were to be very very very hot.

But the profile data doesn't show anything remotely like that.

In fact, the actual report seems to show things improving, ie things
like elapsed time going down:

> 1361 -9.5% 1232 fxmark.time.elapsed_time
> 1361 -9.5% 1232 fxmark.time.elapsed_time.max
> 25.67 +9.1% 28.00 fxmark.time.percent_of_cpu_this_job_got
> 343.68 -2.0% 336.76 fxmark.time.system_time
> 23.66 +2.5 26.12 mpstat.cpu.all.sys%

but I don't know what the benchmark actually does, so maybe it just
repeats things until it hits a certain confidence interval, and thus
"elapsed time" is immaterial.

Honestly, normally if I were to get a report about "52% regression"
for a commit that is supposed to optimize something, I'd just revert
the commit as a case of "ok, that optimization clearly didn't work".

But there is absolutely no sign that this commit is actually the
culprit, or explanation for why that should be, and what could be
going on.

So I'm going to treat this as a "bisection failure, possibly due to
random code or data layout changes". Particularly since this seems to
be a 4-socket Xeon Phi machine, which I think is likely a very very
fragile system if there is some odd cache layout issue.

If somebody can actually figure out what happened there, that would be
good, but for now it goes into my "archived as a random outlier"
folder.

Linus

2021-03-16 20:27:25

by Michal Hocko

[permalink] [raw]

Subject: Re: [mm] f3344adf38: fxmark.hdd_btrfs_DWAL_63_bufferedio.works/sec -52.4% regression

On Mon 15-03-21 13:42:50, Linus Torvalds wrote:
> If somebody can actually figure out what happened there, that would be
> good, but for now it goes into my "archived as a random outlier"
> folder.

This is not something new. We have seen reports like that in the past
already. In many cases there was no apparent reason except for potential
code alignment: e.g. http://lkml.kernel.org/r/20200409135029.GA2072@xsang-OptiPlex-9020
There were some other memcg related which didn't really show any
increase in cache misses or similar. I cannot find those in my notes
right now. There were others like https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian/
with analysis http://lkml.kernel.org/r/[email protected]
which actually pointed to something legit actually.

For this particular report I do not see any real relation to the patch
either.
--
Michal Hocko
SUSE Labs

2021-03-19 03:26:17

by Feng Tang

[permalink] [raw]

Subject: Re: [mm] f3344adf38: fxmark.hdd_btrfs_DWAL_63_bufferedio.works/sec -52.4% regression

Hi Linus,

On Mon, Mar 15, 2021 at 01:42:50PM -0700, Linus Torvalds wrote:
> On Sun, Mar 14, 2021 at 11:30 PM kernel test robot
> <[email protected]> wrote:
> >
> > FYI, we noticed a -52.4% regression of fxmark.hdd_btrfs_DWAL_63_bufferedio.works/sec
>
> That's quite the huge regression.
>
> But:
>
> > due to commit: f3344adf38bd ("mm: memcontrol: optimize per-lruvec stats counter memory usage")
>
> That's _literally_ just changing a dynamically allocated per-cpu array
> of "long[]" to an array of "s32[]" and in the process shrinking it
> from 304 bytes to 152 bytes.
>
> > in testcase: fxmark
> > on test machine: 288 threads Intel(R) Xeon Phi(TM) CPU 7295 @ 1.50GHz with 80G memory
>
> I think this must be some really random memory layout issue that
> causes some false sharing or similar.
>
> And it's not even that some fundamental data structure gets a
> different layout, it literally is just either:
>
> (a) the (now smaller) array is allocated from a differently chunk,
> and that then causes random cache effects with something else
>
> (b) the (old, and bigger) array was more spread out, and as a result
> had different fields in different cachelines and less false sharing
>
> Normally I'd say that (b) is the obvious case, except for the fact
> that this is a __percpu array.
>
> So in the common case there shouldn't be any cache contention _within_
> the array itself. Any cache contention should be with something else
> very hot that the change now randomly makes be in the same cache way
> or whatever.
>
> Afaik, only the flushing of the vmstats batches does access the percpu
> arrays from other CPUs, so (b) is not _entirely_ impossible if
> memcg_flush_percpu_vmstats() were to be very very very hot.
>
> But the profile data doesn't show anything remotely like that.
>
> In fact, the actual report seems to show things improving, ie things
> like elapsed time going down:
>
> > 1361 -9.5% 1232 fxmark.time.elapsed_time
> > 1361 -9.5% 1232 fxmark.time.elapsed_time.max
> > 25.67 +9.1% 28.00 fxmark.time.percent_of_cpu_this_job_got
> > 343.68 -2.0% 336.76 fxmark.time.system_time
> > 23.66 +2.5 26.12 mpstat.cpu.all.sys%
>
> but I don't know what the benchmark actually does, so maybe it just
> repeats things until it hits a certain confidence interval, and thus
> "elapsed time" is immaterial.

I just checked the benchmark, seems it benchmarks the filesystem's
scalability by doing file/inode opertions with different task numbers
(from 1 to nr_cpus).

The benchmark has preparation and cleanup steps before/after every
run, and for the 100 less seconds of 'fxmark.time.elapsed_time' you
found, it is due to the newer kernel spends 100 seconds more in the
cleanup step after run ( a point worth checking)

> Honestly, normally if I were to get a report about "52% regression"
> for a commit that is supposed to optimize something, I'd just revert
> the commit as a case of "ok, that optimization clearly didn't work".
>
> But there is absolutely no sign that this commit is actually the
> culprit, or explanation for why that should be, and what could be
> going on.
>
> So I'm going to treat this as a "bisection failure, possibly due to
> random code or data layout changes". Particularly since this seems to
> be a 4-socket Xeon Phi machine, which I think is likely a very very
> fragile system if there is some odd cache layout issue.

Oliver retested it and made it run 12 times in total, and the data
is consistent. We tried some other test:
* run other sub cases of this 'fxmark' which sees no regression
* change 'btrfs' to 'ext4' of this case, and no regression
* test on Cascadelake platform, no regression

So the bitsection seems to be stable, though can't be explained yet.

We checked the System map of the 2 kernels, and didn't find obvious
code/data alignment change, which is expected, as the commit changes
data structure which is usually dynamically allocated.

Anyway, we will keep checking this and report back when there is
new data.

> If somebody can actually figure out what happened there, that would be
> good, but for now it goes into my "archived as a random outlier"
> folder.

Agreed. We shouldn't take actions before this change is root caused.

Thanks,
Feng

> Linus

2021-03-25 06:54:27

by Feng Tang

[permalink] [raw]

Subject: Re: [mm] f3344adf38: fxmark.hdd_btrfs_DWAL_63_bufferedio.works/sec -52.4% regression

Hi Linus,

Some updates on this, we found the regression is related with the
percpu stuff change and BTRFS, though the exact relation is unknown yet.

Some details below.

+ Michal who helped providing useful links for checking it.
+ Josef Bacik, for this is BTRFS related

On Fri, Mar 19, 2021 at 11:21:44AM +0800, Feng Tang wrote:
> Hi Linus,
>
> On Mon, Mar 15, 2021 at 01:42:50PM -0700, Linus Torvalds wrote:
> > On Sun, Mar 14, 2021 at 11:30 PM kernel test robot
> > <[email protected]> wrote:

> > > in testcase: fxmark
> > > on test machine: 288 threads Intel(R) Xeon Phi(TM) CPU 7295 @ 1.50GHz with 80G memory
> >
> > I think this must be some really random memory layout issue that
> > causes some false sharing or similar.
> >
> > And it's not even that some fundamental data structure gets a
> > different layout, it literally is just either:
> >
> > (a) the (now smaller) array is allocated from a differently chunk,
> > and that then causes random cache effects with something else
> >
> > (b) the (old, and bigger) array was more spread out, and as a result
> > had different fields in different cachelines and less false sharing
> >
> > Normally I'd say that (b) is the obvious case, except for the fact
> > that this is a __percpu array.
> >
> > So in the common case there shouldn't be any cache contention _within_
> > the array itself. Any cache contention should be with something else
> > very hot that the change now randomly makes be in the same cache way
> > or whatever.
> >
> > Afaik, only the flushing of the vmstats batches does access the percpu
> > arrays from other CPUs, so (b) is not _entirely_ impossible if
> > memcg_flush_percpu_vmstats() were to be very very very hot.
> >
> > But the profile data doesn't show anything remotely like that.
> >
> > In fact, the actual report seems to show things improving, ie things
> > like elapsed time going down:
> >
> > > 1361 -9.5% 1232 fxmark.time.elapsed_time
> > > 1361 -9.5% 1232 fxmark.time.elapsed_time.max
> > > 25.67 +9.1% 28.00 fxmark.time.percent_of_cpu_this_job_got
> > > 343.68 -2.0% 336.76 fxmark.time.system_time
> > > 23.66 +2.5 26.12 mpstat.cpu.all.sys%
> >
> > but I don't know what the benchmark actually does, so maybe it just
> > repeats things until it hits a certain confidence interval, and thus
> > "elapsed time" is immaterial.
>
> I just checked the benchmark, seems it benchmarks the filesystem's
> scalability by doing file/inode opertions with different task numbers
> (from 1 to nr_cpus).
>
> The benchmark has preparation and cleanup steps before/after every
> run, and for the 100 less seconds of 'fxmark.time.elapsed_time' you
> found, it is due to the newer kernel spends 100 seconds more in the
> cleanup step after run ( a point worth checking)

Yes, the longer running time is because the cleanup ('umount' specifically)
of older kernel takes longer time, as it complets more FS operations.

And perf-profile info in the original report was not accurate, as the
test tried 72/63/54/45/36/27/18/9/1 tasks, and perf data recorded may
not reflect the "63 tasks" subcase.

> > Honestly, normally if I were to get a report about "52% regression"
> > for a commit that is supposed to optimize something, I'd just revert
> > the commit as a case of "ok, that optimization clearly didn't work".
> >
> > But there is absolutely no sign that this commit is actually the
> > culprit, or explanation for why that should be, and what could be
> > going on.
> >
> > So I'm going to treat this as a "bisection failure, possibly due to
> > random code or data layout changes". Particularly since this seems to
> > be a 4-socket Xeon Phi machine, which I think is likely a very very
> > fragile system if there is some odd cache layout issue.
>
> Oliver retested it and made it run 12 times in total, and the data
> is consistent. We tried some other test:
> * run other sub cases of this 'fxmark' which sees no regression
> * change 'btrfs' to 'ext4' of this case, and no regression
> * test on Cascadelake platform, no regression
>
> So the bitsection seems to be stable, though can't be explained yet.
>
> We checked the System map of the 2 kernels, and didn't find obvious
> code/data alignment change, which is expected, as the commit changes
> data structure which is usually dynamically allocated.

We found with the commit some percpu related ops do have some change,
as shown in perf

old kernel
----------
1.06% 0.69% [kernel.kallsyms] [k] __percpu_counter_sum - -
1.06% __percpu_counter_sum;need_preemptive_reclaim.part.0;__reserve_bytes;btrfs_reserve_metadata_bytes;btrfs_delalloc_reserve_metadata;btrfs_buffered_write;btrfs_file_write_iter;new_sync_write;vfs_write;ksys_write;do_syscall_64;entry_SYSCALL_64_after_hwframe;write

89.85% 88.17% [kernel.kallsyms] [k] native_queued_spin_lock_slowpath - -
45.27% native_queued_spin_lock_slowpath;_raw_spin_lock;btrfs_block_rsv_release;btrfs_inode_rsv_release;btrfs_buffered_write;btrfs_file_write_iter;new_sync_write;vfs_write;ksys_write;do_syscall_64;entry_SYSCALL_64_after_hwframe;write
44.51% native_queued_spin_lock_slowpath;_raw_spin_lock;__reserve_bytes;btrfs_reserve_metadata_bytes;btrfs_delalloc_reserve_metadata;btrfs_buffered_write;btrfs_file_write_iter;new_sync_write;vfs_write;ksys_write;do_syscall_64;entry_SYSCALL_64_after_hwframe;write

new kernel
----------
1.33% 1.14% [kernel.kallsyms] [k] __percpu_counter_sum - -
1.33% __percpu_counter_sum;need_preemptive_reclaim.part.0;__reserve_bytes;btrfs_reserve_metadata_bytes;btrfs_delalloc_reserve_metadata;btrfs_buffered_write;btrfs_file_write_iter;new_sync_write;vfs_write;ksys_write;do_syscall_64;entry_SYSCALL_64_after_hwframe

95.95% 95.31% [kernel.kallsyms] [k] native_queued_spin_lock_slowpath - -
48.56% native_queued_spin_lock_slowpath;_raw_spin_lock;btrfs_block_rsv_release;btrfs_inode_rsv_release;btrfs_buffered_write;btrfs_file_write_iter;new_sync_write;vfs_write;ksys_write;do_syscall_64;entry_SYSCALL_64_after_hwframe
47.33% native_queued_spin_lock_slowpath;_raw_spin_lock;__reserve_bytes;btrfs_reserve_metadata_bytes;btrfs_delalloc_reserve_metadata;btrfs_buffered_write;btrfs_file_write_iter;new_sync_write;vfs_write;ksys_write;do_syscall_64;entry_SYSCALL_64_after_hwframe

__percpu_counter_sum is usually costy for platform with many CPUs and
it does rise much. It is called in fs/btrfs/space-info.c
need_preemptive_reclaim
ordered = percpu_counter_sum_positive(&fs_info->ordered_bytes);
delalloc = percpu_counter_sum_positive(&fs_info->delalloc_bytes);

And we did 2 experiments:
1. Change the percpu_counter_sum_positive() to percpu_counter_read_positive()
which skips looping online CPUs to get the sum, inside need_preemptive_reclaim(),
the regression is gone, and even better

2. We add some padding to restore the percpu stuff size, the regression
is also gone.

struct batched_lruvec_stat {
s32 count[NR_VM_NODE_STAT_ITEMS];
+ s32 pad[NR_VM_NODE_STAT_ITEMS];
};

So we think the regression is related with Muchun's patch and BTRFS.

And another thing I tried is changing the test tasks number from '63' to
'65', and the regression is also gone, and this is something mysterious!

This is only reproduced on a Xeon Phi, which lacks some perf events, so
tool like perf c2c can't be run to pin-point some data or text.

Thanks,
Feng

2021-03-25 07:38:04

by Feng Tang

[permalink] [raw]

Subject: Re: [mm] f3344adf38: fxmark.hdd_btrfs_DWAL_63_bufferedio.works/sec -52.4% regression

On Thu, Mar 25, 2021 at 02:51:42PM +0800, Feng Tang wrote:
> > > Honestly, normally if I were to get a report about "52% regression"
> > > for a commit that is supposed to optimize something, I'd just revert
> > > the commit as a case of "ok, that optimization clearly didn't work".
> > >
> > > But there is absolutely no sign that this commit is actually the
> > > culprit, or explanation for why that should be, and what could be
> > > going on.
> > >
> > > So I'm going to treat this as a "bisection failure, possibly due to
> > > random code or data layout changes". Particularly since this seems to
> > > be a 4-socket Xeon Phi machine, which I think is likely a very very
> > > fragile system if there is some odd cache layout issue.
> >
> > Oliver retested it and made it run 12 times in total, and the data
> > is consistent. We tried some other test:
> > * run other sub cases of this 'fxmark' which sees no regression
> > * change 'btrfs' to 'ext4' of this case, and no regression
> > * test on Cascadelake platform, no regression
> >
> > So the bitsection seems to be stable, though can't be explained yet.
> >
> > We checked the System map of the 2 kernels, and didn't find obvious
> > code/data alignment change, which is expected, as the commit changes
> > data structure which is usually dynamically allocated.
>
> We found with the commit some percpu related ops do have some change,
> as shown in perf
>
> old kernel
> ----------
> 1.06% 0.69% [kernel.kallsyms] [k] __percpu_counter_sum - -
> 1.06% __percpu_counter_sum;need_preemptive_reclaim.part.0;__reserve_bytes;btrfs_reserve_metadata_bytes;btrfs_delalloc_reserve_metadata;btrfs_buffered_write;btrfs_file_write_iter;new_sync_write;vfs_write;ksys_write;do_syscall_64;entry_SYSCALL_64_after_hwframe;write
>
> 89.85% 88.17% [kernel.kallsyms] [k] native_queued_spin_lock_slowpath - -
> 45.27% native_queued_spin_lock_slowpath;_raw_spin_lock;btrfs_block_rsv_release;btrfs_inode_rsv_release;btrfs_buffered_write;btrfs_file_write_iter;new_sync_write;vfs_write;ksys_write;do_syscall_64;entry_SYSCALL_64_after_hwframe;write
> 44.51% native_queued_spin_lock_slowpath;_raw_spin_lock;__reserve_bytes;btrfs_reserve_metadata_bytes;btrfs_delalloc_reserve_metadata;btrfs_buffered_write;btrfs_file_write_iter;new_sync_write;vfs_write;ksys_write;do_syscall_64;entry_SYSCALL_64_after_hwframe;write
>
>
> new kernel
> ----------
> 1.33% 1.14% [kernel.kallsyms] [k] __percpu_counter_sum - -
> 1.33% __percpu_counter_sum;need_preemptive_reclaim.part.0;__reserve_bytes;btrfs_reserve_metadata_bytes;btrfs_delalloc_reserve_metadata;btrfs_buffered_write;btrfs_file_write_iter;new_sync_write;vfs_write;ksys_write;do_syscall_64;entry_SYSCALL_64_after_hwframe
>
> 95.95% 95.31% [kernel.kallsyms] [k] native_queued_spin_lock_slowpath - -
> 48.56% native_queued_spin_lock_slowpath;_raw_spin_lock;btrfs_block_rsv_release;btrfs_inode_rsv_release;btrfs_buffered_write;btrfs_file_write_iter;new_sync_write;vfs_write;ksys_write;do_syscall_64;entry_SYSCALL_64_after_hwframe
> 47.33% native_queued_spin_lock_slowpath;_raw_spin_lock;__reserve_bytes;btrfs_reserve_metadata_bytes;btrfs_delalloc_reserve_metadata;btrfs_buffered_write;btrfs_file_write_iter;new_sync_write;vfs_write;ksys_write;do_syscall_64;entry_SYSCALL_64_after_hwframe
>
> __percpu_counter_sum is usually costy for platform with many CPUs and
> it does rise much. It is called in fs/btrfs/space-info.c
> need_preemptive_reclaim
> ordered = percpu_counter_sum_positive(&fs_info->ordered_bytes);
> delalloc = percpu_counter_sum_positive(&fs_info->delalloc_bytes);
>
> And we did 2 experiments:
> 1. Change the percpu_counter_sum_positive() to percpu_counter_read_positive()
> which skips looping online CPUs to get the sum, inside need_preemptive_reclaim(),
> the regression is gone, and even better

Interestingly, We just got mail from Oliver about btrfs perf's 81.3%
improvement:
https://lore.kernel.org/lkml/20210325055609.GA13061@xsang-OptiPlex-9020/

which is from Josef's patch which does the same conversion of functions
of getting percpu counter.

I guess with the patch, this regression will be gone, and several other
old regressions will be gone too (Thanks Josef):
https://lore.kernel.org/lkml/20201104061657.GB15746@xsang-OptiPlex-9020/
https://lore.kernel.org/lkml/20210305083757.GF31481@xsang-OptiPlex-9020/

Thanks,
Feng