Hello,
kernel test robot noticed a 3.7% improvement of will-it-scale.per_thread_ops on:
commit: df671b17195cd6526e029c70d04dfb72561082d7 ("[PATCH 1/2] lib/find: Make functions safe on changing bitmaps")
url: https://github.com/intel-lab-lkp/linux/commits/Jan-Kara/lib-find-Make-functions-safe-on-changing-bitmaps/20231011-230553
base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git 1c8b86a3799f7e5be903c3f49fcdaee29fd385b5
patch link: https://lore.kernel.org/all/[email protected]/
patch subject: [PATCH 1/2] lib/find: Make functions safe on changing bitmaps
testcase: will-it-scale
test machine: 104 threads 2 sockets (Skylake) with 192G memory
parameters:
nr_task: 50%
mode: thread
test: tlb_flush3
cpufreq_governor: performance
Details are as below:
-------------------------------------------------------------------------------------------------->
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20231025/[email protected]
=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-12/performance/x86_64-rhel-8.3/thread/50%/debian-11.1-x86_64-20220510.cgz/lkp-skl-fpga01/tlb_flush3/will-it-scale
commit:
1c8b86a379 ("Merge tag 'xsa441-6.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip")
df671b1719 ("lib/find: Make functions safe on changing bitmaps")
1c8b86a3799f7e5b df671b17195cd6526e029c70d04
---------------- ---------------------------
%stddev %change %stddev
\ | \
0.14 ? 19% +36.9% 0.19 ? 17% perf-sched.wait_time.avg.ms.schedule_preempt_disabled.rwsem_down_write_slowpath.down_write_killable.__vm_munmap
2.26e+08 +3.6% 2.343e+08 proc-vmstat.pgfault
0.04 +25.0% 0.05 turbostat.IPC
32666 -15.5% 27605 ? 2% turbostat.POLL
7856 +2.2% 8025 vmstat.system.cs
6331931 +2.3% 6478704 vmstat.system.in
700119 +3.7% 725931 will-it-scale.52.threads
13463 +3.7% 13959 will-it-scale.per_thread_ops
700119 +3.7% 725931 will-it-scale.workload
8.36 -7.3% 7.74 perf-stat.i.MPKI
4.591e+09 +3.4% 4.747e+09 perf-stat.i.branch-instructions
1.832e+08 +2.8% 1.883e+08 perf-stat.i.branch-misses
26.70 -0.3 26.40 perf-stat.i.cache-miss-rate%
7852 +2.2% 8021 perf-stat.i.context-switches
6.43 -7.2% 5.97 perf-stat.i.cpi
769.61 +1.8% 783.29 perf-stat.i.cpu-migrations
6.39e+09 +3.4% 6.606e+09 perf-stat.i.dTLB-loads
2.94e+09 +3.2% 3.035e+09 perf-stat.i.dTLB-stores
78.29 -0.9 77.44 perf-stat.i.iTLB-load-miss-rate%
18959450 +3.5% 19621273 perf-stat.i.iTLB-load-misses
5254435 +8.7% 5713444 perf-stat.i.iTLB-loads
2.236e+10 +7.7% 2.408e+10 perf-stat.i.instructions
1181 +4.0% 1228 perf-stat.i.instructions-per-iTLB-miss
0.16 +7.7% 0.17 perf-stat.i.ipc
0.02 ? 36% -49.6% 0.01 ? 53% perf-stat.i.major-faults
485.08 +3.0% 499.67 perf-stat.i.metric.K/sec
141.71 +3.2% 146.25 perf-stat.i.metric.M/sec
747997 +3.7% 775416 perf-stat.i.minor-faults
3127957 -13.9% 2693728 perf-stat.i.node-loads
26089697 +3.4% 26965335 perf-stat.i.node-store-misses
767569 +3.7% 796095 perf-stat.i.node-stores
747997 +3.7% 775416 perf-stat.i.page-faults
8.35 -7.3% 7.74 perf-stat.overall.MPKI
26.70 -0.3 26.40 perf-stat.overall.cache-miss-rate%
6.43 -7.1% 5.97 perf-stat.overall.cpi
78.30 -0.9 77.45 perf-stat.overall.iTLB-load-miss-rate%
1179 +4.0% 1226 perf-stat.overall.instructions-per-iTLB-miss
0.16 +7.7% 0.17 perf-stat.overall.ipc
9644584 +3.8% 10011125 perf-stat.overall.path-length
4.575e+09 +3.4% 4.731e+09 perf-stat.ps.branch-instructions
1.825e+08 +2.8% 1.876e+08 perf-stat.ps.branch-misses
7825 +2.2% 7995 perf-stat.ps.context-switches
767.16 +1.8% 780.76 perf-stat.ps.cpu-migrations
6.368e+09 +3.4% 6.583e+09 perf-stat.ps.dTLB-loads
2.93e+09 +3.2% 3.025e+09 perf-stat.ps.dTLB-stores
18896725 +3.5% 19555325 perf-stat.ps.iTLB-load-misses
5236456 +8.7% 5693636 perf-stat.ps.iTLB-loads
2.229e+10 +7.6% 2.399e+10 perf-stat.ps.instructions
745423 +3.7% 772705 perf-stat.ps.minor-faults
3117663 -13.9% 2684861 perf-stat.ps.node-loads
26002765 +3.4% 26875267 perf-stat.ps.node-store-misses
764789 +3.7% 793098 perf-stat.ps.node-stores
745423 +3.7% 772705 perf-stat.ps.page-faults
6.752e+12 +7.6% 7.267e+12 perf-stat.total.instructions
19.21 -1.0 18.18 perf-profile.calltrace.cycles-pp.llist_add_batch.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range.tlb_finish_mmu
17.00 -0.9 16.09 perf-profile.calltrace.cycles-pp.llist_add_batch.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range.zap_pte_range
65.30 -0.6 64.69 perf-profile.calltrace.cycles-pp.zap_page_range_single.madvise_vma_behavior.do_madvise.__x64_sys_madvise.do_syscall_64
65.34 -0.6 64.75 perf-profile.calltrace.cycles-pp.madvise_vma_behavior.do_madvise.__x64_sys_madvise.do_syscall_64.entry_SYSCALL_64_after_hwframe
65.98 -0.5 65.45 perf-profile.calltrace.cycles-pp.__x64_sys_madvise.do_syscall_64.entry_SYSCALL_64_after_hwframe.__madvise
65.96 -0.5 65.42 perf-profile.calltrace.cycles-pp.do_madvise.__x64_sys_madvise.do_syscall_64.entry_SYSCALL_64_after_hwframe.__madvise
9.72 ? 2% -0.5 9.20 perf-profile.calltrace.cycles-pp.asm_sysvec_call_function.llist_add_batch.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range
66.33 -0.5 65.81 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__madvise
66.46 -0.5 65.95 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__madvise
31.88 -0.4 31.43 perf-profile.calltrace.cycles-pp.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range.tlb_finish_mmu.zap_page_range_single
67.72 -0.4 67.28 perf-profile.calltrace.cycles-pp.__madvise
32.15 -0.4 31.73 perf-profile.calltrace.cycles-pp.on_each_cpu_cond_mask.flush_tlb_mm_range.tlb_finish_mmu.zap_page_range_single.madvise_vma_behavior
32.60 -0.4 32.21 perf-profile.calltrace.cycles-pp.flush_tlb_mm_range.tlb_finish_mmu.zap_page_range_single.madvise_vma_behavior.do_madvise
32.93 -0.3 32.58 perf-profile.calltrace.cycles-pp.tlb_finish_mmu.zap_page_range_single.madvise_vma_behavior.do_madvise.__x64_sys_madvise
31.07 -0.3 30.74 perf-profile.calltrace.cycles-pp.flush_tlb_mm_range.zap_pte_range.zap_pmd_range.unmap_page_range.zap_page_range_single
31.58 -0.3 31.28 perf-profile.calltrace.cycles-pp.zap_pte_range.zap_pmd_range.unmap_page_range.zap_page_range_single.madvise_vma_behavior
31.61 -0.3 31.30 perf-profile.calltrace.cycles-pp.zap_pmd_range.unmap_page_range.zap_page_range_single.madvise_vma_behavior.do_madvise
31.80 -0.3 31.51 perf-profile.calltrace.cycles-pp.unmap_page_range.zap_page_range_single.madvise_vma_behavior.do_madvise.__x64_sys_madvise
8.34 -0.1 8.22 perf-profile.calltrace.cycles-pp.sysvec_call_function.asm_sysvec_call_function.llist_add_batch.smp_call_function_many_cond.on_each_cpu_cond_mask
8.06 -0.1 7.95 perf-profile.calltrace.cycles-pp.__sysvec_call_function.sysvec_call_function.asm_sysvec_call_function.llist_add_batch.smp_call_function_many_cond
7.98 -0.1 7.87 perf-profile.calltrace.cycles-pp.__flush_smp_call_function_queue.__sysvec_call_function.sysvec_call_function.asm_sysvec_call_function.llist_add_batch
0.59 ? 3% +0.1 0.65 ? 2% perf-profile.calltrace.cycles-pp.asm_sysvec_call_function.testcase
1.46 +0.1 1.53 perf-profile.calltrace.cycles-pp.filemap_map_pages.do_read_fault.do_fault.__handle_mm_fault.handle_mm_fault
1.48 +0.1 1.55 perf-profile.calltrace.cycles-pp.do_read_fault.do_fault.__handle_mm_fault.handle_mm_fault.do_user_addr_fault
1.53 +0.1 1.62 perf-profile.calltrace.cycles-pp.do_fault.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault
2.92 +0.1 3.02 perf-profile.calltrace.cycles-pp.flush_tlb_func.__flush_smp_call_function_queue.__sysvec_call_function.sysvec_call_function.asm_sysvec_call_function
1.26 ? 2% +0.1 1.36 perf-profile.calltrace.cycles-pp.default_send_IPI_mask_sequence_phys.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range.zap_pte_range
1.84 +0.1 1.96 perf-profile.calltrace.cycles-pp.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
7.87 +0.1 8.00 perf-profile.calltrace.cycles-pp.llist_reverse_order.__flush_smp_call_function_queue.__sysvec_call_function.sysvec_call_function.asm_sysvec_call_function
2.03 ? 2% +0.1 2.17 perf-profile.calltrace.cycles-pp.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.testcase
2.90 +0.2 3.06 perf-profile.calltrace.cycles-pp.asm_sysvec_call_function.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range.zap_pte_range
2.62 ? 3% +0.2 2.80 perf-profile.calltrace.cycles-pp.exc_page_fault.asm_exc_page_fault.testcase
2.58 ? 3% +0.2 2.76 perf-profile.calltrace.cycles-pp.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.testcase
2.95 ? 3% +0.2 3.14 perf-profile.calltrace.cycles-pp.asm_exc_page_fault.testcase
2.75 +0.2 2.94 perf-profile.calltrace.cycles-pp.asm_sysvec_call_function.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range.tlb_finish_mmu
4.96 +0.3 5.29 perf-profile.calltrace.cycles-pp.__sysvec_call_function.sysvec_call_function.asm_sysvec_call_function.smp_call_function_many_cond.on_each_cpu_cond_mask
4.92 +0.3 5.25 perf-profile.calltrace.cycles-pp.__flush_smp_call_function_queue.__sysvec_call_function.sysvec_call_function.asm_sysvec_call_function.smp_call_function_many_cond
5.13 +0.3 5.46 perf-profile.calltrace.cycles-pp.sysvec_call_function.asm_sysvec_call_function.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range
5.08 +0.4 5.44 perf-profile.calltrace.cycles-pp.testcase
37.25 -2.0 35.24 perf-profile.children.cycles-pp.llist_add_batch
62.82 -0.8 62.04 perf-profile.children.cycles-pp.on_each_cpu_cond_mask
62.82 -0.8 62.04 perf-profile.children.cycles-pp.smp_call_function_many_cond
63.70 -0.7 62.98 perf-profile.children.cycles-pp.flush_tlb_mm_range
65.30 -0.6 64.70 perf-profile.children.cycles-pp.zap_page_range_single
65.34 -0.6 64.75 perf-profile.children.cycles-pp.madvise_vma_behavior
65.98 -0.5 65.45 perf-profile.children.cycles-pp.__x64_sys_madvise
65.96 -0.5 65.43 perf-profile.children.cycles-pp.do_madvise
66.52 -0.5 66.01 perf-profile.children.cycles-pp.do_syscall_64
66.65 -0.5 66.16 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
67.79 -0.4 67.36 perf-profile.children.cycles-pp.__madvise
32.94 -0.3 32.60 perf-profile.children.cycles-pp.tlb_finish_mmu
31.74 -0.3 31.43 perf-profile.children.cycles-pp.zap_pte_range
31.76 -0.3 31.46 perf-profile.children.cycles-pp.zap_pmd_range
31.95 -0.3 31.66 perf-profile.children.cycles-pp.unmap_page_range
0.42 ? 2% +0.0 0.46 perf-profile.children.cycles-pp.error_entry
0.20 ? 3% +0.0 0.24 ? 5% perf-profile.children.cycles-pp.up_read
0.69 +0.0 0.74 perf-profile.children.cycles-pp.native_flush_tlb_local
1.47 +0.1 1.55 perf-profile.children.cycles-pp.filemap_map_pages
1.48 +0.1 1.56 perf-profile.children.cycles-pp.do_read_fault
1.54 +0.1 1.62 perf-profile.children.cycles-pp.do_fault
2.75 +0.1 2.86 perf-profile.children.cycles-pp.default_send_IPI_mask_sequence_phys
1.85 +0.1 1.98 perf-profile.children.cycles-pp.__handle_mm_fault
2.04 ? 2% +0.1 2.18 perf-profile.children.cycles-pp.handle_mm_fault
2.63 ? 3% +0.2 2.81 perf-profile.children.cycles-pp.exc_page_fault
2.62 ? 3% +0.2 2.80 perf-profile.children.cycles-pp.do_user_addr_fault
3.24 ? 3% +0.2 3.44 perf-profile.children.cycles-pp.asm_exc_page_fault
3.83 +0.2 4.04 perf-profile.children.cycles-pp.flush_tlb_func
0.69 ? 2% +0.2 0.92 perf-profile.children.cycles-pp._find_next_bit
9.92 +0.3 10.23 perf-profile.children.cycles-pp.llist_reverse_order
5.45 +0.4 5.81 perf-profile.children.cycles-pp.testcase
18.42 +0.5 18.96 perf-profile.children.cycles-pp.asm_sysvec_call_function
16.24 +0.5 16.78 perf-profile.children.cycles-pp.__flush_smp_call_function_queue
15.78 +0.5 16.32 perf-profile.children.cycles-pp.__sysvec_call_function
16.36 +0.5 16.90 perf-profile.children.cycles-pp.sysvec_call_function
27.92 -1.9 26.04 perf-profile.self.cycles-pp.llist_add_batch
0.16 ? 2% +0.0 0.18 ? 4% perf-profile.self.cycles-pp.up_read
0.42 ? 2% +0.0 0.45 perf-profile.self.cycles-pp.error_entry
0.21 ? 4% +0.0 0.24 ? 5% perf-profile.self.cycles-pp.down_read
0.26 ? 2% +0.0 0.29 ? 3% perf-profile.self.cycles-pp.tlb_finish_mmu
2.01 +0.0 2.05 perf-profile.self.cycles-pp.default_send_IPI_mask_sequence_phys
0.68 +0.0 0.73 perf-profile.self.cycles-pp.native_flush_tlb_local
3.10 +0.2 3.26 perf-profile.self.cycles-pp.flush_tlb_func
0.50 ? 2% +0.2 0.68 perf-profile.self.cycles-pp._find_next_bit
9.92 +0.3 10.22 perf-profile.self.cycles-pp.llist_reverse_order
16.10 +0.5 16.64 perf-profile.self.cycles-pp.smp_call_function_many_cond
Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
On 25/10/2023 09.18, kernel test robot wrote:
>
>
> Hello,
>
> kernel test robot noticed a 3.7% improvement of will-it-scale.per_thread_ops on:
So with that, can we please just finally say "yeah, let's make the
generic bitmap library functions correct and usable in more cases"
instead of worrying about random micro-benchmarks that just show
you-win-some-you-lose-some.
Yes, users will have to treat results from the find routines carefully
if their bitmap may be concurrently modified. They do. Nobody wins if
those users are forced to implement their own bitmap routines for their
lockless algorithms.
Rasmus
On Wed, Oct 25, 2023 at 10:18:00AM +0200, Rasmus Villemoes wrote:
> On 25/10/2023 09.18, kernel test robot wrote:
> >
> >
> > Hello,
> >
> > kernel test robot noticed a 3.7% improvement of will-it-scale.per_thread_ops on:
>
> So with that, can we please just finally say "yeah, let's make the
> generic bitmap library functions correct
They are all correct already.
> and usable in more cases"
See below.
> instead of worrying about random micro-benchmarks that just show
> you-win-some-you-lose-some.
That's I agree. I don't worry about either +2% or -3% benchmark, and
don't think that they alone can or can't justificate such a radical
change like making all find_bit functions volatile, and shutting down
a newborn KCSAN.
Keeping that in mind, my best guess is that Jan's and Misrad's test
that shows +2% was against stable bitmaps; and what robot measured
is most likely against heavily concurrent access to some bitmap in
the kernel.
I didn't look at both tests sources, but that at least makes some
sense, because if GCC optimizes code against properly described
memory correctly, this is exactly what we can expect.
> Yes, users will have to treat results from the find routines carefully
> if their bitmap may be concurrently modified. They do. Nobody wins if
> those users are forced to implement their own bitmap routines for their
> lockless algorithms.
Again, I agree with this point, and I'm trying to address exactly this.
I'm working on a series that introduces lockless find_bit functions
based on existing FIND_BIT() engine. It's not ready yet, but I hope
I'll submit it in the next merge window.
https://github.com/norov/linux/commits/find_and_bit
Now that we've got a test that presumably works faster if find_bit()
functions are all switched to be volatile, it would be great if we get
into details and understand:
- what find_bit function or functions gives that gain in performance;
- on what bitmap(s);
- is the reason in concurrent memory access (guess yes), and if so,
- can we refactor the code to use lockless find_and_bit() functions
mentioned above;
- if not, how else can we address this.
If you or someone else have an extra time slot to get deeper into
that, I'll be really thankful.
Thanks,
Yury
On Thu 26-10-23 20:51:22, Yury Norov wrote:
> On Wed, Oct 25, 2023 at 10:18:00AM +0200, Rasmus Villemoes wrote:
> > Yes, users will have to treat results from the find routines carefully
> > if their bitmap may be concurrently modified. They do. Nobody wins if
> > those users are forced to implement their own bitmap routines for their
> > lockless algorithms.
>
> Again, I agree with this point, and I'm trying to address exactly this.
>
> I'm working on a series that introduces lockless find_bit functions
> based on existing FIND_BIT() engine. It's not ready yet, but I hope
> I'll submit it in the next merge window.
>
> https://github.com/norov/linux/commits/find_and_bit
I agree that the find_and_{set|clear}() bit functions are useful and we'll
be able to remove some boilerplate code with them. But also note that you
will need to duplicate practically all of the bitmap API to provide similar
"atomic" functionality - e.g. the sbitmap conversion you have in your
branch has a bug that it drops the 'lock' memory ordering from the bitmap
manipulation. So you need something like find_and_set_bit_lock() (which you
already have) and find_and_set_bit_wrap_lock() (which you don't have yet).
If you are to convert bitmap code in filesystems (some of which is lockless
as well), you will need to add little and big endian variants of volatile
bitmap functions. Finally there are users like lib/xarray.c which don't
want to set/clear found bit (we just want to quickly find a good guess for
a set bit in the bitmap and we then verify in another structure whether the
guess was right or not). So we'll need the volatile variant of plain
find_first_bit(), find_next_bit() as well. Also when we have variants of
bitmap functions that are safe wrt parallel changes and those that are not,
the function names should be probably indicating which are which.
So as much as I agree your solution is theorically the cleanest, I
personally don't think the cost in terms of code duplication and code churn
is really worth it.
> Now that we've got a test that presumably works faster if find_bit()
> functions are all switched to be volatile, it would be great if we get
> into details and understand:
> - what find_bit function or functions gives that gain in performance;
> - on what bitmap(s);
> - is the reason in concurrent memory access (guess yes), and if so,
> - can we refactor the code to use lockless find_and_bit() functions
> mentioned above;
> - if not, how else can we address this.
Frankly, I don't think there's any substantial reason why the volatile or
non-volatile code should be faster. The guys from compiler team looked at
the x86 disassembly and said both variants should be the same speed based on
instruction costs. What could be causing these small performance differences
is that the resulting code is layed out slightly differently (and the
volatile bitmap functions end up being somewhat larger) and that somehow
interferes with instruction caching or CPU-internal out-of-order execution.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR