LinuxLists.cc - Re: [PATCH 1/2] lib/find: Make functions safe on changing bitmaps

2023-10-25 07:19:58

Subject: Re: [PATCH 1/2] lib/find: Make functions safe on changing bitmaps

Hello,

kernel test robot

commit: df671b17195cd6526e url: ========================== compiler/cpufreq_governor/ gcc-12/performance/x86_6
commit:
1c8b86a379 ("Merge df671b1719 ("lib/find:
1c8b86a3799f7e5b ---------------- %stddev %change \ | 0.14 ? 19% +36.9% 2.26e+08 0.04 +25.0% 32666 -15.5% 7856 +2.2% 6331931 700119 13463 700119 8.36 -7.3% 4.591e+09 1.832e+08 26.70 -0.3 7852 +2.2% 6.43 -7.2% 769.61 6.39e+09 2.94e+09 78.29 -0.9 18959450 5254435 2.236e+10 1181 +4.0% 0.16 +7.7% 0.02 ? 36% -49.6% 485.08 141.71 747997 3127957 26089697 767569 747997 8.35 -7.3% 26.70 -0.3 6.43 -7.1% 78.30 -0.9 1179 +4.0% 0.16 +7.7% 9644584 4.575e+09 1.825e+08 7825 +2.2% 767.16 6.368e+09 2.93e+09 18896725 5236456 2.229e+10 745423 3117663 26002765 764789 745423 6.752e+12 19.21 -1.0 17.00 -0.9 65.30 -0.6 65.34 -0.6 65.98 -0.5 65.96 -0.5 9.72 ? 2% -0.5 66.33 -0.5 66.46 -0.5 31.88 -0.4 67.72 -0.4 32.15 -0.4 32.60 -0.4 32.93 -0.3 31.07 -0.3 31.58 -0.3 31.61 -0.3 31.80 -0.3 8.34 -0.1 8.06 -0.1 7.98 -0.1 0.59 ? 3% +0.1 1.46 +0.1 1.48 +0.1 1.53 +0.1 2.92 +0.1 1.26 ? 2% +0.1 1.84 +0.1 7.87 +0.1 2.03 ? 2% +0.1 2.90 +0.2 2.62 ? 3% +0.2 2.58 ? 3% +0.2 2.95 ? 3% +0.2 2.75 +0.2 4.96 +0.3 4.92 +0.3 5.13 +0.3 5.08 +0.4 37.25 -2.0 62.82 -0.8 62.82 -0.8 63.70 -0.7 65.30 -0.6 65.34 -0.6 65.98 -0.5 65.96 -0.5 66.52 -0.5 66.65 -0.5 67.79 -0.4 32.94 -0.3 31.74 -0.3 31.76 -0.3 31.95 -0.3 0.42 ? 2% +0.0 0.20 ? 3% +0.0 0.69 +0.0 1.47 +0.1 1.48 +0.1 1.54 +0.1 2.75 +0.1 1.85 +0.1 2.04 ? 2% +0.1 2.63 ? 3% +0.2 2.62 ? 3% +0.2 3.24 ? 3% +0.2 3.83 +0.2 0.69 ? 2% +0.2 9.92 +0.3 5.45 +0.4 18.42 +0.5 16.24 +0.5 15.78 +0.5 16.36 +0.5 27.92 -1.9 0.16 ? 2% +0.0 0.42 ? 2% +0.0 0.21 ? 4% +0.0 0.26 ? 2% +0.0 2.01 +0.0 0.68 +0.0 3.10 +0.2 0.50 ? 2% +0.2 9.92 +0.3 16.10 +0.5

Disclaimer:
Results have been for informational design or configuration

--
0-DAY CI Kernel Test Service
https://github.com/intel-lab-lkp/linux/commits/Jan-Kara/lib-find-Make-functions-safe-on-changing-bitmaps/20231011-230553
t.kernel.org/cgit/linux/kernel/git/torvalds/linux.git">https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git 1c8b86a3799f7e5be903c3f49fcdaee29fd385b5
s://lore.kernel.org/all/20231011150252.32737-1-jack@suse.cz/">https://lore.kernel.org/all/[email protected]/
1/2] lib/find: Make functions safe on changing bitmaps
threads 2 sockets (Skylake) with 192G memory
------------------------------------------------------------------------>
and materials to reproduce are available at:
.01.org/0day-ci/archive/20231025/202310251458.48b4452d-oliver.sang@intel.com">https://download.01.org/0day-ci/archive/20231025/[email protected]
===============================================================
kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
4-rhel-8.3/thread/50%/debian-11.1-x86_64-20220510.cgz/lkp-skl-fpga01/tlb_flush3/will-it-scale
tag 'xsa441-6.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip")
Make functions safe on changing bitmaps")
df671b17195cd6526e029c70d04
---------------------------
%stddev
\
0.19 ? 17% perf-sched.wait_time.avg.ms.schedule_preempt_disabled.rwsem_down_write_slowpath.down_write_killable.__vm_munmap
+3.6% 2.343e+08 proc-vmstat.pgfault
0.05 turbostat.IPC
27605 ? 2% turbostat.POLL
8025 vmstat.system.cs
+2.3% 6478704 vmstat.system.in
+3.7% 725931 will-it-scale.52.threads
+3.7% 13959 will-it-scale.per_thread_ops
+3.7% 725931 will-it-scale.workload
7.74 perf-stat.i.MPKI
+3.4% 4.747e+09 perf-stat.i.branch-instructions
+2.8% 1.883e+08 perf-stat.i.branch-misses
26.40 perf-stat.i.cache-miss-rate%
8021 perf-stat.i.context-switches
5.97 perf-stat.i.cpi
+1.8% 783.29 perf-stat.i.cpu-migrations
+3.4% 6.606e+09 perf-stat.i.dTLB-loads
+3.2% 3.035e+09 perf-stat.i.dTLB-stores
77.44 perf-stat.i.iTLB-load-miss-rate%
+3.5% 19621273 perf-stat.i.iTLB-load-misses
+8.7% 5713444 perf-stat.i.iTLB-loads
+7.7% 2.408e+10 perf-stat.i.instructions
1228 perf-stat.i.instructions-per-iTLB-miss
0.17 perf-stat.i.ipc
0.01 ? 53% perf-stat.i.major-faults
+3.0% 499.67 perf-stat.i.metric.K/sec
+3.2% 146.25 perf-stat.i.metric.M/sec
+3.7% 775416 perf-stat.i.minor-faults
-13.9% 2693728 perf-stat.i.node-loads
+3.4% 26965335 perf-stat.i.node-store-misses
+3.7% 796095 perf-stat.i.node-stores
+3.7% 775416 perf-stat.i.page-faults
7.74 perf-stat.overall.MPKI
26.40 perf-stat.overall.cache-miss-rate%
5.97 perf-stat.overall.cpi
77.45 perf-stat.overall.iTLB-load-miss-rate%
1226 perf-stat.overall.instructions-per-iTLB-miss
0.17 perf-stat.overall.ipc
+3.8% 10011125 perf-stat.overall.path-length
+3.4% 4.731e+09 perf-stat.ps.branch-instructions
+2.8% 1.876e+08 perf-stat.ps.branch-misses
7995 perf-stat.ps.context-switches
+1.8% 780.76 perf-stat.ps.cpu-migrations
+3.4% 6.583e+09 perf-stat.ps.dTLB-loads
+3.2% 3.025e+09 perf-stat.ps.dTLB-stores
+3.5% 19555325 perf-stat.ps.iTLB-load-misses
+8.7% 5693636 perf-stat.ps.iTLB-loads
+7.6% 2.399e+10 perf-stat.ps.instructions
+3.7% 772705 perf-stat.ps.minor-faults
-13.9% 2684861 perf-stat.ps.node-loads
+3.4% 26875267 perf-stat.ps.node-store-misses
+3.7% 793098 perf-stat.ps.node-stores
+3.7% 772705 perf-stat.ps.page-faults
+7.6% 7.267e+12 perf-stat.total.instructions
18.18 perf-profile.calltrace.cycles-pp.llist_add_batch.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range.tlb_finish_mmu
16.09 perf-profile.calltrace.cycles-pp.llist_add_batch.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range.zap_pte_range
64.69 perf-profile.calltrace.cycles-pp.zap_page_range_single.madvise_vma_behavior.do_madvise.__x64_sys_madvise.do_syscall_64
64.75 perf-profile.calltrace.cycles-pp.madvise_vma_behavior.do_madvise.__x64_sys_madvise.do_syscall_64.entry_SYSCALL_64_after_hwframe
65.45 perf-profile.calltrace.cycles-pp.__x64_sys_madvise.do_syscall_64.entry_SYSCALL_64_after_hwframe.__madvise
65.42 perf-profile.calltrace.cycles-pp.do_madvise.__x64_sys_madvise.do_syscall_64.entry_SYSCALL_64_after_hwframe.__madvise
9.20 perf-profile.calltrace.cycles-pp.asm_sysvec_call_function.llist_add_batch.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range
65.81 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__madvise
65.95 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__madvise
31.43 perf-profile.calltrace.cycles-pp.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range.tlb_finish_mmu.zap_page_range_single
67.28 perf-profile.calltrace.cycles-pp.__madvise
31.73 perf-profile.calltrace.cycles-pp.on_each_cpu_cond_mask.flush_tlb_mm_range.tlb_finish_mmu.zap_page_range_single.madvise_vma_behavior
32.21 perf-profile.calltrace.cycles-pp.flush_tlb_mm_range.tlb_finish_mmu.zap_page_range_single.madvise_vma_behavior.do_madvise
32.58 perf-profile.calltrace.cycles-pp.tlb_finish_mmu.zap_page_range_single.madvise_vma_behavior.do_madvise.__x64_sys_madvise
30.74 perf-profile.calltrace.cycles-pp.flush_tlb_mm_range.zap_pte_range.zap_pmd_range.unmap_page_range.zap_page_range_single
31.28 perf-profile.calltrace.cycles-pp.zap_pte_range.zap_pmd_range.unmap_page_range.zap_page_range_single.madvise_vma_behavior
31.30 perf-profile.calltrace.cycles-pp.zap_pmd_range.unmap_page_range.zap_page_range_single.madvise_vma_behavior.do_madvise
31.51 perf-profile.calltrace.cycles-pp.unmap_page_range.zap_page_range_single.madvise_vma_behavior.do_madvise.__x64_sys_madvise
8.22 perf-profile.calltrace.cycles-pp.sysvec_call_function.asm_sysvec_call_function.llist_add_batch.smp_call_function_many_cond.on_each_cpu_cond_mask
7.95 perf-profile.calltrace.cycles-pp.__sysvec_call_function.sysvec_call_function.asm_sysvec_call_function.llist_add_batch.smp_call_function_many_cond
7.87 perf-profile.calltrace.cycles-pp.__flush_smp_call_function_queue.__sysvec_call_function.sysvec_call_function.asm_sysvec_call_function.llist_add_batch
0.65 ? 2% perf-profile.calltrace.cycles-pp.asm_sysvec_call_function.testcase
1.53 perf-profile.calltrace.cycles-pp.filemap_map_pages.do_read_fault.do_fault.__handle_mm_fault.handle_mm_fault
1.55 perf-profile.calltrace.cycles-pp.do_read_fault.do_fault.__handle_mm_fault.handle_mm_fault.do_user_addr_fault
1.62 perf-profile.calltrace.cycles-pp.do_fault.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault
3.02 perf-profile.calltrace.cycles-pp.flush_tlb_func.__flush_smp_call_function_queue.__sysvec_call_function.sysvec_call_function.asm_sysvec_call_function
1.36 perf-profile.calltrace.cycles-pp.default_send_IPI_mask_sequence_phys.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range.zap_pte_range
1.96 perf-profile.calltrace.cycles-pp.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
8.00 perf-profile.calltrace.cycles-pp.llist_reverse_order.__flush_smp_call_function_queue.__sysvec_call_function.sysvec_call_function.asm_sysvec_call_function
2.17 perf-profile.calltrace.cycles-pp.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.testcase
3.06 perf-profile.calltrace.cycles-pp.asm_sysvec_call_function.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range.zap_pte_range
2.80 perf-profile.calltrace.cycles-pp.exc_page_fault.asm_exc_page_fault.testcase
2.76 perf-profile.calltrace.cycles-pp.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.testcase
3.14 perf-profile.calltrace.cycles-pp.asm_exc_page_fault.testcase
2.94 perf-profile.calltrace.cycles-pp.asm_sysvec_call_function.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range.tlb_finish_mmu
5.29 perf-profile.calltrace.cycles-pp.__sysvec_call_function.sysvec_call_function.asm_sysvec_call_function.smp_call_function_many_cond.on_each_cpu_cond_mask
5.25 perf-profile.calltrace.cycles-pp.__flush_smp_call_function_queue.__sysvec_call_function.sysvec_call_function.asm_sysvec_call_function.smp_call_function_many_cond
5.46 perf-profile.calltrace.cycles-pp.sysvec_call_function.asm_sysvec_call_function.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range
5.44 perf-profile.calltrace.cycles-pp.testcase
35.24 perf-profile.children.cycles-pp.llist_add_batch
62.04 perf-profile.children.cycles-pp.on_each_cpu_cond_mask
62.04 perf-profile.children.cycles-pp.smp_call_function_many_cond
62.98 perf-profile.children.cycles-pp.flush_tlb_mm_range
64.70 perf-profile.children.cycles-pp.zap_page_range_single
64.75 perf-profile.children.cycles-pp.madvise_vma_behavior
65.45 perf-profile.children.cycles-pp.__x64_sys_madvise
65.43 perf-profile.children.cycles-pp.do_madvise
66.01 perf-profile.children.cycles-pp.do_syscall_64
66.16 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
67.36 perf-profile.children.cycles-pp.__madvise
32.60 perf-profile.children.cycles-pp.tlb_finish_mmu
31.43 perf-profile.children.cycles-pp.zap_pte_range
31.46 perf-profile.children.cycles-pp.zap_pmd_range
31.66 perf-profile.children.cycles-pp.unmap_page_range
0.46 perf-profile.children.cycles-pp.error_entry
0.24 ? 5% perf-profile.children.cycles-pp.up_read
0.74 perf-profile.children.cycles-pp.native_flush_tlb_local
1.55 perf-profile.children.cycles-pp.filemap_map_pages
1.56 perf-profile.children.cycles-pp.do_read_fault
1.62 perf-profile.children.cycles-pp.do_fault
2.86 perf-profile.children.cycles-pp.default_send_IPI_mask_sequence_phys
1.98 perf-profile.children.cycles-pp.__handle_mm_fault
2.18 perf-profile.children.cycles-pp.handle_mm_fault
2.81 perf-profile.children.cycles-pp.exc_page_fault
2.80 perf-profile.children.cycles-pp.do_user_addr_fault
3.44 perf-profile.children.cycles-pp.asm_exc_page_fault
4.04 perf-profile.children.cycles-pp.flush_tlb_func
0.92 perf-profile.children.cycles-pp._find_next_bit
10.23 perf-profile.children.cycles-pp.llist_reverse_order
5.81 perf-profile.children.cycles-pp.testcase
18.96 perf-profile.children.cycles-pp.asm_sysvec_call_function
16.78 perf-profile.children.cycles-pp.__flush_smp_call_function_queue
16.32 perf-profile.children.cycles-pp.__sysvec_call_function
16.90 perf-profile.children.cycles-pp.sysvec_call_function
26.04 perf-profile.self.cycles-pp.llist_add_batch
0.18 ? 4% perf-profile.self.cycles-pp.up_read
0.45 perf-profile.self.cycles-pp.error_entry
0.24 ? 5% perf-profile.self.cycles-pp.down_read
0.29 ? 3% perf-profile.self.cycles-pp.tlb_finish_mmu
2.05 perf-profile.self.cycles-pp.default_send_IPI_mask_sequence_phys
0.73 perf-profile.self.cycles-pp.native_flush_tlb_local
3.26 perf-profile.self.cycles-pp.flush_tlb_func
0.68 perf-profile.self.cycles-pp._find_next_bit
10.22 perf-profile.self.cycles-pp.llist_reverse_order
16.64 perf-profile.self.cycles-pp.smp_call_function_many_cond
estimated based on internal Intel analysis and are provided
purposes only. Any difference in system hardware or software
may affect actual performance.
om/intel/lkp-tests/wiki">https://github.com/intel/lkp-tests/wiki

2023-10-25 08:18:16

by Rasmus Villemoes

[permalink] [raw]

Subject: Re: [PATCH 1/2] lib/find: Make functions safe on changing bitmaps

On 25/10/2023 09.18, kernel test robot wrote:
>
>
> Hello,
>
> kernel test robot noticed a 3.7% improvement of will-it-scale.per_thread_ops on:

So with that, can we please just finally say "yeah, let's make the
generic bitmap library functions correct and usable in more cases"
instead of worrying about random micro-benchmarks that just show
you-win-some-you-lose-some.

Yes, users will have to treat results from the find routines carefully
if their bitmap may be concurrently modified. They do. Nobody wins if
those users are forced to implement their own bitmap routines for their
lockless algorithms.

Rasmus

2023-10-27 03:51:43

by Yury Norov

[permalink] [raw]

Subject: Re: [PATCH 1/2] lib/find: Make functions safe on changing bitmaps

On Wed, Oct 25, 2023 at 10:18:00AM +0200, Rasmus Villemoes wrote:
> On 25/10/2023 09.18, kernel test robot wrote:
> >
> >
> > Hello,
> >
> > kernel test robot noticed a 3.7% improvement of will-it-scale.per_thread_ops on:
>
> So with that, can we please just finally say "yeah, let's make the
> generic bitmap library functions correct

They are all correct already.

> and usable in more cases"

See below.

> instead of worrying about random micro-benchmarks that just show
> you-win-some-you-lose-some.

That's I agree. I don't worry about either +2% or -3% benchmark, and
don't think that they alone can or can't justificate such a radical
change like making all find_bit functions volatile, and shutting down
a newborn KCSAN.

Keeping that in mind, my best guess is that Jan's and Misrad's test
that shows +2% was against stable bitmaps; and what robot measured
is most likely against heavily concurrent access to some bitmap in
the kernel.

I didn't look at both tests sources, but that at least makes some
sense, because if GCC optimizes code against properly described
memory correctly, this is exactly what we can expect.

> Yes, users will have to treat results from the find routines carefully
> if their bitmap may be concurrently modified. They do. Nobody wins if
> those users are forced to implement their own bitmap routines for their
> lockless algorithms.

Again, I agree with this point, and I'm trying to address exactly this.

I'm working on a series that introduces lockless find_bit functions
based on existing FIND_BIT() engine. It's not ready yet, but I hope
I'll submit it in the next merge window.

https://github.com/norov/linux/commits/find_and_bit

Now that we've got a test that presumably works faster if find_bit()
functions are all switched to be volatile, it would be great if we get
into details and understand:
- what find_bit function or functions gives that gain in performance;
- on what bitmap(s);
- is the reason in concurrent memory access (guess yes), and if so,
- can we refactor the code to use lockless find_and_bit() functions
mentioned above;
- if not, how else can we address this.

If you or someone else have an extra time slot to get deeper into
that, I'll be really thankful.

Thanks,
Yury

2023-10-27 09:57:16

by Jan Kara

[permalink] [raw]

Subject: Re: [PATCH 1/2] lib/find: Make functions safe on changing bitmaps

On Thu 26-10-23 20:51:22, Yury Norov wrote:
> On Wed, Oct 25, 2023 at 10:18:00AM +0200, Rasmus Villemoes wrote:
> > Yes, users will have to treat results from the find routines carefully
> > if their bitmap may be concurrently modified. They do. Nobody wins if
> > those users are forced to implement their own bitmap routines for their
> > lockless algorithms.
>
> Again, I agree with this point, and I'm trying to address exactly this.
>
> I'm working on a series that introduces lockless find_bit functions
> based on existing FIND_BIT() engine. It's not ready yet, but I hope
> I'll submit it in the next merge window.
>
> https://github.com/norov/linux/commits/find_and_bit

I agree that the find_and_{set|clear}() bit functions are useful and we'll
be able to remove some boilerplate code with them. But also note that you
will need to duplicate practically all of the bitmap API to provide similar
"atomic" functionality - e.g. the sbitmap conversion you have in your
branch has a bug that it drops the 'lock' memory ordering from the bitmap
manipulation. So you need something like find_and_set_bit_lock() (which you
already have) and find_and_set_bit_wrap_lock() (which you don't have yet).
If you are to convert bitmap code in filesystems (some of which is lockless
as well), you will need to add little and big endian variants of volatile
bitmap functions. Finally there are users like lib/xarray.c which don't
want to set/clear found bit (we just want to quickly find a good guess for
a set bit in the bitmap and we then verify in another structure whether the
guess was right or not). So we'll need the volatile variant of plain
find_first_bit(), find_next_bit() as well. Also when we have variants of
bitmap functions that are safe wrt parallel changes and those that are not,
the function names should be probably indicating which are which.

So as much as I agree your solution is theorically the cleanest, I
personally don't think the cost in terms of code duplication and code churn
is really worth it.

> Now that we've got a test that presumably works faster if find_bit()
> functions are all switched to be volatile, it would be great if we get
> into details and understand:
> - what find_bit function or functions gives that gain in performance;
> - on what bitmap(s);
> - is the reason in concurrent memory access (guess yes), and if so,
> - can we refactor the code to use lockless find_and_bit() functions
> mentioned above;
> - if not, how else can we address this.

Frankly, I don't think there's any substantial reason why the volatile or
non-volatile code should be faster. The guys from compiler team looked at
the x86 disassembly and said both variants should be the same speed based on
instruction costs. What could be causing these small performance differences
is that the resulting code is layed out slightly differently (and the
volatile bitmap functions end up being somewhat larger) and that somehow
interferes with instruction caching or CPU-internal out-of-order execution.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR