2022-07-24 21:18:15

by Linus Torvalds

[permalink] [raw]
Subject: Linux 5.19-rc8

As already mentioned last week, this release is one of those "extra
week of rc" ones, and here we are, with release candidate #8.

There's nothing really surprising in here - a few smaller fixups for
the retbleed mess as expected, and the usual random one-liners
elsewhere.

The diffstat shows mainly some documentation updates and a couple of
drivers with bigger fixes (eg the i916 GuC firmware thing), and the
networking sysctl data-race annotations.

So it all just makes me go "yeah, I'm happy to have done another rc,
but there is nothing particularly interesting here". Which is all
fine. Shortlog appended for the curious among you.

We'll let this simmer for another week, and please do give it another
round of testing to make this last week count, ok?

Linus

---

Aaron Lewis (1):
KVM: x86: Protect the unused bits in MSR exiting flags

Adam Borowski (1):
certs: make system keyring depend on x509 parser

Alexandru Elisei (1):
ASoC: rockchip: i2s: Fix NULL pointer dereference when pinctrl
is not found

Andy Shevchenko (1):
gpiolib: cdev: Fix kernel doc for struct line

Ben Dooks (1):
riscv: add as-options for modules with assembly compontents

Ben Hutchings (1):
x86/speculation: Make all RETbleed mitigations 64-bit only

Biao Huang (3):
stmmac: dwmac-mediatek: fix clock issue
net: stmmac: fix pm runtime issue in stmmac_dvr_remove()
net: stmmac: fix unbalanced ptp clock issue in suspend/resume flow

Biju Das (1):
spi: spi-rspi: Fix PIO fallback on RZ platforms

Christian König (1):
drm/ttm: fix locking in vmap/vunmap TTM GEM helpers

Dan Carpenter (1):
md/raid5: missing error code in setup_conf()

Daniele Ceraolo Spurio (1):
drm/i915/guc: support v69 in parallel to v70

Dawid Lukwinski (1):
i40e: Fix erroneous adapter reinitialization during recovery process

Dmitry Osipenko (1):
drm/scheduler: Don't kill jobs in interrupt context

Dylan Yudaken (2):
io_uring: fix free of unallocated buffer list
io_uring: do not recycle buffer in READV

Eric Snowberg (1):
lockdown: Fix kexec lockdown bypass with ima policy

Flavio Suligoi (1):
i2c: imx: fix typo in comment

Gavin Shan (1):
KVM: selftests: Fix target thread to be migrated in rseq_test

Haibo Chen (3):
gpio: pca953x: only use single read/write for No AI mode
gpio: pca953x: use the correct range when do regmap sync
gpio: pca953x: use the correct register address when regcache
sync during init

Hangyu Hua (1):
xfrm: xfrm_policy: fix a possible double xfrm_pols_put() in
xfrm_bundle_lookup()

Hayes Wang (1):
r8152: fix a WOL issue

Herve Codina (1):
clk: lan966x: Fix the lan966x clock gate register address

Horatiu Vultur (7):
pinctrl: ocelot: Fix pincfg for lan966x
pinctrl: ocelot: Fix pincfg
net: lan966x: Fix taking rtnl_lock while holding spin_lock
net: lan966x: Fix usage of lan966x->mac_lock when entry is added
net: lan966x: Fix usage of lan966x->mac_lock when entry is removed
net: lan966x: Fix usage of lan966x->mac_lock inside
lan966x_mac_irq_handler
net: lan966x: Fix usage of lan966x->mac_lock when used by FDB

Hristo Venev (1):
be2net: Fix buffer overflow in be_get_module_eeprom

Ido Schimmel (1):
mlxsw: spectrum_router: Fix IPv4 nexthop gateway indication

Jacky Bai (1):
MAINTAINERS: Update freescale pin controllers maintainer

Josh Poimboeuf (1):
lkdtm: Disable return thunks in rodata.c

Junxiao Chang (1):
net: stmmac: fix dma queue left shift overflow issue

Juri Lelli (1):
sched/deadline: Fix BUG_ON condition for deboosted tasks

Justin Stitt (1):
net: ipv4: fix clang -Wformat warnings

Kan Liang (1):
perf/x86/intel/lbr: Fix unchecked MSR access error on HSW

Kees Cook (1):
x86/alternative: Report missing return thunk details

Kent Gibson (1):
selftests: gpio: fix include path to kernel headers for out of tree builds

Khalid Masum (1):
scripts/gdb: Fix gdb 'lx-symbols' command

Krzysztof Kozlowski (1):
riscv: dts: align gpio-key node names with dtschema

Kuniyuki Iwashima (46):
ip: Fix data-races around sysctl_ip_default_ttl.
ip: Fix data-races around sysctl_ip_no_pmtu_disc.
ip: Fix data-races around sysctl_ip_fwd_use_pmtu.
ip: Fix data-races around sysctl_ip_fwd_update_priority.
ip: Fix data-races around sysctl_ip_nonlocal_bind.
ip: Fix a data-race around sysctl_ip_autobind_reuse.
ip: Fix a data-race around sysctl_fwmark_reflect.
tcp/dccp: Fix a data-race around sysctl_tcp_fwmark_accept.
tcp: Fix data-races around sysctl_tcp_l3mdev_accept.
tcp: Fix data-races around sysctl_tcp_mtu_probing.
tcp: Fix data-races around sysctl_tcp_base_mss.
tcp: Fix data-races around sysctl_tcp_min_snd_mss.
tcp: Fix a data-race around sysctl_tcp_mtu_probe_floor.
tcp: Fix a data-race around sysctl_tcp_probe_threshold.
tcp: Fix a data-race around sysctl_tcp_probe_interval.
tcp/udp: Make early_demux back namespacified.
igmp: Fix data-races around sysctl_igmp_llm_reports.
igmp: Fix a data-race around sysctl_igmp_max_memberships.
igmp: Fix data-races around sysctl_igmp_max_msf.
igmp: Fix data-races around sysctl_igmp_qrv.
tcp: Fix data-races around keepalive sysctl knobs.
tcp: Fix data-races around sysctl_tcp_syn(ack)?_retries.
tcp: Fix data-races around sysctl_tcp_syncookies.
tcp: Fix data-races around sysctl_tcp_migrate_req.
tcp: Fix data-races around sysctl_tcp_reordering.
tcp: Fix data-races around some timeout sysctl knobs.
tcp: Fix a data-race around sysctl_tcp_notsent_lowat.
tcp: Fix a data-race around sysctl_tcp_tw_reuse.
tcp: Fix data-races around sysctl_max_syn_backlog.
tcp: Fix data-races around sysctl_tcp_fastopen.
tcp: Fix data-races around sysctl_tcp_fastopen_blackhole_timeout.
ipv4: Fix a data-race around sysctl_fib_multipath_use_neigh.
ipv4: Fix data-races around sysctl_fib_multipath_hash_policy.
ipv4: Fix data-races around sysctl_fib_multipath_hash_fields.
ip: Fix data-races around sysctl_ip_prot_sock.
udp: Fix a data-race around sysctl_udp_l3mdev_accept.
tcp: Fix data-races around sysctl knobs related to SYN option.
tcp: Fix a data-race around sysctl_tcp_early_retrans.
tcp: Fix data-races around sysctl_tcp_recovery.
tcp: Fix a data-race around sysctl_tcp_thin_linear_timeouts.
tcp: Fix data-races around sysctl_tcp_slow_start_after_idle.
tcp: Fix a data-race around sysctl_tcp_retrans_collapse.
tcp: Fix a data-race around sysctl_tcp_stdurg.
tcp: Fix a data-race around sysctl_tcp_rfc1337.
tcp: Fix a data-race around sysctl_tcp_abort_on_overflow.
tcp: Fix data-races around sysctl_tcp_max_reordering.

Lennert Buytenhek (1):
igc: Reinstate IGC_REMOVED logic and implement it properly

Li Zhengyu (2):
RISCV: kexec: Fix build error without CONFIG_MODULES
RISC-V: kexec: Fix build error without CONFIG_KEXEC

Liang He (3):
net: dsa: microchip: ksz_common: Fix refcount leak bug
drm/imx/dcss: Add missing of_node_put() in fail path
can: rcar_canfd: Add missing of_node_put() in rcar_canfd_probe()

Linus Torvalds (4):
watchqueue: make sure to serialize 'wqueue->defunct' properly
watch-queue: remove spurious double semicolon
mmu_gather: fix the CONFIG_MMU_GATHER_NO_RANGE case
Linux 5.19-rc8

Lorenzo Bianconi (1):
net: ethernet: mtk_ppe: fix possible NULL pointer dereference in
mtk_flow_get_wdma_info

Luben Tuikov (1):
drm/amdgpu: Protect the amdgpu_bo_list list with a mutex v2

Maksym Glubokiy (1):
net: prestera: acl: use proper mask for port selector

Marc Kleine-Budde (2):
can: mcp251xfd: fix detection of mcp251863
spi: bcm2835: bcm2835_spi_handle_err(): fix NULL pointer deref
for non DMA transfers

Mario Limonciello (2):
pinctrl: Don't allow PINCTRL_AMD to be a module
ACPI: CPPC: Don't require flexible address space if
X86_FEATURE_CPPC is supported

Mark Brown (1):
ASoC: rockchip-i2s: Undo BCLK pinctrl changes

Matthew Brost (1):
drm/i915/guc: Support programming the EU priority in the GuC descriptor

Mustafa Ismail (2):
RDMA/irdma: Do not advertise 1GB page size for x722
RDMA/irdma: Fix sleep from invalid context BUG

Neeraj Upadhyay (1):
srcu: Make expedited RCU grace periods block even less frequently

Nícolas F. R. A. Prado (1):
drm/panel-edp: Fix variable typo when saving hpd absent delay from DT

Oleksij Rempel (2):
net: dsa: sja1105: silent spi_device_id warnings
net: dsa: vitesse-vsc73xx: silent spi_device_id warnings

Oliver Upton (1):
KVM: stats: Fix value for KVM_STATS_UNIT_MAX for boolean stats

Oz Shlomo (1):
net/sched: cls_api: Fix flow action initialization

Paolo Bonzini (1):
tools headers UAPI: Sync linux/kvm.h with the kernel sources

Paul E. McKenney (1):
srcu: Block less aggressively for expedited grace periods

Pawan Gupta (1):
x86/bugs: Warn when "ibrs" mitigation is selected on Enhanced IBRS parts

Peter Zijlstra (5):
x86/amd: Use IBPB for firmware calls
mmu_gather: Remove per arch tlb_{start,end}_vma()
csky/tlb: Remove tlb_flush() define
mmu_gather: Let there be one tlb_{start,end}_vma() implementation
mmu_gather: Force tlb-flush VM_PFNMAP vmas

Piotr Skajewski (1):
ixgbe: Add locking to prevent panic when setting sriov_numvfs to zero

Przemyslaw Patynowski (4):
iavf: Fix VLAN_V2 addition/rejection
iavf: Disallow changing rx/tx-frames and rx/tx-frames-irq
iavf: Fix handling of dummy receive descriptors
iavf: Fix missing state logs

Robert Hancock (1):
i2c: cadence: Change large transfer count reset logic to be unconditional

Sai Krishna Potthuri (1):
spi: spi-cadence: Fix SPI NO Slave Select macro definition

Sascha Hauer (1):
mtd: rawnand: gpmi: Set WAIT_FOR_READY timeout based on
program/erase times

Sasha Neftin (2):
e1000e: Enable GPT clock before sending message to CSME
Revert "e1000e: Fix possible HW unit hang after an s0ix exit"

Srinivas Neeli (1):
gpio: gpio-xilinx: Fix integer overflow

Stylon Wang (1):
drm/amd/display: Fix new dmub notification enabling in DM

Taehee Yoo (8):
amt: use workqueue for gateway side message handling
amt: remove unnecessary locks
amt: use READ_ONCE() in amt module
amt: add missing regeneration nonce logic in request logic
amt: drop unexpected advertisement message
amt: drop unexpected query message
amt: drop unexpected multicast data
amt: do not use amt->nr_tunnels outside of lock

Tariq Toukan (1):
net/tls: Fix race in TLS device down flow

Tom Lendacky (1):
virt: sev-guest: Pass the appropriate argument type to iounmap()

Tom Rix (1):
net: ethernet: mtk_eth_soc: fix off by one check of ARRAY_SIZE

Tony Lindgren (1):
mmc: sdhci-omap: Fix a lockdep warning for PM runtime init

Vadim Pasternak (1):
i2c: mlxcpld: Fix register setting for 400KHz frequency

Vladimir Oltean (19):
docs: net: dsa: update probing documentation
docs: net: dsa: document the shutdown behavior
docs: net: dsa: rename tag_protocol to get_tag_protocol
docs: net: dsa: add more info about the other arguments to
get_tag_protocol
docs: net: dsa: document change_tag_protocol
docs: net: dsa: document the teardown method
docs: net: dsa: document port_setup and port_teardown
docs: net: dsa: document port_fast_age
docs: net: dsa: remove port_bridge_tx_fwd_offload
docs: net: dsa: remove port_vlan_dump
docs: net: dsa: delete port_mdb_dump
docs: net: dsa: add a section for address databases
docs: net: dsa: re-explain what port_fdb_dump actually does
docs: net: dsa: delete misinformation about -EOPNOTSUPP for FDB/MDB/VLAN
docs: net: dsa: mention that VLANs are now refcounted on shared ports
pinctrl: armada-37xx: make irq_lock a raw spinlock to avoid
invalid wait context
pinctrl: armada-37xx: use raw spinlocks for regmap to avoid
invalid wait context
net: dsa: fix dsa_port_vlan_filtering when global
net: dsa: fix NULL pointer dereference in dsa_port_reset_vlan_filtering

William Dean (2):
pinctrl: ralink: Check for null return of devm_kcalloc
pinctrl: sunplus: Add check for kcalloc

Wong Vee Khee (2):
net: stmmac: switch to use interrupt for hw crosstimestamping
net: stmmac: remove redunctant disable xPCS EEE call

Xin Long (1):
Documentation: fix udp_wmem_min in ip-sysctl.rst

xinhui pan (1):
drm/amdgpu: Remove one duplicated ef removal


2022-07-25 16:24:11

by Guenter Roeck

[permalink] [raw]
Subject: Re: Linux 5.19-rc8

On Sun, Jul 24, 2022 at 01:42:18PM -0700, Linus Torvalds wrote:
> As already mentioned last week, this release is one of those "extra
> week of rc" ones, and here we are, with release candidate #8.
>
> There's nothing really surprising in here - a few smaller fixups for
> the retbleed mess as expected, and the usual random one-liners
> elsewhere.
>
> The diffstat shows mainly some documentation updates and a couple of
> drivers with bigger fixes (eg the i916 GuC firmware thing), and the
> networking sysctl data-race annotations.
>
> So it all just makes me go "yeah, I'm happy to have done another rc,
> but there is nothing particularly interesting here". Which is all
> fine. Shortlog appended for the curious among you.
>
> We'll let this simmer for another week, and please do give it another
> round of testing to make this last week count, ok?
>

Build results:
total: 150 pass: 150 fail: 0
Qemu test results:
total: 489 pass: 489 fail: 0

I still randomly see

BUG: KFENCE: out-of-bounds read in _find_next_bit_le+0x10/0x48

That is not a new problem; I see it with some arm boot tests
whenever KFENCE is enabled. I attached a trace with decoded
symbols below. Maybe someone with more experience than me can
figure out if this is an arm specific problem or a test problem
or a qemu problem.

Guenter

---
test_bitmap: parselist: 14: input is '0-2047:128/256' OK, Time: 11667
==================================================================
BUG: KFENCE: out-of-bounds read in _find_next_bit_le (arch/arm/lib/findbit.S:88)

Out-of-bounds read at 0xef59e000 (4096B right of kfence-#93):
_find_next_bit_le (arch/arm/lib/findbit.S:88)
kfence-#93: 0xef59d000-0xef59dfff, size=4096, cache=kmalloc-4k
allocated by task 1 on cpu 1 at 18.432911s:
test_bitmap_printlist (./include/linux/slab.h:600 lib/test_bitmap.c:452)
test_bitmap_init (lib/test_bitmap.c:883 lib/test_bitmap.c:889)
do_one_initcall (./include/linux/jump_label.h:261 ./include/linux/jump_label.h:271 ./include/trace/events/initcall.h:48 init/main.c:1296)
kernel_init_freeable (init/main.c:1367 init/main.c:1384 init/main.c:1403 init/main.c:1610)
kernel_init (init/main.c:1501)
ret_from_fork (arch/arm/kernel/entry-common.S:149)
0x0
CPU: 1 PID: 1 Comm: swapper/0 Not tainted 5.19.0-rc8 #1
Hardware name: Samsung Exynos (Flattened Device Tree)
PC is at _find_next_bit_le (arch/arm/lib/findbit.S:88)
LR is at bitmap_list_string.constprop.0 (lib/vsprintf.c:1246)
pc : lr : psr: 20000113
sp : f082dc70 ip : 00000001 fp : 00000001
r10: 00000000 r9 : 0000002d r8 : ef59d000
r7 : c0e55514 r6 : c2215000 r5 : 00008000 r4 : 00008000
r3 : 845cac12 r2 : 00008001 r1 : 00008000 r0 : ef59d000
Flags: nzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
Control: 10c5387d Table: 4000406a DAC: 00000051
CPU: 1 PID: 1 Comm: swapper/0 Not tainted 5.19.0-rc8 #1
Hardware name: Samsung Exynos (Flattened Device Tree)
unwind_backtrace from show_stack (arch/arm/kernel/traps.c:255)
show_stack from dump_stack_lvl (lib/dump_stack.c:107)
dump_stack_lvl from kfence_report_error (mm/kfence/report.c:262)
kfence_report_error from kfence_handle_page_fault (mm/kfence/core.c:1151)
kfence_handle_page_fault from __do_kernel_fault.part.0 (arch/arm/mm/fault.c:143)
__do_kernel_fault.part.0 from do_page_fault (arch/arm/mm/fault.c:380)
do_page_fault from do_DataAbort (arch/arm/mm/fault.c:539)
do_DataAbort from __dabt_svc (arch/arm/kernel/entry-armv.S:214)
Exception stack(0xf082dc20 to 0xf082dc68)
dc20: ef59d000 00008000 00008001 845cac12 00008000 00008000 c2215000 c0e55514
dc40: ef59d000 0000002d 00000000 00000001 00000001 f082dc70 c0715930 c06ff18c
dc60: 20000113 ffffffff
__dabt_svc from _find_next_bit_le (arch/arm/lib/findbit.S:88)
==================================================================
test_bitmap: bitmap_print_to_pagebuf: input is '0-32767
', Time: 15696250

2022-07-25 18:32:02

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux 5.19-rc8

On Mon, Jul 25, 2022 at 9:11 AM Guenter Roeck <[email protected]> wrote:
>
> BUG: KFENCE: out-of-bounds read in _find_next_bit_le+0x10/0x48

Ok, I was hoping somebody more ARMy would look at this, particularly
since there is no call trace beyond the actual fault.

So it shows that it happens in _find_next_bit_le(), but not who called it.

It does show "who allocated the page", and I can see the message that
is printed afterwards, so it comes from that

static void __init test_bitmap_printlist(void)

function, so I guess we know the call chain:

test_bitmap_printlist ->
bitmap_print_to_pagebuf ->
scnprintf "%*pbl\n" ->
pointer ->
bitmap_list_string ->
for_each_set_bitrange

and I think I see what's wrong in there. That thing does

(b) = find_next_bit((addr), (size), (e) + 1), \
(e) = find_next_zero_bit((addr), (size), (b) + 1))

for the end of the range, and looking at the oops, the instruction
that oopses is

ldrb r3, [r0, r2, lsr #3]

where 'r2' is the bit position, and 'r0' is the start of the bitmap.

And:

> r10: 00000000 r9 : 0000002d r8 : ef59d000
> r7 : c0e55514 r6 : c2215000 r5 : 00008000 r4 : 00008000
> r3 : 845cac12 r2 : 00008001 r1 : 00008000 r0 : ef59d000

Lookie here: r1 contains the size, and r2 is past the end of the size.

So pick your poison: either the bug is in

(a) the bitmap region iterators shouldn't even ask for past-the-end results

I've added Dennis Zhou who did that first
bitmap_for_each_set_region() in commit e837dfde15a4 ("bitmap:
genericize percpu bitmap region iterators"), and Yuri Norov who
renamed and moved it to for_each_set_bitrange() in commit ec288a2cf7ca
("bitmap: unify find_bit operations").

or

(b) the ARM find_next_bit() implementation, which doesn't check
whether the position is past the end

I've added Russell King (ARM stuff) and Catalin Marinas who
touched that last many many years ago in 8b592783a2e8 ("Thumb-2:
Implement the unified arch/arm/lib functions")

I think it's arguably a little bit of both, but mostly (b).

Note how the genetic find_next_bit() (and _find_next_bit()) does

if (unlikely(start >= nbits))
return nbits;

but the arm version of it does not.

I think the fix might be something like this:

diff --git a/arch/arm/lib/findbit.S b/arch/arm/lib/findbit.S
index b5e8b9ae4c7d..b36ca301892e 100644
--- a/arch/arm/lib/findbit.S
+++ b/arch/arm/lib/findbit.S
@@ -83,6 +83,8 @@ ENDPROC(_find_first_bit_le)
ENTRY(_find_next_bit_le)
teq r1, #0
beq 3b
+ cmp r2, r1
+ bhs 3b
ands ip, r2, #7
beq 1b @ If new byte, goto old routine
ARM( ldrb r3, [r0, r2, lsr #3] )

but my ARM asm is so broken that the above is just really random noise
that may or may not build - much less work.

I'll leave it to Russell &co to have a tested and working patch.

Hmm?

Linus

2022-07-25 19:11:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux 5.19-rc8

On Mon, Jul 25, 2022 at 10:55 AM Linus Torvalds
<[email protected]> wrote:
>
> I think the fix might be something like this:

Hmm. Maybe the fix is to just not have the arm architecture-specific
version at all.

The generic code handles the "small constant size bitmap that fits in
a word" case better than the ARM special case code does.

And the generic code handles the "scan large bitmap" case better than
the ARM code does too, in that it does things a word at a time, while
the ARM special case code does things one byte at a time.

The ARM code does have a few things going for it:

(a) it's simple

(b) it has separate routines for the little-endian case

Now, (a) is probably not too strong an argument, because it's arguably
*too* simple, and buggy as a result. And having looked a bit more,
it's not just _find_next_bit_le() that has this bug, it's all the
"next" versions (zero-bit and big-endian).

But (b) is actively better than what the generic "find bit" code has.
The generic code is kind of disgusting in this area, with code like

if (le)
tmp = swab(tmp);

in lib/find_bit.c and this is nasty for two reasons:

(1) on little-endian, the "le" flag is mis-named: it's always zero,
and it never should swab, but the code was taken from some big-endian
generic case

(2) even on big-endian, that "le" flag is basically a compile-time
constant, but the header file shenanigans have hidden that fact, so it
ends up being a "pass a constant to a function that then has to test
it dynamically" situation

So the generic code is in many ways better than the ARM special case
code, but it has a couple of unfortunate warts too. At least those
unfortunate warts aren't outright *bugs*, they are just ugly.

Linus

2022-07-25 19:44:12

by Yury Norov

[permalink] [raw]
Subject: Re: Linux 5.19-rc8

On Mon, Jul 25, 2022 at 10:55:18AM -0700, Linus Torvalds wrote:
> On Mon, Jul 25, 2022 at 9:11 AM Guenter Roeck <[email protected]> wrote:
> >
> > BUG: KFENCE: out-of-bounds read in _find_next_bit_le+0x10/0x48
>
> Ok, I was hoping somebody more ARMy would look at this, particularly
> since there is no call trace beyond the actual fault.
>
> So it shows that it happens in _find_next_bit_le(), but not who called it.
>
> It does show "who allocated the page", and I can see the message that
> is printed afterwards, so it comes from that
>
> static void __init test_bitmap_printlist(void)
>
> function, so I guess we know the call chain:
>
> test_bitmap_printlist ->
> bitmap_print_to_pagebuf ->
> scnprintf "%*pbl\n" ->
> pointer ->
> bitmap_list_string ->
> for_each_set_bitrange
>
> and I think I see what's wrong in there. That thing does
>
> (b) = find_next_bit((addr), (size), (e) + 1), \
> (e) = find_next_zero_bit((addr), (size), (b) + 1))
>
> for the end of the range, and looking at the oops, the instruction
> that oopses is
>
> ldrb r3, [r0, r2, lsr #3]
>
> where 'r2' is the bit position, and 'r0' is the start of the bitmap.
>
> And:
>
> > r10: 00000000 r9 : 0000002d r8 : ef59d000
> > r7 : c0e55514 r6 : c2215000 r5 : 00008000 r4 : 00008000
> > r3 : 845cac12 r2 : 00008001 r1 : 00008000 r0 : ef59d000
>
> Lookie here: r1 contains the size, and r2 is past the end of the size.
>
> So pick your poison: either the bug is in
>
> (a) the bitmap region iterators shouldn't even ask for past-the-end results
>
> I've added Dennis Zhou who did that first
> bitmap_for_each_set_region() in commit e837dfde15a4 ("bitmap:
> genericize percpu bitmap region iterators"), and Yuri Norov who
> renamed and moved it to for_each_set_bitrange() in commit ec288a2cf7ca
> ("bitmap: unify find_bit operations").
>
> or
>
> (b) the ARM find_next_bit() implementation, which doesn't check
> whether the position is past the end
>
> I've added Russell King (ARM stuff) and Catalin Marinas who
> touched that last many many years ago in 8b592783a2e8 ("Thumb-2:
> Implement the unified arch/arm/lib functions")
>
> I think it's arguably a little bit of both, but mostly (b).
>
> Note how the genetic find_next_bit() (and _find_next_bit()) does
>
> if (unlikely(start >= nbits))
> return nbits;
>
> but the arm version of it does not.

If nbits == 0, a function of this sort shouldn't dereference memory at all.
Consider for example:
void *memchr(const void *s, int c, size_t n)
{
const unsigned char *p = s;
while (n-- != 0) {
if ((unsigned char)c == *p++) {
return (void *)(p - 1);
}
}
return NULL;
}

In case of find_next_bit(), we shouldn't also dereference memory if
start is out of bonds. That's why there's start >= nbits check at the
very beginning. (We can't pack everything in a nice-looking loop, like
memchr does, because we need to mask 1st word at the beginning.)

> I think the fix might be something like this:
>
> diff --git a/arch/arm/lib/findbit.S b/arch/arm/lib/findbit.S
> index b5e8b9ae4c7d..b36ca301892e 100644
> --- a/arch/arm/lib/findbit.S
> +++ b/arch/arm/lib/findbit.S
> @@ -83,6 +83,8 @@ ENDPROC(_find_first_bit_le)
> ENTRY(_find_next_bit_le)
> teq r1, #0
> beq 3b
> + cmp r2, r1
> + bhs 3b
> ands ip, r2, #7
> beq 1b @ If new byte, goto old routine
> ARM( ldrb r3, [r0, r2, lsr #3] )

Looking at the ARM implementation... For sure, it's harder to maintain
because it's asm. It hasn't been revisited for long, and I'm not even
sure it's faster than generic code, because it reads memory per-byte
(ldrb), while bitmaps are optimized for per-word operations (ldr).

It doesn't implement new functions from the API like find_next_and_bit(),
so ARM takes generic code for them, and nobody complains.

I'm looking at this code for quite a long. Now it starts causing troubles.
Maybe it's time to switch ARM to generic bitmap API entirely?

Thanks,
Yury

> but my ARM asm is so broken that the above is just really random noise
> that may or may not build - much less work.
>
> I'll leave it to Russell &co to have a tested and working patch.
>
> Hmm?
>
> Linus

2022-07-25 20:43:10

by Geert Uytterhoeven

[permalink] [raw]
Subject: Build regressions/improvements in v5.19-rc8

Below is the list of build error/warning regressions/improvements in
v5.19-rc8[1] compared to v5.18[2].

Summarized:
- build errors: +3/-4
- build warnings: +11/-11

JFYI, when comparing v5.19-rc8[1] to v5.19-rc7[3], the summaries are:
- build errors: +1/-5
- build warnings: +0/-0

Note that there may be false regressions, as some logs are incomplete.
Still, they're build errors/warnings.

Happy fixing! ;-)

Thanks to the linux-next team for providing the build service.

[1] http://kisskb.ellerman.id.au/kisskb/branch/linus/head/e0dccc3b76fb35bb257b4118367a883073d7390e/ (all 135 configs)
[2] http://kisskb.ellerman.id.au/kisskb/branch/linus/head/4b0986a3613c92f4ec1bdc7f60ec66fea135991f/ (131 out of 135 configs)
[3] http://kisskb.ellerman.id.au/kisskb/branch/linus/head/ff6992735ade75aae3e35d16b17da1008d753d28/ (all 135 configs)


*** ERRORS ***

3 error regressions:
+ /kisskb/src/include/ufs/ufshci.h: error: initializer element is not constant: => 245:36
+ error: relocation truncated to fit: R_SPARC_WDISP22 against `.init.text': => (.head.text+0x5100), (.head.text+0x5040)
+ error: relocation truncated to fit: R_SPARC_WDISP22 against symbol `leon_smp_cpu_startup' defined in .text section in arch/sparc/kernel/trampoline_32.o: => (.init.text+0xa4)

4 error improvements:
- /kisskb/src/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c: error: case label does not reduce to an integer constant: 4917:4 =>
- /kisskb/src/drivers/scsi/aacraid/commsup.c: error: case label does not reduce to an integer constant: 1983:2 =>
- error: arch/sparc/kernel/head_32.o: relocation truncated to fit: R_SPARC_WDISP22 against `.init.text': (.head.text+0x5100), (.head.text+0x5040) =>
- error: arch/sparc/kernel/head_32.o: relocation truncated to fit: R_SPARC_WDISP22 against symbol `leon_smp_cpu_startup' defined in .text section in arch/sparc/kernel/trampoline_32.o: (.init.text+0xa4) =>


*** WARNINGS ***

11 warning regressions:
+ .config: warning: override: ARCH_RV32I changes choice state: => 3864
+ arch/m68k/configs/multi_defconfig: warning: symbol value 'm' invalid for ZPOOL: => 61
+ arch/m68k/configs/sun3_defconfig: warning: symbol value 'm' invalid for ZPOOL: => 37
+ modpost: WARNING: modpost: drivers/net/ethernet/qlogic/qede/qede.o(.data+0x47b0): Section mismatch in reference from the variable qede_forced_speed_maps to the variable .init.rodata:qede_forced_speed_100000: => N/A
+ modpost: WARNING: modpost: drivers/net/ethernet/qlogic/qede/qede.o(.data+0x47c8): Section mismatch in reference from the variable qede_forced_speed_maps to the variable .init.rodata:qede_forced_speed_100000: => N/A
+ modpost: WARNING: modpost: drivers/net/ethernet/qlogic/qede/qede.o(.data+0x47e0): Section mismatch in reference from the variable qede_forced_speed_maps to the variable .init.rodata:qede_forced_speed_100000: => N/A
+ modpost: WARNING: modpost: drivers/net/ethernet/qlogic/qede/qede.o(.data+0x47f8): Section mismatch in reference from the variable qede_forced_speed_maps to the variable .init.rodata:qede_forced_speed_100000: => N/A
+ modpost: WARNING: modpost: drivers/net/ethernet/qlogic/qede/qede.o(.data+0x4810): Section mismatch in reference from the variable qede_forced_speed_maps to the variable .init.rodata:qede_forced_speed_100000: => N/A
+ modpost: WARNING: modpost: drivers/net/ethernet/qlogic/qede/qede.o(.data+0x4828): Section mismatch in reference from the variable qede_forced_speed_maps to the variable .init.rodata:qede_forced_speed_100000: => N/A
+ modpost: WARNING: modpost: drivers/net/ethernet/qlogic/qede/qede.o(.data+0x4840): Section mismatch in reference from the variable qede_forced_speed_maps to the variable .init.rodata:qede_forced_speed_100000: => N/A
+ modpost: WARNING: modpost: vmlinux.o(.text.unlikely+0x52bc): Section mismatch in reference from the function __trace_event_discard_commit() to the variable .init.data:initcall_level_names: => N/A

11 warning improvements:
- .config: warning: override: reassigning to symbol GCC_PLUGIN_RANDSTRUCT: 12253, 12475 =>
- /kisskb/src/drivers/scsi/mpt3sas/mpt3sas_base.c: warning: array subscript 'Mpi2SasIOUnitPage1_t {aka struct _MPI2_CONFIG_PAGE_SASIOUNIT_1}[0]' is partly outside array bounds of 'unsigned char[20]' [-Warray-bounds]: 5400:40, 5396:40, 5403:43 =>
- /kisskb/src/drivers/tty/serial/sh-sci.c: warning: unused variable 'sport' [-Wunused-variable]: 2651:26 =>
- modpost: WARNING: modpost: drivers/net/ethernet/qlogic/qede/qede.o(.data+0x4790): Section mismatch in reference from the variable qede_forced_speed_maps to the variable .init.rodata:qede_forced_speed_100000: N/A =>
- modpost: WARNING: modpost: drivers/net/ethernet/qlogic/qede/qede.o(.data+0x47a8): Section mismatch in reference from the variable qede_forced_speed_maps to the variable .init.rodata:qede_forced_speed_100000: N/A =>
- modpost: WARNING: modpost: drivers/net/ethernet/qlogic/qede/qede.o(.data+0x47c0): Section mismatch in reference from the variable qede_forced_speed_maps to the variable .init.rodata:qede_forced_speed_100000: N/A =>
- modpost: WARNING: modpost: drivers/net/ethernet/qlogic/qede/qede.o(.data+0x47d8): Section mismatch in reference from the variable qede_forced_speed_maps to the variable .init.rodata:qede_forced_speed_100000: N/A =>
- modpost: WARNING: modpost: drivers/net/ethernet/qlogic/qede/qede.o(.data+0x47f0): Section mismatch in reference from the variable qede_forced_speed_maps to the variable .init.rodata:qede_forced_speed_100000: N/A =>
- modpost: WARNING: modpost: drivers/net/ethernet/qlogic/qede/qede.o(.data+0x4808): Section mismatch in reference from the variable qede_forced_speed_maps to the variable .init.rodata:qede_forced_speed_100000: N/A =>
- modpost: WARNING: modpost: drivers/net/ethernet/qlogic/qede/qede.o(.data+0x4820): Section mismatch in reference from the variable qede_forced_speed_maps to the variable .init.rodata:qede_forced_speed_100000: N/A =>
- modpost: WARNING: modpost: vmlinux.o(.text.unlikely+0x45d4): Section mismatch in reference from the function __trace_event_discard_commit() to the variable .init.data:initcall_level_names: N/A =>

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2022-07-25 20:48:25

by Yury Norov

[permalink] [raw]
Subject: Re: Linux 5.19-rc8

On Mon, Jul 25, 2022 at 11:49:09AM -0700, Linus Torvalds wrote:
> On Mon, Jul 25, 2022 at 10:55 AM Linus Torvalds
> <[email protected]> wrote:
> >
> > I think the fix might be something like this:
>
> Hmm. Maybe the fix is to just not have the arm architecture-specific
> version at all.

I agree (see my other email in the thread). If no objections from ARM
people, I can drop it.

> The generic code handles the "small constant size bitmap that fits in
> a word" case better than the ARM special case code does.
>
> And the generic code handles the "scan large bitmap" case better than
> the ARM code does too, in that it does things a word at a time, while
> the ARM special case code does things one byte at a time.
>
> The ARM code does have a few things going for it:
>
> (a) it's simple
>
> (b) it has separate routines for the little-endian case
>
> Now, (a) is probably not too strong an argument, because it's arguably
> *too* simple, and buggy as a result. And having looked a bit more,
> it's not just _find_next_bit_le() that has this bug, it's all the
> "next" versions (zero-bit and big-endian).
>
> But (b) is actively better than what the generic "find bit" code has.
> The generic code is kind of disgusting in this area, with code like
>
> if (le)
> tmp = swab(tmp);

The patch that adds this is: b78c57135d470 ("lib/find_bit.c: join
_find_next_bit{_le}"), so I did that on purpose.

> in lib/find_bit.c and this is nasty for two reasons:
>
> (1) on little-endian, the "le" flag is mis-named: it's always zero,
> and it never should swab, but the code was taken from some big-endian
> generic case

Yes, the "le" is a bad name, and I think it should be changed. Are you OK
with "need_swab"?

> (2) even on big-endian, that "le" flag is basically a compile-time
> constant, but the header file shenanigans have hidden that fact, so it
> ends up being a "pass a constant to a function that then has to test
> it dynamically" situation

I think it's not measurable, at least find_bit_benchmark didn't get worse.
Even if find_next_bit() is invoked many times in a loop, we can expect
that branch predictor would optimize the difference out.

> So the generic code is in many ways better than the ARM special case
> code, but it has a couple of unfortunate warts too. At least those
> unfortunate warts aren't outright *bugs*, they are just ugly.

Here we have 2 ugly options - having pairs of almost identical
functions, or passing dummy variables. I decided that copy-pasting is
worse than abusing branch predictor.

Thanks,
Yury

2022-07-25 21:15:56

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: Build regressions/improvements in v5.19-rc8

On Mon, 25 Jul 2022, Geert Uytterhoeven wrote:
> JFYI, when comparing v5.19-rc8[1] to v5.19-rc7[3], the summaries are:
> - build errors: +1/-5

+ /kisskb/src/include/ufs/ufshci.h: error: initializer element is not constant: => 245:36

v5.19-rc8/powerpc-gcc5/ppc64_book3e_allmodconfig
v5.19-rc8/powerpc-gcc5/powerpc-allmodconfig
v5.19-rc8/powerpc-gcc5/ppc64le_allmodconfig

Seen before

> [1] http://kisskb.ellerman.id.au/kisskb/branch/linus/head/e0dccc3b76fb35bb257b4118367a883073d7390e/ (all 135 configs)
> [3] http://kisskb.ellerman.id.au/kisskb/branch/linus/head/ff6992735ade75aae3e35d16b17da1008d753d28/ (all 135 configs)

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2022-07-25 21:16:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux 5.19-rc8

On Mon, Jul 25, 2022 at 1:35 PM Yury Norov <[email protected]> wrote:
>
> Here we have 2 ugly options - having pairs of almost identical
> functions, or passing dummy variables. I decided that copy-pasting is
> worse than abusing branch predictor.

The thing is, we have solutions for this - just use a single inline
function, and make it *see* the constant.

That way the compiler can DTRT, either generating two copies (with the
constant elided) or depending on the branch predictor.

That's particularly true in a case like this, when there is only *one*
case for the normal situation (ie little-endian). Because let's face
it, big-endian is completely dead and legacy model.

Linus

2022-07-26 09:22:34

by Russell King (Oracle)

[permalink] [raw]
Subject: Re: Linux 5.19-rc8

On Mon, Jul 25, 2022 at 10:55:18AM -0700, Linus Torvalds wrote:
> On Mon, Jul 25, 2022 at 9:11 AM Guenter Roeck <[email protected]> wrote:
> >
> > BUG: KFENCE: out-of-bounds read in _find_next_bit_le+0x10/0x48
>
> Ok, I was hoping somebody more ARMy would look at this, particularly
> since there is no call trace beyond the actual fault.

First I'm aware of it. Was it reported to linux-arm-kernel? I'm guessing
the report wasn't Cc'd to me - I can't find anything in my mailbox about
it.

> I think the fix might be something like this:
>
> diff --git a/arch/arm/lib/findbit.S b/arch/arm/lib/findbit.S
> index b5e8b9ae4c7d..b36ca301892e 100644
> --- a/arch/arm/lib/findbit.S
> +++ b/arch/arm/lib/findbit.S
> @@ -83,6 +83,8 @@ ENDPROC(_find_first_bit_le)
> ENTRY(_find_next_bit_le)
> teq r1, #0
> beq 3b
> + cmp r2, r1
> + bhs 3b
> ands ip, r2, #7
> beq 1b @ If new byte, goto old routine
> ARM( ldrb r3, [r0, r2, lsr #3] )
>
> but my ARM asm is so broken that the above is just really random noise
> that may or may not build - much less work.
>
> I'll leave it to Russell &co to have a tested and working patch.

I think it needs a bit more than that, but as you point out in later
emails, the compiler may do a better job for this.

One of the reasons for using byte loads was to avoid problems in the
early days of Linux where these took void pointers and thus could be
misaligned - and using word accesses would have resulted in much
pain. However, that was changed to unsigned long pointers back in
2017, so in theory that should no longer be a concern.

I don't remember why we used void pointers there originally - that's
something which dates back to the 1990s.

--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

2022-07-26 15:40:03

by Yury Norov

[permalink] [raw]
Subject: Re: Linux 5.19-rc8

On Tue, Jul 26, 2022 at 10:12:21AM +0100, Russell King (Oracle) wrote:
> On Mon, Jul 25, 2022 at 10:55:18AM -0700, Linus Torvalds wrote:
> > On Mon, Jul 25, 2022 at 9:11 AM Guenter Roeck <[email protected]> wrote:
> > >
> > > BUG: KFENCE: out-of-bounds read in _find_next_bit_le+0x10/0x48
> >
> > Ok, I was hoping somebody more ARMy would look at this, particularly
> > since there is no call trace beyond the actual fault.
>
> First I'm aware of it. Was it reported to linux-arm-kernel? I'm guessing
> the report wasn't Cc'd to me - I can't find anything in my mailbox about
> it.
>
> > I think the fix might be something like this:
> >
> > diff --git a/arch/arm/lib/findbit.S b/arch/arm/lib/findbit.S
> > index b5e8b9ae4c7d..b36ca301892e 100644
> > --- a/arch/arm/lib/findbit.S
> > +++ b/arch/arm/lib/findbit.S
> > @@ -83,6 +83,8 @@ ENDPROC(_find_first_bit_le)
> > ENTRY(_find_next_bit_le)
> > teq r1, #0
> > beq 3b
> > + cmp r2, r1
> > + bhs 3b
> > ands ip, r2, #7
> > beq 1b @ If new byte, goto old routine
> > ARM( ldrb r3, [r0, r2, lsr #3] )
> >
> > but my ARM asm is so broken that the above is just really random noise
> > that may or may not build - much less work.
> >
> > I'll leave it to Russell &co to have a tested and working patch.
>
> I think it needs a bit more than that, but as you point out in later
> emails, the compiler may do a better job for this.
>
> One of the reasons for using byte loads was to avoid problems in the
> early days of Linux where these took void pointers and thus could be
> misaligned - and using word accesses would have resulted in much
> pain. However, that was changed to unsigned long pointers back in
> 2017, so in theory that should no longer be a concern.
>
> I don't remember why we used void pointers there originally - that's
> something which dates back to the 1990s.

OK, then I'm sending the patch that switches it to generic code.

Thanks,
Yury

2022-07-26 15:58:55

by Yury Norov

[permalink] [raw]
Subject: Re: Linux 5.19-rc8

On Mon, Jul 25, 2022 at 01:40:58PM -0700, Linus Torvalds wrote:
> On Mon, Jul 25, 2022 at 1:35 PM Yury Norov <[email protected]> wrote:
> >
> > Here we have 2 ugly options - having pairs of almost identical
> > functions, or passing dummy variables. I decided that copy-pasting is
> > worse than abusing branch predictor.
>
> The thing is, we have solutions for this - just use a single inline
> function, and make it *see* the constant.
>
> That way the compiler can DTRT, either generating two copies (with the
> constant elided) or depending on the branch predictor.
>
> That's particularly true in a case like this, when there is only *one*
> case for the normal situation (ie little-endian). Because let's face
> it, big-endian is completely dead and legacy model.

OK, I'll do it for 5.20. Thanks for the hint.

2022-07-26 17:42:48

by Dennis Zhou

[permalink] [raw]
Subject: Re: Linux 5.19-rc8

Hello,

On Mon, Jul 25, 2022 at 10:55:18AM -0700, Linus Torvalds wrote:
> On Mon, Jul 25, 2022 at 9:11 AM Guenter Roeck <[email protected]> wrote:
> >
> > BUG: KFENCE: out-of-bounds read in _find_next_bit_le+0x10/0x48
>
> Ok, I was hoping somebody more ARMy would look at this, particularly
> since there is no call trace beyond the actual fault.
>
> So it shows that it happens in _find_next_bit_le(), but not who called it.
>
> It does show "who allocated the page", and I can see the message that
> is printed afterwards, so it comes from that
>
> static void __init test_bitmap_printlist(void)
>
> function, so I guess we know the call chain:
>
> test_bitmap_printlist ->
> bitmap_print_to_pagebuf ->
> scnprintf "%*pbl\n" ->
> pointer ->
> bitmap_list_string ->
> for_each_set_bitrange
>
> and I think I see what's wrong in there. That thing does
>
> (b) = find_next_bit((addr), (size), (e) + 1), \
> (e) = find_next_zero_bit((addr), (size), (b) + 1))
>
> for the end of the range, and looking at the oops, the instruction
> that oopses is
>
> ldrb r3, [r0, r2, lsr #3]
>
> where 'r2' is the bit position, and 'r0' is the start of the bitmap.
>
> And:
>
> > r10: 00000000 r9 : 0000002d r8 : ef59d000
> > r7 : c0e55514 r6 : c2215000 r5 : 00008000 r4 : 00008000
> > r3 : 845cac12 r2 : 00008001 r1 : 00008000 r0 : ef59d000
>
> Lookie here: r1 contains the size, and r2 is past the end of the size.
>
> So pick your poison: either the bug is in
>
> (a) the bitmap region iterators shouldn't even ask for past-the-end results
>
> I've added Dennis Zhou who did that first
> bitmap_for_each_set_region() in commit e837dfde15a4 ("bitmap:
> genericize percpu bitmap region iterators"), and Yuri Norov who
> renamed and moved it to for_each_set_bitrange() in commit ec288a2cf7ca
> ("bitmap: unify find_bit operations").
>

It seems like this is mostly taken care of by migrating arm to use the
generic implementations, but I just want to cover our basis here.

Are we okay with adding the contract find_*_bit() operations must handle
asking for past size properly? FWIW, we'd have to modify most of the
iterators in find.h.

Thanks,
Dennis

2022-07-26 18:11:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux 5.19-rc8

On Tue, Jul 26, 2022 at 10:39 AM Dennis Zhou <[email protected]> wrote:
>
> Are we okay with adding the contract find_*_bit() operations must handle
> asking for past size properly? FWIW, we'd have to modify most of the
> iterators in find.h.

So I think we're ok with it, if only it makes the macros simpler.

I also think we should probably look at the m68k case, because while
that one seems to not have the bug that the arm case had, if we remove
the arm case the m68k code is now the only non-generic case remaining.

And it just makes me go "maybe we should get rid of the whole
'override the generic code' thing entirely?"

I don't think that inlining the first word (like the m68k code does)
is worth it, but it *is* possible that the architecture-specific
functions generate better code for some common cases, so I think this
is a "needs looking at the generated code" and not just a blind
removal.

Linus

2022-07-26 18:27:26

by Yury Norov

[permalink] [raw]
Subject: Re: Linux 5.19-rc8

+ Geert Uytterhoeven <[email protected]>
+ [email protected]

On Tue, Jul 26, 2022 at 10:51:01AM -0700, Linus Torvalds wrote:
> On Tue, Jul 26, 2022 at 10:39 AM Dennis Zhou <[email protected]> wrote:
> >
> > Are we okay with adding the contract find_*_bit() operations must handle
> > asking for past size properly? FWIW, we'd have to modify most of the
> > iterators in find.h.
>
> So I think we're ok with it, if only it makes the macros simpler.
>
> I also think we should probably look at the m68k case, because while
> that one seems to not have the bug that the arm case had, if we remove
> the arm case the m68k code is now the only non-generic case remaining.
>
> And it just makes me go "maybe we should get rid of the whole
> 'override the generic code' thing entirely?"
>
> I don't think that inlining the first word (like the m68k code does)
> is worth it, but it *is* possible that the architecture-specific
> functions generate better code for some common cases,

We have find_bit_benchmark to check how it works in practice. Would
be great if someone with access to the hardware can share numbers.

> so I think this
> is a "needs looking at the generated code" and not just a blind
> removal.
>
> Linus

2022-07-26 18:52:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux 5.19-rc8

On Tue, Jul 26, 2022 at 11:18 AM Yury Norov <[email protected]> wrote:
>
> We have find_bit_benchmark to check how it works in practice. Would
> be great if someone with access to the hardware can share numbers.

Honestly, I doubt benchmarking find_bit in a loop is all that sensible.

These are helper functions that are probably seldom super-hot in
cache, and as with so many of these things, I suspect the cool-I$
numbers are the ones that matter most in real life.

When some filesystem ends up searching for a free block or similar, it
will probably have done other things before that that means that the
L1 I$ has been long flushed, and branch history is quite possibly
entirely gone too.

The same is quite possibly true for the bitmap itself in D$ too.

That said, looking at the x86 code generation (not only because I have
the build environment, but where I can actually read make sense of the
asm), the only thing that looks bad is the conditional bswap.

So the code gen looks fairly good,

Linus

2022-07-26 19:56:56

by Russell King (Oracle)

[permalink] [raw]
Subject: Re: Linux 5.19-rc8

On Tue, Jul 26, 2022 at 11:36:21AM -0700, Linus Torvalds wrote:
> On Tue, Jul 26, 2022 at 11:18 AM Yury Norov <[email protected]> wrote:
> >
> > We have find_bit_benchmark to check how it works in practice. Would
> > be great if someone with access to the hardware can share numbers.
>
> Honestly, I doubt benchmarking find_bit in a loop is all that sensible.

Yes, that's what I was thinking - I've never seen it crop up in any of
the perf traces I've seen.

Nevertheless, here's some numbers from a single run of the
find_bit_benchmark module, kernel built with:
arm-linux-gnueabihf-gcc (Debian 10.2.1-6) 10.2.1 20210110

Current native implementation:

[ 46.184565]
Start testing find_bit() with random-filled bitmap
[ 46.195127] find_next_bit: 2440833 ns, 163112 iterations
[ 46.204226] find_next_zero_bit: 2372128 ns, 164569 iterations
[ 46.213152] find_last_bit: 2199779 ns, 163112 iterations
[ 46.299398] find_first_bit: 79526013 ns, 16234 iterations
[ 46.684026] find_first_and_bit: 377912990 ns, 32617 iterations
[ 46.692020] find_next_and_bit: 1269071 ns, 73562 iterations
[ 46.698745]
Start testing find_bit() with sparse bitmap
[ 46.705711] find_next_bit: 118652 ns, 656 iterations
[ 46.716621] find_next_zero_bit: 4183472 ns, 327025 iterations
[ 46.723395] find_last_bit: 50448 ns, 656 iterations
[ 46.762308] find_first_bit: 32190802 ns, 656 iterations
[ 46.769093] find_first_and_bit: 52129 ns, 1 iterations
[ 46.775882] find_next_and_bit: 62522 ns, 1 iterations

Generic implementation:

[ 25.149238]
Start testing find_bit() with random-filled bitmap
[ 25.160002] find_next_bit: 2640943 ns, 163537 iterations
[ 25.169567] find_next_zero_bit: 2838485 ns, 164144 iterations
[ 25.178595] find_last_bit: 2302372 ns, 163538 iterations
[ 25.204016] find_first_bit: 18697630 ns, 16373 iterations
[ 25.602571] find_first_and_bit: 391841480 ns, 32555 iterations
[ 25.610563] find_next_and_bit: 1260306 ns, 73587 iterations
[ 25.617295]
Start testing find_bit() with sparse bitmap
[ 25.624222] find_next_bit: 70289 ns, 656 iterations
[ 25.636478] find_next_zero_bit: 5527050 ns, 327025 iterations
[ 25.643253] find_last_bit: 52147 ns, 656 iterations
[ 25.657304] find_first_bit: 7328573 ns, 656 iterations
[ 25.664087] find_first_and_bit: 48518 ns, 1 iterations
[ 25.670871] find_next_and_bit: 59750 ns, 1 iterations

Overall, I would say it's pretty similar (some generic perform
marginally better, some native perform marginally better) with the
exception of find_first_bit() being much better with the generic
implementation, but find_next_zero_bit() being noticably worse.

So, pretty much nothing of any relevance between them, which may
come as a surprise given the byte vs word access differences between
the two implementations.

I suspect the reason behind that may be because the native
implementation code is smaller than the generic implementation,
outweighing the effects of the by-byte rather than by-word. I would
also suspect that, because of the smaller implementation, the native
version performs better in a I$-cool situation than the generic. Lastly,
I would suspect if we fixed the bug in the native version, and converted
it to use word loads, it would probably be better than the generic
version. I haven't anything to base that on other than gut feeling at
the moment, but I can make the changes to the native implementation and
see what effect that has, possibly tomorrow.

--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

2022-07-26 20:36:16

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux 5.19-rc8

On Tue, Jul 26, 2022 at 12:44 PM Russell King (Oracle)
<[email protected]> wrote:
>
> Overall, I would say it's pretty similar (some generic perform
> marginally better, some native perform marginally better) with the
> exception of find_first_bit() being much better with the generic
> implementation, but find_next_zero_bit() being noticably worse.

The generic _find_first_bit() code is actually sane and simple. It
loops over words until it finds a non-zero one, and then does trivial
calculations on that last word.

That explains why the generic code does so much better than your byte-wise asm.

In contrast, the generic _find_next_bit() I find almost offensively
silly - which in turn explains why your byte-wide asm does better.

I think the generic _find_next_bit() should actually do what the m68k
find_next_bit code does: handle the first special word itself, and
then just call find_first_bit() on the rest of it.

And it should *not* try to handle the dynamic "bswap and/or bit sense
invert" thing at all. That should be just four different (trivial)
cases for the first word.

Linus

2022-07-27 00:19:12

by Russell King (Oracle)

[permalink] [raw]
Subject: Re: Linux 5.19-rc8

On Tue, Jul 26, 2022 at 01:20:23PM -0700, Linus Torvalds wrote:
> On Tue, Jul 26, 2022 at 12:44 PM Russell King (Oracle)
> <[email protected]> wrote:
> >
> > Overall, I would say it's pretty similar (some generic perform
> > marginally better, some native perform marginally better) with the
> > exception of find_first_bit() being much better with the generic
> > implementation, but find_next_zero_bit() being noticably worse.
>
> The generic _find_first_bit() code is actually sane and simple. It
> loops over words until it finds a non-zero one, and then does trivial
> calculations on that last word.
>
> That explains why the generic code does so much better than your byte-wise asm.
>
> In contrast, the generic _find_next_bit() I find almost offensively
> silly - which in turn explains why your byte-wide asm does better.
>
> I think the generic _find_next_bit() should actually do what the m68k
> find_next_bit code does: handle the first special word itself, and
> then just call find_first_bit() on the rest of it.
>
> And it should *not* try to handle the dynamic "bswap and/or bit sense
> invert" thing at all. That should be just four different (trivial)
> cases for the first word.

Here's the results for the native version converted to use word loads:

[ 37.319937]
Start testing find_bit() with random-filled bitmap
[ 37.330289] find_next_bit: 2222703 ns, 163781 iterations
[ 37.339186] find_next_zero_bit: 2154375 ns, 163900 iterations
[ 37.348118] find_last_bit: 2208104 ns, 163780 iterations
[ 37.372564] find_first_bit: 17722203 ns, 16370 iterations
[ 37.737415] find_first_and_bit: 358135191 ns, 32453 iterations
[ 37.745420] find_next_and_bit: 1280537 ns, 73644 iterations
[ 37.752143]
Start testing find_bit() with sparse bitmap
[ 37.759032] find_next_bit: 41256 ns, 655 iterations
[ 37.769905] find_next_zero_bit: 4148410 ns, 327026 iterations
[ 37.776675] find_last_bit: 48742 ns, 655 iterations
[ 37.790961] find_first_bit: 7562371 ns, 655 iterations
[ 37.797743] find_first_and_bit: 47366 ns, 1 iterations
[ 37.804527] find_next_and_bit: 59924 ns, 1 iterations

which is generally faster than the generic version, with the exception
of the sparse find_first_bit (generic was:
[ 25.657304] find_first_bit: 7328573 ns, 656 iterations)

find_next_{,zero_}bit() in the sparse case are quite a bit faster than
the generic code.

--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

2022-07-27 02:11:38

by Yury Norov

[permalink] [raw]
Subject: Re: Linux 5.19-rc8

On Tue, Jul 26, 2022 at 5:15 PM Russell King (Oracle)
<[email protected]> wrote:
>
> On Tue, Jul 26, 2022 at 01:20:23PM -0700, Linus Torvalds wrote:
> > On Tue, Jul 26, 2022 at 12:44 PM Russell King (Oracle)
> > <[email protected]> wrote:
> > >
> > > Overall, I would say it's pretty similar (some generic perform
> > > marginally better, some native perform marginally better) with the
> > > exception of find_first_bit() being much better with the generic
> > > implementation, but find_next_zero_bit() being noticably worse.
> >
> > The generic _find_first_bit() code is actually sane and simple. It
> > loops over words until it finds a non-zero one, and then does trivial
> > calculations on that last word.
> >
> > That explains why the generic code does so much better than your byte-wise asm.
> >
> > In contrast, the generic _find_next_bit() I find almost offensively
> > silly - which in turn explains why your byte-wide asm does better.
> >
> > I think the generic _find_next_bit() should actually do what the m68k
> > find_next_bit code does: handle the first special word itself, and
> > then just call find_first_bit() on the rest of it.
> >
> > And it should *not* try to handle the dynamic "bswap and/or bit sense
> > invert" thing at all. That should be just four different (trivial)
> > cases for the first word.
>
> Here's the results for the native version converted to use word loads:
>
> [ 37.319937]
> Start testing find_bit() with random-filled bitmap
> [ 37.330289] find_next_bit: 2222703 ns, 163781 iterations
> [ 37.339186] find_next_zero_bit: 2154375 ns, 163900 iterations
> [ 37.348118] find_last_bit: 2208104 ns, 163780 iterations
> [ 37.372564] find_first_bit: 17722203 ns, 16370 iterations
> [ 37.737415] find_first_and_bit: 358135191 ns, 32453 iterations
> [ 37.745420] find_next_and_bit: 1280537 ns, 73644 iterations
> [ 37.752143]
> Start testing find_bit() with sparse bitmap
> [ 37.759032] find_next_bit: 41256 ns, 655 iterations
> [ 37.769905] find_next_zero_bit: 4148410 ns, 327026 iterations
> [ 37.776675] find_last_bit: 48742 ns, 655 iterations
> [ 37.790961] find_first_bit: 7562371 ns, 655 iterations
> [ 37.797743] find_first_and_bit: 47366 ns, 1 iterations
> [ 37.804527] find_next_and_bit: 59924 ns, 1 iterations
>
> which is generally faster than the generic version, with the exception
> of the sparse find_first_bit (generic was:
> [ 25.657304] find_first_bit: 7328573 ns, 656 iterations)
>
> find_next_{,zero_}bit() in the sparse case are quite a bit faster than
> the generic code.

Look at find_{first,next}_and_bit results. Those two have no arch version
and in both cases use generic code. In theory they should be equally fast
before and after, but your testing says that generic case is slower even
for them, and the difference is comparable with real arch functions numbers.
It makes me feel like:
- there's something unrelated, like governor/throttling that affect results;
- the numbers are identical, taking the dispersion into account.

If the difference really concerns you, I'd suggest running the test
several times
to measure confidence intervals.

Thanks,
Yury

2022-07-27 07:49:13

by Russell King (Oracle)

[permalink] [raw]
Subject: Re: Linux 5.19-rc8

On Tue, Jul 26, 2022 at 06:33:55PM -0700, Yury Norov wrote:
> On Tue, Jul 26, 2022 at 5:15 PM Russell King (Oracle)
> <[email protected]> wrote:
> >
> > On Tue, Jul 26, 2022 at 01:20:23PM -0700, Linus Torvalds wrote:
> > > On Tue, Jul 26, 2022 at 12:44 PM Russell King (Oracle)
> > > <[email protected]> wrote:
> > > >
> > > > Overall, I would say it's pretty similar (some generic perform
> > > > marginally better, some native perform marginally better) with the
> > > > exception of find_first_bit() being much better with the generic
> > > > implementation, but find_next_zero_bit() being noticably worse.
> > >
> > > The generic _find_first_bit() code is actually sane and simple. It
> > > loops over words until it finds a non-zero one, and then does trivial
> > > calculations on that last word.
> > >
> > > That explains why the generic code does so much better than your byte-wise asm.
> > >
> > > In contrast, the generic _find_next_bit() I find almost offensively
> > > silly - which in turn explains why your byte-wide asm does better.
> > >
> > > I think the generic _find_next_bit() should actually do what the m68k
> > > find_next_bit code does: handle the first special word itself, and
> > > then just call find_first_bit() on the rest of it.
> > >
> > > And it should *not* try to handle the dynamic "bswap and/or bit sense
> > > invert" thing at all. That should be just four different (trivial)
> > > cases for the first word.
> >
> > Here's the results for the native version converted to use word loads:
> >
> > [ 37.319937]
> > Start testing find_bit() with random-filled bitmap
> > [ 37.330289] find_next_bit: 2222703 ns, 163781 iterations
> > [ 37.339186] find_next_zero_bit: 2154375 ns, 163900 iterations
> > [ 37.348118] find_last_bit: 2208104 ns, 163780 iterations
> > [ 37.372564] find_first_bit: 17722203 ns, 16370 iterations
> > [ 37.737415] find_first_and_bit: 358135191 ns, 32453 iterations
> > [ 37.745420] find_next_and_bit: 1280537 ns, 73644 iterations
> > [ 37.752143]
> > Start testing find_bit() with sparse bitmap
> > [ 37.759032] find_next_bit: 41256 ns, 655 iterations
> > [ 37.769905] find_next_zero_bit: 4148410 ns, 327026 iterations
> > [ 37.776675] find_last_bit: 48742 ns, 655 iterations
> > [ 37.790961] find_first_bit: 7562371 ns, 655 iterations
> > [ 37.797743] find_first_and_bit: 47366 ns, 1 iterations
> > [ 37.804527] find_next_and_bit: 59924 ns, 1 iterations
> >
> > which is generally faster than the generic version, with the exception
> > of the sparse find_first_bit (generic was:
> > [ 25.657304] find_first_bit: 7328573 ns, 656 iterations)
> >
> > find_next_{,zero_}bit() in the sparse case are quite a bit faster than
> > the generic code.
>
> Look at find_{first,next}_and_bit results. Those two have no arch version
> and in both cases use generic code. In theory they should be equally fast
> before and after, but your testing says that generic case is slower even
> for them, and the difference is comparable with real arch functions numbers.
> It makes me feel like:
> - there's something unrelated, like governor/throttling that affect results;
> - the numbers are identical, taking the dispersion into account.
>
> If the difference really concerns you, I'd suggest running the test
> several times
> to measure confidence intervals.

Given that the benchmark is run against random bitmaps and with
interrupts enabled, there is going to be noise in the results.

Here's the second run:

[26234.429389]
Start testing find_bit() with random-filled bitmap
[26234.439722] find_next_bit: 2206687 ns, 164277 iterations
[26234.448664] find_next_zero_bit: 2188368 ns, 163404 iterations
[26234.457612] find_last_bit: 2223742 ns, 164278 iterations
[26234.482056] find_first_bit: 17720726 ns, 16384 iterations
[26234.859374] find_first_and_bit: 370602019 ns, 32877 iterations
[26234.867379] find_next_and_bit: 1280651 ns, 74091 iterations
[26234.874107]
Start testing find_bit() with sparse bitmap
[26234.881014] find_next_bit: 46142 ns, 656 iterations
[26234.891900] find_next_zero_bit: 4158987 ns, 327025 iterations
[26234.898672] find_last_bit: 49727 ns, 656 iterations
[26234.912504] find_first_bit: 7107862 ns, 656 iterations
[26234.919290] find_first_and_bit: 52092 ns, 1 iterations
[26234.926076] find_next_and_bit: 60856 ns, 1 iterations

And a third run:

[26459.679524]
Start testing find_bit() with random-filled bitmap
[26459.689871] find_next_bit: 2199418 ns, 163311 iterations
[26459.698798] find_next_zero_bit: 2181289 ns, 164370 iterations
[26459.707738] find_last_bit: 2213638 ns, 163311 iterations
[26459.732224] find_first_bit: 17764152 ns, 16429 iterations
[26460.133823] find_first_and_bit: 394886375 ns, 32672 iterations
[26460.141818] find_next_and_bit: 1269693 ns, 73485 iterations
[26460.148545]
Start testing find_bit() with sparse bitmap
[26460.155433] find_next_bit: 40753 ns, 653 iterations
[26460.166307] find_next_zero_bit: 4148211 ns, 327028 iterations
[26460.173078] find_last_bit: 50017 ns, 653 iterations
[26460.187007] find_first_bit: 7205325 ns, 653 iterations
[26460.193790] find_first_and_bit: 49358 ns, 1 iterations
[26460.200577] find_next_and_bit: 62332 ns, 1 iterations

My gut feeling is that yes, there is some variance, but not on an
order that is significant that would allow us to say "there's no
difference".

find_next_bit results for random are: 2222703, 2206687, 2199418,
which is an average of 2209603 and a variance of around 0.5%.
The difference between this and the single generic figure I have
is on the order of 20%.

I'll do the same with find_first_bit for random: 17722203, 17720726,
and 17764152. Average is 17735694. Variance is around 0.1% or 0.2%.
The difference between this and the single generic figure I have is
on the order of 5%. Not so large, but still quite a big difference
compared to the variance.

find_first_bit for sparse: 7562371, 7107862, 7205325. Average is
7291853. Variance is higher at about 4%. Difference between this and
the generic figure is 0.5%, so this one is not significantly
different.

The best result looks to be find_next_zero_bit for the sparse bitmap
case. The generic code measures 5.5ms, the native code is sitting
around 4.1ms. That's a difference of around 34%, and by just looking
at the range in the figures above we can see this is a significant
result without needing to do the calculations. Similar is true of
find_next_bit for the sparse bitmap.

So, I think the results are significant in most cases and variance
doesn't account for the differences. The only one which isn't is
find_first_bit for the sparse case.

--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

2022-07-27 08:05:03

by David Laight

[permalink] [raw]
Subject: RE: Linux 5.19-rc8

From: Yury Norov
> Sent: 27 July 2022 02:34
...
> If the difference really concerns you, I'd suggest running the test
> several times to measure confidence intervals.

Or, do what I've been doing and get an accurate cycle count
for a single call and repeat about 10 times.
The first value is typically slow (loading I$), the
rest typically identical.
They can often even be matched to the expected value!
The fastest value is the one to quote!

On x86 you have to use the perf cycle counter, the tsc
is just useless.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2022-07-28 19:08:14

by Russell King (Oracle)

[permalink] [raw]
Subject: Re: Linux 5.19-rc8

On Tue, Jul 26, 2022 at 10:12:21AM +0100, Russell King (Oracle) wrote:
> First I'm aware of it. Was it reported to linux-arm-kernel? I'm guessing
> the report wasn't Cc'd to me - I can't find anything in my mailbox about
> it.
>
> > I think the fix might be something like this:
> >
> > diff --git a/arch/arm/lib/findbit.S b/arch/arm/lib/findbit.S
> > index b5e8b9ae4c7d..b36ca301892e 100644
> > --- a/arch/arm/lib/findbit.S
> > +++ b/arch/arm/lib/findbit.S
> > @@ -83,6 +83,8 @@ ENDPROC(_find_first_bit_le)
> > ENTRY(_find_next_bit_le)
> > teq r1, #0
> > beq 3b
> > + cmp r2, r1
> > + bhs 3b
> > ands ip, r2, #7
> > beq 1b @ If new byte, goto old routine
> > ARM( ldrb r3, [r0, r2, lsr #3] )
> >
> > but my ARM asm is so broken that the above is just really random noise
> > that may or may not build - much less work.
> >
> > I'll leave it to Russell &co to have a tested and working patch.
>
> I think it needs a bit more than that, but as you point out in later
> emails, the compiler may do a better job for this.

Okay, I've moved my patch that fixes this (without adding a single line
of code!) to my fixes branch, which I'll ask you to pull in the next
couple of days.

Each of the _find_next_* functions had:

teq r1, #0
beq 3b

at the beginning to catch the case where size == 0. This is now:

cmp r2, r1
bhs 3b

which is the C equivalent of:

if (offset >= size)
goto 3b;

where both are unsigned, and nicely covers the case where size == 0 as
before (since if size is 0, the condition is always true irrespective
of the value of offset.)

We can sort out the question of keeping this code or not later, but I
think as this has been spotted as an issue, it's important to get it
fixed.

--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

2022-07-29 00:18:28

by Guenter Roeck

[permalink] [raw]
Subject: Re: Linux 5.19-rc8

On Thu, Jul 28, 2022 at 07:28:22PM +0100, Russell King (Oracle) wrote:
> On Tue, Jul 26, 2022 at 10:12:21AM +0100, Russell King (Oracle) wrote:
> > First I'm aware of it. Was it reported to linux-arm-kernel? I'm guessing
> > the report wasn't Cc'd to me - I can't find anything in my mailbox about
> > it.
> >
> > > I think the fix might be something like this:
> > >
> > > diff --git a/arch/arm/lib/findbit.S b/arch/arm/lib/findbit.S
> > > index b5e8b9ae4c7d..b36ca301892e 100644
> > > --- a/arch/arm/lib/findbit.S
> > > +++ b/arch/arm/lib/findbit.S
> > > @@ -83,6 +83,8 @@ ENDPROC(_find_first_bit_le)
> > > ENTRY(_find_next_bit_le)
> > > teq r1, #0
> > > beq 3b
> > > + cmp r2, r1
> > > + bhs 3b
> > > ands ip, r2, #7
> > > beq 1b @ If new byte, goto old routine
> > > ARM( ldrb r3, [r0, r2, lsr #3] )
> > >
> > > but my ARM asm is so broken that the above is just really random noise
> > > that may or may not build - much less work.
> > >
> > > I'll leave it to Russell &co to have a tested and working patch.
> >
> > I think it needs a bit more than that, but as you point out in later
> > emails, the compiler may do a better job for this.
>
> Okay, I've moved my patch that fixes this (without adding a single line
> of code!) to my fixes branch, which I'll ask you to pull in the next
> couple of days.
>
I downloaded your patch and ran it through my testbed.
With it applied, the problem is no longer seen.
Feel free to add

Tested-by: Guenter Roeck <[email protected]>

Thanks,
Guenter

> Each of the _find_next_* functions had:
>
> teq r1, #0
> beq 3b
>
> at the beginning to catch the case where size == 0. This is now:
>
> cmp r2, r1
> bhs 3b
>
> which is the C equivalent of:
>
> if (offset >= size)
> goto 3b;
>
> where both are unsigned, and nicely covers the case where size == 0 as
> before (since if size is 0, the condition is always true irrespective
> of the value of offset.)
>
> We can sort out the question of keeping this code or not later, but I
> think as this has been spotted as an issue, it's important to get it
> fixed.
>
> --
> RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

2022-07-30 22:34:05

by Yury Norov

[permalink] [raw]
Subject: Re: Linux 5.19-rc8

On Wed, Jul 27, 2022 at 08:43:22AM +0100, Russell King (Oracle) wrote:
> On Tue, Jul 26, 2022 at 06:33:55PM -0700, Yury Norov wrote:
> > On Tue, Jul 26, 2022 at 5:15 PM Russell King (Oracle)
> > <[email protected]> wrote:
> > >
> > > On Tue, Jul 26, 2022 at 01:20:23PM -0700, Linus Torvalds wrote:
> > > > On Tue, Jul 26, 2022 at 12:44 PM Russell King (Oracle)
> > > > <[email protected]> wrote:
> > > > >
> > > > > Overall, I would say it's pretty similar (some generic perform
> > > > > marginally better, some native perform marginally better) with the
> > > > > exception of find_first_bit() being much better with the generic
> > > > > implementation, but find_next_zero_bit() being noticably worse.
> > > >
> > > > The generic _find_first_bit() code is actually sane and simple. It
> > > > loops over words until it finds a non-zero one, and then does trivial
> > > > calculations on that last word.
> > > >
> > > > That explains why the generic code does so much better than your byte-wise asm.
> > > >
> > > > In contrast, the generic _find_next_bit() I find almost offensively
> > > > silly - which in turn explains why your byte-wide asm does better.
> > > >
> > > > I think the generic _find_next_bit() should actually do what the m68k
> > > > find_next_bit code does: handle the first special word itself, and
> > > > then just call find_first_bit() on the rest of it.
> > > >
> > > > And it should *not* try to handle the dynamic "bswap and/or bit sense
> > > > invert" thing at all. That should be just four different (trivial)
> > > > cases for the first word.
> > >
> > > Here's the results for the native version converted to use word loads:
> > >
> > > [ 37.319937]
> > > Start testing find_bit() with random-filled bitmap
> > > [ 37.330289] find_next_bit: 2222703 ns, 163781 iterations
> > > [ 37.339186] find_next_zero_bit: 2154375 ns, 163900 iterations
> > > [ 37.348118] find_last_bit: 2208104 ns, 163780 iterations
> > > [ 37.372564] find_first_bit: 17722203 ns, 16370 iterations
> > > [ 37.737415] find_first_and_bit: 358135191 ns, 32453 iterations
> > > [ 37.745420] find_next_and_bit: 1280537 ns, 73644 iterations
> > > [ 37.752143]
> > > Start testing find_bit() with sparse bitmap
> > > [ 37.759032] find_next_bit: 41256 ns, 655 iterations
> > > [ 37.769905] find_next_zero_bit: 4148410 ns, 327026 iterations
> > > [ 37.776675] find_last_bit: 48742 ns, 655 iterations
> > > [ 37.790961] find_first_bit: 7562371 ns, 655 iterations
> > > [ 37.797743] find_first_and_bit: 47366 ns, 1 iterations
> > > [ 37.804527] find_next_and_bit: 59924 ns, 1 iterations
> > >
> > > which is generally faster than the generic version, with the exception
> > > of the sparse find_first_bit (generic was:
> > > [ 25.657304] find_first_bit: 7328573 ns, 656 iterations)
> > >
> > > find_next_{,zero_}bit() in the sparse case are quite a bit faster than
> > > the generic code.
> >
> > Look at find_{first,next}_and_bit results. Those two have no arch version
> > and in both cases use generic code. In theory they should be equally fast
> > before and after, but your testing says that generic case is slower even
> > for them, and the difference is comparable with real arch functions numbers.
> > It makes me feel like:
> > - there's something unrelated, like governor/throttling that affect results;
> > - the numbers are identical, taking the dispersion into account.
> >
> > If the difference really concerns you, I'd suggest running the test
> > several times
> > to measure confidence intervals.
>
> Given that the benchmark is run against random bitmaps and with
> interrupts enabled, there is going to be noise in the results.
>
> Here's the second run:
>
> [26234.429389]
> Start testing find_bit() with random-filled bitmap
> [26234.439722] find_next_bit: 2206687 ns, 164277 iterations
> [26234.448664] find_next_zero_bit: 2188368 ns, 163404 iterations
> [26234.457612] find_last_bit: 2223742 ns, 164278 iterations
> [26234.482056] find_first_bit: 17720726 ns, 16384 iterations
> [26234.859374] find_first_and_bit: 370602019 ns, 32877 iterations
> [26234.867379] find_next_and_bit: 1280651 ns, 74091 iterations
> [26234.874107]
> Start testing find_bit() with sparse bitmap
> [26234.881014] find_next_bit: 46142 ns, 656 iterations
> [26234.891900] find_next_zero_bit: 4158987 ns, 327025 iterations
> [26234.898672] find_last_bit: 49727 ns, 656 iterations
> [26234.912504] find_first_bit: 7107862 ns, 656 iterations
> [26234.919290] find_first_and_bit: 52092 ns, 1 iterations
> [26234.926076] find_next_and_bit: 60856 ns, 1 iterations
>
> And a third run:
>
> [26459.679524]
> Start testing find_bit() with random-filled bitmap
> [26459.689871] find_next_bit: 2199418 ns, 163311 iterations
> [26459.698798] find_next_zero_bit: 2181289 ns, 164370 iterations
> [26459.707738] find_last_bit: 2213638 ns, 163311 iterations
> [26459.732224] find_first_bit: 17764152 ns, 16429 iterations
> [26460.133823] find_first_and_bit: 394886375 ns, 32672 iterations
> [26460.141818] find_next_and_bit: 1269693 ns, 73485 iterations
> [26460.148545]
> Start testing find_bit() with sparse bitmap
> [26460.155433] find_next_bit: 40753 ns, 653 iterations
> [26460.166307] find_next_zero_bit: 4148211 ns, 327028 iterations
> [26460.173078] find_last_bit: 50017 ns, 653 iterations
> [26460.187007] find_first_bit: 7205325 ns, 653 iterations
> [26460.193790] find_first_and_bit: 49358 ns, 1 iterations
> [26460.200577] find_next_and_bit: 62332 ns, 1 iterations
>
> My gut feeling is that yes, there is some variance, but not on an
> order that is significant that would allow us to say "there's no
> difference".
>
> find_next_bit results for random are: 2222703, 2206687, 2199418,
> which is an average of 2209603 and a variance of around 0.5%.
> The difference between this and the single generic figure I have
> is on the order of 20%.
>
> I'll do the same with find_first_bit for random: 17722203, 17720726,
> and 17764152. Average is 17735694. Variance is around 0.1% or 0.2%.
> The difference between this and the single generic figure I have is
> on the order of 5%. Not so large, but still quite a big difference
> compared to the variance.
>
> find_first_bit for sparse: 7562371, 7107862, 7205325. Average is
> 7291853. Variance is higher at about 4%. Difference between this and
> the generic figure is 0.5%, so this one is not significantly
> different.
>
> The best result looks to be find_next_zero_bit for the sparse bitmap
> case. The generic code measures 5.5ms, the native code is sitting
> around 4.1ms. That's a difference of around 34%, and by just looking
> at the range in the figures above we can see this is a significant
> result without needing to do the calculations. Similar is true of
> find_next_bit for the sparse bitmap.
>
> So, I think the results are significant in most cases and variance
> doesn't account for the differences. The only one which isn't is
> find_first_bit for the sparse case.

Hi Russel,

+ Alexey Klimov <[email protected]>

This is my testings for native vs generic find_bit operations on a15
and 17.

The raw numbers are collected by Alexey Klimov on Odroid-xu3. All cpu
frequencies were fixed at 1000Mhz. (Thanks a lot!)

For each native/generic @ a15/a7 configuration, the find_bit_benchmark
was run 5 times, and the results are summarized below:

A15 Native Generic Difference
Dense ns ns % sigmas
find_next_bit: 3746929 3231641 14 8.3
find_next_zero_bit: 3935354 3202608 19 10.4
find_last_bit: 3134713 3129717 0 0.1
find_first_bit: 85626542 20498669 76 172.4
find_first_and_bit: 409252997 414820417 -1 -0.2
find_next_and_bit: 1678706 1654420 1 0.4

Sparse
find_next_bit: 143208 77924 46 29.4
find_next_zero_bit: 6893375 6059177 12 14.3
find_last_bit: 67174 68616 -2 -1.0
find_first_bit: 33689256 8151493 76 47.8
find_first_and_bit: 124758 156974 -26 -1.3
find_next_and_bit: 53391 56716 -6 -0.2

A7 Native Generic Difference
Dense ns ns % sigmas
find_next_bit: 4207627 5532764 -31 -14.9
find_next_zero_bit: 4259961 5236880 -23 -10.0
find_last_bit: 4281386 4201025 2 1.5
find_first_bit: 236913620 50970424 78 163.3
find_first_and_bit: 728069762 750580781 -3 -0.7
find_next_and_bit: 2696263 2766077 -3 -0.9

Sparse
find_next_bit: 327241 143776 56 40.7
find_next_zero_bit: 6987249 10235989 -46 -21.8
find_last_bit: 97758 94725 3 0.6
find_first_bit: 94628040 21051964 78 41.8
find_first_and_bit: 248133 241267 3 0.3
find_next_and_bit: 136475 154000 -13 -0.5

The last column is the difference between native and generic code
performance normalized to a standard deviation:
(mean(native) - mean(generic)) / max(std(native), std(generic))

The results look consistent to me because 'and' subtests that are always
generic differ by less than one sigma.

On A15 generic code is a clear winner. On A7 results are inconsistent
although significant. Maybe it's worth to retest on A7.

Regarding the choice between native and generic core - I would prefer
generic version even if it's slightly slower because it is tested and
maintained better. And because the results of the test are at least on
par, to me it's a no-brainer.

Would be really interesting to compare performance of your LDRB->LDR
patch with the generic code using the same procedure.

Thanks,
Yury

2022-08-01 16:31:43

by Russell King (Oracle)

[permalink] [raw]
Subject: Re: Linux 5.19-rc8

Oh FFS.

I see you decided off your own back to remove the ARM version of the
find_bit functions, with NO agreement from the arch maintainer. This
is not on.


On Sat, Jul 30, 2022 at 02:38:38PM -0700, Yury Norov wrote:
> On Wed, Jul 27, 2022 at 08:43:22AM +0100, Russell King (Oracle) wrote:
> > On Tue, Jul 26, 2022 at 06:33:55PM -0700, Yury Norov wrote:
> > > On Tue, Jul 26, 2022 at 5:15 PM Russell King (Oracle)
> > > <[email protected]> wrote:
> > > >
> > > > On Tue, Jul 26, 2022 at 01:20:23PM -0700, Linus Torvalds wrote:
> > > > > On Tue, Jul 26, 2022 at 12:44 PM Russell King (Oracle)
> > > > > <[email protected]> wrote:
> > > > > >
> > > > > > Overall, I would say it's pretty similar (some generic perform
> > > > > > marginally better, some native perform marginally better) with the
> > > > > > exception of find_first_bit() being much better with the generic
> > > > > > implementation, but find_next_zero_bit() being noticably worse.
> > > > >
> > > > > The generic _find_first_bit() code is actually sane and simple. It
> > > > > loops over words until it finds a non-zero one, and then does trivial
> > > > > calculations on that last word.
> > > > >
> > > > > That explains why the generic code does so much better than your byte-wise asm.
> > > > >
> > > > > In contrast, the generic _find_next_bit() I find almost offensively
> > > > > silly - which in turn explains why your byte-wide asm does better.
> > > > >
> > > > > I think the generic _find_next_bit() should actually do what the m68k
> > > > > find_next_bit code does: handle the first special word itself, and
> > > > > then just call find_first_bit() on the rest of it.
> > > > >
> > > > > And it should *not* try to handle the dynamic "bswap and/or bit sense
> > > > > invert" thing at all. That should be just four different (trivial)
> > > > > cases for the first word.
> > > >
> > > > Here's the results for the native version converted to use word loads:
> > > >
> > > > [ 37.319937]
> > > > Start testing find_bit() with random-filled bitmap
> > > > [ 37.330289] find_next_bit: 2222703 ns, 163781 iterations
> > > > [ 37.339186] find_next_zero_bit: 2154375 ns, 163900 iterations
> > > > [ 37.348118] find_last_bit: 2208104 ns, 163780 iterations
> > > > [ 37.372564] find_first_bit: 17722203 ns, 16370 iterations
> > > > [ 37.737415] find_first_and_bit: 358135191 ns, 32453 iterations
> > > > [ 37.745420] find_next_and_bit: 1280537 ns, 73644 iterations
> > > > [ 37.752143]
> > > > Start testing find_bit() with sparse bitmap
> > > > [ 37.759032] find_next_bit: 41256 ns, 655 iterations
> > > > [ 37.769905] find_next_zero_bit: 4148410 ns, 327026 iterations
> > > > [ 37.776675] find_last_bit: 48742 ns, 655 iterations
> > > > [ 37.790961] find_first_bit: 7562371 ns, 655 iterations
> > > > [ 37.797743] find_first_and_bit: 47366 ns, 1 iterations
> > > > [ 37.804527] find_next_and_bit: 59924 ns, 1 iterations
> > > >
> > > > which is generally faster than the generic version, with the exception
> > > > of the sparse find_first_bit (generic was:
> > > > [ 25.657304] find_first_bit: 7328573 ns, 656 iterations)
> > > >
> > > > find_next_{,zero_}bit() in the sparse case are quite a bit faster than
> > > > the generic code.
> > >
> > > Look at find_{first,next}_and_bit results. Those two have no arch version
> > > and in both cases use generic code. In theory they should be equally fast
> > > before and after, but your testing says that generic case is slower even
> > > for them, and the difference is comparable with real arch functions numbers.
> > > It makes me feel like:
> > > - there's something unrelated, like governor/throttling that affect results;
> > > - the numbers are identical, taking the dispersion into account.
> > >
> > > If the difference really concerns you, I'd suggest running the test
> > > several times
> > > to measure confidence intervals.
> >
> > Given that the benchmark is run against random bitmaps and with
> > interrupts enabled, there is going to be noise in the results.
> >
> > Here's the second run:
> >
> > [26234.429389]
> > Start testing find_bit() with random-filled bitmap
> > [26234.439722] find_next_bit: 2206687 ns, 164277 iterations
> > [26234.448664] find_next_zero_bit: 2188368 ns, 163404 iterations
> > [26234.457612] find_last_bit: 2223742 ns, 164278 iterations
> > [26234.482056] find_first_bit: 17720726 ns, 16384 iterations
> > [26234.859374] find_first_and_bit: 370602019 ns, 32877 iterations
> > [26234.867379] find_next_and_bit: 1280651 ns, 74091 iterations
> > [26234.874107]
> > Start testing find_bit() with sparse bitmap
> > [26234.881014] find_next_bit: 46142 ns, 656 iterations
> > [26234.891900] find_next_zero_bit: 4158987 ns, 327025 iterations
> > [26234.898672] find_last_bit: 49727 ns, 656 iterations
> > [26234.912504] find_first_bit: 7107862 ns, 656 iterations
> > [26234.919290] find_first_and_bit: 52092 ns, 1 iterations
> > [26234.926076] find_next_and_bit: 60856 ns, 1 iterations
> >
> > And a third run:
> >
> > [26459.679524]
> > Start testing find_bit() with random-filled bitmap
> > [26459.689871] find_next_bit: 2199418 ns, 163311 iterations
> > [26459.698798] find_next_zero_bit: 2181289 ns, 164370 iterations
> > [26459.707738] find_last_bit: 2213638 ns, 163311 iterations
> > [26459.732224] find_first_bit: 17764152 ns, 16429 iterations
> > [26460.133823] find_first_and_bit: 394886375 ns, 32672 iterations
> > [26460.141818] find_next_and_bit: 1269693 ns, 73485 iterations
> > [26460.148545]
> > Start testing find_bit() with sparse bitmap
> > [26460.155433] find_next_bit: 40753 ns, 653 iterations
> > [26460.166307] find_next_zero_bit: 4148211 ns, 327028 iterations
> > [26460.173078] find_last_bit: 50017 ns, 653 iterations
> > [26460.187007] find_first_bit: 7205325 ns, 653 iterations
> > [26460.193790] find_first_and_bit: 49358 ns, 1 iterations
> > [26460.200577] find_next_and_bit: 62332 ns, 1 iterations
> >
> > My gut feeling is that yes, there is some variance, but not on an
> > order that is significant that would allow us to say "there's no
> > difference".
> >
> > find_next_bit results for random are: 2222703, 2206687, 2199418,
> > which is an average of 2209603 and a variance of around 0.5%.
> > The difference between this and the single generic figure I have
> > is on the order of 20%.
> >
> > I'll do the same with find_first_bit for random: 17722203, 17720726,
> > and 17764152. Average is 17735694. Variance is around 0.1% or 0.2%.
> > The difference between this and the single generic figure I have is
> > on the order of 5%. Not so large, but still quite a big difference
> > compared to the variance.
> >
> > find_first_bit for sparse: 7562371, 7107862, 7205325. Average is
> > 7291853. Variance is higher at about 4%. Difference between this and
> > the generic figure is 0.5%, so this one is not significantly
> > different.
> >
> > The best result looks to be find_next_zero_bit for the sparse bitmap
> > case. The generic code measures 5.5ms, the native code is sitting
> > around 4.1ms. That's a difference of around 34%, and by just looking
> > at the range in the figures above we can see this is a significant
> > result without needing to do the calculations. Similar is true of
> > find_next_bit for the sparse bitmap.
> >
> > So, I think the results are significant in most cases and variance
> > doesn't account for the differences. The only one which isn't is
> > find_first_bit for the sparse case.
>
> Hi Russel,
>
> + Alexey Klimov <[email protected]>
>
> This is my testings for native vs generic find_bit operations on a15
> and 17.
>
> The raw numbers are collected by Alexey Klimov on Odroid-xu3. All cpu
> frequencies were fixed at 1000Mhz. (Thanks a lot!)
>
> For each native/generic @ a15/a7 configuration, the find_bit_benchmark
> was run 5 times, and the results are summarized below:
>
> A15 Native Generic Difference
> Dense ns ns % sigmas
> find_next_bit: 3746929 3231641 14 8.3
> find_next_zero_bit: 3935354 3202608 19 10.4
> find_last_bit: 3134713 3129717 0 0.1
> find_first_bit: 85626542 20498669 76 172.4
> find_first_and_bit: 409252997 414820417 -1 -0.2
> find_next_and_bit: 1678706 1654420 1 0.4
>
> Sparse
> find_next_bit: 143208 77924 46 29.4
> find_next_zero_bit: 6893375 6059177 12 14.3
> find_last_bit: 67174 68616 -2 -1.0
> find_first_bit: 33689256 8151493 76 47.8
> find_first_and_bit: 124758 156974 -26 -1.3
> find_next_and_bit: 53391 56716 -6 -0.2
>
> A7 Native Generic Difference
> Dense ns ns % sigmas
> find_next_bit: 4207627 5532764 -31 -14.9
> find_next_zero_bit: 4259961 5236880 -23 -10.0
> find_last_bit: 4281386 4201025 2 1.5
> find_first_bit: 236913620 50970424 78 163.3
> find_first_and_bit: 728069762 750580781 -3 -0.7
> find_next_and_bit: 2696263 2766077 -3 -0.9
>
> Sparse
> find_next_bit: 327241 143776 56 40.7
> find_next_zero_bit: 6987249 10235989 -46 -21.8
> find_last_bit: 97758 94725 3 0.6
> find_first_bit: 94628040 21051964 78 41.8
> find_first_and_bit: 248133 241267 3 0.3
> find_next_and_bit: 136475 154000 -13 -0.5
>
> The last column is the difference between native and generic code
> performance normalized to a standard deviation:
> (mean(native) - mean(generic)) / max(std(native), std(generic))
>
> The results look consistent to me because 'and' subtests that are always
> generic differ by less than one sigma.
>
> On A15 generic code is a clear winner. On A7 results are inconsistent
> although significant. Maybe it's worth to retest on A7.
>
> Regarding the choice between native and generic core - I would prefer
> generic version even if it's slightly slower because it is tested and
> maintained better. And because the results of the test are at least on
> par, to me it's a no-brainer.
>
> Would be really interesting to compare performance of your LDRB->LDR
> patch with the generic code using the same procedure.
>
> Thanks,
> Yury
>

--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

2022-08-01 16:33:38

by Russell King (Oracle)

[permalink] [raw]
Subject: Re: Linux 5.19-rc8

On Mon, Aug 01, 2022 at 04:48:21PM +0100, Russell King (Oracle) wrote:
> Oh FFS.
>
> I see you decided off your own back to remove the ARM version of the
> find_bit functions, with NO agreement from the arch maintainer. This
> is not on.

Sorry, my mistake, I'm getting confused with git over conflicts which
aren't making much sense.

--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!