2022-10-20 03:50:34

by Yury Norov

[permalink] [raw]
Subject: [RFC PATCH 0/2] Switch ARM to generic find_bit() API

Hi Russell, all,

I'd like to respin a patch that switches ARM to generic find_bit()
functions.

Generic code works on par with arch or better, according to my
testing [1], and with recent improvements merged in v6.1, it should
be even faster.

ARM already uses many generic find_bit() functions - those that it
doesn't implement. So we are talking about migrating a subset of the
API; most of find_bit() family has only generic implementation on ARM.

The only concern about this migration is that ARM code supports
byte-aligned bitmap addresses, while generic code is optimized for
word-aligned bitmaps.

In my practice, I've never seen unaligned bitmaps. But to check that on
ARM, I added a run-time check for bitmap alignment. I gave it run on
several architectures and found nothing.

Can you please check that on your hardware and compare performance of
generic vs arch code for you? If everything is OK, I suggest switching
ARM to generic find_bit() completely.

Thanks,
Yury

[1] https://lore.kernel.org/all/YuWk3titnOiQACzC@yury-laptop/

Yury Norov (2):
bitmap: add sanity check function for find_bit()
arm: drop arch implementation for find_bit() functions

arch/arm/include/asm/bitops.h | 63 -----------
arch/arm/kernel/armksyms.c | 11 --
arch/arm/lib/Makefile | 2 +-
arch/arm/lib/findbit.S | 193 ----------------------------------
include/linux/find.h | 35 ++++++
lib/Kconfig.debug | 7 ++
6 files changed, 43 insertions(+), 268 deletions(-)
delete mode 100644 arch/arm/lib/findbit.S

--
2.34.1


2022-10-20 04:39:03

by Yury Norov

[permalink] [raw]
Subject: [PATCH 1/2] bitmap: add sanity check function for find_bit()

find_bit() requires a pointer aligned to it's size. However some
subsystems (fs, for example) cast char* variables to unsigned long*
before passing them to find_bit(). Many architectures allow unaligned
pointers with the cost of performance degradation.

This patch adds runtime check for the pointers to be aligned.

Signed-off-by: Yury Norov <[email protected]>
---
include/linux/find.h | 35 +++++++++++++++++++++++++++++++++++
lib/Kconfig.debug | 7 +++++++
2 files changed, 42 insertions(+)

diff --git a/include/linux/find.h b/include/linux/find.h
index ccaf61a0f5fd..2d8f5419d787 100644
--- a/include/linux/find.h
+++ b/include/linux/find.h
@@ -7,6 +7,7 @@
#endif

#include <linux/bitops.h>
+#include <linux/bug.h>

unsigned long _find_next_bit(const unsigned long *addr1, unsigned long nbits,
unsigned long start);
@@ -35,6 +36,14 @@ unsigned long _find_next_bit_le(const unsigned long *addr, unsigned
long size, unsigned long offset);
#endif

+static __always_inline
+void check_find_bit(const unsigned long *addr)
+{
+#ifdef CONFIG_DEBUG_BITMAP
+ WARN_ON_ONCE(!IS_ALIGNED((unsigned long)addr, sizeof(unsigned long)));
+#endif
+}
+
#ifndef find_next_bit
/**
* find_next_bit - find the next set bit in a memory region
@@ -49,6 +58,8 @@ static inline
unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
unsigned long offset)
{
+ check_find_bit(addr);
+
if (small_const_nbits(size)) {
unsigned long val;

@@ -79,6 +90,9 @@ unsigned long find_next_and_bit(const unsigned long *addr1,
const unsigned long *addr2, unsigned long size,
unsigned long offset)
{
+ check_find_bit(addr1);
+ check_find_bit(addr2);
+
if (small_const_nbits(size)) {
unsigned long val;

@@ -138,6 +152,8 @@ static inline
unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
unsigned long offset)
{
+ check_find_bit(addr);
+
if (small_const_nbits(size)) {
unsigned long val;

@@ -164,6 +180,8 @@ unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
static inline
unsigned long find_first_bit(const unsigned long *addr, unsigned long size)
{
+ check_find_bit(addr);
+
if (small_const_nbits(size)) {
unsigned long val = *addr & GENMASK(size - 1, 0);

@@ -270,6 +288,9 @@ unsigned long find_first_and_bit(const unsigned long *addr1,
const unsigned long *addr2,
unsigned long size)
{
+ check_find_bit(addr1);
+ check_find_bit(addr2);
+
if (small_const_nbits(size)) {
unsigned long val = *addr1 & *addr2 & GENMASK(size - 1, 0);

@@ -292,6 +313,8 @@ unsigned long find_first_and_bit(const unsigned long *addr1,
static inline
unsigned long find_first_zero_bit(const unsigned long *addr, unsigned long size)
{
+ check_find_bit(addr);
+
if (small_const_nbits(size)) {
unsigned long val = *addr | ~GENMASK(size - 1, 0);

@@ -313,6 +336,8 @@ unsigned long find_first_zero_bit(const unsigned long *addr, unsigned long size)
static inline
unsigned long find_last_bit(const unsigned long *addr, unsigned long size)
{
+ check_find_bit(addr);
+
if (small_const_nbits(size)) {
unsigned long val = *addr & GENMASK(size - 1, 0);

@@ -417,18 +442,24 @@ extern unsigned long find_next_clump8(unsigned long *clump,
static inline unsigned long find_next_zero_bit_le(const void *addr,
unsigned long size, unsigned long offset)
{
+ check_find_bit(addr);
+
return find_next_zero_bit(addr, size, offset);
}

static inline unsigned long find_next_bit_le(const void *addr,
unsigned long size, unsigned long offset)
{
+ check_find_bit(addr);
+
return find_next_bit(addr, size, offset);
}

static inline unsigned long find_first_zero_bit_le(const void *addr,
unsigned long size)
{
+ check_find_bit(addr);
+
return find_first_zero_bit(addr, size);
}

@@ -439,6 +470,8 @@ static inline
unsigned long find_next_zero_bit_le(const void *addr, unsigned
long size, unsigned long offset)
{
+ check_find_bit(addr);
+
if (small_const_nbits(size)) {
unsigned long val = *(const unsigned long *)addr;

@@ -472,6 +505,8 @@ static inline
unsigned long find_next_bit_le(const void *addr, unsigned
long size, unsigned long offset)
{
+ check_find_bit(addr);
+
if (small_const_nbits(size)) {
unsigned long val = *(const unsigned long *)addr;

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 3fc7abffc7aa..1c7dcd33fc2a 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -543,6 +543,13 @@ endmenu # "Compiler options"

menu "Generic Kernel Debugging Instruments"

+config DEBUG_BITMAP
+ bool "Debug bitmaps"
+ help
+ Say Y here if you want to check bitmap functions parameters at
+ the runtime. Enable CONFIG_DEBUG_BITMAP only for debugging because
+ it may affect performance.
+
config MAGIC_SYSRQ
bool "Magic SysRq key"
depends on !UML
--
2.34.1

2022-10-20 04:39:34

by Yury Norov

[permalink] [raw]
Subject: [PATCH 2/2] arm: drop arch implementation for find_bit() functions

Generic code works on par with arch or better, according to my
testing, and with recent improvements merged in v6.1, it should
be even faster.

ARM already uses many generic find_bit() functions - those that it
doesn't implement. So we are talking about migrating a subset of the
API; most of find_bit() family has only generic implementation on ARM.

This patch switches ARM to generic find_bit() entirely.

CC: Guenter Roeck <[email protected]>
CC: Dennis Zhou <[email protected]>
CC: Russell King <[email protected]>
CC: Catalin Marinas <[email protected]>
CC: [email protected]
Reported-by: Guenter Roeck <[email protected]>
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Yury Norov <[email protected]>
---
arch/arm/include/asm/bitops.h | 63 -----------
arch/arm/kernel/armksyms.c | 11 --
arch/arm/lib/Makefile | 2 +-
arch/arm/lib/findbit.S | 193 ----------------------------------
4 files changed, 1 insertion(+), 268 deletions(-)
delete mode 100644 arch/arm/lib/findbit.S

diff --git a/arch/arm/include/asm/bitops.h b/arch/arm/include/asm/bitops.h
index 714440fa2fc6..23a7cde3422e 100644
--- a/arch/arm/include/asm/bitops.h
+++ b/arch/arm/include/asm/bitops.h
@@ -157,24 +157,6 @@ extern int _test_and_set_bit(int nr, volatile unsigned long * p);
extern int _test_and_clear_bit(int nr, volatile unsigned long * p);
extern int _test_and_change_bit(int nr, volatile unsigned long * p);

-/*
- * Little endian assembly bitops. nr = 0 -> byte 0 bit 0.
- */
-unsigned long _find_first_zero_bit_le(const unsigned long *p, unsigned long size);
-unsigned long _find_next_zero_bit_le(const unsigned long *p,
- unsigned long size, unsigned long offset);
-unsigned long _find_first_bit_le(const unsigned long *p, unsigned long size);
-unsigned long _find_next_bit_le(const unsigned long *p, unsigned long size, unsigned long offset);
-
-/*
- * Big endian assembly bitops. nr = 0 -> byte 3 bit 0.
- */
-unsigned long _find_first_zero_bit_be(const unsigned long *p, unsigned long size);
-unsigned long _find_next_zero_bit_be(const unsigned long *p,
- unsigned long size, unsigned long offset);
-unsigned long _find_first_bit_be(const unsigned long *p, unsigned long size);
-unsigned long _find_next_bit_be(const unsigned long *p, unsigned long size, unsigned long offset);
-
#ifndef CONFIG_SMP
/*
* The __* form of bitops are non-atomic and may be reordered.
@@ -195,26 +177,6 @@ unsigned long _find_next_bit_be(const unsigned long *p, unsigned long size, unsi
#define test_and_clear_bit(nr,p) ATOMIC_BITOP(test_and_clear_bit,nr,p)
#define test_and_change_bit(nr,p) ATOMIC_BITOP(test_and_change_bit,nr,p)

-#ifndef __ARMEB__
-/*
- * These are the little endian, atomic definitions.
- */
-#define find_first_zero_bit(p,sz) _find_first_zero_bit_le(p,sz)
-#define find_next_zero_bit(p,sz,off) _find_next_zero_bit_le(p,sz,off)
-#define find_first_bit(p,sz) _find_first_bit_le(p,sz)
-#define find_next_bit(p,sz,off) _find_next_bit_le(p,sz,off)
-
-#else
-/*
- * These are the big endian, atomic definitions.
- */
-#define find_first_zero_bit(p,sz) _find_first_zero_bit_be(p,sz)
-#define find_next_zero_bit(p,sz,off) _find_next_zero_bit_be(p,sz,off)
-#define find_first_bit(p,sz) _find_first_bit_be(p,sz)
-#define find_next_bit(p,sz,off) _find_next_bit_be(p,sz,off)
-
-#endif
-
#if __LINUX_ARM_ARCH__ < 5

#include <asm-generic/bitops/__fls.h>
@@ -237,35 +199,10 @@ unsigned long _find_next_bit_be(const unsigned long *p, unsigned long size, unsi
#endif

#include <asm-generic/bitops/ffz.h>
-
#include <asm-generic/bitops/fls64.h>
-
#include <asm-generic/bitops/sched.h>
#include <asm-generic/bitops/hweight.h>
#include <asm-generic/bitops/lock.h>
-
-#ifdef __ARMEB__
-
-static inline int find_first_zero_bit_le(const void *p, unsigned size)
-{
- return _find_first_zero_bit_le(p, size);
-}
-#define find_first_zero_bit_le find_first_zero_bit_le
-
-static inline int find_next_zero_bit_le(const void *p, int size, int offset)
-{
- return _find_next_zero_bit_le(p, size, offset);
-}
-#define find_next_zero_bit_le find_next_zero_bit_le
-
-static inline int find_next_bit_le(const void *p, int size, int offset)
-{
- return _find_next_bit_le(p, size, offset);
-}
-#define find_next_bit_le find_next_bit_le
-
-#endif
-
#include <asm-generic/bitops/le.h>

/*
diff --git a/arch/arm/kernel/armksyms.c b/arch/arm/kernel/armksyms.c
index 82e96ac83684..10130987d388 100644
--- a/arch/arm/kernel/armksyms.c
+++ b/arch/arm/kernel/armksyms.c
@@ -150,17 +150,6 @@ EXPORT_SYMBOL(_clear_bit);
EXPORT_SYMBOL(_test_and_clear_bit);
EXPORT_SYMBOL(_change_bit);
EXPORT_SYMBOL(_test_and_change_bit);
-EXPORT_SYMBOL(_find_first_zero_bit_le);
-EXPORT_SYMBOL(_find_next_zero_bit_le);
-EXPORT_SYMBOL(_find_first_bit_le);
-EXPORT_SYMBOL(_find_next_bit_le);
-
-#ifdef __ARMEB__
-EXPORT_SYMBOL(_find_first_zero_bit_be);
-EXPORT_SYMBOL(_find_next_zero_bit_be);
-EXPORT_SYMBOL(_find_first_bit_be);
-EXPORT_SYMBOL(_find_next_bit_be);
-#endif

#ifdef CONFIG_FUNCTION_TRACER
EXPORT_SYMBOL(__gnu_mcount_nc);
diff --git a/arch/arm/lib/Makefile b/arch/arm/lib/Makefile
index 6d2ba454f25b..8b152b1a3014 100644
--- a/arch/arm/lib/Makefile
+++ b/arch/arm/lib/Makefile
@@ -7,7 +7,7 @@

lib-y := changebit.o csumipv6.o csumpartial.o \
csumpartialcopy.o csumpartialcopyuser.o clearbit.o \
- delay.o delay-loop.o findbit.o memchr.o memcpy.o \
+ delay.o delay-loop.o memchr.o memcpy.o \
memmove.o memset.o setbit.o \
strchr.o strrchr.o \
testchangebit.o testclearbit.o testsetbit.o \
diff --git a/arch/arm/lib/findbit.S b/arch/arm/lib/findbit.S
deleted file mode 100644
index 7fd3600db8ef..000000000000
--- a/arch/arm/lib/findbit.S
+++ /dev/null
@@ -1,193 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-/*
- * linux/arch/arm/lib/findbit.S
- *
- * Copyright (C) 1995-2000 Russell King
- *
- * 16th March 2001 - John Ripley <[email protected]>
- * Fixed so that "size" is an exclusive not an inclusive quantity.
- * All users of these functions expect exclusive sizes, and may
- * also call with zero size.
- * Reworked by rmk.
- */
-#include <linux/linkage.h>
-#include <asm/assembler.h>
- .text
-
-/*
- * Purpose : Find a 'zero' bit
- * Prototype: int find_first_zero_bit(void *addr, unsigned int maxbit);
- */
-ENTRY(_find_first_zero_bit_le)
- teq r1, #0
- beq 3f
- mov r2, #0
-1:
- ARM( ldrb r3, [r0, r2, lsr #3] )
- THUMB( lsr r3, r2, #3 )
- THUMB( ldrb r3, [r0, r3] )
- eors r3, r3, #0xff @ invert bits
- bne .L_found @ any now set - found zero bit
- add r2, r2, #8 @ next bit pointer
-2: cmp r2, r1 @ any more?
- blo 1b
-3: mov r0, r1 @ no free bits
- ret lr
-ENDPROC(_find_first_zero_bit_le)
-
-/*
- * Purpose : Find next 'zero' bit
- * Prototype: int find_next_zero_bit(void *addr, unsigned int maxbit, int offset)
- */
-ENTRY(_find_next_zero_bit_le)
- cmp r2, r1
- bhs 3b
- ands ip, r2, #7
- beq 1b @ If new byte, goto old routine
- ARM( ldrb r3, [r0, r2, lsr #3] )
- THUMB( lsr r3, r2, #3 )
- THUMB( ldrb r3, [r0, r3] )
- eor r3, r3, #0xff @ now looking for a 1 bit
- movs r3, r3, lsr ip @ shift off unused bits
- bne .L_found
- orr r2, r2, #7 @ if zero, then no bits here
- add r2, r2, #1 @ align bit pointer
- b 2b @ loop for next bit
-ENDPROC(_find_next_zero_bit_le)
-
-/*
- * Purpose : Find a 'one' bit
- * Prototype: int find_first_bit(const unsigned long *addr, unsigned int maxbit);
- */
-ENTRY(_find_first_bit_le)
- teq r1, #0
- beq 3f
- mov r2, #0
-1:
- ARM( ldrb r3, [r0, r2, lsr #3] )
- THUMB( lsr r3, r2, #3 )
- THUMB( ldrb r3, [r0, r3] )
- movs r3, r3
- bne .L_found @ any now set - found zero bit
- add r2, r2, #8 @ next bit pointer
-2: cmp r2, r1 @ any more?
- blo 1b
-3: mov r0, r1 @ no free bits
- ret lr
-ENDPROC(_find_first_bit_le)
-
-/*
- * Purpose : Find next 'one' bit
- * Prototype: int find_next_zero_bit(void *addr, unsigned int maxbit, int offset)
- */
-ENTRY(_find_next_bit_le)
- cmp r2, r1
- bhs 3b
- ands ip, r2, #7
- beq 1b @ If new byte, goto old routine
- ARM( ldrb r3, [r0, r2, lsr #3] )
- THUMB( lsr r3, r2, #3 )
- THUMB( ldrb r3, [r0, r3] )
- movs r3, r3, lsr ip @ shift off unused bits
- bne .L_found
- orr r2, r2, #7 @ if zero, then no bits here
- add r2, r2, #1 @ align bit pointer
- b 2b @ loop for next bit
-ENDPROC(_find_next_bit_le)
-
-#ifdef __ARMEB__
-
-ENTRY(_find_first_zero_bit_be)
- teq r1, #0
- beq 3f
- mov r2, #0
-1: eor r3, r2, #0x18 @ big endian byte ordering
- ARM( ldrb r3, [r0, r3, lsr #3] )
- THUMB( lsr r3, #3 )
- THUMB( ldrb r3, [r0, r3] )
- eors r3, r3, #0xff @ invert bits
- bne .L_found @ any now set - found zero bit
- add r2, r2, #8 @ next bit pointer
-2: cmp r2, r1 @ any more?
- blo 1b
-3: mov r0, r1 @ no free bits
- ret lr
-ENDPROC(_find_first_zero_bit_be)
-
-ENTRY(_find_next_zero_bit_be)
- cmp r2, r1
- bhs 3b
- ands ip, r2, #7
- beq 1b @ If new byte, goto old routine
- eor r3, r2, #0x18 @ big endian byte ordering
- ARM( ldrb r3, [r0, r3, lsr #3] )
- THUMB( lsr r3, #3 )
- THUMB( ldrb r3, [r0, r3] )
- eor r3, r3, #0xff @ now looking for a 1 bit
- movs r3, r3, lsr ip @ shift off unused bits
- bne .L_found
- orr r2, r2, #7 @ if zero, then no bits here
- add r2, r2, #1 @ align bit pointer
- b 2b @ loop for next bit
-ENDPROC(_find_next_zero_bit_be)
-
-ENTRY(_find_first_bit_be)
- teq r1, #0
- beq 3f
- mov r2, #0
-1: eor r3, r2, #0x18 @ big endian byte ordering
- ARM( ldrb r3, [r0, r3, lsr #3] )
- THUMB( lsr r3, #3 )
- THUMB( ldrb r3, [r0, r3] )
- movs r3, r3
- bne .L_found @ any now set - found zero bit
- add r2, r2, #8 @ next bit pointer
-2: cmp r2, r1 @ any more?
- blo 1b
-3: mov r0, r1 @ no free bits
- ret lr
-ENDPROC(_find_first_bit_be)
-
-ENTRY(_find_next_bit_be)
- cmp r2, r1
- bhs 3b
- ands ip, r2, #7
- beq 1b @ If new byte, goto old routine
- eor r3, r2, #0x18 @ big endian byte ordering
- ARM( ldrb r3, [r0, r3, lsr #3] )
- THUMB( lsr r3, #3 )
- THUMB( ldrb r3, [r0, r3] )
- movs r3, r3, lsr ip @ shift off unused bits
- bne .L_found
- orr r2, r2, #7 @ if zero, then no bits here
- add r2, r2, #1 @ align bit pointer
- b 2b @ loop for next bit
-ENDPROC(_find_next_bit_be)
-
-#endif
-
-/*
- * One or more bits in the LSB of r3 are assumed to be set.
- */
-.L_found:
-#if __LINUX_ARM_ARCH__ >= 5
- rsb r0, r3, #0
- and r3, r3, r0
- clz r3, r3
- rsb r3, r3, #31
- add r0, r2, r3
-#else
- tst r3, #0x0f
- addeq r2, r2, #4
- movne r3, r3, lsl #4
- tst r3, #0x30
- addeq r2, r2, #2
- movne r3, r3, lsl #2
- tst r3, #0x40
- addeq r2, r2, #1
- mov r0, r2
-#endif
- cmp r1, r0 @ Clamp to maxbit
- movlo r0, r1
- ret lr
-
--
2.34.1

2022-10-20 17:00:17

by Russell King (Oracle)

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] Switch ARM to generic find_bit() API

On Wed, Oct 19, 2022 at 08:20:22PM -0700, Yury Norov wrote:
> Hi Russell, all,
>
> I'd like to respin a patch that switches ARM to generic find_bit()
> functions.
>
> Generic code works on par with arch or better, according to my
> testing [1], and with recent improvements merged in v6.1, it should
> be even faster.
>
> ARM already uses many generic find_bit() functions - those that it
> doesn't implement. So we are talking about migrating a subset of the
> API; most of find_bit() family has only generic implementation on ARM.
>
> The only concern about this migration is that ARM code supports
> byte-aligned bitmap addresses, while generic code is optimized for
> word-aligned bitmaps.
>
> In my practice, I've never seen unaligned bitmaps. But to check that on
> ARM, I added a run-time check for bitmap alignment. I gave it run on
> several architectures and found nothing.
>
> Can you please check that on your hardware and compare performance of
> generic vs arch code for you? If everything is OK, I suggest switching
> ARM to generic find_bit() completely.
>
> Thanks,
> Yury
>
> [1] https://lore.kernel.org/all/YuWk3titnOiQACzC@yury-laptop/

I _really_ don't want to play around with this stuff right now... 6.0
appears to have a regression on arm32 early on during boot:

[ 1.410115] EXT4-fs error (device sda1): htree_dirblock_to_tree:1093: inode #256: block 8797: comm systemd: bad entry in directory: rec_len % 4 != 0 - offset=0, inode=33188, rec_len=35097, size=4096 fake=0

Booting 5.19 with the same filesystem works without issue and without
even a fsck, but booting 6.0 always results in some problem that
prevents it booting.

Debugging this is not easy, because there also seems to be something
up with the bloody serial console - sometimes I get nothing, other
times I get nothing more than:

[ 2.929502] EXT4-fs error (de

and then the output stops. Is the console no longer synchronous? If it
isn't, that's a huge mistake which can be seen right here with the
partial message output... so I also need to work out how to make the
console output synchronous again.

--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

2022-10-20 20:23:24

by Yury Norov

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] Switch ARM to generic find_bit() API

On Thu, Oct 20, 2022 at 05:51:34PM +0100, Russell King (Oracle) wrote:
> On Wed, Oct 19, 2022 at 08:20:22PM -0700, Yury Norov wrote:
> > Hi Russell, all,
> >
> > I'd like to respin a patch that switches ARM to generic find_bit()
> > functions.
> >
> > Generic code works on par with arch or better, according to my
> > testing [1], and with recent improvements merged in v6.1, it should
> > be even faster.
> >
> > ARM already uses many generic find_bit() functions - those that it
> > doesn't implement. So we are talking about migrating a subset of the
> > API; most of find_bit() family has only generic implementation on ARM.
> >
> > The only concern about this migration is that ARM code supports
> > byte-aligned bitmap addresses, while generic code is optimized for
> > word-aligned bitmaps.
> >
> > In my practice, I've never seen unaligned bitmaps. But to check that on
> > ARM, I added a run-time check for bitmap alignment. I gave it run on
> > several architectures and found nothing.
> >
> > Can you please check that on your hardware and compare performance of
> > generic vs arch code for you? If everything is OK, I suggest switching
> > ARM to generic find_bit() completely.
> >
> > Thanks,
> > Yury
> >
> > [1] https://lore.kernel.org/all/YuWk3titnOiQACzC@yury-laptop/
>
> I _really_ don't want to play around with this stuff right now... 6.0
> appears to have a regression on arm32 early on during boot:
>
> [ 1.410115] EXT4-fs error (device sda1): htree_dirblock_to_tree:1093: inode #256: block 8797: comm systemd: bad entry in directory: rec_len % 4 != 0 - offset=0, inode=33188, rec_len=35097, size=4096 fake=0
>
> Booting 5.19 with the same filesystem works without issue and without
> even a fsck, but booting 6.0 always results in some problem that
> prevents it booting.
>
> Debugging this is not easy, because there also seems to be something
> up with the bloody serial console - sometimes I get nothing, other
> times I get nothing more than:
>
> [ 2.929502] EXT4-fs error (de
>
> and then the output stops. Is the console no longer synchronous? If it
> isn't, that's a huge mistake which can be seen right here with the
> partial message output... so I also need to work out how to make the
> console output synchronous again.

Got it.

I you think that EXT4 problems are due to unaligned bitmaps, you can take
1st patch from this series to check.

Thanks,
Yury

2022-10-20 23:20:45

by Russell King (Oracle)

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] Switch ARM to generic find_bit() API

On Thu, Oct 20, 2022 at 01:02:07PM -0700, Yury Norov wrote:
> On Thu, Oct 20, 2022 at 05:51:34PM +0100, Russell King (Oracle) wrote:
> > On Wed, Oct 19, 2022 at 08:20:22PM -0700, Yury Norov wrote:
> > > Hi Russell, all,
> > >
> > > I'd like to respin a patch that switches ARM to generic find_bit()
> > > functions.
> > >
> > > Generic code works on par with arch or better, according to my
> > > testing [1], and with recent improvements merged in v6.1, it should
> > > be even faster.
> > >
> > > ARM already uses many generic find_bit() functions - those that it
> > > doesn't implement. So we are talking about migrating a subset of the
> > > API; most of find_bit() family has only generic implementation on ARM.
> > >
> > > The only concern about this migration is that ARM code supports
> > > byte-aligned bitmap addresses, while generic code is optimized for
> > > word-aligned bitmaps.
> > >
> > > In my practice, I've never seen unaligned bitmaps. But to check that on
> > > ARM, I added a run-time check for bitmap alignment. I gave it run on
> > > several architectures and found nothing.
> > >
> > > Can you please check that on your hardware and compare performance of
> > > generic vs arch code for you? If everything is OK, I suggest switching
> > > ARM to generic find_bit() completely.
> > >
> > > Thanks,
> > > Yury
> > >
> > > [1] https://lore.kernel.org/all/YuWk3titnOiQACzC@yury-laptop/
> >
> > I _really_ don't want to play around with this stuff right now... 6.0
> > appears to have a regression on arm32 early on during boot:
> >
> > [ 1.410115] EXT4-fs error (device sda1): htree_dirblock_to_tree:1093: inode #256: block 8797: comm systemd: bad entry in directory: rec_len % 4 != 0 - offset=0, inode=33188, rec_len=35097, size=4096 fake=0
> >
> > Booting 5.19 with the same filesystem works without issue and without
> > even a fsck, but booting 6.0 always results in some problem that
> > prevents it booting.
> >
> > Debugging this is not easy, because there also seems to be something
> > up with the bloody serial console - sometimes I get nothing, other
> > times I get nothing more than:
> >
> > [ 2.929502] EXT4-fs error (de
> >
> > and then the output stops. Is the console no longer synchronous? If it
> > isn't, that's a huge mistake which can be seen right here with the
> > partial message output... so I also need to work out how to make the
> > console output synchronous again.
>
> Got it.
>
> I you think that EXT4 problems are due to unaligned bitmaps, you can take
> 1st patch from this series to check.

Got to the bottom of it, it wasn't the bit array functions, it was DMA
API issues.

Okay, I've now tested the generic ops vs my updated optimised ops,
and my version still comes out faster (based on three runs). The
random-filled show less difference, but the sparse bitmaps show a
much better win for my optimised code over the generic code where
they exist:

arm: [ 694.614773] find_next_bit: 40078 ns, 656 iterations
gen: [ 88.611973] find_next_bit: 69943 ns, 655 iterations
arm: [ 694.625436] find_next_zero_bit: 3939309 ns, 327025 iterations
gen: [ 88.624227] find_next_zero_bit: 5529553 ns, 327026 iterations
arm: [ 694.646236] find_first_bit: 7301363 ns, 656 iterations
gen: [ 88.645313] find_first_bit: 7589120 ns, 655 iterations

These figures appear to be pretty consistent across the three
runs.

For completness, here's the random-filled results:

arm: [ 694.109190] find_next_bit: 2242618 ns, 163949 iterations
gen: [ 88.167340] find_next_bit: 2632859 ns, 163743 iterations
arm: [ 694.117968] find_next_zero_bit: 2049129 ns, 163732 iterations
gen: [ 88.176912] find_next_zero_bit: 2844221 ns, 163938 iterations
arm: [ 694.151421] find_first_bit: 17778911 ns, 16448 iterations
gen: [ 88.211167] find_first_bit: 18596622 ns, 16401 iterations

So, I don't see much reason to switch to the generic ops for
these, not when we have such a significant speedup on the
find_next_* functions for sparse-filled results..

--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

2022-10-23 23:13:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] bitmap: add sanity check function for find_bit()

On Wed, Oct 19, 2022 at 8:24 PM Yury Norov <[email protected]> wrote:
>
> This patch adds runtime check for the pointers to be aligned.

No. Don't add pointless things like this. It only adds code, with no advantage.

The bitmap ops all operate on 'unsigned long', and if a bitmap isn't
aligned, we'll take a fault on the architectures that don't do
unaligned accesses natively.

And the find-bit functions simply aren't special enough to have this
kind of random testing, when the *basic* bitmap functions like
"set_bit()" and friends all do the accesses without any alignment
checks.

The fact that filesystem code often uses bitmap functions with a cast
from 'char *' is immaterial. Those things are already aligned
(typically they are a whole disk block). They just weren't an array of
'unsigned long'.

Linus

2022-10-25 17:24:17

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH 1/2] bitmap: add sanity check function for find_bit()

On Sun, Oct 23, 2022 at 03:19:24PM -0700, Linus Torvalds wrote:
> On Wed, Oct 19, 2022 at 8:24 PM Yury Norov <[email protected]> wrote:
> >
> > This patch adds runtime check for the pointers to be aligned.
>
> No. Don't add pointless things like this. It only adds code, with no advantage.

Sure. Patch #1 is mostly for Russell to address his concern about
unaligned bitmaps on ARM32. And it looks like it found nothing.

> The bitmap ops all operate on 'unsigned long', and if a bitmap isn't
> aligned, we'll take a fault on the architectures that don't do
> unaligned accesses natively.

ARMv6 may or may not support unaligned access depending on SCTLR.U
bit. This is what Russell was concerned about in the other email.
As far as I understand, linux enables that feature.

ARMv7 deprecates that bit and supports unaligned dereference
unconditionally, with few exceptions like exclusive access.

https://developer.arm.com/documentation/ddi0406/b/Appendices/ARMv6-Differences/Application-level-memory-support/Alignment?lang=en
Thanks,
Yury

> And the find-bit functions simply aren't special enough to have this
> kind of random testing, when the *basic* bitmap functions like
> "set_bit()" and friends all do the accesses without any alignment
> checks.
>
> The fact that filesystem code often uses bitmap functions with a cast
> from 'char *' is immaterial. Those things are already aligned
> (typically they are a whole disk block). They just weren't an array of
> 'unsigned long'.
>
> Linus


2022-10-25 18:55:19

by Russell King (Oracle)

[permalink] [raw]
Subject: Re: [PATCH 1/2] bitmap: add sanity check function for find_bit()

On Tue, Oct 25, 2022 at 10:11:51AM -0700, Yury Norov wrote:
> ARMv6 may or may not support unaligned access depending on SCTLR.U
> bit. This is what Russell was concerned about in the other email.
> As far as I understand, linux enables that feature.

However, we still support ARMv5 and ARMv4, both of which _trap_ every
unaligned access, which will make a findbit call with an unaligned
pointer using word loads painfully expensive. This is the main reason
we haven't used word loads in the findbit ops.

As mentioned, I have patches that do change that (and convert the
thing to use assembly macros to make updates much easier.)

--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

2022-10-25 18:58:02

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] bitmap: add sanity check function for find_bit()

On Tue, Oct 25, 2022 at 11:26 AM Russell King (Oracle)
<[email protected]> wrote:
>
> However, we still support ARMv5 and ARMv4, both of which _trap_ every
> unaligned access, which will make a findbit call with an unaligned
> pointer using word loads painfully expensive. This is the main reason
> we haven't used word loads in the findbit ops.

The findbit ops really shouldn't be a special case, and bitmaps can
never be unaligned.

Just look at what 'test_bit()' does: the non-constant non-instrumented
version ends up as generic_test_bit(), which uses a "const volatile
unsigned long *" access to do the bitmap load.

So there is absolutely no way that bitmaps can ever be unaligned,
because that would trap.

And test_bit() is a lot more fundamental than one of the "find bits" functions.

Have we had bugs in this area before? Sure. People have used "unsigned
int" for flags and mised the bitmap ops on it, and it has worked on
x86.

But then it fails *miserably* on big-endian machines and on machines
that require more alignment (and even on x86 we have KASAN failures
etc these days and obviously without casts it will warn), so we've
hopefully fixed all those cases up long long ago.

So I really think it's pointless to worry about alignment for
"find_bit()" and friends, when much more fundamental bitop functions
don't worry about it.

Linus

2022-11-14 12:20:41

by Russell King (Oracle)

[permalink] [raw]
Subject: Re: [PATCH 1/2] bitmap: add sanity check function for find_bit()

On Tue, Oct 25, 2022 at 11:38:31AM -0700, Linus Torvalds wrote:
> On Tue, Oct 25, 2022 at 11:26 AM Russell King (Oracle)
> <[email protected]> wrote:
> >
> > However, we still support ARMv5 and ARMv4, both of which _trap_ every
> > unaligned access, which will make a findbit call with an unaligned
> > pointer using word loads painfully expensive. This is the main reason
> > we haven't used word loads in the findbit ops.
>
> The findbit ops really shouldn't be a special case, and bitmaps can
> never be unaligned.
>
> Just look at what 'test_bit()' does: the non-constant non-instrumented
> version ends up as generic_test_bit(), which uses a "const volatile
> unsigned long *" access to do the bitmap load.
>
> So there is absolutely no way that bitmaps can ever be unaligned,
> because that would trap.
>
> And test_bit() is a lot more fundamental than one of the "find bits" functions.
>
> Have we had bugs in this area before? Sure. People have used "unsigned
> int" for flags and mised the bitmap ops on it, and it has worked on
> x86.
>
> But then it fails *miserably* on big-endian machines and on machines
> that require more alignment (and even on x86 we have KASAN failures
> etc these days and obviously without casts it will warn), so we've
> hopefully fixed all those cases up long long ago.
>
> So I really think it's pointless to worry about alignment for
> "find_bit()" and friends, when much more fundamental bitop functions
> don't worry about it.

Yes, which is what my series does by converting to use word operations
and not caring anymore whether the pointer is aligned or not.

My reply was more a correction of the apparent "we don't have to worry
about unaligned accesses because version 6 of the architecture has a
feature that means we don't have to worry" which I regard as broken
thinking, broken as long as we continue to support previous versions
of the architecture.

I'm planning to queue up my series of five patches today, so it should
be in tonight's linux-next.

--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!