In the recent discussion, it was noticed that find_next_bit() functions may
be improved by adding wrappers around common __find_next_bit() in .c file.
As suggested by Linus, I tried the meta-programming trick with the
EXPRESSION macro, which is passed from wrapper into find_bit()
helpers:
#define BIT_FIND_BODY(addr, size, start, EXPRESSION) \
BIT_FIND_SETUP(addr, size, start) \
BIT_FIND_FIRST_WORD(addr, size, start, EXPRESSION) \
BIT_WORD_LOOP(addr, size, idx, val, EXPRESSION) \
return size; \
found: BIT_WORD_SWAB(val); \
return min((idx)*BITS_PER_LONG + __ffs(val), size)
unsigned long _find_next_and_bit(const unsigned long *addr1,
const unsigned long *addr2,
unsigned long size,
unsigned long start)
{ BIT_FIND_BODY(addr, size, start, addr1[idx] & addr2[idx]); }
I appreciated the potential of how the EXPRESSION works, but I don't like
that the resulting macro is constructed from pieces because it makes it
harder to understand what happens behind the ifdefery. Checkpatch isn't
happy as well because the final macro contains 'return' statement; and I
would agree that it's better to avoid it.
I spined the idea one more time, trying to make FIND helper a more or
less standard looking macros.
This new approach saves 10-11K of Image size, and is 15% faster in the
performance benchmark. See the 3rd patch for some statistics.
v1: https://lore.kernel.org/all/[email protected]/T/
v2: https://lore.kernel.org/lkml/YwaXvphVpy5A7fSs@yury-laptop/t/
v3:
- add a MUNGE parameter to FIND_{FIRST,NEXT}_BIT and keep all machinery
in lib/find_bit.c;
- rename EXPRESSION to FETCH, and add comments;
- sync tools.
Yury Norov (4):
lib/find_bit: introduce FIND_FIRST_BIT() macro
lib/find_bit: create find_first_zero_bit_le()
lib/find_bit: optimize find_next_bit() functions
tools: sync find_bit() implementation
include/linux/find.h | 46 +++++++---
lib/find_bit.c | 178 ++++++++++++++++++++++---------------
tools/include/linux/find.h | 61 +++----------
tools/lib/find_bit.c | 149 ++++++++++++++-----------------
4 files changed, 220 insertions(+), 214 deletions(-)
--
2.34.1
Over the past couple years, the function _find_next_bit() was extended
with parameters that modify its behavior to implement and- zero- and le-
flavors. The parameters are passed at compile time, but current design
prevents a compiler from optimizing out the conditionals.
As find_next_bit() API grows, I expect that more parameterss will be added.
Current design would require more conditional code in _find_next_bit(),
which would bloat the helper even more and make it barely readable.
This patch replaces _find_next_bit() with a macro FIND_NEXT_BIT, and adds
a set of wrappers, so that the compile-time optimizations become possible.
The common logic is moved to the new macro, and all flavors may be
generated by providing a FETCH macro parameter, like in this example:
#define FIND_NEXT_BIT(FETCH, MUNGE, size, start) ...
find_next_xornot_and_bit(addr1, addr2, addr3, size, start)
{
return FIND_NEXT_BIT(addr1[idx] ^ ~addr2[idx] & addr3[idx],
/* nop */, size, start);
}
The FETCH may be of any complexity, as soon as it only refers the bitmap(s)
and an iterator idx.
MUNGE is here to support _le code generation for BE builds. May be
empty.
I ran find_bit_benchmark 16 times on top of 6.0-rc2 and 16 times on top
of this series. The results for kvm/x86_64 are:
v6.0-rc2 Optimized Difference Z-score
Random dense bitmap ns ns ns %
find_next_bit: 787735 670546 117189 14.9 3.97
find_next_zero_bit: 777492 664208 113284 14.6 10.51
find_last_bit: 830925 687573 143352 17.3 2.35
find_first_bit: 3874366 3306635 567731 14.7 1.84
find_first_and_bit: 40677125 37739887 2937238 7.2 1.36
find_next_and_bit: 347865 304456 43409 12.5 1.35
Random sparse bitmap
find_next_bit: 19816 14021 5795 29.2 6.10
find_next_zero_bit: 1318901 1223794 95107 7.2 1.41
find_last_bit: 14573 13514 1059 7.3 6.92
find_first_bit: 1313321 1249024 64297 4.9 1.53
find_first_and_bit: 8921 8098 823 9.2 4.56
find_next_and_bit: 9796 7176 2620 26.7 5.39
Where the statistics is significant (z-score > 3), the improvement
is ~15%.
According to bloat-o-meter, the Image size is 10-11K less:
x86_64/defconfig:
add/remove: 32/14 grow/shrink: 61/782 up/down: 6344/-16521 (-10177)
arm64/defconfig:
add/remove: 3/2 grow/shrink: 50/714 up/down: 608/-11556 (-10948)
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Yury Norov <[email protected]>
---
include/linux/find.h | 23 ++++++---
lib/find_bit.c | 119 +++++++++++++++++++++++++------------------
2 files changed, 85 insertions(+), 57 deletions(-)
diff --git a/include/linux/find.h b/include/linux/find.h
index a1861d0ba533..d3b360ba428f 100644
--- a/include/linux/find.h
+++ b/include/linux/find.h
@@ -8,9 +8,12 @@
#include <linux/bitops.h>
-extern unsigned long _find_next_bit(const unsigned long *addr1,
- const unsigned long *addr2, unsigned long nbits,
- unsigned long start, unsigned long invert, unsigned long le);
+unsigned long _find_next_bit(const unsigned long *addr1, unsigned long nbits,
+ unsigned long start);
+unsigned long _find_next_and_bit(const unsigned long *addr1, const unsigned long *addr2,
+ unsigned long nbits, unsigned long start);
+unsigned long _find_next_zero_bit(const unsigned long *addr, unsigned long nbits,
+ unsigned long start);
extern unsigned long _find_first_bit(const unsigned long *addr, unsigned long size);
extern unsigned long _find_first_and_bit(const unsigned long *addr1,
const unsigned long *addr2, unsigned long size);
@@ -19,6 +22,10 @@ extern unsigned long _find_last_bit(const unsigned long *addr, unsigned long siz
#ifdef __BIG_ENDIAN
unsigned long _find_first_zero_bit_le(const unsigned long *addr, unsigned long size);
+unsigned long _find_next_zero_bit_le(const unsigned long *addr, unsigned
+ long size, unsigned long offset);
+unsigned long _find_next_bit_le(const unsigned long *addr, unsigned
+ long size, unsigned long offset);
#endif
#ifndef find_next_bit
@@ -45,7 +52,7 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
return val ? __ffs(val) : size;
}
- return _find_next_bit(addr, NULL, size, offset, 0UL, 0);
+ return _find_next_bit(addr, size, offset);
}
#endif
@@ -75,7 +82,7 @@ unsigned long find_next_and_bit(const unsigned long *addr1,
return val ? __ffs(val) : size;
}
- return _find_next_bit(addr1, addr2, size, offset, 0UL, 0);
+ return _find_next_and_bit(addr1, addr2, size, offset);
}
#endif
@@ -103,7 +110,7 @@ unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
return val == ~0UL ? size : ffz(val);
}
- return _find_next_bit(addr, NULL, size, offset, ~0UL, 0);
+ return _find_next_zero_bit(addr, size, offset);
}
#endif
@@ -251,7 +258,7 @@ unsigned long find_next_zero_bit_le(const void *addr, unsigned
return val == ~0UL ? size : ffz(val);
}
- return _find_next_bit(addr, NULL, size, offset, ~0UL, 1);
+ return _find_next_zero_bit_le(addr, size, offset);
}
#endif
@@ -284,7 +291,7 @@ unsigned long find_next_bit_le(const void *addr, unsigned
return val ? __ffs(val) : size;
}
- return _find_next_bit(addr, NULL, size, offset, 0UL, 1);
+ return _find_next_bit_le(addr, size, offset);
}
#endif
diff --git a/lib/find_bit.c b/lib/find_bit.c
index f4d9b9684811..ba119f69f5ff 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -40,57 +40,33 @@
sz; \
})
-#if !defined(find_next_bit) || !defined(find_next_zero_bit) || \
- !defined(find_next_bit_le) || !defined(find_next_zero_bit_le) || \
- !defined(find_next_and_bit)
/*
- * This is a common helper function for find_next_bit, find_next_zero_bit, and
- * find_next_and_bit. The differences are:
- * - The "invert" argument, which is XORed with each fetched word before
- * searching it for one bits.
- * - The optional "addr2", which is anded with "addr1" if present.
+ * Common helper for find_next_bit() function family
+ * @FETCH: The expression that fetches and pre-processes each word of bitmap(s)
+ * @MUNGE: The expression that post-processes a word containing found bit (may be empty)
+ * @size: The bitmap size in bits
+ * @start: The bitnumber to start searching at
*/
-unsigned long _find_next_bit(const unsigned long *addr1,
- const unsigned long *addr2, unsigned long nbits,
- unsigned long start, unsigned long invert, unsigned long le)
-{
- unsigned long tmp, mask;
-
- if (unlikely(start >= nbits))
- return nbits;
-
- tmp = addr1[start / BITS_PER_LONG];
- if (addr2)
- tmp &= addr2[start / BITS_PER_LONG];
- tmp ^= invert;
-
- /* Handle 1st word. */
- mask = BITMAP_FIRST_WORD_MASK(start);
- if (le)
- mask = swab(mask);
-
- tmp &= mask;
-
- start = round_down(start, BITS_PER_LONG);
-
- while (!tmp) {
- start += BITS_PER_LONG;
- if (start >= nbits)
- return nbits;
-
- tmp = addr1[start / BITS_PER_LONG];
- if (addr2)
- tmp &= addr2[start / BITS_PER_LONG];
- tmp ^= invert;
- }
-
- if (le)
- tmp = swab(tmp);
-
- return min(start + __ffs(tmp), nbits);
-}
-EXPORT_SYMBOL(_find_next_bit);
-#endif
+#define FIND_NEXT_BIT(FETCH, MUNGE, size, start) \
+({ \
+ unsigned long mask, idx, tmp, sz = (size), __start = (start); \
+ \
+ if (unlikely(__start >= sz)) \
+ goto out; \
+ \
+ mask = MUNGE(BITMAP_FIRST_WORD_MASK(__start)); \
+ idx = __start / BITS_PER_LONG; \
+ \
+ for (tmp = (FETCH) & mask; !tmp; tmp = (FETCH)) { \
+ if (idx > sz / BITS_PER_LONG) \
+ goto out; \
+ idx++; \
+ } \
+ \
+ sz = min(idx * BITS_PER_LONG + __ffs(MUNGE(tmp)), sz); \
+out: \
+ sz; \
+})
#ifndef find_first_bit
/*
@@ -127,6 +103,32 @@ unsigned long _find_first_zero_bit(const unsigned long *addr, unsigned long size
EXPORT_SYMBOL(_find_first_zero_bit);
#endif
+#ifndef find_next_bit
+unsigned long _find_next_bit(const unsigned long *addr, unsigned long nbits, unsigned long start)
+{
+ return FIND_NEXT_BIT(addr[idx], /* nop */, nbits, start);
+}
+EXPORT_SYMBOL(_find_next_bit);
+#endif
+
+#ifndef find_next_and_bit
+unsigned long _find_next_and_bit(const unsigned long *addr1, const unsigned long *addr2,
+ unsigned long nbits, unsigned long start)
+{
+ return FIND_NEXT_BIT(addr1[idx] & addr2[idx], /* nop */, nbits, start);
+}
+EXPORT_SYMBOL(_find_next_and_bit);
+#endif
+
+#ifndef find_next_zero_bit
+unsigned long _find_next_zero_bit(const unsigned long *addr, unsigned long nbits,
+ unsigned long start)
+{
+ return FIND_NEXT_BIT(~addr[idx], /* nop */, nbits, start);
+}
+EXPORT_SYMBOL(_find_next_zero_bit);
+#endif
+
#ifndef find_last_bit
unsigned long _find_last_bit(const unsigned long *addr, unsigned long size)
{
@@ -173,4 +175,23 @@ unsigned long _find_first_zero_bit_le(const unsigned long *addr, unsigned long s
EXPORT_SYMBOL(_find_first_zero_bit_le);
#endif
+#ifndef find_next_zero_bit_le
+unsigned long _find_next_zero_bit_le(const unsigned long *addr, unsigned
+ long size, unsigned long offset)
+{
+ return FIND_NEXT_BIT(~addr[idx], swab, size, offset);
+}
+EXPORT_SYMBOL(_find_next_zero_bit_le);
+#endif
+
+#ifndef find_next_bit_le
+unsigned long _find_next_bit_le(const unsigned long *addr, unsigned
+ long size, unsigned long offset)
+{
+ return FIND_NEXT_BIT(addr[idx], swab, size, offset);
+}
+EXPORT_SYMBOL(_find_next_bit_le);
+
+#endif
+
#endif /* __BIG_ENDIAN */
--
2.34.1
On 27/08/22 10:58, Yury Norov wrote:
> +#define FIND_NEXT_BIT(FETCH, MUNGE, size, start) \
> +({ \
> + unsigned long mask, idx, tmp, sz = (size), __start = (start); \
> + \
> + if (unlikely(__start >= sz)) \
> + goto out; \
> + \
> + mask = MUNGE(BITMAP_FIRST_WORD_MASK(__start)); \
> + idx = __start / BITS_PER_LONG; \
> + \
> + for (tmp = (FETCH) & mask; !tmp; tmp = (FETCH)) { \
> + if (idx > sz / BITS_PER_LONG) \
Does that want to be
if (idx + 1 >= sz / BITS_PER_LONG)
?
Consider this as used in _find_next_bit() for an all-zero 128-bit wide
bitmap (two ULL's), providing the memory contiguous to the bitmap is also
zero then this will only stop at idx=3, so that's fetching two ULLs too
far.
> + goto out; \
> + idx++; \
> + } \
> + \
> + sz = min(idx * BITS_PER_LONG + __ffs(MUNGE(tmp)), sz); \
> +out: \
> + sz; \
> +})
On Wed, Sep 07, 2022 at 05:27:08PM +0100, Valentin Schneider wrote:
> On 27/08/22 10:58, Yury Norov wrote:
> > +#define FIND_NEXT_BIT(FETCH, MUNGE, size, start) \
> > +({ \
> > + unsigned long mask, idx, tmp, sz = (size), __start = (start); \
> > + \
> > + if (unlikely(__start >= sz)) \
> > + goto out; \
> > + \
> > + mask = MUNGE(BITMAP_FIRST_WORD_MASK(__start)); \
> > + idx = __start / BITS_PER_LONG; \
> > + \
> > + for (tmp = (FETCH) & mask; !tmp; tmp = (FETCH)) { \
> > + if (idx > sz / BITS_PER_LONG) \
>
> Does that want to be
Yes, I already fixed this.
> if (idx + 1 >= sz / BITS_PER_LONG)
>
> ?
>
> Consider this as used in _find_next_bit() for an all-zero 128-bit wide
> bitmap (two ULL's), providing the memory contiguous to the bitmap is also
> zero then this will only stop at idx=3, so that's fetching two ULLs too
> far.
>
> > + goto out; \
> > + idx++; \
> > + } \
> > + \
> > + sz = min(idx * BITS_PER_LONG + __ffs(MUNGE(tmp)), sz); \
> > +out: \
> > + sz; \
> > +})
Hi Yury,
Yury Norov <[email protected]> writes:
> On Wed, Sep 07, 2022 at 05:27:08PM +0100, Valentin Schneider wrote:
>> On 27/08/22 10:58, Yury Norov wrote:
>> > +#define FIND_NEXT_BIT(FETCH, MUNGE, size, start) \
>> > +({ \
>> > + unsigned long mask, idx, tmp, sz = (size), __start = (start); \
>> > + \
>> > + if (unlikely(__start >= sz)) \
>> > + goto out; \
>> > + \
>> > + mask = MUNGE(BITMAP_FIRST_WORD_MASK(__start)); \
>> > + idx = __start / BITS_PER_LONG; \
>> > + \
>> > + for (tmp = (FETCH) & mask; !tmp; tmp = (FETCH)) { \
>> > + if (idx > sz / BITS_PER_LONG) \
>>
>> Does that want to be
>
> Yes, I already fixed this.
>
>> if (idx + 1 >= sz / BITS_PER_LONG)
>>
>> ?
Did you push that already? We're still seeing crashes in CI, and the
'idx + 1' doesnt seem to be in next-20220908. Adding it makes the
out-of-bound access go away, but the kernel will crash later in the
block mq code:
[ 0.760803] NET: Registered PF_PACKET protocol family
[ 0.760808] bridge: filtering via arp/ip/ip6tables is no longeravailable by default. Update your scripts to load br_netfilter if youneed this.
[ 0.760938] Key type dns_resolver registered
[ 0.760967] cio: Channel measurement facility initialized using format extended (mode autodetected)
[ 0.761789] dasd-eckd 0.0.3850: A channel path to the device has become operational
[ 0.762577] dasd-eckd 0.0.3850: A channel path to the device has become operational
[ 0.772623] dasd-eckd 0.0.3850: New DASD 3390/0C (CU 3990/01) with 30051 cylinders, 15 heads, 224 sectors
[ 0.806105] dasd-eckd 0.0.3850: DASD with 4 KB/block, 21636720 KB total size, 48 KB/track, compatible disk layout
[ 0.806131] Unable to handle kernel pointer dereference in virtual kernel address space
[ 0.806132] Failing address: fffffffffffff000 TEID: fffffffffffff803
[ 0.806134] Fault in home space mode while using kernel ASCE.
[ 0.806137] AS:0000000001f14007 R3:0000000081360007 S:0000000000000020
[ 0.806154] Oops: 0038 ilc:2 [#1] SMP
[ 0.806333] Modules linked in:
[ 0.806337] CPU: 0 PID: 42 Comm: kworker/0:3 Not tainted 6.0.0-rc4-next-20220908-dirty #463
[ 0.806339] Hardware name: IBM 3906 M04 704 (z/VM 7.1.0)
[ 0.806340] Workqueue: events do_kick_device
[ 0.806344] Krnl PSW : 0704c00180000000 0000000000811ee4 (blk_rq_merge_ok+0x1c/0x120)
[ 0.806362] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
[ 0.806368] Krnl GPRS: 8000000000000001 00000000018461e0 ffffffffffffffb8 0000000080ac6200
[ 0.806379] 00000000ffffbfff 00000000fee80000 0000000001fbe4a8 00000000822cbda0
[ 0.806387] 0000000000000008 0000000080ac6200 000003ff7f842108 ffffffffffffffb8
[ 0.806390] 00000000822aa100 00000380003b78d0 00000380003b7590 00000380003b7550
[ 0.806413] Krnl Code: 0000000000811ed4: b90400ef lgr %r14,%r15
[ 0.806413] 0000000000811ed8: e3f0ffc0ff71 lay %r15,-64(%r15)
[ 0.806413] #0000000000811ede: e3e0f0980024 stg %r14,152(%r15)
[ 0.806413] >0000000000811ee4: 58102018 l %r1,24(%r2)
[ 0.806413] 0000000000811ee8: 4210f0a7 stc %r1,167(%r15)
[ 0.806413] 0000000000811eec: ec4138bf0055 risbg %r4,%r1,56,191,0
[ 0.806413] 0000000000811ef2: b90400b2 lgr %r11,%r2
[ 0.806413] 0000000000811ef6: b90400a3 lgr %r10,%r3
[ 0.806427] Call Trace:
[ 0.806431] [<0000000000811ee4>] blk_rq_merge_ok+0x1c/0x120
[ 0.806470] [<0000000000d07a5a>] blk_bio_list_merge+0x7a/0xd0
[ 0.806474] [<0000000000820456>] blk_mq_sched_bio_merge+0xde/0x150
[ 0.806478] [<0000000000815110>] blk_mq_get_new_requests+0x88/0x190
[ 0.806481] [<000000000081a346>] blk_mq_submit_bio+0x286/0x438
[ 0.806484] [<00000000008090d2>] __submit_bio+0x12a/0x1d8
[ 0.806498] [<0000000000809738>] submit_bio_noacct_nocheck+0xa0/0xf8
[ 0.806502] [<000000000047e02e>] submit_bh_wbc+0x16e/0x1b0
[ 0.806505] [<0000000000481384>] block_read_full_folio+0x2ec/0x400
[ 0.806508] [<0000000000348518>] filemap_read_folio+0x38/0xc8
[ 0.806512] [<000000000034998e>] do_read_cache_folio+0x13e/0x1e8
[ 0.806516] [<0000000000349a60>] read_cache_folio+0x28/0x38
[ 0.806520] [<0000000000826d56>] read_part_sector+0x5e/0xe8
[ 0.806524] [<0000000000828ed4>] read_lba+0xb4/0x170
[ 0.806528] [<00000000008294d6>] find_valid_gpt.constprop.0+0xde/0x568
[ 0.806532] [<0000000000829c92>] efi_partition+0x332/0x398
[ 0.806559] [<0000000000826586>] check_partition+0x12e/0x248
[ 0.806567] [<0000000000826a1c>] bdev_disk_changed.part.0+0xcc/0x378
[ 0.806578] [<0000000000ca8016>] dasd_scan_partitions+0x76/0x120
[ 0.806582] [<0000000000ca1018>] dasd_change_state+0x310/0x3a0
[ 0.806584] [<0000000000ca10e8>] do_kick_device+0x40/0xc8
[ 0.806587] [<0000000000174ae0>] process_one_work+0x200/0x458
[ 0.806591] [<000000000017526e>] worker_thread+0x66/0x480
[ 0.806594] [<000000000017cf98>] kthread+0x108/0x110
[ 0.806596] [<0000000000103354>] __ret_from_fork+0x3c/0x58
[ 0.806600] [<0000000000d1eb0a>] ret_from_fork+0xa/0x40
[ 0.806604] Last Breaking-Event-Address:
[ 0.806606] [<0000000000d07a54>] blk_bio_list_merge+0x74/0xd0
[ 0.806610] Kernel panic - not syncing: Fatal exception: panic_on_oops
Reverting the 'lib/find_bit: optimize find_next_bit() functions' patch
'fixes' these issues.
Regards
Sven
On Fri, Sep 09, 2022 at 02:24:31PM +0200, Sven Schnelle wrote:
> Hi Yury,
>
> Yury Norov <[email protected]> writes:
>
> > On Wed, Sep 07, 2022 at 05:27:08PM +0100, Valentin Schneider wrote:
> >> On 27/08/22 10:58, Yury Norov wrote:
> >> > +#define FIND_NEXT_BIT(FETCH, MUNGE, size, start) \
> >> > +({ \
> >> > + unsigned long mask, idx, tmp, sz = (size), __start = (start); \
> >> > + \
> >> > + if (unlikely(__start >= sz)) \
> >> > + goto out; \
> >> > + \
> >> > + mask = MUNGE(BITMAP_FIRST_WORD_MASK(__start)); \
> >> > + idx = __start / BITS_PER_LONG; \
> >> > + \
> >> > + for (tmp = (FETCH) & mask; !tmp; tmp = (FETCH)) { \
> >> > + if (idx > sz / BITS_PER_LONG) \
> >>
> >> Does that want to be
> >
> > Yes, I already fixed this.
> >
> >> if (idx + 1 >= sz / BITS_PER_LONG)
> >>
> >> ?
>
> Did you push that already? We're still seeing crashes in CI, and the
> 'idx + 1' doesnt seem to be in next-20220908. Adding it makes the
> out-of-bound access go away, but the kernel will crash later in the
> block mq code:
Hi Swen,
I removed the whole series and will resend it with an appropriate fixes
at the weekend. Hopefully it will disappear in next-20220909 or 10.
Thanks,
Yury
On Fri, Sep 09, 2022 at 07:47:10AM -0700, Yury Norov wrote:
> On Fri, Sep 09, 2022 at 02:24:31PM +0200, Sven Schnelle wrote:
...
> I removed the whole series and will resend it with an appropriate fixes
> at the weekend. Hopefully it will disappear in next-20220909 or 10.
Usually there is no Linux Next releases on weekends. So it will be
at best the 20220912 one.
--
With Best Regards,
Andy Shevchenko