2019-08-21 08:14:13

by Denis Efremov

[permalink] [raw]
Subject: [PATCH] lib/memweight.c: optimize by inlining bitmap_weight()

This patch inlines bitmap_weight() call. Thus, removing the BUG_ON,
and 'longs to bits -> bits to longs' conversion by directly calling
hweight_long().

./scripts/bloat-o-meter lib/memweight.o.old lib/memweight.o.new
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-10 (-10)
Function old new delta
memweight 162 152 -10

Co-developed-by: Erdem Tumurov <[email protected]>
Co-developed-by: Vladimir Shelekhov <[email protected]>
Signed-off-by: Erdem Tumurov <[email protected]>
Signed-off-by: Vladimir Shelekhov <[email protected]>
Signed-off-by: Denis Efremov <[email protected]>
---
lib/memweight.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/lib/memweight.c b/lib/memweight.c
index 94dd72ccaa7f..f050b2b4c5e2 100644
--- a/lib/memweight.c
+++ b/lib/memweight.c
@@ -20,11 +20,13 @@ size_t memweight(const void *ptr, size_t bytes)

longs = bytes / sizeof(long);
if (longs) {
- BUG_ON(longs >= INT_MAX / BITS_PER_LONG);
- ret += bitmap_weight((unsigned long *)bitmap,
- longs * BITS_PER_LONG);
+ const unsigned long *bitmap_long =
+ (const unsigned long *)bitmap;
+
bytes -= longs * sizeof(long);
- bitmap += longs * sizeof(long);
+ for (; longs > 0; longs--, bitmap_long++)
+ ret += hweight_long(*bitmap_long);
+ bitmap = (const unsigned char *)bitmap_long;
}
/*
* The reason that this last loop is distinct from the preceding
--
2.21.0


2019-08-22 05:11:38

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] lib/memweight.c: optimize by inlining bitmap_weight()

On Wed, 21 Aug 2019 10:42:00 +0300 Denis Efremov <[email protected]> wrote:

> This patch inlines bitmap_weight() call.

It is better to say the patch "open codes" the bitmap_weight() call.

> Thus, removing the BUG_ON,

Why is that OK to do?

I expect all the code size improvements are from doing this?

> and 'longs to bits -> bits to longs' conversion by directly calling
> hweight_long().
>
> ./scripts/bloat-o-meter lib/memweight.o.old lib/memweight.o.new
> add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-10 (-10)
> Function old new delta
> memweight 162 152 -10
>

2019-08-22 09:34:07

by Denis Efremov

[permalink] [raw]
Subject: Re: [PATCH] lib/memweight.c: optimize by inlining bitmap_weight()



On 22.08.2019 04:25, Andrew Morton wrote:
> On Wed, 21 Aug 2019 10:42:00 +0300 Denis Efremov <[email protected]> wrote:
>
>> This patch inlines bitmap_weight() call.
>
> It is better to say the patch "open codes" the bitmap_weight() call.
>
>> Thus, removing the BUG_ON,
>
> Why is that OK to do?

BUG_ON was necessary here to check that bitmap_weight will return a correct value,
i.e. the computed weight will fit the int type:
static __always_inline int bitmap_weight(const unsigned long *src, unsigned int nbits);

BUG_ON was added in the memweight v2
https://lore.kernel.org/lkml/[email protected]/
Jan Kara wrote:
>> +
>> + for (longs = bytes / sizeof(long); longs > 0; ) {
>> + size_t bits = min_t(size_t, INT_MAX & ~(BITS_PER_LONG - 1),
> + longs * BITS_PER_LONG);
> I find it highly unlikely that someone would have such a large bitmap
> (256 MB or more on 32-bit). Also the condition as you wrote it can just
> overflow so it won't have the desired effect. Just do
> BUG_ON(longs >= ULONG_MAX / BITS_PER_LONG);
> and remove the loop completely. If someone comes with such a huge bitmap,
> the code can be modified easily (after really closely inspecting whether
> such a huge bitmap is really well justified).
>> +
>> + w += bitmap_weight(bitmap.ptr, bits);
>> + bytes -= bits / BITS_PER_BYTE;
>> + bitmap.address += bits / BITS_PER_BYTE;
>> + longs -= bits / BITS_PER_LONG;

Akinobu Mita wrote:
> The bits argument of bitmap_weight() is int type. So this should be
>
> BUG_ON(longs >= INT_MAX / BITS_PER_LONG);

We don't need this check, since we removed the bitmap_weight call and
control the computation directly with size_t everywhere.

We could add BUG_ON(bytes >= SIZE_MAX / BITS_PER_BYTE);
at the very beginning of the function to check that the array is not
very big (>2000PiB), but it seems excessive.

>
> I expect all the code size improvements are from doing this?

Yes, but I thought it's good to show that the total size is not
increasing because of the manual "inlining".

>
>> and 'longs to bits -> bits to longs' conversion by directly calling
>> hweight_long().
>>
>> ./scripts/bloat-o-meter lib/memweight.o.old lib/memweight.o.new
>> add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-10 (-10)
>> Function old new delta
>> memweight 162 152 -10
>>
>

Regards,
Denis

2019-08-24 10:03:06

by Denis Efremov

[permalink] [raw]
Subject: [PATCH v2] lib/memweight.c: open codes bitmap_weight()

This patch open codes the bitmap_weight() call. The direct
invocation of hweight_long() allows to remove the BUG_ON and
excessive "longs to bits, bits to longs" conversion.

BUG_ON was required to check that bitmap_weight() will return
a correct value, i.e. the computed weight will fit the int type
of the return value. With this patch memweight() controls the
computation directly with size_t type everywhere. Thus, the BUG_ON
becomes unnecessary.

Total size reduced:
./scripts/bloat-o-meter lib/memweight.o.old lib/memweight.o.new
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-10 (-10)
Function old new delta
memweight 162 152 -10

Co-developed-by: Erdem Tumurov <[email protected]>
Signed-off-by: Erdem Tumurov <[email protected]>
Co-developed-by: Vladimir Shelekhov <[email protected]>
Signed-off-by: Vladimir Shelekhov <[email protected]>
Signed-off-by: Denis Efremov <[email protected]>
---
lib/memweight.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/lib/memweight.c b/lib/memweight.c
index 94dd72ccaa7f..f050b2b4c5e2 100644
--- a/lib/memweight.c
+++ b/lib/memweight.c
@@ -20,11 +20,13 @@ size_t memweight(const void *ptr, size_t bytes)

longs = bytes / sizeof(long);
if (longs) {
- BUG_ON(longs >= INT_MAX / BITS_PER_LONG);
- ret += bitmap_weight((unsigned long *)bitmap,
- longs * BITS_PER_LONG);
+ const unsigned long *bitmap_long =
+ (const unsigned long *)bitmap;
+
bytes -= longs * sizeof(long);
- bitmap += longs * sizeof(long);
+ for (; longs > 0; longs--, bitmap_long++)
+ ret += hweight_long(*bitmap_long);
+ bitmap = (const unsigned char *)bitmap_long;
}
/*
* The reason that this last loop is distinct from the preceding
--
2.21.0

2019-08-25 06:14:02

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v2] lib/memweight.c: open codes bitmap_weight()

On Sat, Aug 24, 2019 at 01:01:02PM +0300, Denis Efremov wrote:
> This patch open codes the bitmap_weight() call. The direct
> invocation of hweight_long() allows to remove the BUG_ON and
> excessive "longs to bits, bits to longs" conversion.

Honestly, that's not the problem with this function. Take a look
at https://danluu.com/assembly-intrinsics/ for a _benchmarked_
set of problems with popcnt.

> BUG_ON was required to check that bitmap_weight() will return
> a correct value, i.e. the computed weight will fit the int type
> of the return value.

What? No. Look at the _arguments_ of bitmap_weight():

static __always_inline int bitmap_weight(const unsigned long *src, unsigned int nbits)

> With this patch memweight() controls the
> computation directly with size_t type everywhere. Thus, the BUG_ON
> becomes unnecessary.

Why are you bothering? How are you allocating half a gigabyte of memory?
Why are you calling memweight() on half a gigabyte of memory?

> if (longs) {
> - BUG_ON(longs >= INT_MAX / BITS_PER_LONG);
> - ret += bitmap_weight((unsigned long *)bitmap,
> - longs * BITS_PER_LONG);
> + const unsigned long *bitmap_long =
> + (const unsigned long *)bitmap;
> +
> bytes -= longs * sizeof(long);
> - bitmap += longs * sizeof(long);
> + for (; longs > 0; longs--, bitmap_long++)
> + ret += hweight_long(*bitmap_long);
> + bitmap = (const unsigned char *)bitmap_long;
> }

If you really must change anything, I'd rather see this turned into a
loop:

while (longs) {
unsigned int nbits;

if (longs >= INT_MAX / BITS_PER_LONG)
nbits = INT_MAX + 1;
else
nbits = longs * BITS_PER_LONG;

ret += bitmap_weight((unsigned long *)bitmap, sz);
bytes -= nbits / 8;
bitmap += nbits / 8;
longs -= nbits / BITS_PER_LONG;
}

then we only have to use Dan Luu's optimisation in bitmap_weight()
and not in memweight() as well.

Also, why does the trailer do this:

for (; bytes > 0; bytes--, bitmap++)
ret += hweight8(*bitmap);

instead of calling hweight_long on *bitmap & mask?

2019-08-25 11:42:26

by Denis Efremov

[permalink] [raw]
Subject: Re: [PATCH v2] lib/memweight.c: open codes bitmap_weight()



On 25.08.2019 09:11, Matthew Wilcox wrote:
> On Sat, Aug 24, 2019 at 01:01:02PM +0300, Denis Efremov wrote:
>> This patch open codes the bitmap_weight() call. The direct
>> invocation of hweight_long() allows to remove the BUG_ON and
>> excessive "longs to bits, bits to longs" conversion.
>
> Honestly, that's not the problem with this function. Take a look
> at https://danluu.com/assembly-intrinsics/ for a _benchmarked_
> set of problems with popcnt.
>
>> BUG_ON was required to check that bitmap_weight() will return
>> a correct value, i.e. the computed weight will fit the int type
>> of the return value.
>
> What? No. Look at the _arguments_ of bitmap_weight():
>
> static __always_inline int bitmap_weight(const unsigned long *src, unsigned int nbits)

I'm not sure why it is INT_MAX then? I would expect in case we care only about arguments
something like:

BUG_ON(longs >= UINT_MAX / BITS_PER_LONG);

>
>> With this patch memweight() controls the
>> computation directly with size_t type everywhere. Thus, the BUG_ON
>> becomes unnecessary.
>
> Why are you bothering? How are you allocating half a gigabyte of memory?
> Why are you calling memweight() on half a gigabyte of memory?
>

No, we don't use such big arrays. However, it's possible to remove BUG_ON and make
the code more "straight". Why do we need to "artificially" limit this function
to arrays of a particular size if we can relatively simple omit this restriction?

>
> If you really must change anything, I'd rather see this turned into a
> loop:
>
> while (longs) {
> unsigned int nbits;
>
> if (longs >= INT_MAX / BITS_PER_LONG)
> nbits = INT_MAX + 1;
> else
> nbits = longs * BITS_PER_LONG;
>
> ret += bitmap_weight((unsigned long *)bitmap, sz);
> bytes -= nbits / 8;
> bitmap += nbits / 8;
> longs -= nbits / BITS_PER_LONG;
> }
>
> then we only have to use Dan Luu's optimisation in bitmap_weight()
> and not in memweight() as well.

I don't know how the implementation of this optimization will look like in it's
final shape, because of different hardware/compiler issues. It looks there are
a number of different ways to do it https://arxiv.org/pdf/1611.07612.pdf,
http://0x80.pl/articles/sse-popcount.html.

However, if it will be based on popcnt instruction I would expect that
hweight_long will also contain this intrinsics. Since version 4.9.2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011#c13 GCC knows of the
false-dependency in popcnt and generates code to handle it
(e.g. xor https://godbolt.org/z/Q7AW_d) Thus, I would expect that it's
possible to use popcnt intrinsics in hweight_long that would be natively
optimized in all loops like "for (...) { res += hweight_long() }" without
requiring manual unrolling like in builtin_popcnt_unrolled_errata_manual
example of Dan Luu's optimization.

>
> Also, why does the trailer do this:
>
> for (; bytes > 0; bytes--, bitmap++)
> ret += hweight8(*bitmap);
>
> instead of calling hweight_long on *bitmap & mask?
>

Do you mean something like this?

longs = bytes;
bytes = do_div(longs, sizeof(long));
bitmap_long = (const unsigned long *)bitmap;
if (longs) {
for (; longs > 0; longs--, bitmap_long++)
ret += hweight_long(*bitmap_long);
}
if (bytes) {
ret += hweight_long(*bitmap_long &
((0x1 << bytes * BITS_PER_BYTE) - 1));
}

The *bitmap_long will lead to buffer overflow here.

Thanks,
Denis

2019-08-26 19:06:53

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v2] lib/memweight.c: open codes bitmap_weight()

On Sun, Aug 25, 2019 at 02:39:47PM +0300, Denis Efremov wrote:
> On 25.08.2019 09:11, Matthew Wilcox wrote:
> > On Sat, Aug 24, 2019 at 01:01:02PM +0300, Denis Efremov wrote:
> >> This patch open codes the bitmap_weight() call. The direct
> >> invocation of hweight_long() allows to remove the BUG_ON and
> >> excessive "longs to bits, bits to longs" conversion.
> >
> > Honestly, that's not the problem with this function. Take a look
> > at https://danluu.com/assembly-intrinsics/ for a _benchmarked_
> > set of problems with popcnt.
> >
> >> BUG_ON was required to check that bitmap_weight() will return
> >> a correct value, i.e. the computed weight will fit the int type
> >> of the return value.
> >
> > What? No. Look at the _arguments_ of bitmap_weight():
> >
> > static __always_inline int bitmap_weight(const unsigned long *src, unsigned int nbits)
>
> I'm not sure why it is INT_MAX then? I would expect in case we care only about arguments
> something like:
>
> BUG_ON(longs >= UINT_MAX / BITS_PER_LONG);

People aren't always terribly consistent with INT_MAX vs UINT_MAX.
Also, bitmap_weight() should arguably return an unisnged int (it can't
legitimately return a negative value).

> >> With this patch memweight() controls the
> >> computation directly with size_t type everywhere. Thus, the BUG_ON
> >> becomes unnecessary.
> >
> > Why are you bothering? How are you allocating half a gigabyte of memory?
> > Why are you calling memweight() on half a gigabyte of memory?
> >
>
> No, we don't use such big arrays. However, it's possible to remove BUG_ON and make
> the code more "straight". Why do we need to "artificially" limit this function
> to arrays of a particular size if we can relatively simple omit this restriction?

You're not making a great case for changing the implementation of
memweight() here ...

> I don't know how the implementation of this optimization will look like in it's
> final shape, because of different hardware/compiler issues. It looks there are
> a number of different ways to do it https://arxiv.org/pdf/1611.07612.pdf,
> http://0x80.pl/articles/sse-popcount.html.

The problem with using XMM registers is that they have to be saved/restored.
Not to mention the thermal issues caused by heavy usage of AVX instructions.

> However, if it will be based on popcnt instruction I would expect that
> hweight_long will also contain this intrinsics. Since version 4.9.2
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011#c13 GCC knows of the
> false-dependency in popcnt and generates code to handle it

Ah! Glad to see GCC knows about this problem and has worked around it.

> (e.g. xor https://godbolt.org/z/Q7AW_d) Thus, I would expect that it's
> possible to use popcnt intrinsics in hweight_long that would be natively
> optimized in all loops like "for (...) { res += hweight_long() }" without
> requiring manual unrolling like in builtin_popcnt_unrolled_errata_manual
> example of Dan Luu's optimization.

That might be expecting rather more from our compiler than is reasonable ...

> >
> > Also, why does the trailer do this:
> >
> > for (; bytes > 0; bytes--, bitmap++)
> > ret += hweight8(*bitmap);
> >
> > instead of calling hweight_long on *bitmap & mask?
> >
>
> Do you mean something like this?
>
> longs = bytes;
> bytes = do_div(longs, sizeof(long));
> bitmap_long = (const unsigned long *)bitmap;
> if (longs) {
> for (; longs > 0; longs--, bitmap_long++)
> ret += hweight_long(*bitmap_long);
> }
> if (bytes) {
> ret += hweight_long(*bitmap_long &
> ((0x1 << bytes * BITS_PER_BYTE) - 1));
> }
>
> The *bitmap_long will lead to buffer overflow here.

No it won't. The CPU will access more bytes than the `bytes' argument
would seem to imply -- but it's going to have fetched that entire
cacheline anyway. It might confuse a very strict bounds checking library,
but usually those just check you're not accessing outside your object,
which is going to be a multiple of 'sizeof(long)' anyway.

If we do something like this, we'll need to use an 'inverse' of that mask
on big-endian machines. ie something more like:

if (bytes) {
unsigned long mask;
if (_BIG_ENDIAN)
mask = ~0UL >> (bytes * 8);
else
mask = ~0UL << (bytes * 8);
ret += hweight_long(*bitmap_long & ~mask);
}

Also we need a memweight() test to be sure we didn't get that wrong.

2019-09-13 13:46:44

by Denis Efremov

[permalink] [raw]
Subject: Re: [PATCH v2] lib/memweight.c: open codes bitmap_weight()

Sorry, no question, pointer alignment of course.

Denis Efremov писал 2019-09-13 14:48:
> Hi,
>
> Sorry for reviving this conversation, but it looks to me like
> this function could be reduced to a single bitmap_weight call:
>
> static inline size_t memweight(const void *ptr, size_t bytes)
> {
> BUG_ON(bytes >= UINT_MAX / BITS_PER_BYTE);
> return bitmap_weight(ptr, bytes * BITS_PER_BYTE);
> }
>
> Comparing to the current implementation
> https://elixir.bootlin.com/linux/latest/source/lib/memweight.c#L11
> this results in a signification simplification.
>
> __bitmap_weight already count last bits with hweight_long as we
> discussed earlier.
>
> int __bitmap_weight(const unsigned long *bitmap, unsigned int bits)
> {
> ...
> if (bits % BITS_PER_LONG)
> w += hweight_long(bitmap[k] & BITMAP_LAST_WORD_MASK(bits));
> ...
> }
>
> and __arch_hweight* functions use popcnt instruction.
>
> I've briefly tested the equivalence of 2 implementations on x86_64 with
> fuzzing here:
> https://gist.github.com/evdenis/95a8b9b8041e09368b31c3a9510491a5
>
> What do you think making this function static inline and moving it
> to include/linux/string.h? I could prepare a patch for it and add some
> tests for
> memweight and bitmap_weight. Or maybe I miss something again?
>
> Best regards,
> Denis

2019-09-13 14:21:35

by Denis Efremov

[permalink] [raw]
Subject: Re: [PATCH v2] lib/memweight.c: open codes bitmap_weight()

Hi,

Sorry for reviving this conversation, but it looks to me like
this function could be reduced to a single bitmap_weight call:

static inline size_t memweight(const void *ptr, size_t bytes)
{
BUG_ON(bytes >= UINT_MAX / BITS_PER_BYTE);
return bitmap_weight(ptr, bytes * BITS_PER_BYTE);
}

Comparing to the current implementation
https://elixir.bootlin.com/linux/latest/source/lib/memweight.c#L11
this results in a signification simplification.

__bitmap_weight already count last bits with hweight_long as we
discussed earlier.

int __bitmap_weight(const unsigned long *bitmap, unsigned int bits)
{
...
if (bits % BITS_PER_LONG)
w += hweight_long(bitmap[k] & BITMAP_LAST_WORD_MASK(bits));
...
}

and __arch_hweight* functions use popcnt instruction.

I've briefly tested the equivalence of 2 implementations on x86_64 with
fuzzing here: https://gist.github.com/evdenis/95a8b9b8041e09368b31c3a9510491a5

What do you think making this function static inline and moving it
to include/linux/string.h? I could prepare a patch for it and add some tests for
memweight and bitmap_weight. Or maybe I miss something again?

Best regards,
Denis