2024-05-02 14:14:27

by Hsin-Yu.Chen

[permalink] [raw]
Subject: [PATCH 2/2] string: improve strlen performance

Port `strlen` in gcc, which enhance performance over 10 times

Please refer to these following articles
1. [Determine if a word has a byte less than n]
(https://graphics.stanford.edu/~seander/bithacks.html#HasLessInWord)
2. [Determine if a word has a zero byte]
(https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord)

Signed-off-by: Hsin-Yu.Chen <[email protected]>
---
lib/string.c | 77 +++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 73 insertions(+), 4 deletions(-)

diff --git a/lib/string.c b/lib/string.c
index 6891d15ce991..31e8642422af 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -398,11 +398,80 @@ EXPORT_SYMBOL(strnchr);
#ifndef __HAVE_ARCH_STRLEN
size_t strlen(const char *s)
{
- const char *sc;
+ const char *char_ptr;
+ const unsigned long *longword_ptr;
+ unsigned long longword, himagic, lomagic;

- for (sc = s; *sc != '\0'; ++sc)
- /* nothing */;
- return sc - s;
+ /* Handle the first few characters by reading one character at a time.
+ * Do this until CHAR_PTR is aligned on a longword boundary.
+ */
+ for (char_ptr = s; ((unsigned long) char_ptr
+ & (sizeof(longword) - 1)) != 0;
+ ++char_ptr)
+ if (*char_ptr == '\0')
+ return char_ptr - s;
+
+ /* All these elucidatory comments refer to 4-byte longwords,
+ * but the theory applies equally well to 8-byte longwords.
+ */
+ longword_ptr = (unsigned long *) char_ptr;
+
+ /* Bits 31, 24, 16, and 8 of this number are zero.
+ * Call these bits the "holes."
+ * Note that there is a hole just to the left of
+ * each byte, with an extra at the end:
+ * bits: 01111110 11111110 11111110 11111111
+ * bytes: AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD
+ * The 1-bits make sure that carries propagate to the next 0-bit.
+ * The 0-bits provide holes for carries to fall into.
+ */
+ himagic = 0x80808080L;
+ lomagic = 0x01010101L;
+
+ if (sizeof(longword) > 4) {
+ /* 64-bit version of the magic. */
+ /* Do the shift in two steps to avoid a warning if long has 32 bits.
+ */
+ himagic = ((himagic << 16) << 16) | himagic;
+ lomagic = ((lomagic << 16) << 16) | lomagic;
+ }
+
+ if (sizeof(longword) > 8)
+ abort();
+
+ /* Instead of the traditional loop which tests each character,
+ * we will test a longword at a time. The tricky part is testing
+ * if *any of the four* bytes in the longword in question are zero.
+ */
+ for (;;) {
+ longword = *longword_ptr++;
+ if (((longword - lomagic) & ~longword & himagic) != 0) {
+
+ /* Which of the bytes was the zero?
+ * If none of them were, it was a misfire; continue the search.
+ */
+ const char *cp = (const char *) (longword_ptr - 1);
+
+ if (cp[0] == 0)
+ return cp - s;
+ else if (cp[1] == 0)
+ return cp - s + 1;
+ else if (cp[2] == 0)
+ return cp - s + 2;
+ else if (cp[3] == 0)
+ return cp - s + 3;
+ if (sizeof(longword) > 4) {
+ if (cp[4] == 0)
+ return cp - s + 4;
+ else if (cp[5] == 0)
+ return cp - s + 5;
+ else if (cp[6] == 0)
+ return cp - s + 6;
+ else if (cp[7] == 0)
+ return cp - s + 7;
+ }
+ }
+ }
}
EXPORT_SYMBOL(strlen);
#endif
--
2.38.1



2024-05-02 15:00:49

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [PATCH 2/2] string: improve strlen performance

On Thu, May 2, 2024 at 5:14 PM Hsin-Yu.Chen <[email protected]> wrote:
>
> Port `strlen` in gcc,

This is the Linux kernel project. What does this mean?
Also note we refer to the function as strlen() (and no [back]quotes).

> which enhance performance over 10 times

>
> Please refer to these following articles
> 1. [Determine if a word has a byte less than n]
> (https://graphics.stanford.edu/~seander/bithacks.html#HasLessInWord)
> 2. [Determine if a word has a zero byte]
> (https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord)

Make these proper Link: tags

Link: URL#1 [1]
Link: URL#2 [2]

..

> - const char *sc;
> + const char *char_ptr;

No need to change the var name for this.

> + const unsigned long *longword_ptr;
> + unsigned long longword, himagic, lomagic;

Keep it in reversed xmas tree order.

..

> + /* Handle the first few characters by reading one character at a time.
> + * Do this until CHAR_PTR is aligned on a longword boundary.
> + */

/*
* This is the wrong comment style. You may
* use this example.
*/

..

> + for (char_ptr = s; ((unsigned long) char_ptr
> + & (sizeof(longword) - 1)) != 0;
> + ++char_ptr)

This is too verbose (too many unneeded symbols) and why pre-increment?
What is special about it?

..

> + /* All these elucidatory comments refer to 4-byte longwords,
> + * but the theory applies equally well to 8-byte longwords.
> + */

Use proper style.

> + longword_ptr = (unsigned long *) char_ptr;

No space after casting and why do you need it?

..

> + /* Bits 31, 24, 16, and 8 of this number are zero.
> + * Call these bits the "holes."
> + * Note that there is a hole just to the left of
> + * each byte, with an extra at the end:
> + * bits: 01111110 11111110 11111110 11111111
> + * bytes: AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD
> + * The 1-bits make sure that carries propagate to the next 0-bit.
> + * The 0-bits provide holes for carries to fall into.
> + */

Use proper style.

..

> + /* 64-bit version of the magic. */
> + /* Do the shift in two steps to avoid a warning if long has 32 bits.
> + */

Ditto.

..

> + if (sizeof(longword) > 8)
> + abort();

Huh?!

..

> + /* Instead of the traditional loop which tests each character,
> + * we will test a longword at a time. The tricky part is testing
> + * if *any of the four* bytes in the longword in question are zero.
> + */

Proper style, please.

..

> + for (;;) {
> + longword = *longword_ptr++;
> + if (((longword - lomagic) & ~longword & himagic) != 0) {
> +
> + /* Which of the bytes was the zero?
> + * If none of them were, it was a misfire; continue the search.
> + */
> + const char *cp = (const char *) (longword_ptr - 1);

> + if (cp[0] == 0)
> + return cp - s;
> + else if (cp[1] == 0)
> + return cp - s + 1;
> + else if (cp[2] == 0)
> + return cp - s + 2;
> + else if (cp[3] == 0)
> + return cp - s + 3;

> + if (sizeof(longword) > 4) {

if (... <= 4)
continue;

> + if (cp[4] == 0)
> + return cp - s + 4;
> + else if (cp[5] == 0)
> + return cp - s + 5;
> + else if (cp[6] == 0)
> + return cp - s + 6;
> + else if (cp[7] == 0)
> + return cp - s + 7;

A lot of redundant 'else':s.

> + }
> + }
> + }

--
With Best Regards,
Andy Shevchenko

2024-05-02 15:11:45

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [PATCH 2/2] string: improve strlen performance

On Thu, May 2, 2024 at 5:59 PM Andy Shevchenko
<[email protected]> wrote:
> On Thu, May 2, 2024 at 5:14 PM Hsin-Yu.Chen <[email protected]> wrote:

And on top of that, check what this code will do on the architectures
that do not support unaligned access. If everything is fine, mention
this in the commit message. Btw, your commit message needs
elaboration, e.g., pointing to the test case (which is absent in this
patch, I assume it's already in the kernel?) and step-by-step
instructions on how you got the mentioned results with details of the
hardware you used for that.

--
With Best Regards,
Andy Shevchenko

2024-05-02 15:14:26

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH 2/2] string: improve strlen performance

On Thu, May 02, 2024 at 06:03:04PM +0300, Andy Shevchenko wrote:
> On Thu, May 2, 2024 at 5:59 PM Andy Shevchenko
> <[email protected]> wrote:
> > On Thu, May 2, 2024 at 5:14 PM Hsin-Yu.Chen <[email protected]> wrote:
>
> And on top of that, check what this code will do on the architectures
> that do not support unaligned access. If everything is fine, mention
> this in the commit message. Btw, your commit message needs
> elaboration, e.g., pointing to the test case (which is absent in this
> patch, I assume it's already in the kernel?) and step-by-step
> instructions on how you got the mentioned results with details of the
> hardware you used for that.

I might be worth looking at the implementation of strscpy(), which is
doing similar multi-byte steps and handles unaligned access.

--
Kees Cook

2024-05-02 15:47:50

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [PATCH 2/2] string: improve strlen performance

On Thu, May 02, 2024 at 08:10:32AM -0700, Kees Cook wrote:
> On Thu, May 02, 2024 at 06:03:04PM +0300, Andy Shevchenko wrote:
> > On Thu, May 2, 2024 at 5:59 PM Andy Shevchenko
> > <[email protected]> wrote:
> > > On Thu, May 2, 2024 at 5:14 PM Hsin-Yu.Chen <[email protected]> wrote:
> >
> > And on top of that, check what this code will do on the architectures
> > that do not support unaligned access. If everything is fine, mention
> > this in the commit message. Btw, your commit message needs
> > elaboration, e.g., pointing to the test case (which is absent in this
> > patch, I assume it's already in the kernel?) and step-by-step
> > instructions on how you got the mentioned results with details of the
> > hardware you used for that.
>
> I might be worth looking at the implementation of strscpy(), which is
> doing similar multi-byte steps and handles unaligned access.

Right, we have a word-at-a-time.h or alike for this things.

--
With Best Regards,
Andy Shevchenko



2024-05-05 13:39:41

by David Laight

[permalink] [raw]
Subject: RE: [PATCH 2/2] string: improve strlen performance

From: Kees Cook
> Sent: 02 May 2024 16:11
>
> On Thu, May 02, 2024 at 06:03:04PM +0300, Andy Shevchenko wrote:
> > On Thu, May 2, 2024 at 5:59 PM Andy Shevchenko
> > <[email protected]> wrote:
> > > On Thu, May 2, 2024 at 5:14 PM Hsin-Yu.Chen <[email protected]> wrote:
> >
> > And on top of that, check what this code will do on the architectures
> > that do not support unaligned access. If everything is fine, mention
> > this in the commit message. Btw, your commit message needs
> > elaboration, e.g., pointing to the test case (which is absent in this
> > patch, I assume it's already in the kernel?) and step-by-step
> > instructions on how you got the mentioned results with details of the
> > hardware you used for that.
>
> I might be worth looking at the implementation of strscpy(), which is
> doing similar multi-byte steps and handles unaligned access.

And x86 really doesn't care about unaligned accesses (for normal registers).
But it is important to not accidentally run off the end of a page.

There is also the whole question of the typical string length.
For short strings you've already lost by the time you've aligned
the address.

On 32bit the whole bit-twiddling may not be worth it at all.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)