2022-07-10 14:32:02

by Yu-Jen Chang

[permalink] [raw]
Subject: [PATCH 2/2] lib/string.c: Optimize memchr()

The original version of memchr() is implemented with the byte-wise
comparing technique, which does not fully use 64-bits or 32-bits
registers in CPU. We use word-wide comparing so that 8 characters
can be compared at the same time on CPU. This code is base on
David Laight's implementation.

We create two files to measure the performance. The first file
contains on average 10 characters ahead the target character.
The second file contains at least 1000 characters ahead the
target character. Our implementation of “memchr()” is slightly
better in the first test and nearly 4x faster than the orginal
implementation in the second test.

Signed-off-by: Yu-Jen Chang <[email protected]>
Signed-off-by: Ching-Chun (Jim) Huang <[email protected]>
---
lib/string.c | 28 +++++++++++++++++++++-------
1 file changed, 21 insertions(+), 7 deletions(-)

diff --git a/lib/string.c b/lib/string.c
index 80469e6c3..8ca965431 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -905,21 +905,35 @@ EXPORT_SYMBOL(strnstr);
#ifndef __HAVE_ARCH_MEMCHR
/**
* memchr - Find a character in an area of memory.
- * @s: The memory area
+ * @p: The memory area
* @c: The byte to search for
- * @n: The size of the area.
+ * @length: The size of the area.
*
* returns the address of the first occurrence of @c, or %NULL
* if @c is not found
*/
-void *memchr(const void *s, int c, size_t n)
+void *memchr(const void *p, int c, unsigned long length)
{
- const unsigned char *p = s;
- while (n-- != 0) {
- if ((unsigned char)c == *p++) {
- return (void *)(p - 1);
+ u64 mask, val;
+ const void *end = p + length;
+
+ c &= 0xff;
+ if (p <= end - 8) {
+ mask = c;
+ MEMCHR_MASK_GEN(mask);
+
+ for (; p <= end - 8; p += 8) {
+ val = *(u64 *)p ^ mask;
+ if ((val + 0xfefefefefefefeffu) &
+ (~val & 0x8080808080808080u))
+ break;
}
}
+
+ for (; p < end; p++)
+ if (*(unsigned char *)p == c)
+ return (void *)p;
+
return NULL;
}
EXPORT_SYMBOL(memchr);
--
2.25.1


2022-07-10 16:07:54

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH 2/2] lib/string.c: Optimize memchr()

On Sun, 2022-07-10 at 22:28 +0800, Yu-Jen Chang wrote:
> The original version of memchr() is implemented with the byte-wise
> comparing technique, which does not fully use 64-bits or 32-bits
> registers in CPU. We use word-wide comparing so that 8 characters
> can be compared at the same time on CPU. This code is base on
> David Laight's implementation.
>
> We create two files to measure the performance. The first file
> contains on average 10 characters ahead the target character.
> The second file contains at least 1000 characters ahead the
> target character. Our implementation of “memchr()” is slightly
> better in the first test and nearly 4x faster than the orginal
> implementation in the second test.

It seems you did not test this with 32bit compilers as
there are 64 bit constants without ull

> diff --git a/lib/string.c b/lib/string.c
[]
> @@ -905,21 +905,35 @@ EXPORT_SYMBOL(strnstr);
> #ifndef __HAVE_ARCH_MEMCHR
> /**
> * memchr - Find a character in an area of memory.
> - * @s: The memory area
> + * @p: The memory area
> * @c: The byte to search for
> - * @n: The size of the area.
> + * @length: The size of the area.
> *
> * returns the address of the first occurrence of @c, or %NULL
> * if @c is not found
> */
> -void *memchr(const void *s, int c, size_t n)
> +void *memchr(const void *p, int c, unsigned long length)
> {
> - const unsigned char *p = s;
> - while (n-- != 0) {
> - if ((unsigned char)c == *p++) {
> - return (void *)(p - 1);
> + u64 mask, val;
> + const void *end = p + length;
> +
> + c &= 0xff;
> + if (p <= end - 8) {
> + mask = c;
> + MEMCHR_MASK_GEN(mask);
> +
> + for (; p <= end - 8; p += 8) {
> + val = *(u64 *)p ^ mask;
> + if ((val + 0xfefefefefefefeffu) &
> + (~val & 0x8080808080808080u))

here.

2022-07-10 17:38:56

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 2/2] lib/string.c: Optimize memchr()

Hi Yu-Jen,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v5.19-rc5 next-20220708]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Yu-Jen-Chang/Optimize-memchr/20220710-223118
base: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git b1c428b6c3684ee8ddf4137d68b3e8d51d2a700f
config: hexagon-randconfig-r041-20220710 (https://download.01.org/0day-ci/archive/20220711/[email protected]/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project 6ce63e267aab79ca87bf63453d34dd3909ab978d)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/307ee542c2bd8378b2225910f3f9b982df7a7669
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Yu-Jen-Chang/Optimize-memchr/20220710-223118
git checkout 307ee542c2bd8378b2225910f3f9b982df7a7669
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=hexagon SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <[email protected]>

All errors (new ones prefixed by >>):

>> lib/string.c:902:7: error: conflicting types for 'memchr'
void *memchr(const void *p, int c, unsigned long length)
^
include/linux/string.h:162:15: note: previous declaration is here
extern void * memchr(const void *,int,__kernel_size_t);
^
1 error generated.


vim +/memchr +902 lib/string.c

891
892 #ifndef __HAVE_ARCH_MEMCHR
893 /**
894 * memchr - Find a character in an area of memory.
895 * @p: The memory area
896 * @c: The byte to search for
897 * @length: The size of the area.
898 *
899 * returns the address of the first occurrence of @c, or %NULL
900 * if @c is not found
901 */
> 902 void *memchr(const void *p, int c, unsigned long length)
903 {
904 u64 mask, val;
905 const void *end = p + length;
906
907 c &= 0xff;
908 if (p <= end - 8) {
909 mask = c;
910 MEMCHR_MASK_GEN(mask);
911
912 for (; p <= end - 8; p += 8) {
913 val = *(u64 *)p ^ mask;
914 if ((val + 0xfefefefefefefeffu) &
915 (~val & 0x8080808080808080u))
916 break;
917 }
918 }
919
920 for (; p < end; p++)
921 if (*(unsigned char *)p == c)
922 return (void *)p;
923
924 return NULL;
925 }
926 EXPORT_SYMBOL(memchr);
927 #endif
928

--
0-DAY CI Kernel Test Service
https://01.org/lkp

2022-07-10 20:15:43

by Andrey Semashev

[permalink] [raw]
Subject: Re: [PATCH 2/2] lib/string.c: Optimize memchr()

On 7/10/22 17:28, Yu-Jen Chang wrote:
> The original version of memchr() is implemented with the byte-wise
> comparing technique, which does not fully use 64-bits or 32-bits
> registers in CPU. We use word-wide comparing so that 8 characters
> can be compared at the same time on CPU. This code is base on
> David Laight's implementation.
>
> We create two files to measure the performance. The first file
> contains on average 10 characters ahead the target character.
> The second file contains at least 1000 characters ahead the
> target character. Our implementation of “memchr()” is slightly
> better in the first test and nearly 4x faster than the orginal
> implementation in the second test.
>
> Signed-off-by: Yu-Jen Chang <[email protected]>
> Signed-off-by: Ching-Chun (Jim) Huang <[email protected]>
> ---
> lib/string.c | 28 +++++++++++++++++++++-------
> 1 file changed, 21 insertions(+), 7 deletions(-)
>
> diff --git a/lib/string.c b/lib/string.c
> index 80469e6c3..8ca965431 100644
> --- a/lib/string.c
> +++ b/lib/string.c
> @@ -905,21 +905,35 @@ EXPORT_SYMBOL(strnstr);
> #ifndef __HAVE_ARCH_MEMCHR
> /**
> * memchr - Find a character in an area of memory.
> - * @s: The memory area
> + * @p: The memory area
> * @c: The byte to search for
> - * @n: The size of the area.
> + * @length: The size of the area.
> *
> * returns the address of the first occurrence of @c, or %NULL
> * if @c is not found
> */
> -void *memchr(const void *s, int c, size_t n)
> +void *memchr(const void *p, int c, unsigned long length)
> {
> - const unsigned char *p = s;
> - while (n-- != 0) {
> - if ((unsigned char)c == *p++) {
> - return (void *)(p - 1);
> + u64 mask, val;
> + const void *end = p + length;
> +
> + c &= 0xff;
> + if (p <= end - 8) {
> + mask = c;
> + MEMCHR_MASK_GEN(mask);
> +
> + for (; p <= end - 8; p += 8) {
> + val = *(u64 *)p ^ mask;

What if p is not aligned to 8 (or 4 on 32-bit targets) bytes? Not all
targets support (efficient) unaligned loads, do they?

> + if ((val + 0xfefefefefefefeffu) &
> + (~val & 0x8080808080808080u))
> + break;
> }
> }
> +
> + for (; p < end; p++)
> + if (*(unsigned char *)p == c)
> + return (void *)p;
> +
> return NULL;
> }
> EXPORT_SYMBOL(memchr);

2022-07-11 15:24:32

by Yu-Jen Chang

[permalink] [raw]
Subject: Re: [PATCH 2/2] lib/string.c: Optimize memchr()

Andrey Semashev <[email protected]> 於 2022年7月11日 週一 凌晨4:01寫道:
>
> On 7/10/22 17:28, Yu-Jen Chang wrote:
> > The original version of memchr() is implemented with the byte-wise
> > comparing technique, which does not fully use 64-bits or 32-bits
> > registers in CPU. We use word-wide comparing so that 8 characters
> > can be compared at the same time on CPU. This code is base on
> > David Laight's implementation.
> >
> > We create two files to measure the performance. The first file
> > contains on average 10 characters ahead the target character.
> > The second file contains at least 1000 characters ahead the
> > target character. Our implementation of “memchr()” is slightly
> > better in the first test and nearly 4x faster than the orginal
> > implementation in the second test.
> >
> > Signed-off-by: Yu-Jen Chang <[email protected]>
> > Signed-off-by: Ching-Chun (Jim) Huang <[email protected]>
> > ---
> > lib/string.c | 28 +++++++++++++++++++++-------
> > 1 file changed, 21 insertions(+), 7 deletions(-)
> >
> > diff --git a/lib/string.c b/lib/string.c
> > index 80469e6c3..8ca965431 100644
> > --- a/lib/string.c
> > +++ b/lib/string.c
> > @@ -905,21 +905,35 @@ EXPORT_SYMBOL(strnstr);
> > #ifndef __HAVE_ARCH_MEMCHR
> > /**
> > * memchr - Find a character in an area of memory.
> > - * @s: The memory area
> > + * @p: The memory area
> > * @c: The byte to search for
> > - * @n: The size of the area.
> > + * @length: The size of the area.
> > *
> > * returns the address of the first occurrence of @c, or %NULL
> > * if @c is not found
> > */
> > -void *memchr(const void *s, int c, size_t n)
> > +void *memchr(const void *p, int c, unsigned long length)
> > {
> > - const unsigned char *p = s;
> > - while (n-- != 0) {
> > - if ((unsigned char)c == *p++) {
> > - return (void *)(p - 1);
> > + u64 mask, val;
> > + const void *end = p + length;
> > +
> > + c &= 0xff;
> > + if (p <= end - 8) {
> > + mask = c;
> > + MEMCHR_MASK_GEN(mask);
> > +
> > + for (; p <= end - 8; p += 8) {
> > + val = *(u64 *)p ^ mask;
>
> What if p is not aligned to 8 (or 4 on 32-bit targets) bytes? Not all
> targets support (efficient) unaligned loads, do they?

I think it works if p is not aligned to 8 or 4 bytes.

Let's say the string is 10 bytes. The for loop here will search the first
8 bytes. If the target character is in the last 2 bytes, the second for
loop will find it. It also work like this on 32-bit machine.

>
> > + if ((val + 0xfefefefefefefeffu) &
> > + (~val & 0x8080808080808080u))
> > + break;
> > }
> > }
> > +
> > + for (; p < end; p++)
> > + if (*(unsigned char *)p == c)
> > + return (void *)p;
> > +
> > return NULL;
> > }
> > EXPORT_SYMBOL(memchr);
>

2022-07-11 15:27:21

by Andrey Semashev

[permalink] [raw]
Subject: Re: [PATCH 2/2] lib/string.c: Optimize memchr()

On 7/11/22 17:52, Yu-Jen Chang wrote:
> Andrey Semashev <[email protected]> 於 2022年7月11日 週一 凌晨4:01寫道:
>>
>> On 7/10/22 17:28, Yu-Jen Chang wrote:
>>> The original version of memchr() is implemented with the byte-wise
>>> comparing technique, which does not fully use 64-bits or 32-bits
>>> registers in CPU. We use word-wide comparing so that 8 characters
>>> can be compared at the same time on CPU. This code is base on
>>> David Laight's implementation.
>>>
>>> We create two files to measure the performance. The first file
>>> contains on average 10 characters ahead the target character.
>>> The second file contains at least 1000 characters ahead the
>>> target character. Our implementation of “memchr()” is slightly
>>> better in the first test and nearly 4x faster than the orginal
>>> implementation in the second test.
>>>
>>> Signed-off-by: Yu-Jen Chang <[email protected]>
>>> Signed-off-by: Ching-Chun (Jim) Huang <[email protected]>
>>> ---
>>> lib/string.c | 28 +++++++++++++++++++++-------
>>> 1 file changed, 21 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/lib/string.c b/lib/string.c
>>> index 80469e6c3..8ca965431 100644
>>> --- a/lib/string.c
>>> +++ b/lib/string.c
>>> @@ -905,21 +905,35 @@ EXPORT_SYMBOL(strnstr);
>>> #ifndef __HAVE_ARCH_MEMCHR
>>> /**
>>> * memchr - Find a character in an area of memory.
>>> - * @s: The memory area
>>> + * @p: The memory area
>>> * @c: The byte to search for
>>> - * @n: The size of the area.
>>> + * @length: The size of the area.
>>> *
>>> * returns the address of the first occurrence of @c, or %NULL
>>> * if @c is not found
>>> */
>>> -void *memchr(const void *s, int c, size_t n)
>>> +void *memchr(const void *p, int c, unsigned long length)

I didn't comment on this initially, but is the change of length type
intentional? Why?

>>> {
>>> - const unsigned char *p = s;
>>> - while (n-- != 0) {
>>> - if ((unsigned char)c == *p++) {
>>> - return (void *)(p - 1);
>>> + u64 mask, val;
>>> + const void *end = p + length;
>>> +
>>> + c &= 0xff;
>>> + if (p <= end - 8) {
>>> + mask = c;
>>> + MEMCHR_MASK_GEN(mask);
>>> +
>>> + for (; p <= end - 8; p += 8) {
>>> + val = *(u64 *)p ^ mask;
>>
>> What if p is not aligned to 8 (or 4 on 32-bit targets) bytes? Not all
>> targets support (efficient) unaligned loads, do they?
>
> I think it works if p is not aligned to 8 or 4 bytes.
>
> Let's say the string is 10 bytes. The for loop here will search the first
> 8 bytes. If the target character is in the last 2 bytes, the second for
> loop will find it. It also work like this on 32-bit machine.

I think you're missing the point. Loads at unaligned addresses may not
be allowed by hardware using conventional load instructions or may be
inefficient. Given that this memchr implementation is used as a fallback
when no hardware-specific version is available, you should be
conservative wrt. hardware capabilities and behavior. You should
probably have a pre-alignment loop.

>>
>>> + if ((val + 0xfefefefefefefeffu) &
>>> + (~val & 0x8080808080808080u))
>>> + break;
>>> }
>>> }
>>> +
>>> + for (; p < end; p++)
>>> + if (*(unsigned char *)p == c)
>>> + return (void *)p;
>>> +
>>> return NULL;
>>> }
>>> EXPORT_SYMBOL(memchr);
>>

2022-07-11 15:28:23

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [PATCH 2/2] lib/string.c: Optimize memchr()

On Mon, Jul 11, 2022 at 5:00 PM Andrey Semashev
<[email protected]> wrote:
> On 7/11/22 17:52, Yu-Jen Chang wrote:

...

> I think you're missing the point. Loads at unaligned addresses may not
> be allowed by hardware using conventional load instructions or may be
> inefficient. Given that this memchr implementation is used as a fallback
> when no hardware-specific version is available, you should be
> conservative wrt. hardware capabilities and behavior. You should
> probably have a pre-alignment loop.

Exactly!
The initial code is broken, NAK.

P.S. At least you may look into strscpy() implementation to get a clue.

--
With Best Regards,
Andy Shevchenko

2022-07-11 15:46:31

by Yu-Jen Chang

[permalink] [raw]
Subject: Re: [PATCH 2/2] lib/string.c: Optimize memchr()

Joe Perches <[email protected]> 於 2022年7月10日 週日 晚上11:16寫道:
>
> On Sun, 2022-07-10 at 22:28 +0800, Yu-Jen Chang wrote:
> > The original version of memchr() is implemented with the byte-wise
> > comparing technique, which does not fully use 64-bits or 32-bits
> > registers in CPU. We use word-wide comparing so that 8 characters
> > can be compared at the same time on CPU. This code is base on
> > David Laight's implementation.
> >
> > We create two files to measure the performance. The first file
> > contains on average 10 characters ahead the target character.
> > The second file contains at least 1000 characters ahead the
> > target character. Our implementation of “memchr()” is slightly
> > better in the first test and nearly 4x faster than the orginal
> > implementation in the second test.
>
> It seems you did not test this with 32bit compilers as
> there are 64 bit constants without ull

Yeah, it is better to add ull at the end of the constant. I test with
32-bit compiler and it works. The compiler will use two or more
instruction to reach the same result.

See https://godbolt.org/z/svbj18foP

>
> > diff --git a/lib/string.c b/lib/string.c
> []
> > @@ -905,21 +905,35 @@ EXPORT_SYMBOL(strnstr);
> > #ifndef __HAVE_ARCH_MEMCHR
> > /**
> > * memchr - Find a character in an area of memory.
> > - * @s: The memory area
> > + * @p: The memory area
> > * @c: The byte to search for
> > - * @n: The size of the area.
> > + * @length: The size of the area.
> > *
> > * returns the address of the first occurrence of @c, or %NULL
> > * if @c is not found
> > */
> > -void *memchr(const void *s, int c, size_t n)
> > +void *memchr(const void *p, int c, unsigned long length)
> > {
> > - const unsigned char *p = s;
> > - while (n-- != 0) {
> > - if ((unsigned char)c == *p++) {
> > - return (void *)(p - 1);
> > + u64 mask, val;
> > + const void *end = p + length;
> > +
> > + c &= 0xff;
> > + if (p <= end - 8) {
> > + mask = c;
> > + MEMCHR_MASK_GEN(mask);
> > +
> > + for (; p <= end - 8; p += 8) {
> > + val = *(u64 *)p ^ mask;
> > + if ((val + 0xfefefefefefefeffu) &
> > + (~val & 0x8080808080808080u))
>
> here.
>

2022-07-12 15:01:58

by Yu-Jen Chang

[permalink] [raw]
Subject: Re: [PATCH 2/2] lib/string.c: Optimize memchr()

Andrey Semashev <[email protected]> 於 2022年7月11日 週一 晚上11:00寫道:
>
> On 7/11/22 17:52, Yu-Jen Chang wrote:
> > Andrey Semashev <[email protected]> 於 2022年7月11日 週一 凌晨4:01寫道:
> >>
> >> On 7/10/22 17:28, Yu-Jen Chang wrote:
> >>> The original version of memchr() is implemented with the byte-wise
> >>> comparing technique, which does not fully use 64-bits or 32-bits
> >>> registers in CPU. We use word-wide comparing so that 8 characters
> >>> can be compared at the same time on CPU. This code is base on
> >>> David Laight's implementation.
> >>>
> >>> We create two files to measure the performance. The first file
> >>> contains on average 10 characters ahead the target character.
> >>> The second file contains at least 1000 characters ahead the
> >>> target character. Our implementation of “memchr()” is slightly
> >>> better in the first test and nearly 4x faster than the orginal
> >>> implementation in the second test.
> >>>
> >>> Signed-off-by: Yu-Jen Chang <[email protected]>
> >>> Signed-off-by: Ching-Chun (Jim) Huang <[email protected]>
> >>> ---
> >>> lib/string.c | 28 +++++++++++++++++++++-------
> >>> 1 file changed, 21 insertions(+), 7 deletions(-)
> >>>
> >>> diff --git a/lib/string.c b/lib/string.c
> >>> index 80469e6c3..8ca965431 100644
> >>> --- a/lib/string.c
> >>> +++ b/lib/string.c
> >>> @@ -905,21 +905,35 @@ EXPORT_SYMBOL(strnstr);
> >>> #ifndef __HAVE_ARCH_MEMCHR
> >>> /**
> >>> * memchr - Find a character in an area of memory.
> >>> - * @s: The memory area
> >>> + * @p: The memory area
> >>> * @c: The byte to search for
> >>> - * @n: The size of the area.
> >>> + * @length: The size of the area.
> >>> *
> >>> * returns the address of the first occurrence of @c, or %NULL
> >>> * if @c is not found
> >>> */
> >>> -void *memchr(const void *s, int c, size_t n)
> >>> +void *memchr(const void *p, int c, unsigned long length)
>
> I didn't comment on this initially, but is the change of length type
> intentional? Why?

It is my mistake. I will change the type back to size_t.

>
> >>> {
> >>> - const unsigned char *p = s;
> >>> - while (n-- != 0) {
> >>> - if ((unsigned char)c == *p++) {
> >>> - return (void *)(p - 1);
> >>> + u64 mask, val;
> >>> + const void *end = p + length;
> >>> +
> >>> + c &= 0xff;
> >>> + if (p <= end - 8) {
> >>> + mask = c;
> >>> + MEMCHR_MASK_GEN(mask);
> >>> +
> >>> + for (; p <= end - 8; p += 8) {
> >>> + val = *(u64 *)p ^ mask;
> >>
> >> What if p is not aligned to 8 (or 4 on 32-bit targets) bytes? Not all
> >> targets support (efficient) unaligned loads, do they?
> >
> > I think it works if p is not aligned to 8 or 4 bytes.
> >
> > Let's say the string is 10 bytes. The for loop here will search the first
> > 8 bytes. If the target character is in the last 2 bytes, the second for
> > loop will find it. It also work like this on 32-bit machine.
>
> I think you're missing the point. Loads at unaligned addresses may not
> be allowed by hardware using conventional load instructions or may be
> inefficient. Given that this memchr implementation is used as a fallback
> when no hardware-specific version is available, you should be
> conservative wrt. hardware capabilities and behavior. You should
> probably have a pre-alignment loop.

Got it. I add pre-alignment loop. It aligns the address to 8 or 4bytes.

void *memchr(const void *p, int c, size_t length)
{
u64 mask, val;
const void *end = p + length;
c &= 0xff;
while ((long ) p & (sizeof(long) - 1)) {
if (p >= end)
return NULL;
if (*(unsigned char *)p == c)
return (void *) p;
p++;
}
if (p <= end - 8) {
mask = c;
MEMCHR_MASK_GEN(mask);

for (; p <= end - 8; p += 8) {
val = *(u64*)p ^ mask;
if ((val + 0xfefefefefefefeffull)
& (~val & 0x8080808080808080ull))
break;
}
}

for (; p < end; p++)
if (*(unsigned char *)p == c)
return (void *)p;

return NULL;
}

>
> >>
> >>> + if ((val + 0xfefefefefefefeffu) &
> >>> + (~val & 0x8080808080808080u))
> >>> + break;
> >>> }
> >>> }
> >>> +
> >>> + for (; p < end; p++)
> >>> + if (*(unsigned char *)p == c)
> >>> + return (void *)p;
> >>> +
> >>> return NULL;
> >>> }
> >>> EXPORT_SYMBOL(memchr);
> >>
>

2022-07-12 15:57:21

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [PATCH 2/2] lib/string.c: Optimize memchr()

On Tue, Jul 12, 2022 at 4:58 PM Yu-Jen Chang <[email protected]> wrote:
> Andrey Semashev <[email protected]> 於 2022年7月11日 週一 晚上11:00寫道:
> > On 7/11/22 17:52, Yu-Jen Chang wrote:
> > > Andrey Semashev <[email protected]> 於 2022年7月11日 週一 凌晨4:01寫道:
> > >> On 7/10/22 17:28, Yu-Jen Chang wrote:

...

> > >>> + for (; p <= end - 8; p += 8) {
> > >>> + val = *(u64 *)p ^ mask;
> > >>
> > >> What if p is not aligned to 8 (or 4 on 32-bit targets) bytes? Not all
> > >> targets support (efficient) unaligned loads, do they?
> > >
> > > I think it works if p is not aligned to 8 or 4 bytes.
> > >
> > > Let's say the string is 10 bytes. The for loop here will search the first
> > > 8 bytes. If the target character is in the last 2 bytes, the second for
> > > loop will find it. It also work like this on 32-bit machine.
> >
> > I think you're missing the point. Loads at unaligned addresses may not
> > be allowed by hardware using conventional load instructions or may be
> > inefficient. Given that this memchr implementation is used as a fallback
> > when no hardware-specific version is available, you should be
> > conservative wrt. hardware capabilities and behavior. You should
> > probably have a pre-alignment loop.
>
> Got it. I add pre-alignment loop. It aligns the address to 8 or 4bytes.

Still far from what can be accepted. Have you had a chance to read how
strscpy() is implemented? Do you understand why it's done that way?

> void *memchr(const void *p, int c, size_t length)
> {
> u64 mask, val;
> const void *end = p + length;
> c &= 0xff;

> while ((long ) p & (sizeof(long) - 1)) {
> if (p >= end)
> return NULL;
> if (*(unsigned char *)p == c)
> return (void *) p;
> p++;
> }
> if (p <= end - 8) {
> mask = c;
> MEMCHR_MASK_GEN(mask);
>
> for (; p <= end - 8; p += 8) {

Why you decided that this code will be run explicitly on 64-bit arch?

> val = *(u64*)p ^ mask;
> if ((val + 0xfefefefefefefeffull)
> & (~val & 0x8080808080808080ull))
> break;
> }
> }
>
> for (; p < end; p++)
> if (*(unsigned char *)p == c)
> return (void *)p;
>
> return NULL;
> }

--
With Best Regards,
Andy Shevchenko

2022-07-13 09:46:18

by David Laight

[permalink] [raw]
Subject: RE: [PATCH 2/2] lib/string.c: Optimize memchr()

From: Yu-Jen Chang
> Sent: 12 July 2022 15:59
...
> > I think you're missing the point. Loads at unaligned addresses may not
> > be allowed by hardware using conventional load instructions or may be
> > inefficient. Given that this memchr implementation is used as a fallback
> > when no hardware-specific version is available, you should be
> > conservative wrt. hardware capabilities and behavior. You should
> > probably have a pre-alignment loop.
>
> Got it. I add pre-alignment loop. It aligns the address to 8 or 4bytes.

That should be predicated on !HAS_EFFICIENT_UNALIGNED_ACCESS.

...
> for (; p <= end - 8; p += 8) {
> val = *(u64*)p ^ mask;
> if ((val + 0xfefefefefefefeffull)
> & (~val & 0x8080808080808080ull))
> break;

I would add a couple of comments, like:
// Convert to check for zero byte.
// Standard check for a zero byte in a word.
(But not the big 4 line explanation you had.

It is also worth looking at how that code compiles
on 32bit arch that don't have a carry flag.
That is everything based on MIPS, including riscv.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2022-07-13 09:52:56

by Andrey Semashev

[permalink] [raw]
Subject: Re: [PATCH 2/2] lib/string.c: Optimize memchr()

On 7/13/22 12:39, David Laight wrote:
> From: Yu-Jen Chang
>> Sent: 12 July 2022 15:59
> ...
>>> I think you're missing the point. Loads at unaligned addresses may not
>>> be allowed by hardware using conventional load instructions or may be
>>> inefficient. Given that this memchr implementation is used as a fallback
>>> when no hardware-specific version is available, you should be
>>> conservative wrt. hardware capabilities and behavior. You should
>>> probably have a pre-alignment loop.
>>
>> Got it. I add pre-alignment loop. It aligns the address to 8 or 4bytes.
>
> That should be predicated on !HAS_EFFICIENT_UNALIGNED_ACCESS.
>
> ...
>> for (; p <= end - 8; p += 8) {
>> val = *(u64*)p ^ mask;
>> if ((val + 0xfefefefefefefeffull)
>> & (~val & 0x8080808080808080ull))
>> break;
>
> I would add a couple of comments, like:
> // Convert to check for zero byte.
> // Standard check for a zero byte in a word.
> (But not the big 4 line explanation you had.
>
> It is also worth looking at how that code compiles
> on 32bit arch that don't have a carry flag.
> That is everything based on MIPS, including riscv.

It may be worth looking at how glibc does it:

https://sourceware.org/git/?p=glibc.git;a=blob;f=string/memchr.c;h=422bcd0cd646ea46711a57fa3cbdb8a3329fc302;hb=refs/heads/release/2.35/master#l46

They do use 32-bit words on 32-bit targets and 64-bit on 64-bit ones. I
think memchr in the kernel should follow this.

2022-07-13 10:08:30

by Andrey Semashev

[permalink] [raw]
Subject: Re: [PATCH 2/2] lib/string.c: Optimize memchr()

On 7/13/22 12:49, Andrey Semashev wrote:
> On 7/13/22 12:39, David Laight wrote:
>> From: Yu-Jen Chang
>>> Sent: 12 July 2022 15:59
>> ...
>>>> I think you're missing the point. Loads at unaligned addresses may not
>>>> be allowed by hardware using conventional load instructions or may be
>>>> inefficient. Given that this memchr implementation is used as a fallback
>>>> when no hardware-specific version is available, you should be
>>>> conservative wrt. hardware capabilities and behavior. You should
>>>> probably have a pre-alignment loop.
>>>
>>> Got it. I add pre-alignment loop. It aligns the address to 8 or 4bytes.
>>
>> That should be predicated on !HAS_EFFICIENT_UNALIGNED_ACCESS.
>>
>> ...
>>> for (; p <= end - 8; p += 8) {
>>> val = *(u64*)p ^ mask;
>>> if ((val + 0xfefefefefefefeffull)
>>> & (~val & 0x8080808080808080ull))
>>> break;
>>
>> I would add a couple of comments, like:
>> // Convert to check for zero byte.
>> // Standard check for a zero byte in a word.
>> (But not the big 4 line explanation you had.
>>
>> It is also worth looking at how that code compiles
>> on 32bit arch that don't have a carry flag.
>> That is everything based on MIPS, including riscv.
>
> It may be worth looking at how glibc does it:
>
> https://sourceware.org/git/?p=glibc.git;a=blob;f=string/memchr.c;h=422bcd0cd646ea46711a57fa3cbdb8a3329fc302;hb=refs/heads/release/2.35/master#l46
>
> They do use 32-bit words on 32-bit targets and 64-bit on 64-bit ones. I
> think memchr in the kernel should follow this.

Also, if by chance this optimization is aimed for x86-64, it may be
worth adding an arch-specific version that uses ERMS.

2022-07-13 10:45:27

by Andrey Semashev

[permalink] [raw]
Subject: Re: [PATCH 2/2] lib/string.c: Optimize memchr()

On 7/13/22 12:39, David Laight wrote:
> From: Yu-Jen Chang
>> Sent: 12 July 2022 15:59
> ...
>>> I think you're missing the point. Loads at unaligned addresses may not
>>> be allowed by hardware using conventional load instructions or may be
>>> inefficient. Given that this memchr implementation is used as a fallback
>>> when no hardware-specific version is available, you should be
>>> conservative wrt. hardware capabilities and behavior. You should
>>> probably have a pre-alignment loop.
>>
>> Got it. I add pre-alignment loop. It aligns the address to 8 or 4bytes.
>
> That should be predicated on !HAS_EFFICIENT_UNALIGNED_ACCESS.

If there is a pre-alignment loop, there won't be unaligned loads, so
there shouldn't be the need for the HAS_EFFICIENT_UNALIGNED_ACCESS
check. Unless I misunderstand what HAS_EFFICIENT_UNALIGNED_ACCESS indicates.

2022-07-13 11:05:17

by David Laight

[permalink] [raw]
Subject: RE: [PATCH 2/2] lib/string.c: Optimize memchr()

From: Andrey Semashev
> Sent: 13 July 2022 11:03
>
> On 7/13/22 12:49, Andrey Semashev wrote:
> > On 7/13/22 12:39, David Laight wrote:
> >> From: Yu-Jen Chang
> >>> Sent: 12 July 2022 15:59
> >> ...
> >>>> I think you're missing the point. Loads at unaligned addresses may not
> >>>> be allowed by hardware using conventional load instructions or may be
> >>>> inefficient. Given that this memchr implementation is used as a fallback
> >>>> when no hardware-specific version is available, you should be
> >>>> conservative wrt. hardware capabilities and behavior. You should
> >>>> probably have a pre-alignment loop.
> >>>
> >>> Got it. I add pre-alignment loop. It aligns the address to 8 or 4bytes.
> >>
> >> That should be predicated on !HAS_EFFICIENT_UNALIGNED_ACCESS.
> >>
> >> ...
> >>> for (; p <= end - 8; p += 8) {
> >>> val = *(u64*)p ^ mask;
> >>> if ((val + 0xfefefefefefefeffull)
> >>> & (~val & 0x8080808080808080ull))
> >>> break;
> >>
> >> I would add a couple of comments, like:
> >> // Convert to check for zero byte.
> >> // Standard check for a zero byte in a word.
> >> (But not the big 4 line explanation you had.
> >>
> >> It is also worth looking at how that code compiles
> >> on 32bit arch that don't have a carry flag.
> >> That is everything based on MIPS, including riscv.
> >
> > It may be worth looking at how glibc does it:
> >
> >
> https://sourceware.org/git/?p=glibc.git;a=blob;f=string/memchr.c;h=422bcd0cd646ea46711a57fa3cbdb8a3329
> fc302;hb=refs/heads/release/2.35/master#l46
> >
> > They do use 32-bit words on 32-bit targets and 64-bit on 64-bit ones. I
> > think memchr in the kernel should follow this.
>
> Also, if by chance this optimization is aimed for x86-64, it may be
> worth adding an arch-specific version that uses ERMS.

Don't believe everything you see in glibc.
The common cases in the kernel are different from the ones they
tend to optimise for..

You might try using:
#define GEN_MASK(x) (x * (unsigned long)0x0101010101010101ull)
for the mask and the two constants.
Then make all the variables 'long'.

I'm not at all sure what the test for fast multiply is about.
It may be very historic, for modern cpu generating the 64bit
constant is likely to be more problematic (check arm64).
If the input value is 'unsigned char' (or masked) then the
compiler may decide to do the repeated <<= itself.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2022-07-21 05:27:33

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 2/2] lib/string.c: Optimize memchr()

Hi Yu-Jen,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v5.19-rc7 next-20220720]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Yu-Jen-Chang/Optimize-memchr/20220710-223118
base: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git b1c428b6c3684ee8ddf4137d68b3e8d51d2a700f
config: arc-buildonly-randconfig-r005-20220719 (https://download.01.org/0day-ci/archive/20220721/[email protected]/config)
compiler: arceb-elf-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/307ee542c2bd8378b2225910f3f9b982df7a7669
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Yu-Jen-Chang/Optimize-memchr/20220710-223118
git checkout 307ee542c2bd8378b2225910f3f9b982df7a7669
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=arc SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <[email protected]>

All errors (new ones prefixed by >>):

>> lib/string.c:902:7: error: conflicting types for 'memchr'; have 'void *(const void *, int, long unsigned int)'
902 | void *memchr(const void *p, int c, unsigned long length)
| ^~~~~~
In file included from lib/string.c:19:
include/linux/string.h:162:15: note: previous declaration of 'memchr' with type 'void *(const void *, int, __kernel_size_t)' {aka 'void *(const void *, int, unsigned int)'}
162 | extern void * memchr(const void *,int,__kernel_size_t);
| ^~~~~~


vim +902 lib/string.c

891
892 #ifndef __HAVE_ARCH_MEMCHR
893 /**
894 * memchr - Find a character in an area of memory.
895 * @p: The memory area
896 * @c: The byte to search for
897 * @length: The size of the area.
898 *
899 * returns the address of the first occurrence of @c, or %NULL
900 * if @c is not found
901 */
> 902 void *memchr(const void *p, int c, unsigned long length)
903 {
904 u64 mask, val;
905 const void *end = p + length;
906
907 c &= 0xff;
908 if (p <= end - 8) {
909 mask = c;
910 MEMCHR_MASK_GEN(mask);
911
912 for (; p <= end - 8; p += 8) {
913 val = *(u64 *)p ^ mask;
914 if ((val + 0xfefefefefefefeffu) &
915 (~val & 0x8080808080808080u))
916 break;
917 }
918 }
919
920 for (; p < end; p++)
921 if (*(unsigned char *)p == c)
922 return (void *)p;
923
924 return NULL;
925 }
926 EXPORT_SYMBOL(memchr);
927 #endif
928

--
0-DAY CI Kernel Test Service
https://01.org/lkp

2022-07-22 16:23:00

by Yu-Jen Chang

[permalink] [raw]
Subject: Re: [PATCH 2/2] lib/string.c: Optimize memchr()

On Wed, Jul 13, 2022 at 6:24 PM David Laight <[email protected]> wrote:
>
> From: Andrey Semashev
> > Sent: 13 July 2022 11:03
> >
> > On 7/13/22 12:49, Andrey Semashev wrote:
> > > On 7/13/22 12:39, David Laight wrote:
> > >> From: Yu-Jen Chang
> > >>> Sent: 12 July 2022 15:59
> > >> ...
> > >>>> I think you're missing the point. Loads at unaligned addresses may not
> > >>>> be allowed by hardware using conventional load instructions or may be
> > >>>> inefficient. Given that this memchr implementation is used as a fallback
> > >>>> when no hardware-specific version is available, you should be
> > >>>> conservative wrt. hardware capabilities and behavior. You should
> > >>>> probably have a pre-alignment loop.
> > >>>
> > >>> Got it. I add pre-alignment loop. It aligns the address to 8 or 4bytes.
> > >>
> > >> That should be predicated on !HAS_EFFICIENT_UNALIGNED_ACCESS.
> > >>
> > >> ...
> > >>> for (; p <= end - 8; p += 8) {
> > >>> val = *(u64*)p ^ mask;
> > >>> if ((val + 0xfefefefefefefeffull)
> > >>> & (~val & 0x8080808080808080ull))
> > >>> break;
> > >>
> > >> I would add a couple of comments, like:
> > >> // Convert to check for zero byte.
> > >> // Standard check for a zero byte in a word.
> > >> (But not the big 4 line explanation you had.
> > >>
> > >> It is also worth looking at how that code compiles
> > >> on 32bit arch that don't have a carry flag.
> > >> That is everything based on MIPS, including riscv.
> > >
> > > It may be worth looking at how glibc does it:
> > >
> > >
> > https://sourceware.org/git/?p=glibc.git;a=blob;f=string/memchr.c;h=422bcd0cd646ea46711a57fa3cbdb8a3329
> > fc302;hb=refs/heads/release/2.35/master#l46
> > >
> > > They do use 32-bit words on 32-bit targets and 64-bit on 64-bit ones. I
> > > think memchr in the kernel should follow this.
> >
> > Also, if by chance this optimization is aimed for x86-64, it may be
> > worth adding an arch-specific version that uses ERMS.
>
> Don't believe everything you see in glibc.
> The common cases in the kernel are different from the ones they
> tend to optimise for..
>
> You might try using:
> #define GEN_MASK(x) (x * (unsigned long)0x0101010101010101ull)
> for the mask and the two constants.
> Then make all the variables 'long'.
>
> I'm not at all sure what the test for fast multiply is about.
> It may be very historic, for modern cpu generating the 64bit
> constant is likely to be more problematic (check arm64).
> If the input value is 'unsigned char' (or masked) then the
> compiler may decide to do the repeated <<= itself.
>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)


I have rewrite my code as the following. I use several macros to
generate the mask and detect target char for 32-bit or 64-bit.
The DETECT_CHAR macro is the same as this part:

(val + 0xfefefefefefefeffull) & (~val & 0x8080808080808080ull)

The 0xfefefefefefefeffull is the 2's complement of 0x0101010101010101ULL.
I perform subtraction here exactly.

If the CPU architecture is unable to perform unaligned access
efficiently, the pre-alignment loop will align the string at first.


#define LBLOCKSIZE (sizeof(long))
#if BITS_PER_LONG == 64

#define DETECT_CHAR(X) \
(((X) - 0x0101010101010101ULL) & ~(X) & 0x8080808080808080ULL)

#define MEMCHR_MASK_GEN(mask) \
do { \
(mask) |= (mask) << 8; \
(mask) |= (mask) << 16; \
(mask) |= (mask) << 32; \
} while (0)

#else

#define DETECT_CHAR(X) (((X)-0x01010101UL) & ~(X)&0x80808080UL)

#define MEMCHR_MASK_GEN(mask) \
do { \
(mask) |= (mask) << 8; \
(mask) |= (mask) << 16; \
} while (0)

#endif

void *memchr(const void *p, int c, size_t length)
{
unsigned long mask, val;
const void *end = p + length;
c &= 0xff;
#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
while ((long)p & (sizeof(long) - 1)) {
if (p >= end)
return NULL;
if (*(unsigned char *)p == c)
return (void *)p;
p++;
}
#endif
if (p <= end - LBLOCKSIZE) {
mask = c;
MEMCHR_MASK_GEN(mask);

for (; p <= end - LBLOCKSIZE; p += LBLOCKSIZE) {
val = *(unsigned long *)p ^ mask;
if (DETECT_CHAR(val))
break;
}
}

for (; p < end; p++)
if (*(unsigned char *)p == c)
return (void *)p;

return NULL;
}

2022-07-29 15:45:16

by Yu-Jen Chang

[permalink] [raw]
Subject: Re: [PATCH 2/2] lib/string.c: Optimize memchr()

On Sat, Jul 23, 2022 at 12:56 AM Andy Shevchenko
<[email protected]> wrote:
>
>
>
> On Friday, July 22, 2022, Yu-Jen Chang <[email protected]> wrote:
>>
>> On Wed, Jul 13, 2022 at 6:24 PM David Laight <[email protected]> wrote:
>> >
>> > From: Andrey Semashev
>> > > Sent: 13 July 2022 11:03
>> > >
>> > > On 7/13/22 12:49, Andrey Semashev wrote:
>> > > > On 7/13/22 12:39, David Laight wrote:
>> > > >> From: Yu-Jen Chang
>> > > >>> Sent: 12 July 2022 15:59
>> > > >> ...
>> > > >>>> I think you're missing the point. Loads at unaligned addresses may not
>> > > >>>> be allowed by hardware using conventional load instructions or may be
>> > > >>>> inefficient. Given that this memchr implementation is used as a fallback
>> > > >>>> when no hardware-specific version is available, you should be
>> > > >>>> conservative wrt. hardware capabilities and behavior. You should
>> > > >>>> probably have a pre-alignment loop.
>> > > >>>
>> > > >>> Got it. I add pre-alignment loop. It aligns the address to 8 or 4bytes.
>> > > >>
>> > > >> That should be predicated on !HAS_EFFICIENT_UNALIGNED_ACCESS.
>> > > >>
>> > > >> ...
>> > > >>> for (; p <= end - 8; p += 8) {
>> > > >>> val = *(u64*)p ^ mask;
>> > > >>> if ((val + 0xfefefefefefefeffull)
>> > > >>> & (~val & 0x8080808080808080ull))
>> > > >>> break;
>> > > >>
>> > > >> I would add a couple of comments, like:
>> > > >> // Convert to check for zero byte.
>> > > >> // Standard check for a zero byte in a word.
>> > > >> (But not the big 4 line explanation you had.
>> > > >>
>> > > >> It is also worth looking at how that code compiles
>> > > >> on 32bit arch that don't have a carry flag.
>> > > >> That is everything based on MIPS, including riscv.
>> > > >
>> > > > It may be worth looking at how glibc does it:
>> > > >
>> > > >
>> > > https://sourceware.org/git/?p=glibc.git;a=blob;f=string/memchr.c;h=422bcd0cd646ea46711a57fa3cbdb8a3329
>> > > fc302;hb=refs/heads/release/2.35/master#l46
>> > > >
>> > > > They do use 32-bit words on 32-bit targets and 64-bit on 64-bit ones. I
>> > > > think memchr in the kernel should follow this.
>> > >
>> > > Also, if by chance this optimization is aimed for x86-64, it may be
>> > > worth adding an arch-specific version that uses ERMS.
>> >
>> > Don't believe everything you see in glibc.
>> > The common cases in the kernel are different from the ones they
>> > tend to optimise for..
>> >
>> > You might try using:
>> > #define GEN_MASK(x) (x * (unsigned long)0x0101010101010101ull)
>> > for the mask and the two constants.
>> > Then make all the variables 'long'.
>> >
>> > I'm not at all sure what the test for fast multiply is about.
>> > It may be very historic, for modern cpu generating the 64bit
>> > constant is likely to be more problematic (check arm64).
>> > If the input value is 'unsigned char' (or masked) then the
>> > compiler may decide to do the repeated <<= itself.
>> >
>> > David
>> >
>> > -
>> > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
>> > Registration No: 1397386 (Wales)
>>
>>
>> I have rewrite my code as the following. I use several macros
>
>
>
> My gosh, have you checked what strscpy() does under the hood?
>
> Why can’t you just reuse parts of it?
>
>>
>> to
>> generate the mask and detect target char for 32-bit or 64-bit.
>> The DETECT_CHAR macro is the same as this part:
>>
>> (val + 0xfefefefefefefeffull) & (~val & 0x8080808080808080ull)
>>
>> The 0xfefefefefefefeffull is the 2's complement of 0x0101010101010101ULL.
>> I perform subtraction here exactly.
>>
>> If the CPU architecture is unable to perform unaligned access
>> efficiently, the pre-alignment loop will align the string at first.
>>
>>
>> #define LBLOCKSIZE (sizeof(long))
>> #if BITS_PER_LONG == 64
>>
>> #define DETECT_CHAR(X) \
>> (((X) - 0x0101010101010101ULL) & ~(X) & 0x8080808080808080ULL)
>>
>> #define MEMCHR_MASK_GEN(mask) \
>> do { \
>> (mask) |= (mask) << 8; \
>> (mask) |= (mask) << 16; \
>> (mask) |= (mask) << 32; \
>> } while (0)
>>
>> #else
>>
>> #define DETECT_CHAR(X) (((X)-0x01010101UL) & ~(X)&0x80808080UL)
>>
>> #define MEMCHR_MASK_GEN(mask) \
>> do { \
>> (mask) |= (mask) << 8; \
>> (mask) |= (mask) << 16; \
>> } while (0)
>>
>> #endif
>>
>> void *memchr(const void *p, int c, size_t length)
>> {
>> unsigned long mask, val;
>> const void *end = p + length;
>> c &= 0xff;
>> #ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
>> while ((long)p & (sizeof(long) - 1)) {
>> if (p >= end)
>> return NULL;
>> if (*(unsigned char *)p == c)
>> return (void *)p;
>> p++;
>> }
>> #endif
>> if (p <= end - LBLOCKSIZE) {
>> mask = c;
>> MEMCHR_MASK_GEN(mask);
>>
>> for (; p <= end - LBLOCKSIZE; p += LBLOCKSIZE) {
>> val = *(unsigned long *)p ^ mask;
>> if (DETECT_CHAR(val))
>> break;
>> }
>> }
>>
>> for (; p < end; p++)
>> if (*(unsigned char *)p == c)
>> return (void *)p;
>>
>> return NULL;
>> }
>
>
>
> --
> With Best Regards,
> Andy Shevchenko
>
>

I reuse some parts of the strscpy and use "read_word_at_a_time"
to read strings and "has_zero" to find if the address of the target char
is in the current word.

Yu-Jen Chang


#if BITS_PER_LONG == 64

#define MEMCHR_MASK_GEN(mask) \
do { \
(mask) |= (mask) << 8; \
(mask) |= (mask) << 16; \
(mask) |= (mask) << 32; \
} while (0)

#else

#define MEMCHR_MASK_GEN(mask) \
do { \
(mask) |= (mask) << 8; \
(mask) |= (mask) << 16; \
} while (0)

#endif

void *memchr(const void *p, int c, size_t length)
{
unsigned long mask;
unsigned char *src = (unsigned char *) p;
const void *end = p + length;
const struct word_at_a_time constants = WORD_AT_A_TIME_CONSTANTS;
size_t max = length;
c &= 0xff;

#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
/*
* If src is unaligned, don't cross a page boundary,
* since we don't know if the next page is mapped.
*/
if ((long)p & (sizeof(long) - 1)) {
size_t limit = PAGE_SIZE - ((long)src & (PAGE_SIZE - 1));

if (limit < max)
max = limit;
}

#else
/* If src is unaligned, don't do word-at-a-time. */
if (((long) p) & (sizeof(long) - 1))
max = 0;
#endif
mask = c;
MEMCHR_MASK_GEN(mask);
for ( ; max >= sizeof(unsigned long); max -= sizeof(unsigned
long), src += sizeof(unsigned long)) {
unsigned long data, result;
data = read_word_at_a_time(src);
data = data ^ mask;
if (has_zero(data, &result, &constants))
break;

}

for (; src < end; src++)
if (*src == c)
return (void *)src;

return NULL;
}