2022-10-25 02:34:22

by Nathan Moinvaziri

[permalink] [raw]
Subject: [PATCH] lib/string.c: Improve strcasecmp speed by not lowering if chars match

From fcb0159ee74908f92adc34143657d8ca56e9a811 Mon Sep 17 00:00:00 2001
From: Nathan Moinvaziri <[email protected]>
Date: Mon, 24 Oct 2022 16:37:59 -0700
Subject: [PATCH] lib/string.c: Improve strcasecmp speed by not lowering if
chars match.

With strings where many characters match exactly each character is needlessly
converted to lowercase before comparing. This patch improves the comparison
by only converting to lowercase after checking that the characters don't match.

The more characters that match exactly the better performance we expect versus
the old function.

When running tests using Quick Benchmark with two matching 256 character
strings these changes result in anywhere between ~6-9x speed improvement.

* We use unsigned char instead of int similar to strncasecmp.
* We only subtract c1 - c2 when they are not equal.

Reviewed-by: Sergey Markelov <[email protected]>
Reviewed-by: Steve Tucker <[email protected]>
---
lib/string.c | 17 ++++++++++++-----
1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/lib/string.c b/lib/string.c
index 3371d26a0e39..51ad56db1b5d 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -64,13 +64,20 @@ EXPORT_SYMBOL(strncasecmp);
#ifndef __HAVE_ARCH_STRCASECMP
int strcasecmp(const char *s1, const char *s2)
{
- int c1, c2;
+ /* Yes, Virginia, it had better be unsigned */
+ unsigned char c1, c2;

do {
- c1 = tolower(*s1++);
- c2 = tolower(*s2++);
- } while (c1 == c2 && c1 != 0);
- return c1 - c2;
+ c1 = *s1++;
+ c2 = *s2++;
+ if (c1 != c2) {
+ c1 = tolower(c1);
+ c2 = tolower(c2);
+ if (c1 != c2)
+ return (int)c1 - (int)c2;
+ }
+ } while (c1 != 0);
+ return 0;
}
EXPORT_SYMBOL(strcasecmp);
#endif
--
2.37.2.windows.2


2022-10-25 09:04:05

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [PATCH] lib/string.c: Improve strcasecmp speed by not lowering if chars match

On Tue, Oct 25, 2022 at 4:46 AM Nathan Moinvaziri <[email protected]> wrote:
>
> From fcb0159ee74908f92adc34143657d8ca56e9a811 Mon Sep 17 00:00:00 2001
> From: Nathan Moinvaziri <[email protected]>
> Date: Mon, 24 Oct 2022 16:37:59 -0700
> Subject: [PATCH] lib/string.c: Improve strcasecmp speed by not lowering if
> chars match.

Why is the above in the commit message?

> With strings where many characters match exactly each character is needlessly
> converted to lowercase before comparing. This patch improves the comparison
> by only converting to lowercase after checking that the characters don't match.
>
> The more characters that match exactly the better performance we expect versus
> the old function.
>
> When running tests using Quick Benchmark with two matching 256 character
> strings these changes result in anywhere between ~6-9x speed improvement.
>
> * We use unsigned char instead of int similar to strncasecmp.
> * We only subtract c1 - c2 when they are not equal.

Nobody can take it. Please, read Submitting Patches documentation and
fix your contribution accordingly.

...

> do {
> - c1 = tolower(*s1++);
> - c2 = tolower(*s2++);
> - } while (c1 == c2 && c1 != 0);
> - return c1 - c2;
> + c1 = *s1++;
> + c2 = *s2++;
> + if (c1 != c2) {
> + c1 = tolower(c1);
> + c2 = tolower(c2);
> + if (c1 != c2)
> + return (int)c1 - (int)c2;
> + }
> + } while (c1 != 0);

You tell us that this is more preformant, but have not provided the
numbers. Can we see those, please?

Note, that you basically trash CPU cache lines when characters are not
equal, and before doing that you have a branching. I'm unsure that
your way is more performant than the original one.

--
With Best Regards,
Andy Shevchenko

2022-10-25 09:09:39

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [PATCH] lib/string.c: Improve strcasecmp speed by not lowering if chars match

On Tue, Oct 25, 2022 at 11:00:36AM +0300, Andy Shevchenko wrote:
> On Tue, Oct 25, 2022 at 4:46 AM Nathan Moinvaziri <[email protected]> wrote:

...

> > When running tests using Quick Benchmark with two matching 256 character
> > strings these changes result in anywhere between ~6-9x speed improvement.
> >
> > * We use unsigned char instead of int similar to strncasecmp.
> > * We only subtract c1 - c2 when they are not equal.

...

> You tell us that this is more preformant, but have not provided the
> numbers. Can we see those, please?

So, I have read carefully and see the reference to some QuickBenchmark I have
no idea about. What I meant here is to have numbers provided by an (open
source) tool (maybe even in-kernel test case) that anybody can test on their
machines. You also missed details about how you run, what the data set has been
used, etc.

> Note, that you basically trash CPU cache lines when characters are not
> equal, and before doing that you have a branching. I'm unsure that
> your way is more performant than the original one.

--
With Best Regards,
Andy Shevchenko



2022-10-25 18:05:19

by Nathan Moinvaziri

[permalink] [raw]
Subject: RE: [PATCH] lib/string.c: Improve strcasecmp speed by not lowering if chars match

Hi Andy,

I appreciate your quick feedback!

I have done as you suggested and published my results this time using Google benchmark:
https://github.com/nmoinvaz/strcasecmp

After you review it, and if you still think the patch is worthwhile then I can fix the other problems you mentioned for the original patch. If you think it is not worth it, then I understand.

Thanks again,
Nathan

-----Original Message-----
From: Andy Shevchenko <[email protected]>
Sent: Tuesday, October 25, 2022 2:04 AM
To: Nathan Moinvaziri <[email protected]>
Cc: [email protected]
Subject: Re: [PATCH] lib/string.c: Improve strcasecmp speed by not lowering if chars match

On Tue, Oct 25, 2022 at 11:00:36AM +0300, Andy Shevchenko wrote:
> On Tue, Oct 25, 2022 at 4:46 AM Nathan Moinvaziri <[email protected]> wrote:

...

> > When running tests using Quick Benchmark with two matching 256
> > character strings these changes result in anywhere between ~6-9x speed improvement.
> >
> > * We use unsigned char instead of int similar to strncasecmp.
> > * We only subtract c1 - c2 when they are not equal.

...

> You tell us that this is more preformant, but have not provided the
> numbers. Can we see those, please?

So, I have read carefully and see the reference to some QuickBenchmark I have no idea about. What I meant here is to have numbers provided by an (open
source) tool (maybe even in-kernel test case) that anybody can test on their machines. You also missed details about how you run, what the data set has been used, etc.

> Note, that you basically trash CPU cache lines when characters are not
> equal, and before doing that you have a branching. I'm unsure that
> your way is more performant than the original one.

--
With Best Regards,
Andy Shevchenko



2022-10-25 19:26:30

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [PATCH] lib/string.c: Improve strcasecmp speed by not lowering if chars match

On Tue, Oct 25, 2022 at 8:53 PM Nathan Moinvaziri <[email protected]> wrote:
>
> Hi Andy,
>
> I appreciate your quick feedback!
>
> I have done as you suggested and published my results this time using Google benchmark:
> https://github.com/nmoinvaz/strcasecmp

Thank you for sharing!

Looks promising, but may I suggest a few things:
1) have you considered the word-at-a-time use (like strscpy() does)?
2) instead of using tolower() on both sides, have you considered
(with the above in mind) to use XOR over words and if they are not 0,
check if the result is one of possible combinations of 0x20 and then
by excluding the non-letters from the range you may find the
difference?

So, I think it's a good exercise for the twiddling of bits.

> After you review it, and if you still think the patch is worthwhile then I can fix the other problems you mentioned for the original patch. If you think it is not worth it, then I understand.

P.S. Avoid top-posting in the Linux kernel mailing lists!

> -----Original Message-----
> From: Andy Shevchenko <[email protected]>
> Sent: Tuesday, October 25, 2022 2:04 AM
> To: Nathan Moinvaziri <[email protected]>
> Cc: [email protected]
> Subject: Re: [PATCH] lib/string.c: Improve strcasecmp speed by not lowering if chars match
>
> On Tue, Oct 25, 2022 at 11:00:36AM +0300, Andy Shevchenko wrote:
> > On Tue, Oct 25, 2022 at 4:46 AM Nathan Moinvaziri <[email protected]> wrote:
>
> ...
>
> > > When running tests using Quick Benchmark with two matching 256
> > > character strings these changes result in anywhere between ~6-9x speed improvement.
> > >
> > > * We use unsigned char instead of int similar to strncasecmp.
> > > * We only subtract c1 - c2 when they are not equal.
>
> ...
>
> > You tell us that this is more preformant, but have not provided the
> > numbers. Can we see those, please?
>
> So, I have read carefully and see the reference to some QuickBenchmark I have no idea about. What I meant here is to have numbers provided by an (open
> source) tool (maybe even in-kernel test case) that anybody can test on their machines. You also missed details about how you run, what the data set has been used, etc.
>
> > Note, that you basically trash CPU cache lines when characters are not
> > equal, and before doing that you have a branching. I'm unsure that
> > your way is more performant than the original one.


--
With Best Regards,
Andy Shevchenko

2022-10-25 19:44:51

by Christophe JAILLET

[permalink] [raw]
Subject: Re: RE: [PATCH] lib/string.c: Improve strcasecmp speed by not lowering if chars match

Le 25/10/2022 à 19:53, Nathan Moinvaziri a écrit :
> Hi Andy,
>
> I appreciate your quick feedback!
>
> I have done as you suggested and published my results this time using Google benchmark:
> https://github.com/nmoinvaz/strcasecmp

Hi,
the algorithm on github is not the same as the one posted here.

IIUC, the one on github is wrong. If you compare 2 strings that are the
same, they will have the same length, and "if (c1 == c2) continue;" will
go one past the end of the strings. And the result will be <0 or 0 or >0
depending the the char *after* the trailing \0.

On the other side, the results of the benchmark on github are likely not
accurate with the algorithm posted here, because there is one more test
in each loop ("while (c1 != 0)") as long as the 2 strings are the same.
On github this test is skipped because you will go through the "continue"

CJ

>
> After you review it, and if you still think the patch is worthwhile then I can fix the other problems you mentioned for the original patch. If you think it is not worth it, then I understand.
>
> Thanks again,
> Nathan
>
> -----Original Message-----
> From: Andy Shevchenko <[email protected]>
> Sent: Tuesday, October 25, 2022 2:04 AM
> To: Nathan Moinvaziri <[email protected]>
> Cc: [email protected]
> Subject: Re: [PATCH] lib/string.c: Improve strcasecmp speed by not lowering if chars match
>
> On Tue, Oct 25, 2022 at 11:00:36AM +0300, Andy Shevchenko wrote:
>> On Tue, Oct 25, 2022 at 4:46 AM Nathan Moinvaziri <[email protected]> wrote:
>
> ...
>
>>> When running tests using Quick Benchmark with two matching 256
>>> character strings these changes result in anywhere between ~6-9x speed improvement.
>>>
>>> * We use unsigned char instead of int similar to strncasecmp.
>>> * We only subtract c1 - c2 when they are not equal.
>
> ...
>
>> You tell us that this is more preformant, but have not provided the
>> numbers. Can we see those, please?
>
> So, I have read carefully and see the reference to some QuickBenchmark I have no idea about. What I meant here is to have numbers provided by an (open
> source) tool (maybe even in-kernel test case) that anybody can test on their machines. You also missed details about how you run, what the data set has been used, etc.
>
>> Note, that you basically trash CPU cache lines when characters are not
>> equal, and before doing that you have a branching. I'm unsure that
>> your way is more performant than the original one.
>
> --
> With Best Regards,
> Andy Shevchenko
>
>
>


2022-10-25 20:09:32

by Rasmus Villemoes

[permalink] [raw]
Subject: Re: [PATCH] lib/string.c: Improve strcasecmp speed by not lowering if chars match

On 25/10/2022 10.00, Andy Shevchenko wrote:
> On Tue, Oct 25, 2022 at 4:46 AM Nathan Moinvaziri <[email protected]> wrote:
>>
>> From fcb0159ee74908f92adc34143657d8ca56e9a811 Mon Sep 17 00:00:00 2001
>> From: Nathan Moinvaziri <[email protected]>
>> Date: Mon, 24 Oct 2022 16:37:59 -0700
>> Subject: [PATCH] lib/string.c: Improve strcasecmp speed by not lowering if
>> chars match.
>
> Why is the above in the commit message?
>
>> With strings where many characters match exactly each character is needlessly
>> converted to lowercase before comparing. This patch improves the comparison
>> by only converting to lowercase after checking that the characters don't match.
>>
>> The more characters that match exactly the better performance we expect versus
>> the old function.

> You tell us that this is more preformant, but have not provided the
> numbers. Can we see those, please?
>
> Note, that you basically trash CPU cache lines when characters are not
> equal, and before doing that you have a branching. I'm unsure that
> your way is more performant than the original one.
>

Are there any code paths in the kernel where strcasecmp performance
matters? strcmp, sure, but strcasecmp or strncasecmp? I don't think so.
If anything, we should nuke the complication in strncasecmp(), and then
make strcasecmp() simply do strncasecmp(a, b, SIZE_MAX).

Rasmus

2022-10-25 23:11:16

by Nathan Moinvaziri

[permalink] [raw]
Subject: Re: [PATCH] lib/string.c: Improve strcasecmp speed by not lowering if chars match

On 10/25/2022 12:55 PM, Rasmus Villemoes wrote:
> Are there any code paths in the kernel where strcasecmp performance
> matters? strcmp, sure, but strcasecmp or strncasecmp? I don't think so.
> If anything, we should nuke the complication in strncasecmp(), and then
> make strcasecmp() simply do strncasecmp(a, b, SIZE_MAX).

It looks like several of the string functions could be collapsed in that
way.

Nathan

2022-10-26 00:20:43

by Nathan Moinvaziri

[permalink] [raw]
Subject: Re: [PATCH] lib/string.c: Improve strcasecmp speed by not lowering if chars match

On 10/25/2022 12:32 PM, Christophe JAILLET wrote:
> Hi,
> the algorithm on github is not the same as the one posted here.
>
> IIUC, the one on github is wrong. If you compare 2 strings that are
> the same, they will have the same length, and "if (c1 == c2)
> continue;" will go one past the end of the strings. And the result
> will be <0 or 0 or >0 depending the the char *after* the trailing \0.
>
> On the other side, the results of the benchmark on github are likely
> not accurate with the algorithm posted here, because there is one more
> test in each loop ("while (c1 != 0)") as long as the 2 strings are the
> same.
> On github this test is skipped because you will go through the "continue"
>
> CJ

Hi CJ,

Thanks for catching that, I had changed it at the last second. I have
updated the code and the benchmarks to what I initially proposed in the
patch. Results are about +/-1% from previously.

Nathan

2022-10-27 03:59:00

by Nathan Moinvaziri

[permalink] [raw]
Subject: Re: [PATCH] lib/string.c: Improve strcasecmp speed by not lowering if chars match

On 10/25/2022 12:19 PM, Andy Shevchenko wrote:
> Looks promising, but may I suggest a few things:
> 1) have you considered the word-at-a-time use (like strscpy() does)?

Only briefly at the beginning of the function to check for an identical
comparison and the added check hurt performance for strings that were
not identical.

On 10/25/2022 12:19 PM, Andy Shevchenko wrote:

> 2) instead of using tolower() on both sides, have you considered
> (with the above in mind) to use XOR over words and if they are not 0,
> check if the result is one of possible combinations of 0x20 and then
> by excluding the non-letters from the range you may find the
> difference?

I'm not sure what you mean about the possible combinations of the space
character. I have not investigated this method.

...

According to my previous findings the check for c1 != c2 does perform
better for strings that are at least 25% or more the same. I was able to
get even more performance out of it by changing tolower() to use a
different hash table than the one used for the is*() functions. By using
a pre-generated hash table for both islower() and isupper() it is
possible to remove the branch where ever those functions are used,
including in strcasecmp. This method I've seen employed in the Android
code base and also in cURL. Using it would add additional 2x256 bytes to
the code size for the tables.

I've put together a Quick Benchmark that shows the comparison between
the different methods:

https://quick-bench.com/q/l5DkYQO-CcMxQUu5MjZiqZ8M-Y0

Nathan



2022-10-27 06:42:48

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [PATCH] lib/string.c: Improve strcasecmp speed by not lowering if chars match

On Thu, Oct 27, 2022 at 6:29 AM Nathan Moinvaziri <[email protected]> wrote:
> On 10/25/2022 12:19 PM, Andy Shevchenko wrote:
> > Looks promising, but may I suggest a few things:
> > 1) have you considered the word-at-a-time use (like strscpy() does)?
>
> Only briefly at the beginning of the function to check for an identical
> comparison and the added check hurt performance for strings that were
> not identical.
>
> On 10/25/2022 12:19 PM, Andy Shevchenko wrote:
>
> > 2) instead of using tolower() on both sides, have you considered
> > (with the above in mind) to use XOR over words and if they are not 0,
> > check if the result is one of possible combinations of 0x20 and then
> > by excluding the non-letters from the range you may find the
> > difference?
>
> I'm not sure what you mean about the possible combinations of the space
> character. I have not investigated this method.

'a' xor 'A' == 0x20 (same for all the letters.
That's why we have a specific _tolower() in vsprintf.c.

> According to my previous findings the check for c1 != c2 does perform
> better for strings that are at least 25% or more the same. I was able to
> get even more performance out of it by changing tolower() to use a
> different hash table than the one used for the is*() functions. By using
> a pre-generated hash table for both islower() and isupper() it is
> possible to remove the branch where ever those functions are used,
> including in strcasecmp. This method I've seen employed in the Android
> code base and also in cURL. Using it would add additional 2x256 bytes to
> the code size for the tables.

Rasmus raised a good question, where do we actually need the
performant strcasecmp()?

--
With Best Regards,
Andy Shevchenko