2003-08-12 14:08:09

by Yoshinori Sato

[permalink] [raw]
Subject: generic strncpy - off-by-one error

zero fill count is off-by-one error

--- lib/string.c~ 2003-08-09 20:30:36.000000000 +0900
+++ lib/string.c 2003-08-12 22:55:47.000000000 +0900
@@ -89,7 +89,8 @@

while (count && (*dest++ = *src++) != '\0')
count--;
- while (count) {
+ count--;
+ while (count > 0) {
*dest++ = 0;
count--;
}

--
Yoshinori Sato
<[email protected]>


2003-08-12 14:39:12

by Willy Tarreau

[permalink] [raw]
Subject: Re: generic strncpy - off-by-one error

On Tue, Aug 12, 2003 at 11:07:59PM +0900, Yoshinori Sato wrote:
> zero fill count is off-by-one error

I disagree here. With your code, if count becomes 0 within the first while(),
you set it to (unsigned)(-1) (because count is size_t), and the second loop
will add this number of zeroes after dest (4 billion on 32 bits archs).

The original code seems OK to me.

Cheers,
Willy

> --- lib/string.c~ 2003-08-09 20:30:36.000000000 +0900
> +++ lib/string.c 2003-08-12 22:55:47.000000000 +0900
> @@ -89,7 +89,8 @@
>
> while (count && (*dest++ = *src++) != '\0')
> count--;
> - while (count) {
> + count--;
> + while (count > 0) {
> *dest++ = 0;
> count--;
> }

2003-08-12 14:51:27

by Yoshinori Sato

[permalink] [raw]
Subject: Re: generic strncpy - off-by-one error

I made a mistake. This is a good patch.

--- lib/string.c~ 2003-08-09 20:30:36.000000000 +0900
+++ lib/string.c 2003-08-12 23:27:08.000000000 +0900
@@ -89,7 +89,7 @@

while (count && (*dest++ = *src++) != '\0')
count--;
- while (count) {
+ while (count > 1) {
*dest++ = 0;
count--;
}

--
Yoshinori Sato
<[email protected]>

2003-08-12 15:03:42

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: generic strncpy - off-by-one error

On Tue, 12 Aug 2003 23:50:06 +0900, Yoshinori Sato <[email protected]> said:

> - while (count) {
> + while (count > 1) {

Given that count is a size_t, which seems to be derived from 'unsigned int' or
'unsigned long' on every platform, how are these any different?


Attachments:
(No filename) (226.00 B)

2003-08-12 15:52:57

by William Gallafent

[permalink] [raw]
Subject: Re: generic strncpy - off-by-one error

On Tue, 12 Aug 2003 [email protected] wrote:

> On Tue, 12 Aug 2003 23:50:06 +0900, Yoshinori Sato
> <[email protected]> said:
> > - while (count) {
> > + while (count > 1) {
>
> Given that count is a size_t, which seems to be derived from 'unsigned int'
> or 'unsigned long' on every platform, how are these any different?

Er, consider the case of count == 1. Fenceposts can be dangerous things.

--
Bill Gallafent.

2003-08-12 16:17:03

by Anthony Dominic Truong

[permalink] [raw]
Subject: Re: generic strncpy - off-by-one error

On Tue, 2003-08-12 at 23:54, William Gallafent wrote:

On Tue, 12 Aug 2003 [email protected] wrote:

> On Tue, 12 Aug 2003 23:50:06 +0900, Yoshinori Sato
> <[email protected]> said:
> > - while (count) {
> > + while (count > 1) {
>
> Given that count is a size_t, which seems to be derived from 'unsigned
int'
> or 'unsigned long' on every platform, how are these any different?

Er, consider the case of count == 1. Fenceposts can be dangerous things.

--
Bill Gallafent.
-


Hello,
This is the code I got from 2.4.20:
char * strncpy(char * dest,const char *src,size_t count)
{
char *tmp = dest;

while (count-- && (*dest++ = *src++) != '\0')
/* nothing */;

return tmp;
}

I don't see any problem with this code, and if we don't need to NULL-pad
the dest string, we do not have to. It is not in the definition of
strncpy(). So we don't need the second while {};
I'm hoping we're looking at the same thing.

Regards,
Anthony Dominic Truong.




Disclaimer: The information contained in this transmission, including any
attachments, may contain confidential information of Matsushita Avionics
Systems Corporation. This transmission is intended only for the use of the
addressee(s) listed above. Unauthorized review, dissemination or other use
of the information contained in this transmission is strictly prohibited.
If you have received this transmission in error or have reason to believe
you are not authorized to receive it, please notify the sender by return
email and promptly delete the transmission.


2003-08-12 16:19:58

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: generic strncpy - off-by-one error

On Tue, 12 Aug 2003 16:54:05 BST, William Gallafent said:

> Er, consider the case of count == 1. Fenceposts can be dangerous things.

Amen, given the original code and the first attempt to fix it. I was actually
trolling for "accidentally correct" versus "intentionally correct".

/Valdis (who has posted "convince me the code is right THIS time" waay too
many times on waaay too many lists already today - including one count-the-on-bits
scheme that only worked for an all-zeros input.. ;)


Attachments:
(No filename) (226.00 B)

2003-08-12 16:09:41

by Andreas Schwab

[permalink] [raw]
Subject: Re: generic strncpy - off-by-one error

[email protected] writes:

|> On Tue, 12 Aug 2003 23:50:06 +0900, Yoshinori Sato <[email protected]> said:
|>
|> > - while (count) {
|> > + while (count > 1) {
|>
|> Given that count is a size_t, which seems to be derived from 'unsigned int' or
|> 'unsigned long' on every platform, how are these any different?

Let's suppose count == 1.

Andreas.

--
Andreas Schwab, SuSE Labs, [email protected]
SuSE Linux AG, Deutschherrnstr. 15-19, D-90429 N?rnberg
Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
"And now for something completely different."


Attachments:
(No filename) (188.00 B)

2003-08-12 16:24:48

by Bernd Petrovitsch

[permalink] [raw]
Subject: Re: generic strncpy - off-by-one error

>I don't see any problem with this code, and if we don't need to
>NULL-pad the dest string, we do not have to. It is not in the
>definition of strncpy().

Wrong: http://www.opengroup.org/onlinepubs/007908799/xsh/strncpy.html

Bernd
--
Firmix Software GmbH http://www.firmix.at/
mobil: +43 664 4416156 fax: +43 1 7890849-55
Embedded Linux Development and Services

2003-08-12 16:44:50

by Anthony Dominic Truong

[permalink] [raw]
Subject: Re: generic strncpy - off-by-one error

On Wed, 2003-08-13 at 00:24, Bernd Petrovitsch wrote:

>I don't see any problem with this code, and if we don't need to
>NULL-pad the dest string, we do not have to. It is not in the
>definition of strncpy().

Wrong: http://www.opengroup.org/onlinepubs/007908799/xsh/strncpy.html

Bernd

Hello,
Thanks for pointing that out to me. However, I don't think the kernel
strncpy implementation is exactly the same as that of Standard C lib
implementation. Iwas just looking at it from the kernel code context.
There's a point in doing it the "kernel" way, to save precious CPU
cycles from being wasted otherwise.

Regards,
Anthony Dominic Truong.



Disclaimer: The information contained in this transmission, including any
attachments, may contain confidential information of Matsushita Avionics
Systems Corporation. This transmission is intended only for the use of the
addressee(s) listed above. Unauthorized review, dissemination or other use
of the information contained in this transmission is strictly prohibited.
If you have received this transmission in error or have reason to believe
you are not authorized to receive it, please notify the sender by return
email and promptly delete the transmission.


2003-08-12 17:18:24

by Alan

[permalink] [raw]
Subject: Re: generic strncpy - off-by-one error

On Maw, 2003-08-12 at 02:56, Anthony Truong wrote:
> Thanks for pointing that out to me. However, I don't think the kernel
> strncpy implementation is exactly the same as that of Standard C lib

It is true it doesnt need to be

> implementation. Iwas just looking at it from the kernel code context.
> There's a point in doing it the "kernel" way, to save precious CPU
> cycles from being wasted otherwise.

CPU cycles, got lots of those 8). If its going to do anything it might
be to reference an extra cache line. For people who dont need padding
2.6 has strlcpy. Lots of drivers assume strncpy fills the entire block

2003-08-12 21:54:17

by Bernd Petrovitsch

[permalink] [raw]
Subject: Re: generic strncpy - off-by-one error

On Tue, 2003-08-12 at 19:14, Alan Cox wrote:
> On Maw, 2003-08-12 at 02:56, Anthony Truong wrote:
> > Thanks for pointing that out to me. However, I don't think the kernel
> > strncpy implementation is exactly the same as that of Standard C lib
>
> It is true it doesnt need to be

Nevertheless it is defined the inefficient way by the C-Standard and
lots of people actually really know this.
So IMHO (re)implementing a function called "strncpy" differently doesn't
make much sense.

> > implementation. Iwas just looking at it from the kernel code context.

The kernel is implemented in C. And the function strncpy() _is_ already
defined by C. So please use at least another name. Given the existence
of strlcpy() - http://www.courtesan.com/todd/papers/strlcpy.html- this
problem already has been solved.

> > There's a point in doing it the "kernel" way, to save precious CPU
> > cycles from being wasted otherwise.
>
> CPU cycles, got lots of those 8). If its going to do anything it might
> be to reference an extra cache line. For people who dont need padding
> 2.6 has strlcpy. Lots of drivers assume strncpy fills the entire block

ACK. To use strlcpy() is correct solution (if padding is not required).

Bernd

2003-08-13 02:28:55

by Albert Cahalan

[permalink] [raw]
Subject: Re: generic strncpy - off-by-one error

You're all wrong. This is some kind of programming
test for sure!

Let us imagine that glibc has a correct version.
By exhaustive testing, I found a version that works.
Here it is, along with the test code:

//////////////////////////////////////////////////////
#define _GNU_SOURCE
#include <string.h>
#include <stdlib.h>
#include <stdio.h>

// first correct implementation!
char * strncpy_good(char *dest, const char *src, size_t count){
char *tmp = dest;
memset(dest,'\0',count);
while (count-- && (*tmp++ = *src++))
;
return dest;
}

static char ref[32];
static char hmm[32];

static char s0[] = "";
static char s1[] = "1";
static char s2[] = "12";
static char s6[] = "123456";
static char s7[] = "1234567";
static char s8[] = "12345678";
static char s9[] = "123456789";

static void tester(const char *src, size_t count){
memset(ref,'%',sizeof ref);
memset(hmm,'%',sizeof hmm);
strncpy (ref+2,src,count);
strncpy_good(hmm+2,src,count);
if(memcmp(ref,hmm,sizeof hmm)){
printf("error @ size %d\n",(int)count);
}
}

int main(int argc, char *argv[]){
int i = 15;
while(i--){
tester(s0,i);
tester(s1,i);
tester(s2,i);
tester(s6,i);
tester(s7,i);
tester(s8,i);
tester(s9,i);
}
return 0;
}
////////////////////////////////////////////////////////////


2003-08-13 02:47:56

by Erik Andersen

[permalink] [raw]
Subject: Re: generic strncpy - off-by-one error

On Tue Aug 12, 2003 at 10:18:21PM -0400, Albert Cahalan wrote:
> You're all wrong. This is some kind of programming
> test for sure!
>
> Let us imagine that glibc has a correct version.
> By exhaustive testing, I found a version that works.
> Here it is, along with the test code:
>
> //////////////////////////////////////////////////////
> #define _GNU_SOURCE
> #include <string.h>
> #include <stdlib.h>
> #include <stdio.h>
>
> // first correct implementation!
> char * strncpy_good(char *dest, const char *src, size_t count){
> char *tmp = dest;
> memset(dest,'\0',count);
> while (count-- && (*tmp++ = *src++))
> ;
> return dest;
> }

char *strncpy(char * s1, const char * s2, size_t n)
{
register char *s = s1;
while (n) {
if ((*s = *s2) != 0) s2++;
++s;
--n;
}
return s1;
}

-Erik

--
Erik B. Andersen http://codepoet-consulting.com/
--This message was written using 73% post-consumer electrons--

2003-08-13 03:49:10

by Albert Cahalan

[permalink] [raw]
Subject: Re: generic strncpy - off-by-one error

On Tue, 2003-08-12 at 22:47, Erik Andersen wrote:
> On Tue Aug 12, 2003 at 10:18:21PM -0400, Albert Cahalan wrote:
> > You're all wrong. This is some kind of programming
> > test for sure!
> >
> > Let us imagine that glibc has a correct version.
> > By exhaustive testing, I found a version that works.
> > Here it is, along with the test code:
> >
> > //////////////////////////////////////////////////////
> > #define _GNU_SOURCE
> > #include <string.h>
> > #include <stdlib.h>
> > #include <stdio.h>
> >
> > // first correct implementation!
> > char * strncpy_good(char *dest, const char *src, size_t count){
> > char *tmp = dest;
> > memset(dest,'\0',count);
> > while (count-- && (*tmp++ = *src++))
> > ;
> > return dest;
> > }
>
> char *strncpy(char * s1, const char * s2, size_t n)
> {
> register char *s = s1;
> while (n) {
> if ((*s = *s2) != 0) s2++;
> ++s;
> --n;
> }
> return s1;
> }

That's excellent. On ppc I count 12 instructions,
4 of which would go away for typical usage if inlined.
Annoyingly, gcc doesn't get the same assembly from my
attempt at that general idea:

char * strncpy_5(char *dest, const char *src, size_t count){
char *tmp = dest;
while (count--){
if(( *tmp++ = *src )) src++;
}
return dest;
}

I suppose that gcc could use a bug report.


2003-08-13 03:57:54

by Nick Piggin

[permalink] [raw]
Subject: Re: generic strncpy - off-by-one error



Albert Cahalan wrote:

>On Tue, 2003-08-12 at 22:47, Erik Andersen wrote:
>
>>On Tue Aug 12, 2003 at 10:18:21PM -0400, Albert Cahalan wrote:
>>
>>>You're all wrong. This is some kind of programming
>>>test for sure!
>>>
>>>Let us imagine that glibc has a correct version.
>>>By exhaustive testing, I found a version that works.
>>>Here it is, along with the test code:
>>>
>>>//////////////////////////////////////////////////////
>>>#define _GNU_SOURCE
>>>#include <string.h>
>>>#include <stdlib.h>
>>>#include <stdio.h>
>>>
>>>// first correct implementation!
>>>char * strncpy_good(char *dest, const char *src, size_t count){
>>> char *tmp = dest;
>>> memset(dest,'\0',count);
>>> while (count-- && (*tmp++ = *src++))
>>> ;
>>> return dest;
>>>}
>>>
>>char *strncpy(char * s1, const char * s2, size_t n)
>>{
>> register char *s = s1;
>> while (n) {
>> if ((*s = *s2) != 0) s2++;
>> ++s;
>> --n;
>> }
>> return s1;
>>}
>>
>
>That's excellent. On ppc I count 12 instructions,
>4 of which would go away for typical usage if inlined.
>Annoyingly, gcc doesn't get the same assembly from my
>attempt at that general idea:
>
>char * strncpy_5(char *dest, const char *src, size_t count){
> char *tmp = dest;
> while (count--){
> if(( *tmp++ = *src )) src++;
> }
> return dest;
>}
>
>I suppose that gcc could use a bug report.
>

Its has different semantics though. Well taken as a whole they
are the same. When your loop finishes, count will be -1.


2003-08-13 05:19:08

by Willy Tarreau

[permalink] [raw]
Subject: Re: generic strncpy - off-by-one error

On Tue, Aug 12, 2003 at 11:38:31PM -0400, Albert Cahalan wrote:
> That's excellent. On ppc I count 12 instructions,
> 4 of which would go away for typical usage if inlined.
> Annoyingly, gcc doesn't get the same assembly from my
> attempt at that general idea:
>
> char * strncpy_5(char *dest, const char *src, size_t count){
> char *tmp = dest;
> while (count--){
> if(( *tmp++ = *src )) src++;
> }
> return dest;
> }
>
> I suppose that gcc could use a bug report.

I often noticed that using '++' and '--' within or just before assignments
and/or comparisons often break the code and make it suboptimal. C provides
enough flexibility to code what you think nearly at the instruction level.
Since 'while' loops often start with a jump to the end, you can sometimes help
the compiler by enclosing them within an 'if' statement such as below. BTW, in
your case, count ends with -1.

I've absolutely not tried this one, but it could produce different code on your
PPC, and can trivially be derived to cleaner constructs. I proceeded the same
way when I wrote my own optimized strlcpy() implementation which is 45 bytes
long and copies 1 char per CPU cycle on i686.

char *strncpy(char *dest, const char *src, size_t count)
{
if (count) {
char *tmp = dest;
while (1) {
*tmp = *src;
if (*src) src++;
tmp++;
if (!count--) break;
}
}
return dest;
}

Cheers,
Willy

2003-08-13 17:57:57

by Anthony Dominic Truong

[permalink] [raw]
Subject: Re: generic strncpy - off-by-one error

On Wed, 2003-08-13 at 13:18, Willy Tarreau wrote:

On Tue, Aug 12, 2003 at 11:38:31PM -0400, Albert Cahalan wrote:
> That's excellent. On ppc I count 12 instructions,
> 4 of which would go away for typical usage if inlined.
> Annoyingly, gcc doesn't get the same assembly from my
> attempt at that general idea:
>
> char * strncpy_5(char *dest, const char *src, size_t count){
> char *tmp = dest;
> while (count--){
> if(( *tmp++ = *src )) src++;
> }
> return dest;
> }
>
> I suppose that gcc could use a bug report.

I often noticed that using '++' and '--' within or just before
assignments
and/or comparisons often break the code and make it suboptimal. C
provides
enough flexibility to code what you think nearly at the instruction
level.
Since 'while' loops often start with a jump to the end, you can
sometimes help
the compiler by enclosing them within an 'if' statement such as below.
BTW, in
your case, count ends with -1.

I've absolutely not tried this one, but it could produce different code
on your
PPC, and can trivially be derived to cleaner constructs. I proceeded the
same
way when I wrote my own optimized strlcpy() implementation which is 45
bytes
long and copies 1 char per CPU cycle on i686.

char *strncpy(char *dest, const char *src, size_t count)
{
if (count) {
char *tmp = dest;
while (1) {
*tmp = *src;
if (*src) src++;
tmp++;
if (!count--) break;
}
}
return dest;
}

Cheers,
Willy


Hello,
This reminds me of my days in school learning how to program in C/C++.
I first learnt that in
while (count-- && (*dest++ = *src++));
the first operand of && will be evaluated first, and then the second
iff the first is evaluated to TRUE.
count-- : count will be checked to see if it is not 0 first, and if it
is not, it will get decremented. If it is, the while loop is ended.
Therefore, how could count go to -1?
If there is a bug in the gcc, then should we fix it rather than messing
with our coding trying to please it?
We don't have to do if (count) before the loop because a count of 0 is
already caught in while (count-- ......).

Again, maybe what I learnt in school about C/C++ programming is all
wrong.:-)

:-)
Anthony Dominic Truong.




Disclaimer: The information contained in this transmission, including any
attachments, may contain confidential information of Matsushita Avionics
Systems Corporation. This transmission is intended only for the use of the
addressee(s) listed above. Unauthorized review, dissemination or other use
of the information contained in this transmission is strictly prohibited.
If you have received this transmission in error or have reason to believe
you are not authorized to receive it, please notify the sender by return
email and promptly delete the transmission.


2003-08-13 18:50:10

by Timothy Miller

[permalink] [raw]
Subject: Re: generic strncpy - off-by-one error



Erik Andersen wrote:

> char *strncpy(char * s1, const char * s2, size_t n)
> {
> register char *s = s1;
> while (n) {
> if ((*s = *s2) != 0) s2++;
> ++s;
> --n;
> }
> return s1;
> }



Nice!



How about this:


char *strncpy(char * s1, const char * s2, size_t n)
{
register char *s = s1;

while (n && *s2) {
n--;
*s++ = *s2++;
}
while (n--) {
*s++ = 0;
}
return s1;
}



This reminds me a lot of the ORIGINAL, although I didn't pay much
attention to it at the time, so I don't remember. It may be that the
original had "n--" in the while () condition of the first loop, rather
than inside the loop.

I THINK the original complaint was that n would be off by 1 upon exiting
the first loop. The fix is to only decrement n when n is nonzero.

If s2 is short enough, then we'll exit the first loop on the nul byte
and fill in the rest in the second loop. Since n is only decremented
with we actually write to s, we will only ever write n bytes. No
off-by-one.

If s2 is too long, the first loop will exit on n being zero, and since
it doesn't get decremented in that case, it'll be zero upon entering the
second loop, thus bypassing it properly.

Erik's code is actually quite elegant, and its efficiency is probably
essentially the same as my first loop. But my second loop would
probably be faster at doing the zero fill.


Now, consider this for the second loop!

while (n&3) {
*s++ = 0;
n--;
}
l = n>>2;
while (l--) {
*((int *)s)++ = 0;
}
n &= 3;
while (n--) {
*s++ = 0;
}


This is only a win for relatively long nul padding. How often is the
padding long enough?

2003-08-14 09:39:50

by Peter Kjellerstedt

[permalink] [raw]
Subject: RE: generic strncpy - off-by-one error

[ Trimmed the recipient list. ]

> -----Original Message-----
> From: Timothy Miller [mailto:[email protected]]
> Sent: Wednesday, August 13, 2003 21:04
> To: [email protected]
> Cc: Albert Cahalan; linux-kernel mailing list;
> [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]
> Subject: Re: generic strncpy - off-by-one error
>
> Erik Andersen wrote:
>
> > char *strncpy(char * s1, const char * s2, size_t n)
> > {
> > register char *s = s1;
> > while (n) {
> > if ((*s = *s2) != 0) s2++;
> > ++s;
> > --n;
> > }
> > return s1;
> > }
>
>
>
> Nice!

I can but agree.

> How about this:
>
>
> char *strncpy(char * s1, const char * s2, size_t n)
> {
> register char *s = s1;
>
> while (n && *s2) {
> n--;
> *s++ = *s2++;
> }
> while (n--) {
> *s++ = 0;
> }
> return s1;
> }

This may be improved further:

char *strncpy(char *dest, const char *src, size_t count)
{
char *tmp = dest;

while (count) {
if (*src == '\0')
break;

*tmp++ = *src++;
count--;
}

while (count) {
*tmp++ = '\0';
count--;
}

return dest;
}

Moving the check for *src == '\0' into the first loop seems
to let the compiler reuse the object code a little better
(may depend on the optimizer). Also note that your version
of the second loop needs an explicit comparison with -1,
whereas mine uses an implicit comparison with 0.

Testing on the CRIS architecture, your version is 24 instructions,
whereas mine is 18. For comparison, Eric's one is 12 and the
currently used implementation is 26 (when corrected for the
off-by-one error by comparing with > 1 rather than != 0 in the
second loop).

Here is another version that I think is quite pleasing
aesthetically (YMMV) since the loops are so similar (it is 21
instructions on CRIS):

char *strncpy(char *dest, const char *src, size_t count)
{
char *tmp = dest;

for (; count && *src; count--)
*tmp++ = *src++;

for (; count; count--)
*tmp++ = '\0';

return dest;
}

This is probably the version I would choose if I were to decide
as the object code generated for the actual loops are optimal in
this version (at least for CRIS).

> This reminds me a lot of the ORIGINAL, although I didn't pay much
> attention to it at the time, so I don't remember. It may be that
> the original had "n--" in the while () condition of the first
> loop, rather than inside the loop.
>
> I THINK the original complaint was that n would be off by 1
> upon exiting the first loop. The fix is to only decrement n
> when n is nonzero.
>
> If s2 is short enough, then we'll exit the first loop on the nul byte
> and fill in the rest in the second loop. Since n is only decremented
> with we actually write to s, we will only ever write n bytes. No
> off-by-one.
>
> If s2 is too long, the first loop will exit on n being zero,
> and since it doesn't get decremented in that case, it'll be
> zero upon entering the second loop, thus bypassing it properly.
>
> Erik's code is actually quite elegant, and its efficiency is probably
> essentially the same as my first loop. But my second loop would
> probably be faster at doing the zero fill.
>
>
> Now, consider this for the second loop!
>
> while (n&3) {

I think sizeof(int)-1 would be better than 3. ;)
And long would probably be better for the 64-bit architectures?

> *s++ = 0;
> n--;
> }
> l = n>>2;
> while (l--) {
> *((int *)s)++ = 0;
> }
> n &= 3;
> while (n--) {
> *s++ = 0;
> }
>
> This is only a win for relatively long nul padding. How often is the
> padding long enough?

I guess another way would be to replace the second loop with
memset(s, '\0', n), but that would probably only be a win for
quite long paddings.

//Peter

2003-08-14 19:45:33

by Willy Tarreau

[permalink] [raw]
Subject: Re: generic strncpy - off-by-one error

On Thu, Aug 14, 2003 at 11:34:50AM +0200, Peter Kjellerstedt wrote:

> char *strncpy(char *dest, const char *src, size_t count)
> {
> char *tmp = dest;
>
> while (count) {
> if (*src == '\0')
> break;
>
> *tmp++ = *src++;
> count--;
> }
>
> while (count) {
> *tmp++ = '\0';
> count--;
> }
>
> return dest;
> }
>
> Moving the check for *src == '\0' into the first loop seems
> to let the compiler reuse the object code a little better
> (may depend on the optimizer). Also note that your version
> of the second loop needs an explicit comparison with -1,
> whereas mine uses an implicit comparison with 0.

Well, if you're at this level of comparison, then I may object that
'count' is evaluated twice when jumping from one loop to the other one.

*Algorithmically* speaking, the most optimal code should be :

char *strncpy(char *dest, const char *src, size_t count)
{
char *tmp = dest;
if (unlikely(!count))
goto end;
loop1:
if ((*tmp = *src++) == '\0')
goto next;
tmp++;
if (likely(--count))
goto loop1;
else
goto end;
loop2:
*tmp = '\0';
next:
tmp++;
if (likely(--count))
goto loop2;
end:
return dest;
}

(sorry for those who hate gotos). Look at it : no test is performed twice, no
byte is read nor written twice in any case. The assembly code produced is
optimal on x86 (well, in fact we could delete exactly one conditionnal jump
because the compiler doesn't know how to reuse them for several tests). 16
inlinable instructions (= excluding function in/out) which can be executed 2 at
a time if your CPU has 2 pipelines. about 3 cycles/copied byte, 2 cycles/filled
byte.

Unfortunately, it fools branch predictors and prefetch mechanisms found in
today's processors, so it results in slower code than yours (at least on
athlon and P3). Perhaps it would be fast on older processors, I don't know.

All that demonstrates that whatever your efforts in helping the optimizer, the
only meaningful result is the benchmark. Number of bytes and speculations on
the reusability of information between lines of codes are not what makes our
programs fast anymore :-(

> Testing on the CRIS architecture, your version is 24 instructions,
> whereas mine is 18. For comparison, Eric's one is 12 and the
> currently used implementation is 26 (when corrected for the
> off-by-one error by comparing with > 1 rather than != 0 in the
> second loop).

Just out of curiosity, can the CRIS architecture execute multiple instructions
per cycle ? have you tried to determine the dependencies between the
instructions ? Did you time the different functions ? (yours is rather fast on
x86 BTW).

To conclude, I would say that if we want to implement really fast low-level
primitives such as str*, mem*, ... there's nothing better than assembly
associated with benchmarks on different CPU architectures and models.

But I don't know if strncpy() is used enough to justify that...

Regards,
Willy

2003-08-14 20:09:46

by Timothy Miller

[permalink] [raw]
Subject: Re: generic strncpy - off-by-one error



Peter Kjellerstedt wrote:

>>
>>Nice!
>
>
> I can but agree.
>
>
>>How about this:
>>
>>
>>char *strncpy(char * s1, const char * s2, size_t n)
>>{
>> register char *s = s1;
>>
>> while (n && *s2) {
>> n--;
>> *s++ = *s2++;
>> }
>> while (n--) {
>> *s++ = 0;
>> }
>> return s1;
>>}
>
>
> This may be improved further:
>
> char *strncpy(char *dest, const char *src, size_t count)
> {
> char *tmp = dest;
>
> while (count) {
> if (*src == '\0')
> break;
>
> *tmp++ = *src++;
> count--;
> }
>
> while (count) {
> *tmp++ = '\0';
> count--;
> }
>
> return dest;
> }
>
> Moving the check for *src == '\0' into the first loop seems
> to let the compiler reuse the object code a little better
> (may depend on the optimizer).

While I can understand that certain architectures may benefit from that
alteration, I am curious as to what SPECIFICALLY it is doing that is
different. How do they differ?

> Also note that your version
> of the second loop needs an explicit comparison with -1,
> whereas mine uses an implicit comparison with 0.

I don't understand why you say I need an explicit comparison with -1.
My first loop exits either with the number of bytes remaining in the
buffer or with zero if it's copied count number of bytes.

The second loop WOULD require a comparison with -1 IF the "count--" were
not inside of the loop body. As it IS in the loop body, there is no
need for that. My second loop has an implicit comparison against zero.

> Testing on the CRIS architecture, your version is 24 instructions,
> whereas mine is 18. For comparison, Eric's one is 12 and the
> currently used implementation is 26 (when corrected for the
> off-by-one error by comparing with > 1 rather than != 0 in the
> second loop).

If I understand it correctly, the corrected original needs that explicit
comparison because the decrement is in the loop conditional. My
implementation (and yours) corrects this by moving the decrement into
the body of the loop.

Also, while instruction count is sometimes a good indication of
algorithm efficiency, I would argue that our two-loop version is
probably the same speed as the single-loop version for copying
characters, but ours is faster for zeroing the rest of the target buffer.

>
> Here is another version that I think is quite pleasing
> aesthetically (YMMV) since the loops are so similar (it is 21
> instructions on CRIS):
>
> char *strncpy(char *dest, const char *src, size_t count)
> {
> char *tmp = dest;
>
> for (; count && *src; count--)
> *tmp++ = *src++;
>
> for (; count; count--)
> *tmp++ = '\0';
>
> return dest;
> }

I agree that this is definately a more elegant look to the code, and I
would prefer what you have done here. But what puzzles me is that this
is functionally and logically equivalent to my code.

So, this code:

for (A; B; C) {}

is the same as this:

A;
while (B) {
...
C;
}


So why is it that this mere syntactic difference causes the compiler to
produce a better result?


> This is probably the version I would choose if I were to decide
> as the object code generated for the actual loops are optimal in
> this version (at least for CRIS).

Sounds wise to me.

>
>>This reminds me a lot of the ORIGINAL, although I didn't pay much
>>attention to it at the time, so I don't remember. It may be that
>>the original had "n--" in the while () condition of the first
>>loop, rather than inside the loop.
>>
>>I THINK the original complaint was that n would be off by 1
>>upon exiting the first loop. The fix is to only decrement n
>>when n is nonzero.
>>
>>If s2 is short enough, then we'll exit the first loop on the nul byte
>>and fill in the rest in the second loop. Since n is only decremented
>>with we actually write to s, we will only ever write n bytes. No
>>off-by-one.
>>
>>If s2 is too long, the first loop will exit on n being zero,
>>and since it doesn't get decremented in that case, it'll be
>>zero upon entering the second loop, thus bypassing it properly.
>>
>>Erik's code is actually quite elegant, and its efficiency is probably
>>essentially the same as my first loop. But my second loop would
>>probably be faster at doing the zero fill.
>>
>>
>>Now, consider this for the second loop!
>>
>> while (n&3) {
>
>
> I think sizeof(int)-1 would be better than 3. ;)
> And long would probably be better for the 64-bit architectures?

Yeah, I know. I deal with 32-bit vs 64-bit all the time. I was just
using this as an example and leaving it as an exercise for the reader to
infer the 64-bit case.

Also, this approach would be good for some architectures (x86, PPC,
etc.), but may be either an incredible improvement or no help at all for
some architectures which do weird things with addressing (like alpha
with its sparse vs. dense addressing).

Also, some compilers may not migrate operations around so intelligently,
so it might help, for instance, to move the "n &= 3;" to between the "l
= n>>2;" and the middle loop, because it gives the CPU time to compute l
for the middle loop while computing n.


>
>> *s++ = 0;
>> n--;
>> }
>> l = n>>2;
>> while (l--) {
>> *((int *)s)++ = 0;
>> }
>> n &= 3;
>> while (n--) {
>> *s++ = 0;
>> }
>>
>>This is only a win for relatively long nul padding. How often is the
>>padding long enough?
>
>
> I guess another way would be to replace the second loop with
> memset(s, '\0', n), but that would probably only be a win for
> quite long paddings.


That depends entirely on how memset is written. What IS in memset?

If it's inlined, and it's very well optimized, then it's probably a
great win.

2003-08-15 10:00:03

by Peter Kjellerstedt

[permalink] [raw]
Subject: RE: generic strncpy - off-by-one error

> -----Original Message-----
> From: Willy Tarreau [mailto:[email protected]]
> Sent: Thursday, August 14, 2003 21:45
> To: Peter Kjellerstedt
> Cc: 'Timothy Miller'; linux-kernel mailing list
> Subject: Re: generic strncpy - off-by-one error
>
> On Thu, Aug 14, 2003 at 11:34:50AM +0200, Peter Kjellerstedt wrote:
>
> > char *strncpy(char *dest, const char *src, size_t count)
> > {
> > char *tmp = dest;
> >
> > while (count) {
> > if (*src == '\0')
> > break;
> >
> > *tmp++ = *src++;
> > count--;
> > }
> >
> > while (count) {
> > *tmp++ = '\0';
> > count--;
> > }
> >
> > return dest;
> > }
> >
> > Moving the check for *src == '\0' into the first loop seems
> > to let the compiler reuse the object code a little better
> > (may depend on the optimizer). Also note that your version
> > of the second loop needs an explicit comparison with -1,
> > whereas mine uses an implicit comparison with 0.
>
> Well, if you're at this level of comparison, then I may object that
> 'count' is evaluated twice when jumping from one loop to the
> other one.
>
> *Algorithmically* speaking, the most optimal code should be :
>
> char *strncpy(char *dest, const char *src, size_t count)
> {
> char *tmp = dest;
> if (unlikely(!count))
> goto end;
> loop1:
> if ((*tmp = *src++) == '\0')
> goto next;
> tmp++;
> if (likely(--count))
> goto loop1;
> else
> goto end;
> loop2:
> *tmp = '\0';
> next:
> tmp++;
> if (likely(--count))
> goto loop2;
> end:
> return dest;
> }

I hope we do not opt to go with this version. :)

> (sorry for those who hate gotos). Look at it : no test is
> performed twice, no byte is read nor written twice in any
> case. The assembly code produced is optimal on x86 (well,
> in fact we could delete exactly one conditionnal jump because
> the compiler doesn't know how to reuse them for several tests).
> 16 inlinable instructions (= excluding function in/out) which
> can be executed 2 at a time if your CPU has 2 pipelines. about
> 3 cycles/copied byte, 2 cycles/filled byte.

It does not give optimal code for CRIS for the second loop.
It can easily be fixed by combining loop2 and next (which
will cause tmp to be assigned twice when changing loop),
but I know this was not the intent of this exercise.

> Unfortunately, it fools branch predictors and prefetch
> mechanisms found in today's processors, so it results in
> slower code than yours (at least on athlon and P3). Perhaps
> it would be fast on older processors, I don't know.
>
> All that demonstrates that whatever your efforts in helping
> the optimizer, the only meaningful result is the benchmark.
> Number of bytes and speculations on the reusability of
> information between lines of codes are not what makes our
> programs fast anymore :-(

It was easier some years ago, wasn't it? ;)

> > Testing on the CRIS architecture, your version is 24 instructions,
> > whereas mine is 18. For comparison, Eric's one is 12 and the
> > currently used implementation is 26 (when corrected for the
> > off-by-one error by comparing with > 1 rather than != 0 in the
> > second loop).
>
> Just out of curiosity, can the CRIS architecture execute
> multiple instructions per cycle ? have you tried to determine

No it cannot. :( Our ETRAX 100LX processor which uses the
CRIS architecture is a 100 MIPS processor without most of
today's fancy processor optimizations like multiple instructions,
branch prediction etc. It is designed for embedded products.

> the dependencies between the instructions ? Did you time the
> different functions ? (yours is rather fast on x86 BTW).

No I didn't, but I have done it now.

I made a program that called the respective strncpy()
implementation a number of times copying a 100 byte string
into a 1000 byte buffer. For each function I tested with
copying 50, 100, 200 and 1000 bytes. I have attached the
source to this mail if anyone wants to make sure I did not
screw up the benchmarking too much... ;)

This table shows the times in seconds to call each function
500,000 times for CRIS:

50 100 200 1000
Original 3.123906 6.138616 9.146204 33.323051
Eric's 2.602632 5.105367 9.637701 45.910798
Timothy's 3.125003 6.128571 9.147905 33.325337
My first 2.868396 5.626149 8.138134 28.276760
My second (for) 2.622412 5.129988 7.636364 27.786745
Algorithmic 2.597808 5.119332 7.627300 27.785999

Going by these numbers, Eric's version is a good candidate
for the copying phase, but loses out badly when it comes to
the filling. The best versions here seem to be my version
with the for loops and your algorithmic version which both
give almost identical numbers (but look below for more on
the algorithmic version).

This table shows the times in seconds to call each function
20,000,000 times for my P4 2.53GHz:

50 100 200 1000
Original 2.900444 5.475311 9.095701 37.462796
Eric's 2.988404 5.637352 9.576857 37.580639
Timothy's 2.929508 5.534824 9.157147 36.836899
My first 2.951068 5.511169 8.321381 29.669017
My second (for) 2.921675 5.531486 8.296728 28.940855
Algorithmic 3.911937 7.343235 12.699275 54.288284

The interesting thing here is that the original code wins
the copying phase. But the numbers for the copying phase for
the first five versions are so similar, that I doubt one can
say one is better than the other. However, when it comes to
the filling phase my versions are back in the lead. These
numbers also shows very clearly what you stated above about
branch predictions, as the algorithmic version loses out
badly here.

All these numbers taken together, I would say that my version
with the for loops seems to be the overall winner. :)

> To conclude, I would say that if we want to implement really
> fast low-level primitives such as str*, mem*, ... there's
> nothing better than assembly associated with benchmarks on
> different CPU architectures and models.
>
> But I don't know if strncpy() is used enough to justify that...

I agree

> Regards,
> Willy

//Peter


Attachments:
testa.c (4.84 kB)

2003-08-15 09:58:58

by Peter Kjellerstedt

[permalink] [raw]
Subject: RE: generic strncpy - off-by-one error

> -----Original Message-----
> From: Timothy Miller [mailto:[email protected]]
> Sent: Thursday, August 14, 2003 22:24
> To: Peter Kjellerstedt
> Cc: linux-kernel mailing list
> Subject: Re: generic strncpy - off-by-one error
>
> Peter Kjellerstedt wrote:
>
> >>
> >>Nice!
> >
> >
> > I can but agree.
> >
> >
> >>How about this:
> >>
> >>
> >>char *strncpy(char * s1, const char * s2, size_t n)
> >>{
> >> register char *s = s1;
> >>
> >> while (n && *s2) {
> >> n--;
> >> *s++ = *s2++;
> >> }
> >> while (n--) {
> >> *s++ = 0;
> >> }
> >> return s1;
> >>}
> >
> >
> > This may be improved further:
> >
> > char *strncpy(char *dest, const char *src, size_t count)
> > {
> > char *tmp = dest;
> >
> > while (count) {
> > if (*src == '\0')
> > break;
> >
> > *tmp++ = *src++;
> > count--;
> > }
> >
> > while (count) {
> > *tmp++ = '\0';
> > count--;
> > }
> >
> > return dest;
> > }
> >
> > Moving the check for *src == '\0' into the first loop seems
> > to let the compiler reuse the object code a little better
> > (may depend on the optimizer).
>
> While I can understand that certain architectures may benefit
> from that alteration, I am curious as to what SPECIFICALLY it
> is doing that is different. How do they differ?

I do not know if this will give you anything, but here is
the disassembled CRIS version of your first while loop
(note that the branch instructions have one delay slot,
and that dest, src and count are in $r10, $r11 and $r12):

while (count && *src)
805d8: 6cc6 test.d $r12
805da: 1830 beq 805f4
805dc: 6a96 move.d $r10,$r9
805de: 8b0b test.b [$r11]
805e0: 1230 beq 805f4
805e2: 0f05 nop
{
count--;
805e4: 81c2 subq 1,$r12
*tmp++ = *src++;
805e6: 4bde move.b [$r11+],$r13
805e8: 6cc6 test.d $r12
805ea: 0830 beq 805f4
805ec: c9df move.b $r13,[$r9+]
805ee: 8b0b test.b [$r11]
805f0: f320 bne 805e4
805f2: 0f05 nop
}

And here is my first while loop:

while (count)
8062c: 6cc6 test.d $r12
8062e: 1030 beq 80640
80630: 6a96 move.d $r10,$r9
{
count--;
if (!(*tmp++ = *src++))
80632: 4bde move.b [$r11+],$r13
80634: c9db move.b $r13,[$r9]
80636: 890f test.b [$r9+]
80638: 0630 beq 80640
8063a: 81c2 subq 1,$r12
break;
8063c: f520 bne 80632
8063e: 0f05 nop
}

> > Also note that your version
> > of the second loop needs an explicit comparison with -1,
> > whereas mine uses an implicit comparison with 0.
>
> I don't understand why you say I need an explicit comparison
> with -1. My first loop exits either with the number of bytes
> remaining in the buffer or with zero if it's copied count
> number of bytes.

I was talking about the second loop. The object code the compiler
produces for your version actually tests the count variable after
it decreases it, which is why it tests for -1.

> The second loop WOULD require a comparison with -1 IF the
> "count--" were not inside of the loop body. As it IS in the
> loop body, there is no need for that. My second loop has an
> implicit comparison against zero.

Hmm, your second loop from above looks like:

while (n--) {
*s++ = 0;
}

whereas mine looks like:

while (count) {
*tmp++ = '\0';
count--;
}

You seem to be referring to my version, where what you say is true.

> > Testing on the CRIS architecture, your version is 24 instructions,
> > whereas mine is 18. For comparison, Eric's one is 12 and the
> > currently used implementation is 26 (when corrected for the
> > off-by-one error by comparing with > 1 rather than != 0 in the
> > second loop).
>
> If I understand it correctly, the corrected original needs
> that explicit comparison because the decrement is in the loop
> conditional. My implementation (and yours) corrects this by
> moving the decrement into the body of the loop.
>
> Also, while instruction count is sometimes a good indication of
> algorithm efficiency, I would argue that our two-loop version is
> probably the same speed as the single-loop version for copying
> characters, but ours is faster for zeroing the rest of the
> target buffer.

Seems to be the case, yes.

> > Here is another version that I think is quite pleasing
> > aesthetically (YMMV) since the loops are so similar (it is 21
> > instructions on CRIS):
> >
> > char *strncpy(char *dest, const char *src, size_t count)
> > {
> > char *tmp = dest;
> >
> > for (; count && *src; count--)
> > *tmp++ = *src++;
> >
> > for (; count; count--)
> > *tmp++ = '\0';
> >
> > return dest;
> > }
>
> I agree that this is definately a more elegant look to the
> code, and I would prefer what you have done here. But what
> puzzles me is that this is functionally and logically
> equivalent to my code.
>
> So, this code:
>
> for (A; B; C) {}
>
> is the same as this:
>
> A;
> while (B) {
> ...
> C;
> }
>
>
> So why is it that this mere syntactic difference causes the
> compiler to produce a better result?

I wish I new. Actually in the CRIS case, it seems to be an
optimizer thing. If I change your first loop from

while (n && *s2) {
n--;
*s++ = *s2++;
}

to

while (n && *s2) {
*s++ = *s2++;
n--;
}

it gives the expected object code, i.e., the same as my
first for loop. So here is a modified version of your
code that gives exactly the same object code (for CRIS)
as my version with the for loops:

char *strncpy(char * s1, const char * s2, size_t n)
{
register char *s = s1;

while (n && *s2) {
*s++ = *s2++;
n--;
}
while (n) {
*s++ = 0;
n--;
}
return s1;
}

//Peter

2003-08-15 17:32:28

by Timothy Miller

[permalink] [raw]
Subject: Re: generic strncpy - off-by-one error



Peter Kjellerstedt wrote:

>>While I can understand that certain architectures may benefit
>>from that alteration, I am curious as to what SPECIFICALLY it
>>is doing that is different. How do they differ?
>
>
> I do not know if this will give you anything, but here is
> the disassembled CRIS version of your first while loop
> (note that the branch instructions have one delay slot,
> and that dest, src and count are in $r10, $r11 and $r12):
>
> while (count && *src)
> 805d8: 6cc6 test.d $r12
> 805da: 1830 beq 805f4
> 805dc: 6a96 move.d $r10,$r9

Ok, test count and exit if count is zero.

I take it that this move is the copy of the dst to a temporary variable?

> 805de: 8b0b test.b [$r11]
> 805e0: 1230 beq 805f4
> 805e2: 0f05 nop

Test *s and exit if zero.

> {
> count--;
> 805e4: 81c2 subq 1,$r12


> *tmp++ = *src++;
> 805e6: 4bde move.b [$r11+],$r13

Fetch *s and increment. This is redundant. Why didn't it know to fetch
*s above and keep it in a register?

> 805e8: 6cc6 test.d $r12
> 805ea: 0830 beq 805f4
> 805ec: c9df move.b $r13,[$r9+]

Test count and exit if count is zero. Why is it doing this again? Why
not jump back to the top? Oh, wait... in that case, for the loop to
exit, it would jump to the top and then jump back to past the end,
rather than just failing one branch.

I see that the delay slot was filled with the store to *dst. But is
there also some pipeline latency that makes this worth while?

> 805ee: 8b0b test.b [$r11]
> 805f0: f320 bne 805e4
> 805f2: 0f05 nop

Test *src AGAIN and loop if nonzero. Redundant.

> }
>
> And here is my first while loop:
>
> while (count)
> 8062c: 6cc6 test.d $r12
> 8062e: 1030 beq 80640
> 80630: 6a96 move.d $r10,$r9

Test count and bypass if zero. Also, copy dst.

> {
> count--;
> if (!(*tmp++ = *src++))
> 80632: 4bde move.b [$r11+],$r13
> 80634: c9db move.b $r13,[$r9]

Move data

> 80636: 890f test.b [$r9+]

Interesting how (a) it moved the increment to the test, and (b) it makes
a redundant read of the dest.

Does this arch not get condition codes from things other than test and
compare?

And why didn't it test $r13 instead? Wouldn't that have been a lot more
efficient?

> 80638: 0630 beq 80640
> 8063a: 81c2 subq 1,$r12

Exit on zero and decrement count.

> break;
> 8063c: f520 bne 80632
> 8063e: 0f05 nop

Loop.

> }
>
>
>>>Also note that your version
>>>of the second loop needs an explicit comparison with -1,
>>>whereas mine uses an implicit comparison with 0.
>>
>>I don't understand why you say I need an explicit comparison
>>with -1. My first loop exits either with the number of bytes
>>remaining in the buffer or with zero if it's copied count
>>number of bytes.
>
>
> I was talking about the second loop. The object code the compiler
> produces for your version actually tests the count variable after
> it decreases it, which is why it tests for -1.

Why does it do that? Is that somehow more optimal? Why doesn't it test
BEFORE it decrements? Isn't that more straightforward?

In any event, couldn't that be fixed by moving the decrement into the loop?

>
>>The second loop WOULD require a comparison with -1 IF the
>>"count--" were not inside of the loop body. As it IS in the
>>loop body, there is no need for that. My second loop has an
>>implicit comparison against zero.
>
>
> Hmm, your second loop from above looks like:
>
> while (n--) {
> *s++ = 0;
> }
>
> whereas mine looks like:
>
> while (count) {
> *tmp++ = '\0';
> count--;
> }
>
> You seem to be referring to my version, where what you say is true.

And what I was saying was that since what the while should test is the
value before the decrement, I assumed that the test would be before the
decrement.


>>
>>I agree that this is definately a more elegant look to the
>>code, and I would prefer what you have done here. But what
>>puzzles me is that this is functionally and logically
>>equivalent to my code.
>>
>>So, this code:
>>
>>for (A; B; C) {}
>>
>>is the same as this:
>>
>>A;
>>while (B) {
>> ...
>> C;
>>}
>>
>>
>>So why is it that this mere syntactic difference causes the
>>compiler to produce a better result?
>
>
> I wish I new. Actually in the CRIS case, it seems to be an
> optimizer thing. If I change your first loop from
>
> while (n && *s2) {
> n--;
> *s++ = *s2++;
> }
>
> to
>
> while (n && *s2) {
> *s++ = *s2++;
> n--;
> }
>
> it gives the expected object code, i.e., the same as my
> first for loop. So here is a modified version of your
> code that gives exactly the same object code (for CRIS)
> as my version with the for loops:

A decent optimizer should know that the "n--" and the "*s++ = *s2++" are
independent and migrate them to optimal positions.

Furthermore, I put the "n--" first because I assumed that the compiler
might be stupid about it. The idea is to order the statements so that
the results of a given instruction are being computed while the next one
is being decoded.

My assumption was that the code would branch back to the loop test
(rather than make a second one) at the end of the loop and then test n.
Therefore, decrementing n earlier would be a win in terms of having
the result earlier for the test, reducing pipline stall.

>
> char *strncpy(char * s1, const char * s2, size_t n)
> {
> register char *s = s1;
>
> while (n && *s2) {
> *s++ = *s2++;
> n--;
> }
> while (n) {
> *s++ = 0;
> n--;
> }
> return s1;
> }
>
> //Peter
>
>


2003-08-15 17:37:20

by Timothy Miller

[permalink] [raw]
Subject: Re: generic strncpy - off-by-one error



Peter Kjellerstedt wrote:

> Timothy's 3.125003 6.128571 9.147905 33.325337


Which of mine did you test? The one with the single-byte fill, or the
one with the multiple fill loops that does words for the bulk of the fills?

With some minor tweaks to eliminate compiler stupidity which compares
against -1, that might win on the fill phase. No?


2003-08-16 08:20:09

by Peter Kjellerstedt

[permalink] [raw]
Subject: RE: generic strncpy - off-by-one error

> -----Original Message-----
> From: Timothy Miller [mailto:[email protected]]
> Sent: Friday, August 15, 2003 19:52
> To: Peter Kjellerstedt
> Cc: 'Willy Tarreau'; linux-kernel mailing list
> Subject: Re: generic strncpy - off-by-one error
>
> Peter Kjellerstedt wrote:
> > Timothy's 3.125003 6.128571 9.147905 33.325337
>
> Which of mine did you test? The one with the single-byte
> fill, or the one with the multiple fill loops that does
> words for the bulk of the fills?

Stupid me. I only tested the single byte fill version.

> With some minor tweaks to eliminate compiler stupidity which compares
> against -1, that might win on the fill phase. No?

Here are the numbers for my for loop version and your multi
byte fill version for CRIS:

For loops 2.867568 5.620561 8.128734 28.286289
Multi byte fill 2.868031 5.670782 6.312027 11.336015

And here are the numbers for my P4:

For loops 3.060262 5.927378 8.796814 30.659818
Multi byte fill 3.126607 5.898459 7.096685 13.135379

So there is no doubt that the multi byte version is a clear
winner (which was expected, I suppose).

Here is the code that I used:

char *strncpy(char *dest, const char *src, size_t count)
{
char *tmp = dest;

while (count && *src) {
*tmp++ = *src++;
count--;
}

if (count) {
size_t count2;

while (count & (sizeof(long) - 1)) {
*tmp++ = '\0';
count--;
}

count2 = count / sizeof(long);
while (count2) {
*((long *)tmp)++ = '\0';
count2--;
}

count &= (sizeof(long) - 1);
while (count) {
*tmp++ = '\0';
count--;
}
}

return dest;
}

//Peter

2003-08-16 08:41:37

by Daniel Forrest

[permalink] [raw]
Subject: Re: generic strncpy - off-by-one error

On Sat, Aug 16, 2003 at 10:15:14AM +0200, Peter Kjellerstedt wrote:
>
> Here is the code that I used:
>
> char *strncpy(char *dest, const char *src, size_t count)
> {
> char *tmp = dest;
>
> while (count && *src) {
> *tmp++ = *src++;
> count--;
> }
>
> if (count) {
> size_t count2;
>
> while (count & (sizeof(long) - 1)) {

Shouldn't this be:

while (tmp & (sizeof(long) - 1)) {

> *tmp++ = '\0';
> count--;
> }
>
> count2 = count / sizeof(long);
> while (count2) {
> *((long *)tmp)++ = '\0';
> count2--;
> }
>
> count &= (sizeof(long) - 1);
> while (count) {
> *tmp++ = '\0';
> count--;
> }
> }
>
> return dest;
> }

--
Dan

2003-08-16 09:24:55

by Peter Kjellerstedt

[permalink] [raw]
Subject: RE: generic strncpy - off-by-one error

> -----Original Message-----
> From: Daniel Forrest [mailto:[email protected]]
> Sent: Saturday, August 16, 2003 10:41
> To: Peter Kjellerstedt
> Cc: 'Timothy Miller'; 'Willy Tarreau'; linux-kernel mailing list
> Subject: Re: generic strncpy - off-by-one error
>
> On Sat, Aug 16, 2003 at 10:15:14AM +0200, Peter Kjellerstedt wrote:
> >
> > Here is the code that I used:
> >
> > char *strncpy(char *dest, const char *src, size_t count)
> > {
> > char *tmp = dest;
> >
> > while (count && *src) {
> > *tmp++ = *src++;
> > count--;
> > }
> >
> > if (count) {
> > size_t count2;
> >
> > while (count & (sizeof(long) - 1)) {
>
> Shouldn't this be:
>
> while (tmp & (sizeof(long) - 1)) {

Actually, it should be:

while (count && ((long)tmp & (sizeof(long) - 1)))

> > *tmp++ = '\0';
> > count--;
> > }
> >
> > count2 = count / sizeof(long);
> > while (count2) {
> > *((long *)tmp)++ = '\0';
> > count2--;
> > }
> >
> > count &= (sizeof(long) - 1);
> > while (count) {
> > *tmp++ = '\0';
> > count--;
> > }
> > }
> >
> > return dest;
> > }
>
> --
> Dan

//Peter

2003-08-16 10:04:22

by Daniel Forrest

[permalink] [raw]
Subject: Re: generic strncpy - off-by-one error

On Sat, Aug 16, 2003 at 11:19:30AM +0200, Peter Kjellerstedt wrote:
>
> Actually, it should be:
>
> while (count && ((long)tmp & (sizeof(long) - 1)))
>

Oops, you're right, I forgot that the count could be small.

But, now that I think of it, maybe this would be best...

char *strncpy(char *dest, const char *src, size_t count)
{
char *tmp = dest;

while (count && *src) {
*tmp++ = *src++;
count--;
}

- if (count) {
+ if (count >= sizeof(long)) {
size_t count2;

- while (count && ((long)tmp & (sizeof(long) - 1))) {
+ while ((long)tmp & (sizeof(long) - 1)) {
*tmp++ = '\0';
count--;
}

count2 = count / sizeof(long);
while (count2) {
*((long *)tmp)++ = '\0';
count2--;
}

count &= (sizeof(long) - 1);
- while (count) {
- *tmp++ = '\0';
- count--;
- }
+ }
+
+ while (count) {
+ *tmp++ = '\0';
+ count--;
}

return dest;
}

--
Dan

2003-08-16 20:13:24

by Peter Kjellerstedt

[permalink] [raw]
Subject: RE: generic strncpy - off-by-one error

> -----Original Message-----
> From: Timothy Miller [mailto:[email protected]]
> Sent: Friday, August 15, 2003 19:47
> To: Peter Kjellerstedt
> Cc: linux-kernel mailing list
> Subject: Re: generic strncpy - off-by-one error
>
> Peter Kjellerstedt wrote:
>
> >>While I can understand that certain architectures may benefit
> >>from that alteration, I am curious as to what SPECIFICALLY it
> >>is doing that is different. How do they differ?
> >
> > I do not know if this will give you anything, but here is
> > the disassembled CRIS version of your first while loop
> > (note that the branch instructions have one delay slot,
> > and that dest, src and count are in $r10, $r11 and $r12):
> >
> > while (count && *src)
> > 805d8: 6cc6 test.d $r12
> > 805da: 1830 beq 805f4
> > 805dc: 6a96 move.d $r10,$r9
>
> Ok, test count and exit if count is zero.
>
> I take it that this move is the copy of the dst to a
> temporary variable?

Yes.

> > 805de: 8b0b test.b [$r11]
> > 805e0: 1230 beq 805f4
> > 805e2: 0f05 nop
>
> Test *s and exit if zero.
>
> > {
> > count--;
> > 805e4: 81c2 subq 1,$r12
>
>
> > *tmp++ = *src++;
> > 805e6: 4bde move.b [$r11+],$r13
>
> Fetch *s and increment. This is redundant. Why didn't it
> know to fetch *s above and keep it in a register?

I think this is the way gcc produces while loops (not exactly
my area of expertise).

> > 805e8: 6cc6 test.d $r12
> > 805ea: 0830 beq 805f4
> > 805ec: c9df move.b $r13,[$r9+]
>
> Test count and exit if count is zero. Why is it doing this
> again? Why not jump back to the top? Oh, wait... in that
> case, for the loop to exit, it would jump to the top and then
> jump back to past the end, rather than just failing one branch.

Yes.

> I see that the delay slot was filled with the store to *dst. But is
> there also some pipeline latency that makes this worth while?

As far as I know this should be a win.

> > 805ee: 8b0b test.b [$r11]
> > 805f0: f320 bne 805e4
> > 805f2: 0f05 nop
>
> Test *src AGAIN and loop if nonzero. Redundant.

No. Note that $r11 was increased after the move at 805e6,
so what it tests here is the next byte to be copied. Once
the loop is running this is the only test of *src. However,
the code isn't optimal as it causes two reads of *src, but
it was not trivial to modify the assembler code to get
optimal behavior for this version, so I can understand
that the compiler did not make it.

> > }
> >
> > And here is my first while loop:
> >
> > while (count)
> > 8062c: 6cc6 test.d $r12
> > 8062e: 1030 beq 80640
> > 80630: 6a96 move.d $r10,$r9
>
> Test count and bypass if zero. Also, copy dst.
>
> > {
> > count--;
> > if (!(*tmp++ = *src++))
> > 80632: 4bde move.b [$r11+],$r13
> > 80634: c9db move.b $r13,[$r9]
>
> Move data
>
> > 80636: 890f test.b [$r9+]
>
> Interesting how (a) it moved the increment to the test, and
> (b) it makes a redundant read of the dest.
>
> Does this arch not get condition codes from things other than
> test and compare?

Only moves TO a register results in the status flags being
changed. But I do not know why the compiler doesn't produce
either

move.b [$r11+],$r13
move.b $r13,[$r9+]
test.b $r13

or even better

move.b [$r11+],$r13
move.b $r13,[$r9+]

for this code, as I believe they should be equivalent, but with
better cycle counts. I will ask our compiler guy next week...

> And why didn't it test $r13 instead? Wouldn't that have been
> a lot more efficient?

One rather than two cycles.

> > 80638: 0630 beq 80640
> > 8063a: 81c2 subq 1,$r12
>
> Exit on zero and decrement count.
>
> > break;
> > 8063c: f520 bne 80632
> > 8063e: 0f05 nop
>
> Loop.
>
> > }
> >
> >
> >>>Also note that your version
> >>>of the second loop needs an explicit comparison with -1,
> >>>whereas mine uses an implicit comparison with 0.
> >>
> >>I don't understand why you say I need an explicit comparison
> >>with -1. My first loop exits either with the number of bytes
> >>remaining in the buffer or with zero if it's copied count
> >>number of bytes.
> >
> >
> > I was talking about the second loop. The object code the compiler
> > produces for your version actually tests the count variable after
> > it decreases it, which is why it tests for -1.
>
> Why does it do that? Is that somehow more optimal? Why
> doesn't it test BEFORE it decrements? Isn't that more
> straightforward?

It would be hard to do in assembler, since after the test
instruction is executed one must do the branch instruction,
and there is no way to do the decrementation in-between
without affecting the status flags.

We can easily change the code and move the decrementation into
it since we do not care what happens to count after the loop
(or can take the changed value into account), but the compiler
cannot (easily) do this as it would change the value count has
after the loop (-1 vs. 0).

> In any event, couldn't that be fixed by moving the decrement
> into the loop?
>
> >
> >>The second loop WOULD require a comparison with -1 IF the
> >>"count--" were not inside of the loop body. As it IS in the
> >>loop body, there is no need for that. My second loop has an
> >>implicit comparison against zero.
> >
> >
> > Hmm, your second loop from above looks like:
> >
> > while (n--) {
> > *s++ = 0;
> > }
> >
> > whereas mine looks like:
> >
> > while (count) {
> > *tmp++ = '\0';
> > count--;
> > }
> >
> > You seem to be referring to my version, where what you say is true.
>
> And what I was saying was that since what the while should
> test is the value before the decrement, I assumed that the
> test would be before the decrement.

See my answer above.

> >>I agree that this is definately a more elegant look to the
> >>code, and I would prefer what you have done here. But what
> >>puzzles me is that this is functionally and logically
> >>equivalent to my code.
> >>
> >>So, this code:
> >>
> >>for (A; B; C) {}
> >>
> >>is the same as this:
> >>
> >>A;
> >>while (B) {
> >> ...
> >> C;
> >>}
> >>
> >>
> >>So why is it that this mere syntactic difference causes the
> >>compiler to produce a better result?
> >
> >
> > I wish I new. Actually in the CRIS case, it seems to be an
> > optimizer thing. If I change your first loop from
> >
> > while (n && *s2) {
> > n--;
> > *s++ = *s2++;
> > }
> >
> > to
> >
> > while (n && *s2) {
> > *s++ = *s2++;
> > n--;
> > }
> >
> > it gives the expected object code, i.e., the same as my
> > first for loop. So here is a modified version of your
> > code that gives exactly the same object code (for CRIS)
> > as my version with the for loops:
>
> A decent optimizer should know that the "n--" and the "*s++ =
> *s2++" are independent and migrate them to optimal positions.

I agree. I think I will have to ask our compiler guy about
this too...

> Furthermore, I put the "n--" first because I assumed that the
> compiler might be stupid about it. The idea is to order the
> statements so that the results of a given instruction are being
> computed while the next one is being decoded.
>
> My assumption was that the code would branch back to the loop test
> (rather than make a second one) at the end of the loop and
> then test n.

Assumptions does not seem to work when dealing with the
compiler. ;) My tests with benchmarking this code has clearly
taught me that...

> Therefore, decrementing n earlier would be a win in terms of having
> the result earlier for the test, reducing pipline stall.

I guess trying to optimize generic code is not very easy
as what is optimal on one architecture may be bad on some
other. However, it should be possible to get better results
than what the current implementation does.

//Peter

2003-08-16 21:15:33

by Peter Kjellerstedt

[permalink] [raw]
Subject: RE: generic strncpy - off-by-one error

> -----Original Message-----
> From: Daniel Forrest [mailto:[email protected]]
> Sent: Saturday, August 16, 2003 12:04
> To: Peter Kjellerstedt
> Cc: 'Daniel Forrest'; 'Timothy Miller'; 'Willy Tarreau';
> linux-kernel mailing list
> Subject: Re: generic strncpy - off-by-one error
>
> On Sat, Aug 16, 2003 at 11:19:30AM +0200, Peter Kjellerstedt wrote:
> >
> > Actually, it should be:
> >
> > while (count && ((long)tmp & (sizeof(long) - 1)))
> >
>
> Oops, you're right, I forgot that the count could be small.
>
> But, now that I think of it, maybe this would be best...
>
> char *strncpy(char *dest, const char *src, size_t count)
> {
> char *tmp = dest;
>
> while (count && *src) {
> *tmp++ = *src++;
> count--;
> }
>
> - if (count) {
> + if (count >= sizeof(long)) {
> size_t count2;
>
> - while (count && ((long)tmp & (sizeof(long) - 1))) {
> + while ((long)tmp & (sizeof(long) - 1)) {
> *tmp++ = '\0';
> count--;
> }
>
> count2 = count / sizeof(long);
> while (count2) {
> *((long *)tmp)++ = '\0';
> count2--;
> }
>
> count &= (sizeof(long) - 1);
> - while (count) {
> - *tmp++ = '\0';
> - count--;
> - }
> + }
> +
> + while (count) {
> + *tmp++ = '\0';
> + count--;
> }
>
> return dest;
> }
>
> --
> Dan

Looks good to me. However, this only optimizes the filling
part. So why not try to optimize the copying part too? Here
is a version that optimizes the (hopefully somewhat common)
case of both source and destination being long aligned
(UNALIGNED() and DETECT_NUL() were borrowed from newlib's
strncpy()):

#include <limits.h>

/* Non-zero if either x or y is not aligned on a "long" boundary. */
#define UNALIGNED(x, y) \
(((long)x & (sizeof(long) - 1)) | ((long)y & (sizeof(long) - 1)))

/* DETECT_NUL() is non-zero if x contains a NUL byte. */
#if LONG_MAX == 2147483647L
#define DETECT_NUL(x) (((x) - 0x01010101) & ~(x) & 0x80808080)
#else
#if LONG_MAX == 9223372036854775807L
#define DETECT_NUL(x) (((x) - 0x0101010101010101) & ~(x) & 0x8080808080808080)
#else
#error long int is not a 32bit or 64bit type.
#endif
#endif

char *strncpy(char *dest, const char *src, size_t count)
{
char *tmp = dest;

if (!UNALIGNED(src, tmp)) {
while (count >= sizeof(long)) {
if (DETECT_NUL(*((long*)src)))
break;

*((long *)tmp)++ = *((long *)src)++;
count -= sizeof(long);
}
}

while (count) {
count--;
if (!(*tmp++ = *src++))
break;
}

if (count) {
size_t count2;

if (count >= sizeof(long)) {
while ((long)tmp & (sizeof(long) - 1)) {
*tmp++ = '\0';
count--;
}

count2 = count / sizeof(long);
while (count2) {
*((long *)tmp)++ = '\0';
count2--;
}
}

count &= (sizeof(long) - 1);
while (count) {
*tmp++ = '\0';
count--;
}
}

return dest;
}

And here are some benchmarking numbers for the CRIS
architecture. "Multi copy & fill" is the code above.
"Multi copy+memset" is the same as above, but with the
filling replaced by a call to memset().

Multi copy+memset 1.374258 2.627357 3.125354 4.614480
Multi copy & fill 1.339129 2.607850 3.260540 8.303503
Multi byte fill 2.336643 4.619374 5.290972 10.326644
For loops 2.828718 5.591865 8.110286 40.684943
strncpy (glibc) 2.201317 4.273052 7.281693 31.474109
strncpy (uClibc) 2.311860 4.588357 9.112990 45.387527

And the numbers for my P4:

Multi copy+memset 1.284029 2.111502 4.672941 8.470845
Multi copy & fill 1.203095 2.160514 3.265682 8.645152
Multi byte fill 3.161438 6.249649 7.144416 12.687760
For loops 3.202569 6.028025 9.410537 33.492053
strncpy (glibc) 2.359909 3.943612 6.952230 29.604865
strncpy (uClibc) 3.234713 5.943762 10.137144 39.651053

For unaligned source or destination the "Multi copy & fill"
would degenerate into "Multi byte fill". However, for
architectures like ix86 and CRIS that can do unaligned long
access, it would be a win to remove the UNALIGNED() check,
and use long word copying all the time.

Then whether using memset() or your filling is a win depends
on the architecture and how many bytes needs to be filled.
For a slow processor with little function call overhead (like
CRIS), using memset seems to be a win almost immediately.
However, for a fast processor like my P4, the call to memset
is not a win until some 1500 bytes need to be filled.

//Peter

2003-08-18 15:52:13

by Timothy Miller

[permalink] [raw]
Subject: Re: generic strncpy - off-by-one error



Peter Kjellerstedt wrote:

>
> For loops 2.867568 5.620561 8.128734 28.286289
> Multi byte fill 2.868031 5.670782 6.312027 11.336015
>
> And here are the numbers for my P4:
>
> For loops 3.060262 5.927378 8.796814 30.659818
> Multi byte fill 3.126607 5.898459 7.096685 13.135379
>
> So there is no doubt that the multi byte version is a clear
> winner (which was expected, I suppose).

Cool! Hey, is this just an exercise, or are we actually going to use
this? I would be very happy to have something I contributed to put into
the kernel. :)

>
> Here is the code that I used:
>
> char *strncpy(char *dest, const char *src, size_t count)
> {
> char *tmp = dest;
>
> while (count && *src) {
> *tmp++ = *src++;
> count--;
> }
>
> if (count) {

Good idea... bad to do so many checks if count is zero. On the other
hand, if count is rarely zero, then it's a loss. Maybe benchmark with
and without?

> size_t count2;
>
> while (count & (sizeof(long) - 1)) {
> *tmp++ = '\0';
> count--;
> }
>
> count2 = count / sizeof(long);

I know that a good compiler should migrate code to help the CPU
pipeline, but how about moving this "count2 = " line up to before the
first fill loop. See if that helps any. Always good to precompute well
in advance.

> while (count2) {
> *((long *)tmp)++ = '\0';
> count2--;
> }
>
> count &= (sizeof(long) - 1);

And move this to before the middle fill loop.

> while (count) {
> *tmp++ = '\0';
> count--;
> }
> }
>
> return dest;
> }
>
> //Peter
>
>


2003-08-18 16:01:21

by Timothy Miller

[permalink] [raw]
Subject: Re: generic strncpy - off-by-one error



Daniel Forrest wrote:
> On Sat, Aug 16, 2003 at 10:15:14AM +0200, Peter Kjellerstedt wrote:


> Shouldn't this be:
>
> while (tmp & (sizeof(long) - 1)) {
>
>
>> *tmp++ = '\0';
>> count--;
>> }

Oh, yeah! That's right. We need to check the address. Also need to
cast tmp to (int) or something (doesn't matter what it's cast to,
because we only care about the lower 2 or 3 bits).

Peter, please see if this makes any speed difference. But it definately
needs this fix.


Frankly, I'm surprised it works. In fact, it might not, but it's hard
to tell from the tests just benchmarks.


Also, if you're doing dense addressing on Alpha, and you do byte
accesses the addresses for char are byte addresses, but the code does
read-modify-write to memory for byte accesses, because in that mode, you
can only do 32-bit and 64-bit accesses. The performance increase could
be even greater for Alpha than for x86.


For Sparc, we might be able to do something with VIS instructions,
although I don't know what the setup overhead is. Sun's memcpy and
memset only use VIS when the size is greater than 512, because
otherwise, it's not worth it.


I don't know enough about PowerPC other than the proper use of the
"eieio" instruction. :)

2003-08-18 16:24:44

by Timothy Miller

[permalink] [raw]
Subject: Re: generic strncpy - off-by-one error



Daniel Forrest wrote:

>
> - if (count) {
> + if (count >= sizeof(long)) {
> size_t count2;
>

I like this size check here, but the comparison should be to some number
greater than sizeof(long). There is a threshold below which it's not
worth it to do all of the extra loops. If you're going to fill only
four bytes, it's probably best to do it just using the last loop.

Maybe through some trial and error, we could determine what that
threshold is. I'm betting it's something around 2* or 3* word size.


[snip]

> +
> + while (count) {
> + *tmp++ = '\0';
> + count--;
> }
>
> return dest;
> }
>


2003-08-18 18:25:32

by Timothy Miller

[permalink] [raw]
Subject: Re: generic strncpy - off-by-one error



Peter Kjellerstedt wrote:

> For unaligned source or destination the "Multi copy & fill"
> would degenerate into "Multi byte fill". However, for
> architectures like ix86 and CRIS that can do unaligned long
> access, it would be a win to remove the UNALIGNED() check,
> and use long word copying all the time.

In fact, it's possible to do the copy even if the source and dest are
not aligned. It requires holding pieces of source in a register and
doing shifts. If the CPU is much faster than the memory, this can be a
huge win.


> Then whether using memset() or your filling is a win depends
> on the architecture and how many bytes needs to be filled.
> For a slow processor with little function call overhead (like
> CRIS), using memset seems to be a win almost immediately.
> However, for a fast processor like my P4, the call to memset
> is not a win until some 1500 bytes need to be filled.

What is in memset that isn't in the fill code I suggested?


2003-08-20 07:48:20

by Peter Kjellerstedt

[permalink] [raw]
Subject: RE: generic strncpy - off-by-one error

[ I combined all my answers to your four mails into this one. ]

> -----Original Message-----
> From: Timothy Miller [mailto:[email protected]]
> Sent: Monday, August 18, 2003 18:07
> To: Peter Kjellerstedt
> Cc: 'Willy Tarreau'; linux-kernel mailing list
> Subject: Re: generic strncpy - off-by-one error
>
> Peter Kjellerstedt wrote:
> > For loops 2.867568 5.620561 8.128734 28.286289
> > Multi byte fill 2.868031 5.670782 6.312027 11.336015
> >
> > And here are the numbers for my P4:
> >
> > For loops 3.060262 5.927378 8.796814 30.659818
> > Multi byte fill 3.126607 5.898459 7.096685 13.135379
> >
> > So there is no doubt that the multi byte version is a clear
> > winner (which was expected, I suppose).
>
> Cool! Hey, is this just an exercise, or are we actually going to
> use this? I would be very happy to have something I contributed
> to put into the kernel. :)

I have no idea. But it sure seems like the generic strncpy() could
use some improvement (and probably strcpy() too). However, I guess
one can argue that it is better to keep the generic implementations
as simple as possible, and then have each architecture implement
their own optimized versions.

> > Here is the code that I used:
> >
> > char *strncpy(char *dest, const char *src, size_t count)
> > {
> > char *tmp = dest;
> >
> > while (count && *src) {
> > *tmp++ = *src++;
> > count--;
> > }
> >
> > if (count) {
>
> Good idea... bad to do so many checks if count is zero. On the
> other hand, if count is rarely zero, then it's a loss. Maybe
> benchmark with and without?

Actually, I think this change would be lost in the noise in
the measurements.

> > size_t count2;
> >
> > while (count & (sizeof(long) - 1)) {
> > *tmp++ = '\0';
> > count--;
> > }
> >
> > count2 = count / sizeof(long);
>
> I know that a good compiler should migrate code to help the CPU
> pipeline, but how about moving this "count2 = " line up to before
> the first fill loop. See if that helps any. Always good to
> precompute well in advance.

Cannot do that as the first loop modifies count.

> > while (count2) {
> > *((long *)tmp)++ = '\0';
> > count2--;
> > }
> >
> > count &= (sizeof(long) - 1);
>
> And move this to before the middle fill loop.

In my later implementations this is not possible any longer.

> > while (count) {
> > *tmp++ = '\0';
> > count--;
> > }
> > }
> >
> > return dest;
> > }


> Daniel Forrest wrote:
> > On Sat, Aug 16, 2003 at 10:15:14AM +0200, Peter Kjellerstedt wrote:
>
> > Shouldn't this be:
> >
> > while (tmp & (sizeof(long) - 1)) {
> >
> >
> >> *tmp++ = '\0';
> >> count--;
> >> }
>
> Oh, yeah! That's right. We need to check the address. Also need to
> cast tmp to (int) or something (doesn't matter what it's cast to,
> because we only care about the lower 2 or 3 bits).
>
> Peter, please see if this makes any speed difference. But it
> definately needs this fix.

Yes, I added it in my later versions.

> Frankly, I'm surprised it works. In fact, it might not, but
> it's hard to tell from the tests just benchmarks.

I actually added verification to my benchmarking program too,
so the later versions I mailed are verified to work the same
as the standard glibc implementation at least.

> Also, if you're doing dense addressing on Alpha, and you do byte
> accesses the addresses for char are byte addresses, but the code
> does read-modify-write to memory for byte accesses, because in
> that mode, you can only do 32-bit and 64-bit accesses. The
> performance increase could be even greater for Alpha than for x86.
>
> For Sparc, we might be able to do something with VIS instructions,
> although I don't know what the setup overhead is. Sun's memcpy and
> memset only use VIS when the size is greater than 512, because
> otherwise, it's not worth it.
>
> I don't know enough about PowerPC other than the proper use of the
> "eieio" instruction. :)

Remember that many architectures already have their own architecture
specific implementations. Also note that most of them have not been
modified to do the null-padding... The following architectures have
their own implementations: alpha, h8300, i386, m68k, m68knommu, mips,
ppc, ppc64, s390 and sh. The following use the generic implementation:
arm, arm26, cris, ia64, mips64, parisc, sparc, sparc64, um, v850 and
x86_64.


> Daniel Forrest wrote:
>
> >
> > - if (count) {
> > + if (count >= sizeof(long)) {
> > size_t count2;
> >
>
> I like this size check here, but the comparison should be to
> some number greater than sizeof(long). There is a threshold
> below which it's not worth it to do all of the extra loops.
> If you're going to fill only four bytes, it's probably best
> to do it just using the last loop.
>
> Maybe through some trial and error, we could determine what that
> threshold is. I'm betting it's something around 2* or 3* word size.

The problem is that this probably varies a lot between different
architectures, and also processor speeds. Probably best left for
the architecture specific implementations.


> Peter Kjellerstedt wrote:
> > For unaligned source or destination the "Multi copy & fill"
> > would degenerate into "Multi byte fill". However, for
> > architectures like ix86 and CRIS that can do unaligned long
> > access, it would be a win to remove the UNALIGNED() check,
> > and use long word copying all the time.
>
> In fact, it's possible to do the copy even if the source and
> dest are not aligned. It requires holding pieces of source in
> a register and doing shifts. If the CPU is much faster than
> the memory, this can be a huge win.

This is true. However, this too is probably best left for the
architecture specific implementations.

> > Then whether using memset() or your filling is a win depends
> > on the architecture and how many bytes needs to be filled.
> > For a slow processor with little function call overhead (like
> > CRIS), using memset seems to be a win almost immediately.
> > However, for a fast processor like my P4, the call to memset
> > is not a win until some 1500 bytes need to be filled.
>
> What is in memset that isn't in the fill code I suggested?

In the CRIS case it sets 12 registers to zero and uses the
movem instruction to copy them to memory at once (movem is
normally used to store/restore registers to/from the stack
on function entry/exit). Thus it can clear 48 bytes each
time through the loop. And of course the rest of the function
is hand crafted to give the best performance for any count.

//Peter