2023-08-30 22:03:25

by Ammar Faizi

[permalink] [raw]
Subject: [RFC PATCH v1 3/5] tools/nolibc: x86-64: Use `rep cmpsb` for `memcmp()`

Simplify memcmp() on the x86-64 arch.

The x86-64 arch has a 'rep cmpsb' instruction, which can be used to
implement the memcmp() function.

%rdi = source 1
%rsi = source 2
%rcx = length

Signed-off-by: Ammar Faizi <[email protected]>
---
tools/include/nolibc/arch-x86_64.h | 19 +++++++++++++++++++
tools/include/nolibc/string.h | 2 ++
2 files changed, 21 insertions(+)

diff --git a/tools/include/nolibc/arch-x86_64.h b/tools/include/nolibc/arch-x86_64.h
index 42f2674ad1ecdd64..6c1b54ba9f774e7b 100644
--- a/tools/include/nolibc/arch-x86_64.h
+++ b/tools/include/nolibc/arch-x86_64.h
@@ -214,4 +214,23 @@ __asm__ (
"retq\n"
);

+#define NOLIBC_ARCH_HAS_MEMCMP
+static int memcmp(const void *s1, const void *s2, size_t n)
+{
+ const unsigned char *p1 = s1;
+ const unsigned char *p2 = s2;
+
+ if (!n)
+ return 0;
+
+ __asm__ volatile (
+ "rep cmpsb"
+ : "+D"(p2), "+S"(p1), "+c"(n)
+ : "m"(*(const unsigned char (*)[n])s1),
+ "m"(*(const unsigned char (*)[n])s2)
+ );
+
+ return p1[-1] - p2[-1];
+}
+
#endif /* _NOLIBC_ARCH_X86_64_H */
diff --git a/tools/include/nolibc/string.h b/tools/include/nolibc/string.h
index 1bad6121ef8c4ab5..3c941289d5dd0985 100644
--- a/tools/include/nolibc/string.h
+++ b/tools/include/nolibc/string.h
@@ -15,6 +15,7 @@ static void *malloc(size_t len);
* As much as possible, please keep functions alphabetically sorted.
*/

+#ifndef NOLIBC_ARCH_HAS_MEMCMP
static __attribute__((unused))
int memcmp(const void *s1, const void *s2, size_t n)
{
@@ -26,6 +27,7 @@ int memcmp(const void *s1, const void *s2, size_t n)
}
return c1;
}
+#endif /* #ifndef NOLIBC_ARCH_HAS_MEMCMP */

static __attribute__((unused))
void *_nolibc_memcpy_up(void *dst, const void *src, size_t len)
--
Ammar Faizi



2023-09-01 12:45:02

by Willy Tarreau

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/5] tools/nolibc: x86-64: Use `rep cmpsb` for `memcmp()`

On Fri, Sep 01, 2023 at 10:24:42AM +0700, Ammar Faizi wrote:
> On Wed, Aug 30, 2023 at 11:26:57PM +0200, Willy Tarreau wrote:
> > Out of curiosity, given that you implemented the 3 other ones directly
> > in an asm statement, is there a particular reason this one mixes a bit
> > of C and asm ?
>
> Because this one maybe unused. The other are explicitly exported.

Makes sense, indeed.

> > It would probably be something around this, in the same vein:
> >
> > memcmp:
> > xchg %esi,%eax // source1
> > mov %rdx,%rcx // count
> > rep cmpsb // source2 in rdi; sets ZF on equal, CF if src1<src2
> > seta %al // 0 if src2 <= src1, 1 if src2 > src1
> > sbb $0, %al // 0 if src2 == src1, -1 if src2 < src1, 1 if src2 > src1
> > movsx %al, %eax // sign extend to %eax
> > ret
> >
> > Note that the output logic could have to be revisited, I'm not certain but
> > at first glance it looks valid.
>
> After thinking about this more, I think I'll drop the memcmp() patch
> because it will prevent optimization when comparing a small value.
>
> For example, without __asm__:
>
> memcmp(var, "abcd", 4);
>
> may compile to:
>
> cmpl $0x64636261, %reg
> ...something...
>
> But with __asm__, the compiler can't do that. Thus, it's not worth
> optimizing the memcmp() in this case.

Ah you're totally right!

Willy

2023-09-04 13:49:04

by David Laight

[permalink] [raw]
Subject: RE: [RFC PATCH v1 3/5] tools/nolibc: x86-64: Use `rep cmpsb` for `memcmp()`

From: Willy Tarreau
> Sent: 30 August 2023 22:27
>
> On Wed, Aug 30, 2023 at 08:57:24PM +0700, Ammar Faizi wrote:
> > Simplify memcmp() on the x86-64 arch.
> >
> > The x86-64 arch has a 'rep cmpsb' instruction, which can be used to
> > implement the memcmp() function.
> >
> > %rdi = source 1
> > %rsi = source 2
> > %rcx = length
> >
> > Signed-off-by: Ammar Faizi <[email protected]>
> > ---
> > tools/include/nolibc/arch-x86_64.h | 19 +++++++++++++++++++
> > tools/include/nolibc/string.h | 2 ++
> > 2 files changed, 21 insertions(+)
> >
> > diff --git a/tools/include/nolibc/arch-x86_64.h b/tools/include/nolibc/arch-x86_64.h
> > index 42f2674ad1ecdd64..6c1b54ba9f774e7b 100644
> > --- a/tools/include/nolibc/arch-x86_64.h
> > +++ b/tools/include/nolibc/arch-x86_64.h
> > @@ -214,4 +214,23 @@ __asm__ (
> > "retq\n"
> > );
> >
> > +#define NOLIBC_ARCH_HAS_MEMCMP
> > +static int memcmp(const void *s1, const void *s2, size_t n)
> > +{
> > + const unsigned char *p1 = s1;
> > + const unsigned char *p2 = s2;
> > +
> > + if (!n)
> > + return 0;
> > +
> > + __asm__ volatile (
> > + "rep cmpsb"
> > + : "+D"(p2), "+S"(p1), "+c"(n)
> > + : "m"(*(const unsigned char (*)[n])s1),
> > + "m"(*(const unsigned char (*)[n])s2)
> > + );
> > +
> > + return p1[-1] - p2[-1];
> > +}
>
> Out of curiosity, given that you implemented the 3 other ones directly
> in an asm statement, is there a particular reason this one mixes a bit
> of C and asm ? It would probably be something around this, in the same
> vein:
>
> memcmp:
> xchg %esi,%eax // source1

Aren't the arguments in %rdi, %rsi and %rdx so you only
need to move the count (below) ?
(Looks like you copied memchr())

David

> mov %rdx,%rcx // count
> rep cmpsb // source2 in rdi; sets ZF on equal, CF if src1<src2
> seta %al // 0 if src2 <= src1, 1 if src2 > src1
> sbb $0, %al // 0 if src2 == src1, -1 if src2 < src1, 1 if src2 > src1
> movsx %al, %eax // sign extend to %eax
> ret
>
> Note that the output logic could have to be revisited, I'm not certain but
> at first glance it looks valid.
>
> Regards,
> Willy

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2023-09-05 19:38:29

by Ammar Faizi

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/5] tools/nolibc: x86-64: Use `rep cmpsb` for `memcmp()`

On Fri, Sep 01, 2023 at 05:35:08AM +0200, Willy Tarreau wrote:
> On Fri, Sep 01, 2023 at 10:24:42AM +0700, Ammar Faizi wrote:
> > After thinking about this more, I think I'll drop the memcmp() patch
> > because it will prevent optimization when comparing a small value.
> >
> > For example, without __asm__:
> >
> > memcmp(var, "abcd", 4);
> >
> > may compile to:
> >
> > cmpl $0x64636261, %reg
> > ...something...
> >
> > But with __asm__, the compiler can't do that. Thus, it's not worth
> > optimizing the memcmp() in this case.
>
> Ah you're totally right!

So, it turns out that such assumption is wrong. The compiler cannot
optimize the current memcmp() into that. I just posted a question on SO:

https://stackoverflow.com/questions/77020562/what-prevents-the-compiler-from-optimizing-a-hand-written-memcmp

Given:
```
bool test_data(void *data)
{
return memcmp(data, "abcd", 4) == 0;
}
```

The result when using default the <string.h> memcmp (good):
```
test_data:
cmpl $1684234849, (%rdi)
sete %al
ret
```

The result when using nolibc memcmp() (bad):
```
test_data:
cmpb $97, (%rdi)
jne .L5
cmpb $98, 1(%rdi)
jne .L5
cmpb $99, 2(%rdi)
jne .L5
cmpb $100, 3(%rdi)
sete %al
ret
.L5:
xorl %eax, %eax
ret
```

Link: https://godbolt.org/z/TT94r3bvf

This is because apart from the input length, the current nolibc
`memcmp()` must stop comparing the next byte if it finds a non-match
byte. Imagine what happens if we call:

```
char xstr[] = {'a', 'b', 'x'};
test_data(x);
```

In that case, the compiler may read past xstr if it uses a dword cmp, it
can also lead to segfault in particular circumstances using a dword cmp.

What the current nolibc memcmp() does from the C language view:

1) Compare one byte at a time.
2) Must stop comparing the next byte if it finds a non-match byte.

Because point (2) comes in, the compiler is not allowed to optimize
nolibc memcmp() into a wider load; otherwise, it may hit a segfault.
That also means it cannot vectorize the memcmp() loop.

On the other hand, memcpy() and memset() don't have such a restriction
so they can vectorize.

The real memcmp() assumes that both sources are at least `n` length in
size, allowing for a wider load. The current nolibc memcmp()
implementation doesn't reflect that assumption in the C code.

IOW, the real built-in memcmp() is undefined behavior for this code:
```
char x = 'q';
return memcmp(&x, "abcd", 4);
```
but the current nolibc memcmp() is well-defined behavior (well, must be,
as what the C code reflects).

We can improve nolibc memcmp() by casting the sources to a wider type
like (ulong, uint, ushort). But that's another story for another RFC
patchset.

--
Ammar Faizi