2019-03-24 01:46:16

by Sultan Alsawaf

[permalink] [raw]
Subject: [RFC] string: Use faster alternatives when constant arguments are used

From: Sultan Alsawaf <[email protected]>

When strcpy, strcat, and strcmp are used with a literal string, they can
be optimized to memcpy or memcmp calls. These alternatives are faster
since knowing the length of a string argument beforehand allows
traversal through the string word at a time without being concerned
about looking for the terminating zero character. In some cases, the
replaced calls to memcpy or memcmp can even be optimized out completely
for a significant speed up.

Signed-off-by: Sultan Alsawaf <[email protected]>
---
include/linux/string.h | 27 +++++++++++++++++++++++++++
1 file changed, 27 insertions(+)

diff --git a/include/linux/string.h b/include/linux/string.h
index 7927b875f..65db58342 100644
--- a/include/linux/string.h
+++ b/include/linux/string.h
@@ -476,4 +476,31 @@ static __always_inline size_t str_has_prefix(const char *str, const char *prefix
return strncmp(str, prefix, len) == 0 ? len : 0;
}

+/*
+ * Replace some common string helpers with faster alternatives when one of the
+ * arguments is a constant (i.e., literal string). This uses strlen instead of
+ * sizeof for calculating the string length in order to silence compiler
+ * warnings that may arise due to what the compiler thinks is incorrect sizeof
+ * usage. The strlen calls on constants are folded into scalar values at compile
+ * time, so performance is not reduced by using strlen.
+ */
+#define strcpy(dest, src) \
+ __builtin_choose_expr(__builtin_constant_p(src), \
+ memcpy((dest), (src), strlen(src) + 1), \
+ (strcpy)((dest), (src)))
+
+#define strcat(dest, src) \
+ __builtin_choose_expr(__builtin_constant_p(src), \
+ memcpy(strchr((dest), '\0'), (src), strlen(src) + 1), \
+ (strcat)((dest), (src)))
+
+#define strcmp(dest, src) \
+ __builtin_choose_expr(__builtin_constant_p(dest), \
+ __builtin_choose_expr(__builtin_constant_p(src), \
+ (strcmp)((dest), (src)), \
+ memcmp((dest), (src), strlen(dest) + 1)), \
+ __builtin_choose_expr(__builtin_constant_p(src), \
+ memcmp((dest), (src), strlen(src) + 1), \
+ (strcmp)((dest), (src))))
+
#endif /* _LINUX_STRING_H_ */
--
2.21.0



2019-03-24 02:31:47

by Sultan Alsawaf

[permalink] [raw]
Subject: [RFCv2] string: Use faster alternatives when constant arguments are used

I messed up the return value for strcat in the first patch. Here's a fixed
version, ready for some scathing reviews.

From: Sultan Alsawaf <[email protected]>

When strcpy, strcat, and strcmp are used with a literal string, they can
be optimized to memcpy or memcmp calls. These alternatives are faster
since knowing the length of a string argument beforehand allows
traversal through the string word at a time without being concerned
about looking for the terminating zero character. In some cases, the
replaced calls to memcpy or memcmp can even be optimized out completely
for a significant speed up.

Signed-off-by: Sultan Alsawaf <[email protected]>
---
include/linux/string.h | 30 ++++++++++++++++++++++++++++++
1 file changed, 30 insertions(+)

diff --git a/include/linux/string.h b/include/linux/string.h
index 7927b875f..59c301c0e 100644
--- a/include/linux/string.h
+++ b/include/linux/string.h
@@ -476,4 +476,34 @@ static __always_inline size_t str_has_prefix(const char *str, const char *prefix
return strncmp(str, prefix, len) == 0 ? len : 0;
}

+/*
+ * Replace some common string helpers with faster alternatives when one of the
+ * arguments is a constant (i.e., literal string). This uses strlen instead of
+ * sizeof for calculating the string length in order to silence compiler
+ * warnings that may arise due to what the compiler thinks is incorrect sizeof
+ * usage. The strlen calls on constants are folded into scalar values at compile
+ * time, so performance is not reduced by using strlen.
+ */
+#define strcpy(dest, src) \
+ __builtin_choose_expr(__builtin_constant_p(src), \
+ memcpy((dest), (src), strlen(src) + 1), \
+ (strcpy)((dest), (src)))
+
+#define strcat(dest, src) \
+ __builtin_choose_expr(__builtin_constant_p(src), \
+ ({ \
+ memcpy(strchr((dest), '\0'), (src), strlen(src) + 1); \
+ (dest); \
+ }), \
+ (strcat)((dest), (src)))
+
+#define strcmp(dest, src) \
+ __builtin_choose_expr(__builtin_constant_p(dest), \
+ __builtin_choose_expr(__builtin_constant_p(src), \
+ (strcmp)((dest), (src)), \
+ memcmp((dest), (src), strlen(dest) + 1)), \
+ __builtin_choose_expr(__builtin_constant_p(src), \
+ memcmp((dest), (src), strlen(src) + 1), \
+ (strcmp)((dest), (src))))
+
#endif /* _LINUX_STRING_H_ */
--
2.21.0


2019-03-24 03:32:51

by Nathan Chancellor

[permalink] [raw]
Subject: Re: [RFCv2] string: Use faster alternatives when constant arguments are used

On Sat, Mar 23, 2019 at 07:24:06PM -0700, Sultan Alsawaf wrote:
> I messed up the return value for strcat in the first patch. Here's a fixed
> version, ready for some scathing reviews.
>
> From: Sultan Alsawaf <[email protected]>
>
> When strcpy, strcat, and strcmp are used with a literal string, they can
> be optimized to memcpy or memcmp calls. These alternatives are faster
> since knowing the length of a string argument beforehand allows
> traversal through the string word at a time without being concerned
> about looking for the terminating zero character. In some cases, the
> replaced calls to memcpy or memcmp can even be optimized out completely
> for a significant speed up.
>
> Signed-off-by: Sultan Alsawaf <[email protected]>
> ---
> include/linux/string.h | 30 ++++++++++++++++++++++++++++++
> 1 file changed, 30 insertions(+)
>
> diff --git a/include/linux/string.h b/include/linux/string.h
> index 7927b875f..59c301c0e 100644
> --- a/include/linux/string.h
> +++ b/include/linux/string.h
> @@ -476,4 +476,34 @@ static __always_inline size_t str_has_prefix(const char *str, const char *prefix
> return strncmp(str, prefix, len) == 0 ? len : 0;
> }
>
> +/*
> + * Replace some common string helpers with faster alternatives when one of the
> + * arguments is a constant (i.e., literal string). This uses strlen instead of
> + * sizeof for calculating the string length in order to silence compiler
> + * warnings that may arise due to what the compiler thinks is incorrect sizeof
> + * usage. The strlen calls on constants are folded into scalar values at compile
> + * time, so performance is not reduced by using strlen.
> + */
> +#define strcpy(dest, src) \
> + __builtin_choose_expr(__builtin_constant_p(src), \
> + memcpy((dest), (src), strlen(src) + 1), \
> + (strcpy)((dest), (src)))
> +
> +#define strcat(dest, src) \
> + __builtin_choose_expr(__builtin_constant_p(src), \
> + ({ \
> + memcpy(strchr((dest), '\0'), (src), strlen(src) + 1); \
> + (dest); \
> + }), \
> + (strcat)((dest), (src)))
> +
> +#define strcmp(dest, src) \
> + __builtin_choose_expr(__builtin_constant_p(dest), \
> + __builtin_choose_expr(__builtin_constant_p(src), \
> + (strcmp)((dest), (src)), \
> + memcmp((dest), (src), strlen(dest) + 1)), \
> + __builtin_choose_expr(__builtin_constant_p(src), \
> + memcmp((dest), (src), strlen(src) + 1), \
> + (strcmp)((dest), (src))))
> +
> #endif /* _LINUX_STRING_H_ */
> --
> 2.21.0
>

Explicitly cc'ing some folks who have touched include/linux/string.h in
the past and might want to take a look at this.

Nathan

2019-03-24 07:19:31

by Sultan Alsawaf

[permalink] [raw]
Subject: Re: [RFC v3] string: Use faster alternatives when constant arguments are used

On Sat, Mar 23, 2019 at 08:31:53PM -0700, Nathan Chancellor wrote:
> Explicitly cc'ing some folks who have touched include/linux/string.h in
> the past and might want to take a look at this.
>
> Nathan

Thanks. One last revision with some nitpicks fixed is attached, though I doubt
it'll influence any comments on the concept of this anyway.

---
include/linux/string.h | 30 ++++++++++++++++++++++++++++++
1 file changed, 30 insertions(+)

diff --git a/include/linux/string.h b/include/linux/string.h
index 7927b875f..40a8a9436 100644
--- a/include/linux/string.h
+++ b/include/linux/string.h
@@ -476,4 +476,34 @@ static __always_inline size_t str_has_prefix(const char *str, const char *prefix
return strncmp(str, prefix, len) == 0 ? len : 0;
}

+/*
+ * Replace some common string helpers with faster alternatives when one of the
+ * arguments is a constant (i.e., literal string). This uses strlen instead of
+ * sizeof for calculating the string length in order to silence compiler
+ * warnings that may arise due to what the compiler thinks is incorrect sizeof
+ * usage. The strlen calls on constants are folded into scalar values at compile
+ * time, so performance is not reduced by using strlen.
+ */
+#define strcpy(dest, src) \
+ __builtin_choose_expr(__builtin_constant_p(src), \
+ memcpy((dest), (src), strlen(src) + 1), \
+ (strcpy)((dest), (src)))
+
+#define strcat(dest, src) \
+ __builtin_choose_expr(__builtin_constant_p(src), \
+ ({ \
+ memcpy((dest) + strlen(dest), (src), strlen(src) + 1); \
+ (dest); \
+ }), \
+ (strcat)((dest), (src)))
+
+#define strcmp(left, right) \
+ __builtin_choose_expr(__builtin_constant_p(left), \
+ __builtin_choose_expr(__builtin_constant_p(right), \
+ (strcmp)((left), (right)), \
+ memcmp((left), (right), strlen(left) + 1)), \
+ __builtin_choose_expr(__builtin_constant_p(right), \
+ memcmp((left), (right), strlen(right) + 1), \
+ (strcmp)((left), (right))))
+
#endif /* _LINUX_STRING_H_ */
--
2.21.0


2019-03-24 21:19:01

by Rasmus Villemoes

[permalink] [raw]
Subject: Re: [RFCv2] string: Use faster alternatives when constant arguments are used

On 24/03/2019 03.24, Sultan Alsawaf wrote:
> I messed up the return value for strcat in the first patch. Here's a fixed
> version, ready for some scathing reviews.
>
> From: Sultan Alsawaf <[email protected]>
>
> When strcpy, strcat, and strcmp are used with a literal string, they can
> be optimized to memcpy or memcmp calls.

gcc already knows the semantics of these functions and can optimize
accordingly. E.g. for strcpy() of a literal to a buffer, gcc readily
compiles

void f(char *buf) { strcpy(buf, "1"); }
void g(char *buf) { strcpy(buf, "12"); }
void h(char *buf) { strcpy(buf, "123456"); }

into

0000000000000000 <f>:
0: b8 31 00 00 00 mov $0x31,%eax
5: 66 89 07 mov %ax,(%rdi)
8: c3 retq
9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)

0000000000000010 <g>:
10: b8 31 32 00 00 mov $0x3231,%eax
15: c6 47 02 00 movb $0x0,0x2(%rdi)
19: 66 89 07 mov %ax,(%rdi)
1c: c3 retq
1d: 0f 1f 00 nopl (%rax)

0000000000000020 <h>:
20: b8 35 36 00 00 mov $0x3635,%eax
25: c7 07 31 32 33 34 movl $0x34333231,(%rdi)
2b: c6 47 06 00 movb $0x0,0x6(%rdi)
2f: 66 89 47 04 mov %ax,0x4(%rdi)
33: c3 retq

These alternatives are faster
> since knowing the length of a string argument beforehand allows
> traversal through the string word at a time

For strcmp(string, "someliteral"), gcc cannot (generate code that does)
read from string beyond a/the first nul byte

>
> +/*
> + * Replace some common string helpers with faster alternatives when one of the
> + * arguments is a constant (i.e., literal string). This uses strlen instead of
> + * sizeof for calculating the string length in order to silence compiler
> + * warnings that may arise due to what the compiler thinks is incorrect sizeof
> + * usage. The strlen calls on constants are folded into scalar values at compile
> + * time, so performance is not reduced by using strlen.
> + */
> +#define strcpy(dest, src) \
> + __builtin_choose_expr(__builtin_constant_p(src), \
> + memcpy((dest), (src), strlen(src) + 1), \
> + (strcpy)((dest), (src)))

Does this even compile? It's a well-known (or perhaps
not-so-well-known?) pitfall that __builtin_constant_p() is not
guaranteed to be usable in __builtin_choose_expr() - the latter only
accepts bona fide integer constant expressions, while evaluation of
__builtin_constant_p can be delayed until various optimization phases.

> +#define strcat(dest, src) \
> + __builtin_choose_expr(__builtin_constant_p(src), \
> + ({ \
> + memcpy(strchr((dest), '\0'), (src), strlen(src) + 1); \
> + (dest); \
> + }), \
> + (strcat)((dest), (src)))
> +
> +#define strcmp(dest, src) \
> + __builtin_choose_expr(__builtin_constant_p(dest), \
> + __builtin_choose_expr(__builtin_constant_p(src), \
> + (strcmp)((dest), (src)), \
> + memcmp((dest), (src), strlen(dest) + 1)), \

This seems to be buggy - you don't know that src is at least as long as
dest. And arguing "if it's shorter, there's a nul byte, which will
differ from dest at that point, so memcmp will/should stop" means that
the whole word-at-a-time argument would be out.

Aside from all that, you'd need multiple-evaluation guards. Also, it's a
good idea to give an example of some piece of actual kernel code that
would be compiled differently/better, by simply showing the difference
in disassembly. But even just a toy example as above would be good -
then you might have seen yourself that gcc already recognizes these
functions.

Rasmus

2019-03-24 22:32:58

by Sultan Alsawaf

[permalink] [raw]
Subject: Re: [RFCv2] string: Use faster alternatives when constant arguments are used

On Sun, Mar 24, 2019 at 10:17:49PM +0100, Rasmus Villemoes wrote:
> gcc already knows the semantics of these functions and can optimize
> accordingly. E.g. for strcpy() of a literal to a buffer, gcc readily
> compiles

The example you gave appears to get optimized accordingly, but there are
numerous sane counterexamples that don't get optimized.

On arm64, with GCC 8.3.0, let's look at the simple strcmp usage in
kernel/trace/preemptirq_delay_test.c:

static int preemptirq_delay_run(void *data)
{
unsigned long flags;

if (!strcmp(test_mode, "irq")) {
local_irq_save(flags);
busy_wait(delay);
local_irq_restore(flags);
} else if (!strcmp(test_mode, "preempt")) {
preempt_disable();
busy_wait(delay);
preempt_enable();
}

return 0;
}

This is what happens without my patch:

preemptirq_delay_run:
.LFB2517:
.cfi_startproc
stp x29, x30, [sp, -32]!
.cfi_def_cfa_offset 32
.cfi_offset 29, -32
.cfi_offset 30, -24
adrp x1, .LC0
add x1, x1, :lo12:.LC0
mov x29, sp
stp x19, x20, [sp, 16]
.cfi_offset 19, -16
.cfi_offset 20, -8
adrp x19, .LANCHOR0
add x19, x19, :lo12:.LANCHOR0
mov x0, x19
> bl strcmp
cbz w0, .L22
adrp x1, .LC1
mov x0, x19
add x1, x1, :lo12:.LC1
> bl strcmp
cbz w0, .L23

The calls to strcmp are not optimized out. Here is what this snippet looks like
after my patch:

preemptirq_delay_run:
.LFB2517:
.cfi_startproc
stp x29, x30, [sp, -32]!
.cfi_def_cfa_offset 32
.cfi_offset 29, -32
.cfi_offset 30, -24
adrp x0, .LANCHOR0
mov w1, 29289
mov x29, sp
ldr w2, [x0, #:lo12:.LANCHOR0]
movk w1, 0x71, lsl 16
add x3, x0, :lo12:.LANCHOR0
cmp w2, w1
beq .L23
ldr x1, [x0, #:lo12:.LANCHOR0]
mov x0, 29296
movk x0, 0x6565, lsl 16
movk x0, 0x706d, lsl 32
movk x0, 0x74, lsl 48
cmp x1, x0
beq .L24

The calls to strcmp were optimized out completely, not even replaced by a call
to memcmp.

I can produce lots of kernel examples that don't seem to follow basic str*
optimization, which is what prompted me to write this in the first place.

> Does this even compile? It's a well-known (or perhaps
> not-so-well-known?) pitfall that __builtin_constant_p() is not
> guaranteed to be usable in __builtin_choose_expr() - the latter only
> accepts bona fide integer constant expressions, while evaluation of
> __builtin_constant_p can be delayed until various optimization phases.

It not only compiles, but also seems to work correctly from what I've observed.

> This seems to be buggy - you don't know that src is at least as long as
> dest. And arguing "if it's shorter, there's a nul byte, which will
> differ from dest at that point, so memcmp will/should stop" means that
> the whole word-at-a-time argument would be out.

You mean reading the last word of a string could read out of bounds of the
string when using memcmp? There are lots of memcmp instances using a literal
string in the kernel; I don't think it would be hard to find one that violates
what you've pointed out, so I'm not really sure why it's a problem.

After a few minutes of grepping, it seems like the memcmp usage in
drivers/md/dm-integrity.c can read out of bounds on its arguments:
} else if (!memcmp(opt_string, "internal_hash:", strlen("internal_hash:"))) {
r = get_alg_and_key(opt_string, &ic->internal_hash_alg, &ti->error,
"Invalid internal_hash argument");
if (r)
goto bad;
} else if (!memcmp(opt_string, "journal_crypt:", strlen("journal_crypt:"))) {
r = get_alg_and_key(opt_string, &ic->journal_crypt_alg, &ti->error,
"Invalid journal_crypt argument");
if (r)
goto bad;
} else if (!memcmp(opt_string, "journal_mac:", strlen("journal_mac:"))) {

Where opt_string is a dynamically-set argument with no specified minimum length.

Sultan

2019-03-25 21:25:28

by Rasmus Villemoes

[permalink] [raw]
Subject: Re: [RFCv2] string: Use faster alternatives when constant arguments are used

On 24/03/2019 23.32, Sultan Alsawaf wrote:
> On Sun, Mar 24, 2019 at 10:17:49PM +0100, Rasmus Villemoes wrote:
>> gcc already knows the semantics of these functions and can optimize
>> accordingly. E.g. for strcpy() of a literal to a buffer, gcc readily
>> compiles
>
> The example you gave appears to get optimized accordingly, but there are
> numerous sane counterexamples that don't get optimized.
>
> On arm64, with GCC 8.3.0, let's look at the simple strcmp usage in
> kernel/trace/preemptirq_delay_test.c:
>
> static int preemptirq_delay_run(void *data)
> {
> unsigned long flags;
>
> if (!strcmp(test_mode, "irq")) {
> local_irq_save(flags);
> busy_wait(delay);
> local_irq_restore(flags);
> } else if (!strcmp(test_mode, "preempt")) {
> preempt_disable();
> busy_wait(delay);
> preempt_enable();
> }
>
> return 0;
> }
>
> This is what happens without my patch:

Excellent, this is the kind of examples I was looking for. On x86-64,
there's no callq strcmp in sight (just a repz cmpsb loop). With your
patch, the code gets compiled similar to the arm64 case (it's a rather
convenient case which can be done as single 4- or 8-byte comparisons).

> I can produce lots of kernel examples that don't seem to follow basic str*
> optimization, which is what prompted me to write this in the first place.

Probably because the compiler doesn't want to introduce memory accesses
that are not allowed by the C language.

>> Does this even compile? It's a well-known (or perhaps
>> not-so-well-known?) pitfall that __builtin_constant_p() is not
>> guaranteed to be usable in __builtin_choose_expr() - the latter only
>> accepts bona fide integer constant expressions, while evaluation of
>> __builtin_constant_p can be delayed until various optimization phases.
>
> It not only compiles, but also seems to work correctly from what I've observed.

Interesting. I'll have to look into how that can be.

>> This seems to be buggy - you don't know that src is at least as long as
>> dest. And arguing "if it's shorter, there's a nul byte, which will
>> differ from dest at that point, so memcmp will/should stop" means that
>> the whole word-at-a-time argument would be out.
>
> You mean reading the last word of a string could read out of bounds of the
> string when using memcmp? There are lots of memcmp instances using a literal
> string in the kernel; I don't think it would be hard to find one that violates
> what you've pointed out, so I'm not really sure why it's a problem.

Existing code may have bugs. Or existing code may use a dubious pattern
which happens to be ok due to the context it appears in, but becomes
buggy when copy-pasted elsewhere.

> After a few minutes of grepping, it seems like the memcmp usage in
> drivers/md/dm-integrity.c can read out of bounds on its arguments:
> } else if (!memcmp(opt_string, "internal_hash:", strlen("internal_hash:"))) {
> r = get_alg_and_key(opt_string, &ic->internal_hash_alg, &ti->error,
> "Invalid internal_hash argument");
> if (r)
> goto bad;

Yeah, this actually looks buggy. Maybe we don't have any memcmp()
implementations that do word-at-a-time (it would mostly make sense on BE
architectures). Or maybe there's something here that guarantees
sufficient slack after the nul byte, so we never read across a page
boundary (which is how bad things could happen).

Something like

char stackbuffer[32];
fill_stackbuffer_somehow();
if (memcmp(stackbuffer, "literal", strlen("literal"))

should be sorta ok, even if memcmp might end up reading uninitialized bytes.

What I'm worried about is your patch changing every single strcmp(,
"literal") into a memcmp, with absolutely no way of knowing or checking
anything about the other buffer. And actually, it doesn't have to be a
BE arch with a word-at-a-time memcmp. If (as is usually the case) the
strcmp() result is compared to zero, after you change

!strcmp(buf, "literal")

into

!memcmp(buf, "literal", 8)

the compiler may (exactly as you want it to) change that into a single
8-byte load (or two 4-byte loads) and comparisons to literals, no
memcmp() involved. And how do you know that _that_ is ok, for every one
of the hundreds, if not thousands, of instances in the tree?

So, for strcpy(), I think gcc does optimize across all arches. For
strcat(), I dunno, it probably doesn't have many users and users of
strcat() obviously do not care about performance anyway. And for
strcmp(), some arch backends already optimize pretty well, some may be
lacking (either because the arch does not have suitable string ops, or
because nobody implemented the optimizations in gcc yet), and converting
blindly to memcmp() seems to be somewhat dangerous - a few pre-existing
"open-coded" such instances doesn't really justify a tree-wide conversion.

Rasmus

2019-03-30 23:02:33

by Sultan Alsawaf

[permalink] [raw]
Subject: Re: [RFCv2] string: Use faster alternatives when constant arguments are used

On Mon, Mar 25, 2019 at 10:24:00PM +0100, Rasmus Villemoes wrote:
> What I'm worried about is your patch changing every single strcmp(,
> "literal") into a memcmp, with absolutely no way of knowing or checking
> anything about the other buffer. And actually, it doesn't have to be a
> BE arch with a word-at-a-time memcmp.
> If (as is usually the case) the strcmp() result is compared to zero, after you
> change
>
> !strcmp(buf, "literal")
>
> into
>
> !memcmp(buf, "literal", 8)
>
> the compiler may (exactly as you want it to) change that into a single
> 8-byte load (or two 4-byte loads) and comparisons to literals, no
> memcmp() involved. And how do you know that _that_ is ok, for every one
> of the hundreds, if not thousands, of instances in the tree?

When would this not be ok though? From what I've always known,

strcmp(terminated_buf1, terminated_buf2)

is equivalent to

memcmp(terminated_buf1, terminated_buf2, strlen(terminated_buf1))

and

memcmp(terminated_buf1, terminated_buf2, strlen(terminated_buf2))

regardless of whether or not one side is a literal (my patch just leverages the
compiler's ability to recognize strlen called on literals and optimize it out).
The latter memcmp instances would indeed perform worse than the first strcmp
when neither arguments are literals, but I don't see what makes the memcmp usage
"dangerous". How can the memcmps cross a page boundary when memcmp itself will
only read in large buffers of data at word boundaries?

And if there are concerns for some arches but not others, then couldn't this be
a feasible optimization for those which would work well with it?

Sultan

2019-04-01 20:44:13

by Rasmus Villemoes

[permalink] [raw]
Subject: Re: [RFCv2] string: Use faster alternatives when constant arguments are used

On 30/03/2019 23.59, Sultan Alsawaf wrote:
> How can the memcmps cross a page boundary when memcmp itself will
> only read in large buffers of data at word boundaries?

Consider your patch replacing !strcmp(buf, "123") by !memcmp(buf, "123",
4). buf is known to point to a nul-terminated string. But it may point
at, say, the second-last byte in a page, with the last byte in that page
being a nul byte, and the following page being MMIO or unmapped or all
kinds of bad things. On e.g. x86 where unaligned accesses are cheap, and
seeing that you're only comparing for equality, gcc is likely to compile
the memcmp version into

*(u32*)buf == 0x00333231

because you've told the compiler that there's no problem accessing four
bytes starting at buf. Boom. Even without unaligned access being cheap
this can happen; suppose the length is 8 instead, and gcc somehow knows
that buf is four-byte aligned (and in this case it happens to point four
bytes before a page boundary), so it could compile the memcmp(,,8) into

*(u32*)(buf+4) == secondword && *(u32*)buf == firstword

(or do the comparisons in the "natural" order, but it might still do
both loads first).

> And if there are concerns for some arches but not others, then couldn't this be
> a feasible optimization for those which would work well with it?

No. First, these are concerns for all arches. Second, if you can find
some particular place where string parsing/matching is in any way
performance relevant and not just done once during driver init or
whatnot, maybe the maintainers of that file would take a patch
hand-optimizing some strcmps to memcmps, or, depending on what the code
does, perhaps replacing the whole *cmp logic with a custom hash table.

But a patch implicitly and silently touching thousands of lines of code,
without an analysis of why none of the above is a problem for any of
those lines, for any .config, arch, compiler version? No.

Rasmus


2019-04-01 23:56:31

by Sultan Alsawaf

[permalink] [raw]
Subject: Re: [RFCv2] string: Use faster alternatives when constant arguments are used

On Mon, Apr 01, 2019 at 10:43:13PM +0200, Rasmus Villemoes wrote:
> Consider your patch replacing !strcmp(buf, "123") by !memcmp(buf, "123",
> 4). buf is known to point to a nul-terminated string. But it may point
> at, say, the second-last byte in a page, with the last byte in that page
> being a nul byte, and the following page being MMIO or unmapped or all
> kinds of bad things. On e.g. x86 where unaligned accesses are cheap, and
> seeing that you're only comparing for equality, gcc is likely to compile
> the memcmp version into
>
> *(u32*)buf == 0x00333231
>
> because you've told the compiler that there's no problem accessing four
> bytes starting at buf. Boom. Even without unaligned access being cheap
> this can happen; suppose the length is 8 instead, and gcc somehow knows
> that buf is four-byte aligned (and in this case it happens to point four
> bytes before a page boundary), so it could compile the memcmp(,,8) into
>
> *(u32*)(buf+4) == secondword && *(u32*)buf == firstword
>
> (or do the comparisons in the "natural" order, but it might still do
> both loads first).

Ok, this is the first solid counter I've seen against my patch. I hadn't
considered unaligned word-sized accesses. Thanks.

> > And if there are concerns for some arches but not others, then couldn't this be
> > a feasible optimization for those which would work well with it?
>
> No. First, these are concerns for all arches. Second, if you can find
> some particular place where string parsing/matching is in any way
> performance relevant and not just done once during driver init or
> whatnot, maybe the maintainers of that file would take a patch
> hand-optimizing some strcmps to memcmps, or, depending on what the code
> does, perhaps replacing the whole *cmp logic with a custom hash table.

FWIW, I didn't have a specific place in the kernel in mind that heavily relied
on str* operations and needed optimization. I just thought it could be a "free"
optimization, but apparently not.

> But a patch implicitly and silently touching thousands of lines of code,
> without an analysis of why none of the above is a problem for any of
> those lines, for any .config, arch, compiler version? No.

Well, I thought it could be proven to be correct regardless of the context,
which would imply that making such a global change touching thousands lof lines
of code would be safe. But sadly it isn't.

This does bring into question a greater problem though: usage of memcmp
throughout the kernel where the size provided is not the size of the smaller
container being compared. This usage of memcmp (which is quite easy to find all
across the kernel) could run into the unaligned memory access across a page
boundary that you gave as an example, no?

Sultan

2019-04-02 08:19:36

by Rasmus Villemoes

[permalink] [raw]
Subject: Re: [RFCv2] string: Use faster alternatives when constant arguments are used

On 02/04/2019 01.55, Sultan Alsawaf wrote:
> On Mon, Apr 01, 2019 at 10:43:13PM +0200, Rasmus Villemoes wrote:

>> No. First, these are concerns for all arches. Second, if you can find
>> some particular place where string parsing/matching is in any way
>> performance relevant and not just done once during driver init or
>> whatnot, maybe the maintainers of that file would take a patch
>> hand-optimizing some strcmps to memcmps, or, depending on what the code
>> does, perhaps replacing the whole *cmp logic with a custom hash table.
>
> FWIW, I didn't have a specific place in the kernel in mind that heavily relied
> on str* operations and needed optimization. I just thought it could be a "free"
> optimization, but apparently not.

Well, it wouldn't be completely free; at the very least, .text size
would usually increase, and for the majority of sites that are
execute-at-most-once-per-full-moon that's actually a small overall
pessimization. But the point is moot, I think.

>> But a patch implicitly and silently touching thousands of lines of code,
>> without an analysis of why none of the above is a problem for any of
>> those lines, for any .config, arch, compiler version? No.
>
> Well, I thought it could be proven to be correct regardless of the context,
> which would imply that making such a global change touching thousands lof lines
> of code would be safe. But sadly it isn't.
>
> This does bring into question a greater problem though: usage of memcmp
> throughout the kernel where the size provided is not the size of the smaller
> container being compared. This usage of memcmp (which is quite easy to find all
> across the kernel) could run into the unaligned memory access across a page
> boundary that you gave as an example, no?

Yes, as I said, I think you may have identified some actual bugs. In
some cases, the runtime string is known to live in an array of
sufficient size, so we would at most be reading uninitialized bytes (but
not making decisions based on them) - that could cause a sanitizer
warning, but otherwise harmless (stuff like this is a rather common
source of false positives with valgrind). In other cases, there's no
such guarantee. It really needs to be decided on a case-by-case basis.
And if in doubt, I'd change the code to use strcmp(), strncmp(),
strstarts() as appropriate - that's _much_ more readable, and avoids
having to worry, and would usually avoid duplicating the literal string
(which is error-prone when the code gets copy-pasted to handle a new
option).

Rasmus