LinuxLists.cc - [PATCH] lib/strscpy: remove word-at-a-time optimization.

2018-01-09 16:39:26

Subject: [PATCH] lib/strscpy: remove word-at-a-time optimization.

strscpy() performs the word-at-a-time optimistic reads. So it may
may access the memory past the end of the object, which is perfectly fine
since strscpy() doesn't use that (past-the-end) data and makes sure the
optimistic read won't cross a page boundary.

But KASAN doesn't know anything about that so it will complain.
There are several possible ways to address this issue, but none
are perfect. See https://lkml.kernel.org/r/[email protected]

It seems the best solution is to simply disable word-at-a-time
optimization. My trivial testing shows that byte-at-a-time
could be up to x4.3 times slower than word-at-a-time.
It may seems like a lot, but it's actually ~1.2e-10 sec per symbol vs
~4.8e-10 sec per symbol on modern hardware. And we don't use strscpy()
in a performance critical paths to copy large amounts of data,
so it shouldn't matter anyway.

Fixes: 30035e45753b7 ("string: provide strscpy()")
Signed-off-by: Andrey Ryabinin <[email protected]>
Cc: <[email protected]>
---
lib/string.c | 38 --------------------------------------
1 file changed, 38 deletions(-)

diff --git a/lib/string.c b/lib/string.c
index 64a9e33f1daa..6205dd71aa0f 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -29,7 +29,6 @@
#include <linux/errno.h>

#include <asm/byteorder.h>
-#include <asm/word-at-a-time.h>
#include <asm/page.h>

#ifndef __HAVE_ARCH_STRNCASECMP
@@ -177,45 +176,8 @@ EXPORT_SYMBOL(strlcpy);
*/
ssize_t strscpy(char *dest, const char *src, size_t count)
{
- const struct word_at_a_time constants = WORD_AT_A_TIME_CONSTANTS;
- size_t max = count;
long res = 0;

- if (count == 0)
- return -E2BIG;
-
-#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
- /*
- * If src is unaligned, don't cross a page boundary,
- * since we don't know if the next page is mapped.
- */
- if ((long)src & (sizeof(long) - 1)) {
- size_t limit = PAGE_SIZE - ((long)src & (PAGE_SIZE - 1));
- if (limit < max)
- max = limit;
- }
-#else
- /* If src or dest is unaligned, don't do word-at-a-time. */
- if (((long) dest | (long) src) & (sizeof(long) - 1))
- max = 0;
-#endif
-
- while (max >= sizeof(unsigned long)) {
- unsigned long c, data;
-
- c = *(unsigned long *)(src+res);
- if (has_zero(c, &data, &constants)) {
- data = prep_zero_mask(c, data, &constants);
- data = create_zero_mask(data);
- *(unsigned long *)(dest+res) = c & zero_bytemask(data);
- return res + find_zero(data);
- }
- *(unsigned long *)(dest+res) = c;
- res += sizeof(unsigned long);
- count -= sizeof(unsigned long);
- max -= sizeof(unsigned long);
- }
-
while (count) {
char c;

--
2.13.6

2018-01-09 16:47:02

by Andrey Ryabinin

[permalink] [raw]

Subject: Re: [PATCH] lib/strscpy: remove word-at-a-time optimization.

Attached user space program I used to see the difference.
Usage:
gcc -02 -o strscpy strscpy_test.c
./strscpy {b|w} src_str_len count

src_str_len - length of source string in between 1-4096
count - how many strscpy() to execute.

Also I've noticed something strange. I'm not sure why, but certain
src_len values (e.g. 30) drives branch predictor crazy causing worse than usual results
for byte-at-a-time copy:

$ perf stat ./strscpy b 29 10000000

Performance counter stats for './strscpy b 29 10000000':

165.354974 task-clock:u (msec) # 0.999 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
48 page-faults:u # 0.290 K/sec
640,475,981 cycles:u # 3.873 GHz
2,500,090,080 instructions:u # 3.90 insn per cycle
640,017,126 branches:u # 3870.565 M/sec
1,589 branch-misses:u # 0.00% of all branches

0.165568346 seconds time elapsed

Performance counter stats for './strscpy b 30 10000000':

250.835659 task-clock:u (msec) # 0.999 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
46 page-faults:u # 0.183 K/sec
974,528,780 cycles:u # 3.885 GHz
2,580,090,165 instructions:u # 2.65 insn per cycle
660,017,211 branches:u # 2631.273 M/sec
14,488,234 branch-misses:u # 2.20% of all branches

0.251147341 seconds time elapsed

Performance counter stats for './strscpy b 31 10000000':

176.598368 task-clock:u (msec) # 0.997 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
46 page-faults:u # 0.260 K/sec
681,367,948 cycles:u # 3.858 GHz
2,660,090,092 instructions:u # 3.90 insn per cycle
680,017,138 branches:u # 3850.642 M/sec
1,817 branch-misses:u # 0.00% of all branches

0.177150181 seconds time elapsed

Attachments:

strscpy_test.c (3.21 kB)

2018-01-24 08:48:59

by Rasmus Villemoes

[permalink] [raw]

Subject: Re: [PATCH] lib/strscpy: remove word-at-a-time optimization.

On 2018-01-09 17:37, Andrey Ryabinin wrote:
> strscpy() performs the word-at-a-time optimistic reads. So it may
> may access the memory past the end of the object, which is perfectly fine
> since strscpy() doesn't use that (past-the-end) data and makes sure the
> optimistic read won't cross a page boundary.
>
> But KASAN doesn't know anything about that so it will complain.
> There are several possible ways to address this issue, but none
> are perfect. See https://lkml.kernel.org/r/[email protected]
>
> It seems the best solution is to simply disable word-at-a-time
> optimization. My trivial testing shows that byte-at-a-time
> could be up to x4.3 times slower than word-at-a-time.
> It may seems like a lot, but it's actually ~1.2e-10 sec per symbol vs
> ~4.8e-10 sec per symbol on modern hardware. And we don't use strscpy()
> in a performance critical paths to copy large amounts of data,
> so it shouldn't matter anyway.
>
> Fixes: 30035e45753b7 ("string: provide strscpy()")
> Signed-off-by: Andrey Ryabinin <[email protected]>
> Cc: <[email protected]>
>

Acked-by: Rasmus Villemoes <[email protected]>

Your microbenchmark even favours word-at-a-time slightly, since in
practice I think at least one of src or dst will be unaligned a lot of
the time, and while x86 may HAVE_EFFICIENT_UNALIGNED_ACCESS, it's still
a little more expensive than doing aligned access. And since strscpy is
not called that often, I expect some of the ~300 bytes of instruction
cache it occupies can be put to better use elsewhere.

Rasmus

2018-01-24 08:54:54

by Rasmus Villemoes

[permalink] [raw]

Subject: Re: [PATCH] lib/strscpy: remove word-at-a-time optimization.

On 2018-01-09 17:47, Andrey Ryabinin wrote:
> Attached user space program I used to see the difference.
> Usage:
> gcc -02 -o strscpy strscpy_test.c
> ./strscpy {b|w} src_str_len count
>
> src_str_len - length of source string in between 1-4096
> count - how many strscpy() to execute.
>
> Also I've noticed something strange. I'm not sure why, but certain
> src_len values (e.g. 30) drives branch predictor crazy causing worse than usual results
> for byte-at-a-time copy:

I see something similar, but at the 30->31 transition, and the
branch-misses remain at 1-3% for higher values, until 42 where it drops
back to 0%. Anyway, I highly doubt we do a lot of string copies of
strings longer then 32.

$ perf stat ./strscpy_test b 30 10000000

Performance counter stats for './strscpy_test b 30 10000000':

156,777082 task-clock (msec) # 0,999 CPUs
utilized
0 context-switches # 0,000 K/sec

0 cpu-migrations # 0,000 K/sec

48 page-faults # 0,306 K/sec

584.646.177 cycles # 3,729 GHz

<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
2.580.599.614 instructions # 4,41 insns per
cycle
660.114.283 branches # 4210,528 M/sec

4.891 branch-misses # 0,00% of all
branches

0,156970910 seconds time elapsed

$ perf stat ./strscpy_test b 31 10000000

Performance counter stats for './strscpy_test b 31 10000000':

258,533250 task-clock (msec) # 0,999 CPUs
utilized
0 context-switches # 0,000 K/sec

0 cpu-migrations # 0,000 K/sec

50 page-faults # 0,193 K/sec

965.505.138 cycles # 3,735 GHz

<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
2.660.773.463 instructions # 2,76 insns per
cycle
680.141.051 branches # 2630,768 M/sec

19.150.367 branch-misses # 2,82% of all
branches

0,258725192 seconds time elapsed

Rasmus

2018-01-24 15:54:19

by Andy Shevchenko

[permalink] [raw]

Subject: Re: [PATCH] lib/strscpy: remove word-at-a-time optimization.

On Tue, Jan 9, 2018 at 6:37 PM, Andrey Ryabinin <[email protected]> wrote:
> strscpy() performs the word-at-a-time optimistic reads. So it may
> may access the memory past the end of the object, which is perfectly fine
> since strscpy() doesn't use that (past-the-end) data and makes sure the
> optimistic read won't cross a page boundary.
>
> But KASAN doesn't know anything about that so it will complain.
> There are several possible ways to address this issue, but none
> are perfect. See https://lkml.kernel.org/r/[email protected]
>
> It seems the best solution is to simply disable word-at-a-time
> optimization. My trivial testing shows that byte-at-a-time
> could be up to x4.3 times slower than word-at-a-time.
> It may seems like a lot, but it's actually ~1.2e-10 sec per symbol vs
> ~4.8e-10 sec per symbol on modern hardware. And we don't use strscpy()
> in a performance critical paths to copy large amounts of data,
> so it shouldn't matter anyway.

What prevents you to revert the patch? After this one the all
advantages of the function becomes useless AFAIU.

--
With Best Regards,
Andy Shevchenko

2018-01-24 17:53:25

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH] lib/strscpy: remove word-at-a-time optimization.

On Wed, Jan 24, 2018 at 12:54 AM, Rasmus Villemoes
<[email protected]> wrote:
>
> I see something similar, but at the 30->31 transition, and the
> branch-misses remain at 1-3% for higher values, until 42 where it drops
> back to 0%. Anyway, I highly doubt we do a lot of string copies of
> strings longer then 32.

So I really dislike that microbenchmark, because it just has the same
length all the time. Which is very wrong, and makes the benchmark
pointless. A big part of this all is branch mispredicts, you shouldn't
just hand it the pattern on a plate.

Anyway, the reason I really dislike the patch is not because I think
strscpy() is all that important, but I *do* think that the
word-at-a-time thing is conceptually something we do care about, and I
hate removing it just because of KASAN not understanding it.

So I'd *much* rather have some way to tell KASAN that word-at-a-time
is going on. Because that approach definitely makes a difference in
other places.

Linus

2018-01-25 08:34:01

by Dmitry Vyukov

[permalink] [raw]

Subject: Re: [PATCH] lib/strscpy: remove word-at-a-time optimization.

On Wed, Jan 24, 2018 at 6:52 PM, Linus Torvalds
<[email protected]> wrote:
> On Wed, Jan 24, 2018 at 12:54 AM, Rasmus Villemoes
> <[email protected]> wrote:
>>
>> I see something similar, but at the 30->31 transition, and the
>> branch-misses remain at 1-3% for higher values, until 42 where it drops
>> back to 0%. Anyway, I highly doubt we do a lot of string copies of
>> strings longer then 32.
>
> So I really dislike that microbenchmark, because it just has the same
> length all the time. Which is very wrong, and makes the benchmark
> pointless. A big part of this all is branch mispredicts, you shouldn't
> just hand it the pattern on a plate.
>
> Anyway, the reason I really dislike the patch is not because I think
> strscpy() is all that important, but I *do* think that the
> word-at-a-time thing is conceptually something we do care about, and I
> hate removing it just because of KASAN not understanding it.
>
> So I'd *much* rather have some way to tell KASAN that word-at-a-time
> is going on. Because that approach definitely makes a difference in
> other places.

The other option was to use READ_ONCE_NOCHECK(). Not sure if the "read
once" part will affect codegen here, though.
But if word-at-a-time thing is conceptually something we do care
about, we could also introduce something like READ_PARTIALLY_VALID(),
which would check that at least first byte of the read is valid and
that it does not cross heap block boundary (but outside of KASAN is a
normal read).

2018-01-25 08:43:01

by David Laight

[permalink] [raw]

Subject: RE: [PATCH] lib/strscpy: remove word-at-a-time optimization.

From: Dmitry Vyukov [mailto:[email protected]]
> Sent: 25 January 2018 08:33
>
> On Wed, Jan 24, 2018 at 6:52 PM, Linus Torvalds
> <[email protected]> wrote:
> > On Wed, Jan 24, 2018 at 12:54 AM, Rasmus Villemoes
> > <[email protected]> wrote:
> >>
> >> I see something similar, but at the 30->31 transition, and the
> >> branch-misses remain at 1-3% for higher values, until 42 where it drops
> >> back to 0%. Anyway, I highly doubt we do a lot of string copies of
> >> strings longer then 32.
> >
> > So I really dislike that microbenchmark, because it just has the same
> > length all the time. Which is very wrong, and makes the benchmark
> > pointless. A big part of this all is branch mispredicts, you shouldn't
> > just hand it the pattern on a plate.
> >
> > Anyway, the reason I really dislike the patch is not because I think
> > strscpy() is all that important, but I *do* think that the
> > word-at-a-time thing is conceptually something we do care about, and I
> > hate removing it just because of KASAN not understanding it.
> >
> > So I'd *much* rather have some way to tell KASAN that word-at-a-time
> > is going on. Because that approach definitely makes a difference in
> > other places.
>
>
> The other option was to use READ_ONCE_NOCHECK(). Not sure if the "read
> once" part will affect codegen here, though.
> But if word-at-a-time thing is conceptually something we do care
> about, we could also introduce something like READ_PARTIALLY_VALID(),
> which would check that at least first byte of the read is valid and
> that it does not cross heap block boundary (but outside of KASAN is a
> normal read).

The first byte might not have been written either.
For example, doing a strlen() on a misaligned string you might read
the aligned word containing the first byte and adjust the value so
that the initial byte(s) are not zero.
After scanning for a zero byte the length would be corrected.

David

2018-01-25 09:09:15

by Dmitry Vyukov

[permalink] [raw]

Subject: Re: [PATCH] lib/strscpy: remove word-at-a-time optimization.

On Thu, Jan 25, 2018 at 9:42 AM, David Laight <[email protected]> wrote:
> From: Dmitry Vyukov [mailto:[email protected]]
>> Sent: 25 January 2018 08:33
>>
>> On Wed, Jan 24, 2018 at 6:52 PM, Linus Torvalds
>> <[email protected]> wrote:
>> > On Wed, Jan 24, 2018 at 12:54 AM, Rasmus Villemoes
>> > <[email protected]> wrote:
>> >>
>> >> I see something similar, but at the 30->31 transition, and the
>> >> branch-misses remain at 1-3% for higher values, until 42 where it drops
>> >> back to 0%. Anyway, I highly doubt we do a lot of string copies of
>> >> strings longer then 32.
>> >
>> > So I really dislike that microbenchmark, because it just has the same
>> > length all the time. Which is very wrong, and makes the benchmark
>> > pointless. A big part of this all is branch mispredicts, you shouldn't
>> > just hand it the pattern on a plate.
>> >
>> > Anyway, the reason I really dislike the patch is not because I think
>> > strscpy() is all that important, but I *do* think that the
>> > word-at-a-time thing is conceptually something we do care about, and I
>> > hate removing it just because of KASAN not understanding it.
>> >
>> > So I'd *much* rather have some way to tell KASAN that word-at-a-time
>> > is going on. Because that approach definitely makes a difference in
>> > other places.
>>
>>
>> The other option was to use READ_ONCE_NOCHECK(). Not sure if the "read
>> once" part will affect codegen here, though.
>> But if word-at-a-time thing is conceptually something we do care
>> about, we could also introduce something like READ_PARTIALLY_VALID(),
>> which would check that at least first byte of the read is valid and
>> that it does not cross heap block boundary (but outside of KASAN is a
>> normal read).
>
> The first byte might not have been written either.
> For example, doing a strlen() on a misaligned string you might read
> the aligned word containing the first byte and adjust the value so
> that the initial byte(s) are not zero.
> After scanning for a zero byte the length would be corrected.

Was the first byte at least kmalloc-ed? That's what KASAN checks, it
does not care about "written". KMSAN can detect uses of uninit data,
but that's another story.

2018-01-25 17:56:32

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH] lib/strscpy: remove word-at-a-time optimization.

On Thu, Jan 25, 2018 at 12:32 AM, Dmitry Vyukov <[email protected]> wrote:
> On Wed, Jan 24, 2018 at 6:52 PM, Linus Torvalds
> <[email protected]> wrote:
>>
>> So I'd *much* rather have some way to tell KASAN that word-at-a-time
>> is going on. Because that approach definitely makes a difference in
>> other places.
>
> The other option was to use READ_ONCE_NOCHECK().

How about just using the same accessor that we do for the dcache case.
That gives a reasonable example of the whole word-at-a-time model, and
should be good.

COMPLETELY UNTESTED patch attached. This needs to be verified.

It does limit the word-at-a-time code to the architectures that select
both HAVE_EFFICIENT_UNALIGNED_ACCESS and DCACHE_WORD_ACCESS, but that
seems a reasonable choice anyway.

Linus

Attachments:

patch.diff (1.61 kB)

2018-01-25 19:15:14

by Andrey Ryabinin

[permalink] [raw]

Subject: Re: [PATCH] lib/strscpy: remove word-at-a-time optimization.

On 01/25/2018 08:55 PM, Linus Torvalds wrote:
> On Thu, Jan 25, 2018 at 12:32 AM, Dmitry Vyukov <[email protected]> wrote:
>> On Wed, Jan 24, 2018 at 6:52 PM, Linus Torvalds
>> <[email protected]> wrote:
>>>
>>> So I'd *much* rather have some way to tell KASAN that word-at-a-time
>>> is going on. Because that approach definitely makes a difference in
>>> other places.
>>
>> The other option was to use READ_ONCE_NOCHECK().
>
> How about just using the same accessor that we do for the dcache case.
> That gives a reasonable example of the whole word-at-a-time model, and
> should be good.
>

If we also instrument load_unaligned_zeropad() with kasan_check_read(addr, 1),
than it should be fine. We don't want completely unchecked read of a source string.

But I also would like to revert df4c0e36f1b1 ("fs: dcache: manually unpoison dname after allocation to shut up kasan's reports")
So I was going to send something like the hunk bellow (split in several patches).

Or we could just use instrumented load_unalingned_zeropad() everywhere, but it seems wrong
to use it to load *cs only to shut up KASAN.

---
fs/dcache.c | 2 +-
include/linux/compiler.h | 11 +++++++++++
lib/string.c | 2 +-
3 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 5c7df1df81ff..6aa7be55a96d 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -195,7 +195,7 @@ static inline int dentry_string_cmp(const unsigned char *cs, const unsigned char
unsigned long a,b,mask;

for (;;) {
- a = *(unsigned long *)cs;
+ a = READ_PARTIAL_CHECK(*(unsigned long *)cs);
b = load_unaligned_zeropad(ct);
if (tcount < sizeof(unsigned long))
break;
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 52e611ab9a6c..85b63c2e196e 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -240,6 +240,7 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s
* required ordering.
*/
#include <asm/barrier.h>
+#include <linux/kasan-checks.h>

#define __READ_ONCE(x, check) \
({ \
@@ -259,6 +260,16 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s
*/
#define READ_ONCE_NOCHECK(x) __READ_ONCE(x, 0)

+#ifdef CONFIG_KASAN
+#define READ_PARTIAL_CHECK(x) \
+({ \
+ kasan_check_read(&(x), 1); \
+ READ_ONCE_NOCHECK(x); \
+})
+#else
+#define READ_PARTIAL_CHECK(x) (x)
+#endif
+
#define WRITE_ONCE(x, val) \
({ \
union { typeof(x) __val; char __c[1]; } __u = \
diff --git a/lib/string.c b/lib/string.c
index 64a9e33f1daa..2396856e4c56 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -203,7 +203,7 @@ ssize_t strscpy(char *dest, const char *src, size_t count)
while (max >= sizeof(unsigned long)) {
unsigned long c, data;

- c = *(unsigned long *)(src+res);
+ c = READ_PARTIAL_CHECK(*(unsigned long *)(src+res));
if (has_zero(c, &data, &constants)) {
data = prep_zero_mask(c, data, &constants);
data = create_zero_mask(data);
--
2.13.6

2018-01-30 09:13:16

by Dmitry Vyukov

[permalink] [raw]

Subject: Re: [PATCH] lib/strscpy: remove word-at-a-time optimization.

On Thu, Jan 25, 2018 at 8:13 PM, Andrey Ryabinin
<[email protected]> wrote:
> On 01/25/2018 08:55 PM, Linus Torvalds wrote:
>> On Thu, Jan 25, 2018 at 12:32 AM, Dmitry Vyukov <[email protected]> wrote:
>>> On Wed, Jan 24, 2018 at 6:52 PM, Linus Torvalds
>>> <[email protected]> wrote:
>>>>
>>>> So I'd *much* rather have some way to tell KASAN that word-at-a-time
>>>> is going on. Because that approach definitely makes a difference in
>>>> other places.
>>>
>>> The other option was to use READ_ONCE_NOCHECK().
>>
>> How about just using the same accessor that we do for the dcache case.
>> That gives a reasonable example of the whole word-at-a-time model, and
>> should be good.
>>
>
> If we also instrument load_unaligned_zeropad() with kasan_check_read(addr, 1),
> than it should be fine. We don't want completely unchecked read of a source string.
>
> But I also would like to revert df4c0e36f1b1 ("fs: dcache: manually unpoison dname after allocation to shut up kasan's reports")
> So I was going to send something like the hunk bellow (split in several patches).
>
> Or we could just use instrumented load_unalingned_zeropad() everywhere, but it seems wrong
> to use it to load *cs only to shut up KASAN.
>
>
> ---
> fs/dcache.c | 2 +-
> include/linux/compiler.h | 11 +++++++++++
> lib/string.c | 2 +-
> 3 files changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 5c7df1df81ff..6aa7be55a96d 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -195,7 +195,7 @@ static inline int dentry_string_cmp(const unsigned char *cs, const unsigned char
> unsigned long a,b,mask;
>
> for (;;) {
> - a = *(unsigned long *)cs;
> + a = READ_PARTIAL_CHECK(*(unsigned long *)cs);
> b = load_unaligned_zeropad(ct);
> if (tcount < sizeof(unsigned long))
> break;
> diff --git a/include/linux/compiler.h b/include/linux/compiler.h
> index 52e611ab9a6c..85b63c2e196e 100644
> --- a/include/linux/compiler.h
> +++ b/include/linux/compiler.h
> @@ -240,6 +240,7 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s
> * required ordering.
> */
> #include <asm/barrier.h>
> +#include <linux/kasan-checks.h>
>
> #define __READ_ONCE(x, check) \
> ({ \
> @@ -259,6 +260,16 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s
> */
> #define READ_ONCE_NOCHECK(x) __READ_ONCE(x, 0)
>
> +#ifdef CONFIG_KASAN
> +#define READ_PARTIAL_CHECK(x) \
> +({ \
> + kasan_check_read(&(x), 1); \
> + READ_ONCE_NOCHECK(x); \
> +})
> +#else
> +#define READ_PARTIAL_CHECK(x) (x)
> +#endif
> +
> #define WRITE_ONCE(x, val) \
> ({ \
> union { typeof(x) __val; char __c[1]; } __u = \
> diff --git a/lib/string.c b/lib/string.c
> index 64a9e33f1daa..2396856e4c56 100644
> --- a/lib/string.c
> +++ b/lib/string.c
> @@ -203,7 +203,7 @@ ssize_t strscpy(char *dest, const char *src, size_t count)
> while (max >= sizeof(unsigned long)) {
> unsigned long c, data;
>
> - c = *(unsigned long *)(src+res);
> + c = READ_PARTIAL_CHECK(*(unsigned long *)(src+res));
> if (has_zero(c, &data, &constants)) {
> data = prep_zero_mask(c, data, &constants);
> data = create_zero_mask(data);

Looks good to me a general way to support word-at-a-time pattern.

This will also get rid of this in fs/dcache.c:

if (IS_ENABLED(CONFIG_DCACHE_WORD_ACCESS))
kasan_unpoison_shadow(dname,
round_up(name->len + 1, sizeof(unsigned long)));