Hi,
We would like to report a “potential” bug in the KMSAN instrumentation
which has been found during the root-cause analysis of another bug
discovered by our modified version of syzkaller.
======================================================
description: Possible incorrect handling of fault injection inside
KMSAN instrumentation
affected file: mm/kmsan/shadow.c
kernel version: 6.2.0-rc5
kernel commit: 41c66f47061608dc1fd493eebce198f0e74cc2d7
git tree: kmsan
kernel config: https://syzkaller.appspot.com/text?tag=KernelConfig&x=a9a22da1efde3af6.
The config has Fault Injection (FI) turned on, which is important in
this case.
======================================================
We reported the “supposed” bug discovered by our fuzzer here:
https://groups.google.com/u/1/g/syzkaller/c/_83qwErVKlA. Initially, we
presumed that the vzalloc() call (refer to Jiri Slaby’s comment on
that thread) fails due to fault injection (refer to the reproducer
attached). However, we were confused to see that the allocation
failure triggers a crash, though clearly the driver code checks for
allocation failures. Nonetheless, we reported the crash to the
developers. Following Jiri’s comments, who evidently had the same
impression as ours, we started investigating. Below is our
observation.
======================================================
TL;DR:
kmsan's allocation of shadow or origin memory in
kmsan_vmap_pages_range_noflush() fails silently due to fault injection
(FI). KMSAN sort of “swallows” the allocation failure, and moves on.
When either of them is later accessed while updating the metadata,
there are no checks to test the validity of the respective pointers,
which results in a page fault.
======================================================
Detail explanation:
- In drivers/tty/n_tty.c:1879 (n_tty_open) , the driver calls vzalloc
to allocate memory for ldata.
- This triggers the KMSAN instrumentation to allocate the
corresponding shadow and origin memory in mm/kmsan/shadow.c:236
(kmsan_vmap_pages_range_noflush) .
- This allocation of the shadow memory fails (through fault
injection). KMSAN checks for failure, frees the allocated memory and
returns. Note: There is no return value signaling the error.
Additionally, the pages for shadow and origin memory are not mapped at
the addresses where KMSAN expects them to be (in fact, there are no
pages that could be mapped at all since the allocation failed).
- The allocation of the actual memory for the driver is successful.
Therefore, vzalloc (from 1.) returns a valid pointer (not NULL).
- After checking that the allocation succeeded
(drivers/tty/n_tty.c:1880), the driver tries to dereference ldata and
write to one of the fields at drivers/tty/n_tty.c:1883 (n_tty_open).
- This triggers the KMSAN instrumentation to update the shadow/origin
memory according to the write by calling
__msan_metadata_ptr_for_store_8 which subsequently calls
mm/kmsan/shadow.c:81 (kmsan_get_shadow_origin_ptr).
- Since the address that the driver is trying to access is with the
vmalloc range, this function will only calculate the offset of this
pointer from the base of the vmalloc range and add this to the base of
the shadow/origin memory range to retrieve the pointer for the
corresponding shadow/origin memory. Note: there are no checks ensuring
that this memory is actually mapped.
- Next, after the return of __msan_metadata_ptr_for_store_8 , the
instrumentation will try to update the shadow memory (or origin, we
are not entirely confident which of the two. We think it is the
shadow, but it also does not really change anything). Since this
memory is not mapped, it leads to the crash.
======================================================
Our conclusions/Questions:
- Should KMSAN fail silently? Probably not. Otherwise, the
instrumentation always needs to check whether shadow/origin memory
exists.
- Should KMSAN even be tested using fault injection? We are not sure.
On one hand, the primary purpose of FI should be testing the
application code. But also, inducing faults inside instrumentation
clearly helps to find mistakes in that, too.
- What is a fix for this? Should a failure in the KMSAN
instrumentation be propagated up so that the kernel allocator
(vzalloc() in this case) can “pretend” to fail, too?
--
Thanks and Regards,
Dipanjan
On Sat, Apr 8, 2023 at 5:51 PM Dipanjan Das <[email protected]> wrote:
>
> Hi,
Hi Dipanjan, thanks a lot for the elaborate analysis!
> kmsan's allocation of shadow or origin memory in
> kmsan_vmap_pages_range_noflush() fails silently due to fault injection
> (FI). KMSAN sort of “swallows” the allocation failure, and moves on.
> When either of them is later accessed while updating the metadata,
> there are no checks to test the validity of the respective pointers,
> which results in a page fault.
You are absolutely right.
> Our conclusions/Questions:
>
> - Should KMSAN fail silently? Probably not. Otherwise, the
> instrumentation always needs to check whether shadow/origin memory
> exists.
KMSAN shouldn't fail silently in any case.
kmsan_vmap_pages_range_noflush() used to have KMSAN_WARN_ON() to catch
such cases, but unfortunately I've failed to check the return values
of the kcalloc() calls.
> - Should KMSAN even be tested using fault injection? We are not sure.
At least our deployment of KMSAN on syzbot uses fault injection, so
having the two play well together is important.
> On one hand, the primary purpose of FI should be testing the
> application code. But also, inducing faults inside instrumentation
> clearly helps to find mistakes in that, too.
At first I had an idea of having a special GFP flag that prohibits
fault injections for the tool's allocations.
But this would just shift the allocations failures right, making them
harder to detect, because they will occur less often.
We'd better handle the failures properly instead.
> - What is a fix for this? Should a failure in the KMSAN
> instrumentation be propagated up so that the kernel allocator
> (vzalloc() in this case) can “pretend” to fail, too?
Yes, I think so.
Here are two patches that fix the problem:
- https://github.com/google/kmsan/commit/b793a6d5a1c1258326b0f53d6e3ac8aa3eeb3499
- for kmsan_vmap_pages_range_noflush();
- https://github.com/google/kmsan/commit/cb9e33e0cd7ff735bc302ff69c02274f24060cff
- for kmsan_ioremap_page_range()
Can you please try them out?
Alex
On Wed, Apr 12, 2023 at 7:39 AM Alexander Potapenko <[email protected]> wrote:
> Here are two patches that fix the problem:
> - https://github.com/google/kmsan/commit/b793a6d5a1c1258326b0f53d6e3ac8aa3eeb3499
> - for kmsan_vmap_pages_range_noflush();
> - https://github.com/google/kmsan/commit/cb9e33e0cd7ff735bc302ff69c02274f24060cff
> - for kmsan_ioremap_page_range()
>
> Can you please try them out?
The second patch needs a small modification.
The return value of `__vmap_pages_range_noflush` at Line 181
(https://github.com/google/kmsan/commit/cb9e33e0cd7ff735bc302ff69c02274f24060cff#diff-6c23520766ef70571c16b74ed93474716645c7ba81dc07028c076b6fd5ad2731R181)
should also be assigned to `mapped`. With this modification, the patch
works.
--
Thanks and Regards,
Dipanjan
On Wed, Apr 12, 2023 at 8:24 PM Dipanjan Das
<[email protected]> wrote:
>
> On Wed, Apr 12, 2023 at 7:39 AM Alexander Potapenko <[email protected]> wrote:
>
> > Here are two patches that fix the problem:
> > - https://github.com/google/kmsan/commit/b793a6d5a1c1258326b0f53d6e3ac8aa3eeb3499
> > - for kmsan_vmap_pages_range_noflush();
> > - https://github.com/google/kmsan/commit/cb9e33e0cd7ff735bc302ff69c02274f24060cff
> > - for kmsan_ioremap_page_range()
> >
> > Can you please try them out?
>
> The second patch needs a small modification.
>
> The return value of `__vmap_pages_range_noflush` at Line 181
> (https://github.com/google/kmsan/commit/cb9e33e0cd7ff735bc302ff69c02274f24060cff#diff-6c23520766ef70571c16b74ed93474716645c7ba81dc07028c076b6fd5ad2731R181)
> should also be assigned to `mapped`. With this modification, the patch
> works.
Good catch, thanks!
I'll send an updated version.