2022-07-01 14:43:41

by Alexander Potapenko

[permalink] [raw]
Subject: [PATCH v4 00/45] Add KernelMemorySanitizer infrastructure

KernelMemorySanitizer (KMSAN) is a detector of errors related to uses of
uninitialized memory. It relies on compile-time Clang instrumentation
(similar to MSan in the userspace [1]) and tracks the state of every bit
of kernel memory, being able to report an error if uninitialized value
is used in a condition, dereferenced, or escapes to userspace, USB or
DMA.

KMSAN has reported more than 300 bugs in the past few years (recently
fixed bugs: [2]), most of them with the help of syzkaller. Such bugs
keep getting introduced into the kernel despite new compiler warnings
and other analyses (the 5.16 cycle already resulted in several
KMSAN-reported bugs, e.g. [3]). Mitigations like total stack and heap
initialization are unfortunately very far from being deployable.

The proposed patchset contains KMSAN runtime implementation together
with small changes to other subsystems needed to make KMSAN work.

The latter changes fall into several categories:

1. Changes and refactorings of existing code required to add KMSAN:
- [01/45] x86: add missing include to sparsemem.h
- [02/45] stackdepot: reserve 5 extra bits in depot_stack_handle_t
- [03/45] instrumented.h: allow instrumenting both sides of copy_from_user()
- [04/45] x86: asm: instrument usercopy in get_user() and __put_user_size()
- [05/45] asm-generic: instrument usercopy in cacheflush.h
- [10/45] libnvdimm/pfn_dev: increase MAX_STRUCT_PAGE_SIZE

2. KMSAN-related declarations in generic code, KMSAN runtime library,
docs and configs:
- [06/45] kmsan: add ReST documentation
- [07/45] kmsan: introduce __no_sanitize_memory and __no_kmsan_checks
- [09/45] x86: kmsan: pgtable: reduce vmalloc space
- [11/45] kmsan: add KMSAN runtime core
- [13/45] MAINTAINERS: add entry for KMSAN
- [25/45] kmsan: add tests for KMSAN
- [32/45] objtool: kmsan: list KMSAN API functions as uaccess-safe
- [36/45] x86: kmsan: use __msan_ string functions where possible
- [45/45] x86: kmsan: enable KMSAN builds for x86

3. Adding hooks from different subsystems to notify KMSAN about memory
state changes:
- [14/45] mm: kmsan: maintain KMSAN metadata for page
- [15/45] mm: kmsan: call KMSAN hooks from SLUB code
- [16/45] kmsan: handle task creation and exiting
- [17/45] init: kmsan: call KMSAN initialization routines
- [18/45] instrumented.h: add KMSAN support
- [20/45] kmsan: add iomap support
- [21/45] Input: libps2: mark data received in __ps2_command() as initialized
- [22/45] dma: kmsan: unpoison DMA mappings
- [35/45] x86: kmsan: handle open-coded assembly in lib/iomem.c
- [37/45] x86: kmsan: sync metadata pages on page fault

4. Changes that prevent false reports by explicitly initializing memory,
disabling optimized code that may trick KMSAN, selectively skipping
instrumentation:
- [08/45] kmsan: mark noinstr as __no_sanitize_memory
- [12/45] kmsan: disable instrumentation of unsupported common kernel code
- [19/45] kmsan: unpoison @tlb in arch_tlb_gather_mmu()
- [23/45] virtio: kmsan: check/unpoison scatterlist in vring_map_one_sg()
- [24/45] kmsan: handle memory sent to/from USB
- [26/45] kmsan: disable strscpy() optimization under KMSAN
- [27/45] crypto: kmsan: disable accelerated configs under KMSAN
- [28/45] kmsan: disable physical page merging in biovec
- [29/45] block: kmsan: skip bio block merging logic for KMSAN
- [30/45] kcov: kmsan: unpoison area->list in kcov_remote_area_put()
- [31/45] security: kmsan: fix interoperability with auto-initialization
- [33/45] x86: kmsan: disable instrumentation of unsupported code
- [34/45] x86: kmsan: skip shadow checks in __switch_to()
- [38/45] x86: kasan: kmsan: support CONFIG_GENERIC_CSUM on x86, enable it for KASAN/KMSAN
- [39/45] x86: fs: kmsan: disable CONFIG_DCACHE_WORD_ACCESS
- [40/45] x86: kmsan: don't instrument stack walking functions
- [41/45] entry: kmsan: introduce kmsan_unpoison_entry_regs()

5. Fixes for bugs detected with CONFIG_KMSAN_CHECK_PARAM_RETVAL:
- [42/45] bpf: kmsan: initialize BPF registers with zeroes
- [43/45] namei: initialize parameters passed to step_into()
- [44/45] mm: fs: initialize fsdata passed to write_begin/write_end interface


This patchset allows one to boot and run a defconfig+KMSAN kernel on a
QEMU without known false positives. It however doesn't guarantee there
are no false positives in drivers of certain devices or less tested
subsystems, although KMSAN is actively tested on syzbot with a large
config.

The biggest difference between this patch series and v3 is the
introduction of CONFIG_KMSAN_CHECK_PARAM_RETVAL, which maps to the
-fsanitize-memory-param-retval compiler flag and enforces conservative
checks of most kernel function parameters passed by value. As discussed
in [4], passing uninitialized values as function parameters is
considered undefined behavior, therefore KMSAN now reports such cases as
errors. Several newly added patches fix known manifestations of these
errors.

The patchset was generated relative to Linux v5.19-rc4. The most
up-to-date KMSAN tree currently resides at
https://github.com/google/kmsan/. One may find it handy to review these
patches in Gerrit [5].

A huge thanks goes to the reviewers of the RFC patch series sent to LKML
in 2020 ([6]).

[1] https://clang.llvm.org/docs/MemorySanitizer.html
[2] https://syzkaller.appspot.com/upstream/fixed?manager=ci-upstream-kmsan-gce
[3] https://lore.kernel.org/all/[email protected]/
[4] https://lore.kernel.org/all/[email protected]/
[5] https://linux-review.googlesource.com/c/linux/kernel/git/torvalds/linux/+/12604/
[6] https://lore.kernel.org/all/[email protected]/

Alexander Potapenko (44):
stackdepot: reserve 5 extra bits in depot_stack_handle_t
instrumented.h: allow instrumenting both sides of copy_from_user()
x86: asm: instrument usercopy in get_user() and __put_user_size()
asm-generic: instrument usercopy in cacheflush.h
kmsan: add ReST documentation
kmsan: introduce __no_sanitize_memory and __no_kmsan_checks
kmsan: mark noinstr as __no_sanitize_memory
x86: kmsan: pgtable: reduce vmalloc space
libnvdimm/pfn_dev: increase MAX_STRUCT_PAGE_SIZE
kmsan: add KMSAN runtime core
kmsan: disable instrumentation of unsupported common kernel code
MAINTAINERS: add entry for KMSAN
mm: kmsan: maintain KMSAN metadata for page operations
mm: kmsan: call KMSAN hooks from SLUB code
kmsan: handle task creation and exiting
init: kmsan: call KMSAN initialization routines
instrumented.h: add KMSAN support
kmsan: unpoison @tlb in arch_tlb_gather_mmu()
kmsan: add iomap support
Input: libps2: mark data received in __ps2_command() as initialized
dma: kmsan: unpoison DMA mappings
virtio: kmsan: check/unpoison scatterlist in vring_map_one_sg()
kmsan: handle memory sent to/from USB
kmsan: add tests for KMSAN
kmsan: disable strscpy() optimization under KMSAN
crypto: kmsan: disable accelerated configs under KMSAN
kmsan: disable physical page merging in biovec
block: kmsan: skip bio block merging logic for KMSAN
kcov: kmsan: unpoison area->list in kcov_remote_area_put()
security: kmsan: fix interoperability with auto-initialization
objtool: kmsan: list KMSAN API functions as uaccess-safe
x86: kmsan: disable instrumentation of unsupported code
x86: kmsan: skip shadow checks in __switch_to()
x86: kmsan: handle open-coded assembly in lib/iomem.c
x86: kmsan: use __msan_ string functions where possible
x86: kmsan: sync metadata pages on page fault
x86: kasan: kmsan: support CONFIG_GENERIC_CSUM on x86, enable it for
KASAN/KMSAN
x86: fs: kmsan: disable CONFIG_DCACHE_WORD_ACCESS
x86: kmsan: don't instrument stack walking functions
entry: kmsan: introduce kmsan_unpoison_entry_regs()
bpf: kmsan: initialize BPF registers with zeroes
namei: initialize parameters passed to step_into()
mm: fs: initialize fsdata passed to write_begin/write_end interface
x86: kmsan: enable KMSAN builds for x86

Dmitry Vyukov (1):
x86: add missing include to sparsemem.h

Documentation/dev-tools/index.rst | 1 +
Documentation/dev-tools/kmsan.rst | 422 ++++++++++++++++++
MAINTAINERS | 12 +
Makefile | 1 +
arch/s390/lib/uaccess.c | 3 +-
arch/x86/Kconfig | 9 +-
arch/x86/boot/Makefile | 1 +
arch/x86/boot/compressed/Makefile | 1 +
arch/x86/entry/vdso/Makefile | 3 +
arch/x86/include/asm/checksum.h | 16 +-
arch/x86/include/asm/kmsan.h | 55 +++
arch/x86/include/asm/page_64.h | 12 +
arch/x86/include/asm/pgtable_64_types.h | 41 +-
arch/x86/include/asm/sparsemem.h | 2 +
arch/x86/include/asm/string_64.h | 23 +-
arch/x86/include/asm/uaccess.h | 7 +
arch/x86/kernel/Makefile | 2 +
arch/x86/kernel/cpu/Makefile | 1 +
arch/x86/kernel/dumpstack.c | 6 +
arch/x86/kernel/process_64.c | 1 +
arch/x86/kernel/unwind_frame.c | 11 +
arch/x86/lib/Makefile | 2 +
arch/x86/lib/iomem.c | 5 +
arch/x86/mm/Makefile | 2 +
arch/x86/mm/fault.c | 23 +-
arch/x86/mm/init_64.c | 2 +-
arch/x86/mm/ioremap.c | 3 +
arch/x86/realmode/rm/Makefile | 1 +
block/bio.c | 2 +
block/blk.h | 7 +
crypto/Kconfig | 30 ++
drivers/firmware/efi/libstub/Makefile | 1 +
drivers/input/serio/libps2.c | 5 +-
drivers/net/Kconfig | 1 +
drivers/nvdimm/nd.h | 2 +-
drivers/nvdimm/pfn_devs.c | 2 +-
drivers/usb/core/urb.c | 2 +
drivers/virtio/virtio_ring.c | 10 +-
fs/buffer.c | 4 +-
fs/namei.c | 10 +-
include/asm-generic/cacheflush.h | 9 +-
include/linux/compiler-clang.h | 23 +
include/linux/compiler-gcc.h | 6 +
include/linux/compiler_types.h | 3 +-
include/linux/fortify-string.h | 2 +
include/linux/highmem.h | 3 +
include/linux/instrumented.h | 26 +-
include/linux/kmsan-checks.h | 83 ++++
include/linux/kmsan.h | 362 ++++++++++++++++
include/linux/mm_types.h | 12 +
include/linux/sched.h | 5 +
include/linux/stackdepot.h | 8 +
include/linux/uaccess.h | 19 +-
init/main.c | 3 +
kernel/Makefile | 1 +
kernel/bpf/core.c | 2 +-
kernel/dma/mapping.c | 9 +-
kernel/entry/common.c | 5 +
kernel/exit.c | 2 +
kernel/fork.c | 2 +
kernel/kcov.c | 7 +
kernel/locking/Makefile | 3 +-
lib/Kconfig.debug | 1 +
lib/Kconfig.kmsan | 62 +++
lib/Makefile | 3 +
lib/iomap.c | 44 ++
lib/iov_iter.c | 9 +-
lib/stackdepot.c | 29 +-
lib/string.c | 8 +
lib/usercopy.c | 3 +-
mm/Makefile | 1 +
mm/filemap.c | 2 +-
mm/internal.h | 6 +
mm/kasan/common.c | 2 +-
mm/kmsan/Makefile | 28 ++
mm/kmsan/core.c | 468 ++++++++++++++++++++
mm/kmsan/hooks.c | 395 +++++++++++++++++
mm/kmsan/init.c | 238 ++++++++++
mm/kmsan/instrumentation.c | 271 ++++++++++++
mm/kmsan/kmsan.h | 195 +++++++++
mm/kmsan/kmsan_test.c | 552 ++++++++++++++++++++++++
mm/kmsan/report.c | 211 +++++++++
mm/kmsan/shadow.c | 297 +++++++++++++
mm/memory.c | 2 +
mm/mmu_gather.c | 10 +
mm/page_alloc.c | 18 +
mm/slab.h | 1 +
mm/slub.c | 18 +
mm/vmalloc.c | 20 +-
scripts/Makefile.kmsan | 8 +
scripts/Makefile.lib | 9 +
security/Kconfig.hardening | 4 +
tools/objtool/check.c | 20 +
93 files changed, 4222 insertions(+), 52 deletions(-)
create mode 100644 Documentation/dev-tools/kmsan.rst
create mode 100644 arch/x86/include/asm/kmsan.h
create mode 100644 include/linux/kmsan-checks.h
create mode 100644 include/linux/kmsan.h
create mode 100644 lib/Kconfig.kmsan
create mode 100644 mm/kmsan/Makefile
create mode 100644 mm/kmsan/core.c
create mode 100644 mm/kmsan/hooks.c
create mode 100644 mm/kmsan/init.c
create mode 100644 mm/kmsan/instrumentation.c
create mode 100644 mm/kmsan/kmsan.h
create mode 100644 mm/kmsan/kmsan_test.c
create mode 100644 mm/kmsan/report.c
create mode 100644 mm/kmsan/shadow.c
create mode 100644 scripts/Makefile.kmsan

--
2.37.0.rc0.161.g10f37bed90-goog


2022-07-01 14:44:27

by Alexander Potapenko

[permalink] [raw]
Subject: [PATCH v4 17/45] init: kmsan: call KMSAN initialization routines

kmsan_init_shadow() scans the mappings created at boot time and creates
metadata pages for those mappings.

When the memblock allocator returns pages to pagealloc, we reserve 2/3
of those pages and use them as metadata for the remaining 1/3. Once KMSAN
starts, every page allocated by pagealloc has its associated shadow and
origin pages.

kmsan_initialize() initializes the bookkeeping for init_task and enables
KMSAN.

Signed-off-by: Alexander Potapenko <[email protected]>
---
v2:
-- move mm/kmsan/init.c and kmsan_memblock_free_pages() to this patch
-- print a warning that KMSAN is a debugging tool (per Greg K-H's
request)

v4:
-- change sizeof(type) to sizeof(*ptr)
-- replace occurrences of |var| with @var
-- swap init: and kmsan: in the subject
-- do not export __init functions

Link: https://linux-review.googlesource.com/id/I7bc53706141275914326df2345881ffe0cdd16bd
---
include/linux/kmsan.h | 48 +++++++++
init/main.c | 3 +
mm/kmsan/Makefile | 3 +-
mm/kmsan/init.c | 238 ++++++++++++++++++++++++++++++++++++++++++
mm/kmsan/kmsan.h | 3 +
mm/kmsan/shadow.c | 36 +++++++
mm/page_alloc.c | 3 +
7 files changed, 333 insertions(+), 1 deletion(-)
create mode 100644 mm/kmsan/init.c

diff --git a/include/linux/kmsan.h b/include/linux/kmsan.h
index b71e2032222e9..82fd564cc72e7 100644
--- a/include/linux/kmsan.h
+++ b/include/linux/kmsan.h
@@ -51,6 +51,40 @@ void kmsan_task_create(struct task_struct *task);
*/
void kmsan_task_exit(struct task_struct *task);

+/**
+ * kmsan_init_shadow() - Initialize KMSAN shadow at boot time.
+ *
+ * Allocate and initialize KMSAN metadata for early allocations.
+ */
+void __init kmsan_init_shadow(void);
+
+/**
+ * kmsan_init_runtime() - Initialize KMSAN state and enable KMSAN.
+ */
+void __init kmsan_init_runtime(void);
+
+/**
+ * kmsan_memblock_free_pages() - handle freeing of memblock pages.
+ * @page: struct page to free.
+ * @order: order of @page.
+ *
+ * Freed pages are either returned to buddy allocator or held back to be used
+ * as metadata pages.
+ */
+bool __init kmsan_memblock_free_pages(struct page *page, unsigned int order);
+
+/**
+ * kmsan_task_create() - Initialize KMSAN state for the task.
+ * @task: task to initialize.
+ */
+void kmsan_task_create(struct task_struct *task);
+
+/**
+ * kmsan_task_exit() - Notify KMSAN that a task has exited.
+ * @task: task about to finish.
+ */
+void kmsan_task_exit(struct task_struct *task);
+
/**
* kmsan_alloc_page() - Notify KMSAN about an alloc_pages() call.
* @page: struct page pointer returned by alloc_pages().
@@ -172,6 +206,20 @@ void kmsan_iounmap_page_range(unsigned long start, unsigned long end);

#else

+static inline void kmsan_init_shadow(void)
+{
+}
+
+static inline void kmsan_init_runtime(void)
+{
+}
+
+static inline bool kmsan_memblock_free_pages(struct page *page,
+ unsigned int order)
+{
+ return true;
+}
+
static inline void kmsan_task_create(struct task_struct *task)
{
}
diff --git a/init/main.c b/init/main.c
index 0ee39cdcfcac9..7ba48a9ff1d53 100644
--- a/init/main.c
+++ b/init/main.c
@@ -34,6 +34,7 @@
#include <linux/percpu.h>
#include <linux/kmod.h>
#include <linux/kprobes.h>
+#include <linux/kmsan.h>
#include <linux/vmalloc.h>
#include <linux/kernel_stat.h>
#include <linux/start_kernel.h>
@@ -835,6 +836,7 @@ static void __init mm_init(void)
init_mem_debugging_and_hardening();
kfence_alloc_pool();
report_meminit();
+ kmsan_init_shadow();
stack_depot_early_init();
mem_init();
mem_init_print_info();
@@ -852,6 +854,7 @@ static void __init mm_init(void)
init_espfix_bsp();
/* Should be run after espfix64 is set up. */
pti_init();
+ kmsan_init_runtime();
}

#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
diff --git a/mm/kmsan/Makefile b/mm/kmsan/Makefile
index 550ad8625e4f9..401acb1a491ce 100644
--- a/mm/kmsan/Makefile
+++ b/mm/kmsan/Makefile
@@ -3,7 +3,7 @@
# Makefile for KernelMemorySanitizer (KMSAN).
#
#
-obj-y := core.o instrumentation.o hooks.o report.o shadow.o
+obj-y := core.o instrumentation.o init.o hooks.o report.o shadow.o

KMSAN_SANITIZE := n
KCOV_INSTRUMENT := n
@@ -18,6 +18,7 @@ CFLAGS_REMOVE.o = $(CC_FLAGS_FTRACE)

CFLAGS_core.o := $(CC_FLAGS_KMSAN_RUNTIME)
CFLAGS_hooks.o := $(CC_FLAGS_KMSAN_RUNTIME)
+CFLAGS_init.o := $(CC_FLAGS_KMSAN_RUNTIME)
CFLAGS_instrumentation.o := $(CC_FLAGS_KMSAN_RUNTIME)
CFLAGS_report.o := $(CC_FLAGS_KMSAN_RUNTIME)
CFLAGS_shadow.o := $(CC_FLAGS_KMSAN_RUNTIME)
diff --git a/mm/kmsan/init.c b/mm/kmsan/init.c
new file mode 100644
index 0000000000000..abbf595a1e359
--- /dev/null
+++ b/mm/kmsan/init.c
@@ -0,0 +1,238 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * KMSAN initialization routines.
+ *
+ * Copyright (C) 2017-2021 Google LLC
+ * Author: Alexander Potapenko <[email protected]>
+ *
+ */
+
+#include "kmsan.h"
+
+#include <asm/sections.h>
+#include <linux/mm.h>
+#include <linux/memblock.h>
+
+#include "../internal.h"
+
+#define NUM_FUTURE_RANGES 128
+struct start_end_pair {
+ u64 start, end;
+};
+
+static struct start_end_pair start_end_pairs[NUM_FUTURE_RANGES] __initdata;
+static int future_index __initdata;
+
+/*
+ * Record a range of memory for which the metadata pages will be created once
+ * the page allocator becomes available.
+ */
+static void __init kmsan_record_future_shadow_range(void *start, void *end)
+{
+ u64 nstart = (u64)start, nend = (u64)end, cstart, cend;
+ bool merged = false;
+ int i;
+
+ KMSAN_WARN_ON(future_index == NUM_FUTURE_RANGES);
+ KMSAN_WARN_ON((nstart >= nend) || !nstart || !nend);
+ nstart = ALIGN_DOWN(nstart, PAGE_SIZE);
+ nend = ALIGN(nend, PAGE_SIZE);
+
+ /*
+ * Scan the existing ranges to see if any of them overlaps with
+ * [start, end). In that case, merge the two ranges instead of
+ * creating a new one.
+ * The number of ranges is less than 20, so there is no need to organize
+ * them into a more intelligent data structure.
+ */
+ for (i = 0; i < future_index; i++) {
+ cstart = start_end_pairs[i].start;
+ cend = start_end_pairs[i].end;
+ if ((cstart < nstart && cend < nstart) ||
+ (cstart > nend && cend > nend))
+ /* ranges are disjoint - do not merge */
+ continue;
+ start_end_pairs[i].start = min(nstart, cstart);
+ start_end_pairs[i].end = max(nend, cend);
+ merged = true;
+ break;
+ }
+ if (merged)
+ return;
+ start_end_pairs[future_index].start = nstart;
+ start_end_pairs[future_index].end = nend;
+ future_index++;
+}
+
+/*
+ * Initialize the shadow for existing mappings during kernel initialization.
+ * These include kernel text/data sections, NODE_DATA and future ranges
+ * registered while creating other data (e.g. percpu).
+ *
+ * Allocations via memblock can be only done before slab is initialized.
+ */
+void __init kmsan_init_shadow(void)
+{
+ const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
+ phys_addr_t p_start, p_end;
+ int nid;
+ u64 i;
+
+ for_each_reserved_mem_range(i, &p_start, &p_end)
+ kmsan_record_future_shadow_range(phys_to_virt(p_start),
+ phys_to_virt(p_end));
+ /* Allocate shadow for .data */
+ kmsan_record_future_shadow_range(_sdata, _edata);
+
+ for_each_online_node(nid)
+ kmsan_record_future_shadow_range(
+ NODE_DATA(nid), (char *)NODE_DATA(nid) + nd_size);
+
+ for (i = 0; i < future_index; i++)
+ kmsan_init_alloc_meta_for_range(
+ (void *)start_end_pairs[i].start,
+ (void *)start_end_pairs[i].end);
+}
+
+struct page_pair {
+ struct page *shadow, *origin;
+};
+static struct page_pair held_back[MAX_ORDER] __initdata;
+
+/*
+ * Eager metadata allocation. When the memblock allocator is freeing pages to
+ * pagealloc, we use 2/3 of them as metadata for the remaining 1/3.
+ * We store the pointers to the returned blocks of pages in held_back[] grouped
+ * by their order: when kmsan_memblock_free_pages() is called for the first
+ * time with a certain order, it is reserved as a shadow block, for the second
+ * time - as an origin block. On the third time the incoming block receives its
+ * shadow and origin ranges from the previously saved shadow and origin blocks,
+ * after which held_back[order] can be used again.
+ *
+ * At the very end there may be leftover blocks in held_back[]. They are
+ * collected later by kmsan_memblock_discard().
+ */
+bool kmsan_memblock_free_pages(struct page *page, unsigned int order)
+{
+ struct page *shadow, *origin;
+
+ if (!held_back[order].shadow) {
+ held_back[order].shadow = page;
+ return false;
+ }
+ if (!held_back[order].origin) {
+ held_back[order].origin = page;
+ return false;
+ }
+ shadow = held_back[order].shadow;
+ origin = held_back[order].origin;
+ kmsan_setup_meta(page, shadow, origin, order);
+
+ held_back[order].shadow = NULL;
+ held_back[order].origin = NULL;
+ return true;
+}
+
+#define MAX_BLOCKS 8
+struct smallstack {
+ struct page *items[MAX_BLOCKS];
+ int index;
+ int order;
+};
+
+static struct smallstack collect = {
+ .index = 0,
+ .order = MAX_ORDER,
+};
+
+static void smallstack_push(struct smallstack *stack, struct page *pages)
+{
+ KMSAN_WARN_ON(stack->index == MAX_BLOCKS);
+ stack->items[stack->index] = pages;
+ stack->index++;
+}
+#undef MAX_BLOCKS
+
+static struct page *smallstack_pop(struct smallstack *stack)
+{
+ struct page *ret;
+
+ KMSAN_WARN_ON(stack->index == 0);
+ stack->index--;
+ ret = stack->items[stack->index];
+ stack->items[stack->index] = NULL;
+ return ret;
+}
+
+static void do_collection(void)
+{
+ struct page *page, *shadow, *origin;
+
+ while (collect.index >= 3) {
+ page = smallstack_pop(&collect);
+ shadow = smallstack_pop(&collect);
+ origin = smallstack_pop(&collect);
+ kmsan_setup_meta(page, shadow, origin, collect.order);
+ __free_pages_core(page, collect.order);
+ }
+}
+
+static void collect_split(void)
+{
+ struct smallstack tmp = {
+ .order = collect.order - 1,
+ .index = 0,
+ };
+ struct page *page;
+
+ if (!collect.order)
+ return;
+ while (collect.index) {
+ page = smallstack_pop(&collect);
+ smallstack_push(&tmp, &page[0]);
+ smallstack_push(&tmp, &page[1 << tmp.order]);
+ }
+ __memcpy(&collect, &tmp, sizeof(tmp));
+}
+
+/*
+ * Memblock is about to go away. Split the page blocks left over in held_back[]
+ * and return 1/3 of that memory to the system.
+ */
+static void kmsan_memblock_discard(void)
+{
+ int i;
+
+ /*
+ * For each order=N:
+ * - push held_back[N].shadow and .origin to @collect;
+ * - while there are >= 3 elements in @collect, do garbage collection:
+ * - pop 3 ranges from @collect;
+ * - use two of them as shadow and origin for the third one;
+ * - repeat;
+ * - split each remaining element from @collect into 2 ranges of
+ * order=N-1,
+ * - repeat.
+ */
+ collect.order = MAX_ORDER - 1;
+ for (i = MAX_ORDER - 1; i >= 0; i--) {
+ if (held_back[i].shadow)
+ smallstack_push(&collect, held_back[i].shadow);
+ if (held_back[i].origin)
+ smallstack_push(&collect, held_back[i].origin);
+ held_back[i].shadow = NULL;
+ held_back[i].origin = NULL;
+ do_collection();
+ collect_split();
+ }
+}
+
+void __init kmsan_init_runtime(void)
+{
+ /* Assuming current is init_task */
+ kmsan_internal_task_create(current);
+ kmsan_memblock_discard();
+ pr_info("Starting KernelMemorySanitizer\n");
+ pr_info("ATTENTION: KMSAN is a debugging tool! Do not use it on production machines!\n");
+ kmsan_enabled = true;
+}
diff --git a/mm/kmsan/kmsan.h b/mm/kmsan/kmsan.h
index c7fb8666607e2..2f17912ef863f 100644
--- a/mm/kmsan/kmsan.h
+++ b/mm/kmsan/kmsan.h
@@ -66,6 +66,7 @@ struct shadow_origin_ptr {
struct shadow_origin_ptr kmsan_get_shadow_origin_ptr(void *addr, u64 size,
bool store);
void *kmsan_get_metadata(void *addr, bool is_origin);
+void __init kmsan_init_alloc_meta_for_range(void *start, void *end);

enum kmsan_bug_reason {
REASON_ANY,
@@ -188,5 +189,7 @@ bool kmsan_internal_is_module_addr(void *vaddr);
bool kmsan_internal_is_vmalloc_addr(void *addr);

struct page *kmsan_vmalloc_to_page_or_null(void *vaddr);
+void kmsan_setup_meta(struct page *page, struct page *shadow,
+ struct page *origin, int order);

#endif /* __MM_KMSAN_KMSAN_H */
diff --git a/mm/kmsan/shadow.c b/mm/kmsan/shadow.c
index 416cb85487a1a..7b254c30d42cc 100644
--- a/mm/kmsan/shadow.c
+++ b/mm/kmsan/shadow.c
@@ -259,3 +259,39 @@ void kmsan_vmap_pages_range_noflush(unsigned long start, unsigned long end,
kfree(s_pages);
kfree(o_pages);
}
+
+/* Allocate metadata for pages allocated at boot time. */
+void __init kmsan_init_alloc_meta_for_range(void *start, void *end)
+{
+ struct page *shadow_p, *origin_p;
+ void *shadow, *origin;
+ struct page *page;
+ u64 addr, size;
+
+ start = (void *)ALIGN_DOWN((u64)start, PAGE_SIZE);
+ size = ALIGN((u64)end - (u64)start, PAGE_SIZE);
+ shadow = memblock_alloc(size, PAGE_SIZE);
+ origin = memblock_alloc(size, PAGE_SIZE);
+ for (addr = 0; addr < size; addr += PAGE_SIZE) {
+ page = virt_to_page_or_null((char *)start + addr);
+ shadow_p = virt_to_page_or_null((char *)shadow + addr);
+ set_no_shadow_origin_page(shadow_p);
+ shadow_page_for(page) = shadow_p;
+ origin_p = virt_to_page_or_null((char *)origin + addr);
+ set_no_shadow_origin_page(origin_p);
+ origin_page_for(page) = origin_p;
+ }
+}
+
+void kmsan_setup_meta(struct page *page, struct page *shadow,
+ struct page *origin, int order)
+{
+ int i;
+
+ for (i = 0; i < (1 << order); i++) {
+ set_no_shadow_origin_page(&shadow[i]);
+ set_no_shadow_origin_page(&origin[i]);
+ shadow_page_for(&page[i]) = &shadow[i];
+ origin_page_for(&page[i]) = &origin[i];
+ }
+}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 785459251145e..e8d5a0b2a3264 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1731,6 +1731,9 @@ void __init memblock_free_pages(struct page *page, unsigned long pfn,
{
if (early_page_uninitialised(pfn))
return;
+ if (!kmsan_memblock_free_pages(page, order))
+ /* KMSAN will take care of these pages. */
+ return;
__free_pages_core(page, order);
}

--
2.37.0.rc0.161.g10f37bed90-goog

2022-07-01 14:44:54

by Alexander Potapenko

[permalink] [raw]
Subject: [PATCH v4 44/45] mm: fs: initialize fsdata passed to write_begin/write_end interface

Functions implementing the a_ops->write_end() interface accept the
`void *fsdata` parameter that is supposed to be initialized by the
corresponding a_ops->write_begin() (which accepts `void **fsdata`).

However not all a_ops->write_begin() implementations initialize `fsdata`
unconditionally, so it may get passed uninitialized to a_ops->write_end(),
resulting in undefined behavior.

Fix this by initializing fsdata with NULL before the call to
write_begin(), rather than doing so in all possible a_ops
implementations.

This patch covers only the following cases found by running x86 KMSAN
under syzkaller:

- generic_perform_write()
- cont_expand_zero() and generic_cont_expand_simple()
- page_symlink()

Other cases of passing uninitialized fsdata may persist in the codebase.

Signed-off-by: Alexander Potapenko <[email protected]>
---
Link: https://linux-review.googlesource.com/id/I414f0ee3a164c9c335d91d82ce4558f6f2841471
---
fs/buffer.c | 4 ++--
fs/namei.c | 2 +-
mm/filemap.c | 2 +-
3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 898c7f301b1b9..d014009cff941 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2349,7 +2349,7 @@ int generic_cont_expand_simple(struct inode *inode, loff_t size)
struct address_space *mapping = inode->i_mapping;
const struct address_space_operations *aops = mapping->a_ops;
struct page *page;
- void *fsdata;
+ void *fsdata = NULL;
int err;

err = inode_newsize_ok(inode, size);
@@ -2375,7 +2375,7 @@ static int cont_expand_zero(struct file *file, struct address_space *mapping,
const struct address_space_operations *aops = mapping->a_ops;
unsigned int blocksize = i_blocksize(inode);
struct page *page;
- void *fsdata;
+ void *fsdata = NULL;
pgoff_t index, curidx;
loff_t curpos;
unsigned zerofrom, offset, len;
diff --git a/fs/namei.c b/fs/namei.c
index 6b39dfd3b41bc..5e3ff9d65f502 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -5051,7 +5051,7 @@ int page_symlink(struct inode *inode, const char *symname, int len)
const struct address_space_operations *aops = mapping->a_ops;
bool nofs = !mapping_gfp_constraint(mapping, __GFP_FS);
struct page *page;
- void *fsdata;
+ void *fsdata = NULL;
int err;
unsigned int flags;

diff --git a/mm/filemap.c b/mm/filemap.c
index ffdfbc8b0e3ca..72467f00f1916 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3753,7 +3753,7 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
unsigned long offset; /* Offset into pagecache page */
unsigned long bytes; /* Bytes to write to page */
size_t copied; /* Bytes copied from user */
- void *fsdata;
+ void *fsdata = NULL;

offset = (pos & (PAGE_SIZE - 1));
bytes = min_t(unsigned long, PAGE_SIZE - offset,
--
2.37.0.rc0.161.g10f37bed90-goog

2022-07-01 14:59:22

by Alexander Potapenko

[permalink] [raw]
Subject: [PATCH v4 37/45] x86: kmsan: sync metadata pages on page fault

KMSAN assumes shadow and origin pages for every allocated page are
accessible. For pages between [VMALLOC_START, VMALLOC_END] those metadata
pages start at KMSAN_VMALLOC_SHADOW_START and
KMSAN_VMALLOC_ORIGIN_START, therefore we must sync a bigger memory
region.

Signed-off-by: Alexander Potapenko <[email protected]>

---

v2:
-- addressed reports from kernel test robot <[email protected]>

Link: https://linux-review.googlesource.com/id/Ia5bd541e54f1ecc11b86666c3ec87c62ac0bdfb8
---
arch/x86/mm/fault.c | 23 ++++++++++++++++++++++-
1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index fad8faa29d042..d07fe0801f203 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -260,7 +260,7 @@ static noinline int vmalloc_fault(unsigned long address)
}
NOKPROBE_SYMBOL(vmalloc_fault);

-void arch_sync_kernel_mappings(unsigned long start, unsigned long end)
+static void __arch_sync_kernel_mappings(unsigned long start, unsigned long end)
{
unsigned long addr;

@@ -284,6 +284,27 @@ void arch_sync_kernel_mappings(unsigned long start, unsigned long end)
}
}

+void arch_sync_kernel_mappings(unsigned long start, unsigned long end)
+{
+ __arch_sync_kernel_mappings(start, end);
+#ifdef CONFIG_KMSAN
+ /*
+ * KMSAN maintains two additional metadata page mappings for the
+ * [VMALLOC_START, VMALLOC_END) range. These mappings start at
+ * KMSAN_VMALLOC_SHADOW_START and KMSAN_VMALLOC_ORIGIN_START and
+ * have to be synced together with the vmalloc memory mapping.
+ */
+ if (start >= VMALLOC_START && end < VMALLOC_END) {
+ __arch_sync_kernel_mappings(
+ start - VMALLOC_START + KMSAN_VMALLOC_SHADOW_START,
+ end - VMALLOC_START + KMSAN_VMALLOC_SHADOW_START);
+ __arch_sync_kernel_mappings(
+ start - VMALLOC_START + KMSAN_VMALLOC_ORIGIN_START,
+ end - VMALLOC_START + KMSAN_VMALLOC_ORIGIN_START);
+ }
+#endif
+}
+
static bool low_pfn(unsigned long pfn)
{
return pfn < max_low_pfn;
--
2.37.0.rc0.161.g10f37bed90-goog

2022-07-01 14:59:45

by Alexander Potapenko

[permalink] [raw]
Subject: [PATCH v4 39/45] x86: fs: kmsan: disable CONFIG_DCACHE_WORD_ACCESS

dentry_string_cmp() calls read_word_at_a_time(), which might read
uninitialized bytes to optimize string comparisons.
Disabling CONFIG_DCACHE_WORD_ACCESS should prohibit this optimization,
as well as (probably) similar ones.

Suggested-by: Andrey Konovalov <[email protected]>
Signed-off-by: Alexander Potapenko <[email protected]>
---
Link: https://linux-review.googlesource.com/id/I4c0073224ac2897cafb8c037362c49dda9cfa133
---
arch/x86/Kconfig | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 4a5d0a0f54dea..aadbb16a59f01 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -129,7 +129,9 @@ config X86
select CLKEVT_I8253
select CLOCKSOURCE_VALIDATE_LAST_CYCLE
select CLOCKSOURCE_WATCHDOG
- select DCACHE_WORD_ACCESS
+ # Word-size accesses may read uninitialized data past the trailing \0
+ # in strings and cause false KMSAN reports.
+ select DCACHE_WORD_ACCESS if !KMSAN
select DYNAMIC_SIGFRAME
select EDAC_ATOMIC_SCRUB
select EDAC_SUPPORT
--
2.37.0.rc0.161.g10f37bed90-goog

2022-07-01 15:00:20

by Alexander Potapenko

[permalink] [raw]
Subject: [PATCH v4 27/45] crypto: kmsan: disable accelerated configs under KMSAN

KMSAN is unable to understand when initialized values come from assembly.
Disable accelerated configs in KMSAN builds to prevent false positive
reports.

Signed-off-by: Alexander Potapenko <[email protected]>

---

Link: https://linux-review.googlesource.com/id/Idb2334bf3a1b68b31b399709baefaa763038cc50
---
crypto/Kconfig | 30 ++++++++++++++++++++++++++++++
drivers/net/Kconfig | 1 +
2 files changed, 31 insertions(+)

diff --git a/crypto/Kconfig b/crypto/Kconfig
index 1d44893a997ba..7ddda6072ef35 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -298,6 +298,7 @@ config CRYPTO_CURVE25519
config CRYPTO_CURVE25519_X86
tristate "x86_64 accelerated Curve25519 scalar multiplication library"
depends on X86 && 64BIT
+ depends on !KMSAN # avoid false positives from assembly
select CRYPTO_LIB_CURVE25519_GENERIC
select CRYPTO_ARCH_HAVE_LIB_CURVE25519

@@ -346,11 +347,13 @@ config CRYPTO_AEGIS128
config CRYPTO_AEGIS128_SIMD
bool "Support SIMD acceleration for AEGIS-128"
depends on CRYPTO_AEGIS128 && ((ARM || ARM64) && KERNEL_MODE_NEON)
+ depends on !KMSAN # avoid false positives from assembly
default y

config CRYPTO_AEGIS128_AESNI_SSE2
tristate "AEGIS-128 AEAD algorithm (x86_64 AESNI+SSE2 implementation)"
depends on X86 && 64BIT
+ depends on !KMSAN # avoid false positives from assembly
select CRYPTO_AEAD
select CRYPTO_SIMD
help
@@ -487,6 +490,7 @@ config CRYPTO_NHPOLY1305
config CRYPTO_NHPOLY1305_SSE2
tristate "NHPoly1305 hash function (x86_64 SSE2 implementation)"
depends on X86 && 64BIT
+ depends on !KMSAN # avoid false positives from assembly
select CRYPTO_NHPOLY1305
help
SSE2 optimized implementation of the hash function used by the
@@ -495,6 +499,7 @@ config CRYPTO_NHPOLY1305_SSE2
config CRYPTO_NHPOLY1305_AVX2
tristate "NHPoly1305 hash function (x86_64 AVX2 implementation)"
depends on X86 && 64BIT
+ depends on !KMSAN # avoid false positives from assembly
select CRYPTO_NHPOLY1305
help
AVX2 optimized implementation of the hash function used by the
@@ -608,6 +613,7 @@ config CRYPTO_CRC32C
config CRYPTO_CRC32C_INTEL
tristate "CRC32c INTEL hardware acceleration"
depends on X86
+ depends on !KMSAN # avoid false positives from assembly
select CRYPTO_HASH
help
In Intel processor with SSE4.2 supported, the processor will
@@ -648,6 +654,7 @@ config CRYPTO_CRC32
config CRYPTO_CRC32_PCLMUL
tristate "CRC32 PCLMULQDQ hardware acceleration"
depends on X86
+ depends on !KMSAN # avoid false positives from assembly
select CRYPTO_HASH
select CRC32
help
@@ -713,6 +720,7 @@ config CRYPTO_BLAKE2S
config CRYPTO_BLAKE2S_X86
tristate "BLAKE2s digest algorithm (x86 accelerated version)"
depends on X86 && 64BIT
+ depends on !KMSAN # avoid false positives from assembly
select CRYPTO_LIB_BLAKE2S_GENERIC
select CRYPTO_ARCH_HAVE_LIB_BLAKE2S

@@ -727,6 +735,7 @@ config CRYPTO_CRCT10DIF
config CRYPTO_CRCT10DIF_PCLMUL
tristate "CRCT10DIF PCLMULQDQ hardware acceleration"
depends on X86 && 64BIT && CRC_T10DIF
+ depends on !KMSAN # avoid false positives from assembly
select CRYPTO_HASH
help
For x86_64 processors with SSE4.2 and PCLMULQDQ supported,
@@ -779,6 +788,7 @@ config CRYPTO_POLY1305
config CRYPTO_POLY1305_X86_64
tristate "Poly1305 authenticator algorithm (x86_64/SSE2/AVX2)"
depends on X86 && 64BIT
+ depends on !KMSAN # avoid false positives from assembly
select CRYPTO_LIB_POLY1305_GENERIC
select CRYPTO_ARCH_HAVE_LIB_POLY1305
help
@@ -867,6 +877,7 @@ config CRYPTO_SHA1
config CRYPTO_SHA1_SSSE3
tristate "SHA1 digest algorithm (SSSE3/AVX/AVX2/SHA-NI)"
depends on X86 && 64BIT
+ depends on !KMSAN # avoid false positives from assembly
select CRYPTO_SHA1
select CRYPTO_HASH
help
@@ -878,6 +889,7 @@ config CRYPTO_SHA1_SSSE3
config CRYPTO_SHA256_SSSE3
tristate "SHA256 digest algorithm (SSSE3/AVX/AVX2/SHA-NI)"
depends on X86 && 64BIT
+ depends on !KMSAN # avoid false positives from assembly
select CRYPTO_SHA256
select CRYPTO_HASH
help
@@ -890,6 +902,7 @@ config CRYPTO_SHA256_SSSE3
config CRYPTO_SHA512_SSSE3
tristate "SHA512 digest algorithm (SSSE3/AVX/AVX2)"
depends on X86 && 64BIT
+ depends on !KMSAN # avoid false positives from assembly
select CRYPTO_SHA512
select CRYPTO_HASH
help
@@ -1065,6 +1078,7 @@ config CRYPTO_WP512
config CRYPTO_GHASH_CLMUL_NI_INTEL
tristate "GHASH hash function (CLMUL-NI accelerated)"
depends on X86 && 64BIT
+ depends on !KMSAN # avoid false positives from assembly
select CRYPTO_CRYPTD
help
This is the x86_64 CLMUL-NI accelerated implementation of
@@ -1115,6 +1129,7 @@ config CRYPTO_AES_TI
config CRYPTO_AES_NI_INTEL
tristate "AES cipher algorithms (AES-NI)"
depends on X86
+ depends on !KMSAN # avoid false positives from assembly
select CRYPTO_AEAD
select CRYPTO_LIB_AES
select CRYPTO_ALGAPI
@@ -1239,6 +1254,7 @@ config CRYPTO_BLOWFISH_COMMON
config CRYPTO_BLOWFISH_X86_64
tristate "Blowfish cipher algorithm (x86_64)"
depends on X86 && 64BIT
+ depends on !KMSAN # avoid false positives from assembly
select CRYPTO_SKCIPHER
select CRYPTO_BLOWFISH_COMMON
imply CRYPTO_CTR
@@ -1269,6 +1285,7 @@ config CRYPTO_CAMELLIA
config CRYPTO_CAMELLIA_X86_64
tristate "Camellia cipher algorithm (x86_64)"
depends on X86 && 64BIT
+ depends on !KMSAN # avoid false positives from assembly
select CRYPTO_SKCIPHER
imply CRYPTO_CTR
help
@@ -1285,6 +1302,7 @@ config CRYPTO_CAMELLIA_X86_64
config CRYPTO_CAMELLIA_AESNI_AVX_X86_64
tristate "Camellia cipher algorithm (x86_64/AES-NI/AVX)"
depends on X86 && 64BIT
+ depends on !KMSAN # avoid false positives from assembly
select CRYPTO_SKCIPHER
select CRYPTO_CAMELLIA_X86_64
select CRYPTO_SIMD
@@ -1303,6 +1321,7 @@ config CRYPTO_CAMELLIA_AESNI_AVX_X86_64
config CRYPTO_CAMELLIA_AESNI_AVX2_X86_64
tristate "Camellia cipher algorithm (x86_64/AES-NI/AVX2)"
depends on X86 && 64BIT
+ depends on !KMSAN # avoid false positives from assembly
select CRYPTO_CAMELLIA_AESNI_AVX_X86_64
help
Camellia cipher algorithm module (x86_64/AES-NI/AVX2).
@@ -1348,6 +1367,7 @@ config CRYPTO_CAST5
config CRYPTO_CAST5_AVX_X86_64
tristate "CAST5 (CAST-128) cipher algorithm (x86_64/AVX)"
depends on X86 && 64BIT
+ depends on !KMSAN # avoid false positives from assembly
select CRYPTO_SKCIPHER
select CRYPTO_CAST5
select CRYPTO_CAST_COMMON
@@ -1371,6 +1391,7 @@ config CRYPTO_CAST6
config CRYPTO_CAST6_AVX_X86_64
tristate "CAST6 (CAST-256) cipher algorithm (x86_64/AVX)"
depends on X86 && 64BIT
+ depends on !KMSAN # avoid false positives from assembly
select CRYPTO_SKCIPHER
select CRYPTO_CAST6
select CRYPTO_CAST_COMMON
@@ -1404,6 +1425,7 @@ config CRYPTO_DES_SPARC64
config CRYPTO_DES3_EDE_X86_64
tristate "Triple DES EDE cipher algorithm (x86-64)"
depends on X86 && 64BIT
+ depends on !KMSAN # avoid false positives from assembly
select CRYPTO_SKCIPHER
select CRYPTO_LIB_DES
imply CRYPTO_CTR
@@ -1461,6 +1483,7 @@ config CRYPTO_CHACHA20
config CRYPTO_CHACHA20_X86_64
tristate "ChaCha stream cipher algorithms (x86_64/SSSE3/AVX2/AVX-512VL)"
depends on X86 && 64BIT
+ depends on !KMSAN # avoid false positives from assembly
select CRYPTO_SKCIPHER
select CRYPTO_LIB_CHACHA_GENERIC
select CRYPTO_ARCH_HAVE_LIB_CHACHA
@@ -1504,6 +1527,7 @@ config CRYPTO_SERPENT
config CRYPTO_SERPENT_SSE2_X86_64
tristate "Serpent cipher algorithm (x86_64/SSE2)"
depends on X86 && 64BIT
+ depends on !KMSAN # avoid false positives from assembly
select CRYPTO_SKCIPHER
select CRYPTO_SERPENT
select CRYPTO_SIMD
@@ -1523,6 +1547,7 @@ config CRYPTO_SERPENT_SSE2_X86_64
config CRYPTO_SERPENT_SSE2_586
tristate "Serpent cipher algorithm (i586/SSE2)"
depends on X86 && !64BIT
+ depends on !KMSAN # avoid false positives from assembly
select CRYPTO_SKCIPHER
select CRYPTO_SERPENT
select CRYPTO_SIMD
@@ -1542,6 +1567,7 @@ config CRYPTO_SERPENT_SSE2_586
config CRYPTO_SERPENT_AVX_X86_64
tristate "Serpent cipher algorithm (x86_64/AVX)"
depends on X86 && 64BIT
+ depends on !KMSAN # avoid false positives from assembly
select CRYPTO_SKCIPHER
select CRYPTO_SERPENT
select CRYPTO_SIMD
@@ -1562,6 +1588,7 @@ config CRYPTO_SERPENT_AVX_X86_64
config CRYPTO_SERPENT_AVX2_X86_64
tristate "Serpent cipher algorithm (x86_64/AVX2)"
depends on X86 && 64BIT
+ depends on !KMSAN # avoid false positives from assembly
select CRYPTO_SERPENT_AVX_X86_64
help
Serpent cipher algorithm, by Anderson, Biham & Knudsen.
@@ -1706,6 +1733,7 @@ config CRYPTO_TWOFISH_586
config CRYPTO_TWOFISH_X86_64
tristate "Twofish cipher algorithm (x86_64)"
depends on (X86 || UML_X86) && 64BIT
+ depends on !KMSAN # avoid false positives from assembly
select CRYPTO_ALGAPI
select CRYPTO_TWOFISH_COMMON
imply CRYPTO_CTR
@@ -1723,6 +1751,7 @@ config CRYPTO_TWOFISH_X86_64
config CRYPTO_TWOFISH_X86_64_3WAY
tristate "Twofish cipher algorithm (x86_64, 3-way parallel)"
depends on X86 && 64BIT
+ depends on !KMSAN # avoid false positives from assembly
select CRYPTO_SKCIPHER
select CRYPTO_TWOFISH_COMMON
select CRYPTO_TWOFISH_X86_64
@@ -1743,6 +1772,7 @@ config CRYPTO_TWOFISH_X86_64_3WAY
config CRYPTO_TWOFISH_AVX_X86_64
tristate "Twofish cipher algorithm (x86_64/AVX)"
depends on X86 && 64BIT
+ depends on !KMSAN # avoid false positives from assembly
select CRYPTO_SKCIPHER
select CRYPTO_SIMD
select CRYPTO_TWOFISH_COMMON
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index b2a4f998c180e..fed89b6981759 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -76,6 +76,7 @@ config WIREGUARD
tristate "WireGuard secure network tunnel"
depends on NET && INET
depends on IPV6 || !IPV6
+ depends on !KMSAN # KMSAN doesn't support the crypto configs below
select NET_UDP_TUNNEL
select DST_CACHE
select CRYPTO
--
2.37.0.rc0.161.g10f37bed90-goog

2022-07-01 15:01:20

by Alexander Potapenko

[permalink] [raw]
Subject: [PATCH v4 41/45] entry: kmsan: introduce kmsan_unpoison_entry_regs()

struct pt_regs passed into IRQ entry code is set up by uninstrumented
asm functions, therefore KMSAN may not notice the registers are
initialized.

kmsan_unpoison_entry_regs() unpoisons the contents of struct pt_regs,
preventing potential false positives. Unlike kmsan_unpoison_memory(),
it can be called under kmsan_in_runtime(), which is often the case in
IRQ entry code.

Signed-off-by: Alexander Potapenko <[email protected]>
---
Link: https://linux-review.googlesource.com/id/Ibfd7018ac847fd8e5491681f508ba5d14e4669cf
---
include/linux/kmsan.h | 15 +++++++++++++++
kernel/entry/common.c | 5 +++++
mm/kmsan/hooks.c | 27 +++++++++++++++++++++++++++
3 files changed, 47 insertions(+)

diff --git a/include/linux/kmsan.h b/include/linux/kmsan.h
index e8b5c306c4aa1..c4412622b9a78 100644
--- a/include/linux/kmsan.h
+++ b/include/linux/kmsan.h
@@ -246,6 +246,17 @@ void kmsan_handle_dma_sg(struct scatterlist *sg, int nents,
*/
void kmsan_handle_urb(const struct urb *urb, bool is_out);

+/**
+ * kmsan_unpoison_entry_regs() - Handle pt_regs in low-level entry code.
+ * @regs: struct pt_regs pointer received from assembly code.
+ *
+ * KMSAN unpoisons the contents of the passed pt_regs, preventing potential
+ * false positive reports. Unlike kmsan_unpoison_memory(),
+ * kmsan_unpoison_entry_regs() can be called from the regions where
+ * kmsan_in_runtime() returns true, which is the case in early entry code.
+ */
+void kmsan_unpoison_entry_regs(const struct pt_regs *regs);
+
#else

static inline void kmsan_init_shadow(void)
@@ -342,6 +353,10 @@ static inline void kmsan_handle_urb(const struct urb *urb, bool is_out)
{
}

+static inline void kmsan_unpoison_entry_regs(const struct pt_regs *regs)
+{
+}
+
#endif

#endif /* _LINUX_KMSAN_H */
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 032f164abe7ce..055d3bdb0442c 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -5,6 +5,7 @@
#include <linux/resume_user_mode.h>
#include <linux/highmem.h>
#include <linux/jump_label.h>
+#include <linux/kmsan.h>
#include <linux/livepatch.h>
#include <linux/audit.h>
#include <linux/tick.h>
@@ -24,6 +25,7 @@ static __always_inline void __enter_from_user_mode(struct pt_regs *regs)
user_exit_irqoff();

instrumentation_begin();
+ kmsan_unpoison_entry_regs(regs);
trace_hardirqs_off_finish();
instrumentation_end();
}
@@ -352,6 +354,7 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
lockdep_hardirqs_off(CALLER_ADDR0);
rcu_irq_enter();
instrumentation_begin();
+ kmsan_unpoison_entry_regs(regs);
trace_hardirqs_off_finish();
instrumentation_end();

@@ -367,6 +370,7 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
*/
lockdep_hardirqs_off(CALLER_ADDR0);
instrumentation_begin();
+ kmsan_unpoison_entry_regs(regs);
rcu_irq_enter_check_tick();
trace_hardirqs_off_finish();
instrumentation_end();
@@ -452,6 +456,7 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
rcu_nmi_enter();

instrumentation_begin();
+ kmsan_unpoison_entry_regs(regs);
trace_hardirqs_off_finish();
ftrace_nmi_enter();
instrumentation_end();
diff --git a/mm/kmsan/hooks.c b/mm/kmsan/hooks.c
index 9aecbf2825837..c7528bcbb2f91 100644
--- a/mm/kmsan/hooks.c
+++ b/mm/kmsan/hooks.c
@@ -358,6 +358,33 @@ void kmsan_unpoison_memory(const void *address, size_t size)
}
EXPORT_SYMBOL(kmsan_unpoison_memory);

+/*
+ * Version of kmsan_unpoison_memory() that can be called from within the KMSAN
+ * runtime.
+ *
+ * Non-instrumented IRQ entry functions receive struct pt_regs from assembly
+ * code. Those regs need to be unpoisoned, otherwise using them will result in
+ * false positives.
+ * Using kmsan_unpoison_memory() is not an option in entry code, because the
+ * return value of in_task() is inconsistent - as a result, certain calls to
+ * kmsan_unpoison_memory() are ignored. kmsan_unpoison_entry_regs() ensures that
+ * the registers are unpoisoned even if kmsan_in_runtime() is true in the early
+ * entry code.
+ */
+void kmsan_unpoison_entry_regs(const struct pt_regs *regs)
+{
+ unsigned long ua_flags;
+
+ if (!kmsan_enabled)
+ return;
+
+ ua_flags = user_access_save();
+ kmsan_internal_unpoison_memory((void *)regs, sizeof(*regs),
+ KMSAN_POISON_NOCHECK);
+ user_access_restore(ua_flags);
+}
+EXPORT_SYMBOL(kmsan_unpoison_entry_regs);
+
void kmsan_check_memory(const void *addr, size_t size)
{
if (!kmsan_enabled)
--
2.37.0.rc0.161.g10f37bed90-goog

2022-07-01 15:02:07

by Alexander Potapenko

[permalink] [raw]
Subject: [PATCH v4 29/45] block: kmsan: skip bio block merging logic for KMSAN

KMSAN doesn't allow treating adjacent memory pages as such, if they were
allocated by different alloc_pages() calls.
The block layer however does so: adjacent pages end up being used
together. To prevent this, make page_is_mergeable() return false under
KMSAN.

Suggested-by: Eric Biggers <[email protected]>
Signed-off-by: Alexander Potapenko <[email protected]>

---

v4:
-- swap block: and kmsan: in the subject

Link: https://linux-review.googlesource.com/id/Ie29cc2464c70032347c32ab2a22e1e7a0b37b905
---
block/bio.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index 51c99f2c5c908..ce6b3c82159a6 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -867,6 +867,8 @@ static inline bool page_is_mergeable(const struct bio_vec *bv,
return false;

*same_page = ((vec_end_addr & PAGE_MASK) == page_addr);
+ if (!*same_page && IS_ENABLED(CONFIG_KMSAN))
+ return false;
if (*same_page)
return true;
return (bv->bv_page + bv_end / PAGE_SIZE) == (page + off / PAGE_SIZE);
--
2.37.0.rc0.161.g10f37bed90-goog

2022-07-01 15:02:31

by Alexander Potapenko

[permalink] [raw]
Subject: [PATCH v4 33/45] x86: kmsan: disable instrumentation of unsupported code

Instrumenting some files with KMSAN will result in kernel being unable
to link, boot or crashing at runtime for various reasons (e.g. infinite
recursion caused by instrumentation hooks calling instrumented code again).

Completely omit KMSAN instrumentation in the following places:
- arch/x86/boot and arch/x86/realmode/rm, as KMSAN doesn't work for i386;
- arch/x86/entry/vdso, which isn't linked with KMSAN runtime;
- three files in arch/x86/kernel - boot problems;
- arch/x86/mm/cpu_entry_area.c - recursion.

Signed-off-by: Alexander Potapenko <[email protected]>
---
v2:
-- moved the patch earlier in the series so that KMSAN can compile
-- split off the non-x86 part into a separate patch

v3:
-- added a comment to lib/Makefile

Link: https://linux-review.googlesource.com/id/Id5e5c4a9f9d53c24a35ebb633b814c414628d81b
---
arch/x86/boot/Makefile | 1 +
arch/x86/boot/compressed/Makefile | 1 +
arch/x86/entry/vdso/Makefile | 3 +++
arch/x86/kernel/Makefile | 2 ++
arch/x86/kernel/cpu/Makefile | 1 +
arch/x86/mm/Makefile | 2 ++
arch/x86/realmode/rm/Makefile | 1 +
lib/Makefile | 2 ++
8 files changed, 13 insertions(+)

diff --git a/arch/x86/boot/Makefile b/arch/x86/boot/Makefile
index b5aecb524a8aa..d5623232b763f 100644
--- a/arch/x86/boot/Makefile
+++ b/arch/x86/boot/Makefile
@@ -12,6 +12,7 @@
# Sanitizer runtimes are unavailable and cannot be linked for early boot code.
KASAN_SANITIZE := n
KCSAN_SANITIZE := n
+KMSAN_SANITIZE := n
OBJECT_FILES_NON_STANDARD := y

# Kernel does not boot with kcov instrumentation here.
diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 19e1905dcbf6f..8d0d4d89a00ae 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -20,6 +20,7 @@
# Sanitizer runtimes are unavailable and cannot be linked for early boot code.
KASAN_SANITIZE := n
KCSAN_SANITIZE := n
+KMSAN_SANITIZE := n
OBJECT_FILES_NON_STANDARD := y

# Prevents link failures: __sanitizer_cov_trace_pc() is not linked in.
diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
index c2a8b76ae0bce..645bd919f9845 100644
--- a/arch/x86/entry/vdso/Makefile
+++ b/arch/x86/entry/vdso/Makefile
@@ -11,6 +11,9 @@ include $(srctree)/lib/vdso/Makefile

# Sanitizer runtimes are unavailable and cannot be linked here.
KASAN_SANITIZE := n
+KMSAN_SANITIZE_vclock_gettime.o := n
+KMSAN_SANITIZE_vgetcpu.o := n
+
UBSAN_SANITIZE := n
KCSAN_SANITIZE := n
OBJECT_FILES_NON_STANDARD := y
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 4c8b6ae802ac3..4f2617721d3dc 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -33,6 +33,8 @@ KASAN_SANITIZE_sev.o := n
# With some compiler versions the generated code results in boot hangs, caused
# by several compilation units. To be safe, disable all instrumentation.
KCSAN_SANITIZE := n
+KMSAN_SANITIZE_head$(BITS).o := n
+KMSAN_SANITIZE_nmi.o := n

OBJECT_FILES_NON_STANDARD_test_nx.o := y

diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 9661e3e802be5..f10a921ee7565 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -12,6 +12,7 @@ endif
# If these files are instrumented, boot hangs during the first second.
KCOV_INSTRUMENT_common.o := n
KCOV_INSTRUMENT_perf_event.o := n
+KMSAN_SANITIZE_common.o := n

# As above, instrumenting secondary CPU boot code causes boot hangs.
KCSAN_SANITIZE_common.o := n
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index f8220fd2c169a..39c0700c9955c 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -12,6 +12,8 @@ KASAN_SANITIZE_mem_encrypt_identity.o := n
# Disable KCSAN entirely, because otherwise we get warnings that some functions
# reference __initdata sections.
KCSAN_SANITIZE := n
+# Avoid recursion by not calling KMSAN hooks for CEA code.
+KMSAN_SANITIZE_cpu_entry_area.o := n

ifdef CONFIG_FUNCTION_TRACER
CFLAGS_REMOVE_mem_encrypt.o = -pg
diff --git a/arch/x86/realmode/rm/Makefile b/arch/x86/realmode/rm/Makefile
index 83f1b6a56449f..f614009d3e4e2 100644
--- a/arch/x86/realmode/rm/Makefile
+++ b/arch/x86/realmode/rm/Makefile
@@ -10,6 +10,7 @@
# Sanitizer runtimes are unavailable and cannot be linked here.
KASAN_SANITIZE := n
KCSAN_SANITIZE := n
+KMSAN_SANITIZE := n
OBJECT_FILES_NON_STANDARD := y

# Prevents link failures: __sanitizer_cov_trace_pc() is not linked in.
diff --git a/lib/Makefile b/lib/Makefile
index 5056769d00bb6..73fea85b76365 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -272,6 +272,8 @@ obj-$(CONFIG_POLYNOMIAL) += polynomial.o
CFLAGS_stackdepot.o += -fno-builtin
obj-$(CONFIG_STACKDEPOT) += stackdepot.o
KASAN_SANITIZE_stackdepot.o := n
+# In particular, instrumenting stackdepot.c with KMSAN will result in infinite
+# recursion.
KMSAN_SANITIZE_stackdepot.o := n
KCOV_INSTRUMENT_stackdepot.o := n

--
2.37.0.rc0.161.g10f37bed90-goog

2022-07-01 15:03:34

by Alexander Potapenko

[permalink] [raw]
Subject: [PATCH v4 30/45] kcov: kmsan: unpoison area->list in kcov_remote_area_put()

KMSAN does not instrument kernel/kcov.c for performance reasons (with
CONFIG_KCOV=y virtually every place in the kernel invokes kcov
instrumentation). Therefore the tool may miss writes from kcov.c that
initialize memory.

When CONFIG_DEBUG_LIST is enabled, list pointers from kernel/kcov.c are
passed to instrumented helpers in lib/list_debug.c, resulting in false
positives.

To work around these reports, we unpoison the contents of area->list after
initializing it.

Signed-off-by: Alexander Potapenko <[email protected]>
---

v4:
-- change sizeof(type) to sizeof(*ptr)
-- swap kcov: and kmsan: in the subject

Link: https://linux-review.googlesource.com/id/Ie17f2ee47a7af58f5cdf716d585ebf0769348a5a
---
kernel/kcov.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/kernel/kcov.c b/kernel/kcov.c
index e19c84b02452e..e5cd09fd8a050 100644
--- a/kernel/kcov.c
+++ b/kernel/kcov.c
@@ -11,6 +11,7 @@
#include <linux/fs.h>
#include <linux/hashtable.h>
#include <linux/init.h>
+#include <linux/kmsan-checks.h>
#include <linux/mm.h>
#include <linux/preempt.h>
#include <linux/printk.h>
@@ -152,6 +153,12 @@ static void kcov_remote_area_put(struct kcov_remote_area *area,
INIT_LIST_HEAD(&area->list);
area->size = size;
list_add(&area->list, &kcov_remote_areas);
+ /*
+ * KMSAN doesn't instrument this file, so it may not know area->list
+ * is initialized. Unpoison it explicitly to avoid reports in
+ * kcov_remote_area_get().
+ */
+ kmsan_unpoison_memory(&area->list, sizeof(area->list));
}

static notrace bool check_kcov_mode(enum kcov_mode needed_mode, struct task_struct *t)
--
2.37.0.rc0.161.g10f37bed90-goog

2022-07-01 15:03:35

by Alexander Potapenko

[permalink] [raw]
Subject: [PATCH v4 35/45] x86: kmsan: handle open-coded assembly in lib/iomem.c

KMSAN cannot intercept memory accesses within asm() statements.
That's why we add kmsan_unpoison_memory() and kmsan_check_memory() to
hint it how to handle memory copied from/to I/O memory.

Signed-off-by: Alexander Potapenko <[email protected]>
---
Link: https://linux-review.googlesource.com/id/Icb16bf17269087e475debf07a7fe7d4bebc3df23
---
arch/x86/lib/iomem.c | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/arch/x86/lib/iomem.c b/arch/x86/lib/iomem.c
index 3e2f33fc33de2..e0411a3774d49 100644
--- a/arch/x86/lib/iomem.c
+++ b/arch/x86/lib/iomem.c
@@ -1,6 +1,7 @@
#include <linux/string.h>
#include <linux/module.h>
#include <linux/io.h>
+#include <linux/kmsan-checks.h>

#define movs(type,to,from) \
asm volatile("movs" type:"=&D" (to), "=&S" (from):"0" (to), "1" (from):"memory")
@@ -37,6 +38,8 @@ static void string_memcpy_fromio(void *to, const volatile void __iomem *from, si
n-=2;
}
rep_movs(to, (const void *)from, n);
+ /* KMSAN must treat values read from devices as initialized. */
+ kmsan_unpoison_memory(to, n);
}

static void string_memcpy_toio(volatile void __iomem *to, const void *from, size_t n)
@@ -44,6 +47,8 @@ static void string_memcpy_toio(volatile void __iomem *to, const void *from, size
if (unlikely(!n))
return;

+ /* Make sure uninitialized memory isn't copied to devices. */
+ kmsan_check_memory(from, n);
/* Align any unaligned destination IO */
if (unlikely(1 & (unsigned long)to)) {
movs("b", to, from);
--
2.37.0.rc0.161.g10f37bed90-goog

2022-07-01 15:04:23

by Alexander Potapenko

[permalink] [raw]
Subject: [PATCH v4 31/45] security: kmsan: fix interoperability with auto-initialization

Heap and stack initialization is great, but not when we are trying
uses of uninitialized memory. When the kernel is built with KMSAN,
having kernel memory initialization enabled may introduce false
negatives.

We disable CONFIG_INIT_STACK_ALL_PATTERN and CONFIG_INIT_STACK_ALL_ZERO
under CONFIG_KMSAN, making it impossible to auto-initialize stack
variables in KMSAN builds. We also disable CONFIG_INIT_ON_ALLOC_DEFAULT_ON
and CONFIG_INIT_ON_FREE_DEFAULT_ON to prevent accidental use of heap
auto-initialization.

We however still let the users enable heap auto-initialization at
boot-time (by setting init_on_alloc=1 or init_on_free=1), in which case
a warning is printed.

Signed-off-by: Alexander Potapenko <[email protected]>
---
Link: https://linux-review.googlesource.com/id/I86608dd867018683a14ae1870f1928ad925f42e9
---
mm/page_alloc.c | 4 ++++
security/Kconfig.hardening | 4 ++++
2 files changed, 8 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e8d5a0b2a3264..3a0a5e204df7a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -854,6 +854,10 @@ void init_mem_debugging_and_hardening(void)
else
static_branch_disable(&init_on_free);

+ if (IS_ENABLED(CONFIG_KMSAN) &&
+ (_init_on_alloc_enabled_early || _init_on_free_enabled_early))
+ pr_info("mem auto-init: please make sure init_on_alloc and init_on_free are disabled when running KMSAN\n");
+
#ifdef CONFIG_DEBUG_PAGEALLOC
if (!debug_pagealloc_enabled())
return;
diff --git a/security/Kconfig.hardening b/security/Kconfig.hardening
index bd2aabb2c60f9..2739a6776454e 100644
--- a/security/Kconfig.hardening
+++ b/security/Kconfig.hardening
@@ -106,6 +106,7 @@ choice
config INIT_STACK_ALL_PATTERN
bool "pattern-init everything (strongest)"
depends on CC_HAS_AUTO_VAR_INIT_PATTERN
+ depends on !KMSAN
help
Initializes everything on the stack (including padding)
with a specific debug value. This is intended to eliminate
@@ -124,6 +125,7 @@ choice
config INIT_STACK_ALL_ZERO
bool "zero-init everything (strongest and safest)"
depends on CC_HAS_AUTO_VAR_INIT_ZERO
+ depends on !KMSAN
help
Initializes everything on the stack (including padding)
with a zero value. This is intended to eliminate all
@@ -218,6 +220,7 @@ config STACKLEAK_RUNTIME_DISABLE

config INIT_ON_ALLOC_DEFAULT_ON
bool "Enable heap memory zeroing on allocation by default"
+ depends on !KMSAN
help
This has the effect of setting "init_on_alloc=1" on the kernel
command line. This can be disabled with "init_on_alloc=0".
@@ -230,6 +233,7 @@ config INIT_ON_ALLOC_DEFAULT_ON

config INIT_ON_FREE_DEFAULT_ON
bool "Enable heap memory zeroing on free by default"
+ depends on !KMSAN
help
This has the effect of setting "init_on_free=1" on the kernel
command line. This can be disabled with "init_on_free=0".
--
2.37.0.rc0.161.g10f37bed90-goog

2022-07-01 15:04:24

by Alexander Potapenko

[permalink] [raw]
Subject: [PATCH v4 24/45] kmsan: handle memory sent to/from USB

Depending on the value of is_out kmsan_handle_urb() KMSAN either
marks the data copied to the kernel from a USB device as initialized,
or checks the data sent to the device for being initialized.

Signed-off-by: Alexander Potapenko <[email protected]>

---
v2:
-- move kmsan_handle_urb() implementation to this patch

Link: https://linux-review.googlesource.com/id/Ifa67fb72015d4de14c30e971556f99fc8b2ee506
---
drivers/usb/core/urb.c | 2 ++
include/linux/kmsan.h | 15 +++++++++++++++
mm/kmsan/hooks.c | 17 +++++++++++++++++
3 files changed, 34 insertions(+)

diff --git a/drivers/usb/core/urb.c b/drivers/usb/core/urb.c
index 33d62d7e3929f..1fe3f23205624 100644
--- a/drivers/usb/core/urb.c
+++ b/drivers/usb/core/urb.c
@@ -8,6 +8,7 @@
#include <linux/bitops.h>
#include <linux/slab.h>
#include <linux/log2.h>
+#include <linux/kmsan-checks.h>
#include <linux/usb.h>
#include <linux/wait.h>
#include <linux/usb/hcd.h>
@@ -426,6 +427,7 @@ int usb_submit_urb(struct urb *urb, gfp_t mem_flags)
URB_SETUP_MAP_SINGLE | URB_SETUP_MAP_LOCAL |
URB_DMA_SG_COMBINED);
urb->transfer_flags |= (is_out ? URB_DIR_OUT : URB_DIR_IN);
+ kmsan_handle_urb(urb, is_out);

if (xfertype != USB_ENDPOINT_XFER_CONTROL &&
dev->state < USB_STATE_CONFIGURED)
diff --git a/include/linux/kmsan.h b/include/linux/kmsan.h
index 55fe673ee1e84..e8b5c306c4aa1 100644
--- a/include/linux/kmsan.h
+++ b/include/linux/kmsan.h
@@ -19,6 +19,7 @@ struct page;
struct kmem_cache;
struct task_struct;
struct scatterlist;
+struct urb;

#ifdef CONFIG_KMSAN

@@ -235,6 +236,16 @@ void kmsan_handle_dma(struct page *page, size_t offset, size_t size,
void kmsan_handle_dma_sg(struct scatterlist *sg, int nents,
enum dma_data_direction dir);

+/**
+ * kmsan_handle_urb() - Handle a USB data transfer.
+ * @urb: struct urb pointer.
+ * @is_out: data transfer direction (true means output to hardware).
+ *
+ * If @is_out is true, KMSAN checks the transfer buffer of @urb. Otherwise,
+ * KMSAN initializes the transfer buffer.
+ */
+void kmsan_handle_urb(const struct urb *urb, bool is_out);
+
#else

static inline void kmsan_init_shadow(void)
@@ -327,6 +338,10 @@ static inline void kmsan_handle_dma_sg(struct scatterlist *sg, int nents,
{
}

+static inline void kmsan_handle_urb(const struct urb *urb, bool is_out)
+{
+}
+
#endif

#endif /* _LINUX_KMSAN_H */
diff --git a/mm/kmsan/hooks.c b/mm/kmsan/hooks.c
index 8a6947a2a2f22..9aecbf2825837 100644
--- a/mm/kmsan/hooks.c
+++ b/mm/kmsan/hooks.c
@@ -17,6 +17,7 @@
#include <linux/scatterlist.h>
#include <linux/slab.h>
#include <linux/uaccess.h>
+#include <linux/usb.h>

#include "../internal.h"
#include "../slab.h"
@@ -252,6 +253,22 @@ void kmsan_copy_to_user(void __user *to, const void *from, size_t to_copy,
}
EXPORT_SYMBOL(kmsan_copy_to_user);

+/* Helper function to check an URB. */
+void kmsan_handle_urb(const struct urb *urb, bool is_out)
+{
+ if (!urb)
+ return;
+ if (is_out)
+ kmsan_internal_check_memory(urb->transfer_buffer,
+ urb->transfer_buffer_length,
+ /*user_addr*/ 0, REASON_SUBMIT_URB);
+ else
+ kmsan_internal_unpoison_memory(urb->transfer_buffer,
+ urb->transfer_buffer_length,
+ /*checked*/ false);
+}
+EXPORT_SYMBOL(kmsan_handle_urb);
+
static void kmsan_handle_dma_page(const void *addr, size_t size,
enum dma_data_direction dir)
{
--
2.37.0.rc0.161.g10f37bed90-goog

2022-07-01 15:04:33

by Alexander Potapenko

[permalink] [raw]
Subject: [PATCH v4 25/45] kmsan: add tests for KMSAN

The testing module triggers KMSAN warnings in different cases and checks
that the errors are properly reported, using console probes to capture
the tool's output.

Signed-off-by: Alexander Potapenko <[email protected]>
---
v2:
-- add memcpy tests

v4:
-- change sizeof(type) to sizeof(*ptr)
-- add test expectations for CONFIG_KMSAN_CHECK_PARAM_RETVAL

Link: https://linux-review.googlesource.com/id/I49c3f59014cc37fd13541c80beb0b75a75244650
---
lib/Kconfig.kmsan | 12 +
mm/kmsan/Makefile | 4 +
mm/kmsan/kmsan_test.c | 552 ++++++++++++++++++++++++++++++++++++++++++
3 files changed, 568 insertions(+)
create mode 100644 mm/kmsan/kmsan_test.c

diff --git a/lib/Kconfig.kmsan b/lib/Kconfig.kmsan
index 8f768d4034e3c..f56ed7f7c7090 100644
--- a/lib/Kconfig.kmsan
+++ b/lib/Kconfig.kmsan
@@ -47,4 +47,16 @@ config KMSAN_CHECK_PARAM_RETVAL
may potentially report errors in corner cases when non-instrumented
functions call instrumented ones.

+config KMSAN_KUNIT_TEST
+ tristate "KMSAN integration test suite" if !KUNIT_ALL_TESTS
+ default KUNIT_ALL_TESTS
+ depends on TRACEPOINTS && KUNIT
+ help
+ Test suite for KMSAN, testing various error detection scenarios,
+ and checking that reports are correctly output to console.
+
+ Say Y here if you want the test to be built into the kernel and run
+ during boot; say M if you want the test to build as a module; say N
+ if you are unsure.
+
endif
diff --git a/mm/kmsan/Makefile b/mm/kmsan/Makefile
index 401acb1a491ce..98eab2856626f 100644
--- a/mm/kmsan/Makefile
+++ b/mm/kmsan/Makefile
@@ -22,3 +22,7 @@ CFLAGS_init.o := $(CC_FLAGS_KMSAN_RUNTIME)
CFLAGS_instrumentation.o := $(CC_FLAGS_KMSAN_RUNTIME)
CFLAGS_report.o := $(CC_FLAGS_KMSAN_RUNTIME)
CFLAGS_shadow.o := $(CC_FLAGS_KMSAN_RUNTIME)
+
+obj-$(CONFIG_KMSAN_KUNIT_TEST) += kmsan_test.o
+KMSAN_SANITIZE_kmsan_test.o := y
+CFLAGS_kmsan_test.o += $(call cc-disable-warning, uninitialized)
diff --git a/mm/kmsan/kmsan_test.c b/mm/kmsan/kmsan_test.c
new file mode 100644
index 0000000000000..1b8da71ae0d4f
--- /dev/null
+++ b/mm/kmsan/kmsan_test.c
@@ -0,0 +1,552 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Test cases for KMSAN.
+ * For each test case checks the presence (or absence) of generated reports.
+ * Relies on 'console' tracepoint to capture reports as they appear in the
+ * kernel log.
+ *
+ * Copyright (C) 2021-2022, Google LLC.
+ * Author: Alexander Potapenko <[email protected]>
+ *
+ */
+
+#include <kunit/test.h>
+#include "kmsan.h"
+
+#include <linux/jiffies.h>
+#include <linux/kernel.h>
+#include <linux/kmsan.h>
+#include <linux/mm.h>
+#include <linux/random.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/string.h>
+#include <linux/tracepoint.h>
+#include <trace/events/printk.h>
+
+static DEFINE_PER_CPU(int, per_cpu_var);
+
+/* Report as observed from console. */
+static struct {
+ spinlock_t lock;
+ bool available;
+ bool ignore; /* Stop console output collection. */
+ char header[256];
+} observed = {
+ .lock = __SPIN_LOCK_UNLOCKED(observed.lock),
+};
+
+/* Probe for console output: obtains observed lines of interest. */
+static void probe_console(void *ignore, const char *buf, size_t len)
+{
+ unsigned long flags;
+
+ if (observed.ignore)
+ return;
+ spin_lock_irqsave(&observed.lock, flags);
+
+ if (strnstr(buf, "BUG: KMSAN: ", len)) {
+ /*
+ * KMSAN report and related to the test.
+ *
+ * The provided @buf is not NUL-terminated; copy no more than
+ * @len bytes and let strscpy() add the missing NUL-terminator.
+ */
+ strscpy(observed.header, buf,
+ min(len + 1, sizeof(observed.header)));
+ WRITE_ONCE(observed.available, true);
+ observed.ignore = true;
+ }
+ spin_unlock_irqrestore(&observed.lock, flags);
+}
+
+/* Check if a report related to the test exists. */
+static bool report_available(void)
+{
+ return READ_ONCE(observed.available);
+}
+
+/* Information we expect in a report. */
+struct expect_report {
+ const char *error_type; /* Error type. */
+ /*
+ * Kernel symbol from the error header, or NULL if no report is
+ * expected.
+ */
+ const char *symbol;
+};
+
+/* Check observed report matches information in @r. */
+static bool report_matches(const struct expect_report *r)
+{
+ typeof(observed.header) expected_header;
+ unsigned long flags;
+ bool ret = false;
+ const char *end;
+ char *cur;
+
+ /* Doubled-checked locking. */
+ if (!report_available() || !r->symbol)
+ return (!report_available() && !r->symbol);
+
+ /* Generate expected report contents. */
+
+ /* Title */
+ cur = expected_header;
+ end = &expected_header[sizeof(expected_header) - 1];
+
+ cur += scnprintf(cur, end - cur, "BUG: KMSAN: %s", r->error_type);
+
+ scnprintf(cur, end - cur, " in %s", r->symbol);
+ /* The exact offset won't match, remove it; also strip module name. */
+ cur = strchr(expected_header, '+');
+ if (cur)
+ *cur = '\0';
+
+ spin_lock_irqsave(&observed.lock, flags);
+ if (!report_available())
+ goto out; /* A new report is being captured. */
+
+ /* Finally match expected output to what we actually observed. */
+ ret = strstr(observed.header, expected_header);
+out:
+ spin_unlock_irqrestore(&observed.lock, flags);
+
+ return ret;
+}
+
+/* ===== Test cases ===== */
+
+/* Prevent replacing branch with select in LLVM. */
+static noinline void check_true(char *arg)
+{
+ pr_info("%s is true\n", arg);
+}
+
+static noinline void check_false(char *arg)
+{
+ pr_info("%s is false\n", arg);
+}
+
+#define USE(x) \
+ do { \
+ if (x) \
+ check_true(#x); \
+ else \
+ check_false(#x); \
+ } while (0)
+
+#define EXPECTATION_ETYPE_FN(e, reason, fn) \
+ struct expect_report e = { \
+ .error_type = reason, \
+ .symbol = fn, \
+ }
+
+#define EXPECTATION_NO_REPORT(e) EXPECTATION_ETYPE_FN(e, NULL, NULL)
+#define EXPECTATION_UNINIT_VALUE_FN(e, fn) \
+ EXPECTATION_ETYPE_FN(e, "uninit-value", fn)
+#define EXPECTATION_UNINIT_VALUE(e) EXPECTATION_UNINIT_VALUE_FN(e, __func__)
+#define EXPECTATION_USE_AFTER_FREE(e) \
+ EXPECTATION_ETYPE_FN(e, "use-after-free", __func__)
+
+/* Test case: ensure that kmalloc() returns uninitialized memory. */
+static void test_uninit_kmalloc(struct kunit *test)
+{
+ EXPECTATION_UNINIT_VALUE(expect);
+ int *ptr;
+
+ kunit_info(test, "uninitialized kmalloc test (UMR report)\n");
+ ptr = kmalloc(sizeof(*ptr), GFP_KERNEL);
+ USE(*ptr);
+ KUNIT_EXPECT_TRUE(test, report_matches(&expect));
+}
+
+/*
+ * Test case: ensure that kmalloc'ed memory becomes initialized after memset().
+ */
+static void test_init_kmalloc(struct kunit *test)
+{
+ EXPECTATION_NO_REPORT(expect);
+ int *ptr;
+
+ kunit_info(test, "initialized kmalloc test (no reports)\n");
+ ptr = kmalloc(sizeof(*ptr), GFP_KERNEL);
+ memset(ptr, 0, sizeof(*ptr));
+ USE(*ptr);
+ KUNIT_EXPECT_TRUE(test, report_matches(&expect));
+}
+
+/* Test case: ensure that kzalloc() returns initialized memory. */
+static void test_init_kzalloc(struct kunit *test)
+{
+ EXPECTATION_NO_REPORT(expect);
+ int *ptr;
+
+ kunit_info(test, "initialized kzalloc test (no reports)\n");
+ ptr = kzalloc(sizeof(*ptr), GFP_KERNEL);
+ USE(*ptr);
+ KUNIT_EXPECT_TRUE(test, report_matches(&expect));
+}
+
+/* Test case: ensure that local variables are uninitialized by default. */
+static void test_uninit_stack_var(struct kunit *test)
+{
+ EXPECTATION_UNINIT_VALUE(expect);
+ volatile int cond;
+
+ kunit_info(test, "uninitialized stack variable (UMR report)\n");
+ USE(cond);
+ KUNIT_EXPECT_TRUE(test, report_matches(&expect));
+}
+
+/* Test case: ensure that local variables with initializers are initialized. */
+static void test_init_stack_var(struct kunit *test)
+{
+ EXPECTATION_NO_REPORT(expect);
+ volatile int cond = 1;
+
+ kunit_info(test, "initialized stack variable (no reports)\n");
+ USE(cond);
+ KUNIT_EXPECT_TRUE(test, report_matches(&expect));
+}
+
+static noinline void two_param_fn_2(int arg1, int arg2)
+{
+ USE(arg1);
+ USE(arg2);
+}
+
+static noinline void one_param_fn(int arg)
+{
+ two_param_fn_2(arg, arg);
+ USE(arg);
+}
+
+static noinline void two_param_fn(int arg1, int arg2)
+{
+ int init = 0;
+
+ one_param_fn(init);
+ USE(arg1);
+ USE(arg2);
+}
+
+static void test_params(struct kunit *test)
+{
+#ifdef CONFIG_KMSAN_CHECK_PARAM_RETVAL
+ /*
+ * With eager param/retval checking enabled, KMSAN will report an error
+ * before the call to two_param_fn().
+ */
+ EXPECTATION_UNINIT_VALUE_FN(expect, "test_params");
+#else
+ EXPECTATION_UNINIT_VALUE_FN(expect, "two_param_fn");
+#endif
+ volatile int uninit, init = 1;
+
+ kunit_info(test,
+ "uninit passed through a function parameter (UMR report)\n");
+ two_param_fn(uninit, init);
+ KUNIT_EXPECT_TRUE(test, report_matches(&expect));
+}
+
+static int signed_sum3(int a, int b, int c)
+{
+ return a + b + c;
+}
+
+/*
+ * Test case: ensure that uninitialized values are tracked through function
+ * arguments.
+ */
+static void test_uninit_multiple_params(struct kunit *test)
+{
+ EXPECTATION_UNINIT_VALUE(expect);
+ volatile char b = 3, c;
+ volatile int a;
+
+ kunit_info(test, "uninitialized local passed to fn (UMR report)\n");
+ USE(signed_sum3(a, b, c));
+ KUNIT_EXPECT_TRUE(test, report_matches(&expect));
+}
+
+/* Helper function to make an array uninitialized. */
+static noinline void do_uninit_local_array(char *array, int start, int stop)
+{
+ volatile char uninit;
+ int i;
+
+ for (i = start; i < stop; i++)
+ array[i] = uninit;
+}
+
+/*
+ * Test case: ensure kmsan_check_memory() reports an error when checking
+ * uninitialized memory.
+ */
+static void test_uninit_kmsan_check_memory(struct kunit *test)
+{
+ EXPECTATION_UNINIT_VALUE_FN(expect, "test_uninit_kmsan_check_memory");
+ volatile char local_array[8];
+
+ kunit_info(
+ test,
+ "kmsan_check_memory() called on uninit local (UMR report)\n");
+ do_uninit_local_array((char *)local_array, 5, 7);
+
+ kmsan_check_memory((char *)local_array, 8);
+ KUNIT_EXPECT_TRUE(test, report_matches(&expect));
+}
+
+/*
+ * Test case: check that a virtual memory range created with vmap() from
+ * initialized pages is still considered as initialized.
+ */
+static void test_init_kmsan_vmap_vunmap(struct kunit *test)
+{
+ EXPECTATION_NO_REPORT(expect);
+ const int npages = 2;
+ struct page **pages;
+ void *vbuf;
+ int i;
+
+ kunit_info(test, "pages initialized via vmap (no reports)\n");
+
+ pages = kmalloc_array(npages, sizeof(*pages), GFP_KERNEL);
+ for (i = 0; i < npages; i++)
+ pages[i] = alloc_page(GFP_KERNEL);
+ vbuf = vmap(pages, npages, VM_MAP, PAGE_KERNEL);
+ memset(vbuf, 0xfe, npages * PAGE_SIZE);
+ for (i = 0; i < npages; i++)
+ kmsan_check_memory(page_address(pages[i]), PAGE_SIZE);
+
+ if (vbuf)
+ vunmap(vbuf);
+ for (i = 0; i < npages; i++)
+ if (pages[i])
+ __free_page(pages[i]);
+ kfree(pages);
+ KUNIT_EXPECT_TRUE(test, report_matches(&expect));
+}
+
+/*
+ * Test case: ensure that memset() can initialize a buffer allocated via
+ * vmalloc().
+ */
+static void test_init_vmalloc(struct kunit *test)
+{
+ EXPECTATION_NO_REPORT(expect);
+ int npages = 8, i;
+ char *buf;
+
+ kunit_info(test, "vmalloc buffer can be initialized (no reports)\n");
+ buf = vmalloc(PAGE_SIZE * npages);
+ buf[0] = 1;
+ memset(buf, 0xfe, PAGE_SIZE * npages);
+ USE(buf[0]);
+ for (i = 0; i < npages; i++)
+ kmsan_check_memory(&buf[PAGE_SIZE * i], PAGE_SIZE);
+ vfree(buf);
+ KUNIT_EXPECT_TRUE(test, report_matches(&expect));
+}
+
+/* Test case: ensure that use-after-free reporting works. */
+static void test_uaf(struct kunit *test)
+{
+ EXPECTATION_USE_AFTER_FREE(expect);
+ volatile int value;
+ volatile int *var;
+
+ kunit_info(test, "use-after-free in kmalloc-ed buffer (UMR report)\n");
+ var = kmalloc(80, GFP_KERNEL);
+ var[3] = 0xfeedface;
+ kfree((int *)var);
+ /* Copy the invalid value before checking it. */
+ value = var[3];
+ USE(value);
+ KUNIT_EXPECT_TRUE(test, report_matches(&expect));
+}
+
+/*
+ * Test case: ensure that uninitialized values are propagated through per-CPU
+ * memory.
+ */
+static void test_percpu_propagate(struct kunit *test)
+{
+ EXPECTATION_UNINIT_VALUE(expect);
+ volatile int uninit, check;
+
+ kunit_info(test,
+ "uninit local stored to per_cpu memory (UMR report)\n");
+
+ this_cpu_write(per_cpu_var, uninit);
+ check = this_cpu_read(per_cpu_var);
+ USE(check);
+ KUNIT_EXPECT_TRUE(test, report_matches(&expect));
+}
+
+/*
+ * Test case: ensure that passing uninitialized values to printk() leads to an
+ * error report.
+ */
+static void test_printk(struct kunit *test)
+{
+#ifdef CONFIG_KMSAN_CHECK_PARAM_RETVAL
+ /*
+ * With eager param/retval checking enabled, KMSAN will report an error
+ * before the call to pr_info().
+ */
+ EXPECTATION_UNINIT_VALUE_FN(expect, "test_printk");
+#else
+ EXPECTATION_UNINIT_VALUE_FN(expect, "number");
+#endif
+ volatile int uninit;
+
+ kunit_info(test, "uninit local passed to pr_info() (UMR report)\n");
+ pr_info("%px contains %d\n", &uninit, uninit);
+ KUNIT_EXPECT_TRUE(test, report_matches(&expect));
+}
+
+/*
+ * Test case: ensure that memcpy() correctly copies uninitialized values between
+ * aligned `src` and `dst`.
+ */
+static void test_memcpy_aligned_to_aligned(struct kunit *test)
+{
+ EXPECTATION_UNINIT_VALUE_FN(expect, "test_memcpy_aligned_to_aligned");
+ volatile int uninit_src;
+ volatile int dst = 0;
+
+ kunit_info(test, "memcpy()ing aligned uninit src to aligned dst (UMR report)\n");
+ memcpy((void *)&dst, (void *)&uninit_src, sizeof(uninit_src));
+ kmsan_check_memory((void *)&dst, sizeof(dst));
+ KUNIT_EXPECT_TRUE(test, report_matches(&expect));
+}
+
+/*
+ * Test case: ensure that memcpy() correctly copies uninitialized values between
+ * aligned `src` and unaligned `dst`.
+ *
+ * Copying aligned 4-byte value to an unaligned one leads to touching two
+ * aligned 4-byte values. This test case checks that KMSAN correctly reports an
+ * error on the first of the two values.
+ */
+static void test_memcpy_aligned_to_unaligned(struct kunit *test)
+{
+ EXPECTATION_UNINIT_VALUE_FN(expect, "test_memcpy_aligned_to_unaligned");
+ volatile int uninit_src;
+ volatile char dst[8] = {0};
+
+ kunit_info(test, "memcpy()ing aligned uninit src to unaligned dst (UMR report)\n");
+ memcpy((void *)&dst[1], (void *)&uninit_src, sizeof(uninit_src));
+ kmsan_check_memory((void *)dst, 4);
+ KUNIT_EXPECT_TRUE(test, report_matches(&expect));
+}
+
+/*
+ * Test case: ensure that memcpy() correctly copies uninitialized values between
+ * aligned `src` and unaligned `dst`.
+ *
+ * Copying aligned 4-byte value to an unaligned one leads to touching two
+ * aligned 4-byte values. This test case checks that KMSAN correctly reports an
+ * error on the second of the two values.
+ */
+static void test_memcpy_aligned_to_unaligned2(struct kunit *test)
+{
+ EXPECTATION_UNINIT_VALUE_FN(expect, "test_memcpy_aligned_to_unaligned2");
+ volatile int uninit_src;
+ volatile char dst[8] = {0};
+
+ kunit_info(test, "memcpy()ing aligned uninit src to unaligned dst - part 2 (UMR report)\n");
+ memcpy((void *)&dst[1], (void *)&uninit_src, sizeof(uninit_src));
+ kmsan_check_memory((void *)&dst[4], sizeof(uninit_src));
+ KUNIT_EXPECT_TRUE(test, report_matches(&expect));
+}
+
+static struct kunit_case kmsan_test_cases[] = {
+ KUNIT_CASE(test_uninit_kmalloc),
+ KUNIT_CASE(test_init_kmalloc),
+ KUNIT_CASE(test_init_kzalloc),
+ KUNIT_CASE(test_uninit_stack_var),
+ KUNIT_CASE(test_init_stack_var),
+ KUNIT_CASE(test_params),
+ KUNIT_CASE(test_uninit_multiple_params),
+ KUNIT_CASE(test_uninit_kmsan_check_memory),
+ KUNIT_CASE(test_init_kmsan_vmap_vunmap),
+ KUNIT_CASE(test_init_vmalloc),
+ KUNIT_CASE(test_uaf),
+ KUNIT_CASE(test_percpu_propagate),
+ KUNIT_CASE(test_printk),
+ KUNIT_CASE(test_memcpy_aligned_to_aligned),
+ KUNIT_CASE(test_memcpy_aligned_to_unaligned),
+ KUNIT_CASE(test_memcpy_aligned_to_unaligned2),
+ {},
+};
+
+/* ===== End test cases ===== */
+
+static int test_init(struct kunit *test)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&observed.lock, flags);
+ observed.header[0] = '\0';
+ observed.ignore = false;
+ observed.available = false;
+ spin_unlock_irqrestore(&observed.lock, flags);
+
+ return 0;
+}
+
+static void test_exit(struct kunit *test)
+{
+}
+
+static struct kunit_suite kmsan_test_suite = {
+ .name = "kmsan",
+ .test_cases = kmsan_test_cases,
+ .init = test_init,
+ .exit = test_exit,
+};
+static struct kunit_suite *kmsan_test_suites[] = { &kmsan_test_suite, NULL };
+
+static void register_tracepoints(struct tracepoint *tp, void *ignore)
+{
+ check_trace_callback_type_console(probe_console);
+ if (!strcmp(tp->name, "console"))
+ WARN_ON(tracepoint_probe_register(tp, probe_console, NULL));
+}
+
+static void unregister_tracepoints(struct tracepoint *tp, void *ignore)
+{
+ if (!strcmp(tp->name, "console"))
+ tracepoint_probe_unregister(tp, probe_console, NULL);
+}
+
+/*
+ * We only want to do tracepoints setup and teardown once, therefore we have to
+ * customize the init and exit functions and cannot rely on kunit_test_suite().
+ */
+static int __init kmsan_test_init(void)
+{
+ /*
+ * Because we want to be able to build the test as a module, we need to
+ * iterate through all known tracepoints, since the static registration
+ * won't work here.
+ */
+ for_each_kernel_tracepoint(register_tracepoints, NULL);
+ return __kunit_test_suites_init(kmsan_test_suites);
+}
+
+static void kmsan_test_exit(void)
+{
+ __kunit_test_suites_exit(kmsan_test_suites);
+ for_each_kernel_tracepoint(unregister_tracepoints, NULL);
+ tracepoint_synchronize_unregister();
+}
+
+late_initcall_sync(kmsan_test_init);
+module_exit(kmsan_test_exit);
+
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Alexander Potapenko <[email protected]>");
--
2.37.0.rc0.161.g10f37bed90-goog

2022-07-01 15:04:37

by Alexander Potapenko

[permalink] [raw]
Subject: [PATCH v4 28/45] kmsan: disable physical page merging in biovec

KMSAN metadata for adjacent physical pages may not be adjacent,
therefore accessing such pages together may lead to metadata
corruption.
We disable merging pages in biovec to prevent such corruptions.

Signed-off-by: Alexander Potapenko <[email protected]>
---

Link: https://linux-review.googlesource.com/id/Iece16041be5ee47904fbc98121b105e5be5fea5c
---
block/blk.h | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/block/blk.h b/block/blk.h
index 434017701403f..96309a98a60e3 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -93,6 +93,13 @@ static inline bool biovec_phys_mergeable(struct request_queue *q,
phys_addr_t addr1 = page_to_phys(vec1->bv_page) + vec1->bv_offset;
phys_addr_t addr2 = page_to_phys(vec2->bv_page) + vec2->bv_offset;

+ /*
+ * Merging adjacent physical pages may not work correctly under KMSAN
+ * if their metadata pages aren't adjacent. Just disable merging.
+ */
+ if (IS_ENABLED(CONFIG_KMSAN))
+ return false;
+
if (addr1 + vec1->bv_len != addr2)
return false;
if (xen_domain() && !xen_biovec_phys_mergeable(vec1, vec2->bv_page))
--
2.37.0.rc0.161.g10f37bed90-goog

2022-07-01 15:05:47

by Alexander Potapenko

[permalink] [raw]
Subject: [PATCH v4 42/45] bpf: kmsan: initialize BPF registers with zeroes

When executing BPF programs, certain registers may get passed
uninitialized to helper functions. E.g. when performing a JMP_CALL,
registers BPF_R1-BPF_R5 are always passed to the helper, no matter how
many of them are actually used.

Passing uninitialized values as function parameters is technically
undefined behavior, so we work around it by always initializing the
registers.

Signed-off-by: Alexander Potapenko <[email protected]>
---
Link: https://linux-review.googlesource.com/id/I40f39d26232b14816c14ba64a0ea4a8f336f2675
---
kernel/bpf/core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 5f6f3f829b368..0ba7dd90a2ab3 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2039,7 +2039,7 @@ static u64 ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn)
static unsigned int PROG_NAME(stack_size)(const void *ctx, const struct bpf_insn *insn) \
{ \
u64 stack[stack_size / sizeof(u64)]; \
- u64 regs[MAX_BPF_EXT_REG]; \
+ u64 regs[MAX_BPF_EXT_REG] = {}; \
\
FP = (u64) (unsigned long) &stack[ARRAY_SIZE(stack)]; \
ARG1 = (u64) (unsigned long) ctx; \
--
2.37.0.rc0.161.g10f37bed90-goog

2022-07-01 15:17:17

by Alexander Potapenko

[permalink] [raw]
Subject: [PATCH v4 38/45] x86: kasan: kmsan: support CONFIG_GENERIC_CSUM on x86, enable it for KASAN/KMSAN

This is needed to allow memory tools like KASAN and KMSAN see the
memory accesses from the checksum code. Without CONFIG_GENERIC_CSUM the
tools can't see memory accesses originating from handwritten assembly
code.
For KASAN it's a question of detecting more bugs, for KMSAN using the C
implementation also helps avoid false positives originating from
seemingly uninitialized checksum values.

Signed-off-by: Alexander Potapenko <[email protected]>

---

Link: https://linux-review.googlesource.com/id/I3e95247be55b1112af59dbba07e8cbf34e50a581
---
arch/x86/Kconfig | 4 ++++
arch/x86/include/asm/checksum.h | 16 ++++++++++------
arch/x86/lib/Makefile | 2 ++
3 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index be0b95e51df66..4a5d0a0f54dea 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -324,6 +324,10 @@ config GENERIC_ISA_DMA
def_bool y
depends on ISA_DMA_API

+config GENERIC_CSUM
+ bool
+ default y if KMSAN || KASAN
+
config GENERIC_BUG
def_bool y
depends on BUG
diff --git a/arch/x86/include/asm/checksum.h b/arch/x86/include/asm/checksum.h
index bca625a60186c..6df6ece8a28ec 100644
--- a/arch/x86/include/asm/checksum.h
+++ b/arch/x86/include/asm/checksum.h
@@ -1,9 +1,13 @@
/* SPDX-License-Identifier: GPL-2.0 */
-#define _HAVE_ARCH_COPY_AND_CSUM_FROM_USER 1
-#define HAVE_CSUM_COPY_USER
-#define _HAVE_ARCH_CSUM_AND_COPY
-#ifdef CONFIG_X86_32
-# include <asm/checksum_32.h>
+#ifdef CONFIG_GENERIC_CSUM
+# include <asm-generic/checksum.h>
#else
-# include <asm/checksum_64.h>
+# define _HAVE_ARCH_COPY_AND_CSUM_FROM_USER 1
+# define HAVE_CSUM_COPY_USER
+# define _HAVE_ARCH_CSUM_AND_COPY
+# ifdef CONFIG_X86_32
+# include <asm/checksum_32.h>
+# else
+# include <asm/checksum_64.h>
+# endif
#endif
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index f76747862bd2e..7ba5f61d72735 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -65,7 +65,9 @@ ifneq ($(CONFIG_X86_CMPXCHG64),y)
endif
else
obj-y += iomap_copy_64.o
+ifneq ($(CONFIG_GENERIC_CSUM),y)
lib-y += csum-partial_64.o csum-copy_64.o csum-wrappers_64.o
+endif
lib-y += clear_page_64.o copy_page_64.o
lib-y += memmove_64.o memset_64.o
lib-y += copy_user_64.o
--
2.37.0.rc0.161.g10f37bed90-goog

2022-07-01 15:17:40

by Alexander Potapenko

[permalink] [raw]
Subject: [PATCH v4 45/45] x86: kmsan: enable KMSAN builds for x86

Make KMSAN usable by adding the necessary Kconfig bits.

Also declare x86-specific functions checking address validity
in arch/x86/include/asm/kmsan.h.

Signed-off-by: Alexander Potapenko <[email protected]>
---
v4:
-- per Marco Elver's request, create arch/x86/include/asm/kmsan.h
and move arch-specific inline functions there.

Link: https://linux-review.googlesource.com/id/I1d295ce8159ce15faa496d20089d953a919c125e
---
arch/x86/Kconfig | 1 +
arch/x86/include/asm/kmsan.h | 55 ++++++++++++++++++++++++++++++++++++
2 files changed, 56 insertions(+)
create mode 100644 arch/x86/include/asm/kmsan.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index aadbb16a59f01..d1a601111b277 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -169,6 +169,7 @@ config X86
select HAVE_ARCH_KASAN if X86_64
select HAVE_ARCH_KASAN_VMALLOC if X86_64
select HAVE_ARCH_KFENCE
+ select HAVE_ARCH_KMSAN if X86_64
select HAVE_ARCH_KGDB
select HAVE_ARCH_MMAP_RND_BITS if MMU
select HAVE_ARCH_MMAP_RND_COMPAT_BITS if MMU && COMPAT
diff --git a/arch/x86/include/asm/kmsan.h b/arch/x86/include/asm/kmsan.h
new file mode 100644
index 0000000000000..a790b865d0a68
--- /dev/null
+++ b/arch/x86/include/asm/kmsan.h
@@ -0,0 +1,55 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * x86 KMSAN support.
+ *
+ * Copyright (C) 2022, Google LLC
+ * Author: Alexander Potapenko <[email protected]>
+ */
+
+#ifndef _ASM_X86_KMSAN_H
+#define _ASM_X86_KMSAN_H
+
+#ifndef MODULE
+
+#include <asm/processor.h>
+#include <linux/mmzone.h>
+
+/*
+ * Taken from arch/x86/mm/physaddr.h to avoid using an instrumented version.
+ */
+static inline bool kmsan_phys_addr_valid(unsigned long addr)
+{
+ if (IS_ENABLED(CONFIG_PHYS_ADDR_T_64BIT))
+ return !(addr >> boot_cpu_data.x86_phys_bits);
+ else
+ return true;
+}
+
+/*
+ * Taken from arch/x86/mm/physaddr.c to avoid using an instrumented version.
+ */
+static inline bool kmsan_virt_addr_valid(void *addr)
+{
+ unsigned long x = (unsigned long)addr;
+ unsigned long y = x - __START_KERNEL_map;
+
+ /* use the carry flag to determine if x was < __START_KERNEL_map */
+ if (unlikely(x > y)) {
+ x = y + phys_base;
+
+ if (y >= KERNEL_IMAGE_SIZE)
+ return false;
+ } else {
+ x = y + (__START_KERNEL_map - PAGE_OFFSET);
+
+ /* carry flag will be set if starting x was >= PAGE_OFFSET */
+ if ((x > y) || !kmsan_phys_addr_valid(x))
+ return false;
+ }
+
+ return pfn_valid(x >> PAGE_SHIFT);
+}
+
+#endif /* !MODULE */
+
+#endif /* _ASM_X86_KMSAN_H */
--
2.37.0.rc0.161.g10f37bed90-goog

2022-07-01 15:18:27

by Alexander Potapenko

[permalink] [raw]
Subject: [PATCH v4 36/45] x86: kmsan: use __msan_ string functions where possible

Unless stated otherwise (by explicitly calling __memcpy(), __memset() or
__memmove()) we want all string functions to call their __msan_ versions
(e.g. __msan_memcpy() instead of memcpy()), so that shadow and origin
values are updated accordingly.

Bootloader must still use the default string functions to avoid crashes.

Signed-off-by: Alexander Potapenko <[email protected]>
---

Link: https://linux-review.googlesource.com/id/I7ca9bd6b4f5c9b9816404862ae87ca7984395f33
---
arch/x86/include/asm/string_64.h | 23 +++++++++++++++++++++--
include/linux/fortify-string.h | 2 ++
2 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 6e450827f677a..3b87d889b6e16 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -11,11 +11,23 @@
function. */

#define __HAVE_ARCH_MEMCPY 1
+#if defined(__SANITIZE_MEMORY__)
+#undef memcpy
+void *__msan_memcpy(void *dst, const void *src, size_t size);
+#define memcpy __msan_memcpy
+#else
extern void *memcpy(void *to, const void *from, size_t len);
+#endif
extern void *__memcpy(void *to, const void *from, size_t len);

#define __HAVE_ARCH_MEMSET
+#if defined(__SANITIZE_MEMORY__)
+extern void *__msan_memset(void *s, int c, size_t n);
+#undef memset
+#define memset __msan_memset
+#else
void *memset(void *s, int c, size_t n);
+#endif
void *__memset(void *s, int c, size_t n);

#define __HAVE_ARCH_MEMSET16
@@ -55,7 +67,13 @@ static inline void *memset64(uint64_t *s, uint64_t v, size_t n)
}

#define __HAVE_ARCH_MEMMOVE
+#if defined(__SANITIZE_MEMORY__)
+#undef memmove
+void *__msan_memmove(void *dest, const void *src, size_t len);
+#define memmove __msan_memmove
+#else
void *memmove(void *dest, const void *src, size_t count);
+#endif
void *__memmove(void *dest, const void *src, size_t count);

int memcmp(const void *cs, const void *ct, size_t count);
@@ -64,8 +82,7 @@ char *strcpy(char *dest, const char *src);
char *strcat(char *dest, const char *src);
int strcmp(const char *cs, const char *ct);

-#if defined(CONFIG_KASAN) && !defined(__SANITIZE_ADDRESS__)
-
+#if (defined(CONFIG_KASAN) && !defined(__SANITIZE_ADDRESS__))
/*
* For files that not instrumented (e.g. mm/slub.c) we
* should use not instrumented version of mem* functions.
@@ -73,7 +90,9 @@ int strcmp(const char *cs, const char *ct);

#undef memcpy
#define memcpy(dst, src, len) __memcpy(dst, src, len)
+#undef memmove
#define memmove(dst, src, len) __memmove(dst, src, len)
+#undef memset
#define memset(s, c, n) __memset(s, c, n)

#ifndef __NO_FORTIFY
diff --git a/include/linux/fortify-string.h b/include/linux/fortify-string.h
index 3b401fa0f3746..6c8a1a29d0b63 100644
--- a/include/linux/fortify-string.h
+++ b/include/linux/fortify-string.h
@@ -285,8 +285,10 @@ __FORTIFY_INLINE void fortify_memset_chk(__kernel_size_t size,
* __builtin_object_size() must be captured here to avoid evaluating argument
* side-effects further into the macro layers.
*/
+#ifndef CONFIG_KMSAN
#define memset(p, c, s) __fortify_memset_chk(p, c, s, \
__builtin_object_size(p, 0), __builtin_object_size(p, 1))
+#endif

/*
* To make sure the compiler can enforce protection against buffer overflows,
--
2.37.0.rc0.161.g10f37bed90-goog

2022-07-01 15:20:37

by Alexander Potapenko

[permalink] [raw]
Subject: [PATCH v4 43/45] namei: initialize parameters passed to step_into()

Under certain circumstances initialization of `unsigned seq` and
`struct inode *inode` passed into step_into() may be skipped.
In particular, if the call to lookup_fast() in walk_component()
returns NULL, and lookup_slow() returns a valid dentry, then the
`seq` and `inode` will remain uninitialized until the call to
step_into() (see [1] for more info).

Right now step_into() does not use these uninitialized values,
yet passing uninitialized values to functions is considered undefined
behavior (see [2]). To fix that, we initialize `seq` and `inode` at
definition.

[1] https://github.com/ClangBuiltLinux/linux/issues/1648#issuecomment-1146608063
[2] https://lore.kernel.org/linux-toolchains/CAHk-=whjz3wO8zD+itoerphWem+JZz4uS3myf6u1Wd6epGRgmQ@mail.gmail.com/

Cc: Evgenii Stepanov <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Cc: Nick Desaulniers <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Segher Boessenkool <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vitaly Buka <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Alexander Potapenko <[email protected]>
---
Link: https://linux-review.googlesource.com/id/I94d4e8cc1f0ecc7174659e9506ce96aaf2201d0a
---
fs/namei.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 1f28d3f463c3b..6b39dfd3b41bc 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1995,8 +1995,8 @@ static const char *handle_dots(struct nameidata *nd, int type)
static const char *walk_component(struct nameidata *nd, int flags)
{
struct dentry *dentry;
- struct inode *inode;
- unsigned seq;
+ struct inode *inode = NULL;
+ unsigned seq = 0;
/*
* "." and ".." are special - ".." especially so because it has
* to be able to know about the current root directory and
@@ -3393,8 +3393,8 @@ static const char *open_last_lookups(struct nameidata *nd,
struct dentry *dir = nd->path.dentry;
int open_flag = op->open_flag;
bool got_write = false;
- unsigned seq;
- struct inode *inode;
+ unsigned seq = 0;
+ struct inode *inode = NULL;
struct dentry *dentry;
const char *res;

--
2.37.0.rc0.161.g10f37bed90-goog

2022-07-02 17:40:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v4 43/45] namei: initialize parameters passed to step_into()

On Fri, Jul 1, 2022 at 7:25 AM Alexander Potapenko <[email protected]> wrote:
>
> Under certain circumstances initialization of `unsigned seq` and
> `struct inode *inode` passed into step_into() may be skipped.
> In particular, if the call to lookup_fast() in walk_component()
> returns NULL, and lookup_slow() returns a valid dentry, then the
> `seq` and `inode` will remain uninitialized until the call to
> step_into() (see [1] for more info).

So while I think this needs to be fixed, I think I'd really prefer to
make the initialization and/or usage rules stricter or at least
clearer.

For example, looking around, I think "handle_dotdot()" has the exact
same kind of issue, where follow_dotdot[_rcu|() doesn't initialize
seq/inode for certain cases, and it's *really* hard to see exactly
what the rules are.

It turns out that the rules are that seq/inode only get initialized if
these routines return a non-NULL and non-error result.

Now, that is true for all of these cases - both follow_dotdot*() and
lookup_fast(). Possibly others.

But the reason follow_dotdot*() doesn't cause the same issue is that
the caller actually does the checks that avoid it, and doesn't pass
down the uninitialized cases.

Now, the other part of the rule is that they only get _used_ for
LOOKUP_RCU cases, where they are used to validate the lookup after
we've finalized things.

Of course, sometimes the "only get used for LOOKUP_RCU" is very very
unclear, because even without being an RCU lookup, step_into() will
save it into nd->inode/seq. So the values were "used", and
initializing them makes them valid, but then *that* copy must not then
be used unless RCU was set.

Also, sometimes the LOOKUP_RCU check is in the caller, and has
actually been cleared, so by the time the actual use comes around, you
just have to trust that it was a RCU lookup (ie
legitimize_links/root()).

So it all seems to work, and this patch then gets rid of one
particular odd case, but I think this patch basically hides the
compiler warning without really clarifying the code or the rules.

Anyway, what I'm building up to here is that I think we should
*document* this a bit more. and then make those initializations then
be about that documentation. I also get the feeling that
"nd->inode/nd->seq" should also be initialized.

Right now we have those quite subtle rules about "set vs use", and
while a lot of the uses are conditional on LOOKUP_RCU, that makes the
code correct, but doesn't solve the "pass uninitialized values as
arguments" case.

I also think it's very unclear when nd->inode/nd->seq are initialized,
and the compiler warning only caught the case where they were *set*
(but by arguments that weren't initialized), but didn't necessarily
catch the case where they weren't set at all in the first place and
then passed around.

End result:

- I think I'd like path_init() (or set_nameidata) to actually
initialize nd->inode and nd->seq unconditionally too.

Right now, they get initialized only for that LOOKUP_RCU case.
Pretty much exactly the same issue as the one this patch tries to
solve, except the compiler didn't notice because it's all indirect
through those structure fields and it just didn't track far enough.

- I suspect it would be good to initialize them to actual invalid
values (rather than NULL/0 - particularly the sequence number)

- I look at that follow_dotdot*() caller case, and think "that looks
very similar to the lookup_fast() case, but then we have *very*
different initialization rules".

Al - can you please take a quick look?

Linus

2022-07-03 04:30:46

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v4 43/45] namei: initialize parameters passed to step_into()

On Sat, Jul 02, 2022 at 10:23:16AM -0700, Linus Torvalds wrote:
> On Fri, Jul 1, 2022 at 7:25 AM Alexander Potapenko <[email protected]> wrote:
> >
> > Under certain circumstances initialization of `unsigned seq` and
> > `struct inode *inode` passed into step_into() may be skipped.
> > In particular, if the call to lookup_fast() in walk_component()
> > returns NULL, and lookup_slow() returns a valid dentry, then the
> > `seq` and `inode` will remain uninitialized until the call to
> > step_into() (see [1] for more info).

> So while I think this needs to be fixed, I think I'd really prefer to
> make the initialization and/or usage rules stricter or at least
> clearer.

Disclaimer: the bits below are nowhere near what I consider a decent
explanation; this might serve as the first approximation, but I really
need to get some sleep before I get it into coherent shape. 4 hours
of sleep today...

The rules are
* no pathname resolution without successful path_init().
IOW, path_init() failure is an instant fuck off.
* path_init() success sets nd->inode. In all cases.
* nd->inode must be set - LOOKUP_RCU or not, we simply cannot
proceed without it.

* in non-RCU mode nd->inode must be equal to nd->path.dentry->d_inode.
* in RCU mode nd->inode must be equal to a value observed in
nd->path.dentry->d_inode while nd->path.dentry->d_seq had been equal to
nd->seq.

* step_into() gets a dentry/inode/seq triple. In non-RCU
mode inode and seq are ignored; in RCU mode they must satisfy the
same relationship we have for nd->path.dentry/nd->inode/nd->seq.

> Of course, sometimes the "only get used for LOOKUP_RCU" is very very
> unclear, because even without being an RCU lookup, step_into() will
> save it into nd->inode/seq. So the values were "used", and
> initializing them makes them valid, but then *that* copy must not then
> be used unless RCU was set.

You are misreading that (and I admit that it badly needs documentation).
The whole point of step_into() is to move over to new place. nd->inode
*MUST* be set on success, no matter what.

> - I look at that follow_dotdot*() caller case, and think "that looks
> very similar to the lookup_fast() case, but then we have *very*
> different initialization rules".

follow_dotdot() might as well lose inodep and seqp arguments - everything
would've worked just as well without those. We would've gotten the same
complaints about uninitialized values passed to step_into(), though.

This
if (unlikely(!parent))
error = step_into(nd, WALK_NOFOLLOW,
nd->path.dentry, nd->inode, nd->seq);
in handle_dots() probably contributes to confusion - it's the "we
have stepped on .. in the root, just jump into whatever's mounted on
it" case. In non-RCU case it looks like a use of nd->seq in non-RCU
mode; however, in that case step_into() will end up ignoring the
last two arguments.

I'll post something more coherent after I get some sleep. Sorry... ;-/

2022-07-04 02:57:29

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v4 43/45] namei: initialize parameters passed to step_into()

On Sat, Jul 02, 2022 at 10:23:16AM -0700, Linus Torvalds wrote:

> Al - can you please take a quick look?

FWIW, trying to write a coherent documentation had its usual effect...
The thing is, we don't really need to fetch the inode that early.
All we really care about is that in RCU mode ->d_seq gets sampled
before we fetch ->d_inode *and* we don't treat "it looks negative"
as hard -ENOENT in case of ->d_seq mismatch.

Which can be bloody well left to step_into(). So we don't need
to pass it inode argument at all - just dentry and seq. Makes
a bunch of functions simpler as well...

It does *not* deal with the "uninitialized" seq argument in
!RCU case; I'll handle that in the followup, but that's a separate
story, IMO (and very clearly a false positive).

Cumulative diff follows; splitup is in #work.namei. Comments?

diff --git a/fs/namei.c b/fs/namei.c
index 1f28d3f463c3..7f4f61ade9e3 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1467,7 +1467,7 @@ EXPORT_SYMBOL(follow_down);
* we meet a managed dentry that would need blocking.
*/
static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
- struct inode **inode, unsigned *seqp)
+ unsigned *seqp)
{
struct dentry *dentry = path->dentry;
unsigned int flags = dentry->d_flags;
@@ -1497,13 +1497,6 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
dentry = path->dentry = mounted->mnt.mnt_root;
nd->state |= ND_JUMPED;
*seqp = read_seqcount_begin(&dentry->d_seq);
- *inode = dentry->d_inode;
- /*
- * We don't need to re-check ->d_seq after this
- * ->d_inode read - there will be an RCU delay
- * between mount hash removal and ->mnt_root
- * becoming unpinned.
- */
flags = dentry->d_flags;
continue;
}
@@ -1515,8 +1508,7 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
}

static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry,
- struct path *path, struct inode **inode,
- unsigned int *seqp)
+ struct path *path, unsigned int *seqp)
{
bool jumped;
int ret;
@@ -1525,9 +1517,7 @@ static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry,
path->dentry = dentry;
if (nd->flags & LOOKUP_RCU) {
unsigned int seq = *seqp;
- if (unlikely(!*inode))
- return -ENOENT;
- if (likely(__follow_mount_rcu(nd, path, inode, seqp)))
+ if (likely(__follow_mount_rcu(nd, path, seqp)))
return 0;
if (!try_to_unlazy_next(nd, dentry, seq))
return -ECHILD;
@@ -1547,7 +1537,6 @@ static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry,
if (path->mnt != nd->path.mnt)
mntput(path->mnt);
} else {
- *inode = d_backing_inode(path->dentry);
*seqp = 0; /* out of RCU mode, so the value doesn't matter */
}
return ret;
@@ -1607,9 +1596,7 @@ static struct dentry *__lookup_hash(const struct qstr *name,
return dentry;
}

-static struct dentry *lookup_fast(struct nameidata *nd,
- struct inode **inode,
- unsigned *seqp)
+static struct dentry *lookup_fast(struct nameidata *nd, unsigned *seqp)
{
struct dentry *dentry, *parent = nd->path.dentry;
int status = 1;
@@ -1628,22 +1615,11 @@ static struct dentry *lookup_fast(struct nameidata *nd,
return NULL;
}

- /*
- * This sequence count validates that the inode matches
- * the dentry name information from lookup.
- */
- *inode = d_backing_inode(dentry);
- if (unlikely(read_seqcount_retry(&dentry->d_seq, seq)))
- return ERR_PTR(-ECHILD);
-
- /*
+ /*
* This sequence count validates that the parent had no
* changes while we did the lookup of the dentry above.
- *
- * The memory barrier in read_seqcount_begin of child is
- * enough, we can use __read_seqcount_retry here.
*/
- if (unlikely(__read_seqcount_retry(&parent->d_seq, nd->seq)))
+ if (unlikely(read_seqcount_retry(&parent->d_seq, nd->seq)))
return ERR_PTR(-ECHILD);

*seqp = seq;
@@ -1838,13 +1814,21 @@ static const char *pick_link(struct nameidata *nd, struct path *link,
* for the common case.
*/
static const char *step_into(struct nameidata *nd, int flags,
- struct dentry *dentry, struct inode *inode, unsigned seq)
+ struct dentry *dentry, unsigned seq)
{
struct path path;
- int err = handle_mounts(nd, dentry, &path, &inode, &seq);
+ struct inode *inode;
+ int err = handle_mounts(nd, dentry, &path, &seq);

if (err < 0)
return ERR_PTR(err);
+ inode = path.dentry->d_inode;
+ if (unlikely(!inode)) {
+ if ((nd->flags & LOOKUP_RCU) &&
+ read_seqcount_retry(&path.dentry->d_seq, seq))
+ return ERR_PTR(-ECHILD);
+ return ERR_PTR(-ENOENT);
+ }
if (likely(!d_is_symlink(path.dentry)) ||
((flags & WALK_TRAILING) && !(nd->flags & LOOKUP_FOLLOW)) ||
(flags & WALK_NOFOLLOW)) {
@@ -1870,9 +1854,7 @@ static const char *step_into(struct nameidata *nd, int flags,
return pick_link(nd, &path, inode, seq, flags);
}

-static struct dentry *follow_dotdot_rcu(struct nameidata *nd,
- struct inode **inodep,
- unsigned *seqp)
+static struct dentry *follow_dotdot_rcu(struct nameidata *nd, unsigned *seqp)
{
struct dentry *parent, *old;

@@ -1895,7 +1877,6 @@ static struct dentry *follow_dotdot_rcu(struct nameidata *nd,
}
old = nd->path.dentry;
parent = old->d_parent;
- *inodep = parent->d_inode;
*seqp = read_seqcount_begin(&parent->d_seq);
if (unlikely(read_seqcount_retry(&old->d_seq, nd->seq)))
return ERR_PTR(-ECHILD);
@@ -1910,9 +1891,7 @@ static struct dentry *follow_dotdot_rcu(struct nameidata *nd,
return NULL;
}

-static struct dentry *follow_dotdot(struct nameidata *nd,
- struct inode **inodep,
- unsigned *seqp)
+static struct dentry *follow_dotdot(struct nameidata *nd, unsigned *seqp)
{
struct dentry *parent;

@@ -1937,7 +1916,6 @@ static struct dentry *follow_dotdot(struct nameidata *nd,
return ERR_PTR(-ENOENT);
}
*seqp = 0;
- *inodep = parent->d_inode;
return parent;

in_root:
@@ -1952,7 +1930,6 @@ static const char *handle_dots(struct nameidata *nd, int type)
if (type == LAST_DOTDOT) {
const char *error = NULL;
struct dentry *parent;
- struct inode *inode;
unsigned seq;

if (!nd->root.mnt) {
@@ -1961,17 +1938,17 @@ static const char *handle_dots(struct nameidata *nd, int type)
return error;
}
if (nd->flags & LOOKUP_RCU)
- parent = follow_dotdot_rcu(nd, &inode, &seq);
+ parent = follow_dotdot_rcu(nd, &seq);
else
- parent = follow_dotdot(nd, &inode, &seq);
+ parent = follow_dotdot(nd, &seq);
if (IS_ERR(parent))
return ERR_CAST(parent);
if (unlikely(!parent))
error = step_into(nd, WALK_NOFOLLOW,
- nd->path.dentry, nd->inode, nd->seq);
+ nd->path.dentry, nd->seq);
else
error = step_into(nd, WALK_NOFOLLOW,
- parent, inode, seq);
+ parent, seq);
if (unlikely(error))
return error;

@@ -1995,7 +1972,6 @@ static const char *handle_dots(struct nameidata *nd, int type)
static const char *walk_component(struct nameidata *nd, int flags)
{
struct dentry *dentry;
- struct inode *inode;
unsigned seq;
/*
* "." and ".." are special - ".." especially so because it has
@@ -2007,7 +1983,7 @@ static const char *walk_component(struct nameidata *nd, int flags)
put_link(nd);
return handle_dots(nd, nd->last_type);
}
- dentry = lookup_fast(nd, &inode, &seq);
+ dentry = lookup_fast(nd, &seq);
if (IS_ERR(dentry))
return ERR_CAST(dentry);
if (unlikely(!dentry)) {
@@ -2017,7 +1993,7 @@ static const char *walk_component(struct nameidata *nd, int flags)
}
if (!(flags & WALK_MORE) && nd->depth)
put_link(nd);
- return step_into(nd, flags, dentry, inode, seq);
+ return step_into(nd, flags, dentry, seq);
}

/*
@@ -2473,8 +2449,7 @@ static int handle_lookup_down(struct nameidata *nd)
{
if (!(nd->flags & LOOKUP_RCU))
dget(nd->path.dentry);
- return PTR_ERR(step_into(nd, WALK_NOFOLLOW,
- nd->path.dentry, nd->inode, nd->seq));
+ return PTR_ERR(step_into(nd, WALK_NOFOLLOW, nd->path.dentry, nd->seq));
}

/* Returns 0 and nd will be valid on success; Retuns error, otherwise. */
@@ -3394,7 +3369,6 @@ static const char *open_last_lookups(struct nameidata *nd,
int open_flag = op->open_flag;
bool got_write = false;
unsigned seq;
- struct inode *inode;
struct dentry *dentry;
const char *res;

@@ -3410,7 +3384,7 @@ static const char *open_last_lookups(struct nameidata *nd,
if (nd->last.name[nd->last.len])
nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
/* we _can_ be in RCU mode here */
- dentry = lookup_fast(nd, &inode, &seq);
+ dentry = lookup_fast(nd, &seq);
if (IS_ERR(dentry))
return ERR_CAST(dentry);
if (likely(dentry))
@@ -3464,7 +3438,7 @@ static const char *open_last_lookups(struct nameidata *nd,
finish_lookup:
if (nd->depth)
put_link(nd);
- res = step_into(nd, WALK_TRAILING, dentry, inode, seq);
+ res = step_into(nd, WALK_TRAILING, dentry, seq);
if (unlikely(res))
nd->flags &= ~(LOOKUP_OPEN|LOOKUP_CREATE|LOOKUP_EXCL);
return res;

2022-07-04 08:37:07

by Alexander Potapenko

[permalink] [raw]
Subject: Re: [PATCH v4 43/45] namei: initialize parameters passed to step_into()

On Mon, Jul 4, 2022 at 4:53 AM Al Viro <[email protected]> wrote:
>
> On Sat, Jul 02, 2022 at 10:23:16AM -0700, Linus Torvalds wrote:
>
> > Al - can you please take a quick look?
>
> FWIW, trying to write a coherent documentation had its usual effect...
> The thing is, we don't really need to fetch the inode that early.
> All we really care about is that in RCU mode ->d_seq gets sampled
> before we fetch ->d_inode *and* we don't treat "it looks negative"
> as hard -ENOENT in case of ->d_seq mismatch.
>
> Which can be bloody well left to step_into(). So we don't need
> to pass it inode argument at all - just dentry and seq. Makes
> a bunch of functions simpler as well...
>
> It does *not* deal with the "uninitialized" seq argument in
> !RCU case; I'll handle that in the followup, but that's a separate
> story, IMO (and very clearly a false positive).

I can confirm that your patch fixes KMSAN reports on inode, yet the
following reports still persist:

=====================================================
BUG: KMSAN: uninit-value in walk_component+0x5e7/0x6c0 fs/namei.c:1996
walk_component+0x5e7/0x6c0 fs/namei.c:1996
lookup_last fs/namei.c:2445
path_lookupat+0x27d/0x6f0 fs/namei.c:2468
filename_lookup+0x24c/0x800 fs/namei.c:2497
kern_path+0x79/0x3a0 fs/namei.c:2587
init_stat+0x72/0x13f fs/init.c:132
clean_path+0x74/0x24c init/initramfs.c:339
do_name+0x12d/0xc17 init/initramfs.c:371
write_buffer init/initramfs.c:457
unpack_to_rootfs+0x49a/0xd9e init/initramfs.c:510
do_populate_rootfs+0x57/0x40f init/initramfs.c:699
async_run_entry_fn+0x8f/0x400 kernel/async.c:127
process_one_work+0xb27/0x13e0 kernel/workqueue.c:2289
worker_thread+0x1076/0x1d60 kernel/workqueue.c:2436
kthread+0x31b/0x430 kernel/kthread.c:376
ret_from_fork+0x1f/0x30 ??:?

Local variable seq created at:
walk_component+0x46/0x6c0 fs/namei.c:1981
lookup_last fs/namei.c:2445
path_lookupat+0x27d/0x6f0 fs/namei.c:2468

CPU: 0 PID: 10 Comm: kworker/u9:0 Tainted: G B
5.19.0-rc4-00059-gcf2d25715943-dirty #103
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
Workqueue: events_unbound async_run_entry_fn
=====================================================

What makes you think they are false positives? Is the scenario I
described above:

"""
In particular, if the call to lookup_fast() in walk_component()
returns NULL, and lookup_slow() returns a valid dentry, then the
`seq` and `inode` will remain uninitialized until the call to
step_into()
"""

impossible?

> Cumulative diff follows; splitup is in #work.namei. Comments?
>
> diff --git a/fs/namei.c b/fs/namei.c
> index 1f28d3f463c3..7f4f61ade9e3 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -1467,7 +1467,7 @@ EXPORT_SYMBOL(follow_down);
> * we meet a managed dentry that would need blocking.
> */
> static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
> - struct inode **inode, unsigned *seqp)
> + unsigned *seqp)
> {
> struct dentry *dentry = path->dentry;
> unsigned int flags = dentry->d_flags;
> @@ -1497,13 +1497,6 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
> dentry = path->dentry = mounted->mnt.mnt_root;
> nd->state |= ND_JUMPED;
> *seqp = read_seqcount_begin(&dentry->d_seq);
> - *inode = dentry->d_inode;
> - /*
> - * We don't need to re-check ->d_seq after this
> - * ->d_inode read - there will be an RCU delay
> - * between mount hash removal and ->mnt_root
> - * becoming unpinned.
> - */
> flags = dentry->d_flags;
> continue;
> }
> @@ -1515,8 +1508,7 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
> }
>
> static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry,
> - struct path *path, struct inode **inode,
> - unsigned int *seqp)
> + struct path *path, unsigned int *seqp)
> {
> bool jumped;
> int ret;
> @@ -1525,9 +1517,7 @@ static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry,
> path->dentry = dentry;
> if (nd->flags & LOOKUP_RCU) {
> unsigned int seq = *seqp;
> - if (unlikely(!*inode))
> - return -ENOENT;
> - if (likely(__follow_mount_rcu(nd, path, inode, seqp)))
> + if (likely(__follow_mount_rcu(nd, path, seqp)))
> return 0;
> if (!try_to_unlazy_next(nd, dentry, seq))
> return -ECHILD;
> @@ -1547,7 +1537,6 @@ static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry,
> if (path->mnt != nd->path.mnt)
> mntput(path->mnt);
> } else {
> - *inode = d_backing_inode(path->dentry);
> *seqp = 0; /* out of RCU mode, so the value doesn't matter */
> }
> return ret;
> @@ -1607,9 +1596,7 @@ static struct dentry *__lookup_hash(const struct qstr *name,
> return dentry;
> }
>
> -static struct dentry *lookup_fast(struct nameidata *nd,
> - struct inode **inode,
> - unsigned *seqp)
> +static struct dentry *lookup_fast(struct nameidata *nd, unsigned *seqp)
> {
> struct dentry *dentry, *parent = nd->path.dentry;
> int status = 1;
> @@ -1628,22 +1615,11 @@ static struct dentry *lookup_fast(struct nameidata *nd,
> return NULL;
> }
>
> - /*
> - * This sequence count validates that the inode matches
> - * the dentry name information from lookup.
> - */
> - *inode = d_backing_inode(dentry);
> - if (unlikely(read_seqcount_retry(&dentry->d_seq, seq)))
> - return ERR_PTR(-ECHILD);
> -
> - /*
> + /*
> * This sequence count validates that the parent had no
> * changes while we did the lookup of the dentry above.
> - *
> - * The memory barrier in read_seqcount_begin of child is
> - * enough, we can use __read_seqcount_retry here.
> */
> - if (unlikely(__read_seqcount_retry(&parent->d_seq, nd->seq)))
> + if (unlikely(read_seqcount_retry(&parent->d_seq, nd->seq)))
> return ERR_PTR(-ECHILD);
>
> *seqp = seq;
> @@ -1838,13 +1814,21 @@ static const char *pick_link(struct nameidata *nd, struct path *link,
> * for the common case.
> */
> static const char *step_into(struct nameidata *nd, int flags,
> - struct dentry *dentry, struct inode *inode, unsigned seq)
> + struct dentry *dentry, unsigned seq)
> {
> struct path path;
> - int err = handle_mounts(nd, dentry, &path, &inode, &seq);
> + struct inode *inode;
> + int err = handle_mounts(nd, dentry, &path, &seq);
>
> if (err < 0)
> return ERR_PTR(err);
> + inode = path.dentry->d_inode;
> + if (unlikely(!inode)) {
> + if ((nd->flags & LOOKUP_RCU) &&
> + read_seqcount_retry(&path.dentry->d_seq, seq))
> + return ERR_PTR(-ECHILD);
> + return ERR_PTR(-ENOENT);
> + }
> if (likely(!d_is_symlink(path.dentry)) ||
> ((flags & WALK_TRAILING) && !(nd->flags & LOOKUP_FOLLOW)) ||
> (flags & WALK_NOFOLLOW)) {
> @@ -1870,9 +1854,7 @@ static const char *step_into(struct nameidata *nd, int flags,
> return pick_link(nd, &path, inode, seq, flags);
> }
>
> -static struct dentry *follow_dotdot_rcu(struct nameidata *nd,
> - struct inode **inodep,
> - unsigned *seqp)
> +static struct dentry *follow_dotdot_rcu(struct nameidata *nd, unsigned *seqp)
> {
> struct dentry *parent, *old;
>
> @@ -1895,7 +1877,6 @@ static struct dentry *follow_dotdot_rcu(struct nameidata *nd,
> }
> old = nd->path.dentry;
> parent = old->d_parent;
> - *inodep = parent->d_inode;
> *seqp = read_seqcount_begin(&parent->d_seq);
> if (unlikely(read_seqcount_retry(&old->d_seq, nd->seq)))
> return ERR_PTR(-ECHILD);
> @@ -1910,9 +1891,7 @@ static struct dentry *follow_dotdot_rcu(struct nameidata *nd,
> return NULL;
> }
>
> -static struct dentry *follow_dotdot(struct nameidata *nd,
> - struct inode **inodep,
> - unsigned *seqp)
> +static struct dentry *follow_dotdot(struct nameidata *nd, unsigned *seqp)
> {
> struct dentry *parent;
>
> @@ -1937,7 +1916,6 @@ static struct dentry *follow_dotdot(struct nameidata *nd,
> return ERR_PTR(-ENOENT);
> }
> *seqp = 0;
> - *inodep = parent->d_inode;
> return parent;
>
> in_root:
> @@ -1952,7 +1930,6 @@ static const char *handle_dots(struct nameidata *nd, int type)
> if (type == LAST_DOTDOT) {
> const char *error = NULL;
> struct dentry *parent;
> - struct inode *inode;
> unsigned seq;
>
> if (!nd->root.mnt) {
> @@ -1961,17 +1938,17 @@ static const char *handle_dots(struct nameidata *nd, int type)
> return error;
> }
> if (nd->flags & LOOKUP_RCU)
> - parent = follow_dotdot_rcu(nd, &inode, &seq);
> + parent = follow_dotdot_rcu(nd, &seq);
> else
> - parent = follow_dotdot(nd, &inode, &seq);
> + parent = follow_dotdot(nd, &seq);
> if (IS_ERR(parent))
> return ERR_CAST(parent);
> if (unlikely(!parent))
> error = step_into(nd, WALK_NOFOLLOW,
> - nd->path.dentry, nd->inode, nd->seq);
> + nd->path.dentry, nd->seq);
> else
> error = step_into(nd, WALK_NOFOLLOW,
> - parent, inode, seq);
> + parent, seq);
> if (unlikely(error))
> return error;
>
> @@ -1995,7 +1972,6 @@ static const char *handle_dots(struct nameidata *nd, int type)
> static const char *walk_component(struct nameidata *nd, int flags)
> {
> struct dentry *dentry;
> - struct inode *inode;
> unsigned seq;
> /*
> * "." and ".." are special - ".." especially so because it has
> @@ -2007,7 +1983,7 @@ static const char *walk_component(struct nameidata *nd, int flags)
> put_link(nd);
> return handle_dots(nd, nd->last_type);
> }
> - dentry = lookup_fast(nd, &inode, &seq);
> + dentry = lookup_fast(nd, &seq);
> if (IS_ERR(dentry))
> return ERR_CAST(dentry);
> if (unlikely(!dentry)) {
> @@ -2017,7 +1993,7 @@ static const char *walk_component(struct nameidata *nd, int flags)
> }
> if (!(flags & WALK_MORE) && nd->depth)
> put_link(nd);
> - return step_into(nd, flags, dentry, inode, seq);
> + return step_into(nd, flags, dentry, seq);
> }
>
> /*
> @@ -2473,8 +2449,7 @@ static int handle_lookup_down(struct nameidata *nd)
> {
> if (!(nd->flags & LOOKUP_RCU))
> dget(nd->path.dentry);
> - return PTR_ERR(step_into(nd, WALK_NOFOLLOW,
> - nd->path.dentry, nd->inode, nd->seq));
> + return PTR_ERR(step_into(nd, WALK_NOFOLLOW, nd->path.dentry, nd->seq));
> }
>
> /* Returns 0 and nd will be valid on success; Retuns error, otherwise. */
> @@ -3394,7 +3369,6 @@ static const char *open_last_lookups(struct nameidata *nd,
> int open_flag = op->open_flag;
> bool got_write = false;
> unsigned seq;
> - struct inode *inode;
> struct dentry *dentry;
> const char *res;
>
> @@ -3410,7 +3384,7 @@ static const char *open_last_lookups(struct nameidata *nd,
> if (nd->last.name[nd->last.len])
> nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
> /* we _can_ be in RCU mode here */
> - dentry = lookup_fast(nd, &inode, &seq);
> + dentry = lookup_fast(nd, &seq);
> if (IS_ERR(dentry))
> return ERR_CAST(dentry);
> if (likely(dentry))
> @@ -3464,7 +3438,7 @@ static const char *open_last_lookups(struct nameidata *nd,
> finish_lookup:
> if (nd->depth)
> put_link(nd);
> - res = step_into(nd, WALK_TRAILING, dentry, inode, seq);
> + res = step_into(nd, WALK_TRAILING, dentry, seq);
> if (unlikely(res))
> nd->flags &= ~(LOOKUP_OPEN|LOOKUP_CREATE|LOOKUP_EXCL);
> return res;



--
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

Diese E-Mail ist vertraulich. Falls Sie diese fälschlicherweise
erhalten haben sollten, leiten Sie diese bitte nicht an jemand anderes
weiter, löschen Sie alle Kopien und Anhänge davon und lassen Sie mich
bitte wissen, dass die E-Mail an die falsche Person gesendet wurde.


This e-mail is confidential. If you received this communication by
mistake, please don't forward it to anyone else, please erase all
copies and attachments, and please let me know that it has gone to the
wrong person.

2022-07-04 13:56:39

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v4 43/45] namei: initialize parameters passed to step_into()

On Mon, Jul 04, 2022 at 02:44:00PM +0100, Al Viro wrote:
> On Mon, Jul 04, 2022 at 10:20:53AM +0200, Alexander Potapenko wrote:
>
> > What makes you think they are false positives? Is the scenario I
> > described above:
> >
> > """
> > In particular, if the call to lookup_fast() in walk_component()
> > returns NULL, and lookup_slow() returns a valid dentry, then the
> > `seq` and `inode` will remain uninitialized until the call to
> > step_into()
> > """
> >
> > impossible?
>
> Suppose step_into() has been called in non-RCU mode. The first
> thing it does is
> int err = handle_mounts(nd, dentry, &path, &seq);
> if (err < 0)
> return ERR_PTR(err);
>
> And handle_mounts() in non-RCU mode is
> path->mnt = nd->path.mnt;
> path->dentry = dentry;
> if (nd->flags & LOOKUP_RCU) {
> [unreachable code]
> }
> [code not touching seqp]
> if (unlikely(ret)) {
> [code not touching seqp]
> } else {
> *seqp = 0; /* out of RCU mode, so the value doesn't matter */
> }
> return ret;

Make that
[code assigning ret a non-negative value and never using seqp]
if (unlikely(ret)) {
[code never using seqp or ret]
} else {
*seqp = 0; /* out of RCU mode, so the value doesn't matter */
}
return ret;

so if (err < 0) in the caller is equivalent to if (err).

2022-07-04 14:24:50

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v4 43/45] namei: initialize parameters passed to step_into()

On Mon, Jul 04, 2022 at 10:20:53AM +0200, Alexander Potapenko wrote:

> What makes you think they are false positives? Is the scenario I
> described above:
>
> """
> In particular, if the call to lookup_fast() in walk_component()
> returns NULL, and lookup_slow() returns a valid dentry, then the
> `seq` and `inode` will remain uninitialized until the call to
> step_into()
> """
>
> impossible?

Suppose step_into() has been called in non-RCU mode. The first
thing it does is
int err = handle_mounts(nd, dentry, &path, &seq);
if (err < 0)
return ERR_PTR(err);

And handle_mounts() in non-RCU mode is
path->mnt = nd->path.mnt;
path->dentry = dentry;
if (nd->flags & LOOKUP_RCU) {
[unreachable code]
}
[code not touching seqp]
if (unlikely(ret)) {
[code not touching seqp]
} else {
*seqp = 0; /* out of RCU mode, so the value doesn't matter */
}
return ret;

In other words, the value seq argument of step_into() used to have ends up
being never fetched and, in case step_into() gets past that if (err < 0)
that value is replaced with zero before any further accesses.

So it's a false positive; yes, strictly speaking compiler is allowd
to do anything whatsoever if it manages to prove that the value is
uninitialized. Realistically, though, especially since unsigned int
is not allowed any trapping representations...

If you want an test stripped of VFS specifics, consider this:

int g(int n, _Bool flag)
{
if (!flag)
n = 0;
return n + 1;
}

int f(int n, _Bool flag)
{
int x;

if (flag)
x = n + 2;
return g(x, flag);
}

Do your tools trigger on it?

2022-07-04 15:59:45

by Alexander Potapenko

[permalink] [raw]
Subject: Re: [PATCH v4 43/45] namei: initialize parameters passed to step_into()

On Mon, Jul 4, 2022 at 3:44 PM Al Viro <[email protected]> wrote:
>
> On Mon, Jul 04, 2022 at 10:20:53AM +0200, Alexander Potapenko wrote:
>
> > What makes you think they are false positives? Is the scenario I
> > described above:
> >
> > """
> > In particular, if the call to lookup_fast() in walk_component()
> > returns NULL, and lookup_slow() returns a valid dentry, then the
> > `seq` and `inode` will remain uninitialized until the call to
> > step_into()
> > """
> >
> > impossible?
>
> Suppose step_into() has been called in non-RCU mode. The first
> thing it does is
> int err = handle_mounts(nd, dentry, &path, &seq);
> if (err < 0)
> return ERR_PTR(err);
>
> And handle_mounts() in non-RCU mode is
> path->mnt = nd->path.mnt;
> path->dentry = dentry;
> if (nd->flags & LOOKUP_RCU) {
> [unreachable code]
> }
> [code not touching seqp]
> if (unlikely(ret)) {
> [code not touching seqp]
> } else {
> *seqp = 0; /* out of RCU mode, so the value doesn't matter */
> }
> return ret;
>
> In other words, the value seq argument of step_into() used to have ends up
> being never fetched and, in case step_into() gets past that if (err < 0)
> that value is replaced with zero before any further accesses.

Oh, I see. That is actually what had been discussed here:
https://lore.kernel.org/linux-toolchains/[email protected]/
Indeed, step_into() in its current implementation does not use `seq`
(which is noted in the patch description ;)), but the question is
whether we want to catch such cases regardless of that.
One of the reasons to do so is standard compliance - passing an
uninitialized value to a function is UB in C11, as Segher pointed out
here: https://lore.kernel.org/linux-toolchains/[email protected]/
The compilers may not be smart enough to take advantage of this _yet_,
but I wouldn't underestimate their ability to evolve (especially that
of Clang).
I also believe it's fragile to rely on the callee to ignore certain
parameters: it may be doing so today, but if someone changes
step_into() tomorrow we may miss it.

If I am reading Linus's message here (and the following one from him
in the same thread):
https://lore.kernel.org/linux-toolchains/CAHk-=whjz3wO8zD+itoerphWem+JZz4uS3myf6u1Wd6epGRgmQ@mail.gmail.com/
correctly, we should be reporting uninitialized values passed to
functions, unless those values dissolve after inlining.
While this is a bit of a vague criterion, at least for Clang we always
know that KMSAN instrumentation is applied after inlining, so the
reports we see are due to values that are actually passed between
functions.

> So it's a false positive; yes, strictly speaking compiler is allowd
> to do anything whatsoever if it manages to prove that the value is
> uninitialized. Realistically, though, especially since unsigned int
> is not allowed any trapping representations...
>
> If you want an test stripped of VFS specifics, consider this:
>
> int g(int n, _Bool flag)
> {
> if (!flag)
> n = 0;
> return n + 1;
> }
>
> int f(int n, _Bool flag)
> {
> int x;
>
> if (flag)
> x = n + 2;
> return g(x, flag);
> }
>
> Do your tools trigger on it?

Currently KMSAN has two modes of operation controlled by
CONFIG_KMSAN_CHECK_PARAM_RETVAL.
When enabled, that config enforces checks of function parameters at
call sites (by applying Clang's -fsanitize-memory-param-retval flag).
In that mode the tool would report the attempt to call `g(x, false)`
if g() survives inlining.
In the case CONFIG_KMSAN_CHECK_PARAM_RETVAL=n, KMSAN won't be
reporting the error.

Based on the mentioned discussion I decided to make
CONFIG_KMSAN_CHECK_PARAM_RETVAL=y the default option.
So far it only reported a handful of errors (i.e. enforcing this rule
shouldn't be very problematic for the kernel), but it simplifies
handling of calls between instrumented and non-instrumented functions
that occur e.g. at syscall entry points: knowing that passed-by-value
arguments are checked at call sites, we can assume they are
initialized in the callees.


HTH,
Alex

--
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

Diese E-Mail ist vertraulich. Falls Sie diese fälschlicherweise
erhalten haben sollten, leiten Sie diese bitte nicht an jemand anderes
weiter, löschen Sie alle Kopien und Anhänge davon und lassen Sie mich
bitte wissen, dass die E-Mail an die falsche Person gesendet wurde.


This e-mail is confidential. If you received this communication by
mistake, please don't forward it to anyone else, please erase all
copies and attachments, and please let me know that it has gone to the
wrong person.

2022-07-04 16:13:05

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v4 43/45] namei: initialize parameters passed to step_into()

On Mon, Jul 04, 2022 at 02:44:00PM +0100, Al Viro wrote:
> On Mon, Jul 04, 2022 at 10:20:53AM +0200, Alexander Potapenko wrote:
>
> > What makes you think they are false positives? Is the scenario I
> > described above:
> >
> > """
> > In particular, if the call to lookup_fast() in walk_component()
> > returns NULL, and lookup_slow() returns a valid dentry, then the
> > `seq` and `inode` will remain uninitialized until the call to
> > step_into()
> > """
> >
> > impossible?
>
> Suppose step_into() has been called in non-RCU mode. The first
> thing it does is
> int err = handle_mounts(nd, dentry, &path, &seq);
> if (err < 0)
> return ERR_PTR(err);
>
> And handle_mounts() in non-RCU mode is
> path->mnt = nd->path.mnt;
> path->dentry = dentry;
> if (nd->flags & LOOKUP_RCU) {
> [unreachable code]
> }
> [code not touching seqp]
> if (unlikely(ret)) {
> [code not touching seqp]
> } else {
> *seqp = 0; /* out of RCU mode, so the value doesn't matter */
> }
> return ret;
>
> In other words, the value seq argument of step_into() used to have ends up
> being never fetched and, in case step_into() gets past that if (err < 0)
> that value is replaced with zero before any further accesses.
>
> So it's a false positive; yes, strictly speaking compiler is allowd
> to do anything whatsoever if it manages to prove that the value is
> uninitialized. Realistically, though, especially since unsigned int
> is not allowed any trapping representations...

FWIW, update (and yet untested) branch is in #work.namei. Compared to the
previous, we store sampled ->d_seq of the next dentry in nd->next_seq,
rather than bothering with local variables. AFAICS, it ends up with
better code that way. And both ->seq and ->next_seq are zeroed at the
moments when we switch to non-RCU mode (as well as non-RCU path_init()).

IMO it looks saner that way. NOTE: it still needs to be tested and probably
reordered and massaged; it's not for merge at the moment. Current cumulative
diff follows:

diff --git a/fs/namei.c b/fs/namei.c
index 1f28d3f463c3..b12c6b53579d 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -567,7 +567,7 @@ struct nameidata {
struct path root;
struct inode *inode; /* path.dentry.d_inode */
unsigned int flags, state;
- unsigned seq, m_seq, r_seq;
+ unsigned seq, next_seq, m_seq, r_seq;
int last_type;
unsigned depth;
int total_link_count;
@@ -772,6 +772,7 @@ static bool try_to_unlazy(struct nameidata *nd)
goto out;
if (unlikely(!legitimize_root(nd)))
goto out;
+ nd->seq = nd->next_seq = 0;
rcu_read_unlock();
BUG_ON(nd->inode != parent->d_inode);
return true;
@@ -780,6 +781,7 @@ static bool try_to_unlazy(struct nameidata *nd)
nd->path.mnt = NULL;
nd->path.dentry = NULL;
out:
+ nd->seq = nd->next_seq = 0;
rcu_read_unlock();
return false;
}
@@ -788,7 +790,6 @@ static bool try_to_unlazy(struct nameidata *nd)
* try_to_unlazy_next - try to switch to ref-walk mode.
* @nd: nameidata pathwalk data
* @dentry: next dentry to step into
- * @seq: seq number to check @dentry against
* Returns: true on success, false on failure
*
* Similar to try_to_unlazy(), but here we have the next dentry already
@@ -797,7 +798,7 @@ static bool try_to_unlazy(struct nameidata *nd)
* Nothing should touch nameidata between try_to_unlazy_next() failure and
* terminate_walk().
*/
-static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry, unsigned seq)
+static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry)
{
BUG_ON(!(nd->flags & LOOKUP_RCU));

@@ -818,7 +819,7 @@ static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry, unsi
*/
if (unlikely(!lockref_get_not_dead(&dentry->d_lockref)))
goto out;
- if (unlikely(read_seqcount_retry(&dentry->d_seq, seq)))
+ if (unlikely(read_seqcount_retry(&dentry->d_seq, nd->next_seq)))
goto out_dput;
/*
* Sequence counts matched. Now make sure that the root is
@@ -826,6 +827,7 @@ static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry, unsi
*/
if (unlikely(!legitimize_root(nd)))
goto out_dput;
+ nd->seq = nd->next_seq = 0;
rcu_read_unlock();
return true;

@@ -834,9 +836,11 @@ static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry, unsi
out1:
nd->path.dentry = NULL;
out:
+ nd->seq = nd->next_seq = 0;
rcu_read_unlock();
return false;
out_dput:
+ nd->seq = nd->next_seq = 0;
rcu_read_unlock();
dput(dentry);
return false;
@@ -1466,8 +1470,7 @@ EXPORT_SYMBOL(follow_down);
* Try to skip to top of mountpoint pile in rcuwalk mode. Fail if
* we meet a managed dentry that would need blocking.
*/
-static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
- struct inode **inode, unsigned *seqp)
+static bool __follow_mount_rcu(struct nameidata *nd, struct path *path)
{
struct dentry *dentry = path->dentry;
unsigned int flags = dentry->d_flags;
@@ -1496,14 +1499,7 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
path->mnt = &mounted->mnt;
dentry = path->dentry = mounted->mnt.mnt_root;
nd->state |= ND_JUMPED;
- *seqp = read_seqcount_begin(&dentry->d_seq);
- *inode = dentry->d_inode;
- /*
- * We don't need to re-check ->d_seq after this
- * ->d_inode read - there will be an RCU delay
- * between mount hash removal and ->mnt_root
- * becoming unpinned.
- */
+ nd->next_seq = read_seqcount_begin(&dentry->d_seq);
flags = dentry->d_flags;
continue;
}
@@ -1515,8 +1511,7 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
}

static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry,
- struct path *path, struct inode **inode,
- unsigned int *seqp)
+ struct path *path)
{
bool jumped;
int ret;
@@ -1524,16 +1519,15 @@ static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry,
path->mnt = nd->path.mnt;
path->dentry = dentry;
if (nd->flags & LOOKUP_RCU) {
- unsigned int seq = *seqp;
- if (unlikely(!*inode))
- return -ENOENT;
- if (likely(__follow_mount_rcu(nd, path, inode, seqp)))
+ unsigned int seq = nd->next_seq;
+ if (likely(__follow_mount_rcu(nd, path)))
return 0;
- if (!try_to_unlazy_next(nd, dentry, seq))
- return -ECHILD;
- // *path might've been clobbered by __follow_mount_rcu()
+ // *path and nd->next_seq might've been just clobbered
path->mnt = nd->path.mnt;
path->dentry = dentry;
+ nd->next_seq = seq;
+ if (!try_to_unlazy_next(nd, dentry))
+ return -ECHILD;
}
ret = traverse_mounts(path, &jumped, &nd->total_link_count, nd->flags);
if (jumped) {
@@ -1546,9 +1540,6 @@ static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry,
dput(path->dentry);
if (path->mnt != nd->path.mnt)
mntput(path->mnt);
- } else {
- *inode = d_backing_inode(path->dentry);
- *seqp = 0; /* out of RCU mode, so the value doesn't matter */
}
return ret;
}
@@ -1607,9 +1598,7 @@ static struct dentry *__lookup_hash(const struct qstr *name,
return dentry;
}

-static struct dentry *lookup_fast(struct nameidata *nd,
- struct inode **inode,
- unsigned *seqp)
+static struct dentry *lookup_fast(struct nameidata *nd)
{
struct dentry *dentry, *parent = nd->path.dentry;
int status = 1;
@@ -1620,37 +1609,24 @@ static struct dentry *lookup_fast(struct nameidata *nd,
* going to fall back to non-racy lookup.
*/
if (nd->flags & LOOKUP_RCU) {
- unsigned seq;
- dentry = __d_lookup_rcu(parent, &nd->last, &seq);
+ dentry = __d_lookup_rcu(parent, &nd->last, &nd->next_seq);
if (unlikely(!dentry)) {
if (!try_to_unlazy(nd))
return ERR_PTR(-ECHILD);
return NULL;
}

- /*
- * This sequence count validates that the inode matches
- * the dentry name information from lookup.
- */
- *inode = d_backing_inode(dentry);
- if (unlikely(read_seqcount_retry(&dentry->d_seq, seq)))
- return ERR_PTR(-ECHILD);
-
- /*
+ /*
* This sequence count validates that the parent had no
* changes while we did the lookup of the dentry above.
- *
- * The memory barrier in read_seqcount_begin of child is
- * enough, we can use __read_seqcount_retry here.
*/
- if (unlikely(__read_seqcount_retry(&parent->d_seq, nd->seq)))
+ if (unlikely(read_seqcount_retry(&parent->d_seq, nd->seq)))
return ERR_PTR(-ECHILD);

- *seqp = seq;
status = d_revalidate(dentry, nd->flags);
if (likely(status > 0))
return dentry;
- if (!try_to_unlazy_next(nd, dentry, seq))
+ if (!try_to_unlazy_next(nd, dentry))
return ERR_PTR(-ECHILD);
if (status == -ECHILD)
/* we'd been told to redo it in non-rcu mode */
@@ -1731,7 +1707,7 @@ static inline int may_lookup(struct user_namespace *mnt_userns,
return inode_permission(mnt_userns, nd->inode, MAY_EXEC);
}

-static int reserve_stack(struct nameidata *nd, struct path *link, unsigned seq)
+static int reserve_stack(struct nameidata *nd, struct path *link)
{
if (unlikely(nd->total_link_count++ >= MAXSYMLINKS))
return -ELOOP;
@@ -1746,7 +1722,7 @@ static int reserve_stack(struct nameidata *nd, struct path *link, unsigned seq)
if (nd->flags & LOOKUP_RCU) {
// we need to grab link before we do unlazy. And we can't skip
// unlazy even if we fail to grab the link - cleanup needs it
- bool grabbed_link = legitimize_path(nd, link, seq);
+ bool grabbed_link = legitimize_path(nd, link, nd->next_seq);

if (!try_to_unlazy(nd) || !grabbed_link)
return -ECHILD;
@@ -1760,11 +1736,11 @@ static int reserve_stack(struct nameidata *nd, struct path *link, unsigned seq)
enum {WALK_TRAILING = 1, WALK_MORE = 2, WALK_NOFOLLOW = 4};

static const char *pick_link(struct nameidata *nd, struct path *link,
- struct inode *inode, unsigned seq, int flags)
+ struct inode *inode, int flags)
{
struct saved *last;
const char *res;
- int error = reserve_stack(nd, link, seq);
+ int error = reserve_stack(nd, link);

if (unlikely(error)) {
if (!(nd->flags & LOOKUP_RCU))
@@ -1774,7 +1750,7 @@ static const char *pick_link(struct nameidata *nd, struct path *link,
last = nd->stack + nd->depth++;
last->link = *link;
clear_delayed_call(&last->done);
- last->seq = seq;
+ last->seq = nd->next_seq;

if (flags & WALK_TRAILING) {
error = may_follow_link(nd, inode);
@@ -1836,15 +1812,25 @@ static const char *pick_link(struct nameidata *nd, struct path *link,
* to do this check without having to look at inode->i_op,
* so we keep a cache of "no, this doesn't need follow_link"
* for the common case.
+ *
+ * NOTE: nd->next_seq must be the validated sampled dentry->d_seq
*/
static const char *step_into(struct nameidata *nd, int flags,
- struct dentry *dentry, struct inode *inode, unsigned seq)
+ struct dentry *dentry)
{
struct path path;
- int err = handle_mounts(nd, dentry, &path, &inode, &seq);
+ struct inode *inode;
+ int err = handle_mounts(nd, dentry, &path);

if (err < 0)
return ERR_PTR(err);
+ inode = path.dentry->d_inode;
+ if (unlikely(!inode)) {
+ if ((nd->flags & LOOKUP_RCU) &&
+ read_seqcount_retry(&path.dentry->d_seq, nd->next_seq))
+ return ERR_PTR(-ECHILD);
+ return ERR_PTR(-ENOENT);
+ }
if (likely(!d_is_symlink(path.dentry)) ||
((flags & WALK_TRAILING) && !(nd->flags & LOOKUP_FOLLOW)) ||
(flags & WALK_NOFOLLOW)) {
@@ -1856,23 +1842,21 @@ static const char *step_into(struct nameidata *nd, int flags,
}
nd->path = path;
nd->inode = inode;
- nd->seq = seq;
+ nd->seq = nd->next_seq;
return NULL;
}
if (nd->flags & LOOKUP_RCU) {
/* make sure that d_is_symlink above matches inode */
- if (read_seqcount_retry(&path.dentry->d_seq, seq))
+ if (read_seqcount_retry(&path.dentry->d_seq, nd->next_seq))
return ERR_PTR(-ECHILD);
} else {
if (path.mnt == nd->path.mnt)
mntget(path.mnt);
}
- return pick_link(nd, &path, inode, seq, flags);
+ return pick_link(nd, &path, inode, flags);
}

-static struct dentry *follow_dotdot_rcu(struct nameidata *nd,
- struct inode **inodep,
- unsigned *seqp)
+static struct dentry *follow_dotdot_rcu(struct nameidata *nd)
{
struct dentry *parent, *old;

@@ -1895,8 +1879,7 @@ static struct dentry *follow_dotdot_rcu(struct nameidata *nd,
}
old = nd->path.dentry;
parent = old->d_parent;
- *inodep = parent->d_inode;
- *seqp = read_seqcount_begin(&parent->d_seq);
+ nd->next_seq = read_seqcount_begin(&parent->d_seq);
if (unlikely(read_seqcount_retry(&old->d_seq, nd->seq)))
return ERR_PTR(-ECHILD);
if (unlikely(!path_connected(nd->path.mnt, parent)))
@@ -1907,12 +1890,11 @@ static struct dentry *follow_dotdot_rcu(struct nameidata *nd,
return ERR_PTR(-ECHILD);
if (unlikely(nd->flags & LOOKUP_BENEATH))
return ERR_PTR(-ECHILD);
- return NULL;
+ nd->next_seq = nd->seq;
+ return nd->path.dentry;
}

-static struct dentry *follow_dotdot(struct nameidata *nd,
- struct inode **inodep,
- unsigned *seqp)
+static struct dentry *follow_dotdot(struct nameidata *nd)
{
struct dentry *parent;

@@ -1936,15 +1918,12 @@ static struct dentry *follow_dotdot(struct nameidata *nd,
dput(parent);
return ERR_PTR(-ENOENT);
}
- *seqp = 0;
- *inodep = parent->d_inode;
return parent;

in_root:
if (unlikely(nd->flags & LOOKUP_BENEATH))
return ERR_PTR(-EXDEV);
- dget(nd->path.dentry);
- return NULL;
+ return dget(nd->path.dentry);
}

static const char *handle_dots(struct nameidata *nd, int type)
@@ -1952,8 +1931,6 @@ static const char *handle_dots(struct nameidata *nd, int type)
if (type == LAST_DOTDOT) {
const char *error = NULL;
struct dentry *parent;
- struct inode *inode;
- unsigned seq;

if (!nd->root.mnt) {
error = ERR_PTR(set_root(nd));
@@ -1961,17 +1938,12 @@ static const char *handle_dots(struct nameidata *nd, int type)
return error;
}
if (nd->flags & LOOKUP_RCU)
- parent = follow_dotdot_rcu(nd, &inode, &seq);
+ parent = follow_dotdot_rcu(nd);
else
- parent = follow_dotdot(nd, &inode, &seq);
+ parent = follow_dotdot(nd);
if (IS_ERR(parent))
return ERR_CAST(parent);
- if (unlikely(!parent))
- error = step_into(nd, WALK_NOFOLLOW,
- nd->path.dentry, nd->inode, nd->seq);
- else
- error = step_into(nd, WALK_NOFOLLOW,
- parent, inode, seq);
+ error = step_into(nd, WALK_NOFOLLOW, parent);
if (unlikely(error))
return error;

@@ -1995,8 +1967,6 @@ static const char *handle_dots(struct nameidata *nd, int type)
static const char *walk_component(struct nameidata *nd, int flags)
{
struct dentry *dentry;
- struct inode *inode;
- unsigned seq;
/*
* "." and ".." are special - ".." especially so because it has
* to be able to know about the current root directory and
@@ -2007,7 +1977,7 @@ static const char *walk_component(struct nameidata *nd, int flags)
put_link(nd);
return handle_dots(nd, nd->last_type);
}
- dentry = lookup_fast(nd, &inode, &seq);
+ dentry = lookup_fast(nd);
if (IS_ERR(dentry))
return ERR_CAST(dentry);
if (unlikely(!dentry)) {
@@ -2017,7 +1987,7 @@ static const char *walk_component(struct nameidata *nd, int flags)
}
if (!(flags & WALK_MORE) && nd->depth)
put_link(nd);
- return step_into(nd, flags, dentry, inode, seq);
+ return step_into(nd, flags, dentry);
}

/*
@@ -2372,6 +2342,8 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
flags &= ~LOOKUP_RCU;
if (flags & LOOKUP_RCU)
rcu_read_lock();
+ else
+ nd->seq = nd->next_seq = 0;

nd->flags = flags;
nd->state |= ND_JUMPED;
@@ -2473,8 +2445,8 @@ static int handle_lookup_down(struct nameidata *nd)
{
if (!(nd->flags & LOOKUP_RCU))
dget(nd->path.dentry);
- return PTR_ERR(step_into(nd, WALK_NOFOLLOW,
- nd->path.dentry, nd->inode, nd->seq));
+ nd->next_seq = nd->seq;
+ return PTR_ERR(step_into(nd, WALK_NOFOLLOW, nd->path.dentry));
}

/* Returns 0 and nd will be valid on success; Retuns error, otherwise. */
@@ -3393,8 +3365,6 @@ static const char *open_last_lookups(struct nameidata *nd,
struct dentry *dir = nd->path.dentry;
int open_flag = op->open_flag;
bool got_write = false;
- unsigned seq;
- struct inode *inode;
struct dentry *dentry;
const char *res;

@@ -3410,7 +3380,7 @@ static const char *open_last_lookups(struct nameidata *nd,
if (nd->last.name[nd->last.len])
nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
/* we _can_ be in RCU mode here */
- dentry = lookup_fast(nd, &inode, &seq);
+ dentry = lookup_fast(nd);
if (IS_ERR(dentry))
return ERR_CAST(dentry);
if (likely(dentry))
@@ -3464,7 +3434,7 @@ static const char *open_last_lookups(struct nameidata *nd,
finish_lookup:
if (nd->depth)
put_link(nd);
- res = step_into(nd, WALK_TRAILING, dentry, inode, seq);
+ res = step_into(nd, WALK_TRAILING, dentry);
if (unlikely(res))
nd->flags &= ~(LOOKUP_OPEN|LOOKUP_CREATE|LOOKUP_EXCL);
return res;

2022-07-04 16:13:07

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH v4 43/45] namei: initialize parameters passed to step_into()

On Mon, Jul 04, 2022 at 05:49:13PM +0200, Alexander Potapenko wrote:
> This e-mail is confidential. If you received this communication by
> mistake, please don't forward it to anyone else, please erase all
> copies and attachments, and please let me know that it has gone to the
> wrong person.

This is not compatible with Linux kernel development, sorry.

Now deleted.

2022-07-04 16:40:03

by Alexander Potapenko

[permalink] [raw]
Subject: Re: [PATCH v4 43/45] namei: initialize parameters passed to step_into()

On Mon, Jul 4, 2022 at 6:03 PM Greg Kroah-Hartman
<[email protected]> wrote:
>
> On Mon, Jul 04, 2022 at 05:49:13PM +0200, Alexander Potapenko wrote:
> > This e-mail is confidential. If you received this communication by
> > mistake, please don't forward it to anyone else, please erase all
> > copies and attachments, and please let me know that it has gone to the
> > wrong person.
>
> This is not compatible with Linux kernel development, sorry.
>
> Now deleted.

Sorry, I shouldn't have added those to public emails.
Apologies for the inconvenience.

--
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

2022-07-04 17:00:56

by Alexander Potapenko

[permalink] [raw]
Subject: Re: [PATCH v4 43/45] namei: initialize parameters passed to step_into()

On Mon, Jul 4, 2022 at 6:00 PM Al Viro <[email protected]> wrote:
>
> On Mon, Jul 04, 2022 at 02:44:00PM +0100, Al Viro wrote:
> > On Mon, Jul 04, 2022 at 10:20:53AM +0200, Alexander Potapenko wrote:
> >
> > > What makes you think they are false positives? Is the scenario I
> > > described above:
> > >
> > > """
> > > In particular, if the call to lookup_fast() in walk_component()
> > > returns NULL, and lookup_slow() returns a valid dentry, then the
> > > `seq` and `inode` will remain uninitialized until the call to
> > > step_into()
> > > """
> > >
> > > impossible?
> >
> > Suppose step_into() has been called in non-RCU mode. The first
> > thing it does is
> > int err = handle_mounts(nd, dentry, &path, &seq);
> > if (err < 0)
> > return ERR_PTR(err);
> >
> > And handle_mounts() in non-RCU mode is
> > path->mnt = nd->path.mnt;
> > path->dentry = dentry;
> > if (nd->flags & LOOKUP_RCU) {
> > [unreachable code]
> > }
> > [code not touching seqp]
> > if (unlikely(ret)) {
> > [code not touching seqp]
> > } else {
> > *seqp = 0; /* out of RCU mode, so the value doesn't matter */
> > }
> > return ret;
> >
> > In other words, the value seq argument of step_into() used to have ends up
> > being never fetched and, in case step_into() gets past that if (err < 0)
> > that value is replaced with zero before any further accesses.
> >
> > So it's a false positive; yes, strictly speaking compiler is allowd
> > to do anything whatsoever if it manages to prove that the value is
> > uninitialized. Realistically, though, especially since unsigned int
> > is not allowed any trapping representations...
>
> FWIW, update (and yet untested) branch is in #work.namei. Compared to the
> previous, we store sampled ->d_seq of the next dentry in nd->next_seq,
> rather than bothering with local variables. AFAICS, it ends up with
> better code that way. And both ->seq and ->next_seq are zeroed at the
> moments when we switch to non-RCU mode (as well as non-RCU path_init()).
>
> IMO it looks saner that way. NOTE: it still needs to be tested and probably
> reordered and massaged; it's not for merge at the moment. Current cumulative
> diff follows:

I confirm all KMSAN reports are gone as a result of applying this patch.


--
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

2022-07-04 17:39:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v4 43/45] namei: initialize parameters passed to step_into()

On Sun, Jul 3, 2022 at 7:53 PM Al Viro <[email protected]> wrote:
>
> FWIW, trying to write a coherent documentation had its usual effect...
> The thing is, we don't really need to fetch the inode that early.

Hmm. I like the patch, but as I was reading through it, I had a question...

In particular, I'd like it even more if each step when the sequence
number is updated also had a comment about what then protects the
previous sequence number up to and over that new sequence point.

For example, in __follow_mount_rcu(), when we jump to a new mount
point, and that sequence has

*seqp = read_seqcount_begin(&dentry->d_seq);

to reset the sequence number to the new path we jumped into.

But I don't actually see what checks the previous sequence number in
that path. We just reset it to the new one.

In contrast, in lookup_fast(), we get the new sequence number from
__d_lookup_rcu(), and then after getting the new one and before
"instantiating" it, we will revalidate the parent sequence number.

So lookup_fast() has that "chain of sequence numbers".

For __follow_mount_rcu it looks like validating the previous sequence
number is left to the caller, which then does try_to_unlazy_next().

So when reading this code, my reaction was that it really would have
been much nicer to have that kind of clear "handoff" of one sequence
number domain to the next that lookup_fast() has.

IOW, I think it would be lovely to clarify the sequence number handoff.

I only quickly scanned your second patch for this, it does seem to at
least collect it all into try_to_unlazy_next().

So maybe you already looked at exactly this, but it would be good to
be quite explicit about the sequence number logic because it's "a bit
opaque" to say the least.

Linus

2022-07-04 18:44:10

by Segher Boessenkool

[permalink] [raw]
Subject: Re: [PATCH v4 43/45] namei: initialize parameters passed to step_into()

On Mon, Jul 04, 2022 at 05:49:13PM +0200, Alexander Potapenko wrote:
> One of the reasons to do so is standard compliance - passing an
> uninitialized value to a function is UB in C11, as Segher pointed out
> here: https://lore.kernel.org/linux-toolchains/[email protected]/
> The compilers may not be smart enough to take advantage of this _yet_,
> but I wouldn't underestimate their ability to evolve (especially that
> of Clang).

GCC doesn't currently detect this UB, and doesn't even warn or error for
this, although that shouldn't be hard to do: it is all completely local.
An error is warranted here, and you won't get UB ever either then.

> I also believe it's fragile to rely on the callee to ignore certain
> parameters: it may be doing so today, but if someone changes
> step_into() tomorrow we may miss it.

There isn't any choice usually, this is C, do you want varargs? :-)

But yes, you always should only pass "safe" values; callers should do
their part, and not assume the callee will do in the future as it does
now. Defensive programming is mostly about defending your own sanity!


Segher

2022-07-04 19:31:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v4 43/45] namei: initialize parameters passed to step_into()

On Mon, Jul 4, 2022 at 12:03 PM Al Viro <[email protected]> wrote:
>
> Anyway, I've thrown a mount_lock check in there, running xfstests to
> see how it goes...

So my reaction had been that it would be good to just do something like this:

diff --git a/fs/namei.c b/fs/namei.c
index 1f28d3f463c3..25c4bcc91142 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1493,11 +1493,18 @@ static bool __follow_mount_rcu(struct n...
if (flags & DCACHE_MOUNTED) {
struct mount *mounted = __lookup_mnt(path->mnt, dentry);
if (mounted) {
+ struct dentry *old_dentry = dentry;
+ unsigned old_seq = *seqp;
+
path->mnt = &mounted->mnt;
dentry = path->dentry = mounted->mnt.mnt_root;
nd->state |= ND_JUMPED;
*seqp = read_seqcount_begin(&dentry->d_seq);
*inode = dentry->d_inode;
+
+ if (read_seqcount_retry(&old_dentry->d_seq, old_seq))
+ return false;
+
/*
* We don't need to re-check ->d_seq after this
* ->d_inode read - there will be an RCU delay

but the above is just whitespace-damaged random monkey-scribbling by
yours truly.

More like a "shouldn't we do something like this" than a serious
patch, in other words.

IOW, it has *NOT* had a lot of real thought behind it. Purely a
"shouldn't we always clearly check the old sequence number after we've
picked up the new one?"

Linus

2022-07-04 19:56:26

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v4 43/45] namei: initialize parameters passed to step_into()

On Mon, Jul 04, 2022 at 10:36:05AM -0700, Linus Torvalds wrote:

> For example, in __follow_mount_rcu(), when we jump to a new mount
> point, and that sequence has
>
> *seqp = read_seqcount_begin(&dentry->d_seq);
>
> to reset the sequence number to the new path we jumped into.
>
> But I don't actually see what checks the previous sequence number in
> that path. We just reset it to the new one.

Theoretically it could be a problem. We have /mnt/foo/bar and
/mnt/baz/bar. Something's mounted on /mnt/foo, hiding /mnt/foo/bar.
We start a pathwalk for /mnt/baz/bar,
someone umounts /mnt/foo and swaps /mnt/foo to /mnt/baz before
we get there. We are doomed to get -ECHILD from an attempt to
legitimize in the end, no matter what. However, we might get
a hard error (-ENOENT, for example) before that, if we pick up
the old mount that used to be on top of /mnt/foo (now /mnt/baz)
and had been detached before the damn thing had become /mnt/baz
and notice that there's no "bar" in its root.

It used to be impossible (rename would've failed if the target had
been non-empty and had we managed to empty it first, well, there's
your point when -ENOENT would've been accurate). With exchange...
Yes, it's a possible race.

Might need to add
if (read_seqretry(&mount_lock, nd->m_seq))
return false;
in there. And yes, it's a nice demonstration of how subtle and
brittle RCU pathwalk is - nobody noticed this bit of fun back when
RENAME_EXCHANGE had been added... It got a lot more readable these
days, but...

> For __follow_mount_rcu it looks like validating the previous sequence
> number is left to the caller, which then does try_to_unlazy_next().

Not really - the caller goes there only if we have __follow_mount_rcu()
say "it's too tricky for me, get out of RCU mode and deal with it
there".

Anyway, I've thrown a mount_lock check in there, running xfstests to
see how it goes...

2022-07-04 20:09:21

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v4 43/45] namei: initialize parameters passed to step_into()

On Mon, Jul 04, 2022 at 12:16:24PM -0700, Linus Torvalds wrote:
> On Mon, Jul 4, 2022 at 12:03 PM Al Viro <[email protected]> wrote:
> >
> > Anyway, I've thrown a mount_lock check in there, running xfstests to
> > see how it goes...
>
> So my reaction had been that it would be good to just do something like this:
>
> diff --git a/fs/namei.c b/fs/namei.c
> index 1f28d3f463c3..25c4bcc91142 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -1493,11 +1493,18 @@ static bool __follow_mount_rcu(struct n...
> if (flags & DCACHE_MOUNTED) {
> struct mount *mounted = __lookup_mnt(path->mnt, dentry);
> if (mounted) {
> + struct dentry *old_dentry = dentry;
> + unsigned old_seq = *seqp;
> +
> path->mnt = &mounted->mnt;
> dentry = path->dentry = mounted->mnt.mnt_root;
> nd->state |= ND_JUMPED;
> *seqp = read_seqcount_begin(&dentry->d_seq);
> *inode = dentry->d_inode;
> +
> + if (read_seqcount_retry(&old_dentry->d_seq, old_seq))
> + return false;
> +
> /*
> * We don't need to re-check ->d_seq after this
> * ->d_inode read - there will be an RCU delay
>
> but the above is just whitespace-damaged random monkey-scribbling by
> yours truly.
>
> More like a "shouldn't we do something like this" than a serious
> patch, in other words.
>
> IOW, it has *NOT* had a lot of real thought behind it. Purely a
> "shouldn't we always clearly check the old sequence number after we've
> picked up the new one?"

You are checking the wrong thing here. It's really about mount_lock -
->d_seq is *not* bumped when we or attach in some namespace. If there's
a mismatch, RCU pathwalk is doomed anyway (it'll fail any form of unlazy)
and we might as well bugger off. If it *does* match, we know that both
mountpoint and root had been pinned since before the pathwalk, remain
pinned as of that check and had a mount connecting them all along.
IOW, if we could have arrived to this dentry at any point, we would have
gotten that dentry as the next step.

We sample into nd->m_seq in path_init() and we want it to stay unchanged
all along. If it does, all mountpoints and roots we observe are pinned
and their association with each other is stable.

It's not dentry -> dentry, it's dentry -> mount -> dentry. The following
would've been safe:

find mountpoint
sample ->d_seq
verify whatever had lead us to mountpoint

sample mount_lock
find mount
verify mountpoint's ->d_seq

find root of mounted
sample its ->d_seq
verify mount_lock

Correct? Now, note that the last step done against the value we'd sampled
in path_init() guarantees that mount hash had not changed through all of
that. Which is to say, we can pretend that we'd found mount before ->d_seq
of mountpoint might've changed, leaving us with

find mountpoint
sample ->d_seq
verify whatever had lead us to mountpoint
find mount
find root of mounted
sample its ->d_seq
verify mount_lock

2022-07-04 20:15:26

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v4 44/45] mm: fs: initialize fsdata passed to write_begin/write_end interface

On Fri, Jul 01, 2022 at 04:23:09PM +0200, Alexander Potapenko wrote:
> Functions implementing the a_ops->write_end() interface accept the
> `void *fsdata` parameter that is supposed to be initialized by the
> corresponding a_ops->write_begin() (which accepts `void **fsdata`).
>
> However not all a_ops->write_begin() implementations initialize `fsdata`
> unconditionally, so it may get passed uninitialized to a_ops->write_end(),
> resulting in undefined behavior.

... wait, passing an uninitialised variable to a function *which doesn't
actually use it* is now UB? What genius came up with that rule? What
purpose does it serve?

2022-07-04 20:50:05

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v4 44/45] mm: fs: initialize fsdata passed to write_begin/write_end interface

On Mon, Jul 04, 2022 at 09:07:43PM +0100, Matthew Wilcox wrote:
> On Fri, Jul 01, 2022 at 04:23:09PM +0200, Alexander Potapenko wrote:
> > Functions implementing the a_ops->write_end() interface accept the
> > `void *fsdata` parameter that is supposed to be initialized by the
> > corresponding a_ops->write_begin() (which accepts `void **fsdata`).
> >
> > However not all a_ops->write_begin() implementations initialize `fsdata`
> > unconditionally, so it may get passed uninitialized to a_ops->write_end(),
> > resulting in undefined behavior.
>
> ... wait, passing an uninitialised variable to a function *which doesn't
> actually use it* is now UB? What genius came up with that rule? What
> purpose does it serve?

"The value we are passing might be utter bollocks, but that way it's
obfuscated enough to confuse anyone, compiler included".

Defensive progamming, don'cha know?

I would suggest a different way to obfuscate it, though - pass const void **
and leave it for the callee to decide whether they want to dereferences it.
It is still 100% dependent upon the ->write_end() being correctly matched
with ->write_begin(), with zero assistance from the compiler, but it does
look, er, safer. Or something.

Of course, a clean way to handle that would be to have
->write_begin() return a partial application of foo_write_end to
whatever it wants for fsdata, to be evaluated where we would currently
call ->write_end(). _That_ could be usefully typechecked, but... we
don't have usable partial application.

2022-07-04 20:54:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v4 43/45] namei: initialize parameters passed to step_into()

On Mon, Jul 4, 2022 at 12:55 PM Al Viro <[email protected]> wrote:
>
> You are checking the wrong thing here. It's really about mount_lock -
> ->d_seq is *not* bumped when we or attach in some namespace.

I think we're talking past each other.

Yes, we need to check the mount sequence lock too, because we're doing
that mount traversal.

But I think we *also* need to check the dentry sequence count, because
the dentry itself could have been moved to another parent.

The two are entirely independent, aren't they?

And the dentry sequence point check should go along with the "we're
now updating the sequence point from the old dentry to the new".

The mount point check should go around the "check dentry mount point",
but it's a separate issue from the whole "we are now jumping to a
different dentry, we should check that the previous dentry hasn't
changed".

Linus

2022-07-04 21:06:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v4 43/45] namei: initialize parameters passed to step_into()

On Mon, Jul 4, 2022 at 1:24 PM Linus Torvalds
<[email protected]> wrote:
>
> The mount point check should go around the "check dentry mount point",
> but it's a separate issue from the whole "we are now jumping to a
> different dentry, we should check that the previous dentry hasn't
> changed".

Maybe it doesn't really matter, because we never actually end up
dereferencing the previous dentry (exactly since we're following the
mount point on it).

It feels like the sequence point checks are basically tied to the
"we're looking at the inode that the dentry pointed to", and because
the mount-point traversal doesn't need to look at the inode, the
sequence point check also isn't done.

But it feels wrong to traverse a dentry under RCU - even if we don't
then look at the inode itself - without having verified that the
dentry is still valid.

Yes, the d_seq lock protects against the inode going away (aka
"unlink()") and that cannot happen when it's a mount-point.

But it _also_ ends up changing for __d_move() when the name of the
dentry changes.

And I think that name change is relevant even to "look up a mount
point", exactly because we used that name to look up the dentry in the
first place, so if the name is changing, we shouldn't traverse that
mount point.

But I may have just confused myself terminally here.

Linus

2022-07-04 21:06:52

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v4 43/45] namei: initialize parameters passed to step_into()

On Mon, Jul 04, 2022 at 01:24:48PM -0700, Linus Torvalds wrote:
> On Mon, Jul 4, 2022 at 12:55 PM Al Viro <[email protected]> wrote:
> >
> > You are checking the wrong thing here. It's really about mount_lock -
> > ->d_seq is *not* bumped when we or attach in some namespace.
>
> I think we're talking past each other.

We might be.

> Yes, we need to check the mount sequence lock too, because we're doing
> that mount traversal.
>
> But I think we *also* need to check the dentry sequence count, because
> the dentry itself could have been moved to another parent.

Why is that a problem? It could have been moved to another parent,
but so it could after we'd crossed to the mounted and we wouldn't have
noticed (or cared).

What the chain of seqcount checks gives us is that with some timings
it would be possible to traverse that path, not that it had remained
valid through the entire pathwalk.

What I'm suggesting is to treat transition from mountpoint to mount
as happening instantly, with transition from mount to root sealed by
mount_lock check.

If that succeeds, there had been possible history in which refwalk
would have passed through the same dentry/mount/dentry and arrived
to the root dentry when it had the sampled ->d_seq value.

Sure, mountpoint might be moved since we'd reached it. And the mount
would move with it, so we can pretend that we'd won the race and got
into the mount before it had the mountpoint had been moved.

Am I missing something fundamental about the things the sequence of
sampling and verifications gives us? I'd always thought it's about
verifying that resulting history would be possible for a non-RCU
pathwalk with the right timings. What am I missing?

2022-07-04 21:07:30

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v4 43/45] namei: initialize parameters passed to step_into()

On Mon, Jul 04, 2022 at 01:51:16PM -0700, Linus Torvalds wrote:
> On Mon, Jul 4, 2022 at 1:46 PM Al Viro <[email protected]> wrote:
> >
> > Why is that a problem? It could have been moved to another parent,
> > but so it could after we'd crossed to the mounted and we wouldn't have
> > noticed (or cared).
>
> Yeah, see my other email.
>
> I agree that it might be a "we don't actually care" situation, where
> all we care about that the name was valid at one point (when we picked
> up that sequence point). So maybe we don't care about closing it.
>
> But even if so, I think it might warrant a comment, because I still
> feel like we're basically "throwing away" our previous sequence point
> information without ever checking it.
>
> Maybe all we ever care about is basically "this sequence point
> protects the dentry inode pointer for the next lookup", and when it
> comes to mount points that ends up being immaterial.

There is a problem, actually, but it's in a different place...
OK, let me try to write something resembling a formal proof and see
what falls out.

2022-07-04 21:07:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v4 43/45] namei: initialize parameters passed to step_into()

On Mon, Jul 4, 2022 at 1:46 PM Al Viro <[email protected]> wrote:
>
> Why is that a problem? It could have been moved to another parent,
> but so it could after we'd crossed to the mounted and we wouldn't have
> noticed (or cared).

Yeah, see my other email.

I agree that it might be a "we don't actually care" situation, where
all we care about that the name was valid at one point (when we picked
up that sequence point). So maybe we don't care about closing it.

But even if so, I think it might warrant a comment, because I still
feel like we're basically "throwing away" our previous sequence point
information without ever checking it.

Maybe all we ever care about is basically "this sequence point
protects the dentry inode pointer for the next lookup", and when it
comes to mount points that ends up being immaterial.

Linus

2022-07-04 23:16:39

by Al Viro

[permalink] [raw]
Subject: [PATCH 1/7] __follow_mount_rcu(): verify that mount_lock remains unchanged

Validate mount_lock seqcount as soon as we cross into mount in RCU
mode. Sure, ->mnt_root is pinned and will remain so until we
do rcu_read_unlock() anyway, and we will eventually fail to unlazy if
the mount_lock had been touched, but we might run into a hard error
(e.g. -ENOENT) before trying to unlazy. And it's possible to end
up with RCU pathwalk racing with rename() and umount() in a way
that would fail with -ENOENT while non-RCU pathwalk would've
succeeded with any timings.

Once upon a time we hadn't needed that, but analysis had been subtle,
brittle and went out of window as soon as RENAME_EXCHANGE had been
added.

It's narrow, hard to hit and won't get you anything other than
stray -ENOENT that could be arranged in much easier way with the
same priveleges, but it's a bug all the same.

Cc: [email protected]
X-sky-is-falling: unlikely
Fixes: da1ce0670c14 "vfs: add cross-rename"
Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/fs/namei.c b/fs/namei.c
index 1f28d3f463c3..4dbf55b37ec6 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1505,6 +1505,8 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
* becoming unpinned.
*/
flags = dentry->d_flags;
+ if (read_seqretry(&mount_lock, nd->m_seq))
+ return false;
continue;
}
if (read_seqretry(&mount_lock, nd->m_seq))
--
2.30.2

2022-07-04 23:17:34

by Al Viro

[permalink] [raw]
Subject: [PATCH 4/7] step_into(): lose inode argument

make handle_mounts() always fetch it. This is just the first step -
the callers of step_into() will stop trying to calculate the sucker,
etc.

The passed value should be equal to dentry->d_inode in all cases;
in RCU mode - fetched after we'd sampled ->d_seq. Might as well
fetch it here. We do need to validate ->d_seq, which duplicates
the check currently done in lookup_fast(); that duplication will
go away shortly.

After that change handle_mounts() always ignores the initial value of
*inode and always sets it on success.

Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 17 +++++++++++------
1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index c7c9e88add85..dddbebf92b48 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1532,6 +1532,11 @@ static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry,
path->dentry = dentry;
if (nd->flags & LOOKUP_RCU) {
unsigned int seq = nd->next_seq;
+ *inode = dentry->d_inode;
+ if (read_seqcount_retry(&dentry->d_seq, seq))
+ return -ECHILD;
+ if (unlikely(!*inode))
+ return -ENOENT;
if (likely(__follow_mount_rcu(nd, path, inode)))
return 0;
// *path and nd->next_seq might've been clobbered
@@ -1842,9 +1847,10 @@ static const char *pick_link(struct nameidata *nd, struct path *link,
* NOTE: dentry must be what nd->next_seq had been sampled from.
*/
static const char *step_into(struct nameidata *nd, int flags,
- struct dentry *dentry, struct inode *inode)
+ struct dentry *dentry)
{
struct path path;
+ struct inode *inode;
int err = handle_mounts(nd, dentry, &path, &inode);

if (err < 0)
@@ -1970,7 +1976,7 @@ static const char *handle_dots(struct nameidata *nd, int type)
parent = follow_dotdot(nd, &inode);
if (IS_ERR(parent))
return ERR_CAST(parent);
- error = step_into(nd, WALK_NOFOLLOW, parent, inode);
+ error = step_into(nd, WALK_NOFOLLOW, parent);
if (unlikely(error))
return error;

@@ -2015,7 +2021,7 @@ static const char *walk_component(struct nameidata *nd, int flags)
}
if (!(flags & WALK_MORE) && nd->depth)
put_link(nd);
- return step_into(nd, flags, dentry, inode);
+ return step_into(nd, flags, dentry);
}

/*
@@ -2474,8 +2480,7 @@ static int handle_lookup_down(struct nameidata *nd)
if (!(nd->flags & LOOKUP_RCU))
dget(nd->path.dentry);
nd->next_seq = nd->seq;
- return PTR_ERR(step_into(nd, WALK_NOFOLLOW,
- nd->path.dentry, nd->inode));
+ return PTR_ERR(step_into(nd, WALK_NOFOLLOW, nd->path.dentry));
}

/* Returns 0 and nd will be valid on success; Retuns error, otherwise. */
@@ -3464,7 +3469,7 @@ static const char *open_last_lookups(struct nameidata *nd,
finish_lookup:
if (nd->depth)
put_link(nd);
- res = step_into(nd, WALK_TRAILING, dentry, inode);
+ res = step_into(nd, WALK_TRAILING, dentry);
if (unlikely(res))
nd->flags &= ~(LOOKUP_OPEN|LOOKUP_CREATE|LOOKUP_EXCL);
return res;
--
2.30.2

2022-07-04 23:18:18

by Al Viro

[permalink] [raw]
Subject: [PATCH 5/7] follow_dotdot{,_rcu}(): don't bother with inode

step_into() will fetch it, TYVM.

Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 15 ++++-----------
1 file changed, 4 insertions(+), 11 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index dddbebf92b48..fe95fe39634c 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1880,8 +1880,7 @@ static const char *step_into(struct nameidata *nd, int flags,
return pick_link(nd, &path, inode, flags);
}

-static struct dentry *follow_dotdot_rcu(struct nameidata *nd,
- struct inode **inodep)
+static struct dentry *follow_dotdot_rcu(struct nameidata *nd)
{
struct dentry *parent, *old;

@@ -1905,7 +1904,6 @@ static struct dentry *follow_dotdot_rcu(struct nameidata *nd,
}
old = nd->path.dentry;
parent = old->d_parent;
- *inodep = parent->d_inode;
nd->next_seq = read_seqcount_begin(&parent->d_seq);
// makes sure that non-RCU pathwalk could reach this state
if (unlikely(read_seqcount_retry(&old->d_seq, nd->seq)))
@@ -1919,12 +1917,10 @@ static struct dentry *follow_dotdot_rcu(struct nameidata *nd,
if (unlikely(nd->flags & LOOKUP_BENEATH))
return ERR_PTR(-ECHILD);
nd->next_seq = nd->seq;
- *inodep = nd->path.dentry->d_inode;
return nd->path.dentry;
}

-static struct dentry *follow_dotdot(struct nameidata *nd,
- struct inode **inodep)
+static struct dentry *follow_dotdot(struct nameidata *nd)
{
struct dentry *parent;

@@ -1948,13 +1944,11 @@ static struct dentry *follow_dotdot(struct nameidata *nd,
dput(parent);
return ERR_PTR(-ENOENT);
}
- *inodep = parent->d_inode;
return parent;

in_root:
if (unlikely(nd->flags & LOOKUP_BENEATH))
return ERR_PTR(-EXDEV);
- *inodep = nd->path.dentry->d_inode;
return dget(nd->path.dentry);
}

@@ -1963,7 +1957,6 @@ static const char *handle_dots(struct nameidata *nd, int type)
if (type == LAST_DOTDOT) {
const char *error = NULL;
struct dentry *parent;
- struct inode *inode;

if (!nd->root.mnt) {
error = ERR_PTR(set_root(nd));
@@ -1971,9 +1964,9 @@ static const char *handle_dots(struct nameidata *nd, int type)
return error;
}
if (nd->flags & LOOKUP_RCU)
- parent = follow_dotdot_rcu(nd, &inode);
+ parent = follow_dotdot_rcu(nd);
else
- parent = follow_dotdot(nd, &inode);
+ parent = follow_dotdot(nd);
if (IS_ERR(parent))
return ERR_CAST(parent);
error = step_into(nd, WALK_NOFOLLOW, parent);
--
2.30.2

2022-07-04 23:18:29

by Al Viro

[permalink] [raw]
Subject: [PATCH 6/7] lookup_fast(): don't bother with inode

Note that validation of ->d_seq after ->d_inode fetch is gone, along
with fetching of ->d_inode itself.

Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 24 +++++-------------------
1 file changed, 5 insertions(+), 19 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index fe95fe39634c..cdb61d09df79 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1617,8 +1617,7 @@ static struct dentry *__lookup_hash(const struct qstr *name,
return dentry;
}

-static struct dentry *lookup_fast(struct nameidata *nd,
- struct inode **inode)
+static struct dentry *lookup_fast(struct nameidata *nd)
{
struct dentry *dentry, *parent = nd->path.dentry;
int status = 1;
@@ -1636,22 +1635,11 @@ static struct dentry *lookup_fast(struct nameidata *nd,
return NULL;
}

- /*
- * This sequence count validates that the inode matches
- * the dentry name information from lookup.
- */
- *inode = d_backing_inode(dentry);
- if (unlikely(read_seqcount_retry(&dentry->d_seq, nd->next_seq)))
- return ERR_PTR(-ECHILD);
-
- /*
+ /*
* This sequence count validates that the parent had no
* changes while we did the lookup of the dentry above.
- *
- * The memory barrier in read_seqcount_begin of child is
- * enough, we can use __read_seqcount_retry here.
*/
- if (unlikely(__read_seqcount_retry(&parent->d_seq, nd->seq)))
+ if (unlikely(read_seqcount_retry(&parent->d_seq, nd->seq)))
return ERR_PTR(-ECHILD);

status = d_revalidate(dentry, nd->flags);
@@ -1993,7 +1981,6 @@ static const char *handle_dots(struct nameidata *nd, int type)
static const char *walk_component(struct nameidata *nd, int flags)
{
struct dentry *dentry;
- struct inode *inode;
/*
* "." and ".." are special - ".." especially so because it has
* to be able to know about the current root directory and
@@ -2004,7 +1991,7 @@ static const char *walk_component(struct nameidata *nd, int flags)
put_link(nd);
return handle_dots(nd, nd->last_type);
}
- dentry = lookup_fast(nd, &inode);
+ dentry = lookup_fast(nd);
if (IS_ERR(dentry))
return ERR_CAST(dentry);
if (unlikely(!dentry)) {
@@ -3392,7 +3379,6 @@ static const char *open_last_lookups(struct nameidata *nd,
struct dentry *dir = nd->path.dentry;
int open_flag = op->open_flag;
bool got_write = false;
- struct inode *inode;
struct dentry *dentry;
const char *res;

@@ -3408,7 +3394,7 @@ static const char *open_last_lookups(struct nameidata *nd,
if (nd->last.name[nd->last.len])
nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
/* we _can_ be in RCU mode here */
- dentry = lookup_fast(nd, &inode);
+ dentry = lookup_fast(nd);
if (IS_ERR(dentry))
return ERR_CAST(dentry);
if (likely(dentry))
--
2.30.2

2022-07-04 23:30:50

by Al Viro

[permalink] [raw]
Subject: [PATCH 3/7] namei: stash the sampled ->d_seq into nameidata

New field: nd->next_seq. Set to 0 outside of RCU mode, holds the sampled
value for the next dentry to be considered. Used instead of an arseload
of local variables, arguments, etc.

step_into() has lost seq argument; nd->next_seq is used, so dentry passed
to it must be the one ->next_seq is about.

There are two requirements for RCU pathwalk:
1) it should not give a hard failure (other than -ECHILD) unless
non-RCU pathwalk might fail that way given suitable timings.
2) it should not succeed unless non-RCU pathwalk might succeed
with the same end location given suitable timings.

The use of seq numbers is the way we achieve that. Invariant we want
to maintain is:
if RCU pathwalk can reach the state with given nd->path, nd->inode
and nd->seq after having traversed some part of pathname, it must be possible
for non-RCU pathwalk to reach the same nd->path and nd->inode after having
traversed the same part of pathname, and observe the nd->path.dentry->d_seq
equal to what RCU pathwalk has in nd->seq

For transition from parent to child, we sample child's ->d_seq
and verify that parent's ->d_seq remains unchanged.
For transitions from child to parent we sample parent's ->d_seq
and verify that child's ->d_seq has not changed.
For transition from mountpoint to root of mounted we sample
the ->d_seq of root and verify that nobody has touched mount_lock since
the beginning of pathwalk. That guarantees that mount we'd found had
been there all along, with these mountpoint and root of mounted.
It would be possible for a non-RCU pathwalk to reach the previous state,
find the same mount and observe its root at the moment we'd sampled
->d_seq of that
For transitions from root of mounted to mountpoint we sample
->d_seq of mountpoint and verify that mount_lock had not been touched
since the beginning of pathwalk. The same reasoning as in the
previous case applies.

Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 104 ++++++++++++++++++++++++++---------------------------
1 file changed, 52 insertions(+), 52 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index ecdb9ac21ece..c7c9e88add85 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -567,7 +567,7 @@ struct nameidata {
struct path root;
struct inode *inode; /* path.dentry.d_inode */
unsigned int flags, state;
- unsigned seq, m_seq, r_seq;
+ unsigned seq, next_seq, m_seq, r_seq;
int last_type;
unsigned depth;
int total_link_count;
@@ -772,6 +772,7 @@ static bool try_to_unlazy(struct nameidata *nd)
goto out;
if (unlikely(!legitimize_root(nd)))
goto out;
+ nd->seq = nd->next_seq = 0;
rcu_read_unlock();
BUG_ON(nd->inode != parent->d_inode);
return true;
@@ -780,6 +781,7 @@ static bool try_to_unlazy(struct nameidata *nd)
nd->path.mnt = NULL;
nd->path.dentry = NULL;
out:
+ nd->seq = nd->next_seq = 0;
rcu_read_unlock();
return false;
}
@@ -788,7 +790,6 @@ static bool try_to_unlazy(struct nameidata *nd)
* try_to_unlazy_next - try to switch to ref-walk mode.
* @nd: nameidata pathwalk data
* @dentry: next dentry to step into
- * @seq: seq number to check @dentry against
* Returns: true on success, false on failure
*
* Similar to try_to_unlazy(), but here we have the next dentry already
@@ -797,7 +798,7 @@ static bool try_to_unlazy(struct nameidata *nd)
* Nothing should touch nameidata between try_to_unlazy_next() failure and
* terminate_walk().
*/
-static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry, unsigned seq)
+static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry)
{
BUG_ON(!(nd->flags & LOOKUP_RCU));

@@ -818,7 +819,7 @@ static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry, unsi
*/
if (unlikely(!lockref_get_not_dead(&dentry->d_lockref)))
goto out;
- if (unlikely(read_seqcount_retry(&dentry->d_seq, seq)))
+ if (unlikely(read_seqcount_retry(&dentry->d_seq, nd->next_seq)))
goto out_dput;
/*
* Sequence counts matched. Now make sure that the root is
@@ -826,6 +827,7 @@ static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry, unsi
*/
if (unlikely(!legitimize_root(nd)))
goto out_dput;
+ nd->seq = nd->next_seq = 0;
rcu_read_unlock();
return true;

@@ -834,9 +836,11 @@ static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry, unsi
out1:
nd->path.dentry = NULL;
out:
+ nd->seq = nd->next_seq = 0;
rcu_read_unlock();
return false;
out_dput:
+ nd->seq = nd->next_seq = 0;
rcu_read_unlock();
dput(dentry);
return false;
@@ -1467,7 +1471,7 @@ EXPORT_SYMBOL(follow_down);
* we meet a managed dentry that would need blocking.
*/
static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
- struct inode **inode, unsigned *seqp)
+ struct inode **inode)
{
struct dentry *dentry = path->dentry;
unsigned int flags = dentry->d_flags;
@@ -1496,7 +1500,7 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
path->mnt = &mounted->mnt;
dentry = path->dentry = mounted->mnt.mnt_root;
nd->state |= ND_JUMPED;
- *seqp = read_seqcount_begin(&dentry->d_seq);
+ nd->next_seq = read_seqcount_begin(&dentry->d_seq);
*inode = dentry->d_inode;
/*
* We don't need to re-check ->d_seq after this
@@ -1505,6 +1509,8 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
* becoming unpinned.
*/
flags = dentry->d_flags;
+ // makes sure that non-RCU pathwalk could reach
+ // this state.
if (read_seqretry(&mount_lock, nd->m_seq))
return false;
continue;
@@ -1517,8 +1523,7 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
}

static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry,
- struct path *path, struct inode **inode,
- unsigned int *seqp)
+ struct path *path, struct inode **inode)
{
bool jumped;
int ret;
@@ -1526,16 +1531,15 @@ static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry,
path->mnt = nd->path.mnt;
path->dentry = dentry;
if (nd->flags & LOOKUP_RCU) {
- unsigned int seq = *seqp;
- if (unlikely(!*inode))
- return -ENOENT;
- if (likely(__follow_mount_rcu(nd, path, inode, seqp)))
+ unsigned int seq = nd->next_seq;
+ if (likely(__follow_mount_rcu(nd, path, inode)))
return 0;
- if (!try_to_unlazy_next(nd, dentry, seq))
- return -ECHILD;
- // *path might've been clobbered by __follow_mount_rcu()
+ // *path and nd->next_seq might've been clobbered
path->mnt = nd->path.mnt;
path->dentry = dentry;
+ nd->next_seq = seq;
+ if (!try_to_unlazy_next(nd, dentry))
+ return -ECHILD;
}
ret = traverse_mounts(path, &jumped, &nd->total_link_count, nd->flags);
if (jumped) {
@@ -1550,7 +1554,6 @@ static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry,
mntput(path->mnt);
} else {
*inode = d_backing_inode(path->dentry);
- *seqp = 0; /* out of RCU mode, so the value doesn't matter */
}
return ret;
}
@@ -1610,8 +1613,7 @@ static struct dentry *__lookup_hash(const struct qstr *name,
}

static struct dentry *lookup_fast(struct nameidata *nd,
- struct inode **inode,
- unsigned *seqp)
+ struct inode **inode)
{
struct dentry *dentry, *parent = nd->path.dentry;
int status = 1;
@@ -1622,8 +1624,7 @@ static struct dentry *lookup_fast(struct nameidata *nd,
* going to fall back to non-racy lookup.
*/
if (nd->flags & LOOKUP_RCU) {
- unsigned seq;
- dentry = __d_lookup_rcu(parent, &nd->last, &seq);
+ dentry = __d_lookup_rcu(parent, &nd->last, &nd->next_seq);
if (unlikely(!dentry)) {
if (!try_to_unlazy(nd))
return ERR_PTR(-ECHILD);
@@ -1635,7 +1636,7 @@ static struct dentry *lookup_fast(struct nameidata *nd,
* the dentry name information from lookup.
*/
*inode = d_backing_inode(dentry);
- if (unlikely(read_seqcount_retry(&dentry->d_seq, seq)))
+ if (unlikely(read_seqcount_retry(&dentry->d_seq, nd->next_seq)))
return ERR_PTR(-ECHILD);

/*
@@ -1648,11 +1649,10 @@ static struct dentry *lookup_fast(struct nameidata *nd,
if (unlikely(__read_seqcount_retry(&parent->d_seq, nd->seq)))
return ERR_PTR(-ECHILD);

- *seqp = seq;
status = d_revalidate(dentry, nd->flags);
if (likely(status > 0))
return dentry;
- if (!try_to_unlazy_next(nd, dentry, seq))
+ if (!try_to_unlazy_next(nd, dentry))
return ERR_PTR(-ECHILD);
if (status == -ECHILD)
/* we'd been told to redo it in non-rcu mode */
@@ -1733,7 +1733,7 @@ static inline int may_lookup(struct user_namespace *mnt_userns,
return inode_permission(mnt_userns, nd->inode, MAY_EXEC);
}

-static int reserve_stack(struct nameidata *nd, struct path *link, unsigned seq)
+static int reserve_stack(struct nameidata *nd, struct path *link)
{
if (unlikely(nd->total_link_count++ >= MAXSYMLINKS))
return -ELOOP;
@@ -1748,7 +1748,7 @@ static int reserve_stack(struct nameidata *nd, struct path *link, unsigned seq)
if (nd->flags & LOOKUP_RCU) {
// we need to grab link before we do unlazy. And we can't skip
// unlazy even if we fail to grab the link - cleanup needs it
- bool grabbed_link = legitimize_path(nd, link, seq);
+ bool grabbed_link = legitimize_path(nd, link, nd->next_seq);

if (!try_to_unlazy(nd) || !grabbed_link)
return -ECHILD;
@@ -1762,11 +1762,11 @@ static int reserve_stack(struct nameidata *nd, struct path *link, unsigned seq)
enum {WALK_TRAILING = 1, WALK_MORE = 2, WALK_NOFOLLOW = 4};

static const char *pick_link(struct nameidata *nd, struct path *link,
- struct inode *inode, unsigned seq, int flags)
+ struct inode *inode, int flags)
{
struct saved *last;
const char *res;
- int error = reserve_stack(nd, link, seq);
+ int error = reserve_stack(nd, link);

if (unlikely(error)) {
if (!(nd->flags & LOOKUP_RCU))
@@ -1776,7 +1776,7 @@ static const char *pick_link(struct nameidata *nd, struct path *link,
last = nd->stack + nd->depth++;
last->link = *link;
clear_delayed_call(&last->done);
- last->seq = seq;
+ last->seq = nd->next_seq;

if (flags & WALK_TRAILING) {
error = may_follow_link(nd, inode);
@@ -1838,12 +1838,14 @@ static const char *pick_link(struct nameidata *nd, struct path *link,
* to do this check without having to look at inode->i_op,
* so we keep a cache of "no, this doesn't need follow_link"
* for the common case.
+ *
+ * NOTE: dentry must be what nd->next_seq had been sampled from.
*/
static const char *step_into(struct nameidata *nd, int flags,
- struct dentry *dentry, struct inode *inode, unsigned seq)
+ struct dentry *dentry, struct inode *inode)
{
struct path path;
- int err = handle_mounts(nd, dentry, &path, &inode, &seq);
+ int err = handle_mounts(nd, dentry, &path, &inode);

if (err < 0)
return ERR_PTR(err);
@@ -1858,23 +1860,22 @@ static const char *step_into(struct nameidata *nd, int flags,
}
nd->path = path;
nd->inode = inode;
- nd->seq = seq;
+ nd->seq = nd->next_seq;
return NULL;
}
if (nd->flags & LOOKUP_RCU) {
/* make sure that d_is_symlink above matches inode */
- if (read_seqcount_retry(&path.dentry->d_seq, seq))
+ if (read_seqcount_retry(&path.dentry->d_seq, nd->next_seq))
return ERR_PTR(-ECHILD);
} else {
if (path.mnt == nd->path.mnt)
mntget(path.mnt);
}
- return pick_link(nd, &path, inode, seq, flags);
+ return pick_link(nd, &path, inode, flags);
}

static struct dentry *follow_dotdot_rcu(struct nameidata *nd,
- struct inode **inodep,
- unsigned *seqp)
+ struct inode **inodep)
{
struct dentry *parent, *old;

@@ -1891,6 +1892,7 @@ static struct dentry *follow_dotdot_rcu(struct nameidata *nd,
nd->path = path;
nd->inode = path.dentry->d_inode;
nd->seq = seq;
+ // makes sure that non-RCU pathwalk could reach this state
if (unlikely(read_seqretry(&mount_lock, nd->m_seq)))
return ERR_PTR(-ECHILD);
/* we know that mountpoint was pinned */
@@ -1898,7 +1900,8 @@ static struct dentry *follow_dotdot_rcu(struct nameidata *nd,
old = nd->path.dentry;
parent = old->d_parent;
*inodep = parent->d_inode;
- *seqp = read_seqcount_begin(&parent->d_seq);
+ nd->next_seq = read_seqcount_begin(&parent->d_seq);
+ // makes sure that non-RCU pathwalk could reach this state
if (unlikely(read_seqcount_retry(&old->d_seq, nd->seq)))
return ERR_PTR(-ECHILD);
if (unlikely(!path_connected(nd->path.mnt, parent)))
@@ -1909,14 +1912,13 @@ static struct dentry *follow_dotdot_rcu(struct nameidata *nd,
return ERR_PTR(-ECHILD);
if (unlikely(nd->flags & LOOKUP_BENEATH))
return ERR_PTR(-ECHILD);
- *seqp = nd->seq;
+ nd->next_seq = nd->seq;
*inodep = nd->path.dentry->d_inode;
return nd->path.dentry;
}

static struct dentry *follow_dotdot(struct nameidata *nd,
- struct inode **inodep,
- unsigned *seqp)
+ struct inode **inodep)
{
struct dentry *parent;

@@ -1940,14 +1942,12 @@ static struct dentry *follow_dotdot(struct nameidata *nd,
dput(parent);
return ERR_PTR(-ENOENT);
}
- *seqp = 0;
*inodep = parent->d_inode;
return parent;

in_root:
if (unlikely(nd->flags & LOOKUP_BENEATH))
return ERR_PTR(-EXDEV);
- *seqp = 0;
*inodep = nd->path.dentry->d_inode;
return dget(nd->path.dentry);
}
@@ -1958,7 +1958,6 @@ static const char *handle_dots(struct nameidata *nd, int type)
const char *error = NULL;
struct dentry *parent;
struct inode *inode;
- unsigned seq;

if (!nd->root.mnt) {
error = ERR_PTR(set_root(nd));
@@ -1966,12 +1965,12 @@ static const char *handle_dots(struct nameidata *nd, int type)
return error;
}
if (nd->flags & LOOKUP_RCU)
- parent = follow_dotdot_rcu(nd, &inode, &seq);
+ parent = follow_dotdot_rcu(nd, &inode);
else
- parent = follow_dotdot(nd, &inode, &seq);
+ parent = follow_dotdot(nd, &inode);
if (IS_ERR(parent))
return ERR_CAST(parent);
- error = step_into(nd, WALK_NOFOLLOW, parent, inode, seq);
+ error = step_into(nd, WALK_NOFOLLOW, parent, inode);
if (unlikely(error))
return error;

@@ -1996,7 +1995,6 @@ static const char *walk_component(struct nameidata *nd, int flags)
{
struct dentry *dentry;
struct inode *inode;
- unsigned seq;
/*
* "." and ".." are special - ".." especially so because it has
* to be able to know about the current root directory and
@@ -2007,7 +2005,7 @@ static const char *walk_component(struct nameidata *nd, int flags)
put_link(nd);
return handle_dots(nd, nd->last_type);
}
- dentry = lookup_fast(nd, &inode, &seq);
+ dentry = lookup_fast(nd, &inode);
if (IS_ERR(dentry))
return ERR_CAST(dentry);
if (unlikely(!dentry)) {
@@ -2017,7 +2015,7 @@ static const char *walk_component(struct nameidata *nd, int flags)
}
if (!(flags & WALK_MORE) && nd->depth)
put_link(nd);
- return step_into(nd, flags, dentry, inode, seq);
+ return step_into(nd, flags, dentry, inode);
}

/*
@@ -2372,6 +2370,8 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
flags &= ~LOOKUP_RCU;
if (flags & LOOKUP_RCU)
rcu_read_lock();
+ else
+ nd->seq = nd->next_seq = 0;

nd->flags = flags;
nd->state |= ND_JUMPED;
@@ -2473,8 +2473,9 @@ static int handle_lookup_down(struct nameidata *nd)
{
if (!(nd->flags & LOOKUP_RCU))
dget(nd->path.dentry);
+ nd->next_seq = nd->seq;
return PTR_ERR(step_into(nd, WALK_NOFOLLOW,
- nd->path.dentry, nd->inode, nd->seq));
+ nd->path.dentry, nd->inode));
}

/* Returns 0 and nd will be valid on success; Retuns error, otherwise. */
@@ -3393,7 +3394,6 @@ static const char *open_last_lookups(struct nameidata *nd,
struct dentry *dir = nd->path.dentry;
int open_flag = op->open_flag;
bool got_write = false;
- unsigned seq;
struct inode *inode;
struct dentry *dentry;
const char *res;
@@ -3410,7 +3410,7 @@ static const char *open_last_lookups(struct nameidata *nd,
if (nd->last.name[nd->last.len])
nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
/* we _can_ be in RCU mode here */
- dentry = lookup_fast(nd, &inode, &seq);
+ dentry = lookup_fast(nd, &inode);
if (IS_ERR(dentry))
return ERR_CAST(dentry);
if (likely(dentry))
@@ -3464,7 +3464,7 @@ static const char *open_last_lookups(struct nameidata *nd,
finish_lookup:
if (nd->depth)
put_link(nd);
- res = step_into(nd, WALK_TRAILING, dentry, inode, seq);
+ res = step_into(nd, WALK_TRAILING, dentry, inode);
if (unlikely(res))
nd->flags &= ~(LOOKUP_OPEN|LOOKUP_CREATE|LOOKUP_EXCL);
return res;
--
2.30.2

2022-07-04 23:37:59

by Al Viro

[permalink] [raw]
Subject: [PATCH 7/7] step_into(): move fetching ->d_inode past handle_mounts()

... and lose messing with it in __follow_mount_rcu()

Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 31 +++++++++++--------------------
1 file changed, 11 insertions(+), 20 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index cdb61d09df79..f2c99e75b578 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1470,8 +1470,7 @@ EXPORT_SYMBOL(follow_down);
* Try to skip to top of mountpoint pile in rcuwalk mode. Fail if
* we meet a managed dentry that would need blocking.
*/
-static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
- struct inode **inode)
+static bool __follow_mount_rcu(struct nameidata *nd, struct path *path)
{
struct dentry *dentry = path->dentry;
unsigned int flags = dentry->d_flags;
@@ -1501,13 +1500,6 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
dentry = path->dentry = mounted->mnt.mnt_root;
nd->state |= ND_JUMPED;
nd->next_seq = read_seqcount_begin(&dentry->d_seq);
- *inode = dentry->d_inode;
- /*
- * We don't need to re-check ->d_seq after this
- * ->d_inode read - there will be an RCU delay
- * between mount hash removal and ->mnt_root
- * becoming unpinned.
- */
flags = dentry->d_flags;
// makes sure that non-RCU pathwalk could reach
// this state.
@@ -1523,7 +1515,7 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
}

static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry,
- struct path *path, struct inode **inode)
+ struct path *path)
{
bool jumped;
int ret;
@@ -1532,12 +1524,7 @@ static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry,
path->dentry = dentry;
if (nd->flags & LOOKUP_RCU) {
unsigned int seq = nd->next_seq;
- *inode = dentry->d_inode;
- if (read_seqcount_retry(&dentry->d_seq, seq))
- return -ECHILD;
- if (unlikely(!*inode))
- return -ENOENT;
- if (likely(__follow_mount_rcu(nd, path, inode)))
+ if (likely(__follow_mount_rcu(nd, path)))
return 0;
// *path and nd->next_seq might've been clobbered
path->mnt = nd->path.mnt;
@@ -1557,8 +1544,6 @@ static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry,
dput(path->dentry);
if (path->mnt != nd->path.mnt)
mntput(path->mnt);
- } else {
- *inode = d_backing_inode(path->dentry);
}
return ret;
}
@@ -1839,15 +1824,21 @@ static const char *step_into(struct nameidata *nd, int flags,
{
struct path path;
struct inode *inode;
- int err = handle_mounts(nd, dentry, &path, &inode);
+ int err = handle_mounts(nd, dentry, &path);

if (err < 0)
return ERR_PTR(err);
+ inode = path.dentry->d_inode;
if (likely(!d_is_symlink(path.dentry)) ||
((flags & WALK_TRAILING) && !(nd->flags & LOOKUP_FOLLOW)) ||
(flags & WALK_NOFOLLOW)) {
/* not a symlink or should not follow */
- if (!(nd->flags & LOOKUP_RCU)) {
+ if (nd->flags & LOOKUP_RCU) {
+ if (read_seqcount_retry(&path.dentry->d_seq, nd->next_seq))
+ return ERR_PTR(-ECHILD);
+ if (unlikely(!inode))
+ return ERR_PTR(-ENOENT);
+ } else {
dput(nd->path.dentry);
if (nd->path.mnt != path.mnt)
mntput(nd->path.mnt);
--
2.30.2

2022-07-04 23:48:48

by Al Viro

[permalink] [raw]
Subject: [PATCH 2/7] follow_dotdot{,_rcu}(): change calling conventions

Instead of returning NULL when we are in root, just make it return
the current position (and set *seqp and *inodep accordingly).
That collapses the calls of step_into() in handle_dots()

Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 16 +++++++---------
1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 4dbf55b37ec6..ecdb9ac21ece 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1909,7 +1909,9 @@ static struct dentry *follow_dotdot_rcu(struct nameidata *nd,
return ERR_PTR(-ECHILD);
if (unlikely(nd->flags & LOOKUP_BENEATH))
return ERR_PTR(-ECHILD);
- return NULL;
+ *seqp = nd->seq;
+ *inodep = nd->path.dentry->d_inode;
+ return nd->path.dentry;
}

static struct dentry *follow_dotdot(struct nameidata *nd,
@@ -1945,8 +1947,9 @@ static struct dentry *follow_dotdot(struct nameidata *nd,
in_root:
if (unlikely(nd->flags & LOOKUP_BENEATH))
return ERR_PTR(-EXDEV);
- dget(nd->path.dentry);
- return NULL;
+ *seqp = 0;
+ *inodep = nd->path.dentry->d_inode;
+ return dget(nd->path.dentry);
}

static const char *handle_dots(struct nameidata *nd, int type)
@@ -1968,12 +1971,7 @@ static const char *handle_dots(struct nameidata *nd, int type)
parent = follow_dotdot(nd, &inode, &seq);
if (IS_ERR(parent))
return ERR_CAST(parent);
- if (unlikely(!parent))
- error = step_into(nd, WALK_NOFOLLOW,
- nd->path.dentry, nd->inode, nd->seq);
- else
- error = step_into(nd, WALK_NOFOLLOW,
- parent, inode, seq);
+ error = step_into(nd, WALK_NOFOLLOW, parent, inode, seq);
if (unlikely(error))
return error;

--
2.30.2

2022-07-04 23:56:16

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 1/7] __follow_mount_rcu(): verify that mount_lock remains unchanged

Just in case - this series is RFC only; it's very lightly
tested. Might be carved into too many steps, at that. Cumulative
delta follows; might or might not be more convenient for review.

diff --git a/fs/namei.c b/fs/namei.c
index 1f28d3f463c3..f2c99e75b578 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -567,7 +567,7 @@ struct nameidata {
struct path root;
struct inode *inode; /* path.dentry.d_inode */
unsigned int flags, state;
- unsigned seq, m_seq, r_seq;
+ unsigned seq, next_seq, m_seq, r_seq;
int last_type;
unsigned depth;
int total_link_count;
@@ -772,6 +772,7 @@ static bool try_to_unlazy(struct nameidata *nd)
goto out;
if (unlikely(!legitimize_root(nd)))
goto out;
+ nd->seq = nd->next_seq = 0;
rcu_read_unlock();
BUG_ON(nd->inode != parent->d_inode);
return true;
@@ -780,6 +781,7 @@ static bool try_to_unlazy(struct nameidata *nd)
nd->path.mnt = NULL;
nd->path.dentry = NULL;
out:
+ nd->seq = nd->next_seq = 0;
rcu_read_unlock();
return false;
}
@@ -788,7 +790,6 @@ static bool try_to_unlazy(struct nameidata *nd)
* try_to_unlazy_next - try to switch to ref-walk mode.
* @nd: nameidata pathwalk data
* @dentry: next dentry to step into
- * @seq: seq number to check @dentry against
* Returns: true on success, false on failure
*
* Similar to try_to_unlazy(), but here we have the next dentry already
@@ -797,7 +798,7 @@ static bool try_to_unlazy(struct nameidata *nd)
* Nothing should touch nameidata between try_to_unlazy_next() failure and
* terminate_walk().
*/
-static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry, unsigned seq)
+static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry)
{
BUG_ON(!(nd->flags & LOOKUP_RCU));

@@ -818,7 +819,7 @@ static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry, unsi
*/
if (unlikely(!lockref_get_not_dead(&dentry->d_lockref)))
goto out;
- if (unlikely(read_seqcount_retry(&dentry->d_seq, seq)))
+ if (unlikely(read_seqcount_retry(&dentry->d_seq, nd->next_seq)))
goto out_dput;
/*
* Sequence counts matched. Now make sure that the root is
@@ -826,6 +827,7 @@ static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry, unsi
*/
if (unlikely(!legitimize_root(nd)))
goto out_dput;
+ nd->seq = nd->next_seq = 0;
rcu_read_unlock();
return true;

@@ -834,9 +836,11 @@ static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry, unsi
out1:
nd->path.dentry = NULL;
out:
+ nd->seq = nd->next_seq = 0;
rcu_read_unlock();
return false;
out_dput:
+ nd->seq = nd->next_seq = 0;
rcu_read_unlock();
dput(dentry);
return false;
@@ -1466,8 +1470,7 @@ EXPORT_SYMBOL(follow_down);
* Try to skip to top of mountpoint pile in rcuwalk mode. Fail if
* we meet a managed dentry that would need blocking.
*/
-static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
- struct inode **inode, unsigned *seqp)
+static bool __follow_mount_rcu(struct nameidata *nd, struct path *path)
{
struct dentry *dentry = path->dentry;
unsigned int flags = dentry->d_flags;
@@ -1496,15 +1499,12 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
path->mnt = &mounted->mnt;
dentry = path->dentry = mounted->mnt.mnt_root;
nd->state |= ND_JUMPED;
- *seqp = read_seqcount_begin(&dentry->d_seq);
- *inode = dentry->d_inode;
- /*
- * We don't need to re-check ->d_seq after this
- * ->d_inode read - there will be an RCU delay
- * between mount hash removal and ->mnt_root
- * becoming unpinned.
- */
+ nd->next_seq = read_seqcount_begin(&dentry->d_seq);
flags = dentry->d_flags;
+ // makes sure that non-RCU pathwalk could reach
+ // this state.
+ if (read_seqretry(&mount_lock, nd->m_seq))
+ return false;
continue;
}
if (read_seqretry(&mount_lock, nd->m_seq))
@@ -1515,8 +1515,7 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
}

static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry,
- struct path *path, struct inode **inode,
- unsigned int *seqp)
+ struct path *path)
{
bool jumped;
int ret;
@@ -1524,16 +1523,15 @@ static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry,
path->mnt = nd->path.mnt;
path->dentry = dentry;
if (nd->flags & LOOKUP_RCU) {
- unsigned int seq = *seqp;
- if (unlikely(!*inode))
- return -ENOENT;
- if (likely(__follow_mount_rcu(nd, path, inode, seqp)))
+ unsigned int seq = nd->next_seq;
+ if (likely(__follow_mount_rcu(nd, path)))
return 0;
- if (!try_to_unlazy_next(nd, dentry, seq))
- return -ECHILD;
- // *path might've been clobbered by __follow_mount_rcu()
+ // *path and nd->next_seq might've been clobbered
path->mnt = nd->path.mnt;
path->dentry = dentry;
+ nd->next_seq = seq;
+ if (!try_to_unlazy_next(nd, dentry))
+ return -ECHILD;
}
ret = traverse_mounts(path, &jumped, &nd->total_link_count, nd->flags);
if (jumped) {
@@ -1546,9 +1544,6 @@ static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry,
dput(path->dentry);
if (path->mnt != nd->path.mnt)
mntput(path->mnt);
- } else {
- *inode = d_backing_inode(path->dentry);
- *seqp = 0; /* out of RCU mode, so the value doesn't matter */
}
return ret;
}
@@ -1607,9 +1602,7 @@ static struct dentry *__lookup_hash(const struct qstr *name,
return dentry;
}

-static struct dentry *lookup_fast(struct nameidata *nd,
- struct inode **inode,
- unsigned *seqp)
+static struct dentry *lookup_fast(struct nameidata *nd)
{
struct dentry *dentry, *parent = nd->path.dentry;
int status = 1;
@@ -1620,37 +1613,24 @@ static struct dentry *lookup_fast(struct nameidata *nd,
* going to fall back to non-racy lookup.
*/
if (nd->flags & LOOKUP_RCU) {
- unsigned seq;
- dentry = __d_lookup_rcu(parent, &nd->last, &seq);
+ dentry = __d_lookup_rcu(parent, &nd->last, &nd->next_seq);
if (unlikely(!dentry)) {
if (!try_to_unlazy(nd))
return ERR_PTR(-ECHILD);
return NULL;
}

- /*
- * This sequence count validates that the inode matches
- * the dentry name information from lookup.
- */
- *inode = d_backing_inode(dentry);
- if (unlikely(read_seqcount_retry(&dentry->d_seq, seq)))
- return ERR_PTR(-ECHILD);
-
- /*
+ /*
* This sequence count validates that the parent had no
* changes while we did the lookup of the dentry above.
- *
- * The memory barrier in read_seqcount_begin of child is
- * enough, we can use __read_seqcount_retry here.
*/
- if (unlikely(__read_seqcount_retry(&parent->d_seq, nd->seq)))
+ if (unlikely(read_seqcount_retry(&parent->d_seq, nd->seq)))
return ERR_PTR(-ECHILD);

- *seqp = seq;
status = d_revalidate(dentry, nd->flags);
if (likely(status > 0))
return dentry;
- if (!try_to_unlazy_next(nd, dentry, seq))
+ if (!try_to_unlazy_next(nd, dentry))
return ERR_PTR(-ECHILD);
if (status == -ECHILD)
/* we'd been told to redo it in non-rcu mode */
@@ -1731,7 +1711,7 @@ static inline int may_lookup(struct user_namespace *mnt_userns,
return inode_permission(mnt_userns, nd->inode, MAY_EXEC);
}

-static int reserve_stack(struct nameidata *nd, struct path *link, unsigned seq)
+static int reserve_stack(struct nameidata *nd, struct path *link)
{
if (unlikely(nd->total_link_count++ >= MAXSYMLINKS))
return -ELOOP;
@@ -1746,7 +1726,7 @@ static int reserve_stack(struct nameidata *nd, struct path *link, unsigned seq)
if (nd->flags & LOOKUP_RCU) {
// we need to grab link before we do unlazy. And we can't skip
// unlazy even if we fail to grab the link - cleanup needs it
- bool grabbed_link = legitimize_path(nd, link, seq);
+ bool grabbed_link = legitimize_path(nd, link, nd->next_seq);

if (!try_to_unlazy(nd) || !grabbed_link)
return -ECHILD;
@@ -1760,11 +1740,11 @@ static int reserve_stack(struct nameidata *nd, struct path *link, unsigned seq)
enum {WALK_TRAILING = 1, WALK_MORE = 2, WALK_NOFOLLOW = 4};

static const char *pick_link(struct nameidata *nd, struct path *link,
- struct inode *inode, unsigned seq, int flags)
+ struct inode *inode, int flags)
{
struct saved *last;
const char *res;
- int error = reserve_stack(nd, link, seq);
+ int error = reserve_stack(nd, link);

if (unlikely(error)) {
if (!(nd->flags & LOOKUP_RCU))
@@ -1774,7 +1754,7 @@ static const char *pick_link(struct nameidata *nd, struct path *link,
last = nd->stack + nd->depth++;
last->link = *link;
clear_delayed_call(&last->done);
- last->seq = seq;
+ last->seq = nd->next_seq;

if (flags & WALK_TRAILING) {
error = may_follow_link(nd, inode);
@@ -1836,43 +1816,50 @@ static const char *pick_link(struct nameidata *nd, struct path *link,
* to do this check without having to look at inode->i_op,
* so we keep a cache of "no, this doesn't need follow_link"
* for the common case.
+ *
+ * NOTE: dentry must be what nd->next_seq had been sampled from.
*/
static const char *step_into(struct nameidata *nd, int flags,
- struct dentry *dentry, struct inode *inode, unsigned seq)
+ struct dentry *dentry)
{
struct path path;
- int err = handle_mounts(nd, dentry, &path, &inode, &seq);
+ struct inode *inode;
+ int err = handle_mounts(nd, dentry, &path);

if (err < 0)
return ERR_PTR(err);
+ inode = path.dentry->d_inode;
if (likely(!d_is_symlink(path.dentry)) ||
((flags & WALK_TRAILING) && !(nd->flags & LOOKUP_FOLLOW)) ||
(flags & WALK_NOFOLLOW)) {
/* not a symlink or should not follow */
- if (!(nd->flags & LOOKUP_RCU)) {
+ if (nd->flags & LOOKUP_RCU) {
+ if (read_seqcount_retry(&path.dentry->d_seq, nd->next_seq))
+ return ERR_PTR(-ECHILD);
+ if (unlikely(!inode))
+ return ERR_PTR(-ENOENT);
+ } else {
dput(nd->path.dentry);
if (nd->path.mnt != path.mnt)
mntput(nd->path.mnt);
}
nd->path = path;
nd->inode = inode;
- nd->seq = seq;
+ nd->seq = nd->next_seq;
return NULL;
}
if (nd->flags & LOOKUP_RCU) {
/* make sure that d_is_symlink above matches inode */
- if (read_seqcount_retry(&path.dentry->d_seq, seq))
+ if (read_seqcount_retry(&path.dentry->d_seq, nd->next_seq))
return ERR_PTR(-ECHILD);
} else {
if (path.mnt == nd->path.mnt)
mntget(path.mnt);
}
- return pick_link(nd, &path, inode, seq, flags);
+ return pick_link(nd, &path, inode, flags);
}

-static struct dentry *follow_dotdot_rcu(struct nameidata *nd,
- struct inode **inodep,
- unsigned *seqp)
+static struct dentry *follow_dotdot_rcu(struct nameidata *nd)
{
struct dentry *parent, *old;

@@ -1889,14 +1876,15 @@ static struct dentry *follow_dotdot_rcu(struct nameidata *nd,
nd->path = path;
nd->inode = path.dentry->d_inode;
nd->seq = seq;
+ // makes sure that non-RCU pathwalk could reach this state
if (unlikely(read_seqretry(&mount_lock, nd->m_seq)))
return ERR_PTR(-ECHILD);
/* we know that mountpoint was pinned */
}
old = nd->path.dentry;
parent = old->d_parent;
- *inodep = parent->d_inode;
- *seqp = read_seqcount_begin(&parent->d_seq);
+ nd->next_seq = read_seqcount_begin(&parent->d_seq);
+ // makes sure that non-RCU pathwalk could reach this state
if (unlikely(read_seqcount_retry(&old->d_seq, nd->seq)))
return ERR_PTR(-ECHILD);
if (unlikely(!path_connected(nd->path.mnt, parent)))
@@ -1907,12 +1895,11 @@ static struct dentry *follow_dotdot_rcu(struct nameidata *nd,
return ERR_PTR(-ECHILD);
if (unlikely(nd->flags & LOOKUP_BENEATH))
return ERR_PTR(-ECHILD);
- return NULL;
+ nd->next_seq = nd->seq;
+ return nd->path.dentry;
}

-static struct dentry *follow_dotdot(struct nameidata *nd,
- struct inode **inodep,
- unsigned *seqp)
+static struct dentry *follow_dotdot(struct nameidata *nd)
{
struct dentry *parent;

@@ -1936,15 +1923,12 @@ static struct dentry *follow_dotdot(struct nameidata *nd,
dput(parent);
return ERR_PTR(-ENOENT);
}
- *seqp = 0;
- *inodep = parent->d_inode;
return parent;

in_root:
if (unlikely(nd->flags & LOOKUP_BENEATH))
return ERR_PTR(-EXDEV);
- dget(nd->path.dentry);
- return NULL;
+ return dget(nd->path.dentry);
}

static const char *handle_dots(struct nameidata *nd, int type)
@@ -1952,8 +1936,6 @@ static const char *handle_dots(struct nameidata *nd, int type)
if (type == LAST_DOTDOT) {
const char *error = NULL;
struct dentry *parent;
- struct inode *inode;
- unsigned seq;

if (!nd->root.mnt) {
error = ERR_PTR(set_root(nd));
@@ -1961,17 +1943,12 @@ static const char *handle_dots(struct nameidata *nd, int type)
return error;
}
if (nd->flags & LOOKUP_RCU)
- parent = follow_dotdot_rcu(nd, &inode, &seq);
+ parent = follow_dotdot_rcu(nd);
else
- parent = follow_dotdot(nd, &inode, &seq);
+ parent = follow_dotdot(nd);
if (IS_ERR(parent))
return ERR_CAST(parent);
- if (unlikely(!parent))
- error = step_into(nd, WALK_NOFOLLOW,
- nd->path.dentry, nd->inode, nd->seq);
- else
- error = step_into(nd, WALK_NOFOLLOW,
- parent, inode, seq);
+ error = step_into(nd, WALK_NOFOLLOW, parent);
if (unlikely(error))
return error;

@@ -1995,8 +1972,6 @@ static const char *handle_dots(struct nameidata *nd, int type)
static const char *walk_component(struct nameidata *nd, int flags)
{
struct dentry *dentry;
- struct inode *inode;
- unsigned seq;
/*
* "." and ".." are special - ".." especially so because it has
* to be able to know about the current root directory and
@@ -2007,7 +1982,7 @@ static const char *walk_component(struct nameidata *nd, int flags)
put_link(nd);
return handle_dots(nd, nd->last_type);
}
- dentry = lookup_fast(nd, &inode, &seq);
+ dentry = lookup_fast(nd);
if (IS_ERR(dentry))
return ERR_CAST(dentry);
if (unlikely(!dentry)) {
@@ -2017,7 +1992,7 @@ static const char *walk_component(struct nameidata *nd, int flags)
}
if (!(flags & WALK_MORE) && nd->depth)
put_link(nd);
- return step_into(nd, flags, dentry, inode, seq);
+ return step_into(nd, flags, dentry);
}

/*
@@ -2372,6 +2347,8 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
flags &= ~LOOKUP_RCU;
if (flags & LOOKUP_RCU)
rcu_read_lock();
+ else
+ nd->seq = nd->next_seq = 0;

nd->flags = flags;
nd->state |= ND_JUMPED;
@@ -2473,8 +2450,8 @@ static int handle_lookup_down(struct nameidata *nd)
{
if (!(nd->flags & LOOKUP_RCU))
dget(nd->path.dentry);
- return PTR_ERR(step_into(nd, WALK_NOFOLLOW,
- nd->path.dentry, nd->inode, nd->seq));
+ nd->next_seq = nd->seq;
+ return PTR_ERR(step_into(nd, WALK_NOFOLLOW, nd->path.dentry));
}

/* Returns 0 and nd will be valid on success; Retuns error, otherwise. */
@@ -3393,8 +3370,6 @@ static const char *open_last_lookups(struct nameidata *nd,
struct dentry *dir = nd->path.dentry;
int open_flag = op->open_flag;
bool got_write = false;
- unsigned seq;
- struct inode *inode;
struct dentry *dentry;
const char *res;

@@ -3410,7 +3385,7 @@ static const char *open_last_lookups(struct nameidata *nd,
if (nd->last.name[nd->last.len])
nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
/* we _can_ be in RCU mode here */
- dentry = lookup_fast(nd, &inode, &seq);
+ dentry = lookup_fast(nd);
if (IS_ERR(dentry))
return ERR_CAST(dentry);
if (likely(dentry))
@@ -3464,7 +3439,7 @@ static const char *open_last_lookups(struct nameidata *nd,
finish_lookup:
if (nd->depth)
put_link(nd);
- res = step_into(nd, WALK_TRAILING, dentry, inode, seq);
+ res = step_into(nd, WALK_TRAILING, dentry);
if (unlikely(res))
nd->flags &= ~(LOOKUP_OPEN|LOOKUP_CREATE|LOOKUP_EXCL);
return res;

2022-07-05 00:41:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/7] __follow_mount_rcu(): verify that mount_lock remains unchanged

On Mon, Jul 4, 2022 at 4:19 PM Al Viro <[email protected]> wrote:
>
> - unsigned seq, m_seq, r_seq;
> + unsigned seq, next_seq, m_seq, r_seq;

So the main thing I react to here is how "next_seq" is in the "struct
nameidata", but then it always goes together with a "struct dentry"
that you end up having to pass separately (and that is *not* in that
"struct nameidata").

Now, saving the associated dentry (as "next_dentry") in the nd would
solve that, but ends up benign ugly since everything then wants to
look at the dentry anyway, so while it would solve the inconsistency,
it would be ugly.

I wonder if the solution might not be to create a new structure like

struct rcu_dentry {
struct dentry *dentry;
unsigned seq;
};

and in fact then we could make __d_lookup_rcu() return one of these
things (we already rely on that "returning a two-word structure is
efficient" elsewhere).

That would then make that "this dentry goes with this sequence number"
be a very clear thing, and I actually thjink that it would make
__d_lookup_rcu() have a cleaner calling convention too, ie we'd go
from

dentry = __d_lookup_rcu(parent, &nd->last, &nd->next_seq);

rto

dseq = __d_lookup_rcu(parent, &nd->last);

and it would even improve code generation because it now returns the
dentry and the sequence number in registers, instead of returning one
in a register and one in memory.

I did *not* look at how it would change some of the other places, but
I do like the notion of "keep the dentry and the sequence number that
goes with it together".

That "keep dentry as a local, keep the sequence number that goes with
it as a field in the 'nd'" really does seem an odd thing. So I'm
throwing the above out as a "maybe we could do this instead..".

Not a huge deal. That oddity or not, I think the patch series is an improvement.

I do have a minor gripe with this too:

> + nd->seq = nd->next_seq = 0;

I'm not convinced "0" is a good value.

It's not supposed to match anything, but it *could* match a valid
sequence number. Wouldn't it be better to pick something that is
explicitly invalid and has the low bit set (ie 1 or -1).

We don't seem to have a SEQ_INVAL or anything like that, but it does
seem that if the intent is to make it clear it's not a real sequence
number any more at that point, then 0 isn't great.

But again, this is more of a stylistic detail thing than a real complaint.

Linus

2022-07-05 03:52:22

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 1/7] __follow_mount_rcu(): verify that mount_lock remains unchanged

On Mon, Jul 04, 2022 at 05:06:17PM -0700, Linus Torvalds wrote:

> I wonder if the solution might not be to create a new structure like
>
> struct rcu_dentry {
> struct dentry *dentry;
> unsigned seq;
> };
>
> and in fact then we could make __d_lookup_rcu() return one of these
> things (we already rely on that "returning a two-word structure is
> efficient" elsewhere).
>
> That would then make that "this dentry goes with this sequence number"
> be a very clear thing, and I actually thjink that it would make
> __d_lookup_rcu() have a cleaner calling convention too, ie we'd go
> from
>
> dentry = __d_lookup_rcu(parent, &nd->last, &nd->next_seq);
>
> rto
>
> dseq = __d_lookup_rcu(parent, &nd->last);
>
> and it would even improve code generation because it now returns the
> dentry and the sequence number in registers, instead of returning one
> in a register and one in memory.
>
> I did *not* look at how it would change some of the other places, but
> I do like the notion of "keep the dentry and the sequence number that
> goes with it together".
>
> That "keep dentry as a local, keep the sequence number that goes with
> it as a field in the 'nd'" really does seem an odd thing. So I'm
> throwing the above out as a "maybe we could do this instead..".

I looked into that; turns out to be quite messy, unfortunately. For one
thing, the distance between the places where we get the seq count and
the place where we consume it is large; worse, there's a bunch of paths
where we are in non-RCU mode converging to the same consumer and those
need a 0/1/-1/whatever paired with dentry. Gets very clumsy...

There might be a clever way to deal with pairs cleanly, but I don't see it
at the moment. I'll look into that some more, but...

BTW, how good gcc and clang are at figuring out that e.g.

static int foo(int n)
{
if (likely(n >= 0))
return 0;
....
}

....
if (foo(n))
whatever();

should be treated as
if (unlikely(foo(n)))
whatever();

They certainly do it just fine if the damn thing is inlined (e.g.
all those unlikely(read_seqcount_retry(....)) can and should lose
unlikely), but do they manage that for non-inlined functions in
the same compilation unit? Relatively recent gcc seems to...

2022-07-12 14:05:04

by Marco Elver

[permalink] [raw]
Subject: Re: [PATCH v4 33/45] x86: kmsan: disable instrumentation of unsupported code

On Fri, 1 Jul 2022 at 16:24, 'Alexander Potapenko' via kasan-dev
<[email protected]> wrote:
[...]
> ---
> arch/x86/boot/Makefile | 1 +
> arch/x86/boot/compressed/Makefile | 1 +
> arch/x86/entry/vdso/Makefile | 3 +++
> arch/x86/kernel/Makefile | 2 ++
> arch/x86/kernel/cpu/Makefile | 1 +
> arch/x86/mm/Makefile | 2 ++
> arch/x86/realmode/rm/Makefile | 1 +
> lib/Makefile | 2 ++
[...]
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -272,6 +272,8 @@ obj-$(CONFIG_POLYNOMIAL) += polynomial.o
> CFLAGS_stackdepot.o += -fno-builtin
> obj-$(CONFIG_STACKDEPOT) += stackdepot.o
> KASAN_SANITIZE_stackdepot.o := n
> +# In particular, instrumenting stackdepot.c with KMSAN will result in infinite
> +# recursion.
> KMSAN_SANITIZE_stackdepot.o := n
> KCOV_INSTRUMENT_stackdepot.o := n

This is generic code and not x86, should it have been in the earlier patch?

2022-07-12 14:28:38

by Marco Elver

[permalink] [raw]
Subject: Re: [PATCH v4 25/45] kmsan: add tests for KMSAN

)

On Fri, 1 Jul 2022 at 16:24, 'Alexander Potapenko' via kasan-dev
<[email protected]> wrote:
>
> The testing module triggers KMSAN warnings in different cases and checks
> that the errors are properly reported, using console probes to capture
> the tool's output.
>
> Signed-off-by: Alexander Potapenko <[email protected]>
> ---
> v2:
> -- add memcpy tests
>
> v4:
> -- change sizeof(type) to sizeof(*ptr)
> -- add test expectations for CONFIG_KMSAN_CHECK_PARAM_RETVAL
>
> Link: https://linux-review.googlesource.com/id/I49c3f59014cc37fd13541c80beb0b75a75244650
> ---
> lib/Kconfig.kmsan | 12 +
> mm/kmsan/Makefile | 4 +
> mm/kmsan/kmsan_test.c | 552 ++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 568 insertions(+)
> create mode 100644 mm/kmsan/kmsan_test.c
>
> diff --git a/lib/Kconfig.kmsan b/lib/Kconfig.kmsan
> index 8f768d4034e3c..f56ed7f7c7090 100644
> --- a/lib/Kconfig.kmsan
> +++ b/lib/Kconfig.kmsan
> @@ -47,4 +47,16 @@ config KMSAN_CHECK_PARAM_RETVAL
> may potentially report errors in corner cases when non-instrumented
> functions call instrumented ones.
>
> +config KMSAN_KUNIT_TEST
> + tristate "KMSAN integration test suite" if !KUNIT_ALL_TESTS
> + default KUNIT_ALL_TESTS
> + depends on TRACEPOINTS && KUNIT
> + help
> + Test suite for KMSAN, testing various error detection scenarios,
> + and checking that reports are correctly output to console.
> +
> + Say Y here if you want the test to be built into the kernel and run
> + during boot; say M if you want the test to build as a module; say N
> + if you are unsure.
> +
> endif
> diff --git a/mm/kmsan/Makefile b/mm/kmsan/Makefile
> index 401acb1a491ce..98eab2856626f 100644
> --- a/mm/kmsan/Makefile
> +++ b/mm/kmsan/Makefile
> @@ -22,3 +22,7 @@ CFLAGS_init.o := $(CC_FLAGS_KMSAN_RUNTIME)
> CFLAGS_instrumentation.o := $(CC_FLAGS_KMSAN_RUNTIME)
> CFLAGS_report.o := $(CC_FLAGS_KMSAN_RUNTIME)
> CFLAGS_shadow.o := $(CC_FLAGS_KMSAN_RUNTIME)
> +
> +obj-$(CONFIG_KMSAN_KUNIT_TEST) += kmsan_test.o
> +KMSAN_SANITIZE_kmsan_test.o := y
> +CFLAGS_kmsan_test.o += $(call cc-disable-warning, uninitialized)
> diff --git a/mm/kmsan/kmsan_test.c b/mm/kmsan/kmsan_test.c
> new file mode 100644
> index 0000000000000..1b8da71ae0d4f
> --- /dev/null
> +++ b/mm/kmsan/kmsan_test.c
> @@ -0,0 +1,552 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Test cases for KMSAN.
> + * For each test case checks the presence (or absence) of generated reports.
> + * Relies on 'console' tracepoint to capture reports as they appear in the
> + * kernel log.
> + *
> + * Copyright (C) 2021-2022, Google LLC.
> + * Author: Alexander Potapenko <[email protected]>
> + *
> + */
> +
> +#include <kunit/test.h>
> +#include "kmsan.h"
> +
> +#include <linux/jiffies.h>
> +#include <linux/kernel.h>
> +#include <linux/kmsan.h>
> +#include <linux/mm.h>
> +#include <linux/random.h>
> +#include <linux/slab.h>
> +#include <linux/spinlock.h>
> +#include <linux/string.h>
> +#include <linux/tracepoint.h>
> +#include <trace/events/printk.h>
> +
> +static DEFINE_PER_CPU(int, per_cpu_var);
> +
> +/* Report as observed from console. */
> +static struct {
> + spinlock_t lock;
> + bool available;
> + bool ignore; /* Stop console output collection. */
> + char header[256];
> +} observed = {
> + .lock = __SPIN_LOCK_UNLOCKED(observed.lock),
> +};
> +
> +/* Probe for console output: obtains observed lines of interest. */
> +static void probe_console(void *ignore, const char *buf, size_t len)
> +{
> + unsigned long flags;
> +
> + if (observed.ignore)
> + return;
> + spin_lock_irqsave(&observed.lock, flags);
> +
> + if (strnstr(buf, "BUG: KMSAN: ", len)) {
> + /*
> + * KMSAN report and related to the test.
> + *
> + * The provided @buf is not NUL-terminated; copy no more than
> + * @len bytes and let strscpy() add the missing NUL-terminator.
> + */
> + strscpy(observed.header, buf,
> + min(len + 1, sizeof(observed.header)));
> + WRITE_ONCE(observed.available, true);
> + observed.ignore = true;
> + }
> + spin_unlock_irqrestore(&observed.lock, flags);
> +}
> +
> +/* Check if a report related to the test exists. */
> +static bool report_available(void)
> +{
> + return READ_ONCE(observed.available);
> +}
> +
> +/* Information we expect in a report. */
> +struct expect_report {
> + const char *error_type; /* Error type. */
> + /*
> + * Kernel symbol from the error header, or NULL if no report is
> + * expected.
> + */
> + const char *symbol;
> +};
> +
> +/* Check observed report matches information in @r. */
> +static bool report_matches(const struct expect_report *r)
> +{
> + typeof(observed.header) expected_header;
> + unsigned long flags;
> + bool ret = false;
> + const char *end;
> + char *cur;
> +
> + /* Doubled-checked locking. */
> + if (!report_available() || !r->symbol)
> + return (!report_available() && !r->symbol);
> +
> + /* Generate expected report contents. */
> +
> + /* Title */
> + cur = expected_header;
> + end = &expected_header[sizeof(expected_header) - 1];
> +
> + cur += scnprintf(cur, end - cur, "BUG: KMSAN: %s", r->error_type);
> +
> + scnprintf(cur, end - cur, " in %s", r->symbol);
> + /* The exact offset won't match, remove it; also strip module name. */
> + cur = strchr(expected_header, '+');
> + if (cur)
> + *cur = '\0';
> +
> + spin_lock_irqsave(&observed.lock, flags);
> + if (!report_available())
> + goto out; /* A new report is being captured. */
> +
> + /* Finally match expected output to what we actually observed. */
> + ret = strstr(observed.header, expected_header);
> +out:
> + spin_unlock_irqrestore(&observed.lock, flags);
> +
> + return ret;
> +}
> +
> +/* ===== Test cases ===== */
> +
> +/* Prevent replacing branch with select in LLVM. */
> +static noinline void check_true(char *arg)
> +{
> + pr_info("%s is true\n", arg);
> +}
> +
> +static noinline void check_false(char *arg)
> +{
> + pr_info("%s is false\n", arg);
> +}
> +
> +#define USE(x) \
> + do { \
> + if (x) \
> + check_true(#x); \
> + else \
> + check_false(#x); \
> + } while (0)
> +
> +#define EXPECTATION_ETYPE_FN(e, reason, fn) \
> + struct expect_report e = { \
> + .error_type = reason, \
> + .symbol = fn, \
> + }
> +
> +#define EXPECTATION_NO_REPORT(e) EXPECTATION_ETYPE_FN(e, NULL, NULL)
> +#define EXPECTATION_UNINIT_VALUE_FN(e, fn) \
> + EXPECTATION_ETYPE_FN(e, "uninit-value", fn)
> +#define EXPECTATION_UNINIT_VALUE(e) EXPECTATION_UNINIT_VALUE_FN(e, __func__)
> +#define EXPECTATION_USE_AFTER_FREE(e) \
> + EXPECTATION_ETYPE_FN(e, "use-after-free", __func__)
> +
> +/* Test case: ensure that kmalloc() returns uninitialized memory. */
> +static void test_uninit_kmalloc(struct kunit *test)
> +{
> + EXPECTATION_UNINIT_VALUE(expect);
> + int *ptr;
> +
> + kunit_info(test, "uninitialized kmalloc test (UMR report)\n");
> + ptr = kmalloc(sizeof(*ptr), GFP_KERNEL);
> + USE(*ptr);
> + KUNIT_EXPECT_TRUE(test, report_matches(&expect));
> +}
> +
> +/*
> + * Test case: ensure that kmalloc'ed memory becomes initialized after memset().
> + */
> +static void test_init_kmalloc(struct kunit *test)
> +{
> + EXPECTATION_NO_REPORT(expect);
> + int *ptr;
> +
> + kunit_info(test, "initialized kmalloc test (no reports)\n");
> + ptr = kmalloc(sizeof(*ptr), GFP_KERNEL);
> + memset(ptr, 0, sizeof(*ptr));
> + USE(*ptr);
> + KUNIT_EXPECT_TRUE(test, report_matches(&expect));
> +}
> +
> +/* Test case: ensure that kzalloc() returns initialized memory. */
> +static void test_init_kzalloc(struct kunit *test)
> +{
> + EXPECTATION_NO_REPORT(expect);
> + int *ptr;
> +
> + kunit_info(test, "initialized kzalloc test (no reports)\n");
> + ptr = kzalloc(sizeof(*ptr), GFP_KERNEL);
> + USE(*ptr);
> + KUNIT_EXPECT_TRUE(test, report_matches(&expect));
> +}
> +
> +/* Test case: ensure that local variables are uninitialized by default. */
> +static void test_uninit_stack_var(struct kunit *test)
> +{
> + EXPECTATION_UNINIT_VALUE(expect);
> + volatile int cond;
> +
> + kunit_info(test, "uninitialized stack variable (UMR report)\n");
> + USE(cond);
> + KUNIT_EXPECT_TRUE(test, report_matches(&expect));
> +}
> +
> +/* Test case: ensure that local variables with initializers are initialized. */
> +static void test_init_stack_var(struct kunit *test)
> +{
> + EXPECTATION_NO_REPORT(expect);
> + volatile int cond = 1;
> +
> + kunit_info(test, "initialized stack variable (no reports)\n");
> + USE(cond);
> + KUNIT_EXPECT_TRUE(test, report_matches(&expect));
> +}
> +
> +static noinline void two_param_fn_2(int arg1, int arg2)
> +{
> + USE(arg1);
> + USE(arg2);
> +}
> +
> +static noinline void one_param_fn(int arg)
> +{
> + two_param_fn_2(arg, arg);
> + USE(arg);
> +}
> +
> +static noinline void two_param_fn(int arg1, int arg2)
> +{
> + int init = 0;
> +
> + one_param_fn(init);
> + USE(arg1);
> + USE(arg2);
> +}
> +
> +static void test_params(struct kunit *test)
> +{
> +#ifdef CONFIG_KMSAN_CHECK_PARAM_RETVAL

if (IS_ENABLED(...))

> + /*
> + * With eager param/retval checking enabled, KMSAN will report an error
> + * before the call to two_param_fn().
> + */
> + EXPECTATION_UNINIT_VALUE_FN(expect, "test_params");
> +#else
> + EXPECTATION_UNINIT_VALUE_FN(expect, "two_param_fn");
> +#endif
> + volatile int uninit, init = 1;
> +
> + kunit_info(test,
> + "uninit passed through a function parameter (UMR report)\n");
> + two_param_fn(uninit, init);
> + KUNIT_EXPECT_TRUE(test, report_matches(&expect));
> +}
> +
> +static int signed_sum3(int a, int b, int c)
> +{
> + return a + b + c;
> +}
> +
> +/*
> + * Test case: ensure that uninitialized values are tracked through function
> + * arguments.
> + */
> +static void test_uninit_multiple_params(struct kunit *test)
> +{
> + EXPECTATION_UNINIT_VALUE(expect);
> + volatile char b = 3, c;
> + volatile int a;
> +
> + kunit_info(test, "uninitialized local passed to fn (UMR report)\n");
> + USE(signed_sum3(a, b, c));
> + KUNIT_EXPECT_TRUE(test, report_matches(&expect));
> +}
> +
> +/* Helper function to make an array uninitialized. */
> +static noinline void do_uninit_local_array(char *array, int start, int stop)
> +{
> + volatile char uninit;
> + int i;
> +
> + for (i = start; i < stop; i++)
> + array[i] = uninit;
> +}
> +
> +/*
> + * Test case: ensure kmsan_check_memory() reports an error when checking
> + * uninitialized memory.
> + */
> +static void test_uninit_kmsan_check_memory(struct kunit *test)
> +{
> + EXPECTATION_UNINIT_VALUE_FN(expect, "test_uninit_kmsan_check_memory");
> + volatile char local_array[8];
> +
> + kunit_info(
> + test,
> + "kmsan_check_memory() called on uninit local (UMR report)\n");
> + do_uninit_local_array((char *)local_array, 5, 7);
> +
> + kmsan_check_memory((char *)local_array, 8);
> + KUNIT_EXPECT_TRUE(test, report_matches(&expect));
> +}
> +
> +/*
> + * Test case: check that a virtual memory range created with vmap() from
> + * initialized pages is still considered as initialized.
> + */
> +static void test_init_kmsan_vmap_vunmap(struct kunit *test)
> +{
> + EXPECTATION_NO_REPORT(expect);
> + const int npages = 2;
> + struct page **pages;
> + void *vbuf;
> + int i;
> +
> + kunit_info(test, "pages initialized via vmap (no reports)\n");
> +
> + pages = kmalloc_array(npages, sizeof(*pages), GFP_KERNEL);
> + for (i = 0; i < npages; i++)
> + pages[i] = alloc_page(GFP_KERNEL);
> + vbuf = vmap(pages, npages, VM_MAP, PAGE_KERNEL);
> + memset(vbuf, 0xfe, npages * PAGE_SIZE);
> + for (i = 0; i < npages; i++)
> + kmsan_check_memory(page_address(pages[i]), PAGE_SIZE);
> +
> + if (vbuf)
> + vunmap(vbuf);
> + for (i = 0; i < npages; i++)

add { }

> + if (pages[i])
> + __free_page(pages[i]);
> + kfree(pages);
> + KUNIT_EXPECT_TRUE(test, report_matches(&expect));
> +}
> +
> +/*
> + * Test case: ensure that memset() can initialize a buffer allocated via
> + * vmalloc().
> + */
> +static void test_init_vmalloc(struct kunit *test)
> +{
> + EXPECTATION_NO_REPORT(expect);
> + int npages = 8, i;
> + char *buf;
> +
> + kunit_info(test, "vmalloc buffer can be initialized (no reports)\n");
> + buf = vmalloc(PAGE_SIZE * npages);
> + buf[0] = 1;
> + memset(buf, 0xfe, PAGE_SIZE * npages);
> + USE(buf[0]);
> + for (i = 0; i < npages; i++)
> + kmsan_check_memory(&buf[PAGE_SIZE * i], PAGE_SIZE);
> + vfree(buf);
> + KUNIT_EXPECT_TRUE(test, report_matches(&expect));
> +}
> +
> +/* Test case: ensure that use-after-free reporting works. */
> +static void test_uaf(struct kunit *test)
> +{
> + EXPECTATION_USE_AFTER_FREE(expect);
> + volatile int value;
> + volatile int *var;
> +
> + kunit_info(test, "use-after-free in kmalloc-ed buffer (UMR report)\n");
> + var = kmalloc(80, GFP_KERNEL);
> + var[3] = 0xfeedface;
> + kfree((int *)var);
> + /* Copy the invalid value before checking it. */
> + value = var[3];
> + USE(value);
> + KUNIT_EXPECT_TRUE(test, report_matches(&expect));
> +}
> +
> +/*
> + * Test case: ensure that uninitialized values are propagated through per-CPU
> + * memory.
> + */
> +static void test_percpu_propagate(struct kunit *test)
> +{
> + EXPECTATION_UNINIT_VALUE(expect);
> + volatile int uninit, check;
> +
> + kunit_info(test,
> + "uninit local stored to per_cpu memory (UMR report)\n");
> +
> + this_cpu_write(per_cpu_var, uninit);
> + check = this_cpu_read(per_cpu_var);
> + USE(check);
> + KUNIT_EXPECT_TRUE(test, report_matches(&expect));
> +}
> +
> +/*
> + * Test case: ensure that passing uninitialized values to printk() leads to an
> + * error report.
> + */
> +static void test_printk(struct kunit *test)
> +{
> +#ifdef CONFIG_KMSAN_CHECK_PARAM_RETVAL

if (IS_ENABLED(CONFIG_KMSAN_CHECK_PARAM_RETVAL))

> + /*
> + * With eager param/retval checking enabled, KMSAN will report an error
> + * before the call to pr_info().
> + */
> + EXPECTATION_UNINIT_VALUE_FN(expect, "test_printk");
> +#else
> + EXPECTATION_UNINIT_VALUE_FN(expect, "number");
> +#endif
> + volatile int uninit;
> +
> + kunit_info(test, "uninit local passed to pr_info() (UMR report)\n");
> + pr_info("%px contains %d\n", &uninit, uninit);
> + KUNIT_EXPECT_TRUE(test, report_matches(&expect));
> +}
> +
> +/*
> + * Test case: ensure that memcpy() correctly copies uninitialized values between
> + * aligned `src` and `dst`.
> + */
> +static void test_memcpy_aligned_to_aligned(struct kunit *test)
> +{
> + EXPECTATION_UNINIT_VALUE_FN(expect, "test_memcpy_aligned_to_aligned");
> + volatile int uninit_src;
> + volatile int dst = 0;
> +
> + kunit_info(test, "memcpy()ing aligned uninit src to aligned dst (UMR report)\n");
> + memcpy((void *)&dst, (void *)&uninit_src, sizeof(uninit_src));
> + kmsan_check_memory((void *)&dst, sizeof(dst));
> + KUNIT_EXPECT_TRUE(test, report_matches(&expect));
> +}
> +
> +/*
> + * Test case: ensure that memcpy() correctly copies uninitialized values between
> + * aligned `src` and unaligned `dst`.
> + *
> + * Copying aligned 4-byte value to an unaligned one leads to touching two
> + * aligned 4-byte values. This test case checks that KMSAN correctly reports an
> + * error on the first of the two values.
> + */
> +static void test_memcpy_aligned_to_unaligned(struct kunit *test)
> +{
> + EXPECTATION_UNINIT_VALUE_FN(expect, "test_memcpy_aligned_to_unaligned");
> + volatile int uninit_src;
> + volatile char dst[8] = {0};
> +
> + kunit_info(test, "memcpy()ing aligned uninit src to unaligned dst (UMR report)\n");
> + memcpy((void *)&dst[1], (void *)&uninit_src, sizeof(uninit_src));
> + kmsan_check_memory((void *)dst, 4);
> + KUNIT_EXPECT_TRUE(test, report_matches(&expect));
> +}
> +
> +/*
> + * Test case: ensure that memcpy() correctly copies uninitialized values between
> + * aligned `src` and unaligned `dst`.
> + *
> + * Copying aligned 4-byte value to an unaligned one leads to touching two
> + * aligned 4-byte values. This test case checks that KMSAN correctly reports an
> + * error on the second of the two values.
> + */
> +static void test_memcpy_aligned_to_unaligned2(struct kunit *test)
> +{
> + EXPECTATION_UNINIT_VALUE_FN(expect, "test_memcpy_aligned_to_unaligned2");
> + volatile int uninit_src;
> + volatile char dst[8] = {0};
> +
> + kunit_info(test, "memcpy()ing aligned uninit src to unaligned dst - part 2 (UMR report)\n");
> + memcpy((void *)&dst[1], (void *)&uninit_src, sizeof(uninit_src));
> + kmsan_check_memory((void *)&dst[4], sizeof(uninit_src));
> + KUNIT_EXPECT_TRUE(test, report_matches(&expect));
> +}
> +
> +static struct kunit_case kmsan_test_cases[] = {
> + KUNIT_CASE(test_uninit_kmalloc),
> + KUNIT_CASE(test_init_kmalloc),
> + KUNIT_CASE(test_init_kzalloc),
> + KUNIT_CASE(test_uninit_stack_var),
> + KUNIT_CASE(test_init_stack_var),
> + KUNIT_CASE(test_params),
> + KUNIT_CASE(test_uninit_multiple_params),
> + KUNIT_CASE(test_uninit_kmsan_check_memory),
> + KUNIT_CASE(test_init_kmsan_vmap_vunmap),
> + KUNIT_CASE(test_init_vmalloc),
> + KUNIT_CASE(test_uaf),
> + KUNIT_CASE(test_percpu_propagate),
> + KUNIT_CASE(test_printk),
> + KUNIT_CASE(test_memcpy_aligned_to_aligned),
> + KUNIT_CASE(test_memcpy_aligned_to_unaligned),
> + KUNIT_CASE(test_memcpy_aligned_to_unaligned2),
> + {},
> +};
> +
> +/* ===== End test cases ===== */
> +
> +static int test_init(struct kunit *test)
> +{
> + unsigned long flags;
> +
> + spin_lock_irqsave(&observed.lock, flags);
> + observed.header[0] = '\0';
> + observed.ignore = false;
> + observed.available = false;
> + spin_unlock_irqrestore(&observed.lock, flags);
> +
> + return 0;
> +}
> +
> +static void test_exit(struct kunit *test)
> +{
> +}
> +
> +static struct kunit_suite kmsan_test_suite = {
> + .name = "kmsan",
> + .test_cases = kmsan_test_cases,
> + .init = test_init,
> + .exit = test_exit,
> +};
> +static struct kunit_suite *kmsan_test_suites[] = { &kmsan_test_suite, NULL };
> +
> +static void register_tracepoints(struct tracepoint *tp, void *ignore)
> +{
> + check_trace_callback_type_console(probe_console);
> + if (!strcmp(tp->name, "console"))
> + WARN_ON(tracepoint_probe_register(tp, probe_console, NULL));
> +}
> +
> +static void unregister_tracepoints(struct tracepoint *tp, void *ignore)
> +{
> + if (!strcmp(tp->name, "console"))
> + tracepoint_probe_unregister(tp, probe_console, NULL);
> +}
> +
> +/*
> + * We only want to do tracepoints setup and teardown once, therefore we have to
> + * customize the init and exit functions and cannot rely on kunit_test_suite().
> + */

This is no longer true. See a recent version of
mm/kfence/kfence_test.c which uses the new suite_init/exit.

> +static int __init kmsan_test_init(void)
> +{
> + /*
> + * Because we want to be able to build the test as a module, we need to
> + * iterate through all known tracepoints, since the static registration
> + * won't work here.
> + */
> + for_each_kernel_tracepoint(register_tracepoints, NULL);
> + return __kunit_test_suites_init(kmsan_test_suites);
> +}
> +
> +static void kmsan_test_exit(void)
> +{
> + __kunit_test_suites_exit(kmsan_test_suites);
> + for_each_kernel_tracepoint(unregister_tracepoints, NULL);
> + tracepoint_synchronize_unregister();
> +}
> +
> +late_initcall_sync(kmsan_test_init);
> +module_exit(kmsan_test_exit);
> +
> +MODULE_LICENSE("GPL v2");

A recent version of checkpatch should complain about this, wanting
only "GPL" instead of "GPL v2".

> +MODULE_AUTHOR("Alexander Potapenko <[email protected]>");

2022-07-12 14:54:27

by Marco Elver

[permalink] [raw]
Subject: Re: [PATCH v4 17/45] init: kmsan: call KMSAN initialization routines

On Fri, 1 Jul 2022 at 16:24, Alexander Potapenko <[email protected]> wrote:
>
> kmsan_init_shadow() scans the mappings created at boot time and creates
> metadata pages for those mappings.
>
> When the memblock allocator returns pages to pagealloc, we reserve 2/3
> of those pages and use them as metadata for the remaining 1/3. Once KMSAN
> starts, every page allocated by pagealloc has its associated shadow and
> origin pages.
>
> kmsan_initialize() initializes the bookkeeping for init_task and enables
> KMSAN.
>
> Signed-off-by: Alexander Potapenko <[email protected]>
> ---
> v2:
> -- move mm/kmsan/init.c and kmsan_memblock_free_pages() to this patch
> -- print a warning that KMSAN is a debugging tool (per Greg K-H's
> request)
>
> v4:
> -- change sizeof(type) to sizeof(*ptr)
> -- replace occurrences of |var| with @var
> -- swap init: and kmsan: in the subject
> -- do not export __init functions
>
> Link: https://linux-review.googlesource.com/id/I7bc53706141275914326df2345881ffe0cdd16bd
> ---
> include/linux/kmsan.h | 48 +++++++++
> init/main.c | 3 +
> mm/kmsan/Makefile | 3 +-
> mm/kmsan/init.c | 238 ++++++++++++++++++++++++++++++++++++++++++
> mm/kmsan/kmsan.h | 3 +
> mm/kmsan/shadow.c | 36 +++++++
> mm/page_alloc.c | 3 +
> 7 files changed, 333 insertions(+), 1 deletion(-)
> create mode 100644 mm/kmsan/init.c
>
> diff --git a/include/linux/kmsan.h b/include/linux/kmsan.h
> index b71e2032222e9..82fd564cc72e7 100644
> --- a/include/linux/kmsan.h
> +++ b/include/linux/kmsan.h
> @@ -51,6 +51,40 @@ void kmsan_task_create(struct task_struct *task);
> */
> void kmsan_task_exit(struct task_struct *task);
>
> +/**
> + * kmsan_init_shadow() - Initialize KMSAN shadow at boot time.
> + *
> + * Allocate and initialize KMSAN metadata for early allocations.
> + */
> +void __init kmsan_init_shadow(void);
> +
> +/**
> + * kmsan_init_runtime() - Initialize KMSAN state and enable KMSAN.
> + */
> +void __init kmsan_init_runtime(void);
> +
> +/**
> + * kmsan_memblock_free_pages() - handle freeing of memblock pages.
> + * @page: struct page to free.
> + * @order: order of @page.
> + *
> + * Freed pages are either returned to buddy allocator or held back to be used
> + * as metadata pages.
> + */
> +bool __init kmsan_memblock_free_pages(struct page *page, unsigned int order);
> +
> +/**
> + * kmsan_task_create() - Initialize KMSAN state for the task.
> + * @task: task to initialize.
> + */
> +void kmsan_task_create(struct task_struct *task);
> +
> +/**
> + * kmsan_task_exit() - Notify KMSAN that a task has exited.
> + * @task: task about to finish.
> + */
> +void kmsan_task_exit(struct task_struct *task);

Something went wrong with patch shuffling here I think,
kmsan_task_create + kmsan_task_exit decls are duplicated by this
patch.

> /**
> * kmsan_alloc_page() - Notify KMSAN about an alloc_pages() call.
> * @page: struct page pointer returned by alloc_pages().
> @@ -172,6 +206,20 @@ void kmsan_iounmap_page_range(unsigned long start, unsigned long end);
>
> #else
>
> +static inline void kmsan_init_shadow(void)
> +{
> +}
> +
> +static inline void kmsan_init_runtime(void)
> +{
> +}
> +
> +static inline bool kmsan_memblock_free_pages(struct page *page,
> + unsigned int order)
> +{
> + return true;
> +}
> +
> static inline void kmsan_task_create(struct task_struct *task)
> {
> }
> diff --git a/init/main.c b/init/main.c
> index 0ee39cdcfcac9..7ba48a9ff1d53 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -34,6 +34,7 @@
> #include <linux/percpu.h>
> #include <linux/kmod.h>
> #include <linux/kprobes.h>
> +#include <linux/kmsan.h>
> #include <linux/vmalloc.h>
> #include <linux/kernel_stat.h>
> #include <linux/start_kernel.h>
> @@ -835,6 +836,7 @@ static void __init mm_init(void)
> init_mem_debugging_and_hardening();
> kfence_alloc_pool();
> report_meminit();
> + kmsan_init_shadow();
> stack_depot_early_init();
> mem_init();
> mem_init_print_info();
> @@ -852,6 +854,7 @@ static void __init mm_init(void)
> init_espfix_bsp();
> /* Should be run after espfix64 is set up. */
> pti_init();
> + kmsan_init_runtime();
> }
>
> #ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
> diff --git a/mm/kmsan/Makefile b/mm/kmsan/Makefile
> index 550ad8625e4f9..401acb1a491ce 100644
> --- a/mm/kmsan/Makefile
> +++ b/mm/kmsan/Makefile
> @@ -3,7 +3,7 @@
> # Makefile for KernelMemorySanitizer (KMSAN).
> #
> #
> -obj-y := core.o instrumentation.o hooks.o report.o shadow.o
> +obj-y := core.o instrumentation.o init.o hooks.o report.o shadow.o
>
> KMSAN_SANITIZE := n
> KCOV_INSTRUMENT := n
> @@ -18,6 +18,7 @@ CFLAGS_REMOVE.o = $(CC_FLAGS_FTRACE)
>
> CFLAGS_core.o := $(CC_FLAGS_KMSAN_RUNTIME)
> CFLAGS_hooks.o := $(CC_FLAGS_KMSAN_RUNTIME)
> +CFLAGS_init.o := $(CC_FLAGS_KMSAN_RUNTIME)
> CFLAGS_instrumentation.o := $(CC_FLAGS_KMSAN_RUNTIME)
> CFLAGS_report.o := $(CC_FLAGS_KMSAN_RUNTIME)
> CFLAGS_shadow.o := $(CC_FLAGS_KMSAN_RUNTIME)
> diff --git a/mm/kmsan/init.c b/mm/kmsan/init.c
> new file mode 100644
> index 0000000000000..abbf595a1e359
> --- /dev/null
> +++ b/mm/kmsan/init.c
> @@ -0,0 +1,238 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * KMSAN initialization routines.
> + *
> + * Copyright (C) 2017-2021 Google LLC
> + * Author: Alexander Potapenko <[email protected]>
> + *
> + */
> +
> +#include "kmsan.h"
> +
> +#include <asm/sections.h>
> +#include <linux/mm.h>
> +#include <linux/memblock.h>
> +
> +#include "../internal.h"
> +
> +#define NUM_FUTURE_RANGES 128
> +struct start_end_pair {
> + u64 start, end;
> +};
> +
> +static struct start_end_pair start_end_pairs[NUM_FUTURE_RANGES] __initdata;
> +static int future_index __initdata;
> +
> +/*
> + * Record a range of memory for which the metadata pages will be created once
> + * the page allocator becomes available.
> + */
> +static void __init kmsan_record_future_shadow_range(void *start, void *end)
> +{
> + u64 nstart = (u64)start, nend = (u64)end, cstart, cend;
> + bool merged = false;
> + int i;
> +
> + KMSAN_WARN_ON(future_index == NUM_FUTURE_RANGES);
> + KMSAN_WARN_ON((nstart >= nend) || !nstart || !nend);
> + nstart = ALIGN_DOWN(nstart, PAGE_SIZE);
> + nend = ALIGN(nend, PAGE_SIZE);
> +
> + /*
> + * Scan the existing ranges to see if any of them overlaps with
> + * [start, end). In that case, merge the two ranges instead of
> + * creating a new one.
> + * The number of ranges is less than 20, so there is no need to organize
> + * them into a more intelligent data structure.
> + */
> + for (i = 0; i < future_index; i++) {
> + cstart = start_end_pairs[i].start;
> + cend = start_end_pairs[i].end;
> + if ((cstart < nstart && cend < nstart) ||
> + (cstart > nend && cend > nend))
> + /* ranges are disjoint - do not merge */
> + continue;
> + start_end_pairs[i].start = min(nstart, cstart);
> + start_end_pairs[i].end = max(nend, cend);
> + merged = true;
> + break;
> + }
> + if (merged)
> + return;
> + start_end_pairs[future_index].start = nstart;
> + start_end_pairs[future_index].end = nend;
> + future_index++;
> +}
> +
> +/*
> + * Initialize the shadow for existing mappings during kernel initialization.
> + * These include kernel text/data sections, NODE_DATA and future ranges
> + * registered while creating other data (e.g. percpu).
> + *
> + * Allocations via memblock can be only done before slab is initialized.
> + */
> +void __init kmsan_init_shadow(void)
> +{
> + const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
> + phys_addr_t p_start, p_end;
> + int nid;
> + u64 i;
> +
> + for_each_reserved_mem_range(i, &p_start, &p_end)
> + kmsan_record_future_shadow_range(phys_to_virt(p_start),
> + phys_to_virt(p_end));
> + /* Allocate shadow for .data */
> + kmsan_record_future_shadow_range(_sdata, _edata);
> +
> + for_each_online_node(nid)
> + kmsan_record_future_shadow_range(
> + NODE_DATA(nid), (char *)NODE_DATA(nid) + nd_size);
> +
> + for (i = 0; i < future_index; i++)
> + kmsan_init_alloc_meta_for_range(
> + (void *)start_end_pairs[i].start,
> + (void *)start_end_pairs[i].end);
> +}
> +
> +struct page_pair {

'struct shadow_origin_pages' for a more descriptive name?

> + struct page *shadow, *origin;
> +};
> +static struct page_pair held_back[MAX_ORDER] __initdata;
> +
> +/*
> + * Eager metadata allocation. When the memblock allocator is freeing pages to
> + * pagealloc, we use 2/3 of them as metadata for the remaining 1/3.
> + * We store the pointers to the returned blocks of pages in held_back[] grouped
> + * by their order: when kmsan_memblock_free_pages() is called for the first
> + * time with a certain order, it is reserved as a shadow block, for the second
> + * time - as an origin block. On the third time the incoming block receives its
> + * shadow and origin ranges from the previously saved shadow and origin blocks,
> + * after which held_back[order] can be used again.
> + *
> + * At the very end there may be leftover blocks in held_back[]. They are
> + * collected later by kmsan_memblock_discard().
> + */
> +bool kmsan_memblock_free_pages(struct page *page, unsigned int order)
> +{
> + struct page *shadow, *origin;

Can this just be 'struct page_pair'?

> + if (!held_back[order].shadow) {
> + held_back[order].shadow = page;
> + return false;
> + }
> + if (!held_back[order].origin) {
> + held_back[order].origin = page;
> + return false;
> + }
> + shadow = held_back[order].shadow;
> + origin = held_back[order].origin;
> + kmsan_setup_meta(page, shadow, origin, order);
> +
> + held_back[order].shadow = NULL;
> + held_back[order].origin = NULL;
> + return true;
> +}
> +
> +#define MAX_BLOCKS 8
> +struct smallstack {
> + struct page *items[MAX_BLOCKS];
> + int index;
> + int order;
> +};
> +
> +static struct smallstack collect = {
> + .index = 0,
> + .order = MAX_ORDER,
> +};
> +
> +static void smallstack_push(struct smallstack *stack, struct page *pages)
> +{
> + KMSAN_WARN_ON(stack->index == MAX_BLOCKS);
> + stack->items[stack->index] = pages;
> + stack->index++;
> +}
> +#undef MAX_BLOCKS
> +
> +static struct page *smallstack_pop(struct smallstack *stack)
> +{
> + struct page *ret;
> +
> + KMSAN_WARN_ON(stack->index == 0);
> + stack->index--;
> + ret = stack->items[stack->index];
> + stack->items[stack->index] = NULL;
> + return ret;
> +}
> +
> +static void do_collection(void)
> +{
> + struct page *page, *shadow, *origin;
> +
> + while (collect.index >= 3) {
> + page = smallstack_pop(&collect);
> + shadow = smallstack_pop(&collect);
> + origin = smallstack_pop(&collect);
> + kmsan_setup_meta(page, shadow, origin, collect.order);
> + __free_pages_core(page, collect.order);
> + }
> +}
> +
> +static void collect_split(void)
> +{
> + struct smallstack tmp = {
> + .order = collect.order - 1,
> + .index = 0,
> + };
> + struct page *page;
> +
> + if (!collect.order)
> + return;
> + while (collect.index) {
> + page = smallstack_pop(&collect);
> + smallstack_push(&tmp, &page[0]);
> + smallstack_push(&tmp, &page[1 << tmp.order]);
> + }
> + __memcpy(&collect, &tmp, sizeof(tmp));
> +}
> +
> +/*
> + * Memblock is about to go away. Split the page blocks left over in held_back[]
> + * and return 1/3 of that memory to the system.
> + */
> +static void kmsan_memblock_discard(void)
> +{
> + int i;
> +
> + /*
> + * For each order=N:
> + * - push held_back[N].shadow and .origin to @collect;
> + * - while there are >= 3 elements in @collect, do garbage collection:
> + * - pop 3 ranges from @collect;
> + * - use two of them as shadow and origin for the third one;
> + * - repeat;
> + * - split each remaining element from @collect into 2 ranges of
> + * order=N-1,
> + * - repeat.
> + */
> + collect.order = MAX_ORDER - 1;
> + for (i = MAX_ORDER - 1; i >= 0; i--) {
> + if (held_back[i].shadow)
> + smallstack_push(&collect, held_back[i].shadow);
> + if (held_back[i].origin)
> + smallstack_push(&collect, held_back[i].origin);
> + held_back[i].shadow = NULL;
> + held_back[i].origin = NULL;
> + do_collection();
> + collect_split();
> + }
> +}
> +
> +void __init kmsan_init_runtime(void)
> +{
> + /* Assuming current is init_task */
> + kmsan_internal_task_create(current);
> + kmsan_memblock_discard();
> + pr_info("Starting KernelMemorySanitizer\n");
> + pr_info("ATTENTION: KMSAN is a debugging tool! Do not use it on production machines!\n");
> + kmsan_enabled = true;
> +}
> diff --git a/mm/kmsan/kmsan.h b/mm/kmsan/kmsan.h
> index c7fb8666607e2..2f17912ef863f 100644
> --- a/mm/kmsan/kmsan.h
> +++ b/mm/kmsan/kmsan.h
> @@ -66,6 +66,7 @@ struct shadow_origin_ptr {
> struct shadow_origin_ptr kmsan_get_shadow_origin_ptr(void *addr, u64 size,
> bool store);
> void *kmsan_get_metadata(void *addr, bool is_origin);
> +void __init kmsan_init_alloc_meta_for_range(void *start, void *end);
>
> enum kmsan_bug_reason {
> REASON_ANY,
> @@ -188,5 +189,7 @@ bool kmsan_internal_is_module_addr(void *vaddr);
> bool kmsan_internal_is_vmalloc_addr(void *addr);
>
> struct page *kmsan_vmalloc_to_page_or_null(void *vaddr);
> +void kmsan_setup_meta(struct page *page, struct page *shadow,
> + struct page *origin, int order);
>
> #endif /* __MM_KMSAN_KMSAN_H */
> diff --git a/mm/kmsan/shadow.c b/mm/kmsan/shadow.c
> index 416cb85487a1a..7b254c30d42cc 100644
> --- a/mm/kmsan/shadow.c
> +++ b/mm/kmsan/shadow.c
> @@ -259,3 +259,39 @@ void kmsan_vmap_pages_range_noflush(unsigned long start, unsigned long end,
> kfree(s_pages);
> kfree(o_pages);
> }
> +
> +/* Allocate metadata for pages allocated at boot time. */
> +void __init kmsan_init_alloc_meta_for_range(void *start, void *end)
> +{
> + struct page *shadow_p, *origin_p;
> + void *shadow, *origin;
> + struct page *page;
> + u64 addr, size;
> +
> + start = (void *)ALIGN_DOWN((u64)start, PAGE_SIZE);
> + size = ALIGN((u64)end - (u64)start, PAGE_SIZE);
> + shadow = memblock_alloc(size, PAGE_SIZE);
> + origin = memblock_alloc(size, PAGE_SIZE);
> + for (addr = 0; addr < size; addr += PAGE_SIZE) {
> + page = virt_to_page_or_null((char *)start + addr);
> + shadow_p = virt_to_page_or_null((char *)shadow + addr);
> + set_no_shadow_origin_page(shadow_p);
> + shadow_page_for(page) = shadow_p;
> + origin_p = virt_to_page_or_null((char *)origin + addr);
> + set_no_shadow_origin_page(origin_p);
> + origin_page_for(page) = origin_p;
> + }
> +}
> +
> +void kmsan_setup_meta(struct page *page, struct page *shadow,
> + struct page *origin, int order)
> +{
> + int i;
> +
> + for (i = 0; i < (1 << order); i++) {

Noticed this in many places, but we can just make these "for (int i =.." now.

> + set_no_shadow_origin_page(&shadow[i]);
> + set_no_shadow_origin_page(&origin[i]);
> + shadow_page_for(&page[i]) = &shadow[i];
> + origin_page_for(&page[i]) = &origin[i];
> + }
> +}
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 785459251145e..e8d5a0b2a3264 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1731,6 +1731,9 @@ void __init memblock_free_pages(struct page *page, unsigned long pfn,
> {
> if (early_page_uninitialised(pfn))
> return;
> + if (!kmsan_memblock_free_pages(page, order))
> + /* KMSAN will take care of these pages. */
> + return;

Add {} because the then-statement is not right below the if.

2022-07-13 10:33:11

by Marco Elver

[permalink] [raw]
Subject: Re: [PATCH v4 29/45] block: kmsan: skip bio block merging logic for KMSAN

On Fri, Jul 01, 2022 at 04:22PM +0200, 'Alexander Potapenko' via kasan-dev wrote:
[...]
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -867,6 +867,8 @@ static inline bool page_is_mergeable(const struct bio_vec *bv,
> return false;
>
> *same_page = ((vec_end_addr & PAGE_MASK) == page_addr);
> + if (!*same_page && IS_ENABLED(CONFIG_KMSAN))
> + return false;
> if (*same_page)
> return true;

if (*same_page)
return true;
else if (IS_ENABLED(CONFIG_KMSAN))
return false;

> return (bv->bv_page + bv_end / PAGE_SIZE) == (page + off / PAGE_SIZE);

2022-08-02 17:38:05

by Alexander Potapenko

[permalink] [raw]
Subject: Re: [PATCH v4 25/45] kmsan: add tests for KMSAN

On Tue, Jul 12, 2022 at 4:17 PM Marco Elver <[email protected]> wrote:
>

> > +static void test_params(struct kunit *test)
> > +{
> > +#ifdef CONFIG_KMSAN_CHECK_PARAM_RETVAL
>
> if (IS_ENABLED(...))
>
Not sure this is valid C, given that EXPECTATION_UNINIT_VALUE_FN
introduces a variable declaration.

> > + if (vbuf)
> > + vunmap(vbuf);
> > + for (i = 0; i < npages; i++)
>
> add { }
>
Done.


> if (IS_ENABLED(CONFIG_KMSAN_CHECK_PARAM_RETVAL))
>
Same as above.

> > +static void unregister_tracepoints(struct tracepoint *tp, void *ignore)
> > +{
> > + if (!strcmp(tp->name, "console"))
> > + tracepoint_probe_unregister(tp, probe_console, NULL);
> > +}
> > +
> > +/*
> > + * We only want to do tracepoints setup and teardown once, therefore we have to
> > + * customize the init and exit functions and cannot rely on kunit_test_suite().
> > + */
>
> This is no longer true. See a recent version of
> mm/kfence/kfence_test.c which uses the new suite_init/exit.

Done.

> > +late_initcall_sync(kmsan_test_init);
> > +module_exit(kmsan_test_exit);
> > +
> > +MODULE_LICENSE("GPL v2");
>
> A recent version of checkpatch should complain about this, wanting
> only "GPL" instead of "GPL v2".
>
Fixed.

2022-08-02 17:59:27

by Alexander Potapenko

[permalink] [raw]
Subject: Re: [PATCH v4 29/45] block: kmsan: skip bio block merging logic for KMSAN

On Wed, Jul 13, 2022 at 12:23 PM Marco Elver <[email protected]> wrote:
>
> On Fri, Jul 01, 2022 at 04:22PM +0200, 'Alexander Potapenko' via kasan-dev wrote:
> [...]
> > --- a/block/bio.c
> > +++ b/block/bio.c
> > @@ -867,6 +867,8 @@ static inline bool page_is_mergeable(const struct bio_vec *bv,
> > return false;
> >
> > *same_page = ((vec_end_addr & PAGE_MASK) == page_addr);
> > + if (!*same_page && IS_ENABLED(CONFIG_KMSAN))
> > + return false;
> > if (*same_page)
> > return true;
>
> if (*same_page)
> return true;
> else if (IS_ENABLED(CONFIG_KMSAN))
> return false;
>
Done.
> > return (bv->bv_page + bv_end / PAGE_SIZE) == (page + off / PAGE_SIZE);



--
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

2022-08-02 20:54:28

by Alexander Potapenko

[permalink] [raw]
Subject: Re: [PATCH v4 17/45] init: kmsan: call KMSAN initialization routines

On Tue, Jul 12, 2022 at 4:05 PM Marco Elver <[email protected]> wrote:
>

> > +/**
> > + * kmsan_task_exit() - Notify KMSAN that a task has exited.
> > + * @task: task about to finish.
> > + */
> > +void kmsan_task_exit(struct task_struct *task);
>
> Something went wrong with patch shuffling here I think,
> kmsan_task_create + kmsan_task_exit decls are duplicated by this
> patch.
Right, I've messed it up. Will fix.

> > +
> > +struct page_pair {
>
> 'struct shadow_origin_pages' for a more descriptive name?
How about "metadata_page_pair"?

> > + * At the very end there may be leftover blocks in held_back[]. They are
> > + * collected later by kmsan_memblock_discard().
> > + */
> > +bool kmsan_memblock_free_pages(struct page *page, unsigned int order)
> > +{
> > + struct page *shadow, *origin;
>
> Can this just be 'struct page_pair'?

Not sure this is worth it. We'll save one line by assigning this
struct to held_back[order], but the call to kmsan_setup_meta() will
become more verbose.
(and passing a struct page_pair to kmsan_setup_meta() looks excessive).


> > + struct page *origin, int order)
> > +{
> > + int i;
> > +
> > + for (i = 0; i < (1 << order); i++) {
>
> Noticed this in many places, but we can just make these "for (int i =.." now.
Fixed here and all over the runtime.

> > @@ -1731,6 +1731,9 @@ void __init memblock_free_pages(struct page *page, unsigned long pfn,
> > {
> > if (early_page_uninitialised(pfn))
> > return;
> > + if (!kmsan_memblock_free_pages(page, order))
> > + /* KMSAN will take care of these pages. */
> > + return;
>
> Add {} because the then-statement is not right below the if.

Done.

--
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

2022-08-03 09:42:43

by Marco Elver

[permalink] [raw]
Subject: Re: [PATCH v4 17/45] init: kmsan: call KMSAN initialization routines

On Tue, 2 Aug 2022 at 22:08, Alexander Potapenko <[email protected]> wrote:
>
> On Tue, Jul 12, 2022 at 4:05 PM Marco Elver <[email protected]> wrote:
> >
>
> > > +/**
> > > + * kmsan_task_exit() - Notify KMSAN that a task has exited.
> > > + * @task: task about to finish.
> > > + */
> > > +void kmsan_task_exit(struct task_struct *task);
> >
> > Something went wrong with patch shuffling here I think,
> > kmsan_task_create + kmsan_task_exit decls are duplicated by this
> > patch.
> Right, I've messed it up. Will fix.
>
> > > +
> > > +struct page_pair {
> >
> > 'struct shadow_origin_pages' for a more descriptive name?
> How about "metadata_page_pair"?

Sure - this is local anyway, but page_pair was too generic.

2022-08-03 11:10:32

by Alexander Potapenko

[permalink] [raw]
Subject: Re: [PATCH v4 33/45] x86: kmsan: disable instrumentation of unsupported code

On Tue, Jul 12, 2022 at 3:44 PM Marco Elver <[email protected]> wrote:
>
> On Fri, 1 Jul 2022 at 16:24, 'Alexander Potapenko' via kasan-dev
> <[email protected]> wrote:
> [...]
> > ---
> > arch/x86/boot/Makefile | 1 +
> > arch/x86/boot/compressed/Makefile | 1 +
> > arch/x86/entry/vdso/Makefile | 3 +++
> > arch/x86/kernel/Makefile | 2 ++
> > arch/x86/kernel/cpu/Makefile | 1 +
> > arch/x86/mm/Makefile | 2 ++
> > arch/x86/realmode/rm/Makefile | 1 +
> > lib/Makefile | 2 ++
> [...]
> > --- a/lib/Makefile
> > +++ b/lib/Makefile
> > @@ -272,6 +272,8 @@ obj-$(CONFIG_POLYNOMIAL) += polynomial.o
> > CFLAGS_stackdepot.o += -fno-builtin
> > obj-$(CONFIG_STACKDEPOT) += stackdepot.o
> > KASAN_SANITIZE_stackdepot.o := n
> > +# In particular, instrumenting stackdepot.c with KMSAN will result in infinite
> > +# recursion.
> > KMSAN_SANITIZE_stackdepot.o := n
> > KCOV_INSTRUMENT_stackdepot.o := n
>
> This is generic code and not x86, should it have been in the earlier patch?
Ack.


--
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

2022-08-08 16:48:09

by Alexander Potapenko

[permalink] [raw]
Subject: Re: [PATCH v4 43/45] namei: initialize parameters passed to step_into()

On Fri, Jul 1, 2022 at 4:25 PM Alexander Potapenko <[email protected]> wrote:
>
> Under certain circumstances initialization of `unsigned seq` and
> `struct inode *inode` passed into step_into() may be skipped.
> In particular, if the call to lookup_fast() in walk_component()
> returns NULL, and lookup_slow() returns a valid dentry, then the
> `seq` and `inode` will remain uninitialized until the call to
> step_into() (see [1] for more info).
>
> Right now step_into() does not use these uninitialized values,
> yet passing uninitialized values to functions is considered undefined
> behavior (see [2]). To fix that, we initialize `seq` and `inode` at
> definition.

Given that Al Viro has a patch series in flight to address the
problem, I am going to drop this patch from KMSAN v5 series.

2022-08-25 15:46:50

by Alexander Potapenko

[permalink] [raw]
Subject: Re: [PATCH v4 44/45] mm: fs: initialize fsdata passed to write_begin/write_end interface

On Mon, Jul 4, 2022 at 10:07 PM Matthew Wilcox <[email protected]> wrote:
>
> On Fri, Jul 01, 2022 at 04:23:09PM +0200, Alexander Potapenko wrote:
> > Functions implementing the a_ops->write_end() interface accept the
> > `void *fsdata` parameter that is supposed to be initialized by the
> > corresponding a_ops->write_begin() (which accepts `void **fsdata`).
> >
> > However not all a_ops->write_begin() implementations initialize `fsdata`
> > unconditionally, so it may get passed uninitialized to a_ops->write_end(),
> > resulting in undefined behavior.
>
> ... wait, passing an uninitialised variable to a function *which doesn't
> actually use it* is now UB? What genius came up with that rule? What
> purpose does it serve?
>

Hi Matthew,

There is a discussion at [1], with Segher pointing out a reason for
this rule [2] and Linus requesting that we should be warning about the
cases where uninitialized variables are passed by value.

Right now there are only a handful cases in the kernel where such
passing is performed (we just need one more patch in addition to this
one for KMSAN to boot cleanly). So we are in a good position to start
enforcing this rule, unless there's a reason not to.

I am not sure standard compliance alone is a convincing argument, but
from KMSAN standpoint, performing parameter check at callsites
noticeably eases handling of values passed between instrumented and
non-instrumented code. This lets us avoid some low-level hacks around
instrumentation_begin()/instrumentation_end() (some context available
at [4]).

Let me know what you think,
Alex

[1] - https://lore.kernel.org/lkml/CAFKCwrjBjHMquj-adTf0_1QLYq3Et=gJ0rq6HS-qrAEmVA7Ujw@mail.gmail.com/T/
[2] - https://lore.kernel.org/lkml/[email protected]/
[3] - https://lore.kernel.org/lkml/CAHk-=whjz3wO8zD+itoerphWem+JZz4uS3myf6u1Wd6epGRgmQ@mail.gmail.com/
[4] https://lore.kernel.org/lkml/[email protected]/

2022-08-25 17:17:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v4 44/45] mm: fs: initialize fsdata passed to write_begin/write_end interface

On Thu, Aug 25, 2022 at 8:40 AM Alexander Potapenko <[email protected]> wrote:
>
> On Mon, Jul 4, 2022 at 10:07 PM Matthew Wilcox <[email protected]> wrote:
> >
> > ... wait, passing an uninitialised variable to a function *which doesn't
> > actually use it* is now UB? What genius came up with that rule? What
> > purpose does it serve?
> >
>
> There is a discussion at [1], with Segher pointing out a reason for
> this rule [2] and Linus requesting that we should be warning about the
> cases where uninitialized variables are passed by value.

I think Matthew was actually more wondering how that UB rule came to be.

Personally, I pretty much despise *all* cases of "undefined behavior",
but "uninitialized argument" across a function call is one of the more
understandable ones.

For one, it's a static sanity checking issue: if function call
arguments can be uninitialized random garbage on the assumption that
the callee doesn't necessarily _use_ them, then any static checker is
going to be unhappy because it means that it can never assume that
incoming arguments have been initialized either.

Of course, that's always true for any pointer passing, but hey, at
least then it's pretty much explicit. You're passing a pointer to some
memory to another function, it's always going to be a bit ambiguous
who is supposed to initialize it - the caller or the callee.

Because one very important "static checker" is the person reading the
code. When I read a function definition, I most definitely have the
expectation that the caller has initialized all the arguments.

So I actually think that "human static checker" is a really important
case. I do not think I'm the only one who expects incomping function
arguments to have values.

But I think the immediate cause of it on a compiler side was basically
things like poison bits. Which are a nice debugging feature, even
though (sadly) I don't think they are usually added the for debugging.
It's always for some other much more nefarious reason (eg ia64 and
speculative loads weren't for "hey, this will help people find bugs",
but for "hey, our architecture depends on static scheduling tricks
that aren't really valid, so we have to take faults late").

Now, imagine you're a compiler, and you see a random incoming integer
argument, and you can't even schedule simple arithmetic expressions on
it early because you don't know if the caller initialized it or not,
and it might cause some poison bit fault...

So you'd most certainly want to know that all incoming arguments are
actually valid, because otherwise you can't do even some really simple
and obvious optimziations.

Of course, on normal architectures, this only ever happens with FP
values, and it's often hard to trigger there too. But you most
definitely *could* see it.

I personally was actually surprised compilers didn't warn for "you are
using an uninitialized value" for a function call argument, because I
mentally consider function call arguments to *be* a use of a value.

Except when the function is inlined, and then it's all different - the
call itself goes away, and I *expect* the compiler to DTRT and not
"use" the argument except when it's used inside the inlined function.

Because hey, that's literally the whole point of inlining, and it
makes the "static checking" problem go away at least for a compiler.

Linus

2022-08-25 22:44:12

by Segher Boessenkool

[permalink] [raw]
Subject: Re: [PATCH v4 44/45] mm: fs: initialize fsdata passed to write_begin/write_end interface

On Thu, Aug 25, 2022 at 09:33:18AM -0700, Linus Torvalds wrote:
> So you'd most certainly want to know that all incoming arguments are
> actually valid, because otherwise you can't do even some really simple
> and obvious optimziations.

The C11 change was via DR 338, see
<https://www.open-std.org/jtc1/sc22/wg14/www/docs/dr_338.htm>
for more info.


Segher

2022-08-25 23:01:57

by Segher Boessenkool

[permalink] [raw]
Subject: Re: [PATCH v4 44/45] mm: fs: initialize fsdata passed to write_begin/write_end interface

On Thu, Aug 25, 2022 at 09:33:18AM -0700, Linus Torvalds wrote:
> On Thu, Aug 25, 2022 at 8:40 AM Alexander Potapenko <[email protected]> wrote:
> >
> > On Mon, Jul 4, 2022 at 10:07 PM Matthew Wilcox <[email protected]> wrote:
> > >
> > > ... wait, passing an uninitialised variable to a function *which doesn't
> > > actually use it* is now UB? What genius came up with that rule? What
> > > purpose does it serve?
> > >
> >
> > There is a discussion at [1], with Segher pointing out a reason for
> > this rule [2] and Linus requesting that we should be warning about the
> > cases where uninitialized variables are passed by value.
>
> I think Matthew was actually more wondering how that UB rule came to be.
>
> Personally, I pretty much despise *all* cases of "undefined behavior",

Let me start by saying you're not alone. But some UB *cannot* be worked
around by compilers (we cannot solve the halting problem), and some is
very expensive to work around (initialising huge structures is a
typical example).

Many (if not most) instances of undefined behaviour are unavoidable with
a language like C. A very big part of this is separate compilation,
that is, compiling translation units separately from each other, so that
the compiler does not see all the ways that something is used when it is
compiling it. There only is UB if something is *used*.

> but "uninitialized argument" across a function call is one of the more
> understandable ones.

Allowing this essentially never allows generating better machine code,
so there are no real arguments for ever allowing it, other than just
inertia: uninitialised everything else is allowed just fine, and only
actually *using* such data is UB. Passing it around is not! That is
how everything used to work (with static data, automatic data, function
parameters, the lot).

But it now is clarified that passing data to a function as function
argument is a use of that data by itself, even if the function will not
even look at it ever.

> I personally was actually surprised compilers didn't warn for "you are
> using an uninitialized value" for a function call argument, because I
> mentally consider function call arguments to *be* a use of a value.

The function call is a use of all passed arguments.

> Except when the function is inlined, and then it's all different - the
> call itself goes away, and I *expect* the compiler to DTRT and not
> "use" the argument except when it's used inside the inlined function.

> Because hey, that's literally the whole point of inlining, and it
> makes the "static checking" problem go away at least for a compiler.

But UB is defined in terms of the abstract machine (like *all* of C),
not in terms of the generated machine code. Typically things will work
fine if they "become invisible" by inlining, but this does not make the
program a correct program ever. Sorry :-(


Segher

2022-08-26 20:20:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v4 44/45] mm: fs: initialize fsdata passed to write_begin/write_end interface

On Thu, Aug 25, 2022 at 3:10 PM Segher Boessenkool
<[email protected]> wrote:
>
> But UB is defined in terms of the abstract machine (like *all* of C),
> not in terms of the generated machine code. Typically things will work
> fine if they "become invisible" by inlining, but this does not make the
> program a correct program ever. Sorry :-(

Yeah, and the abstract machine model based on "abstract syntax" is
just wrong, wrong, wrong.

I really wish the C standard people had the guts to just fix it. At
some point, relying on tradition when the tradition is bad is not a
great thing.

It's the same problem that made all the memory ordering discussions
completely untenable. The language to allow the whole data dependency
was completely ridiculous, because it became about the C language
syntax and theory, not about the actual code generation and actual
*meaning* that the whole thing was *about*.

Java may be a horrible language that a lot of people hate, but it
avoided a lot of problems by just making things about an actual
virtual machine and describing things within a more concrete model of
a virtual machine.

Then you can just say "this code sequence generates this set of
operations, and the compiler can optimize it any which way it likes as
long as the end result is equivalent".

Oh well.

I will repeat: a paper standard that doesn't take reality into account
is less useful than toilet paper. It's scratchy and not very
absorbent.

And the kernel will continue to care more about reality than about a C
standard that does bad things.

Inlining makes the use of the argument go away at the call site and
moves the code of the function into the body. That's how things
*work*. That's literally the meaning of inlining.

And inlining in C is so important because macros are weak, and other
facilities like templates don't exist.

But in the kernel, we also often use it because the actual semantics
of "not a function call" in terms of code generation is also important
(ie we have literal cases where "not generating the 'call'
instruction" is a correctness issue).

If the C standard thinks "undefined argument even for inlining use is
UB", then it's a case of that paperwork that doesn't reflect reality,
and we'll treat it with the deference it deserves - is less than
toilet paper.

We have decades of history of doing that in the kernel. Sometimes the
standards are just wrong, sometimes they are just too far removed from
reality to be relevant, and then it's just not worth worrying about
them.

Linus

2022-08-31 14:29:41

by Alexander Potapenko

[permalink] [raw]
Subject: Re: [PATCH v4 44/45] mm: fs: initialize fsdata passed to write_begin/write_end interface

On Fri, Aug 26, 2022 at 9:41 PM Linus Torvalds
<[email protected]> wrote:
>
> On Thu, Aug 25, 2022 at 3:10 PM Segher Boessenkool
> <[email protected]> wrote:
> >
> > But UB is defined in terms of the abstract machine (like *all* of C),
> > not in terms of the generated machine code. Typically things will work
> > fine if they "become invisible" by inlining, but this does not make the
> > program a correct program ever. Sorry :-(
>
> Yeah, and the abstract machine model based on "abstract syntax" is
> just wrong, wrong, wrong.
>
> I really wish the C standard people had the guts to just fix it. At
> some point, relying on tradition when the tradition is bad is not a
> great thing.
>
> It's the same problem that made all the memory ordering discussions
> completely untenable. The language to allow the whole data dependency
> was completely ridiculous, because it became about the C language
> syntax and theory, not about the actual code generation and actual
> *meaning* that the whole thing was *about*.
>
> Java may be a horrible language that a lot of people hate, but it
> avoided a lot of problems by just making things about an actual
> virtual machine and describing things within a more concrete model of
> a virtual machine.
>
> Then you can just say "this code sequence generates this set of
> operations, and the compiler can optimize it any which way it likes as
> long as the end result is equivalent".
>
> Oh well.
>
> I will repeat: a paper standard that doesn't take reality into account
> is less useful than toilet paper. It's scratchy and not very
> absorbent.
>
> And the kernel will continue to care more about reality than about a C
> standard that does bad things.
>
> Inlining makes the use of the argument go away at the call site and
> moves the code of the function into the body. That's how things
> *work*. That's literally the meaning of inlining.
>
> And inlining in C is so important because macros are weak, and other
> facilities like templates don't exist.
>
> But in the kernel, we also often use it because the actual semantics
> of "not a function call" in terms of code generation is also important
> (ie we have literal cases where "not generating the 'call'
> instruction" is a correctness issue).
>
> If the C standard thinks "undefined argument even for inlining use is
> UB", then it's a case of that paperwork that doesn't reflect reality,
> and we'll treat it with the deference it deserves - is less than
> toilet paper.

Just for posterity, in the case of KMSAN we are only dealing with
cases where the function call survived inlining and dead code
elimination.


> We have decades of history of doing that in the kernel. Sometimes the
> standards are just wrong, sometimes they are just too far removed from
> reality to be relevant, and then it's just not worth worrying about
> them.
>
> Linus



--
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg