2023-01-01 16:29:56

by Jason A. Donenfeld

[permalink] [raw]
Subject: [PATCH v14 0/7] implement getrandom() in vDSO

Changes v13->v14:
----------------
- Fix incorrect comments in chacha assembly, and augment comments in
random.c

- Get chacha standalone selftest building again after recent upstream
changes.

- Switch to using `.set` instead of `#define` in asm.

- Rebased on e4cf7c25bae5.

Changes v12->v13:
-----------------
- Use helpers in insn-eval.h instead of open coding the same thing via
insn.h.

- Due to some enum constants in insn-eval.h clashing with enum constants
in an unrelated header that happens to be included by a file this
series needs insn-eval.h from, the first patch of this series now
renames some enum constants. This is a standalone patch that should
probably be applied regardless of the rest of this series.

- Rebase atop the latest random.git master branch, which is itself
pretty similar to Linus' tree at the moment.

Changes v11->v12:
-----------------
- In order to avoid mlock()ing pages, and the related rlimit and fork
inheritance issues there, Introduce VM_DROPPABLE to prevent swapping
while meeting the cache-like requirements of vDSO getrandom().

This has some tenticles in mm/ and arch/x86/ code, so I've marked the
two patches for that as still RFC, while the rest of the series is not
RFC.

- Mandate that opaque state blobs don't straddle page boundaries, so
that VM_DROPPABLE can work on page-level granularity rather than
allocation-level granularity.

- Add compiler barriers to vDSO getrandom() to prevent theoretical
reordering potential.

- Initialize the trials loop counter in the chacha test.

Changes v10->v11:
-----------------
- Adhemerval's latest glibc patch currently lives here:
https://github.com/zatrazz/glibc/commits/azanella/getrandom-vdso
It's still in flux (he hasn't posted it to glibc-alpha yet), but that
should still give some idea.

- Substantially increase documentation detail throughout.

- Numerous fixes to style and clarity.

- Add two kselftests - one for the chacha20 assembly implementation, which
compares its output to that of libsodium, and one for the actual vDSO
function and vgetrandom_alloc() syscall, which also has some built-in quick &
dirty benchmarking functionality.

- Add reserved `unsigned long addr` argument to syscall, for future use.

- Enforce that `flags`, `addr`, and `*size_per_each` are all zero on input, so
that they can actually be used in the future.

- Use VM_LOCKONFAULT|VM_NORESERVE so that memory isn't wasted until it's used.

- Set VM_DONTDUMP to prevent state from being written out to core dumps.

- Do mapping and flag setting all in one operation, so that it's not racy, and
so that dropping the mm lock between operations isn't required.

- Due to the above, there's no longer any purpose in selecting
VGETRANDOM_ALLOC_SYSCALL, so remove that.

- Don't update *num or *size_per_each until the mapping has succeeded.

- Introduce vdso_kernel_ulong type for representing the long type that
the kernel uses, even when in 32-bit compat code on a 64-bit kernel.

- Don't use 'bool' type in the vDSO page, in case of compiler quirks resulting
in a different interpretation of the type in different compilation contexts.

--------------

Two statements:

1) Userspace wants faster cryptographically secure random numbers of
arbitrary size, big or small.

2) Userspace is currently unable to safely roll its own RNG with the
same security profile as getrandom().

Statement (1) has been debated for years, with arguments ranging from
"we need faster cryptographically secure card shuffling!" to "the only
things that actually need good randomness are keys, which are few and
far between" to "actually, TLS CBC nonces are frequent" and so on. I
don't intend to wade into that debate substantially, except to note that
recently glibc added arc4random(), whose goal is to return a
cryptographically secure uint32_t, and there are real user reports of it
being too slow. So here we are.

Statement (2) is more interesting. The kernel is the nexus of all
entropic inputs that influence the RNG. It is in the best position, and
probably the only position, to decide anything at all about the current
state of the RNG and of its entropy. One of the things it uniquely knows
about is when reseeding is necessary.

For example, when a virtual machine is forked, restored, or duplicated,
it's imparative that the RNG doesn't generate the same outputs. For this
reason, there's a small protocol between hypervisors and the kernel that
indicates this has happened, alongside some ID, which the RNG uses to
immediately reseed, so as not to return the same numbers. Were userspace
to expand a getrandom() seed from time T1 for the next hour, and at some
point T2 < hour, the virtual machine forked, userspace would continue to
provide the same numbers to two (or more) different virtual machines,
resulting in potential cryptographic catastrophe. Something similar
happens on resuming from hibernation (or even suspend), with various
compromise scenarios there in mind.

There's a more general reason why userspace rolling its own RNG from a
getrandom() seed is fraught. There's a lot of attention paid to this
particular Linuxism we have of the RNG being initialized and thus
non-blocking or uninitialized and thus blocking until it is initialized.
These are our Two Big States that many hold to be the holy
differentiating factor between safe and not safe, between
cryptographically secure and garbage. The fact is, however, that the
distinction between these two states is a hand-wavy wishy-washy inexact
approximation. Outside of a few exceptional cases (e.g. a HW RNG is
available), we actually don't really ever know with any rigor at all
when the RNG is safe and ready (nor when it's compromised). We do the
best we can to "estimate" it, but entropy estimation is fundamentally
impossible in the general case. So really, we're just doing guess work,
and hoping it's good and conservative enough. Let's then assume that
there's always some potential error involved in this differentiator.

In fact, under the surface, the RNG is engineered around a different
principal, and that is trying to *use* new entropic inputs regularly and
at the right specific moments in time. For example, close to boot time,
the RNG reseeds itself more often than later. At certain events, like VM
fork, the RNG reseeds itself immediately. The various heuristics for
when the RNG will use new entropy and how often is really a core aspect
of what the RNG has some potential to do decently enough (and something
that will probably continue to improve in the future from random.c's
present set of algorithms). So in your mind, put away the metal
attachment to the Two Big States, which represent an approximation with
a potential margin of error. Instead keep in mind that the RNG's primary
operating heuristic is how often and exactly when it's going to reseed.

So, if userspace takes a seed from getrandom() at point T1, and uses it
for the next hour (or N megabytes or some other meaningless metric),
during that time, potential errors in the Two Big States approximation
are amplified. During that time potential reseeds are being lost,
forgotten, not reflected in the output stream. That's not good.

The simplest statement you could make is that userspace RNGs that expand
a getrandom() seed at some point T1 are nearly always *worse*, in some
way, than just calling getrandom() every time a random number is
desired.

For those reasons, after some discussion on libc-alpha, glibc's
arc4random() now just calls getrandom() on each invocation. That's
trivially safe, and gives us latitude to then make the safe thing faster
without becoming unsafe at our leasure. Card shuffling isn't
particularly fast, however.

How do we rectify this? By putting a safe implementation of getrandom()
in the vDSO, which has access to whatever information a
particular iteration of random.c is using to make its decisions. I use
that careful language of "particular iteration of random.c", because the
set of things that a vDSO getrandom() implementation might need for making
decisions as good as the kernel's will likely change over time. This
isn't just a matter of exporting certain *data* to userspace. We're not
going to commit to a "data API" where the various heuristics used are
exposed, locking in how the kernel works for decades to come, and then
leave it to various userspaces to roll something on top and shoot
themselves in the foot and have all sorts of complexity disasters.
Rather, vDSO getrandom() is supposed to be the *same exact algorithm*
that runs in the kernel, except it's been hoisted into userspace as
much as possible. And so vDSO getrandom() and kernel getrandom() will
always mirror each other hermetically.

API-wise, the vDSO gains this function:

ssize_t vgetrandom(void *buffer, size_t len, unsigned int flags, void *opaque_state);

The return value and the first 3 arguments are the same as ordinary
getrandom(), while the last argument is a pointer to some state
allocated with vgetrandom_alloc(), explained below. Were all four
arguments passed to the getrandom syscall, nothing different would
happen, and the functions would have the exact same behavior.

Then, we introduce a new syscall:

void *vgetrandom_alloc(unsigned int *num, unsigned int *size_per_each,
unsigned long addr, unsigned int flags);

This takes a hinted number of opaque states in `num`, and returns a
pointer to an array of opaque states, the number actually allocated back
in `num`, and the size in bytes of each one in `size_per_each`, enabling
a libc to slice up the returned array into a state per each thread. (The
`flags` and `addr` arguments, as well as the `*size_per_each` input
value, are reserved for the future and are forced to be zero for now.)

Libc is expected to allocate a chunk of these on first use, and then
dole them out to threads as they're created, allocating more when
needed. The returned address of the first state may be passed to
munmap(2) with a length of `num * size_per_each`, in order to deallocate
the memory.

We very intentionally do *not* leave state allocation up to the caller
of vgetrandom, but provide vgetrandom_alloc for that allocation. There
are too many weird things that can go wrong, and it's important that
vDSO does not provide too generic of a mechanism. It's not going to
store its state in just any old memory address. It'll do it only in ones
it allocates.

Right now this means it uses a new mm flag called VM_DROPPABLE, along
with VM_WIPEONFORK. In the future maybe there will be other interesting
page flags or anti-heartbleed measures, or other platform-specific
kernel-specific things that can be set from the syscall. Again, it's
important that the kernel has a say in how this works rather than
agreeing to operate on any old address; memory isn't neutral.

The interesting meat of the implementation is in lib/vdso/getrandom.c,
as generic C code, and it aims to mainly follow random.c's buffered fast
key erasure logic. Before the RNG is initialized, it falls back to the
syscall. Right now it uses a simple generation counter to make its decisions
on reseeding (though this could be made more extensive over time).

The actual place that has the most work to do is in all of the other
files. Most of the vDSO shared page infrastructure is centered around
gettimeofday, and so the main structs are all in arrays for different
timestamp types, and attached to time namespaces, and so forth. I've
done the best I could to add onto this in an unintrusive way.

In my test results, performance is pretty stellar (around 15x for uint32_t
generation), and it seems to be working. There's an extended example in the
second commit of this series, showing how the syscall and the vDSO function
are meant to be used together.

Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Thomas Gleixner <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Adhemerval Zanella Netto <[email protected]>
Cc: Carlos O'Donell <[email protected]>
Cc: Florian Weimer <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Christian Brauner <[email protected]>

Jason A. Donenfeld (7):
x86: lib: Separate instruction decoder MMIO type from MMIO trace
mm: add VM_DROPPABLE for designating always lazily freeable mappings
x86: mm: Skip faulting instruction for VM_DROPPABLE faults
random: add vgetrandom_alloc() syscall
arch: allocate vgetrandom_alloc() syscall number
random: introduce generic vDSO getrandom() implementation
x86: vdso: Wire up getrandom() vDSO implementation

MAINTAINERS | 2 +
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/Kconfig | 1 +
arch/x86/coco/tdx/tdx.c | 26 +-
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/x86/entry/vdso/Makefile | 3 +-
arch/x86/entry/vdso/vdso.lds.S | 2 +
arch/x86/entry/vdso/vgetrandom-chacha.S | 178 +++++++++++
arch/x86/entry/vdso/vgetrandom.c | 17 ++
arch/x86/include/asm/insn-eval.h | 18 +-
arch/x86/include/asm/vdso/getrandom.h | 55 ++++
arch/x86/include/asm/vdso/vsyscall.h | 2 +
arch/x86/include/asm/vvar.h | 16 +
arch/x86/kernel/sev.c | 18 +-
arch/x86/lib/insn-eval.c | 20 +-
arch/x86/mm/fault.c | 19 ++
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
drivers/char/random.c | 147 +++++++++
fs/proc/task_mmu.c | 3 +
include/linux/mm.h | 8 +
include/linux/mm_types.h | 5 +-
include/linux/syscalls.h | 3 +
include/trace/events/mmflags.h | 7 +
include/uapi/asm-generic/unistd.h | 5 +-
include/vdso/datapage.h | 12 +
include/vdso/getrandom.h | 44 +++
include/vdso/types.h | 35 +++
kernel/sys_ni.c | 3 +
lib/vdso/Kconfig | 6 +
lib/vdso/getrandom.c | 224 ++++++++++++++
mm/Kconfig | 3 +
mm/memory.c | 6 +
mm/mempolicy.c | 3 +
mm/mprotect.c | 2 +-
mm/rmap.c | 5 +-
tools/include/uapi/asm-generic/unistd.h | 5 +-
.../arch/mips/entry/syscalls/syscall_n64.tbl | 1 +
.../arch/powerpc/entry/syscalls/syscall.tbl | 1 +
.../perf/arch/s390/entry/syscalls/syscall.tbl | 1 +
.../arch/x86/entry/syscalls/syscall_64.tbl | 1 +
tools/testing/selftests/vDSO/.gitignore | 2 +
tools/testing/selftests/vDSO/Makefile | 11 +
.../testing/selftests/vDSO/vdso_test_chacha.c | 43 +++
.../selftests/vDSO/vdso_test_getrandom.c | 283 ++++++++++++++++++
59 files changed, 1217 insertions(+), 49 deletions(-)
create mode 100644 arch/x86/entry/vdso/vgetrandom-chacha.S
create mode 100644 arch/x86/entry/vdso/vgetrandom.c
create mode 100644 arch/x86/include/asm/vdso/getrandom.h
create mode 100644 include/vdso/getrandom.h
create mode 100644 include/vdso/types.h
create mode 100644 lib/vdso/getrandom.c
create mode 100644 tools/testing/selftests/vDSO/vdso_test_chacha.c
create mode 100644 tools/testing/selftests/vDSO/vdso_test_getrandom.c

--
2.39.0


2023-01-01 16:30:03

by Jason A. Donenfeld

[permalink] [raw]
Subject: [PATCH v14 2/7] mm: add VM_DROPPABLE for designating always lazily freeable mappings

The vDSO getrandom() implementation works with a buffer allocated with a
new system call that has certain requirements:

- It shouldn't be written to core dumps.
* Easy: VM_DONTDUMP.
- It should be zeroed on fork.
* Easy: VM_WIPEONFORK.

- It shouldn't be written to swap.
* Uh-oh: mlock is rlimited.
* Uh-oh: mlock isn't inherited by forks.

- It shouldn't reserve actual memory, but it also shouldn't crash when
page faulting in memory if none is available
* Uh-oh: MAP_NORESERVE respects vm.overcommit_memory=2.
* Uh-oh: VM_NORESERVE means segfaults.

It turns out that the vDSO getrandom() function has three really nice
characteristics that we can exploit to solve this problem:

1) Due to being wiped during fork(), the vDSO code is already robust to
having the contents of the pages it reads zeroed out midway through
the function's execution.

2) In the absolute worst case of whatever contingency we're coding for,
we have the option to fallback to the getrandom() syscall, and
everything is fine.

3) The buffers the function uses are only ever useful for a maximum of
60 seconds -- a sort of cache, rather than a long term allocation.

These characteristics mean that we can introduce VM_DROPPABLE, which
has the following semantics:

a) It never is written out to swap.
b) Under memory pressure, mm can just drop the pages (so that they're
zero when read back again).
c) If there's not enough memory to service a page fault, it's not fatal,
and no signal is sent. Instead, writes are simply lost.
d) It is inherited by fork.
e) It doesn't count against the mlock budget, since nothing is locked.

This is fairly simple to implement, with the one snag that we have to
use 64-bit VM_* flags, but this shouldn't be a problem, since the only
consumers will probably be 64-bit anyway.

This way, allocations used by vDSO getrandom() can use:

VM_DROPPABLE | VM_DONTDUMP | VM_WIPEONFORK | VM_NORESERVE

And there will be no problem with OOMing, crashing on overcommitment,
using memory when not in use, not wiping on fork(), coredumps, or
writing out to swap.

At the moment, rather than skipping writes on OOM, the fault handler
just returns to userspace, and the instruction is retried. This isn't
terrible, but it's not quite what is intended. The actual instruction
skipping has to be implemented arch-by-arch, but so does this whole
vDSO series, so that's fine. The following commit addresses it for x86.

Cc: [email protected]
Signed-off-by: Jason A. Donenfeld <[email protected]>
---
fs/proc/task_mmu.c | 3 +++
include/linux/mm.h | 8 ++++++++
include/trace/events/mmflags.h | 7 +++++++
mm/Kconfig | 3 +++
mm/memory.c | 4 ++++
mm/mempolicy.c | 3 +++
mm/mprotect.c | 2 +-
mm/rmap.c | 5 +++--
8 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index e35a0398db63..47c7c046f2be 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -711,6 +711,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
[ilog2(VM_UFFD_MINOR)] = "ui",
#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
+#ifdef CONFIG_NEED_VM_DROPPABLE
+ [ilog2(VM_DROPPABLE)] = "dp",
+#endif
};
size_t i;

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f3f196e4d66d..fba3f1e8616b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -315,11 +315,13 @@ extern unsigned int kobjsize(const void *objp);
#define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit architectures */
#define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */
#define VM_HIGH_ARCH_BIT_4 36 /* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_5 37 /* bit only usable on 64-bit architectures */
#define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0)
#define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1)
#define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2)
#define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3)
#define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_HIGH_ARCH_5 BIT(VM_HIGH_ARCH_BIT_5)
#endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */

#ifdef CONFIG_ARCH_HAS_PKEYS
@@ -335,6 +337,12 @@ extern unsigned int kobjsize(const void *objp);
#endif
#endif /* CONFIG_ARCH_HAS_PKEYS */

+#ifdef CONFIG_NEED_VM_DROPPABLE
+# define VM_DROPPABLE VM_HIGH_ARCH_5
+#else
+# define VM_DROPPABLE 0
+#endif
+
#if defined(CONFIG_X86)
# define VM_PAT VM_ARCH_1 /* PAT reserves whole VMA at once (x86) */
#elif defined(CONFIG_PPC)
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 412b5a46374c..82b2fb811d06 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -163,6 +163,12 @@ IF_HAVE_PG_SKIP_KASAN_POISON(PG_skip_kasan_poison, "skip_kasan_poison")
# define IF_HAVE_UFFD_MINOR(flag, name)
#endif

+#ifdef CONFIG_NEED_VM_DROPPABLE
+# define IF_HAVE_VM_DROPPABLE(flag, name) {flag, name},
+#else
+# define IF_HAVE_VM_DROPPABLE(flag, name)
+#endif
+
#define __def_vmaflag_names \
{VM_READ, "read" }, \
{VM_WRITE, "write" }, \
@@ -195,6 +201,7 @@ IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY, "softdirty" ) \
{VM_MIXEDMAP, "mixedmap" }, \
{VM_HUGEPAGE, "hugepage" }, \
{VM_NOHUGEPAGE, "nohugepage" }, \
+IF_HAVE_VM_DROPPABLE(VM_DROPPABLE, "droppable" ) \
{VM_MERGEABLE, "mergeable" } \

#define show_vma_flags(flags) \
diff --git a/mm/Kconfig b/mm/Kconfig
index ff7b209dec05..91fd0be96ca4 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1030,6 +1030,9 @@ config ARCH_USES_HIGH_VMA_FLAGS
bool
config ARCH_HAS_PKEYS
bool
+config NEED_VM_DROPPABLE
+ select ARCH_USES_HIGH_VMA_FLAGS
+ bool

config ARCH_USES_PG_ARCH_X
bool
diff --git a/mm/memory.c b/mm/memory.c
index aad226daf41b..1ade407ccbf9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5220,6 +5220,10 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,

lru_gen_exit_fault();

+ /* If the mapping is droppable, then errors due to OOM aren't fatal. */
+ if (vma->vm_flags & VM_DROPPABLE)
+ ret &= ~VM_FAULT_OOM;
+
if (flags & FAULT_FLAG_USER) {
mem_cgroup_exit_user_fault();
/*
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 02c8a712282f..ebf2e3694a0a 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2173,6 +2173,9 @@ struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
int preferred_nid;
nodemask_t *nmask;

+ if (vma->vm_flags & VM_DROPPABLE)
+ gfp |= __GFP_NOWARN | __GFP_NORETRY;
+
pol = get_vma_policy(vma, addr);

if (pol->mode == MPOL_INTERLEAVE) {
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 908df12caa26..a679cc5d1c75 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -593,7 +593,7 @@ mprotect_fixup(struct mmu_gather *tlb, struct vm_area_struct *vma,
may_expand_vm(mm, oldflags, nrpages))
return -ENOMEM;
if (!(oldflags & (VM_ACCOUNT|VM_WRITE|VM_HUGETLB|
- VM_SHARED|VM_NORESERVE))) {
+ VM_SHARED|VM_NORESERVE|VM_DROPPABLE))) {
charged = nrpages;
if (security_vm_enough_memory_mm(mm, charged))
return -ENOMEM;
diff --git a/mm/rmap.c b/mm/rmap.c
index b616870a09be..5ed46e59dfcd 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1294,7 +1294,8 @@ void page_add_new_anon_rmap(struct page *page,
int nr;

VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
- __SetPageSwapBacked(page);
+ if (!(vma->vm_flags & VM_DROPPABLE))
+ __SetPageSwapBacked(page);

if (likely(!PageCompound(page))) {
/* increment count (starts at -1) */
@@ -1683,7 +1684,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
* plus the rmap(s) (dropped by discard:).
*/
if (ref_count == 1 + map_count &&
- !folio_test_dirty(folio)) {
+ (!folio_test_dirty(folio) || (vma->vm_flags & VM_DROPPABLE))) {
/* Invalidate as we cleared the pte */
mmu_notifier_invalidate_range(mm,
address, address + PAGE_SIZE);
--
2.39.0

2023-01-01 16:30:07

by Jason A. Donenfeld

[permalink] [raw]
Subject: [PATCH v14 1/7] x86: lib: Separate instruction decoder MMIO type from MMIO trace

Both mmiotrace.h and insn-eval.h define various MMIO_ enum constants.
Rename the insn ones to have a INSN_ prefix, so that the headers can be
used from the same source file.

Signed-off-by: Jason A. Donenfeld <[email protected]>
---
arch/x86/coco/tdx/tdx.c | 26 +++++++++++++-------------
arch/x86/include/asm/insn-eval.h | 18 +++++++++---------
arch/x86/kernel/sev.c | 18 +++++++++---------
arch/x86/lib/insn-eval.c | 20 ++++++++++----------
4 files changed, 41 insertions(+), 41 deletions(-)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index cfd4c95b9f04..669d9e4f2901 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -386,8 +386,8 @@ static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
{
unsigned long *reg, val, vaddr;
char buffer[MAX_INSN_SIZE];
+ enum insn_mmio_type mmio;
struct insn insn = {};
- enum mmio_type mmio;
int size, extend_size;
u8 extend_val = 0;

@@ -402,10 +402,10 @@ static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
return -EINVAL;

mmio = insn_decode_mmio(&insn, &size);
- if (WARN_ON_ONCE(mmio == MMIO_DECODE_FAILED))
+ if (WARN_ON_ONCE(mmio == INSN_MMIO_DECODE_FAILED))
return -EINVAL;

- if (mmio != MMIO_WRITE_IMM && mmio != MMIO_MOVS) {
+ if (mmio != INSN_MMIO_WRITE_IMM && mmio != INSN_MMIO_MOVS) {
reg = insn_get_modrm_reg_ptr(&insn, regs);
if (!reg)
return -EINVAL;
@@ -426,23 +426,23 @@ static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)

/* Handle writes first */
switch (mmio) {
- case MMIO_WRITE:
+ case INSN_MMIO_WRITE:
memcpy(&val, reg, size);
if (!mmio_write(size, ve->gpa, val))
return -EIO;
return insn.length;
- case MMIO_WRITE_IMM:
+ case INSN_MMIO_WRITE_IMM:
val = insn.immediate.value;
if (!mmio_write(size, ve->gpa, val))
return -EIO;
return insn.length;
- case MMIO_READ:
- case MMIO_READ_ZERO_EXTEND:
- case MMIO_READ_SIGN_EXTEND:
+ case INSN_MMIO_READ:
+ case INSN_MMIO_READ_ZERO_EXTEND:
+ case INSN_MMIO_READ_SIGN_EXTEND:
/* Reads are handled below */
break;
- case MMIO_MOVS:
- case MMIO_DECODE_FAILED:
+ case INSN_MMIO_MOVS:
+ case INSN_MMIO_DECODE_FAILED:
/*
* MMIO was accessed with an instruction that could not be
* decoded or handled properly. It was likely not using io.h
@@ -459,15 +459,15 @@ static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
return -EIO;

switch (mmio) {
- case MMIO_READ:
+ case INSN_MMIO_READ:
/* Zero-extend for 32-bit operation */
extend_size = size == 4 ? sizeof(*reg) : 0;
break;
- case MMIO_READ_ZERO_EXTEND:
+ case INSN_MMIO_READ_ZERO_EXTEND:
/* Zero extend based on operand size */
extend_size = insn.opnd_bytes;
break;
- case MMIO_READ_SIGN_EXTEND:
+ case INSN_MMIO_READ_SIGN_EXTEND:
/* Sign extend based on operand size */
extend_size = insn.opnd_bytes;
if (size == 1 && val & BIT(7))
diff --git a/arch/x86/include/asm/insn-eval.h b/arch/x86/include/asm/insn-eval.h
index f07faa61c7f3..54368a43abf6 100644
--- a/arch/x86/include/asm/insn-eval.h
+++ b/arch/x86/include/asm/insn-eval.h
@@ -32,16 +32,16 @@ int insn_fetch_from_user_inatomic(struct pt_regs *regs,
bool insn_decode_from_regs(struct insn *insn, struct pt_regs *regs,
unsigned char buf[MAX_INSN_SIZE], int buf_size);

-enum mmio_type {
- MMIO_DECODE_FAILED,
- MMIO_WRITE,
- MMIO_WRITE_IMM,
- MMIO_READ,
- MMIO_READ_ZERO_EXTEND,
- MMIO_READ_SIGN_EXTEND,
- MMIO_MOVS,
+enum insn_mmio_type {
+ INSN_MMIO_DECODE_FAILED,
+ INSN_MMIO_WRITE,
+ INSN_MMIO_WRITE_IMM,
+ INSN_MMIO_READ,
+ INSN_MMIO_READ_ZERO_EXTEND,
+ INSN_MMIO_READ_SIGN_EXTEND,
+ INSN_MMIO_MOVS,
};

-enum mmio_type insn_decode_mmio(struct insn *insn, int *bytes);
+enum insn_mmio_type insn_decode_mmio(struct insn *insn, int *bytes);

#endif /* _ASM_X86_INSN_EVAL_H */
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index a428c62330d3..679026a640ef 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -1536,32 +1536,32 @@ static enum es_result vc_handle_mmio_movs(struct es_em_ctxt *ctxt,
static enum es_result vc_handle_mmio(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
{
struct insn *insn = &ctxt->insn;
+ enum insn_mmio_type mmio;
unsigned int bytes = 0;
- enum mmio_type mmio;
enum es_result ret;
u8 sign_byte;
long *reg_data;

mmio = insn_decode_mmio(insn, &bytes);
- if (mmio == MMIO_DECODE_FAILED)
+ if (mmio == INSN_MMIO_DECODE_FAILED)
return ES_DECODE_FAILED;

- if (mmio != MMIO_WRITE_IMM && mmio != MMIO_MOVS) {
+ if (mmio != INSN_MMIO_WRITE_IMM && mmio != INSN_MMIO_MOVS) {
reg_data = insn_get_modrm_reg_ptr(insn, ctxt->regs);
if (!reg_data)
return ES_DECODE_FAILED;
}

switch (mmio) {
- case MMIO_WRITE:
+ case INSN_MMIO_WRITE:
memcpy(ghcb->shared_buffer, reg_data, bytes);
ret = vc_do_mmio(ghcb, ctxt, bytes, false);
break;
- case MMIO_WRITE_IMM:
+ case INSN_MMIO_WRITE_IMM:
memcpy(ghcb->shared_buffer, insn->immediate1.bytes, bytes);
ret = vc_do_mmio(ghcb, ctxt, bytes, false);
break;
- case MMIO_READ:
+ case INSN_MMIO_READ:
ret = vc_do_mmio(ghcb, ctxt, bytes, true);
if (ret)
break;
@@ -1572,7 +1572,7 @@ static enum es_result vc_handle_mmio(struct ghcb *ghcb, struct es_em_ctxt *ctxt)

memcpy(reg_data, ghcb->shared_buffer, bytes);
break;
- case MMIO_READ_ZERO_EXTEND:
+ case INSN_MMIO_READ_ZERO_EXTEND:
ret = vc_do_mmio(ghcb, ctxt, bytes, true);
if (ret)
break;
@@ -1581,7 +1581,7 @@ static enum es_result vc_handle_mmio(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
memset(reg_data, 0, insn->opnd_bytes);
memcpy(reg_data, ghcb->shared_buffer, bytes);
break;
- case MMIO_READ_SIGN_EXTEND:
+ case INSN_MMIO_READ_SIGN_EXTEND:
ret = vc_do_mmio(ghcb, ctxt, bytes, true);
if (ret)
break;
@@ -1600,7 +1600,7 @@ static enum es_result vc_handle_mmio(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
memset(reg_data, sign_byte, insn->opnd_bytes);
memcpy(reg_data, ghcb->shared_buffer, bytes);
break;
- case MMIO_MOVS:
+ case INSN_MMIO_MOVS:
ret = vc_handle_mmio_movs(ctxt, bytes);
break;
default:
diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c
index 21104c41cba0..558a605929db 100644
--- a/arch/x86/lib/insn-eval.c
+++ b/arch/x86/lib/insn-eval.c
@@ -1595,16 +1595,16 @@ bool insn_decode_from_regs(struct insn *insn, struct pt_regs *regs,
* Returns:
*
* Type of the instruction. Size of the memory operand is stored in
- * @bytes. If decode failed, MMIO_DECODE_FAILED returned.
+ * @bytes. If decode failed, INSN_MMIO_DECODE_FAILED returned.
*/
-enum mmio_type insn_decode_mmio(struct insn *insn, int *bytes)
+enum insn_mmio_type insn_decode_mmio(struct insn *insn, int *bytes)
{
- enum mmio_type type = MMIO_DECODE_FAILED;
+ enum insn_mmio_type type = INSN_MMIO_DECODE_FAILED;

*bytes = 0;

if (insn_get_opcode(insn))
- return MMIO_DECODE_FAILED;
+ return INSN_MMIO_DECODE_FAILED;

switch (insn->opcode.bytes[0]) {
case 0x88: /* MOV m8,r8 */
@@ -1613,7 +1613,7 @@ enum mmio_type insn_decode_mmio(struct insn *insn, int *bytes)
case 0x89: /* MOV m16/m32/m64, r16/m32/m64 */
if (!*bytes)
*bytes = insn->opnd_bytes;
- type = MMIO_WRITE;
+ type = INSN_MMIO_WRITE;
break;

case 0xc6: /* MOV m8, imm8 */
@@ -1622,7 +1622,7 @@ enum mmio_type insn_decode_mmio(struct insn *insn, int *bytes)
case 0xc7: /* MOV m16/m32/m64, imm16/imm32/imm64 */
if (!*bytes)
*bytes = insn->opnd_bytes;
- type = MMIO_WRITE_IMM;
+ type = INSN_MMIO_WRITE_IMM;
break;

case 0x8a: /* MOV r8, m8 */
@@ -1631,7 +1631,7 @@ enum mmio_type insn_decode_mmio(struct insn *insn, int *bytes)
case 0x8b: /* MOV r16/r32/r64, m16/m32/m64 */
if (!*bytes)
*bytes = insn->opnd_bytes;
- type = MMIO_READ;
+ type = INSN_MMIO_READ;
break;

case 0xa4: /* MOVS m8, m8 */
@@ -1640,7 +1640,7 @@ enum mmio_type insn_decode_mmio(struct insn *insn, int *bytes)
case 0xa5: /* MOVS m16/m32/m64, m16/m32/m64 */
if (!*bytes)
*bytes = insn->opnd_bytes;
- type = MMIO_MOVS;
+ type = INSN_MMIO_MOVS;
break;

case 0x0f: /* Two-byte instruction */
@@ -1651,7 +1651,7 @@ enum mmio_type insn_decode_mmio(struct insn *insn, int *bytes)
case 0xb7: /* MOVZX r32/r64, m16 */
if (!*bytes)
*bytes = 2;
- type = MMIO_READ_ZERO_EXTEND;
+ type = INSN_MMIO_READ_ZERO_EXTEND;
break;

case 0xbe: /* MOVSX r16/r32/r64, m8 */
@@ -1660,7 +1660,7 @@ enum mmio_type insn_decode_mmio(struct insn *insn, int *bytes)
case 0xbf: /* MOVSX r32/r64, m16 */
if (!*bytes)
*bytes = 2;
- type = MMIO_READ_SIGN_EXTEND;
+ type = INSN_MMIO_READ_SIGN_EXTEND;
break;
}
break;
--
2.39.0

2023-01-01 16:30:28

by Jason A. Donenfeld

[permalink] [raw]
Subject: [PATCH v14 3/7] x86: mm: Skip faulting instruction for VM_DROPPABLE faults

The prior commit introduced VM_DROPPABLE, but in a limited form where
the faulting instruction was retried instead of skipped. Finish that up
with the platform-specific aspect of skipping the actual instruction.

This works by copying userspace's %rip to a stack buffer of size
MAX_INSN_SIZE, decoding it, and then adding the length of the decoded
instruction to userspace's %rip. In the event any of these fail, just
fallback to not advancing %rip and trying again.

Cc: [email protected]
Signed-off-by: Jason A. Donenfeld <[email protected]>
---
arch/x86/mm/fault.c | 19 +++++++++++++++++++
include/linux/mm_types.h | 5 ++++-
mm/memory.c | 4 +++-
3 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 7b0d4ab894c8..76ca99ab6eb7 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -33,6 +33,8 @@
#include <asm/kvm_para.h> /* kvm_handle_async_pf */
#include <asm/vdso.h> /* fixup_vdso_exception() */
#include <asm/irq_stack.h>
+#include <asm/insn.h> /* struct insn */
+#include <asm/insn-eval.h> /* insn_fetch_from_user(), ... */

#define CREATE_TRACE_POINTS
#include <asm/trace/exceptions.h>
@@ -1454,6 +1456,23 @@ void do_user_addr_fault(struct pt_regs *regs,
}

mmap_read_unlock(mm);
+
+ if (fault & VM_FAULT_SKIP_INSN) {
+ u8 buf[MAX_INSN_SIZE];
+ struct insn insn;
+ int nr_copied;
+
+ nr_copied = insn_fetch_from_user(regs, buf);
+ if (nr_copied <= 0)
+ return;
+
+ if (!insn_decode_from_regs(&insn, regs, buf, nr_copied))
+ return;
+
+ regs->ip += insn.length;
+ return;
+ }
+
if (likely(!(fault & VM_FAULT_ERROR)))
return;

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 3b8475007734..e76ab9ad555c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -945,6 +945,7 @@ typedef __bitwise unsigned int vm_fault_t;
* fsync() to complete (for synchronous page faults
* in DAX)
* @VM_FAULT_COMPLETED: ->fault completed, meanwhile mmap lock released
+ * @VM_FAULT_SKIP_INSN: ->handle the fault by skipping faulting instruction
* @VM_FAULT_HINDEX_MASK: mask HINDEX value
*
*/
@@ -962,6 +963,7 @@ enum vm_fault_reason {
VM_FAULT_DONE_COW = (__force vm_fault_t)0x001000,
VM_FAULT_NEEDDSYNC = (__force vm_fault_t)0x002000,
VM_FAULT_COMPLETED = (__force vm_fault_t)0x004000,
+ VM_FAULT_SKIP_INSN = (__force vm_fault_t)0x008000,
VM_FAULT_HINDEX_MASK = (__force vm_fault_t)0x0f0000,
};

@@ -985,7 +987,8 @@ enum vm_fault_reason {
{ VM_FAULT_RETRY, "RETRY" }, \
{ VM_FAULT_FALLBACK, "FALLBACK" }, \
{ VM_FAULT_DONE_COW, "DONE_COW" }, \
- { VM_FAULT_NEEDDSYNC, "NEEDDSYNC" }
+ { VM_FAULT_NEEDDSYNC, "NEEDDSYNC" }, \
+ { VM_FAULT_SKIP_INSN, "SKIP_INSN" }

struct vm_special_mapping {
const char *name; /* The name, e.g. "[vdso]". */
diff --git a/mm/memory.c b/mm/memory.c
index 1ade407ccbf9..62ba9b7b713e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5221,8 +5221,10 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
lru_gen_exit_fault();

/* If the mapping is droppable, then errors due to OOM aren't fatal. */
- if (vma->vm_flags & VM_DROPPABLE)
+ if ((ret & VM_FAULT_OOM) && (vma->vm_flags & VM_DROPPABLE)) {
ret &= ~VM_FAULT_OOM;
+ ret |= VM_FAULT_SKIP_INSN;
+ }

if (flags & FAULT_FLAG_USER) {
mem_cgroup_exit_user_fault();
--
2.39.0

2023-01-01 16:30:35

by Jason A. Donenfeld

[permalink] [raw]
Subject: [PATCH v14 4/7] random: add vgetrandom_alloc() syscall

The vDSO getrandom() works over an opaque per-thread state of an
unexported size, which must be marked VM_WIPEONFORK, VM_DONTDUMP,
VM_NORESERVE, and VM_DROPPABLE for proper operation. Over time, the
nuances of these allocations may change or grow or even differ based on
architectural features.

The syscall has the signature:

void *vgetrandom_alloc(unsigned int *num, unsigned int *size_per_each,
unsigned long addr, unsigned int flags);

This takes a hinted number of opaque states in `num`, and returns a
pointer to an array of opaque states, the number actually allocated back
in `num`, and the size in bytes of each one in `size_per_each`, enabling
a libc to slice up the returned array into a state per each thread,
while ensuring that no single state straddles a page boundary. (The
`flags` and `addr` arguments, as well as the `*size_per_each` input
value, are reserved for the future and are forced to be zero zero for
now.)

Libc is expected to allocate a chunk of these on first use, and then
dole them out to threads as they're created, allocating more when
needed. The returned address of the first state may be passed to
munmap(2) with a length of `num * size_per_each`, in order to deallocate
the memory.

We very intentionally do *not* leave state allocation for vDSO
getrandom() up to userspace itself, but rather provide this new syscall
for such allocations. vDSO getrandom() must not store its state in just
any old memory address, but rather just ones that the kernel specially
allocates for it, leaving the particularities of those allocations up to
the kernel.

The allocation of states is intended to be integrated into libc's thread
management. As an illustrative example, the following code might be used
to do the same outside of libc. Though, vgetrandom_alloc() is not
expected to be exposed outside of libc, and the pthread usage here is
expected to be elided into libc internals. This allocation scheme is
very naive and does not shrink; other implementations may choose to be
more complex.

static void *vgetrandom_alloc(unsigned int *num, unsigned int *size_per_each)
{
*size_per_each = 0; /* Must be zero on input. */
return (void *)syscall(__NR_vgetrandom_alloc, &num, &size_per_each,
0 /* reserved @addr */, 0 /* reserved @flags */);
}

static struct {
pthread_mutex_t lock;
void **states;
size_t len, cap;
} grnd_allocator = {
.lock = PTHREAD_MUTEX_INITIALIZER
};

static void *vgetrandom_get_state(void)
{
void *state = NULL;

pthread_mutex_lock(&grnd_allocator.lock);
if (!grnd_allocator.len) {
size_t new_cap;
size_t page_size = getpagesize();
unsigned int num = sysconf(_SC_NPROCESSORS_ONLN); /* Could be arbitrary, just a hint. */
unsigned int size_per_each;
void *new_block = vgetrandom_alloc(&num, &size_per_each);
void *new_states;

if (new_block == MAP_FAILED)
goto out;
new_cap = grnd_allocator.cap + num;
new_states = reallocarray(grnd_allocator.states, new_cap, sizeof(*grnd_allocator.states));
if (!new_states) {
munmap(new_block, num * size_per_each);
goto out;
}
grnd_allocator.cap = new_cap;
grnd_allocator.states = new_states;

for (size_t i = 0; i < num; ++i) {
grnd_allocator.states[i] = new_block;
if (((uintptr_t)new_block & (page_size - 1)) + size_per_each > page_size)
new_block = (void *)(((uintptr_t)new_block + page_size) & (page_size - 1));
else
new_block += size_per_each;
}
grnd_allocator.len = num;
}
state = grnd_allocator.states[--grnd_allocator.len];

out:
pthread_mutex_unlock(&grnd_allocator.lock);
return state;
}

static void vgetrandom_put_state(void *state)
{
if (!state)
return;
pthread_mutex_lock(&grnd_allocator.lock);
grnd_allocator.states[grnd_allocator.len++] = state;
pthread_mutex_unlock(&grnd_allocator.lock);
}

Signed-off-by: Jason A. Donenfeld <[email protected]>
---
MAINTAINERS | 1 +
drivers/char/random.c | 136 +++++++++++++++++++++++++++++++++++++++
include/linux/syscalls.h | 3 +
include/vdso/getrandom.h | 16 +++++
kernel/sys_ni.c | 3 +
lib/vdso/Kconfig | 6 ++
6 files changed, 165 insertions(+)
create mode 100644 include/vdso/getrandom.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 7f86d02cb427..e2a1c0b26fc8 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -17521,6 +17521,7 @@ T: git https://git.kernel.org/pub/scm/linux/kernel/git/crng/random.git
S: Maintained
F: drivers/char/random.c
F: drivers/virt/vmgenid.c
+F: include/vdso/getrandom.h

RAPIDIO SUBSYSTEM
M: Matt Porter <[email protected]>
diff --git a/drivers/char/random.c b/drivers/char/random.c
index ce3ccd172cc8..6425f5f838e0 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -8,6 +8,7 @@
* into roughly six sections, each with a section header:
*
* - Initialization and readiness waiting.
+ * - vDSO support helpers.
* - Fast key erasure RNG, the "crng".
* - Entropy accumulation and extraction routines.
* - Entropy collection routines.
@@ -39,6 +40,7 @@
#include <linux/blkdev.h>
#include <linux/interrupt.h>
#include <linux/mm.h>
+#include <linux/mman.h>
#include <linux/nodemask.h>
#include <linux/spinlock.h>
#include <linux/kthread.h>
@@ -56,6 +58,9 @@
#include <linux/sched/isolation.h>
#include <crypto/chacha.h>
#include <crypto/blake2s.h>
+#ifdef CONFIG_VDSO_GETRANDOM
+#include <vdso/getrandom.h>
+#endif
#include <asm/archrandom.h>
#include <asm/processor.h>
#include <asm/irq.h>
@@ -169,6 +174,137 @@ int __cold execute_with_initialized_rng(struct notifier_block *nb)
__func__, (void *)_RET_IP_, crng_init)


+
+/********************************************************************
+ *
+ * vDSO support helpers.
+ *
+ * The actual vDSO function is defined over in lib/vdso/getrandom.c,
+ * but this section contains the kernel-mode helpers to support that.
+ *
+ ********************************************************************/
+
+#ifdef CONFIG_VDSO_GETRANDOM
+/**
+ * sys_vgetrandom_alloc - Allocate opaque states for use with vDSO getrandom().
+ *
+ * @num: On input, a pointer to a suggested hint of how many states to
+ * allocate, and on return the number of states actually allocated.
+ *
+ * @size_per_each: On input, must be zero. On return, the size of each state allocated,
+ * so that the caller can split up the returned allocation into
+ * individual states.
+ *
+ * @addr: Reserved, must be zero.
+ *
+ * @flags: Reserved, must be zero.
+ *
+ * The getrandom() vDSO function in userspace requires an opaque state, which
+ * this function allocates by mapping a certain number of special pages into
+ * the calling process. It takes a hint as to the number of opaque states
+ * desired, and provides the caller with the number of opaque states actually
+ * allocated, the size of each one in bytes, and the address of the first
+ * state, which may be split up into @num states of @size_per_each bytes each,
+ * by adding @size_per_each to the returned first state @num times, while
+ * ensuring that no single state straddles a page boundary.
+ *
+ * Returns the address of the first state in the allocation on success, or a
+ * negative error value on failure.
+ *
+ * The returned address of the first state may be passed to munmap(2) with a
+ * length of `(size_t)num * (size_t)size_per_each`, in order to deallocate the
+ * memory, after which it is invalid to pass it to vDSO getrandom().
+ *
+ * States allocated by this function must not be dereferenced, written, read,
+ * or otherwise manipulated. The *only* supported operations are:
+ * - Splitting up the states in intervals of @size_per_each, no more than
+ * @num times from the first state, while ensuring that no single state
+ * straddles a page boundary.
+ * - Passing a state to the getrandom() vDSO function's @opaque_state
+ * parameter, but not passing the same state at the same time to two such
+ * calls.
+ * - Passing the first state and the total length to munmap(2), as described
+ * above.
+ * All other uses are undefined behavior, which is subject to change or removal.
+ */
+SYSCALL_DEFINE4(vgetrandom_alloc, unsigned int __user *, num,
+ unsigned int __user *, size_per_each, unsigned long, addr,
+ unsigned int, flags)
+{
+ struct mm_struct *mm = current->mm;
+ size_t state_size, alloc_size, num_states;
+ unsigned long pages_addr, populate, mm_flags;
+ unsigned int num_hint;
+ int ret;
+
+ /*
+ * @flags and @addr are currently unused, so in order to reserve them
+ * for the future, force them to be set to zero by current callers.
+ */
+ if (flags || addr)
+ return -EINVAL;
+
+ /*
+ * Also enforce that *size_per_each is zero on input, in case this becomes
+ * useful later on.
+ */
+ if (get_user(num_hint, size_per_each))
+ return -EFAULT;
+ if (num_hint)
+ return -EINVAL;
+
+ if (get_user(num_hint, num))
+ return -EFAULT;
+
+ state_size = sizeof(struct vgetrandom_state);
+ num_states = clamp_t(size_t, num_hint, 1, (SIZE_MAX & PAGE_MASK) / state_size);
+ alloc_size = PAGE_ALIGN(num_states * state_size);
+ /*
+ * States cannot straddle page boundaries, so calculate the number of
+ * states that can fit inside of a page without being split, and then
+ * multiply that out by the number of pages allocated.
+ */
+ num_states = (PAGE_SIZE / state_size) * (alloc_size / PAGE_SIZE);
+
+ if (mmap_write_lock_killable(mm))
+ return -EINTR;
+ mm_flags = mm->def_flags;
+
+ mm->def_flags |=
+ /*
+ * Don't allow state to be written to swap, to preserve forward secrecy.
+ * But also don't mlock it or pre-reserve it, and allow it to
+ * be discarded under memory pressure. If no memory is available, returns
+ * zeros rather than segfaulting.
+ */
+ VM_DROPPABLE | VM_NORESERVE |
+
+ /* Don't allow the state to survive forks, to prevent random number re-use. */
+ VM_WIPEONFORK |
+
+ /* Don't write random state into coredumps. */
+ VM_DONTDUMP;
+
+ pages_addr = do_mmap(NULL, 0, alloc_size, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS, 0, &populate, NULL);
+
+ mm->def_flags = mm_flags;
+ mmap_write_unlock(mm);
+ if (IS_ERR_VALUE(pages_addr))
+ return pages_addr;
+
+ ret = -EFAULT;
+ if (put_user(num_states, num) || put_user(state_size, size_per_each))
+ goto err_unmap;
+
+ return pages_addr;
+
+err_unmap:
+ vm_munmap(pages_addr, alloc_size);
+ return ret;
+}
+#endif
+
/*********************************************************************
*
* Fast key erasure RNG, the "crng".
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 33a0ee3bcb2e..00f2b590b834 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1008,6 +1008,9 @@ asmlinkage long sys_seccomp(unsigned int op, unsigned int flags,
void __user *uargs);
asmlinkage long sys_getrandom(char __user *buf, size_t count,
unsigned int flags);
+asmlinkage long sys_vgetrandom_alloc(unsigned int __user *num,
+ unsigned int __user *size_per_each,
+ unsigned long addr, unsigned int flags);
asmlinkage long sys_memfd_create(const char __user *uname_ptr, unsigned int flags);
asmlinkage long sys_bpf(int cmd, union bpf_attr *attr, unsigned int size);
asmlinkage long sys_execveat(int dfd, const char __user *filename,
diff --git a/include/vdso/getrandom.h b/include/vdso/getrandom.h
new file mode 100644
index 000000000000..e3ceb1976386
--- /dev/null
+++ b/include/vdso/getrandom.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2022 Jason A. Donenfeld <[email protected]>. All Rights Reserved.
+ */
+
+#ifndef _VDSO_GETRANDOM_H
+#define _VDSO_GETRANDOM_H
+
+/**
+ * struct vgetrandom_state - State used by vDSO getrandom() and allocated by vgetrandom_alloc().
+ *
+ * Currently empty, as the vDSO getrandom() function has not yet been implemented.
+ */
+struct vgetrandom_state { int placeholder; };
+
+#endif /* _VDSO_GETRANDOM_H */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 860b2dcf3ac4..f28196cb919b 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -360,6 +360,9 @@ COND_SYSCALL(pkey_free);
/* memfd_secret */
COND_SYSCALL(memfd_secret);

+/* random */
+COND_SYSCALL(vgetrandom_alloc);
+
/*
* Architecture specific weak syscall entries.
*/
diff --git a/lib/vdso/Kconfig b/lib/vdso/Kconfig
index d883ac299508..f88580960182 100644
--- a/lib/vdso/Kconfig
+++ b/lib/vdso/Kconfig
@@ -31,3 +31,9 @@ config GENERIC_VDSO_TIME_NS
VDSO

endif
+
+config VDSO_GETRANDOM
+ bool
+ select NEED_VM_DROPPABLE
+ help
+ Selected by architectures that support vDSO getrandom().
--
2.39.0

2023-01-01 16:31:28

by Jason A. Donenfeld

[permalink] [raw]
Subject: [PATCH v14 6/7] random: introduce generic vDSO getrandom() implementation

Provide a generic C vDSO getrandom() implementation, which operates on
an opaque state returned by vgetrandom_alloc() and produces random bytes
the same way as getrandom(). This has a the API signature:

ssize_t vgetrandom(void *buffer, size_t len, unsigned int flags, void *opaque_state);

The return value and the first 3 arguments are the same as ordinary
getrandom(), while the last argument is a pointer to the opaque
allocated state. Were all four arguments passed to the getrandom()
syscall, nothing different would happen, and the functions would have
the exact same behavior.

The actual vDSO RNG algorithm implemented is the same one implemented by
drivers/char/random.c, using the same fast-erasure techniques as that.
Should the in-kernel implementation change, so too will the vDSO one.

It requires an implementation of ChaCha20 that does not use any stack,
in order to maintain forward secrecy if a multi-threaded program forks
(though this does not account for a similar issue with SA_SIGINFO
copying registers to the stack), so this is left as an
architecture-specific fill-in. Stack-less ChaCha20 is an easy algorithm
to implement on a variety of architectures, so this shouldn't be too
onerous.

Initially, the state is keyless, and so the first call makes a
getrandom() syscall to generate that key, and then uses it for
subsequent calls. By keeping track of a generation counter, it knows
when its key is invalidated and it should fetch a new one using the
syscall. Later, more than just a generation counter might be used.

Since MADV_WIPEONFORK is set on the opaque state, the key and related
state is wiped during a fork(), so secrets don't roll over into new
processes, and the same state doesn't accidentally generate the same
random stream. The generation counter, as well, is always >0, so that
the 0 counter is a useful indication of a fork() or otherwise
uninitialized state.

If the kernel RNG is not yet initialized, then the vDSO always calls the
syscall, because that behavior cannot be emulated in userspace, but
fortunately that state is short lived and only during early boot. If it
has been initialized, then there is no need to inspect the `flags`
argument, because the behavior does not change post-initialization
regardless of the `flags` value.

Since the opaque state passed to it is mutated, vDSO getrandom() is not
reentrant, when used with the same opaque state, which libc should be
mindful of.

vgetrandom_alloc() and vDSO getrandom() provide the ability for
userspace to generate random bytes quickly and safely, and are intended
to be integrated into libc's thread management. As an illustrative
example, together with the example code from "random: add
vgetrandom_alloc() syscall", the following code might be used to do the
same outside of libc. In a libc, only the non-static vgetrandom()
function at the end would be exported as part of a getrandom()
implementations, and the various pthread-isms are expected to be elided
into libc internals.

static struct {
ssize_t(*fn)(void *buf, size_t len, unsigned long flags, void *state);
pthread_key_t key;
pthread_once_t initialized;
} grnd_ctx = {
.initialized = PTHREAD_ONCE_INIT
};

static void vgetrandom_init(void)
{
if (pthread_key_create(&grnd_ctx.key, vgetrandom_put_state) != 0)
return;
grnd_ctx.fn = vdso_sym("LINUX_2.6", "__vdso_getrandom");
}

ssize_t vgetrandom(void *buf, size_t len, unsigned long flags)
{
void *state;

pthread_once(&grnd_ctx.initialized, vgetrandom_init);
if (!grnd_ctx.fn)
return getrandom(buf, len, flags);
state = pthread_getspecific(grnd_ctx.key);
if (!state) {
state = vgetrandom_get_state();
if (pthread_setspecific(grnd_ctx.key, state) != 0) {
vgetrandom_put_state(state);
state = NULL;
}
if (!state)
return getrandom(buf, len, flags);
}
return grnd_ctx.fn(buf, len, flags, state);
}

Signed-off-by: Jason A. Donenfeld <[email protected]>
---
MAINTAINERS | 1 +
drivers/char/random.c | 11 +
include/vdso/datapage.h | 12 +
include/vdso/getrandom.h | 32 +-
include/vdso/types.h | 35 +++
lib/vdso/getrandom.c | 224 ++++++++++++++
tools/testing/selftests/vDSO/.gitignore | 1 +
tools/testing/selftests/vDSO/Makefile | 2 +
.../selftests/vDSO/vdso_test_getrandom.c | 283 ++++++++++++++++++
9 files changed, 599 insertions(+), 2 deletions(-)
create mode 100644 include/vdso/types.h
create mode 100644 lib/vdso/getrandom.c
create mode 100644 tools/testing/selftests/vDSO/vdso_test_getrandom.c

diff --git a/MAINTAINERS b/MAINTAINERS
index e2a1c0b26fc8..045232eba6ce 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -17522,6 +17522,7 @@ S: Maintained
F: drivers/char/random.c
F: drivers/virt/vmgenid.c
F: include/vdso/getrandom.h
+F: lib/vdso/getrandom.c

RAPIDIO SUBSYSTEM
M: Matt Porter <[email protected]>
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 6425f5f838e0..f3a45711166c 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -60,6 +60,7 @@
#include <crypto/blake2s.h>
#ifdef CONFIG_VDSO_GETRANDOM
#include <vdso/getrandom.h>
+#include <vdso/datapage.h>
#endif
#include <asm/archrandom.h>
#include <asm/processor.h>
@@ -407,6 +408,13 @@ static void crng_reseed(struct work_struct *work)
if (next_gen == ULONG_MAX)
++next_gen;
WRITE_ONCE(base_crng.generation, next_gen);
+#ifdef CONFIG_VDSO_GETRANDOM
+ /* base_crng.generation's invalid value is ULONG_MAX, while
+ * _vdso_rng_data.generation's invalid value is 0, so add one to the
+ * former to arrive at the latter.
+ */
+ smp_store_release(&_vdso_rng_data.generation, next_gen + 1);
+#endif
if (!static_branch_likely(&crng_is_ready))
crng_init = CRNG_READY;
spin_unlock_irqrestore(&base_crng.lock, flags);
@@ -857,6 +865,9 @@ static void __cold _credit_init_bits(size_t bits)
if (static_key_initialized)
execute_in_process_context(crng_set_ready, &set_ready);
atomic_notifier_call_chain(&random_ready_notifier, 0, NULL);
+#ifdef CONFIG_VDSO_GETRANDOM
+ smp_store_release(&_vdso_rng_data.is_ready, true);
+#endif
wake_up_interruptible(&crng_init_wait);
kill_fasync(&fasync, SIGIO, POLL_IN);
pr_notice("crng init done\n");
diff --git a/include/vdso/datapage.h b/include/vdso/datapage.h
index 73eb622e7663..d1f800c1c718 100644
--- a/include/vdso/datapage.h
+++ b/include/vdso/datapage.h
@@ -18,6 +18,7 @@
#include <vdso/time.h>
#include <vdso/time32.h>
#include <vdso/time64.h>
+#include <vdso/types.h>

#ifdef CONFIG_ARCH_HAS_VDSO_DATA
#include <asm/vdso/data.h>
@@ -109,6 +110,16 @@ struct vdso_data {
struct arch_vdso_data arch_data;
};

+/**
+ * struct vdso_rng_data - vdso RNG state information
+ * @generation: counter representing the number of RNG reseeds
+ * @is_ready: boolean signaling whether the RNG is initialized
+ */
+struct vdso_rng_data {
+ vdso_kernel_ulong generation;
+ u8 is_ready;
+};
+
/*
* We use the hidden visibility to prevent the compiler from generating a GOT
* relocation. Not only is going through a GOT useless (the entry couldn't and
@@ -120,6 +131,7 @@ struct vdso_data {
*/
extern struct vdso_data _vdso_data[CS_BASES] __attribute__((visibility("hidden")));
extern struct vdso_data _timens_data[CS_BASES] __attribute__((visibility("hidden")));
+extern struct vdso_rng_data _vdso_rng_data __attribute__((visibility("hidden")));

/*
* The generic vDSO implementation requires that gettimeofday.h
diff --git a/include/vdso/getrandom.h b/include/vdso/getrandom.h
index e3ceb1976386..7dc93d5f72dc 100644
--- a/include/vdso/getrandom.h
+++ b/include/vdso/getrandom.h
@@ -6,11 +6,39 @@
#ifndef _VDSO_GETRANDOM_H
#define _VDSO_GETRANDOM_H

+#include <crypto/chacha.h>
+#include <vdso/types.h>
+
/**
* struct vgetrandom_state - State used by vDSO getrandom() and allocated by vgetrandom_alloc().
*
- * Currently empty, as the vDSO getrandom() function has not yet been implemented.
+ * @batch: One and a half ChaCha20 blocks of buffered RNG output.
+ *
+ * @key: Key to be used for generating next batch.
+ *
+ * @batch_key: Union of the prior two members, which is exactly two full
+ * ChaCha20 blocks in size, so that @batch and @key can be filled
+ * together.
+ *
+ * @generation: Snapshot of @rng_info->generation in the vDSO data page at
+ * the time @key was generated.
+ *
+ * @pos: Offset into @batch of the next available random byte.
+ *
+ * @in_use: Reentrancy guard for reusing a state within the same thread
+ * due to signal handlers.
*/
-struct vgetrandom_state { int placeholder; };
+struct vgetrandom_state {
+ union {
+ struct {
+ u8 batch[CHACHA_BLOCK_SIZE * 3 / 2];
+ u32 key[CHACHA_KEY_SIZE / sizeof(u32)];
+ };
+ u8 batch_key[CHACHA_BLOCK_SIZE * 2];
+ };
+ vdso_kernel_ulong generation;
+ u8 pos;
+ bool in_use;
+};

#endif /* _VDSO_GETRANDOM_H */
diff --git a/include/vdso/types.h b/include/vdso/types.h
new file mode 100644
index 000000000000..ce131463aeff
--- /dev/null
+++ b/include/vdso/types.h
@@ -0,0 +1,35 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2022 Jason A. Donenfeld <[email protected]>. All Rights Reserved.
+ */
+#ifndef __VDSO_TYPES_H
+#define __VDSO_TYPES_H
+
+#include <linux/types.h>
+
+/**
+ * type vdso_kernel_ulong - unsigned long type that matches kernel's unsigned long
+ *
+ * Data shared between userspace and the kernel must operate the same way in both 64-bit code and in
+ * 32-bit compat code, over the same potentially 64-bit kernel. This type represents the size of an
+ * unsigned long as used by kernel code. This isn't necessarily the same as an unsigned long as used
+ * by userspace, however.
+ *
+ * +-------------------+-------------------+------------------+-------------------+
+ * | 32-bit userspace | 32-bit userspace | 64-bit userspace | 64-bit userspace |
+ * | unsigned long | vdso_kernel_ulong | unsigned long | vdso_kernel_ulong |
+ * +---------------+-------------------+-------------------+------------------+-------------------+
+ * | 32-bit kernel | ✓ same size | ✓ same size |
+ * | unsigned long | | |
+ * +---------------+-------------------+-------------------+------------------+-------------------+
+ * | 64-bit kernel | ✘ different size! | ✓ same size | ✓ same size | ✓ same size |
+ * | unsigned long | | | | |
+ * +---------------+-------------------+-------------------+------------------+-------------------+
+ */
+#ifdef CONFIG_64BIT
+typedef u64 vdso_kernel_ulong;
+#else
+typedef u32 vdso_kernel_ulong;
+#endif
+
+#endif /* __VDSO_TYPES_H */
diff --git a/lib/vdso/getrandom.c b/lib/vdso/getrandom.c
new file mode 100644
index 000000000000..64ab28913a23
--- /dev/null
+++ b/lib/vdso/getrandom.c
@@ -0,0 +1,224 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2022 Jason A. Donenfeld <[email protected]>. All Rights Reserved.
+ */
+
+#include <linux/cache.h>
+#include <linux/kernel.h>
+#include <linux/time64.h>
+#include <vdso/datapage.h>
+#include <vdso/getrandom.h>
+#include <asm/vdso/getrandom.h>
+#include <asm/vdso/vsyscall.h>
+
+#define MEMCPY_AND_ZERO_SRC(type, dst, src, len) do { \
+ while (len >= sizeof(type)) { \
+ __put_unaligned_t(type, __get_unaligned_t(type, src), dst); \
+ __put_unaligned_t(type, 0, src); \
+ dst += sizeof(type); \
+ src += sizeof(type); \
+ len -= sizeof(type); \
+ } \
+} while (0)
+
+static void memcpy_and_zero_src(void *dst, void *src, size_t len)
+{
+ if (IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) {
+ if (IS_ENABLED(CONFIG_64BIT))
+ MEMCPY_AND_ZERO_SRC(u64, dst, src, len);
+ MEMCPY_AND_ZERO_SRC(u32, dst, src, len);
+ MEMCPY_AND_ZERO_SRC(u16, dst, src, len);
+ }
+ MEMCPY_AND_ZERO_SRC(u8, dst, src, len);
+}
+
+/**
+ * __cvdso_getrandom_data - Generic vDSO implementation of getrandom() syscall.
+ * @rng_info: Describes state of kernel RNG, memory shared with kernel.
+ * @buffer: Destination buffer to fill with random bytes.
+ * @len: Size of @buffer in bytes.
+ * @flags: Zero or more GRND_* flags.
+ * @opaque_state: Pointer to an opaque state area.
+ *
+ * This implements a "fast key erasure" RNG using ChaCha20, in the same way that the kernel's
+ * getrandom() syscall does. It periodically reseeds its key from the kernel's RNG, at the same
+ * schedule that the kernel's RNG is reseeded. If the kernel's RNG is not ready, then this always
+ * calls into the syscall.
+ *
+ * @opaque_state *must* be allocated using the vgetrandom_alloc() syscall. Unless external locking
+ * is used, one state must be allocated per thread, as it is not safe to call this function
+ * concurrently with the same @opaque_state. However, it is safe to call this using the same
+ * @opaque_state that is shared between main code and signal handling code, within the same thread.
+ *
+ * Returns the number of random bytes written to @buffer, or a negative value indicating an error.
+ */
+static __always_inline ssize_t
+__cvdso_getrandom_data(const struct vdso_rng_data *rng_info, void *buffer, size_t len,
+ unsigned int flags, void *opaque_state)
+{
+ ssize_t ret = min_t(size_t, INT_MAX & PAGE_MASK /* = MAX_RW_COUNT */, len);
+ struct vgetrandom_state *state = opaque_state;
+ size_t batch_len, nblocks, orig_len = len;
+ unsigned long current_generation;
+ void *orig_buffer = buffer;
+ u32 counter[2] = { 0 };
+ bool in_use, have_retried = false;
+
+ /* The state must not straddle a page, since pages can be zeroed at any time. */
+ if (unlikely(((unsigned long)opaque_state & ~PAGE_MASK) + sizeof(*state) > PAGE_SIZE))
+ goto fallback_syscall;
+
+ /*
+ * If the kernel's RNG is not yet ready, then it's not possible to provide random bytes from
+ * userspace, because A) the various @flags require this to block, or not, depending on
+ * various factors unavailable to userspace, and B) the kernel's behavior before the RNG is
+ * ready is to reseed from the entropy pool at every invocation.
+ */
+ if (unlikely(!READ_ONCE(rng_info->is_ready)))
+ goto fallback_syscall;
+
+ /*
+ * This condition is checked after @rng_info->is_ready, because before the kernel's RNG is
+ * initialized, the @flags parameter may require this to block or return an error, even when
+ * len is zero.
+ */
+ if (unlikely(!len))
+ return 0;
+
+ /*
+ * @state->in_use is basic reentrancy protection against this running in a signal handler
+ * with the same @opaque_state, but obviously not atomic wrt multiple CPUs or more than one
+ * level of reentrancy. If a signal interrupts this after reading @state->in_use, but before
+ * writing @state->in_use, there is still no race, because the signal handler will run to
+ * its completion before returning execution.
+ */
+ in_use = READ_ONCE(state->in_use);
+ if (unlikely(in_use))
+ goto fallback_syscall;
+ WRITE_ONCE(state->in_use, true);
+
+retry_generation:
+ /*
+ * @rng_info->generation must always be read here, as it serializes @state->key with the
+ * kernel's RNG reseeding schedule.
+ */
+ current_generation = READ_ONCE(rng_info->generation);
+
+ /*
+ * If @state->generation doesn't match the kernel RNG's generation, then it means the
+ * kernel's RNG has reseeded, and so @state->key is reseeded as well.
+ */
+ if (unlikely(state->generation != current_generation)) {
+ /*
+ * Write the generation before filling the key, in case of fork. If there is a fork
+ * just after this line, the two forks will get different random bytes from the
+ * syscall, which is good. However, were this line to occur after the getrandom
+ * syscall, then both child and parent could have the same bytes and the same
+ * generation counter, so the fork would not be detected. Therefore, write
+ * @state->generation before the call to the getrandom syscall.
+ */
+ WRITE_ONCE(state->generation, current_generation);
+
+ /* Prevent the syscall from being reordered wrt current_generation. */
+ barrier();
+
+ /* Reseed @state->key using fresh bytes from the kernel. */
+ if (getrandom_syscall(state->key, sizeof(state->key), 0) != sizeof(state->key)) {
+ /*
+ * If the syscall failed to refresh the key, then @state->key is now
+ * invalid, so invalidate the generation so that it is not used again, and
+ * fallback to using the syscall entirely.
+ */
+ WRITE_ONCE(state->generation, 0);
+
+ /*
+ * Set @state->in_use to false only after the last write to @state in the
+ * line above.
+ */
+ WRITE_ONCE(state->in_use, false);
+
+ goto fallback_syscall;
+ }
+
+ /*
+ * Set @state->pos to beyond the end of the batch, so that the batch is refilled
+ * using the new key.
+ */
+ state->pos = sizeof(state->batch);
+ }
+
+ /* Set len to the total amount of bytes that this function is allowed to read, ret. */
+ len = ret;
+more_batch:
+ /*
+ * First use bytes out of @state->batch, which may have been filled by the last call to this
+ * function.
+ */
+ batch_len = min_t(size_t, sizeof(state->batch) - state->pos, len);
+ if (batch_len) {
+ /* Zeroing at the same time as memcpying helps preserve forward secrecy. */
+ memcpy_and_zero_src(buffer, state->batch + state->pos, batch_len);
+ state->pos += batch_len;
+ buffer += batch_len;
+ len -= batch_len;
+ }
+
+ if (!len) {
+ /* Prevent the loop from being reordered wrt ->generation. */
+ barrier();
+
+ /*
+ * Since @rng_info->generation will never be 0, re-read @state->generation, rather
+ * than using the local current_generation variable, to learn whether a fork
+ * occurred or if @state was zeroed due to memory pressure. Primarily, though, this
+ * indicates whether the kernel's RNG has reseeded, in which case generate a new key
+ * and start over.
+ */
+ if (unlikely(READ_ONCE(state->generation) != READ_ONCE(rng_info->generation))) {
+ /*
+ * Prevent this from looping forever in case of low memory or racing with a
+ * user force-reseeding the kernel's RNG using the ioctl.
+ */
+ if (have_retried)
+ goto fallback_syscall;
+
+ have_retried = true;
+ buffer = orig_buffer;
+ goto retry_generation;
+ }
+
+ /*
+ * Set @state->in_use to false only when there will be no more reads or writes of
+ * @state.
+ */
+ WRITE_ONCE(state->in_use, false);
+ return ret;
+ }
+
+ /* Generate blocks of RNG output directly into @buffer while there's enough room left. */
+ nblocks = len / CHACHA_BLOCK_SIZE;
+ if (nblocks) {
+ __arch_chacha20_blocks_nostack(buffer, state->key, counter, nblocks);
+ buffer += nblocks * CHACHA_BLOCK_SIZE;
+ len -= nblocks * CHACHA_BLOCK_SIZE;
+ }
+
+ BUILD_BUG_ON(sizeof(state->batch_key) % CHACHA_BLOCK_SIZE != 0);
+
+ /* Refill the batch and then overwrite the key, in order to preserve forward secrecy. */
+ __arch_chacha20_blocks_nostack(state->batch_key, state->key, counter,
+ sizeof(state->batch_key) / CHACHA_BLOCK_SIZE);
+
+ /* Since the batch was just refilled, set the position back to 0 to indicate a full batch. */
+ state->pos = 0;
+ goto more_batch;
+
+fallback_syscall:
+ return getrandom_syscall(orig_buffer, orig_len, flags);
+}
+
+static __always_inline ssize_t
+__cvdso_getrandom(void *buffer, size_t len, unsigned int flags, void *opaque_state)
+{
+ return __cvdso_getrandom_data(__arch_get_vdso_rng_data(), buffer, len, flags, opaque_state);
+}
diff --git a/tools/testing/selftests/vDSO/.gitignore b/tools/testing/selftests/vDSO/.gitignore
index a8dc51af5a9c..7dbfdec53f3d 100644
--- a/tools/testing/selftests/vDSO/.gitignore
+++ b/tools/testing/selftests/vDSO/.gitignore
@@ -6,3 +6,4 @@ vdso_test_correctness
vdso_test_gettimeofday
vdso_test_getcpu
vdso_standalone_test_x86
+vdso_test_getrandom
diff --git a/tools/testing/selftests/vDSO/Makefile b/tools/testing/selftests/vDSO/Makefile
index d53a4d8008f9..a33b4d200a32 100644
--- a/tools/testing/selftests/vDSO/Makefile
+++ b/tools/testing/selftests/vDSO/Makefile
@@ -11,6 +11,7 @@ ifeq ($(ARCH),$(filter $(ARCH),x86 x86_64))
TEST_GEN_PROGS += $(OUTPUT)/vdso_standalone_test_x86
endif
TEST_GEN_PROGS += $(OUTPUT)/vdso_test_correctness
+TEST_GEN_PROGS += $(OUTPUT)/vdso_test_getrandom

CFLAGS := -std=gnu99
CFLAGS_vdso_standalone_test_x86 := -nostdlib -fno-asynchronous-unwind-tables -fno-stack-protector
@@ -33,3 +34,4 @@ $(OUTPUT)/vdso_test_correctness: vdso_test_correctness.c
vdso_test_correctness.c \
-o $@ \
$(LDFLAGS_vdso_test_correctness)
+$(OUTPUT)/vdso_test_getrandom: parse_vdso.c
diff --git a/tools/testing/selftests/vDSO/vdso_test_getrandom.c b/tools/testing/selftests/vDSO/vdso_test_getrandom.c
new file mode 100644
index 000000000000..7184e633cec7
--- /dev/null
+++ b/tools/testing/selftests/vDSO/vdso_test_getrandom.c
@@ -0,0 +1,283 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2022 Jason A. Donenfeld <[email protected]>. All Rights Reserved.
+ */
+
+#include <assert.h>
+#include <pthread.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <time.h>
+#include <unistd.h>
+#include <sys/auxv.h>
+#include <sys/mman.h>
+#include <sys/random.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+
+#include "../kselftest.h"
+#include "parse_vdso.h"
+
+#ifndef timespecsub
+#define timespecsub(tsp, usp, vsp) \
+ do { \
+ (vsp)->tv_sec = (tsp)->tv_sec - (usp)->tv_sec; \
+ (vsp)->tv_nsec = (tsp)->tv_nsec - (usp)->tv_nsec; \
+ if ((vsp)->tv_nsec < 0) { \
+ (vsp)->tv_sec--; \
+ (vsp)->tv_nsec += 1000000000L; \
+ } \
+ } while (0)
+#endif
+
+static void *vgetrandom_alloc(unsigned int *num, unsigned int *size_per_each)
+{
+ enum { __NR_vgetrandom_alloc = 451 };
+ *size_per_each = 0;
+ return (void *)syscall(__NR_vgetrandom_alloc, num, size_per_each, 0, 0);
+}
+
+static struct {
+ pthread_mutex_t lock;
+ void **states;
+ size_t len, cap;
+} grnd_allocator = {
+ .lock = PTHREAD_MUTEX_INITIALIZER
+};
+
+static void *vgetrandom_get_state(void)
+{
+ void *state = NULL;
+
+ pthread_mutex_lock(&grnd_allocator.lock);
+ if (!grnd_allocator.len) {
+ size_t new_cap;
+ size_t page_size = getpagesize();
+ unsigned int num = sysconf(_SC_NPROCESSORS_ONLN); /* Could be arbitrary, just a hint. */
+ unsigned int size_per_each;
+ void *new_block = vgetrandom_alloc(&num, &size_per_each);
+ void *new_states;
+
+ if (new_block == MAP_FAILED)
+ goto out;
+ new_cap = grnd_allocator.cap + num;
+ new_states = reallocarray(grnd_allocator.states, new_cap, sizeof(*grnd_allocator.states));
+ if (!new_states) {
+ munmap(new_block, num * size_per_each);
+ goto out;
+ }
+ grnd_allocator.cap = new_cap;
+ grnd_allocator.states = new_states;
+
+ for (size_t i = 0; i < num; ++i) {
+ grnd_allocator.states[i] = new_block;
+ if (((uintptr_t)new_block & (page_size - 1)) + size_per_each > page_size)
+ new_block = (void *)(((uintptr_t)new_block + page_size) & (page_size - 1));
+ else
+ new_block += size_per_each;
+ }
+ grnd_allocator.len = num;
+ }
+ state = grnd_allocator.states[--grnd_allocator.len];
+
+out:
+ pthread_mutex_unlock(&grnd_allocator.lock);
+ return state;
+}
+
+static void vgetrandom_put_state(void *state)
+{
+ if (!state)
+ return;
+ pthread_mutex_lock(&grnd_allocator.lock);
+ grnd_allocator.states[grnd_allocator.len++] = state;
+ pthread_mutex_unlock(&grnd_allocator.lock);
+}
+
+static struct {
+ ssize_t(*fn)(void *buf, size_t len, unsigned long flags, void *state);
+ pthread_key_t key;
+ pthread_once_t initialized;
+} grnd_ctx = {
+ .initialized = PTHREAD_ONCE_INIT
+};
+
+static void vgetrandom_init(void)
+{
+ if (pthread_key_create(&grnd_ctx.key, vgetrandom_put_state) != 0)
+ return;
+ unsigned long sysinfo_ehdr = getauxval(AT_SYSINFO_EHDR);
+ if (!sysinfo_ehdr) {
+ printf("AT_SYSINFO_EHDR is not present!\n");
+ exit(KSFT_SKIP);
+ }
+ vdso_init_from_sysinfo_ehdr(sysinfo_ehdr);
+ grnd_ctx.fn = (__typeof__(grnd_ctx.fn))vdso_sym("LINUX_2.6", "__vdso_getrandom");
+ if (!grnd_ctx.fn) {
+ printf("__vdso_getrandom is missing!\n");
+ exit(KSFT_FAIL);
+ }
+}
+
+static ssize_t vgetrandom(void *buf, size_t len, unsigned long flags)
+{
+ void *state;
+
+ pthread_once(&grnd_ctx.initialized, vgetrandom_init);
+ state = pthread_getspecific(grnd_ctx.key);
+ if (!state) {
+ state = vgetrandom_get_state();
+ if (pthread_setspecific(grnd_ctx.key, state) != 0) {
+ vgetrandom_put_state(state);
+ state = NULL;
+ }
+ if (!state) {
+ printf("vgetrandom_get_state failed!\n");
+ exit(KSFT_FAIL);
+ }
+ }
+ return grnd_ctx.fn(buf, len, flags, state);
+}
+
+enum { TRIALS = 25000000, THREADS = 256 };
+
+static void *test_vdso_getrandom(void *)
+{
+ for (size_t i = 0; i < TRIALS; ++i) {
+ unsigned int val;
+ ssize_t ret = vgetrandom(&val, sizeof(val), 0);
+ assert(ret == sizeof(val));
+ }
+ return NULL;
+}
+
+static void *test_libc_getrandom(void *)
+{
+ for (size_t i = 0; i < TRIALS; ++i) {
+ unsigned int val;
+ ssize_t ret = getrandom(&val, sizeof(val), 0);
+ assert(ret == sizeof(val));
+ }
+ return NULL;
+}
+
+static void *test_syscall_getrandom(void *)
+{
+ for (size_t i = 0; i < TRIALS; ++i) {
+ unsigned int val;
+ ssize_t ret = syscall(SYS_getrandom, &val, sizeof(val), 0);
+ assert(ret == sizeof(val));
+ }
+ return NULL;
+}
+
+static void bench_single(void)
+{
+ struct timespec start, end, diff;
+
+ clock_gettime(CLOCK_MONOTONIC, &start);
+ test_vdso_getrandom(NULL);
+ clock_gettime(CLOCK_MONOTONIC, &end);
+ timespecsub(&end, &start, &diff);
+ printf(" vdso: %u times in %lu.%09lu seconds\n", TRIALS, diff.tv_sec, diff.tv_nsec);
+
+ clock_gettime(CLOCK_MONOTONIC, &start);
+ test_libc_getrandom(NULL);
+ clock_gettime(CLOCK_MONOTONIC, &end);
+ timespecsub(&end, &start, &diff);
+ printf(" libc: %u times in %lu.%09lu seconds\n", TRIALS, diff.tv_sec, diff.tv_nsec);
+
+ clock_gettime(CLOCK_MONOTONIC, &start);
+ test_syscall_getrandom(NULL);
+ clock_gettime(CLOCK_MONOTONIC, &end);
+ timespecsub(&end, &start, &diff);
+ printf("syscall: %u times in %lu.%09lu seconds\n", TRIALS, diff.tv_sec, diff.tv_nsec);
+}
+
+static void bench_multi(void)
+{
+ struct timespec start, end, diff;
+ pthread_t threads[THREADS];
+
+ clock_gettime(CLOCK_MONOTONIC, &start);
+ for (size_t i = 0; i < THREADS; ++i)
+ assert(pthread_create(&threads[i], NULL, test_vdso_getrandom, NULL) == 0);
+ for (size_t i = 0; i < THREADS; ++i)
+ pthread_join(threads[i], NULL);
+ clock_gettime(CLOCK_MONOTONIC, &end);
+ timespecsub(&end, &start, &diff);
+ printf(" vdso: %u x %u times in %lu.%09lu seconds\n", TRIALS, THREADS, diff.tv_sec, diff.tv_nsec);
+
+ clock_gettime(CLOCK_MONOTONIC, &start);
+ for (size_t i = 0; i < THREADS; ++i)
+ assert(pthread_create(&threads[i], NULL, test_libc_getrandom, NULL) == 0);
+ for (size_t i = 0; i < THREADS; ++i)
+ pthread_join(threads[i], NULL);
+ clock_gettime(CLOCK_MONOTONIC, &end);
+ timespecsub(&end, &start, &diff);
+ printf(" libc: %u x %u times in %lu.%09lu seconds\n", TRIALS, THREADS, diff.tv_sec, diff.tv_nsec);
+
+ clock_gettime(CLOCK_MONOTONIC, &start);
+ for (size_t i = 0; i < THREADS; ++i)
+ assert(pthread_create(&threads[i], NULL, test_syscall_getrandom, NULL) == 0);
+ for (size_t i = 0; i < THREADS; ++i)
+ pthread_join(threads[i], NULL);
+ clock_gettime(CLOCK_MONOTONIC, &end);
+ timespecsub(&end, &start, &diff);
+ printf(" syscall: %u x %u times in %lu.%09lu seconds\n", TRIALS, THREADS, diff.tv_sec, diff.tv_nsec);
+}
+
+static void fill(void)
+{
+ uint8_t weird_size[323929];
+ for (;;)
+ vgetrandom(weird_size, sizeof(weird_size), 0);
+}
+
+static void kselftest(void)
+{
+ uint8_t weird_size[1263];
+
+ ksft_print_header();
+ ksft_set_plan(1);
+
+ for (size_t i = 0; i < 1000; ++i) {
+ ssize_t ret = vgetrandom(weird_size, sizeof(weird_size), 0);
+ if (ret != sizeof(weird_size))
+ exit(KSFT_FAIL);
+ }
+
+ ksft_test_result_pass("getrandom: PASS\n");
+ exit(KSFT_PASS);
+}
+
+static void usage(const char *argv0)
+{
+ fprintf(stderr, "Usage: %s [bench-single|bench-multi|fill]\n", argv0);
+}
+
+int main(int argc, char *argv[])
+{
+ if (argc == 1) {
+ kselftest();
+ return 0;
+ }
+
+ if (argc != 2) {
+ usage(argv[0]);
+ return 1;
+ }
+ if (!strcmp(argv[1], "bench-single"))
+ bench_single();
+ else if (!strcmp(argv[1], "bench-multi"))
+ bench_multi();
+ else if (!strcmp(argv[1], "fill"))
+ fill();
+ else {
+ usage(argv[0]);
+ return 1;
+ }
+ return 0;
+}
--
2.39.0

2023-01-01 16:31:33

by Jason A. Donenfeld

[permalink] [raw]
Subject: [PATCH v14 5/7] arch: allocate vgetrandom_alloc() syscall number

Add vgetrandom_alloc() as syscall 451 (or 561 on alpha) by adding it to
all of the various syscall.tbl and unistd.h files.

Acked-by: Geert Uytterhoeven <[email protected]>
Signed-off-by: Jason A. Donenfeld <[email protected]>
---
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 ++
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
include/uapi/asm-generic/unistd.h | 5 ++++-
tools/include/uapi/asm-generic/unistd.h | 5 ++++-
tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl | 1 +
tools/perf/arch/powerpc/entry/syscalls/syscall.tbl | 1 +
tools/perf/arch/s390/entry/syscalls/syscall.tbl | 1 +
tools/perf/arch/x86/entry/syscalls/syscall_64.tbl | 1 +
24 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 8ebacf37a8cf..a4bfd7b53d6f 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -490,3 +490,4 @@
558 common process_mrelease sys_process_mrelease
559 common futex_waitv sys_futex_waitv
560 common set_mempolicy_home_node sys_ni_syscall
+561 common vgetrandom_alloc sys_vgetrandom_alloc
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index ac964612d8b0..e10319cc6c3e 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -464,3 +464,4 @@
448 common process_mrelease sys_process_mrelease
449 common futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node
+451 common vgetrandom_alloc sys_vgetrandom_alloc
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 037feba03a51..64a514f90131 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -39,7 +39,7 @@
#define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5)
#define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800)

-#define __NR_compat_syscalls 451
+#define __NR_compat_syscalls 452
#endif

#define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 604a2053d006..7285b5a830cc 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -907,6 +907,8 @@ __SYSCALL(__NR_process_mrelease, sys_process_mrelease)
__SYSCALL(__NR_futex_waitv, sys_futex_waitv)
#define __NR_set_mempolicy_home_node 450
__SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
+#define __NR_vgetrandom_alloc 451
+__SYSCALL(__NR_vgetrandom_alloc, sys_vgetrandom_alloc)

/*
* Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index 72c929d9902b..5ed9667051fc 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -371,3 +371,4 @@
448 common process_mrelease sys_process_mrelease
449 common futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node
+451 common vgetrandom_alloc sys_vgetrandom_alloc
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index b1f3940bc298..d9e7ea26dd26 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -450,3 +450,4 @@
448 common process_mrelease sys_process_mrelease
449 common futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node
+451 common vgetrandom_alloc sys_vgetrandom_alloc
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 820145e47350..c109e307a37b 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -456,3 +456,4 @@
448 common process_mrelease sys_process_mrelease
449 common futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node
+451 common vgetrandom_alloc sys_vgetrandom_alloc
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 253ff994ed2e..6d47d8231f7d 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -389,3 +389,4 @@
448 n32 process_mrelease sys_process_mrelease
449 n32 futex_waitv sys_futex_waitv
450 n32 set_mempolicy_home_node sys_set_mempolicy_home_node
+451 n32 vgetrandom_alloc sys_vgetrandom_alloc
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index 3f1886ad9d80..890e5b51e1fc 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -365,3 +365,4 @@
448 n64 process_mrelease sys_process_mrelease
449 n64 futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node
+451 n64 vgetrandom_alloc sys_vgetrandom_alloc
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 8f243e35a7b2..de512de148f5 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -438,3 +438,4 @@
448 o32 process_mrelease sys_process_mrelease
449 o32 futex_waitv sys_futex_waitv
450 o32 set_mempolicy_home_node sys_set_mempolicy_home_node
+451 o32 vgetrandom_alloc sys_vgetrandom_alloc
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index 0e42fceb2d5e..d2928da55d5e 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -448,3 +448,4 @@
448 common process_mrelease sys_process_mrelease
449 common futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node
+451 common vgetrandom_alloc sys_vgetrandom_alloc
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index a0be127475b1..e6c04eda2363 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -537,3 +537,4 @@
448 common process_mrelease sys_process_mrelease
449 common futex_waitv sys_futex_waitv
450 nospu set_mempolicy_home_node sys_set_mempolicy_home_node
+451 common vgetrandom_alloc sys_vgetrandom_alloc
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 799147658dee..5b0b2bea46da 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -453,3 +453,4 @@
448 common process_mrelease sys_process_mrelease sys_process_mrelease
449 common futex_waitv sys_futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node sys_set_mempolicy_home_node
+451 common vgetrandom_alloc sys_vgetrandom_alloc sys_vgetrandom_alloc
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 2de85c977f54..631f0bac0e9a 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -453,3 +453,4 @@
448 common process_mrelease sys_process_mrelease
449 common futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node
+451 common vgetrandom_alloc sys_vgetrandom_alloc
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 4398cc6fb68d..b4925978adea 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -496,3 +496,4 @@
448 common process_mrelease sys_process_mrelease
449 common futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node
+451 common vgetrandom_alloc sys_vgetrandom_alloc
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 320480a8db4f..f5f863a33824 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -455,3 +455,4 @@
448 i386 process_mrelease sys_process_mrelease
449 i386 futex_waitv sys_futex_waitv
450 i386 set_mempolicy_home_node sys_set_mempolicy_home_node
+451 i386 vgetrandom_alloc sys_vgetrandom_alloc
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index c84d12608cd2..0186f173f0e8 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -372,6 +372,7 @@
448 common process_mrelease sys_process_mrelease
449 common futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node
+451 common vgetrandom_alloc sys_vgetrandom_alloc

#
# Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 52c94ab5c205..14d63a119cc2 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -421,3 +421,4 @@
448 common process_mrelease sys_process_mrelease
449 common futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node
+451 common vgetrandom_alloc sys_vgetrandom_alloc
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 45fa180cc56a..9d2e299f3e8a 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -886,8 +886,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
#define __NR_set_mempolicy_home_node 450
__SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)

+#define __NR_vgetrandom_alloc 451
+__SYSCALL(__NR_vgetrandom_alloc, sys_vgetrandom_alloc)
+
#undef __NR_syscalls
-#define __NR_syscalls 451
+#define __NR_syscalls 452

/*
* 32 bit systems traditionally used different
diff --git a/tools/include/uapi/asm-generic/unistd.h b/tools/include/uapi/asm-generic/unistd.h
index 45fa180cc56a..9d2e299f3e8a 100644
--- a/tools/include/uapi/asm-generic/unistd.h
+++ b/tools/include/uapi/asm-generic/unistd.h
@@ -886,8 +886,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
#define __NR_set_mempolicy_home_node 450
__SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)

+#define __NR_vgetrandom_alloc 451
+__SYSCALL(__NR_vgetrandom_alloc, sys_vgetrandom_alloc)
+
#undef __NR_syscalls
-#define __NR_syscalls 451
+#define __NR_syscalls 452

/*
* 32 bit systems traditionally used different
diff --git a/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl b/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl
index 3f1886ad9d80..890e5b51e1fc 100644
--- a/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl
+++ b/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl
@@ -365,3 +365,4 @@
448 n64 process_mrelease sys_process_mrelease
449 n64 futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node
+451 n64 vgetrandom_alloc sys_vgetrandom_alloc
diff --git a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
index a0be127475b1..e6c04eda2363 100644
--- a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
+++ b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
@@ -537,3 +537,4 @@
448 common process_mrelease sys_process_mrelease
449 common futex_waitv sys_futex_waitv
450 nospu set_mempolicy_home_node sys_set_mempolicy_home_node
+451 common vgetrandom_alloc sys_vgetrandom_alloc
diff --git a/tools/perf/arch/s390/entry/syscalls/syscall.tbl b/tools/perf/arch/s390/entry/syscalls/syscall.tbl
index 799147658dee..5b0b2bea46da 100644
--- a/tools/perf/arch/s390/entry/syscalls/syscall.tbl
+++ b/tools/perf/arch/s390/entry/syscalls/syscall.tbl
@@ -453,3 +453,4 @@
448 common process_mrelease sys_process_mrelease sys_process_mrelease
449 common futex_waitv sys_futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node sys_set_mempolicy_home_node
+451 common vgetrandom_alloc sys_vgetrandom_alloc sys_vgetrandom_alloc
diff --git a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
index c84d12608cd2..0186f173f0e8 100644
--- a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
@@ -372,6 +372,7 @@
448 common process_mrelease sys_process_mrelease
449 common futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node
+451 common vgetrandom_alloc sys_vgetrandom_alloc

#
# Due to a historical design error, certain syscalls are numbered differently
--
2.39.0

2023-01-01 16:32:21

by Jason A. Donenfeld

[permalink] [raw]
Subject: [PATCH v14 7/7] x86: vdso: Wire up getrandom() vDSO implementation

Hook up the generic vDSO implementation to the x86 vDSO data page. Since
the existing vDSO infrastructure is heavily based on the timekeeping
functionality, which works over arrays of bases, a new macro is
introduced for vvars that are not arrays.

The vDSO function requires a ChaCha20 implementation that does not write
to the stack, yet can still do an entire ChaCha20 permutation, so
provide this using SSE2, since this is userland code that must work on
all x86-64 processors. There's a simple test for this code as well.

Reviewed-by: Samuel Neves <[email protected]> # for vgetrandom-chacha.S
Signed-off-by: Jason A. Donenfeld <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/entry/vdso/Makefile | 3 +-
arch/x86/entry/vdso/vdso.lds.S | 2 +
arch/x86/entry/vdso/vgetrandom-chacha.S | 178 ++++++++++++++++++
arch/x86/entry/vdso/vgetrandom.c | 17 ++
arch/x86/include/asm/vdso/getrandom.h | 55 ++++++
arch/x86/include/asm/vdso/vsyscall.h | 2 +
arch/x86/include/asm/vvar.h | 16 ++
tools/testing/selftests/vDSO/.gitignore | 1 +
tools/testing/selftests/vDSO/Makefile | 9 +
.../testing/selftests/vDSO/vdso_test_chacha.c | 43 +++++
11 files changed, 326 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/entry/vdso/vgetrandom-chacha.S
create mode 100644 arch/x86/entry/vdso/vgetrandom.c
create mode 100644 arch/x86/include/asm/vdso/getrandom.h
create mode 100644 tools/testing/selftests/vDSO/vdso_test_chacha.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 3604074a878b..ed689d831362 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -272,6 +272,7 @@ config X86
select HAVE_UNSTABLE_SCHED_CLOCK
select HAVE_USER_RETURN_NOTIFIER
select HAVE_GENERIC_VDSO
+ select VDSO_GETRANDOM if X86_64
select HOTPLUG_SMT if SMP
select IRQ_FORCED_THREADING
select NEED_PER_CPU_EMBED_FIRST_CHUNK
diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
index 838613ac15b8..3979bb4a61ae 100644
--- a/arch/x86/entry/vdso/Makefile
+++ b/arch/x86/entry/vdso/Makefile
@@ -27,7 +27,7 @@ VDSO32-$(CONFIG_X86_32) := y
VDSO32-$(CONFIG_IA32_EMULATION) := y

# files to link into the vdso
-vobjs-y := vdso-note.o vclock_gettime.o vgetcpu.o
+vobjs-y := vdso-note.o vclock_gettime.o vgetcpu.o vgetrandom.o vgetrandom-chacha.o
vobjs32-y := vdso32/note.o vdso32/system_call.o vdso32/sigreturn.o
vobjs32-y += vdso32/vclock_gettime.o
vobjs-$(CONFIG_X86_SGX) += vsgx.o
@@ -105,6 +105,7 @@ CFLAGS_REMOVE_vclock_gettime.o = -pg
CFLAGS_REMOVE_vdso32/vclock_gettime.o = -pg
CFLAGS_REMOVE_vgetcpu.o = -pg
CFLAGS_REMOVE_vsgx.o = -pg
+CFLAGS_REMOVE_vgetrandom.o = -pg

#
# X32 processes use x32 vDSO to access 64bit kernel data.
diff --git a/arch/x86/entry/vdso/vdso.lds.S b/arch/x86/entry/vdso/vdso.lds.S
index e8c60ae7a7c8..0bab5f4af6d1 100644
--- a/arch/x86/entry/vdso/vdso.lds.S
+++ b/arch/x86/entry/vdso/vdso.lds.S
@@ -30,6 +30,8 @@ VERSION {
#ifdef CONFIG_X86_SGX
__vdso_sgx_enter_enclave;
#endif
+ getrandom;
+ __vdso_getrandom;
local: *;
};
}
diff --git a/arch/x86/entry/vdso/vgetrandom-chacha.S b/arch/x86/entry/vdso/vgetrandom-chacha.S
new file mode 100644
index 000000000000..d79e2bd97598
--- /dev/null
+++ b/arch/x86/entry/vdso/vgetrandom-chacha.S
@@ -0,0 +1,178 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2022 Jason A. Donenfeld <[email protected]>. All Rights Reserved.
+ */
+
+#include <linux/linkage.h>
+#include <asm/frame.h>
+
+.section .rodata, "a"
+.align 16
+CONSTANTS: .octa 0x6b20657479622d323320646e61707865
+.text
+
+/*
+ * Very basic SSE2 implementation of ChaCha20. Produces a given positive number
+ * of blocks of output with a nonce of 0, taking an input key and 8-byte
+ * counter. Importantly does not spill to the stack. Its arguments are:
+ *
+ * rdi: output bytes
+ * rsi: 32-byte key input
+ * rdx: 8-byte counter input/output
+ * rcx: number of 64-byte blocks to write to output
+ */
+SYM_FUNC_START(__arch_chacha20_blocks_nostack)
+
+.set output, %rdi
+.set key, %rsi
+.set counter, %rdx
+.set nblocks, %rcx
+.set i, %al
+/* xmm registers are *not* callee-save. */
+.set state0, %xmm0
+.set state1, %xmm1
+.set state2, %xmm2
+.set state3, %xmm3
+.set copy0, %xmm4
+.set copy1, %xmm5
+.set copy2, %xmm6
+.set copy3, %xmm7
+.set temp, %xmm8
+.set one, %xmm9
+
+ /* copy0 = "expand 32-byte k" */
+ movaps CONSTANTS(%rip),copy0
+ /* copy1,copy2 = key */
+ movups 0x00(key),copy1
+ movups 0x10(key),copy2
+ /* copy3 = counter || zero nonce */
+ movq 0x00(counter),copy3
+ /* one = 1 || 0 */
+ movq $1,%rax
+ movq %rax,one
+
+.Lblock:
+ /* state0,state1,state2,state3 = copy0,copy1,copy2,copy3 */
+ movdqa copy0,state0
+ movdqa copy1,state1
+ movdqa copy2,state2
+ movdqa copy3,state3
+
+ movb $10,i
+.Lpermute:
+ /* state0 += state1, state3 = rotl32(state3 ^ state0, 16) */
+ paddd state1,state0
+ pxor state0,state3
+ movdqa state3,temp
+ pslld $16,temp
+ psrld $16,state3
+ por temp,state3
+
+ /* state2 += state3, state1 = rotl32(state1 ^ state2, 12) */
+ paddd state3,state2
+ pxor state2,state1
+ movdqa state1,temp
+ pslld $12,temp
+ psrld $20,state1
+ por temp,state1
+
+ /* state0 += state1, state3 = rotl32(state3 ^ state0, 8) */
+ paddd state1,state0
+ pxor state0,state3
+ movdqa state3,temp
+ pslld $8,temp
+ psrld $24,state3
+ por temp,state3
+
+ /* state2 += state3, state1 = rotl32(state1 ^ state2, 7) */
+ paddd state3,state2
+ pxor state2,state1
+ movdqa state1,temp
+ pslld $7,temp
+ psrld $25,state1
+ por temp,state1
+
+ /* state1[0,1,2,3] = state1[1,2,3,0] */
+ pshufd $0x39,state1,state1
+ /* state2[0,1,2,3] = state2[2,3,0,1] */
+ pshufd $0x4e,state2,state2
+ /* state3[0,1,2,3] = state3[3,0,1,2] */
+ pshufd $0x93,state3,state3
+
+ /* state0 += state1, state3 = rotl32(state3 ^ state0, 16) */
+ paddd state1,state0
+ pxor state0,state3
+ movdqa state3,temp
+ pslld $16,temp
+ psrld $16,state3
+ por temp,state3
+
+ /* state2 += state3, state1 = rotl32(state1 ^ state2, 12) */
+ paddd state3,state2
+ pxor state2,state1
+ movdqa state1,temp
+ pslld $12,temp
+ psrld $20,state1
+ por temp,state1
+
+ /* state0 += state1, state3 = rotl32(state3 ^ state0, 8) */
+ paddd state1,state0
+ pxor state0,state3
+ movdqa state3,temp
+ pslld $8,temp
+ psrld $24,state3
+ por temp,state3
+
+ /* state2 += state3, state1 = rotl32(state1 ^ state2, 7) */
+ paddd state3,state2
+ pxor state2,state1
+ movdqa state1,temp
+ pslld $7,temp
+ psrld $25,state1
+ por temp,state1
+
+ /* state1[0,1,2,3] = state1[3,0,1,2] */
+ pshufd $0x93,state1,state1
+ /* state2[0,1,2,3] = state2[2,3,0,1] */
+ pshufd $0x4e,state2,state2
+ /* state3[0,1,2,3] = state3[1,2,3,0] */
+ pshufd $0x39,state3,state3
+
+ decb i
+ jnz .Lpermute
+
+ /* output0 = state0 + copy0 */
+ paddd copy0,state0
+ movups state0,0x00(output)
+ /* output1 = state1 + copy1 */
+ paddd copy1,state1
+ movups state1,0x10(output)
+ /* output2 = state2 + copy2 */
+ paddd copy2,state2
+ movups state2,0x20(output)
+ /* output3 = state3 + copy3 */
+ paddd copy3,state3
+ movups state3,0x30(output)
+
+ /* ++copy3.counter */
+ paddq one,copy3
+
+ /* output += 64, --nblocks */
+ addq $64,output
+ decq nblocks
+ jnz .Lblock
+
+ /* counter = copy3.counter */
+ movq copy3,0x00(counter)
+
+ /* Zero out the potentially sensitive regs, in case nothing uses these again. */
+ pxor state0,state0
+ pxor state1,state1
+ pxor state2,state2
+ pxor state3,state3
+ pxor copy1,copy1
+ pxor copy2,copy2
+ pxor temp,temp
+
+ ret
+SYM_FUNC_END(__arch_chacha20_blocks_nostack)
diff --git a/arch/x86/entry/vdso/vgetrandom.c b/arch/x86/entry/vdso/vgetrandom.c
new file mode 100644
index 000000000000..6045ded5da90
--- /dev/null
+++ b/arch/x86/entry/vdso/vgetrandom.c
@@ -0,0 +1,17 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2022 Jason A. Donenfeld <[email protected]>. All Rights Reserved.
+ */
+#include <linux/types.h>
+
+#include "../../../../lib/vdso/getrandom.c"
+
+ssize_t __vdso_getrandom(void *buffer, size_t len, unsigned int flags, void *state);
+
+ssize_t __vdso_getrandom(void *buffer, size_t len, unsigned int flags, void *state)
+{
+ return __cvdso_getrandom(buffer, len, flags, state);
+}
+
+ssize_t getrandom(void *, size_t, unsigned int, void *)
+ __attribute__((weak, alias("__vdso_getrandom")));
diff --git a/arch/x86/include/asm/vdso/getrandom.h b/arch/x86/include/asm/vdso/getrandom.h
new file mode 100644
index 000000000000..46f99d735ae6
--- /dev/null
+++ b/arch/x86/include/asm/vdso/getrandom.h
@@ -0,0 +1,55 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2022 Jason A. Donenfeld <[email protected]>. All Rights Reserved.
+ */
+#ifndef __ASM_VDSO_GETRANDOM_H
+#define __ASM_VDSO_GETRANDOM_H
+
+#ifndef __ASSEMBLY__
+
+#include <asm/unistd.h>
+#include <asm/vvar.h>
+
+/**
+ * getrandom_syscall - Invoke the getrandom() syscall.
+ * @buffer: Destination buffer to fill with random bytes.
+ * @len: Size of @buffer in bytes.
+ * @flags: Zero or more GRND_* flags.
+ * Returns the number of random bytes written to @buffer, or a negative value indicating an error.
+ */
+static __always_inline ssize_t getrandom_syscall(void *buffer, size_t len, unsigned int flags)
+{
+ long ret;
+
+ asm ("syscall" : "=a" (ret) :
+ "0" (__NR_getrandom), "D" (buffer), "S" (len), "d" (flags) :
+ "rcx", "r11", "memory");
+
+ return ret;
+}
+
+#define __vdso_rng_data (VVAR(_vdso_rng_data))
+
+static __always_inline const struct vdso_rng_data *__arch_get_vdso_rng_data(void)
+{
+ if (__vdso_data->clock_mode == VDSO_CLOCKMODE_TIMENS)
+ return (void *)&__vdso_rng_data + ((void *)&__timens_vdso_data - (void *)&__vdso_data);
+ return &__vdso_rng_data;
+}
+
+/**
+ * __arch_chacha20_blocks_nostack - Generate ChaCha20 stream without using the stack.
+ * @dst_bytes: Destination buffer to hold @nblocks * 64 bytes of output.
+ * @key: 32-byte input key.
+ * @counter: 8-byte counter, read on input and updated on return.
+ * @nblocks: Number of blocks to generate.
+ *
+ * Generates a given positive number of blocks of ChaCha20 output with nonce=0, and does not write
+ * to any stack or memory outside of the parameters passed to it, in order to mitigate stack data
+ * leaking into forked child processes.
+ */
+extern void __arch_chacha20_blocks_nostack(u8 *dst_bytes, const u32 *key, u32 *counter, size_t nblocks);
+
+#endif /* !__ASSEMBLY__ */
+
+#endif /* __ASM_VDSO_GETRANDOM_H */
diff --git a/arch/x86/include/asm/vdso/vsyscall.h b/arch/x86/include/asm/vdso/vsyscall.h
index be199a9b2676..71c56586a22f 100644
--- a/arch/x86/include/asm/vdso/vsyscall.h
+++ b/arch/x86/include/asm/vdso/vsyscall.h
@@ -11,6 +11,8 @@
#include <asm/vvar.h>

DEFINE_VVAR(struct vdso_data, _vdso_data);
+DEFINE_VVAR_SINGLE(struct vdso_rng_data, _vdso_rng_data);
+
/*
* Update the vDSO data page to keep in sync with kernel timekeeping.
*/
diff --git a/arch/x86/include/asm/vvar.h b/arch/x86/include/asm/vvar.h
index 183e98e49ab9..9d9af37f7cab 100644
--- a/arch/x86/include/asm/vvar.h
+++ b/arch/x86/include/asm/vvar.h
@@ -26,6 +26,8 @@
*/
#define DECLARE_VVAR(offset, type, name) \
EMIT_VVAR(name, offset)
+#define DECLARE_VVAR_SINGLE(offset, type, name) \
+ EMIT_VVAR(name, offset)

#else

@@ -37,6 +39,10 @@ extern char __vvar_page;
extern type timens_ ## name[CS_BASES] \
__attribute__((visibility("hidden"))); \

+#define DECLARE_VVAR_SINGLE(offset, type, name) \
+ extern type vvar_ ## name \
+ __attribute__((visibility("hidden"))); \
+
#define VVAR(name) (vvar_ ## name)
#define TIMENS(name) (timens_ ## name)

@@ -44,12 +50,22 @@ extern char __vvar_page;
type name[CS_BASES] \
__attribute__((section(".vvar_" #name), aligned(16))) __visible

+#define DEFINE_VVAR_SINGLE(type, name) \
+ type name \
+ __attribute__((section(".vvar_" #name), aligned(16))) __visible
+
#endif

/* DECLARE_VVAR(offset, type, name) */

DECLARE_VVAR(128, struct vdso_data, _vdso_data)

+#if !defined(_SINGLE_DATA)
+#define _SINGLE_DATA
+DECLARE_VVAR_SINGLE(640, struct vdso_rng_data, _vdso_rng_data)
+#endif
+
#undef DECLARE_VVAR
+#undef DECLARE_VVAR_SINGLE

#endif
diff --git a/tools/testing/selftests/vDSO/.gitignore b/tools/testing/selftests/vDSO/.gitignore
index 7dbfdec53f3d..30d5c8f0e5c7 100644
--- a/tools/testing/selftests/vDSO/.gitignore
+++ b/tools/testing/selftests/vDSO/.gitignore
@@ -7,3 +7,4 @@ vdso_test_gettimeofday
vdso_test_getcpu
vdso_standalone_test_x86
vdso_test_getrandom
+vdso_test_chacha
diff --git a/tools/testing/selftests/vDSO/Makefile b/tools/testing/selftests/vDSO/Makefile
index a33b4d200a32..54a015afe60c 100644
--- a/tools/testing/selftests/vDSO/Makefile
+++ b/tools/testing/selftests/vDSO/Makefile
@@ -3,6 +3,7 @@ include ../lib.mk

uname_M := $(shell uname -m 2>/dev/null || echo not)
ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/x86/ -e s/x86_64/x86/)
+SODIUM := $(shell pkg-config --libs libsodium 2>/dev/null)

TEST_GEN_PROGS := $(OUTPUT)/vdso_test_gettimeofday $(OUTPUT)/vdso_test_getcpu
TEST_GEN_PROGS += $(OUTPUT)/vdso_test_abi
@@ -12,9 +13,15 @@ TEST_GEN_PROGS += $(OUTPUT)/vdso_standalone_test_x86
endif
TEST_GEN_PROGS += $(OUTPUT)/vdso_test_correctness
TEST_GEN_PROGS += $(OUTPUT)/vdso_test_getrandom
+ifeq ($(uname_M),x86_64)
+ifneq ($(SODIUM),)
+TEST_GEN_PROGS += $(OUTPUT)/vdso_test_chacha
+endif
+endif

CFLAGS := -std=gnu99
CFLAGS_vdso_standalone_test_x86 := -nostdlib -fno-asynchronous-unwind-tables -fno-stack-protector
+CFLAGS_vdso_test_chacha := $(SODIUM) -idirafter $(top_srcdir)/include -idirafter $(top_srcdir)/arch/$(ARCH)/include -D__ASSEMBLY__ -DBULID_VDSO -DCONFIG_FUNCTION_ALIGNMENT=0 -Wa,--noexecstack
LDFLAGS_vdso_test_correctness := -ldl
ifeq ($(CONFIG_X86_32),y)
LDLIBS += -lgcc_s
@@ -35,3 +42,5 @@ $(OUTPUT)/vdso_test_correctness: vdso_test_correctness.c
-o $@ \
$(LDFLAGS_vdso_test_correctness)
$(OUTPUT)/vdso_test_getrandom: parse_vdso.c
+$(OUTPUT)/vdso_test_chacha: CFLAGS += $(CFLAGS_vdso_test_chacha)
+$(OUTPUT)/vdso_test_chacha: $(top_srcdir)/arch/$(ARCH)/entry/vdso/vgetrandom-chacha.S
diff --git a/tools/testing/selftests/vDSO/vdso_test_chacha.c b/tools/testing/selftests/vDSO/vdso_test_chacha.c
new file mode 100644
index 000000000000..bce7a7752b11
--- /dev/null
+++ b/tools/testing/selftests/vDSO/vdso_test_chacha.c
@@ -0,0 +1,43 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2022 Jason A. Donenfeld <[email protected]>. All Rights Reserved.
+ */
+
+#include <sodium/crypto_stream_chacha20.h>
+#include <sys/random.h>
+#include <string.h>
+#include <stdint.h>
+#include "../kselftest.h"
+
+extern void __arch_chacha20_blocks_nostack(uint8_t *dst_bytes, const uint8_t *key, uint32_t *counter, size_t nblocks);
+
+int main(int argc, char *argv[])
+{
+ enum { TRIALS = 1000, BLOCKS = 128, BLOCK_SIZE = 64 };
+ static const uint8_t nonce[8] = { 0 };
+ uint32_t counter[2];
+ uint8_t key[32];
+ uint8_t output1[BLOCK_SIZE * BLOCKS], output2[BLOCK_SIZE * BLOCKS];
+
+ ksft_print_header();
+ ksft_set_plan(1);
+
+ for (unsigned int trial = 0; trial < TRIALS; ++trial) {
+ if (getrandom(key, sizeof(key), 0) != sizeof(key)) {
+ printf("getrandom() failed!\n");
+ return KSFT_SKIP;
+ }
+ crypto_stream_chacha20(output1, sizeof(output1), nonce, key);
+ for (unsigned int split = 0; split < BLOCKS; ++split) {
+ memset(output2, 'X', sizeof(output2));
+ memset(counter, 0, sizeof(counter));
+ if (split)
+ __arch_chacha20_blocks_nostack(output2, key, counter, split);
+ __arch_chacha20_blocks_nostack(output2 + split * BLOCK_SIZE, key, counter, BLOCKS - split);
+ if (memcmp(output1, output2, sizeof(output1)))
+ return KSFT_FAIL;
+ }
+ }
+ ksft_test_result_pass("chacha: PASS\n");
+ return KSFT_PASS;
+}
--
2.39.0

2023-01-03 10:35:02

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v14 1/7] x86: lib: Separate instruction decoder MMIO type from MMIO trace


* Jason A. Donenfeld <[email protected]> wrote:

> Both mmiotrace.h and insn-eval.h define various MMIO_ enum constants.
> Rename the insn ones to have a INSN_ prefix, so that the headers can be
> used from the same source file.
>
> Signed-off-by: Jason A. Donenfeld <[email protected]>
> ---
> arch/x86/coco/tdx/tdx.c | 26 +++++++++++++-------------
> arch/x86/include/asm/insn-eval.h | 18 +++++++++---------
> arch/x86/kernel/sev.c | 18 +++++++++---------
> arch/x86/lib/insn-eval.c | 20 ++++++++++----------
> 4 files changed, 41 insertions(+), 41 deletions(-)

FYI, I've applied this fix to tip:x86/asm, as the fix for the namespace
clash makes sense independently of the vDSO getrandom feature.

Thanks,

Ingo

2023-01-03 10:52:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v14 2/7] mm: add VM_DROPPABLE for designating always lazily freeable mappings


* Jason A. Donenfeld <[email protected]> wrote:

> The vDSO getrandom() implementation works with a buffer allocated with a
> new system call that has certain requirements:
>
> - It shouldn't be written to core dumps.
> * Easy: VM_DONTDUMP.
> - It should be zeroed on fork.
> * Easy: VM_WIPEONFORK.
>
> - It shouldn't be written to swap.
> * Uh-oh: mlock is rlimited.
> * Uh-oh: mlock isn't inherited by forks.
>
> - It shouldn't reserve actual memory, but it also shouldn't crash when
> page faulting in memory if none is available
> * Uh-oh: MAP_NORESERVE respects vm.overcommit_memory=2.
> * Uh-oh: VM_NORESERVE means segfaults.
>
> It turns out that the vDSO getrandom() function has three really nice
> characteristics that we can exploit to solve this problem:
>
> 1) Due to being wiped during fork(), the vDSO code is already robust to
> having the contents of the pages it reads zeroed out midway through
> the function's execution.
>
> 2) In the absolute worst case of whatever contingency we're coding for,
> we have the option to fallback to the getrandom() syscall, and
> everything is fine.
>
> 3) The buffers the function uses are only ever useful for a maximum of
> 60 seconds -- a sort of cache, rather than a long term allocation.
>
> These characteristics mean that we can introduce VM_DROPPABLE, which
> has the following semantics:
>
> a) It never is written out to swap.
> b) Under memory pressure, mm can just drop the pages (so that they're
> zero when read back again).
> c) If there's not enough memory to service a page fault, it's not fatal,
> and no signal is sent. Instead, writes are simply lost.
> d) It is inherited by fork.
> e) It doesn't count against the mlock budget, since nothing is locked.
>
> This is fairly simple to implement, with the one snag that we have to
> use 64-bit VM_* flags, but this shouldn't be a problem, since the only
> consumers will probably be 64-bit anyway.
>
> This way, allocations used by vDSO getrandom() can use:
>
> VM_DROPPABLE | VM_DONTDUMP | VM_WIPEONFORK | VM_NORESERVE
>
> And there will be no problem with OOMing, crashing on overcommitment,
> using memory when not in use, not wiping on fork(), coredumps, or
> writing out to swap.
>
> At the moment, rather than skipping writes on OOM, the fault handler
> just returns to userspace, and the instruction is retried. This isn't
> terrible, but it's not quite what is intended. The actual instruction
> skipping has to be implemented arch-by-arch, but so does this whole
> vDSO series, so that's fine. The following commit addresses it for x86.

Yeah, so VM_DROPPABLE adds a whole lot of complexity, corner cases, per
arch low level work and rarely tested functionality (seriously, whose
desktop system touches swap these days?), just so we can add a few pages of
per thread vDSO data of a quirky type that in 99.999% of cases won't ever
be 'dropped' from under the functionality that is using it and will thus
bitrot fast?

The maintainability baby is being thrown out with the bath water IMO ...

And we want to add more complexity to a facility people desperately want to
trust *more*? [RNG]

What's wrong with making mlock() more usable? Or just saying that "yeah,
the vDSO can allocate a few more permanent pages outside of existing
rlimits & mlock budgets"?

The rest of the series looks fine to me, but this special one of a kind
VM_DROPPABLE is just way over-engineered cornercase functionality that
pushes us well past the maintenance_overhead<->complexity trade-off sweet spot ...

Thanks,

Ingo

2023-01-03 14:56:28

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH v14 1/7] x86: lib: Separate instruction decoder MMIO type from MMIO trace

On Tue, Jan 03, 2023 at 11:32:14AM +0100, Ingo Molnar wrote:
>
> * Jason A. Donenfeld <[email protected]> wrote:
>
> > Both mmiotrace.h and insn-eval.h define various MMIO_ enum constants.
> > Rename the insn ones to have a INSN_ prefix, so that the headers can be
> > used from the same source file.
> >
> > Signed-off-by: Jason A. Donenfeld <[email protected]>
> > ---
> > arch/x86/coco/tdx/tdx.c | 26 +++++++++++++-------------
> > arch/x86/include/asm/insn-eval.h | 18 +++++++++---------
> > arch/x86/kernel/sev.c | 18 +++++++++---------
> > arch/x86/lib/insn-eval.c | 20 ++++++++++----------
> > 4 files changed, 41 insertions(+), 41 deletions(-)
>
> FYI, I've applied this fix to tip:x86/asm, as the fix for the namespace
> clash makes sense independently of the vDSO getrandom feature.

I guess you missed the conversation with Borislav yesterday about that.
He mentioned that I'd just take it through random.git when this whole
series goes in.

(Unless of course you wanted to put this into 6.2? That'd be easiest for
me.)

Jason

2023-01-03 15:04:36

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH v14 2/7] mm: add VM_DROPPABLE for designating always lazily freeable mappings

On Tue, Jan 03, 2023 at 11:50:43AM +0100, Ingo Molnar wrote:
>
> * Jason A. Donenfeld <[email protected]> wrote:
>
> > The vDSO getrandom() implementation works with a buffer allocated with a
> > new system call that has certain requirements:
> >
> > - It shouldn't be written to core dumps.
> > * Easy: VM_DONTDUMP.
> > - It should be zeroed on fork.
> > * Easy: VM_WIPEONFORK.
> >
> > - It shouldn't be written to swap.
> > * Uh-oh: mlock is rlimited.
> > * Uh-oh: mlock isn't inherited by forks.
> >
> > - It shouldn't reserve actual memory, but it also shouldn't crash when
> > page faulting in memory if none is available
> > * Uh-oh: MAP_NORESERVE respects vm.overcommit_memory=2.
> > * Uh-oh: VM_NORESERVE means segfaults.
> >
> > It turns out that the vDSO getrandom() function has three really nice
> > characteristics that we can exploit to solve this problem:
> >
> > 1) Due to being wiped during fork(), the vDSO code is already robust to
> > having the contents of the pages it reads zeroed out midway through
> > the function's execution.
> >
> > 2) In the absolute worst case of whatever contingency we're coding for,
> > we have the option to fallback to the getrandom() syscall, and
> > everything is fine.
> >
> > 3) The buffers the function uses are only ever useful for a maximum of
> > 60 seconds -- a sort of cache, rather than a long term allocation.
> >
> > These characteristics mean that we can introduce VM_DROPPABLE, which
> > has the following semantics:
> >
> > a) It never is written out to swap.
> > b) Under memory pressure, mm can just drop the pages (so that they're
> > zero when read back again).
> > c) If there's not enough memory to service a page fault, it's not fatal,
> > and no signal is sent. Instead, writes are simply lost.
> > d) It is inherited by fork.
> > e) It doesn't count against the mlock budget, since nothing is locked.
> >
> > This is fairly simple to implement, with the one snag that we have to
> > use 64-bit VM_* flags, but this shouldn't be a problem, since the only
> > consumers will probably be 64-bit anyway.
> >
> > This way, allocations used by vDSO getrandom() can use:
> >
> > VM_DROPPABLE | VM_DONTDUMP | VM_WIPEONFORK | VM_NORESERVE
> >
> > And there will be no problem with OOMing, crashing on overcommitment,
> > using memory when not in use, not wiping on fork(), coredumps, or
> > writing out to swap.
> >
> > At the moment, rather than skipping writes on OOM, the fault handler
> > just returns to userspace, and the instruction is retried. This isn't
> > terrible, but it's not quite what is intended. The actual instruction
> > skipping has to be implemented arch-by-arch, but so does this whole
> > vDSO series, so that's fine. The following commit addresses it for x86.
>
> Yeah, so VM_DROPPABLE adds a whole lot of complexity, corner cases, per
> arch low level work and rarely tested functionality (seriously, whose
> desktop system touches swap these days?), just so we can add a few pages of
> per thread vDSO data of a quirky type that in 99.999% of cases won't ever
> be 'dropped' from under the functionality that is using it and will thus
> bitrot fast?

It sounds like you've misunderstood the issue.

Firstly, the arch work is +19 lines (in the vdso branch of random.git).
That's very small and basic. Don't misrepresent it just to make a point.

Secondly, and more importantly, swapping this data is *not* permissible.
It doesn't matter if you think that certain systems don't use swap
anyway (you're wrong though regardless -- desktops will swap things
pretty regularly). It simply must *not* be swapped under any
circumstances. It's not that swapping is simply undesirable; it's
absolutely catastrophic. Do you understand that distinction?

If so, then if you don't like this solution, perhaps you could mention
something that accomplishes the goals in the commit message.

> The maintainability baby is being thrown out with the bath water IMO ...

Really? Are you sure this isn't just a matter of you not having written
it yourself or something? The commit is pretty short and basic. All the
actual internals for this kind of thing are already in present. The
commit just pipes it through.

> And we want to add more complexity to a facility people desperately want to
> trust *more*? [RNG]

That seems like a ridiculous rhetorical leap.

> What's wrong with making mlock() more usable? Or just saying that "yeah,
> the vDSO can allocate a few more permanent pages outside of existing
> rlimits & mlock budgets"?

Did you actually read the commit message?

And now you're suggesting a rlimit backdoor? And adding more cases for
fork() to fail or block? That sounds terrible.

> The rest of the series looks fine to me, but this special one of a kind
> VM_DROPPABLE is just way over-engineered cornercase functionality that
> pushes us well past the maintenance_overhead<->complexity trade-off sweet spot ...

I think you should reevaluate that judgement. Consider the actual
problem. I think if you look at it carefully you'll see that this is a
pretty straight forward solution. It's deliberately not over-engineered.
It uses existing facilities and just does the plumbing in the right way
to make it work. It's really not that complicated.

The commit is +38, -4. The commit message has more lines than that.

Jason

2023-01-03 17:04:56

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v14 1/7] x86: lib: Separate instruction decoder MMIO type from MMIO trace


* Jason A. Donenfeld <[email protected]> wrote:

> On Tue, Jan 03, 2023 at 11:32:14AM +0100, Ingo Molnar wrote:
> >
> > * Jason A. Donenfeld <[email protected]> wrote:
> >
> > > Both mmiotrace.h and insn-eval.h define various MMIO_ enum constants.
> > > Rename the insn ones to have a INSN_ prefix, so that the headers can be
> > > used from the same source file.
> > >
> > > Signed-off-by: Jason A. Donenfeld <[email protected]>
> > > ---
> > > arch/x86/coco/tdx/tdx.c | 26 +++++++++++++-------------
> > > arch/x86/include/asm/insn-eval.h | 18 +++++++++---------
> > > arch/x86/kernel/sev.c | 18 +++++++++---------
> > > arch/x86/lib/insn-eval.c | 20 ++++++++++----------
> > > 4 files changed, 41 insertions(+), 41 deletions(-)
> >
> > FYI, I've applied this fix to tip:x86/asm, as the fix for the namespace
> > clash makes sense independently of the vDSO getrandom feature.
>
> I guess you missed the conversation with Borislav yesterday about that.
> He mentioned that I'd just take it through random.git when this whole
> series goes in.

Please base your tree off on tip:x86/asm then (or pull it in) - it only
includes this patch and another trivial patch at:

a0e3aa8fe6cb ("x86/insn: Avoid namespace clash by separating instruction decoder MMIO type from MMIO trace type")

We often modify these files - roughly ~4 commits to arch/x86/kernel/sev.c
alone per cycle on average - and it would be better to avoid conflicts
here.

Thanks,

Ingo

2023-01-03 17:31:55

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v14 1/7] x86: lib: Separate instruction decoder MMIO type from MMIO trace

On Tue, Jan 03, 2023 at 06:00:46PM +0100, Ingo Molnar wrote:
> > I guess you missed the conversation with Borislav yesterday about that.
> > He mentioned that I'd just take it through random.git when this whole
> > series goes in.
>
> Please base your tree off on tip:x86/asm then (or pull it in) - it only

My idea was a lot simpler: avoid the tree inter-dependency by us acking this
patch so that it can go through the random.git tree.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-01-03 17:32:40

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH v14 1/7] x86: lib: Separate instruction decoder MMIO type from MMIO trace

On Tue, Jan 3, 2023 at 6:29 PM Borislav Petkov <[email protected]> wrote:
>
> On Tue, Jan 03, 2023 at 06:00:46PM +0100, Ingo Molnar wrote:
> > > I guess you missed the conversation with Borislav yesterday about that.
> > > He mentioned that I'd just take it through random.git when this whole
> > > series goes in.
> >
> > Please base your tree off on tip:x86/asm then (or pull it in) - it only
>
> My idea was a lot simpler: avoid the tree inter-dependency by us acking this
> patch so that it can go through the random.git tree.

Indeed I would prefer this.

Or... just put this in 6.2 because it's trivial anyway? Heck, even
mark it as stable@ so make future backporting easier. Then it'll meet
tip's urgent criteria.

Jason

2023-01-03 17:57:38

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v14 1/7] x86: lib: Separate instruction decoder MMIO type from MMIO trace


* Jason A. Donenfeld <[email protected]> wrote:

> On Tue, Jan 3, 2023 at 6:29 PM Borislav Petkov <[email protected]> wrote:
> >
> > On Tue, Jan 03, 2023 at 06:00:46PM +0100, Ingo Molnar wrote:
> > > > I guess you missed the conversation with Borislav yesterday about that.
> > > > He mentioned that I'd just take it through random.git when this whole
> > > > series goes in.
> > >
> > > Please base your tree off on tip:x86/asm then (or pull it in) - it only
> >
> > My idea was a lot simpler: avoid the tree inter-dependency by us acking this
> > patch so that it can go through the random.git tree.
>
> Indeed I would prefer this.
>
> Or... just put this in 6.2 because it's trivial anyway? Heck, even mark
> it as stable@ so make future backporting easier. Then it'll meet tip's
> urgent criteria.

Yeah - that's sensible too, it does fix a header namespace bug - I've put
it into tip:x86/urgent.

Thanks,

Ingo

2023-01-03 17:57:55

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH v14 1/7] x86: lib: Separate instruction decoder MMIO type from MMIO trace

On Tue, Jan 3, 2023 at 6:47 PM Ingo Molnar <[email protected]> wrote:
>
>
> * Jason A. Donenfeld <[email protected]> wrote:
>
> > On Tue, Jan 3, 2023 at 6:29 PM Borislav Petkov <[email protected]> wrote:
> > >
> > > On Tue, Jan 03, 2023 at 06:00:46PM +0100, Ingo Molnar wrote:
> > > > > I guess you missed the conversation with Borislav yesterday about that.
> > > > > He mentioned that I'd just take it through random.git when this whole
> > > > > series goes in.
> > > >
> > > > Please base your tree off on tip:x86/asm then (or pull it in) - it only
> > >
> > > My idea was a lot simpler: avoid the tree inter-dependency by us acking this
> > > patch so that it can go through the random.git tree.
> >
> > Indeed I would prefer this.
> >
> > Or... just put this in 6.2 because it's trivial anyway? Heck, even mark
> > it as stable@ so make future backporting easier. Then it'll meet tip's
> > urgent criteria.
>
> Yeah - that's sensible too, it does fix a header namespace bug - I've put
> it into tip:x86/urgent.

Excellent, thanks. I'll rebase on that when it hits Linus' tree.

2023-01-03 18:36:54

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v14 2/7] mm: add VM_DROPPABLE for designating always lazily freeable mappings

On Tue, Jan 3, 2023 at 2:50 AM Ingo Molnar <[email protected]> wrote:
>
>
> * Jason A. Donenfeld <[email protected]> wrote:
>
> > The vDSO getrandom() implementation works with a buffer allocated with a
> > new system call that has certain requirements:
> >
> > - It shouldn't be written to core dumps.
> > * Easy: VM_DONTDUMP.
> > - It should be zeroed on fork.
> > * Easy: VM_WIPEONFORK.

I have a rather different suggestion: make a special mapping. Jason,
you're trying to shoehorn all kinds of bizarre behavior into the core
mm, and none of that seems to me to belong to the core mm. Instead,
have an actual special mapping with callbacks that does the right
thing. No fancy VM flags.

Memory pressure: have it free and unmap it self. Gets accessed again?
->fault can handle it.

Want to mlock it? No, don't do that -- that's absurd. Just arrange
so that, if it gets evicted, it's not written out anywhere. And when
it gets faulted back in it does the right thing -- see above.

Zero on fork? I'm sure that's manageable with a special mapping. If
not, you can add a new vm operation or similar to make it work. (Kind
of like how we extended special mappings to get mremap right a couple
years go.) But maybe you don't want to *zero* it on fork and you want
to do something more intelligent. Fine -- you control ->fault!

> >
> > - It shouldn't be written to swap.
> > * Uh-oh: mlock is rlimited.
> > * Uh-oh: mlock isn't inherited by forks.

No mlock, no problems.

> >
> > - It shouldn't reserve actual memory, but it also shouldn't crash when
> > page faulting in memory if none is available
> > * Uh-oh: MAP_NORESERVE respects vm.overcommit_memory=2.
> > * Uh-oh: VM_NORESERVE means segfaults.

->fault can do whatever you want.

And there is no shortage of user memory that *must* be made available
on fault in order to resume the faulting process. ->fault can handle
this.

> >
> > It turns out that the vDSO getrandom() function has three really nice
> > characteristics that we can exploit to solve this problem:
> >
> > 1) Due to being wiped during fork(), the vDSO code is already robust to
> > having the contents of the pages it reads zeroed out midway through
> > the function's execution.
> >
> > 2) In the absolute worst case of whatever contingency we're coding for,
> > we have the option to fallback to the getrandom() syscall, and
> > everything is fine.
> >
> > 3) The buffers the function uses are only ever useful for a maximum of
> > 60 seconds -- a sort of cache, rather than a long term allocation.
> >
> > These characteristics mean that we can introduce VM_DROPPABLE, which
> > has the following semantics:

No need for another vm flag.

> >
> > a) It never is written out to swap.

No need to complicate the swap logic for this.

> > b) Under memory pressure, mm can just drop the pages (so that they're
> > zero when read back again).

Or ->fault could even repopulate it without needing to ever read zeros.

> > c) If there's not enough memory to service a page fault, it's not fatal,
> > and no signal is sent. Instead, writes are simply lost.

This just seems massively overcomplicated to me. If there isn't
enough memory to fault in a page of code, we don't have some magic
instruction emulator in the kernel. We either OOM or we wait for
memory to show up.

> > d) It is inherited by fork.

If you have a special mapping and you fork, it doesn't magically turn
into normal memory.

> > e) It doesn't count against the mlock budget, since nothing is locked.

Special mapping -> no mlock.

> >
> > This is fairly simple to implement, with the one snag that we have to
> > use 64-bit VM_* flags, but this shouldn't be a problem, since the only
> > consumers will probably be 64-bit anyway.
> >
> > This way, allocations used by vDSO getrandom() can use:
> >
> > VM_DROPPABLE | VM_DONTDUMP | VM_WIPEONFORK | VM_NORESERVE
> >
> > And there will be no problem with OOMing, crashing on overcommitment,
> > using memory when not in use, not wiping on fork(), coredumps, or
> > writing out to swap.
> >
> > At the moment, rather than skipping writes on OOM, the fault handler
> > just returns to userspace, and the instruction is retried. This isn't
> > terrible, but it's not quite what is intended. The actual instruction
> > skipping has to be implemented arch-by-arch, but so does this whole
> > vDSO series, so that's fine. The following commit addresses it for x86.

I really dislike this. I'm with Ingo.

2023-01-03 19:07:38

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH v14 2/7] mm: add VM_DROPPABLE for designating always lazily freeable mappings

On Tue, Jan 03, 2023 at 07:15:50PM +0100, Ingo Molnar wrote:
> Frankly, I don't appreciate your condescending discussion style that
> borders on the toxic, and to save time I'm nacking this technical approach
> until both the patch-set and your reaction to constructive review feedback
> improves:
>
> NAcked-by: Ingo Molnar <[email protected]>

Your initial email to me did not strike me as constructive at all. All I
gotta say is that you really seem to live up to your reputation here...

But trying to steer things back to the technical realm:

> For a single architecture: x86.
>
> And it's only 19 lines because x86 already happens to have a bunch of
> complexity implemented, such as a safe instruction decoder that allows the
> skipping of an instruction - which relies on thousands of lines of
> complexity.
>
> On an architecture where this isn't present, it would have to be
> implemented to support the instruction-skipping aspect of VM_DROPPABLE.

My assumption is actually the opposite: that x86 (CISC) is basically the
most complex, and that the RISC architectures will all be a good deal
more trivial -- e.g. just adding 4 to IP on some. It looks like some
architectures also already have mature decoders where required.

> Even on x86, it's not common today for the software-decoder to be used in
> unprivileged code - primary use was debugging & instrumentation code. So
> your patches bring this piece of complexity to a much larger scope of
> untrusted user-space functionality.

As far as I can tell, this decoder *is* used with userspace already.
It's used by SEV and by UMIP, in a totally userspace accessible way. Am
I misunderstanding its use there? It looks to me like that operates on
untrusted code.

*However* - if your big objection to this patch is that the instruction
skipping is problematic, we could actually punt that part. The result
will be that userspace just retries the memory write and the fault
happens again, and eventually it succeeds. From a perspective of
vgetrandom(), that's perhaps worse -- with this v14 patchset, it'll
immediately fallback to the syscall under memory pressure -- but you
could argue that nobody really cares about performance at that point
anyway, and so just retrying the fault until it succeeds is a less
complex behavior that would be just fine.

Let me know if you think that'd be an acceptable compromise, and I'll
roll it into v15. As a preview, it pretty much amounts to dropping 3/7
and editing the commit message in this 2/7 patch.

> I did not suggest to swap it: my suggestion is to just pin these vDSO data
> pages. The per thread memory overhead is infinitesimal on the vast majority
> of the target systems, and the complexity trade-off you are proposing is
> poorly reasoned IMO.
>
> I think my core point that it would be much simpler to simply pin those
> pages and not introduce rarely-excercised 'discardable memory' semantics in
> Linux is a fair one - so it's straightforward to lift this NAK.

Okay so this is where I think we're really not lined up and is a large
part of why I wondered whether you'd read the commit messages before
dismissing this. This VM_DROPPABLE mapping comes as a result of a
vgetrandom_alloc() syscall, which (g)libc makes at some point, and then
the result of that is passed to the vDSO getrandom() function. The
memory in vgetrandom_alloc() is then carved up, one per thread, with
(g)libc's various internal pthread creation/exit functions.

So that means this isn't a thing that's trivially limited to just one
per thread. Userspace can call vgetrandom_alloc() all it wants.

Thus, I'm having a hard time seeing how relaxing rlimits here as you
suggested doesn't amount to an rlimit backdoor. I'm also not seeing
other fancy options for "pinning pages" as you mentioned in this email.
Something about having the kernel allocate them on clone()? That seems
terrible and complex. And if you do want this all to go through mlock(),
somehow, there's still the fork() inheritabiity issue. (This was all
discussed on the thread a few versions ago that surfaced these issues,
by the way.)

So I'm not really seeing any good alternatives, no matter how hard I
squint at your suggestions. Maybe you can elaborate a bit?
Alternatively, perhaps the compromise I suggested above where we ditch
the instruction decoder stuff is actually fine with you?

> rarely-excercised 'discardable memory' semantics in
> Linux is a fair one - so it's straightforward to lift this NAK.

I still don't think calling this "rarely-exercised" is true. Desktop
machines regularly OOM with lots of Chrome tabs, and this is
functionality that'll be in glibc, so it'll be exercised quite often.
Even on servers, many operators work with the philosophy that unused RAM
is wasted RAM, and so servers are run pretty close to capacity. Not to
mention Android, where lots of handsets have way too little RAM.
Swapping and memory pressure and so forth is very real. So claiming that
this is somehow obscure or rarely used or what have you isn't very
accurate. These are code paths that will certainly get exercised.

Jason

2023-01-03 19:13:04

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH v14 2/7] mm: add VM_DROPPABLE for designating always lazily freeable mappings

Hi Andy,

Thanks for your constructive suggestions.

On Tue, Jan 03, 2023 at 10:36:01AM -0800, Andy Lutomirski wrote:
> > > c) If there's not enough memory to service a page fault, it's not fatal,
> > > and no signal is sent. Instead, writes are simply lost.
>
> This just seems massively overcomplicated to me. If there isn't
> enough memory to fault in a page of code, we don't have some magic
> instruction emulator in the kernel. We either OOM or we wait for
> memory to show up.

Before addressing the other parts of your email, I thought I'd touch on
this. Quoting from the email I just wrote Ingo:

| *However* - if your big objection to this patch is that the instruction
| skipping is problematic, we could actually punt that part. The result
| will be that userspace just retries the memory write and the fault
| happens again, and eventually it succeeds. From a perspective of
| vgetrandom(), that's perhaps worse -- with this v14 patchset, it'll
| immediately fallback to the syscall under memory pressure -- but you
| could argue that nobody really cares about performance at that point
| anyway, and so just retrying the fault until it succeeds is a less
| complex behavior that would be just fine.
|
| Let me know if you think that'd be an acceptable compromise, and I'll
| roll it into v15. As a preview, it pretty much amounts to dropping 3/7
| and editing the commit message in this 2/7 patch.

IOW, I think the main ideas of the patch work just fine without "point
c" with the instruction skipping. Instead, waiting/retrying could
potentially work. So, okay, it seems like the two of you both hate the
instruction decoder stuff, so I'll plan on working that part in, in one
way or another, for v15.

> On Tue, Jan 3, 2023 at 2:50 AM Ingo Molnar <[email protected]> wrote:
> > > The vDSO getrandom() implementation works with a buffer allocated with a
> > > new system call that has certain requirements:
> > >
> > > - It shouldn't be written to core dumps.
> > > * Easy: VM_DONTDUMP.
> > > - It should be zeroed on fork.
> > > * Easy: VM_WIPEONFORK.
>
> I have a rather different suggestion: make a special mapping. Jason,
> you're trying to shoehorn all kinds of bizarre behavior into the core
> mm, and none of that seems to me to belong to the core mm. Instead,
> have an actual special mapping with callbacks that does the right
> thing. No fancy VM flags.

Oooo! I like this. Avoiding adding VM_* flags would indeed be nice.
I had seen things that I thought looked in this direction with the shmem
API, but when I got into the details, it looked like this was meant for
something else and couldn't address most of what I wanted here.

If you say this is possible, I'll look again to see if I can figure it
out. Though, if you have some API name at the top of your head, you
might save me some code squinting time.

> Memory pressure: have it free and unmap it self. Gets accessed again?
> ->fault can handle it.

Right.

> Want to mlock it? No, don't do that -- that's absurd. Just arrange
> so that, if it gets evicted, it's not written out anywhere. And when
> it gets faulted back in it does the right thing -- see above.

Presumably mlock calls are redirected to some function pointer so I can
just return EINTR?

> Zero on fork? I'm sure that's manageable with a special mapping. If
> not, you can add a new vm operation or similar to make it work. (Kind
> of like how we extended special mappings to get mremap right a couple
> years go.) But maybe you don't want to *zero* it on fork and you want
> to do something more intelligent. Fine -- you control ->fault!

Doing something more intelligent would be an interesting development, I
guess... But, before I think about that, all mapping have flags;
couldn't I *still* set VM_WIPEONFORK on the special mapping? Or does the
API you have in mind not work that way? (Side note: I also want
VM_DONTDUMP to work.)

> > > - It shouldn't reserve actual memory, but it also shouldn't crash when
> > > page faulting in memory if none is available
> > > * Uh-oh: MAP_NORESERVE respects vm.overcommit_memory=2.
> > > * Uh-oh: VM_NORESERVE means segfaults.
>
> ->fault can do whatever you want.
>
> And there is no shortage of user memory that *must* be made available
> on fault in order to resume the faulting process. ->fault can handle
> this.

I'll look to see how other things handle this...

Anyway, thanks for the suggestion. That seems like a good future
direction for this.

Jason

2023-01-03 19:29:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v14 2/7] mm: add VM_DROPPABLE for designating always lazily freeable mappings

On Tue, Jan 3, 2023 at 10:36 AM Andy Lutomirski <[email protected]> wrote:
>
> I have a rather different suggestion: make a special mapping. Jason,
> you're trying to shoehorn all kinds of bizarre behavior into the core
> mm, and none of that seems to me to belong to the core mm. Instead,
> have an actual special mapping with callbacks that does the right
> thing. No fancy VM flags.

I don't disagree, but honestly, my deeper reaction is "this is not worth it".

I think the real issue here is that Jason has to prove that this is
such a big deal that the VM has to be modified *at*all* for this.

Honestly, "getrandom()" performance simply is not important enough to
design VM changes for.

Do some people care? Debatable. Jason cares, sure. But people have
gotten used to doing their own random number implementations if they
*really* care, and yes, they've gotten them wrong, or they've not
performed as well as they could, but on the whole this is still a
really tiny thing, and Jason is trying to micro-optimize something
that THE KERNEL SHOULD NOT CARE ABOUT.

This should all be in libc. Not in the kernel with special magic vdso
support and special buffer allocations. The kernel should give good
enough support that libc can do a good job, but the kernel should
simply *not* take the approach of "libc will get this wrong, so let's
just do all the work for it".

Let user space do their own percpu areas if they really care. And most
(as in 99.9%) of all user space won't care at all, and is perfectly
happy just doing a read() from /dev/urandom or using our existing
"getrandom()" without any super-clever vdso support with magic
temporary buffer handling.

Now, if the magic buffers were something cool that were a generic
concept that a lot of *other* cases would also kill for, that's one
thing. But this is such a small special case that absolutely *nobody*
has asked for, and that nothing else wants.

Linus

2023-01-03 19:55:15

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH v14 2/7] mm: add VM_DROPPABLE for designating always lazily freeable mappings

Hi Linus,

On Tue, Jan 03, 2023 at 11:19:36AM -0800, Linus Torvalds wrote:
> performed as well as they could, but on the whole this is still a
> really tiny thing, and Jason is trying to micro-optimize something
> that THE KERNEL SHOULD NOT CARE ABOUT.

I don't think this is about micro-optimization. Rather, userspace RNGs
aren't really possible in a safe way at the moment. This patchset aims
to make that possible, by providing things that libc will use. The cover
letter of this series makes that case.

> This should all be in libc. Not in the kernel with special magic vdso
> support and special buffer allocations. The kernel should give good
> enough support that libc can do a good job, but the kernel should
> simply *not* take the approach of "libc will get this wrong, so let's
> just do all the work for it".

That's not what this patchset does. libc still needs to handle
per-thread semantics itself and slice up buffers and so forth. The vDSO
doesn't allocate any memory. I suspect this was Ingo's presumption too,
and you extrapolated from that. But that's not what's happening.

> Now, if the magic buffers were something cool that were a generic
> concept that a lot of *other* cases would also kill for, that's one

Actually, I was thinking VM_DROPPABLE might be a somewhat interesting
thing to introduce for database caches and so forth, where dropping
things under memory pressure is actually useful. Obviously that's the
result of a thought process involving a solution looking for a problem,
but I considered this a month or so ago when I first sent this in, and
decided that if I was to expose this via a MAP_* flag in mmap(), that
should come later, so I didn't here. Anyway, that is all to say it's not
like this is the only use for it. But either way, I don't actually have
my sights set on it as a general solution -- after all, I am not in the
process of authoring a database cache or something -- and if I can make
Andy's vm_ops suggestion work, that sounds perfectly fine to me.

> thing. But this is such a small special case that absolutely *nobody*
> has asked for, and that nothing else wants.

Okay so that's where I think you're really quite mistaken. If you recall
the original discussion on this, I was initially a bit hesitant to do it
and didn't really want to do it that much. And then I looked into it,
and talked to a bunch of library and program authors, and saw that
there's actually quite a bit of demand for this, and generally an
unhealthy ecosystem of bad solutions that have cropped up in lieu of a
good one.

(I talked about this a bit with tglx at Plumbers, and I had hoped to
discuss with you as well, but you weren't available.)

Jason

2023-01-03 20:03:14

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v14 2/7] mm: add VM_DROPPABLE for designating always lazily freeable mappings

On Tue, Jan 3, 2023 at 11:35 AM Jason A. Donenfeld <[email protected]> wrote:
>
> I don't think this is about micro-optimization. Rather, userspace RNGs
> aren't really possible in a safe way at the moment.

"Bah, humbug", to quote a modern-time philosopher.

It's humbug simply because it makes two assumptions that aren't even valid:

(a) that you have to do it in user space in the first place

(b) that you care about the particular semantics that you are looking for

The thing is, you can just do getrandom(). It's what people already
do. Problem solved.

The system call interface just isn't that expensive. Sure, the various
spectre mitigations have screwed us over in a big way on various
hardware, but even with that in mind system calls aren't some kind of
insurmountable hazard. There are absolutely tons of system calls that
are way more performance-critical than getrandom() has ever been.

So 99% of the time, the solution really is just "getrandom()",
generally with the usual buffering ("read more than you need, so that
you don't do it all the time").\

Which is not at all different from all the other system calls like
"read()" and friends.

And that's entirely ignoring the fact that something like "read()"
basically *has* to be a system call, because there are no alternatives
(ok, mmap, but that's actually much slower, unless you're in it for
the reuse-and/or-sparse-use situation).

With random numbers, you have tons of other alternatives, including
just hardware giving you the random numbers almost for free and you
just using your own rng in user space entirely.

And yes, some users might want to do things like periodic re-seeding,
but we actually do have fast ways to check for periodic events in user
space, ranging from entirely non-kernel things like rdtsc to actual
vdso's for gettimeofday().

So when you say that this isn't about micro-optimizations, I really
say "humbug". It's *clearly* about micro-optimization of an area that
very few people care about, since the alternative is just our existing
"getrandom()" that is not at all horribly slow.

Let me guess: the people you talked to who were excited about this are
mainly just library people?

And they are excited about this not because they actually need the
random numbers themselves, but because they just don't want to do the
work. They want to get nice benchmark numbers, and push the onus of
complexity onto somebody else?

Because the people who actually *use* the random numbers and are *so*
performance-critical about them already have their own per-thread
buffers or what-not, and need to have them anyway because they write
real applications that possibly work on other systems than Linux, but
that *definitely* work on enterprise systems that won't even have this
kind of new support for another several years even if we implemented
it today.

In fact, they probably are people inside of cloud vendors that are
still running 4.x kernels on a large portion of their machines.

Linus

2023-01-03 20:03:32

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH v14 2/7] mm: add VM_DROPPABLE for designating always lazily freeable mappings

Hi Linus,

On Tue, Jan 03, 2023 at 11:54:35AM -0800, Linus Torvalds wrote:
> So 99% of the time, the solution really is just "getrandom()",
> generally with the usual buffering ("read more than you need, so that
> you don't do it all the time").\

That buffering cannot be done safely currently -- VM forks, reseeding
semantics, and so forth. Again, discussed in the cover letter of the
patch if you'd like to engage with those ideas.

> just using your own rng in user space entirely.

This is the thing that isn't quite safe.

> Let me guess: the people you talked to who were excited about this are
> mainly just library people?

No, actually. Mainly people deploying production network-facing things
that need a lot of randomness often. e.g. CBC nonces in TLS, or random
sequence numbers in UDP-based protocols.

> So when you say that this isn't about micro-optimizations, I really
> say "humbug". It's *clearly* about micro-optimization of an area that
> very few people care about, since the alternative is just our existing
> "getrandom()" that is not at all horribly slow.

The alternative is what people currently do, which is attempt to
implement a userspace RNG, which cannot be done safely. Again, -->
cover letter.

> Because the people who actually *use* the random numbers and are *so*
> performance-critical about them already have their own per-thread
> buffers or what-not

...which are not safe.

Anyway, if you're NACK'ing the whole getrandom() vDSO project, just
please outright say so, so I don't spend another 14 revisions in vain.

Jason

2023-01-03 20:26:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v14 2/7] mm: add VM_DROPPABLE for designating always lazily freeable mappings

On Tue, Jan 3, 2023 at 12:03 PM Jason A. Donenfeld <[email protected]> wrote:
>
> That buffering cannot be done safely currently

.. again, this is "your semantics" (the (b) in my humbug list), not
necessarily reality for anybody else.

I'm NAK'ing making invasive changes to the VM for something this
specialized. I really believe that the people who have this issue are
*so* few and far between that they can deal with the VM forking and
reseeding issues quite well on their own.

Linus

2023-01-03 20:28:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v14 2/7] mm: add VM_DROPPABLE for designating always lazily freeable mappings

On Tue, Jan 3, 2023 at 12:15 PM Linus Torvalds
<[email protected]> wrote:
>
> On Tue, Jan 3, 2023 at 12:03 PM Jason A. Donenfeld <[email protected]> wrote:
> >
> > That buffering cannot be done safely currently
>
> .. again, this is "your semantics" (the (b) in my humbug list), not
> necessarily reality for anybody else.

Just to make an example: fork() is already problematic for something
as fundamental as <stdio.h>.

That doesn't mean that we do a special fork-safe stdio.h
infrastructure in the kernel. It just means that people have to do
things like fflush(NULL) (or use variations of setbuf() and friends)
when they deal with fork() and stdio interactions.

The random number generator really isn't that different. Periodic
reseeding isn't something stdio has to deal with, but having a
timestamp in user space library and forcing a re-seed isn't unheard of
in other contexts.

You don't even need anything as fancy as an actual timer, because it
doesn't need *active* flushing, just a "oh, it's been too long since
we read the random data, let's do it again".

And yes, I bet it would be a good idea to have an actual library for
this that handles (and *documents*) these kinds of issues - exactly
like a <stdio.h>, just for randomness.

I just don't think it should be involved in the kernel - again exactly
like <stdio.h>.

Linus

2023-01-03 20:45:36

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH v14 2/7] mm: add VM_DROPPABLE for designating always lazily freeable mappings

On Tue, Jan 03, 2023 at 12:15:57PM -0800, Linus Torvalds wrote:
> On Tue, Jan 3, 2023 at 12:03 PM Jason A. Donenfeld <[email protected]> wrote:
> >
> > That buffering cannot be done safely currently
>
> .. again, this is "your semantics" (the (b) in my humbug list), not
> necessarily reality for anybody else.

Yea that's fair. Except, of course, I maintain that my semantics are
important ones. :)

> I'm NAK'ing making invasive changes to the VM for something this
> specialized. I really believe that the people who have this issue are
> *so* few and far between that they can deal with the VM forking and
> reseeding issues quite well on their own.

Okay, that's fine. I'll see if I can make this work without having to do
surgery on mm and introduce a new VM_* flag and such. Hopefully I'll
succeed there.

Jason

2023-01-03 21:03:17

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v14 2/7] mm: add VM_DROPPABLE for designating always lazily freeable mappings

On Tue, Jan 3, 2023 at 11:06 AM Jason A. Donenfeld <[email protected]> wrote:
>
> Hi Andy,
>
> Thanks for your constructive suggestions.
>
> On Tue, Jan 03, 2023 at 10:36:01AM -0800, Andy Lutomirski wrote:
> > > > c) If there's not enough memory to service a page fault, it's not fatal,
> > > > and no signal is sent. Instead, writes are simply lost.
> >
> > This just seems massively overcomplicated to me. If there isn't
> > enough memory to fault in a page of code, we don't have some magic
> > instruction emulator in the kernel. We either OOM or we wait for
> > memory to show up.
>
> Before addressing the other parts of your email, I thought I'd touch on
> this. Quoting from the email I just wrote Ingo:
>
> | *However* - if your big objection to this patch is that the instruction
> | skipping is problematic, we could actually punt that part. The result
> | will be that userspace just retries the memory write and the fault
> | happens again, and eventually it succeeds. From a perspective of
> | vgetrandom(), that's perhaps worse -- with this v14 patchset, it'll
> | immediately fallback to the syscall under memory pressure -- but you
> | could argue that nobody really cares about performance at that point
> | anyway, and so just retrying the fault until it succeeds is a less
> | complex behavior that would be just fine.
> |
> | Let me know if you think that'd be an acceptable compromise, and I'll
> | roll it into v15. As a preview, it pretty much amounts to dropping 3/7
> | and editing the commit message in this 2/7 patch.
>
> IOW, I think the main ideas of the patch work just fine without "point
> c" with the instruction skipping. Instead, waiting/retrying could
> potentially work. So, okay, it seems like the two of you both hate the
> instruction decoder stuff, so I'll plan on working that part in, in one
> way or another, for v15.
>
> > On Tue, Jan 3, 2023 at 2:50 AM Ingo Molnar <[email protected]> wrote:
> > > > The vDSO getrandom() implementation works with a buffer allocated with a
> > > > new system call that has certain requirements:
> > > >
> > > > - It shouldn't be written to core dumps.
> > > > * Easy: VM_DONTDUMP.
> > > > - It should be zeroed on fork.
> > > > * Easy: VM_WIPEONFORK.
> >
> > I have a rather different suggestion: make a special mapping. Jason,
> > you're trying to shoehorn all kinds of bizarre behavior into the core
> > mm, and none of that seems to me to belong to the core mm. Instead,
> > have an actual special mapping with callbacks that does the right
> > thing. No fancy VM flags.
>
> Oooo! I like this. Avoiding adding VM_* flags would indeed be nice.
> I had seen things that I thought looked in this direction with the shmem
> API, but when I got into the details, it looked like this was meant for
> something else and couldn't address most of what I wanted here.
>
> If you say this is possible, I'll look again to see if I can figure it
> out. Though, if you have some API name at the top of your head, you
> might save me some code squinting time.

Look for _install_special_mapping().

--Andy

> > Want to mlock it? No, don't do that -- that's absurd. Just arrange
> > so that, if it gets evicted, it's not written out anywhere. And when
> > it gets faulted back in it does the right thing -- see above.
>
> Presumably mlock calls are redirected to some function pointer so I can
> just return EINTR?

Or just don't worry about it. If someone mlocks() it, that's their
problem. The point is that no one needs to.

>
> > Zero on fork? I'm sure that's manageable with a special mapping. If
> > not, you can add a new vm operation or similar to make it work. (Kind
> > of like how we extended special mappings to get mremap right a couple
> > years go.) But maybe you don't want to *zero* it on fork and you want
> > to do something more intelligent. Fine -- you control ->fault!
>
> Doing something more intelligent would be an interesting development, I
> guess... But, before I think about that, all mapping have flags;
> couldn't I *still* set VM_WIPEONFORK on the special mapping? Or does the
> API you have in mind not work that way? (Side note: I also want
> VM_DONTDUMP to work.)

You really want unmap (the pages, not the vma) on fork, not wipe on
fork. It'll be VM_SHARED, and I'm not sure what VM_WIPEONFORK |
VM_SHARED does.

2023-01-04 20:30:37

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v14 1/7] x86: lib: Separate instruction decoder MMIO type from MMIO trace


* Ingo Molnar <[email protected]> wrote:

>
> * Jason A. Donenfeld <[email protected]> wrote:
>
> > On Tue, Jan 3, 2023 at 6:29 PM Borislav Petkov <[email protected]> wrote:
> > >
> > > On Tue, Jan 03, 2023 at 06:00:46PM +0100, Ingo Molnar wrote:
> > > > > I guess you missed the conversation with Borislav yesterday about that.
> > > > > He mentioned that I'd just take it through random.git when this whole
> > > > > series goes in.
> > > >
> > > > Please base your tree off on tip:x86/asm then (or pull it in) - it only
> > >
> > > My idea was a lot simpler: avoid the tree inter-dependency by us acking this
> > > patch so that it can go through the random.git tree.
> >
> > Indeed I would prefer this.
> >
> > Or... just put this in 6.2 because it's trivial anyway? Heck, even mark
> > it as stable@ so make future backporting easier. Then it'll meet tip's
> > urgent criteria.
>
> Yeah - that's sensible too, it does fix a header namespace bug - I've put
> it into tip:x86/urgent.

This namespace clash fix is now upstream as of 512dee0c00ad & later kernels.

Thanks,

Ingo

2023-01-04 20:33:34

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH v14 1/7] x86: lib: Separate instruction decoder MMIO type from MMIO trace

On Wed, Jan 4, 2023 at 9:26 PM Ingo Molnar <[email protected]> wrote:
>
>
> * Ingo Molnar <[email protected]> wrote:
>
> >
> > * Jason A. Donenfeld <[email protected]> wrote:
> >
> > > On Tue, Jan 3, 2023 at 6:29 PM Borislav Petkov <[email protected]> wrote:
> > > >
> > > > On Tue, Jan 03, 2023 at 06:00:46PM +0100, Ingo Molnar wrote:
> > > > > > I guess you missed the conversation with Borislav yesterday about that.
> > > > > > He mentioned that I'd just take it through random.git when this whole
> > > > > > series goes in.
> > > > >
> > > > > Please base your tree off on tip:x86/asm then (or pull it in) - it only
> > > >
> > > > My idea was a lot simpler: avoid the tree inter-dependency by us acking this
> > > > patch so that it can go through the random.git tree.
> > >
> > > Indeed I would prefer this.
> > >
> > > Or... just put this in 6.2 because it's trivial anyway? Heck, even mark
> > > it as stable@ so make future backporting easier. Then it'll meet tip's
> > > urgent criteria.
> >
> > Yeah - that's sensible too, it does fix a header namespace bug - I've put
> > it into tip:x86/urgent.
>
> This namespace clash fix is now upstream as of 512dee0c00ad & later kernels.

Thanks. I've rebased my vdso branch on that now. I guess at this point
it probably doesn't matter that much since everyone hates my use of
the instruction decoder anyway, so I'll see what else I can do for
v15, but still, it'll at least make it easier to experiment.

Jason

2023-01-05 22:05:02

by Yann Droneaud

[permalink] [raw]
Subject: Re: [PATCH v14 2/7] mm: add VM_DROPPABLE for designating always lazily freeable mappings

Hi,

Le 03/01/2023 à 21:44, Jason A. Donenfeld a écrit :
> On Tue, Jan 03, 2023 at 12:15:57PM -0800, Linus Torvalds wrote:
>> On Tue, Jan 3, 2023 at 12:03 PM Jason A. Donenfeld <[email protected]> wrote:
>>> That buffering cannot be done safely currently
>> .. again, this is "your semantics" (the (b) in my humbug list), not
>> necessarily reality for anybody else.
> Yea that's fair. Except, of course, I maintain that my semantics are
> important ones. :)


I concur.

To hold secret material, we need MADV_WIPEONFORK | MADV_DONTDUMP and the side effect of mlock() (pages' content never written to swap), inherited across fork().
And I want mlock() without paying the price.

Jason's proposed semantics, which I call MADV_WIPEONSWAP, provide a mean to hold /unlimited/ amount secrets in userspace memory (not limited by RLIMIT_MEMLOCK).
The only constraint for userspace is to handle the case pages are wiped, which is already the case of userspace arc4random()'s implementation.

Regards.

--
Yann Droneaud
OPTEYA

2023-01-05 23:04:17

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH v14 2/7] mm: add VM_DROPPABLE for designating always lazily freeable mappings

On Thu, Jan 05, 2023 at 10:57:48PM +0100, Yann Droneaud wrote:
> Hi,
>
> Le 03/01/2023 à 21:44, Jason A. Donenfeld a écrit :
> > On Tue, Jan 03, 2023 at 12:15:57PM -0800, Linus Torvalds wrote:
> >> On Tue, Jan 3, 2023 at 12:03 PM Jason A. Donenfeld <[email protected]> wrote:
> >>> That buffering cannot be done safely currently
> >> .. again, this is "your semantics" (the (b) in my humbug list), not
> >> necessarily reality for anybody else.
> > Yea that's fair. Except, of course, I maintain that my semantics are
> > important ones. :)
>
>
> I concur.
>
> To hold secret material, we need MADV_WIPEONFORK | MADV_DONTDUMP and the side effect of mlock() (pages' content never written to swap), inherited across fork().
> And I want mlock() without paying the price.
>
> Jason's proposed semantics, which I call MADV_WIPEONSWAP, provide a mean to hold /unlimited/ amount secrets in userspace memory (not limited by RLIMIT_MEMLOCK).
> The only constraint for userspace is to handle the case pages are wiped, which is already the case of userspace arc4random()'s implementation.

If you're actually serious about wanting a generic mechanism for
userspace, I think the moral of yesterday's poo-poo'ing all over this
cool new idea is that the Linux innercircle doesn't really care for
"security things" as a motivator and just takes the shortest and easiest
route toward swatting it away like a gadfly, assuming that the concerns
are unreal or niche or paranoid or whatever. This is obviously nothing
new - it's an old complaint beaten to death for years, with people who
are diehard it about eventually getting burnt out and leaving. So,
practically speaking, if you want this to exist, I think you have to
find some other cool use cases. Like, see if the database cache people
would actually love this. Or if it could be used as an opportunistic
renderer cache in Chrome that wouldn't result in OOMing with lots of
tabs. Or if shared process compiler servers could benefit from it.
"Droppable cache" is likely useful lots of places. So just find
SOMETHING that doesn't mean having to convince folks of a new security
model that justifies tickling mm/ innards in uncomfortable ways.

Jason

2023-01-06 01:08:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v14 2/7] mm: add VM_DROPPABLE for designating always lazily freeable mappings

On Thu, Jan 5, 2023 at 2:57 PM Jason A. Donenfeld <[email protected]> wrote:
>
> On Thu, Jan 05, 2023 at 10:57:48PM +0100, Yann Droneaud wrote:
> >
> > To hold secret material, we need MADV_WIPEONFORK | MADV_DONTDUMP and the side effect of mlock() (pages' content never written to swap), inherited across fork().
> > And I want mlock() without paying the price.
> >
> > Jason's proposed semantics, which I call MADV_WIPEONSWAP, provide a mean to hold /unlimited/ amount secrets in userspace memory (not limited by RLIMIT_MEMLOCK).
> > The only constraint for userspace is to handle the case pages are wiped, which is already the case of userspace arc4random()'s implementation.
>
> If you're actually serious about wanting a generic mechanism for
> userspace, I think the moral of yesterday's poo-poo'ing all over this
> cool new idea is that the Linux innercircle doesn't really care for
> "security things" as a motivator

No.

We don't take stupid statements as a motivator.

Stop with the histrionics and silly security theater BS.

There is *nop* security in "MADV_WIPEONFORK". You claiming that that
is "security" is just making you less believable and me ignoring your
arguments more.

It's a complete make-believe fairy tale.

Why would it be "security" to dump random state data? In most
situations it's a complete non-issue, and nobody cares.

And those situations that want to be extra careful, and are actually
generating keys, those situations can do all of this very carefully on
their own using existing machinery.

If you don't want a core-dump because you have sensitive information,
you do "ulimit -c 0". Or you use MADV_DONTDUMP that we've had forever.

And you don't want to have wipe-on-fork, because

(a) if you want things to be wiped on fork, you just wipe it before
the fork (duh!)

(b) but more likely, and more relevantly, you want to make *DAMN
SURE* you wiped things much earlier than that if you are really
security-conscious and just generated a secret key, because you don't
want to leak things accidentally other ways.

(c) and you can use MADV_DONTFORK to not copy it at all, which again
we've had for a long time.

And if you don't want to have it written to swap, you're just making
sh*t up at that point.

First off, it's a purely theoretical thing in the first place. See (b)
above. Don't keep those random things around long enough (and
untouched enough) to hit the disk.

Secondly, anybody who can read swap space can already ptrace you and
read things much more easily that way.

Thirdly, you can just use mlock, and make sure you never have so much
super-sikret stuff pending for long times and in big buffers.

Fourth, if your keys are *that* sensitive, and *that* secret, just use
/dev/random or getrandom(), because you're not generating that kind of
volume of long-term keys, so the whole "I have a huge random buffer
that is super-secret" is a complete non-issue.

So stop making stupid arguments. The kernel is not supposed to
baby-sit programs that do things wrong on purpose, and that are
literally trying to do things wrong, and leaving secret stuff around
while they do a lot of other things.

You guys have literally MADE UP bad examples of so-called "security",
and then you use those as arguments for bad coding, and for
bad-mouthing kernel developers who just don't happen to believe in
that bad model.

None of what you ask for is for any kind of real security, it's all
just crazy "but I want to feel the warm and fuzzies and take shortcuts
elsewhere, and push my pain onto other people".

Linus

2023-01-06 02:09:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v14 2/7] mm: add VM_DROPPABLE for designating always lazily freeable mappings

On Thu, Jan 5, 2023 at 5:02 PM Linus Torvalds
<[email protected]> wrote:
>
> None of what you ask for is for any kind of real security, it's all
> just crazy "but I want to feel the warm and fuzzies and take shortcuts
> elsewhere, and push my pain onto other people".

Actually, let me maybe soften that a bit and say that it's
"convenience features". It might make some things more _convenient_ to
do, exactly because it might allow other parts to do short-cuts.

But because it's a convenience-feature, it had also better either be
(a) really easy and clear to do in the kernel and (b) have
sufficiently *wide* convenience so that it doesn't end up being one of
those "corner case things we have to maintain forever and nobody
uses".

And I think VM_DROPPABLE matches (a), and would be fine if it had some
other non-made-up use (although honestly, we should solve the 32-bit
problem first - ignoring it isn't fine for anything that is supposed
to be widely useful).

We *have* talked about features kind of like it before, for people
doing basically caches in user space that they can re-create on demand
and are ok with just going away under memory pressure.

But those people almost invariably want dropped pages to cause a
SIGSEGV or SIGBUS, not to come back as zeroes.

So you were insulting when you said kernel people don't care about
security issues. And I'm just telling you that's not true, but it
*is* 100% true that kernel people are often really fed up with
security people who have their blinders on, focus on some small thing,
and think nothing else ever matters.

So yes, the way to get something like VM_DROPPABLE accepted is to
remove the blinders, and have it be something more widely useful, and
not be a "for made up bad code".

Side note: making the 32-bit issue go away is likely trivial. We can
make 'vm_flags' be 64-bit, and a patch for that has been floating
around for over a decade:

https://lore.kernel.org/all/[email protected]/

but there was enough push-back on that patch that I didn't want to
take it, and some of the arguments for it were not that convincing (at
the time).

But see commit ca16d140af91 ("mm: don't access vm_flags as 'int'"),
which happened as a result, and which I (obviously very naively)
believed would be a way to get the conversion to happen in a more
controlled manner. Sadly, it never actually took off, and we have very
few "vm_flags_t" users in the kernel, and a lot of "unsigned long
flags". We even started out with a "__nocast" annotation to try to
make sparse trigger on people who didn't use vm_flags_t properly. That
was removed due to it just never happening.

But converting things to vm_flags_t with a coccinelle script
(hand-wave: look for variables of of "unsigned long" that use the
VM_xyz constants), and then just making vm_flags_t be a "u64" instead
sounds like a way forward.

But again: this is all about new flags like VM_DROPPABLE not being
some corner-case that nobody is expected to use other than some
special code that is relegated to 64-bit only because it is *so*
special.

Linus

2023-01-06 02:17:13

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH v14 2/7] mm: add VM_DROPPABLE for designating always lazily freeable mappings

On Thu, Jan 05, 2023 at 05:02:05PM -0800, Linus Torvalds wrote:
> On Thu, Jan 5, 2023 at 2:57 PM Jason A. Donenfeld <[email protected]> wrote:
> >
> > On Thu, Jan 05, 2023 at 10:57:48PM +0100, Yann Droneaud wrote:
> > >
> > > To hold secret material, we need MADV_WIPEONFORK | MADV_DONTDUMP and the side effect of mlock() (pages' content never written to swap), inherited across fork().
> > > And I want mlock() without paying the price.
> > >
> > > Jason's proposed semantics, which I call MADV_WIPEONSWAP, provide a mean to hold /unlimited/ amount secrets in userspace memory (not limited by RLIMIT_MEMLOCK).
> > > The only constraint for userspace is to handle the case pages are wiped, which is already the case of userspace arc4random()'s implementation.
> >
> > If you're actually serious about wanting a generic mechanism for
> > userspace, I think the moral of yesterday's poo-poo'ing all over this
> > cool new idea is that the Linux innercircle doesn't really care for
> > "security things" as a motivator
>
> No.
>
> We don't take stupid statements as a motivator.
>
> Stop with the histrionics and silly security theater BS.
>
> There is *nop* security in "MADV_WIPEONFORK". You claiming that that
> is "security" is just making you less believable and me ignoring your
> arguments more.
>
> It's a complete make-believe fairy tale.
>
> Why would it be "security" to dump random state data? In most
> situations it's a complete non-issue, and nobody cares.
>
> And those situations that want to be extra careful, and are actually
> generating keys, those situations can do all of this very carefully on
> their own using existing machinery.
>
> If you don't want a core-dump because you have sensitive information,
> you do "ulimit -c 0". Or you use MADV_DONTDUMP that we've had forever.
>
> And you don't want to have wipe-on-fork, because
>
> (a) if you want things to be wiped on fork, you just wipe it before
> the fork (duh!)
>
> (b) but more likely, and more relevantly, you want to make *DAMN
> SURE* you wiped things much earlier than that if you are really
> security-conscious and just generated a secret key, because you don't
> want to leak things accidentally other ways.
>
> (c) and you can use MADV_DONTFORK to not copy it at all, which again
> we've had for a long time.

I never made any of these arguments. This just reads like angry-strawman
to me.

In fact there's actually an intelligent argument against using
VM_DONTDUMP, which is that a core dump will only ever contain state that
hasn't yet been used by anything, since it'll either be a random value
that hasn't yet been returned to the caller, or the overwritten/ratcheted
state that can't be wound backwards to recover the previous state. So it
doesn't really matter that much if it gets written into a core dump.

But in fact, I didn't make any argument one way or another for it.

> And if you don't want to have it written to swap, you're just making
> sh*t up at that point.

I'm really not making stuff up. I've reconstructed and recovered keys
and secrets from pilfered swap partitions before on more than one
occasion. The whole idea is to prevent a state that's going to be used
to generate secrets in the future from being written to a place that
can't be easily and deterministically wiped after use.

I get that you're pretty convinced this is all bologna, and that
"forward secrecy" is really just for security LARPers. But there are
plenty of people whose threat models disagree, including my own
experience reconstructing keys in swap from pulled drives.

> anybody who can read swap space can already ptrace you and
> read things much more easily that way.

As you probably inferred from the above, this isn't a ptrace question.
If you have code execution on the same machine before random bytes are
generated, all bets are off and none of this matters anyway. (And more
generally, if you have code exec prior to use, you can probably not even
care about cryptographic stuff, since in many cases you can just access
whatever sensitive data you were after in the first place.)

> Thirdly, you can just use mlock, and make sure you never have so much
> super-sikret stuff pending for long times and in big buffers.

Yea this is what I initially tried. Then people (Jann and Florian, IIRC)
pointed out all sorts of thorny issues with inheriting that on fork()
and segfaulting if non-reserved and such. So eventually I came up with
this monster as a way to avoid the pitfalls -- as just one particular
technical solution.

It turns out that twiddling with mm/ internals isn't your idea of a fun
time, so I'll keep futzing with things and see if I can come up with a
less intrusive technical solution. IOW, I'm not really wedded at all to
this VM_DROPPABLE thing, and if I find another way to do it, great.

> Fourth, if your keys are *that* sensitive, and *that* secret, just use
> /dev/random or getrandom(), because you're not generating that kind of
> volume of long-term keys, so the whole "I have a huge random buffer
> that is super-secret" is a complete non-issue.

I don't think that's really so accurate, actually. Firstly, this whole
swap discussion is about forward secrecy, so we're most likely talking
about ephemeral keys, not long-term keys. Secondly, many protocols that
have ephemeral keys generate them in every handshake, and that might
mean volume. Thirdly, I'm not sure I'd give greater importance to the
secrecy of long-term keys over ephemeral keys, as your point seems to
do; if you're leaking ephemeral keys for some reason, that's pretty bad.

> You guys have literally MADE UP bad examples of so-called "security",
> and then you use those as arguments for bad coding, and for
> bad-mouthing kernel developers who just don't happen to believe in
> that bad model.

I think you might have made up a bunch of arguments I never made. And if
that "you guys" includes me in it, I also object to being summarily
grouped in with any other folks here.

Most importantly, though, I'm not bad-mouthing anyone. Your whole email
strikes me as an indication that my characterization of the matter is a
pretty accurate one.

That characterization from my last email could be summarized as: you
think the model is bogus, that these sorts of security arguments aren't
typically compelling, and so if this VM_DROPPABLE business is to have
any future, it'll need to be motivated by something non-security related
(if there even is one; I made a few guesses as to other applications,
but they're at best guesses, as I obviously have no credibility or
experience in, e.g., claiming what a database program actually might
need). You even said so yourself a few days ago:

| Now, if the magic buffers were something cool that were a generic
| concept that a lot of *other* cases would also kill for, that's one
| thing.

So I would describe my perspective as basically a realistic and
practical one. I don't really mind that you're highly skeptical of
security things and security people, as you have been for as long as I
can remember. I'm not a diehard purist or something. And it's not
bad-mouthing for me to point out the obvious there. Rather, I'll keep
poking around and seeing if I can come up with better technical
solutions to what I think are worthy goals, and you and others will swat
away things that are distasteful (in this case, mm/ tickling), and
perhaps eventually something will emerge. I don't think there's really
anything wrong with that.

> None of what you ask for is for any kind of real security, it's all
> just crazy "but I want to feel the warm and fuzzies and take shortcuts
> elsewhere, and push my pain onto other people".

No. As discussed, it's simply not toward a security model you care for.
But that's fine, you don't need to care about it; I will do that part.

It's also certainly not at all about wanting to feel warm and fuzzies or
take shortcuts or push pain elsewhere. If anything, this pushes the pain
onto me, since I'm the one implementing and maintaining this! I tried to
make mlock() work vanilla for as long as I could before I couldn't
figure out how to do it, groaned, and then tried to implement something
else. I am generally pretty reluctant to code any thing at all.

Jason

2023-01-06 02:44:31

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH v14 2/7] mm: add VM_DROPPABLE for designating always lazily freeable mappings

On Thu, Jan 05, 2023 at 06:08:28PM -0800, Linus Torvalds wrote:
> (although honestly, we should solve the 32-bit
> problem first - ignoring it isn't fine for anything that is supposed
> to be widely useful).

Yea, the 64-bit aspect of that patch is pretty gnarly. I would hope that
if converting the field to be 64-bit everywhere, the compiler would
still generate good code for most cases, when testing against
common-case constant bit masks that are half unset, but I didn't look
into it. Indeed if this is to become something real, that'd be a
worthwhile pre-requisite project.

> And I think VM_DROPPABLE matches (a), and would be fine if it had some
> other non-made-up use (although honestly, we should solve the 32-bit
> problem first - ignoring it isn't fine for anything that is supposed
> to be widely useful).
>
> We *have* talked about features kind of like it before, for people
> doing basically caches in user space that they can re-create on demand
> and are ok with just going away under memory pressure.
>
> But those people almost invariably want dropped pages to cause a
> SIGSEGV or SIGBUS, not to come back as zeroes.

Yea it seems intuitively like that would be a useful thing. Rather than
trying to heuristically monitor memory pressure from inside
uncoordinated userspace apps, just have the kernel nuke pages when it
makes sense to do.

The thing is -- and this is what I was trying to express to Yann --
while I have some theoretical notion of how it /might/ be useful, that's
mainly just a guess from me. But one could run around discussing the
idea with actual potential users of it, and see if anyone's eyes light
up. And then once, for example, the Chrome renderer team is clamoring
for it, then hey, there'd be a good use that wouldn't be bound up with
security models not everybody cares about.

> So you were insulting when you said kernel people don't care about
> security issues. And I'm just telling you that's not true, but it

Maybe you read that without looking at the end of the sentence which
read: ", assuming that the concerns are unreal or niche or paranoid or
whatever." Because the email you sent in response pretty much described
the concerns as either unreal or niche. So I didn't mean to be
insulting. I actually think we agree with each other on the nature of
your position.

> *is* 100% true that kernel people are often really fed up with
> security people who have their blinders on, focus on some small thing,
> and think nothing else ever matters.
>
> So yes, the way to get something like VM_DROPPABLE accepted is to
> remove the blinders, and have it be something more widely useful, and
> not be a "for made up bad code".

I get this part. Either the implementation is trivial/convenient/
unintrusive/unnoticeable, or it really needs to be motivated by
something you find compelling and preferably has wide application. No
objection from me there.

If you ignore the instruction decoder part, though -- which I mentioned
to Ingo could be nixed anyway -- my initial estimation of this patch
here was that it wasn't /that/ bad, being +32/-3, and I didn't buy the
arguments that it'd be rarely used code paths and that nobody OOMs and
nobody swaps, because it doesn't match my own experience of seeing
OOMing regularly and the fact that if this is in glibc, it'll wind up
being used. So the initial arguments against it were not compelling.

But then you and others elaborated that VM_* flags really ought to be
for general things, and that as a matter of taste and cleanliness,
sticking things into the *core* of mm/ is just really a sensitive thing
not to be done lightly and without a really compelling reason. And so I
also get that.

So, I'm playing around with other technical solutions to the mlock()
limitation thing that started all of this (amidst other things in my
end-of-the-year backlog), and I'll post results if I can get something
working.

Jason

2023-01-06 21:00:21

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v14 2/7] mm: add VM_DROPPABLE for designating always lazily freeable mappings



On Thu, Jan 5, 2023, at 6:08 PM, Linus Torvalds wrote:
> On Thu, Jan 5, 2023 at 5:02 PM Linus Torvalds
> <[email protected]> wrote:
>>
>> None of what you ask for is for any kind of real security, it's all
>> just crazy "but I want to feel the warm and fuzzies and take shortcuts
>> elsewhere, and push my pain onto other people".
>
> Actually, let me maybe soften that a bit and say that it's
> "convenience features". It might make some things more _convenient_ to
> do, exactly because it might allow other parts to do short-cuts.
>
> But because it's a convenience-feature, it had also better either be
> (a) really easy and clear to do in the kernel and (b) have
> sufficiently *wide* convenience so that it doesn't end up being one of
> those "corner case things we have to maintain forever and nobody
> uses".
>
> And I think VM_DROPPABLE matches (a), and would be fine if it had some
> other non-made-up use (although honestly, we should solve the 32-bit
> problem first - ignoring it isn't fine for anything that is supposed
> to be widely useful).
>
> We *have* talked about features kind of like it before, for people
> doing basically caches in user space that they can re-create on demand
> and are ok with just going away under memory pressure.
>
> But those people almost invariably want dropped pages to cause a
> SIGSEGV or SIGBUS, not to come back as zeroes.
>
> So you were insulting when you said kernel people don't care about
> security issues. And I'm just telling you that's not true, but it
> *is* 100% true that kernel people are often really fed up with
> security people who have their blinders on, focus on some small thing,
> and think nothing else ever matters.
>
> So yes, the way to get something like VM_DROPPABLE accepted is to
> remove the blinders, and have it be something more widely useful, and
> not be a "for made up bad code".

I'm going to suggest a very very different approach: fix secret storage in memory for real. That is, don't lock "super secret sensitive stuff" into memory, and don't wipe it either. *Encrypt* it.

This boils down to implementing proper encrypted swap support, which is not conceptually that difficult. The kernel already has identifiers (mapping, offset, etc) for every page in swap and already stores some metadata. Using that as part of a cryptographic operation seems conceptually fairly straightforward.

And a nice implementation of this could be on by default, and the kernel could even tell userspace that it's on, and then userspace could just stop worrying about this particular issue.

2023-01-06 21:18:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v14 2/7] mm: add VM_DROPPABLE for designating always lazily freeable mappings

On Fri, Jan 6, 2023 at 12:54 PM Andy Lutomirski <[email protected]> wrote:
>
> I'm going to suggest a very very different approach: fix secret
> storage in memory for real. That is, don't lock "super secret
> sensitive stuff" into memory, and don't wipe it either. *Encrypt* it.

I don't think you're wrong, but people will complain about key
management, and worry about that part instead.

Honestly, this is what SGX and CPU enclaves is _supposed_ to all do
for you, but then nobody uses it for various reasons.

Linus

2023-01-06 21:55:43

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v14 2/7] mm: add VM_DROPPABLE for designating always lazily freeable mappings

On Thu, Jan 05, 2023 at 06:08:28PM -0800, Linus Torvalds wrote:
> Side note: making the 32-bit issue go away is likely trivial. We can
> make 'vm_flags' be 64-bit, and a patch for that has been floating
> around for over a decade:
>
> https://lore.kernel.org/all/[email protected]/
>
> but there was enough push-back on that patch that I didn't want to
> take it, and some of the arguments for it were not that convincing (at
> the time).
>
> But see commit ca16d140af91 ("mm: don't access vm_flags as 'int'"),
> which happened as a result, and which I (obviously very naively)
> believed would be a way to get the conversion to happen in a more
> controlled manner. Sadly, it never actually took off, and we have very
> few "vm_flags_t" users in the kernel, and a lot of "unsigned long
> flags". We even started out with a "__nocast" annotation to try to
> make sparse trigger on people who didn't use vm_flags_t properly. That
> was removed due to it just never happening.
>
> But converting things to vm_flags_t with a coccinelle script
> (hand-wave: look for variables of of "unsigned long" that use the
> VM_xyz constants), and then just making vm_flags_t be a "u64" instead
> sounds like a way forward.

I'd be more inclined to do:

typedef unsigned int vm_flags_t[2];

and deal with all the fallout. That'll find all the problems (although
leave us vulnerable to people forgetting which half of the flags they
want to be looking at).

Hm. We never actually converted vma->vm_flags to be vm_flags_t. Only
vm_region, which aiui is only used on nommu.

2023-01-06 22:10:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v14 2/7] mm: add VM_DROPPABLE for designating always lazily freeable mappings

On Fri, Jan 6, 2023 at 1:42 PM Matthew Wilcox <[email protected]> wrote:
>
>
> I'd be more inclined to do:
>
> typedef unsigned int vm_flags_t[2];

No, that's entirely invalid.

Never *ever* use arrays in C for type safety. Arrays are not type
safe. They can't be assigned sanely, and they silently become pointers
(which also aren't type-safe, since they end up converting silently to
'void *').

If you want to use the type system to enforce things, and you don't
want to rely on sparse, you absolutely have to use a struct (or union)
type.

So something like

typedef struct { u64 val; } vm_flags_t;

would be an option.

Linus

2023-01-09 10:50:55

by Florian Weimer

[permalink] [raw]
Subject: Re: [PATCH v14 2/7] mm: add VM_DROPPABLE for designating always lazily freeable mappings

* Linus Torvalds:

> On Tue, Jan 3, 2023 at 11:35 AM Jason A. Donenfeld <[email protected]> wrote:
>>
>> I don't think this is about micro-optimization. Rather, userspace RNGs
>> aren't really possible in a safe way at the moment.
>
> "Bah, humbug", to quote a modern-time philosopher.
>
> It's humbug simply because it makes two assumptions that aren't even valid:
>
> (a) that you have to do it in user space in the first place
>
> (b) that you care about the particular semantics that you are looking for
>
> The thing is, you can just do getrandom(). It's what people already
> do. Problem solved.

We are currently doing this in glibc for our arc4random implementation,
after Jason opposed userspace buffering. If chrony is recompiled
against the glibc version of arc4random (instead of its OpenBSD compat
version, which uses userspace buffering), the result is a 25% drop in
NTP packet response rate:

| The new arc4random using getrandom() seems to have a significant
| impact on performance of chronyd operating as an NTP server. On an
| Intel E3-1220 CPU, I see that the maximum number of requests per
| second dropped by about 25%. That would be an issue for some public
| NTP servers.

arc4random is too slow
<https://sourceware.org/bugzilla/show_bug.cgi?id=29437>

This is *not* “arc4random is 25% slower”, it is the measured overall
impact on server performance.

Historically, the solution space for getrandom and arc4random are
slightly different. The backronym is “A Replacement Call For random”,
i.e., you should be able to use arc4random without worrying about
performance. I don't expect cryptographic libraries to turn to
arc4random to implement their random number generators, and that
programmers that use low-level OpenSSL primitives (for example) keep
calling RAND_bytes instead of arc4random because it is available to
them.

We did these changes on the glibc side because Jason sounded very
confident that he's able to deliver vDSO acceleration for getrandom. If
that fails to materialize, we'll just have to add back userspace
buffering in glibc. At least we can get process fork protection via
MADV_WIPEONFORK, solving a real problem with the usual arc4random compat
implementation. (The OpenBSD mechanism for this is slightly different.)
We won't get VM fork protection or forward secrecy against ptrace. But
the latter is rather speculative anyway because if you can do ptrace
once, you can likely do ptrace twice, the first time patching the
process to remove forward secrecy. There is a real gap for VM forks,
but I'm not sure how much that matters in practice. Live migration has
to be implemented in such a way that this isn't observable (otherwise
TCP connections etc. would break), and long-term keys probably shouldn't
be generated under virtualization anyway.

Thanks,
Florian

2023-01-09 14:33:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v14 2/7] mm: add VM_DROPPABLE for designating always lazily freeable mappings

On Mon, Jan 9, 2023 at 4:34 AM Florian Weimer <[email protected]> wrote:
>
> We did these changes on the glibc side because Jason sounded very
> confident that he's able to deliver vDSO acceleration for getrandom. If
> that fails to materialize, we'll just have to add back userspace
> buffering in glibc.

My whole argument has been that user-space buffering is the sane thing
to do. Most definitely for something like glibc.

The number of people who go "oh, no, my buffer or randomness could be
exposed by insert-odd-situation-here" is approximately zero, and then
the onus should be on *them* to do something special.

Because *they* are special. Precious little snowflake special.

Linus

2023-01-11 07:32:11

by Eric Biggers

[permalink] [raw]
Subject: Re: [PATCH v14 2/7] mm: add VM_DROPPABLE for designating always lazily freeable mappings

On Mon, Jan 09, 2023 at 08:28:58AM -0600, Linus Torvalds wrote:
> On Mon, Jan 9, 2023 at 4:34 AM Florian Weimer <[email protected]> wrote:
> >
> > We did these changes on the glibc side because Jason sounded very
> > confident that he's able to deliver vDSO acceleration for getrandom. If
> > that fails to materialize, we'll just have to add back userspace
> > buffering in glibc.
>
> My whole argument has been that user-space buffering is the sane thing
> to do. Most definitely for something like glibc.
>
> The number of people who go "oh, no, my buffer or randomness could be
> exposed by insert-odd-situation-here" is approximately zero, and then
> the onus should be on *them* to do something special.
>
> Because *they* are special. Precious little snowflake special.
>
> Linus

How would userspace decide when to reseed its CRNGs, then?

IMO, the main benefit of the VDSO getrandom over a traditional userspace CRNG is
that it makes reseeds of the kernel's CRNG take effect immediately. See the
cover letter, where Jason explains this.

It's definitely important to make the memory used by userspace CRNGs have
appropriate semantics, but my understanding is that's not the main point.

- Eric

2023-01-11 12:17:16

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v14 2/7] mm: add VM_DROPPABLE for designating always lazily freeable mappings

On Wed, Jan 11, 2023 at 1:27 AM Eric Biggers <[email protected]> wrote:
>
> How would userspace decide when to reseed its CRNGs, then?

.. and that is relevant to all the VM discussions exactly why?

Linus

2023-01-11 22:25:36

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH v14 0/7] implement getrandom() in vDSO

On 01-Jan-2023 05:29:03 PM, Jason A. Donenfeld wrote:
[...]
> Two statements:
>
> 1) Userspace wants faster cryptographically secure random numbers of
> arbitrary size, big or small.
>
> 2) Userspace is currently unable to safely roll its own RNG with the
> same security profile as getrandom().
>
[...]
> API-wise, the vDSO gains this function:
>
> ssize_t vgetrandom(void *buffer, size_t len, unsigned int flags, void *opaque_state);
>
> The return value and the first 3 arguments are the same as ordinary
> getrandom(), while the last argument is a pointer to some state
> allocated with vgetrandom_alloc(), explained below. Were all four
> arguments passed to the getrandom syscall, nothing different would
> happen, and the functions would have the exact same behavior.
>
> Then, we introduce a new syscall:
>
> void *vgetrandom_alloc(unsigned int *num, unsigned int *size_per_each,
> unsigned long addr, unsigned int flags);
>
> This takes a hinted number of opaque states in `num`, and returns a
> pointer to an array of opaque states, the number actually allocated back
> in `num`, and the size in bytes of each one in `size_per_each`, enabling
> a libc to slice up the returned array into a state per each thread. (The
> `flags` and `addr` arguments, as well as the `*size_per_each` input
> value, are reserved for the future and are forced to be zero for now.)
>
> Libc is expected to allocate a chunk of these on first use, and then
> dole them out to threads as they're created, allocating more when
> needed. The returned address of the first state may be passed to
> munmap(2) with a length of `num * size_per_each`, in order to deallocate
> the memory.
>
> We very intentionally do *not* leave state allocation up to the caller
> of vgetrandom, but provide vgetrandom_alloc for that allocation. There
> are too many weird things that can go wrong, and it's important that
> vDSO does not provide too generic of a mechanism. It's not going to
> store its state in just any old memory address. It'll do it only in ones
> it allocates.

[...]

Have you considered extending rseq(2) per-thread "struct rseq" with an
additional "prng_seed" pointer field, which would point to a per-thread
memory area accessible both from userspace (at address
__builtin_thread_pointer() + __rseq_offset) and from kernel's
return-to-userspace rseq notification code (which can handle page
faults) ?

This way, the kernel can update its content when returning to userspace
if an update is needed since the last update.

Would that be sufficient as prng seed for your security requirements ?

Implementation-wise, the semantic of the prng_seed could be entirely
internal to a vgetrandom vDSO implementation, but the allocation of the
memory holding this seed would be the responsibility of libc.

libc could query the size required by the kernel for this prng seed with
a new getauxval(3) entry, e.g. AT_RSEQ_PRNG_SIZE. By doing so, libc
would only allocate as much memory as needed by the kernel vDSO
implementation.

This would remove the need for any kind of vgetrandom_alloc system call
and all its associated complexity.

Thoughts ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2023-01-12 18:05:31

by Christophe Leroy

[permalink] [raw]
Subject: Re: [PATCH v14 7/7] x86: vdso: Wire up getrandom() vDSO implementation



Le 01/01/2023 à 17:29, Jason A. Donenfeld a écrit :
> Hook up the generic vDSO implementation to the x86 vDSO data page. Since
> the existing vDSO infrastructure is heavily based on the timekeeping
> functionality, which works over arrays of bases, a new macro is
> introduced for vvars that are not arrays.
>
> The vDSO function requires a ChaCha20 implementation that does not write
> to the stack, yet can still do an entire ChaCha20 permutation, so
> provide this using SSE2, since this is userland code that must work on
> all x86-64 processors. There's a simple test for this code as well.

As far as I understand the test is not dependent on the architecture,
can it be a separate patch ?

Also, as the chacha implementation is standalone and can be tested by
the above mentionned simple test, can it be a separate patch as well ?

Then the last patch only has the glue to wire-up getrandom VDSO to the
architecture, and can be used as a reference for other architectures.

>
> Reviewed-by: Samuel Neves <[email protected]> # for vgetrandom-chacha.S
> Signed-off-by: Jason A. Donenfeld <[email protected]>
> ---
> arch/x86/Kconfig | 1 +
> arch/x86/entry/vdso/Makefile | 3 +-
> arch/x86/entry/vdso/vdso.lds.S | 2 +
> arch/x86/entry/vdso/vgetrandom-chacha.S | 178 ++++++++++++++++++
> arch/x86/entry/vdso/vgetrandom.c | 17 ++
> arch/x86/include/asm/vdso/getrandom.h | 55 ++++++
> arch/x86/include/asm/vdso/vsyscall.h | 2 +
> arch/x86/include/asm/vvar.h | 16 ++
> tools/testing/selftests/vDSO/.gitignore | 1 +
> tools/testing/selftests/vDSO/Makefile | 9 +
> .../testing/selftests/vDSO/vdso_test_chacha.c | 43 +++++
> 11 files changed, 326 insertions(+), 1 deletion(-)
> create mode 100644 arch/x86/entry/vdso/vgetrandom-chacha.S
> create mode 100644 arch/x86/entry/vdso/vgetrandom.c
> create mode 100644 arch/x86/include/asm/vdso/getrandom.h
> create mode 100644 tools/testing/selftests/vDSO/vdso_test_chacha.c
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 3604074a878b..ed689d831362 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -272,6 +272,7 @@ config X86
> select HAVE_UNSTABLE_SCHED_CLOCK
> select HAVE_USER_RETURN_NOTIFIER
> select HAVE_GENERIC_VDSO
> + select VDSO_GETRANDOM if X86_64
> select HOTPLUG_SMT if SMP
> select IRQ_FORCED_THREADING
> select NEED_PER_CPU_EMBED_FIRST_CHUNK
> diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
> index 838613ac15b8..3979bb4a61ae 100644
> --- a/arch/x86/entry/vdso/Makefile
> +++ b/arch/x86/entry/vdso/Makefile
> @@ -27,7 +27,7 @@ VDSO32-$(CONFIG_X86_32) := y
> VDSO32-$(CONFIG_IA32_EMULATION) := y
>
> # files to link into the vdso
> -vobjs-y := vdso-note.o vclock_gettime.o vgetcpu.o
> +vobjs-y := vdso-note.o vclock_gettime.o vgetcpu.o vgetrandom.o vgetrandom-chacha.o
> vobjs32-y := vdso32/note.o vdso32/system_call.o vdso32/sigreturn.o
> vobjs32-y += vdso32/vclock_gettime.o
> vobjs-$(CONFIG_X86_SGX) += vsgx.o
> @@ -105,6 +105,7 @@ CFLAGS_REMOVE_vclock_gettime.o = -pg
> CFLAGS_REMOVE_vdso32/vclock_gettime.o = -pg
> CFLAGS_REMOVE_vgetcpu.o = -pg
> CFLAGS_REMOVE_vsgx.o = -pg
> +CFLAGS_REMOVE_vgetrandom.o = -pg
>
> #
> # X32 processes use x32 vDSO to access 64bit kernel data.
> diff --git a/arch/x86/entry/vdso/vdso.lds.S b/arch/x86/entry/vdso/vdso.lds.S
> index e8c60ae7a7c8..0bab5f4af6d1 100644
> --- a/arch/x86/entry/vdso/vdso.lds.S
> +++ b/arch/x86/entry/vdso/vdso.lds.S
> @@ -30,6 +30,8 @@ VERSION {
> #ifdef CONFIG_X86_SGX
> __vdso_sgx_enter_enclave;
> #endif
> + getrandom;
> + __vdso_getrandom;
> local: *;
> };
> }
> diff --git a/arch/x86/entry/vdso/vgetrandom-chacha.S b/arch/x86/entry/vdso/vgetrandom-chacha.S
> new file mode 100644
> index 000000000000..d79e2bd97598
> --- /dev/null
> +++ b/arch/x86/entry/vdso/vgetrandom-chacha.S
> @@ -0,0 +1,178 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2022 Jason A. Donenfeld <[email protected]>. All Rights Reserved.
> + */
> +
> +#include <linux/linkage.h>
> +#include <asm/frame.h>
> +
> +.section .rodata, "a"
> +.align 16
> +CONSTANTS: .octa 0x6b20657479622d323320646e61707865
> +.text
> +
> +/*
> + * Very basic SSE2 implementation of ChaCha20. Produces a given positive number
> + * of blocks of output with a nonce of 0, taking an input key and 8-byte
> + * counter. Importantly does not spill to the stack. Its arguments are:
> + *
> + * rdi: output bytes
> + * rsi: 32-byte key input
> + * rdx: 8-byte counter input/output

Why a 8-byte counter ? According to RFC 7539, chacha20 takes:

The inputs to ChaCha20 are:

o A 256-bit key, treated as a concatenation of eight 32-bit little-
endian integers.

o A 96-bit nonce, treated as a concatenation of three 32-bit little-
endian integers.

o A 32-bit block count parameter, treated as a 32-bit little-endian
integer.


Are you mixing up the upper part of the counter with the nonce ? In that
case you can't say you use a 0 nonce, can you ?


> + * rcx: number of 64-byte blocks to write to output
> + */
> +SYM_FUNC_START(__arch_chacha20_blocks_nostack)
> +
> +.set output, %rdi
> +.set key, %rsi
> +.set counter, %rdx
> +.set nblocks, %rcx
> +.set i, %al
> +/* xmm registers are *not* callee-save. */
> +.set state0, %xmm0
> +.set state1, %xmm1
> +.set state2, %xmm2
> +.set state3, %xmm3
> +.set copy0, %xmm4
> +.set copy1, %xmm5
> +.set copy2, %xmm6
> +.set copy3, %xmm7
> +.set temp, %xmm8
> +.set one, %xmm9
> +
> + /* copy0 = "expand 32-byte k" */
> + movaps CONSTANTS(%rip),copy0
> + /* copy1,copy2 = key */
> + movups 0x00(key),copy1
> + movups 0x10(key),copy2
> + /* copy3 = counter || zero nonce */
> + movq 0x00(counter),copy3
> + /* one = 1 || 0 */
> + movq $1,%rax
> + movq %rax,one
> +
> +.Lblock:
> + /* state0,state1,state2,state3 = copy0,copy1,copy2,copy3 */
> + movdqa copy0,state0
> + movdqa copy1,state1
> + movdqa copy2,state2
> + movdqa copy3,state3
> +
> + movb $10,i
> +.Lpermute:
> + /* state0 += state1, state3 = rotl32(state3 ^ state0, 16) */
> + paddd state1,state0
> + pxor state0,state3
> + movdqa state3,temp
> + pslld $16,temp
> + psrld $16,state3
> + por temp,state3

There is a lot of similarities between all the blocks, can you use GAS
macros to avoid repetitions ?


> +
> + /* state2 += state3, state1 = rotl32(state1 ^ state2, 12) */
> + paddd state3,state2
> + pxor state2,state1
> + movdqa state1,temp
> + pslld $12,temp
> + psrld $20,state1
> + por temp,state1
> +
> + /* state0 += state1, state3 = rotl32(state3 ^ state0, 8) */
> + paddd state1,state0
> + pxor state0,state3
> + movdqa state3,temp
> + pslld $8,temp
> + psrld $24,state3
> + por temp,state3
> +
> + /* state2 += state3, state1 = rotl32(state1 ^ state2, 7) */
> + paddd state3,state2
> + pxor state2,state1
> + movdqa state1,temp
> + pslld $7,temp
> + psrld $25,state1
> + por temp,state1
> +
> + /* state1[0,1,2,3] = state1[1,2,3,0] */
> + pshufd $0x39,state1,state1
> + /* state2[0,1,2,3] = state2[2,3,0,1] */
> + pshufd $0x4e,state2,state2
> + /* state3[0,1,2,3] = state3[3,0,1,2] */
> + pshufd $0x93,state3,state3
> +
> + /* state0 += state1, state3 = rotl32(state3 ^ state0, 16) */
> + paddd state1,state0
> + pxor state0,state3
> + movdqa state3,temp
> + pslld $16,temp
> + psrld $16,state3
> + por temp,state3
> +
> + /* state2 += state3, state1 = rotl32(state1 ^ state2, 12) */
> + paddd state3,state2
> + pxor state2,state1
> + movdqa state1,temp
> + pslld $12,temp
> + psrld $20,state1
> + por temp,state1
> +
> + /* state0 += state1, state3 = rotl32(state3 ^ state0, 8) */
> + paddd state1,state0
> + pxor state0,state3
> + movdqa state3,temp
> + pslld $8,temp
> + psrld $24,state3
> + por temp,state3
> +
> + /* state2 += state3, state1 = rotl32(state1 ^ state2, 7) */
> + paddd state3,state2
> + pxor state2,state1
> + movdqa state1,temp
> + pslld $7,temp
> + psrld $25,state1
> + por temp,state1
> +
> + /* state1[0,1,2,3] = state1[3,0,1,2] */
> + pshufd $0x93,state1,state1
> + /* state2[0,1,2,3] = state2[2,3,0,1] */
> + pshufd $0x4e,state2,state2
> + /* state3[0,1,2,3] = state3[1,2,3,0] */
> + pshufd $0x39,state3,state3
> +
> + decb i
> + jnz .Lpermute
> +
> + /* output0 = state0 + copy0 */
> + paddd copy0,state0
> + movups state0,0x00(output)
> + /* output1 = state1 + copy1 */
> + paddd copy1,state1
> + movups state1,0x10(output)
> + /* output2 = state2 + copy2 */
> + paddd copy2,state2
> + movups state2,0x20(output)
> + /* output3 = state3 + copy3 */
> + paddd copy3,state3
> + movups state3,0x30(output)
> +
> + /* ++copy3.counter */
> + paddq one,copy3
> +
> + /* output += 64, --nblocks */
> + addq $64,output
> + decq nblocks
> + jnz .Lblock
> +
> + /* counter = copy3.counter */
> + movq copy3,0x00(counter)
> +
> + /* Zero out the potentially sensitive regs, in case nothing uses these again. */
> + pxor state0,state0
> + pxor state1,state1
> + pxor state2,state2
> + pxor state3,state3
> + pxor copy1,copy1
> + pxor copy2,copy2
> + pxor temp,temp
> +
> + ret
> +SYM_FUNC_END(__arch_chacha20_blocks_nostack)
> diff --git a/arch/x86/entry/vdso/vgetrandom.c b/arch/x86/entry/vdso/vgetrandom.c
> new file mode 100644
> index 000000000000..6045ded5da90
> --- /dev/null
> +++ b/arch/x86/entry/vdso/vgetrandom.c
> @@ -0,0 +1,17 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2022 Jason A. Donenfeld <[email protected]>. All Rights Reserved.
> + */
> +#include <linux/types.h>
> +
> +#include "../../../../lib/vdso/getrandom.c"
> +
> +ssize_t __vdso_getrandom(void *buffer, size_t len, unsigned int flags, void *state);
> +
> +ssize_t __vdso_getrandom(void *buffer, size_t len, unsigned int flags, void *state)
> +{
> + return __cvdso_getrandom(buffer, len, flags, state);
> +}
> +
> +ssize_t getrandom(void *, size_t, unsigned int, void *)
> + __attribute__((weak, alias("__vdso_getrandom")));
> diff --git a/arch/x86/include/asm/vdso/getrandom.h b/arch/x86/include/asm/vdso/getrandom.h
> new file mode 100644
> index 000000000000..46f99d735ae6
> --- /dev/null
> +++ b/arch/x86/include/asm/vdso/getrandom.h
> @@ -0,0 +1,55 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (C) 2022 Jason A. Donenfeld <[email protected]>. All Rights Reserved.
> + */
> +#ifndef __ASM_VDSO_GETRANDOM_H
> +#define __ASM_VDSO_GETRANDOM_H
> +
> +#ifndef __ASSEMBLY__
> +
> +#include <asm/unistd.h>
> +#include <asm/vvar.h>
> +
> +/**
> + * getrandom_syscall - Invoke the getrandom() syscall.
> + * @buffer: Destination buffer to fill with random bytes.
> + * @len: Size of @buffer in bytes.
> + * @flags: Zero or more GRND_* flags.
> + * Returns the number of random bytes written to @buffer, or a negative value indicating an error.
> + */
> +static __always_inline ssize_t getrandom_syscall(void *buffer, size_t len, unsigned int flags)
> +{
> + long ret;
> +
> + asm ("syscall" : "=a" (ret) :
> + "0" (__NR_getrandom), "D" (buffer), "S" (len), "d" (flags) :
> + "rcx", "r11", "memory");
> +
> + return ret;
> +}
> +
> +#define __vdso_rng_data (VVAR(_vdso_rng_data))
> +
> +static __always_inline const struct vdso_rng_data *__arch_get_vdso_rng_data(void)
> +{
> + if (__vdso_data->clock_mode == VDSO_CLOCKMODE_TIMENS)
> + return (void *)&__vdso_rng_data + ((void *)&__timens_vdso_data - (void *)&__vdso_data);
> + return &__vdso_rng_data;
> +}
> +
> +/**
> + * __arch_chacha20_blocks_nostack - Generate ChaCha20 stream without using the stack.
> + * @dst_bytes: Destination buffer to hold @nblocks * 64 bytes of output.
> + * @key: 32-byte input key.
> + * @counter: 8-byte counter, read on input and updated on return.
> + * @nblocks: Number of blocks to generate.
> + *
> + * Generates a given positive number of blocks of ChaCha20 output with nonce=0, and does not write
> + * to any stack or memory outside of the parameters passed to it, in order to mitigate stack data
> + * leaking into forked child processes.
> + */
> +extern void __arch_chacha20_blocks_nostack(u8 *dst_bytes, const u32 *key, u32 *counter, size_t nblocks);
> +
> +#endif /* !__ASSEMBLY__ */
> +
> +#endif /* __ASM_VDSO_GETRANDOM_H */
> diff --git a/arch/x86/include/asm/vdso/vsyscall.h b/arch/x86/include/asm/vdso/vsyscall.h
> index be199a9b2676..71c56586a22f 100644
> --- a/arch/x86/include/asm/vdso/vsyscall.h
> +++ b/arch/x86/include/asm/vdso/vsyscall.h
> @@ -11,6 +11,8 @@
> #include <asm/vvar.h>
>
> DEFINE_VVAR(struct vdso_data, _vdso_data);
> +DEFINE_VVAR_SINGLE(struct vdso_rng_data, _vdso_rng_data);
> +
> /*
> * Update the vDSO data page to keep in sync with kernel timekeeping.
> */
> diff --git a/arch/x86/include/asm/vvar.h b/arch/x86/include/asm/vvar.h
> index 183e98e49ab9..9d9af37f7cab 100644
> --- a/arch/x86/include/asm/vvar.h
> +++ b/arch/x86/include/asm/vvar.h
> @@ -26,6 +26,8 @@
> */
> #define DECLARE_VVAR(offset, type, name) \
> EMIT_VVAR(name, offset)
> +#define DECLARE_VVAR_SINGLE(offset, type, name) \
> + EMIT_VVAR(name, offset)
>
> #else
>
> @@ -37,6 +39,10 @@ extern char __vvar_page;
> extern type timens_ ## name[CS_BASES] \
> __attribute__((visibility("hidden"))); \
>
> +#define DECLARE_VVAR_SINGLE(offset, type, name) \
> + extern type vvar_ ## name \
> + __attribute__((visibility("hidden"))); \
> +
> #define VVAR(name) (vvar_ ## name)
> #define TIMENS(name) (timens_ ## name)
>
> @@ -44,12 +50,22 @@ extern char __vvar_page;
> type name[CS_BASES] \
> __attribute__((section(".vvar_" #name), aligned(16))) __visible
>
> +#define DEFINE_VVAR_SINGLE(type, name) \
> + type name \
> + __attribute__((section(".vvar_" #name), aligned(16))) __visible
> +
> #endif
>
> /* DECLARE_VVAR(offset, type, name) */
>
> DECLARE_VVAR(128, struct vdso_data, _vdso_data)
>
> +#if !defined(_SINGLE_DATA)
> +#define _SINGLE_DATA
> +DECLARE_VVAR_SINGLE(640, struct vdso_rng_data, _vdso_rng_data)
> +#endif
> +
> #undef DECLARE_VVAR
> +#undef DECLARE_VVAR_SINGLE
>
> #endif
> diff --git a/tools/testing/selftests/vDSO/.gitignore b/tools/testing/selftests/vDSO/.gitignore
> index 7dbfdec53f3d..30d5c8f0e5c7 100644
> --- a/tools/testing/selftests/vDSO/.gitignore
> +++ b/tools/testing/selftests/vDSO/.gitignore
> @@ -7,3 +7,4 @@ vdso_test_gettimeofday
> vdso_test_getcpu
> vdso_standalone_test_x86
> vdso_test_getrandom
> +vdso_test_chacha
> diff --git a/tools/testing/selftests/vDSO/Makefile b/tools/testing/selftests/vDSO/Makefile
> index a33b4d200a32..54a015afe60c 100644
> --- a/tools/testing/selftests/vDSO/Makefile
> +++ b/tools/testing/selftests/vDSO/Makefile
> @@ -3,6 +3,7 @@ include ../lib.mk
>
> uname_M := $(shell uname -m 2>/dev/null || echo not)
> ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/x86/ -e s/x86_64/x86/)
> +SODIUM := $(shell pkg-config --libs libsodium 2>/dev/null)
>
> TEST_GEN_PROGS := $(OUTPUT)/vdso_test_gettimeofday $(OUTPUT)/vdso_test_getcpu
> TEST_GEN_PROGS += $(OUTPUT)/vdso_test_abi
> @@ -12,9 +13,15 @@ TEST_GEN_PROGS += $(OUTPUT)/vdso_standalone_test_x86
> endif
> TEST_GEN_PROGS += $(OUTPUT)/vdso_test_correctness
> TEST_GEN_PROGS += $(OUTPUT)/vdso_test_getrandom
> +ifeq ($(uname_M),x86_64)
> +ifneq ($(SODIUM),)
> +TEST_GEN_PROGS += $(OUTPUT)/vdso_test_chacha
> +endif
> +endif
>
> CFLAGS := -std=gnu99
> CFLAGS_vdso_standalone_test_x86 := -nostdlib -fno-asynchronous-unwind-tables -fno-stack-protector
> +CFLAGS_vdso_test_chacha := $(SODIUM) -idirafter $(top_srcdir)/include -idirafter $(top_srcdir)/arch/$(ARCH)/include -D__ASSEMBLY__ -DBULID_VDSO -DCONFIG_FUNCTION_ALIGNMENT=0 -Wa,--noexecstack
> LDFLAGS_vdso_test_correctness := -ldl
> ifeq ($(CONFIG_X86_32),y)
> LDLIBS += -lgcc_s
> @@ -35,3 +42,5 @@ $(OUTPUT)/vdso_test_correctness: vdso_test_correctness.c
> -o $@ \
> $(LDFLAGS_vdso_test_correctness)
> $(OUTPUT)/vdso_test_getrandom: parse_vdso.c
> +$(OUTPUT)/vdso_test_chacha: CFLAGS += $(CFLAGS_vdso_test_chacha)
> +$(OUTPUT)/vdso_test_chacha: $(top_srcdir)/arch/$(ARCH)/entry/vdso/vgetrandom-chacha.S
> diff --git a/tools/testing/selftests/vDSO/vdso_test_chacha.c b/tools/testing/selftests/vDSO/vdso_test_chacha.c
> new file mode 100644
> index 000000000000..bce7a7752b11
> --- /dev/null
> +++ b/tools/testing/selftests/vDSO/vdso_test_chacha.c
> @@ -0,0 +1,43 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2022 Jason A. Donenfeld <[email protected]>. All Rights Reserved.
> + */
> +
> +#include <sodium/crypto_stream_chacha20.h>

Is that standard ? Every distribution has sodium ?

> +#include <sys/random.h>
> +#include <string.h>
> +#include <stdint.h>
> +#include "../kselftest.h"
> +
> +extern void __arch_chacha20_blocks_nostack(uint8_t *dst_bytes, const uint8_t *key, uint32_t *counter, size_t nblocks);
> +
> +int main(int argc, char *argv[])
> +{
> + enum { TRIALS = 1000, BLOCKS = 128, BLOCK_SIZE = 64 };
> + static const uint8_t nonce[8] = { 0 };
> + uint32_t counter[2];
> + uint8_t key[32];
> + uint8_t output1[BLOCK_SIZE * BLOCKS], output2[BLOCK_SIZE * BLOCKS];
> +
> + ksft_print_header();
> + ksft_set_plan(1);
> +
> + for (unsigned int trial = 0; trial < TRIALS; ++trial) {
> + if (getrandom(key, sizeof(key), 0) != sizeof(key)) {
> + printf("getrandom() failed!\n");
> + return KSFT_SKIP;
> + }
> + crypto_stream_chacha20(output1, sizeof(output1), nonce, key);
> + for (unsigned int split = 0; split < BLOCKS; ++split) {
> + memset(output2, 'X', sizeof(output2));
> + memset(counter, 0, sizeof(counter));
> + if (split)
> + __arch_chacha20_blocks_nostack(output2, key, counter, split);
> + __arch_chacha20_blocks_nostack(output2 + split * BLOCK_SIZE, key, counter, BLOCKS - split);
> + if (memcmp(output1, output2, sizeof(output1)))
> + return KSFT_FAIL;
> + }
> + }
> + ksft_test_result_pass("chacha: PASS\n");
> + return KSFT_PASS;
> +}

Christophe

2023-01-12 18:22:56

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH v14 7/7] x86: vdso: Wire up getrandom() vDSO implementation

Hi Christophe,

Thanks for the review.

On Thu, Jan 12, 2023 at 6:27 PM Christophe Leroy
<[email protected]> wrote:
> Le 01/01/2023 à 17:29, Jason A. Donenfeld a écrit :
> > Hook up the generic vDSO implementation to the x86 vDSO data page. Since
> > the existing vDSO infrastructure is heavily based on the timekeeping
> > functionality, which works over arrays of bases, a new macro is
> > introduced for vvars that are not arrays.
> >
> > The vDSO function requires a ChaCha20 implementation that does not write
> > to the stack, yet can still do an entire ChaCha20 permutation, so
> > provide this using SSE2, since this is userland code that must work on
> > all x86-64 processors. There's a simple test for this code as well.
>
> As far as I understand the test is not dependent on the architecture,
> can it be a separate patch ?

The test is dependent on architectures for which there's a vDSO
implementation. I could move it to a patch before or after this one,
though, if you think it'd be better to keep this commit as a template
commit for other architectures.

> Also, as the chacha implementation is standalone and can be tested by
> the above mentionned simple test, can it be a separate patch as well ?

No, that actually needs to be part of this one, as it's part and
parcel of the x86 implementation.

> Then the last patch only has the glue to wire-up getrandom VDSO to the
> architecture, and can be used as a reference for other architectures.

This is part of the required glue, so no, it belongs in this one.
> > + * rdx: 8-byte counter input/output
>
> Why a 8-byte counter ? According to RFC 7539, chacha20 takes:
> Are you mixing up the upper part of the counter with the nonce ? In that
> case you can't say you use a 0 nonce, can you ?

No, I'm not mixing anything up. This is the same algorithm that
random.c uses. And wireguard, for that matter. 8 byte nonce, 8 byte
block counter. In this case, the nonce is 0.

> > +#include <sodium/crypto_stream_chacha20.h>
>
> Is that standard ? Every distribution has sodium ?

As far as I can tell, yes.

Jason