> On May 28, 2024, at 5:25 AM, Jason A. Donenfeld <[email protected]> wrote:
>
> Provide a generic C vDSO getrandom() implementation, which operates on
> an opaque state returned by vgetrandom_alloc() and produces random bytes
> the same way as getrandom(). This has a the API signature:
>
> ssize_t vgetrandom(void *buffer, size_t len, unsigned int flags, void *opaque_state);
> +/**
> + * type vdso_kernel_ulong - unsigned long type that matches kernel's unsigned long
> + *
> + * Data shared between userspace and the kernel must operate the same way in both 64-bit code and in
> + * 32-bit compat code, over the same potentially 64-bit kernel. This type represents the size of an
> + * unsigned long as used by kernel code. This isn't necessarily the same as an unsigned long as used
> + * by userspace, however.
Why is this better than using plain u64? It’s certainly more
complicated. It also rather fundamentally breaks CRIU on 32-bit
userspace (although CRIU may well be unable to keep vgetrandom working
after a restore onto a different kernel anyway). Admittedly 32-bit
userspace is a slowly dying breed, but still.
> + *
> + * +-------------------+-------------------+------------------+-------------------+
> + * | 32-bit userspace | 32-bit userspace | 64-bit userspace | 64-bit userspace |
> + * | unsigned long | vdso_kernel_ulong | unsigned long | vdso_kernel_ulong |
> + * +---------------+-------------------+-------------------+------------------+-------------------+
> + * | 32-bit kernel | ✓ same size | ✓ same size |
> + * | unsigned long | | |
> + * +---------------+-------------------+-------------------+------------------+-------------------+
> + * | 64-bit kernel | ✘ different size! | ✓ same size | ✓ same size | ✓ same size |
> + * | unsigned long | | | | |
> + * +---------------+-------------------+-------------------+------------------+-------------------+
> + */
> +#ifdef CONFIG_64BIT
> +typedef u64 vdso_kernel_ulong;
> +#else
> +typedef u32 vdso_kernel_ulong;
> +#endif
> +
> +#endif /* __VDSO_TYPES_H */
> diff --git a/lib/vdso/getrandom.c b/lib/vdso/getrandom.c
> new file mode 100644
> index 000000000000..4d9bb59985f8
> --- /dev/null
> +++ b/lib/vdso/getrandom.c
> @@ -0,0 +1,226 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2022 Jason A. Donenfeld <[email protected]>. All Rights Reserved.
> + */
> +
> +#include <linux/cache.h>
> +#include <linux/kernel.h>
> +#include <linux/time64.h>
> +#include <vdso/datapage.h>
> +#include <vdso/getrandom.h>
> +#include <asm/vdso/getrandom.h>
> +#include <asm/vdso/vsyscall.h>
> +
> +#define MEMCPY_AND_ZERO_SRC(type, dst, src, len) do { \
> + while (len >= sizeof(type)) { \
> + __put_unaligned_t(type, __get_unaligned_t(type, src), dst); \
> + __put_unaligned_t(type, 0, src); \
> + dst += sizeof(type); \
> + src += sizeof(type); \
> + len -= sizeof(type); \
> + } \
> +} while (0)
> +
> +static void memcpy_and_zero_src(void *dst, void *src, size_t len)
> +{
> + if (IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) {
> + if (IS_ENABLED(CONFIG_64BIT))
> + MEMCPY_AND_ZERO_SRC(u64, dst, src, len);
> + MEMCPY_AND_ZERO_SRC(u32, dst, src, len);
> + MEMCPY_AND_ZERO_SRC(u16, dst, src, len);
> + }
> + MEMCPY_AND_ZERO_SRC(u8, dst, src, len);
> +}
> +
> +/**
> + * __cvdso_getrandom_data - Generic vDSO implementation of getrandom() syscall.
> + * @rng_info: Describes state of kernel RNG, memory shared with kernel.
> + * @buffer: Destination buffer to fill with random bytes.
> + * @len: Size of @buffer in bytes.
> + * @flags: Zero or more GRND_* flags.
> + * @opaque_state: Pointer to an opaque state area.
> + *
> + * This implements a "fast key erasure" RNG using ChaCha20, in the same way that the kernel's
> + * getrandom() syscall does. It periodically reseeds its key from the kernel's RNG, at the same
> + * schedule that the kernel's RNG is reseeded. If the kernel's RNG is not ready, then this always
> + * calls into the syscall.
> + *
> + * @opaque_state *must* be allocated using the vgetrandom_alloc() syscall. Unless external locking
> + * is used, one state must be allocated per thread, as it is not safe to call this function
> + * concurrently with the same @opaque_state. However, it is safe to call this using the same
> + * @opaque_state that is shared between main code and signal handling code, within the same thread.
> + *
> + * Returns the number of random bytes written to @buffer, or a negative value indicating an error.
> + */
> +static __always_inline ssize_t
> +__cvdso_getrandom_data(const struct vdso_rng_data *rng_info, void *buffer, size_t len,
> + unsigned int flags, void *opaque_state)
I don’t love this function signature. I generally think that, if
you’re going to have user code pass a pointer to kernel code, either
make the buffer have a well defined, constant size or pass a length.
As it stands, one cannot locally prove that user code that calls it is
memory-safe. In fact, any caller that has the misfortune of running
under CRIU is *not* memory safe if CRIU allows the vDSO to be
preserved. Ouch. (CRIU has some special code for this. I'm not 100%
clear on all the details.) One could maybe sort of get away with
treating the provided opaque_state as a completely opaque value and
not a pointer, but then the mechanism for allocating these states
should be adjusted accordingly.
One thing that occurs to me is that, if this thing were to be made
CRIU-safe, the buffer could have a magic number that changes any time
the data structure changes, and the vDSO could check, at the beginning
and end of the call, that the magic number is correct. Doing this
would require using a special VMA type instead of just a wipe-on-fork
mapping, which could plausibly be a good thing anyway. (Hmm, we don't
just want WIPEONFORK. We should probably also wipe on swap-out and,
more importantly, we should absolutely wipe on any sort of CRIU-style
checkpointing. Perhaps a special VMA would be a good thing for
multiple reasons.
> +{
> + ssize_t ret = min_t(size_t, INT_MAX & PAGE_MASK /* = MAX_RW_COUNT */, len);
> + struct vgetrandom_state *state = opaque_state;
> + size_t batch_len, nblocks, orig_len = len;
> + unsigned long current_generation;
> + void *orig_buffer = buffer;
> + u32 counter[2] = { 0 };
> + bool in_use, have_retried = false;
> +
> + /* The state must not straddle a page, since pages can be zeroed at any time. */
> + if (unlikely(((unsigned long)opaque_state & ~PAGE_MASK) + sizeof(*state) > PAGE_SIZE))
> + goto fallback_syscall;
This is weird. Either the provided pointer is valid or it isn’t.
Reasonable outcomes are a segfault if the pointer is bad or success
(or fallback if needed for some reason) if the pointer is good. Why
is there specific code to catch a specific sort of pointer screwup
here?
On Fri, May 31, 2024 at 04:06:37PM -0700, Andy Lutomirski wrote:
> > On May 28, 2024, at 5:25 AM, Jason A. Donenfeld <[email protected]> wrote:
> >
> > Provide a generic C vDSO getrandom() implementation, which operates on
> > an opaque state returned by vgetrandom_alloc() and produces random bytes
> > the same way as getrandom(). This has a the API signature:
> >
> > ssize_t vgetrandom(void *buffer, size_t len, unsigned int flags, void *opaque_state);
>
> > +/**
> > + * type vdso_kernel_ulong - unsigned long type that matches kernel's unsigned long
> > + *
> > + * Data shared between userspace and the kernel must operate the same way in both 64-bit code and in
> > + * 32-bit compat code, over the same potentially 64-bit kernel. This type represents the size of an
> > + * unsigned long as used by kernel code. This isn't necessarily the same as an unsigned long as used
> > + * by userspace, however.
>
> Why is this better than using plain u64? It’s certainly more
> complicated. It also rather fundamentally breaks CRIU on 32-bit
> userspace (although CRIU may well be unable to keep vgetrandom working
> after a restore onto a different kernel anyway). Admittedly 32-bit
> userspace is a slowly dying breed, but still.
That came out of this conversation: https://lore.kernel.org/all/[email protected]/
(And I'd like single instruction increments, which means long, not u64
on 32-bit machines.)
> > +{
> > + ssize_t ret = min_t(size_t, INT_MAX & PAGE_MASK /* = MAX_RW_COUNT */, len);
> > + struct vgetrandom_state *state = opaque_state;
> > + size_t batch_len, nblocks, orig_len = len;
> > + unsigned long current_generation;
> > + void *orig_buffer = buffer;
> > + u32 counter[2] = { 0 };
> > + bool in_use, have_retried = false;
> > +
> > + /* The state must not straddle a page, since pages can be zeroed at any time. */
> > + if (unlikely(((unsigned long)opaque_state & ~PAGE_MASK) + sizeof(*state) > PAGE_SIZE))
> > + goto fallback_syscall;
>
> This is weird. Either the provided pointer is valid or it isn’t.
> Reasonable outcomes are a segfault if the pointer is bad or success
> (or fallback if needed for some reason) if the pointer is good. Why
> is there specific code to catch a specific sort of pointer screwup
> here?
I guess I could make it return -EFAULT in this case, rather than
silently succeeding.
Jason