2022-10-03 22:37:44

by Ali Raza

[permalink] [raw]
Subject: [RFC UKL 00/10] Unikernel Linux (UKL)

Unikernel Linux (UKL) is a research project aimed at integrating
application specific optimizations to the Linux kernel. This RFC aims to
introduce this research to the community. Any feedback regarding the idea,
goals, implementation and research is highly appreciated.

Unikernels are specialized operating systems where an application is linked
directly with the kernel and runs in supervisor mode. This allows the
developers to implement application specific optimizations to the kernel,
which can be directly invoked by the application (without going through the
syscall path). An application can control scheduling and resource
management and directly access the hardware. Application and the kernel can
be co-optimized, e.g., through LTO, PGO, etc. All of these optimizations,
and others, provide applications with huge performance benefits over
general purpose operating systems.

Linux is the de-facto operating system of today. Applications depend on its
battle tested code base, large developer community, support for legacy
code, a huge ecosystem of tools and utilities, and a wide range of
compatible hardware and device drivers. Linux also allows some degree of
application specific optimizations through build time config options,
runtime configuration, and recently through eBPF. But still, there is a
need for even more fine-grained application specific optimizations, and
some developers resort to kernel bypass techniques.

Unikernel Linux (UKL) aims to get the best of both worlds by bringing
application specific optimizations to the Linux ecosystem. This way,
unmodified applications can keep getting the benefits of Linux while taking
advantage of the unikernel-style optimizations. Optionally, applications
can be modified to invoke deeper optimizations.

There are two steps to unikernel-izing Linux, i.e., first, equip Linux with
a unikernel model, and second, actually use that model to implement
application specific optimizations. This patch focuses on the first part.
Through this patch, unmodified applications can be built as Linux
unikernels, albeit with only modest performance advantages. Like
unikernels, UKL would allow an application to be statically linked into the
kernel and executed in supervisor mode. However, UKL preserves most of the
invariants and design of Linux, including a separate page-able application
portion of the address space and a pinned kernel portion, the ability to
run multiple processes, and distinct execution modes for application and
kernel code. Kernel execution mode and application execution mode are
different, e.g., the application execution mode allows application threads
to be scheduled, handle signals, etc., which do not apply to kernel
threads. Application built as a Linux unikernel will have its text and data
loaded with the kernel at boot time, while the rest of the address space
would remain unchanged. These applications invoke the system call
functionality through a function call into the kernel system call entry
point instead of through the syscall assembly instruction. UKL would
support a normal userspace so the UKL application can be started, managed,
profiled, etc., using normal command line utilities.

Once Linux has a unikernel model, different application specific
optimizations are possible. We have tried a few, e.g., fast system call
transitions, shared stacks to allow LTO, invoking kernel functions
directly, etc. We have seen huge performance benefits, details of which are
not relevant to this patch and can be found in our paper.
(https://arxiv.org/pdf/2206.00789.pdf)

UKL differs significantly from previous projects, e.g., UML, KML and LKL.
User Mode Linux (UML) is a virtual machine monitor implemented on syscall
interface, a very different goal from UKL. Kernel Mode Linux (KML) allows
applications to run in kernel mode and replaces syscalls with function
calls. While KML stops there, UKL goes further. UKL links applications and
kernel together which allows further optimizations e.g., fast system call
transitions, shared stacks to allow LTO, invoking kernel functions directly
etc. Details can be found in the paper linked above. Linux Kernel Library
(LKL) harvests arch independent code from Linux, takes it to userspace as a
library to be linked with applications. A host needs to provide arch
dependent functionality. This model is very different from UKL. A detailed
discussion of related work is present in the paper linked above.

See samples/ukl for a simple TCP echo server example which can be built as
a normal user space application and also as a UKL application. In the Linux
config options, a path to the compiled and partially linked application
binary can be specified. Kernel built with UKL enabled will search this
location for the binary and link with the kernel. Applications and required
libraries need to be compiled with -mno-red-zone -mcmodel=kernel flags
because kernel mode execution can trample on application red zones and in
order to link with the kernel and be loaded in the high end of the address
space, application should have the correct memory model. Examples of other
applications like Redis, Memcached etc along with glibc and libgcc etc.,
can be found at https://github.com/unikernelLinux/ukl

List of authors and contributors:
=================================

Ali Raza - [email protected]
Thomas Unger - [email protected]
Matthew Boyd - [email protected]
Eric Munson - [email protected]
Parul Sohal - [email protected]
Ulrich Drepper - [email protected]
Richard W.M. Jones - [email protected]
Daniel Bristot de Oliveira - [email protected]
Larry Woodman - [email protected]
Renato Mancuso - [email protected]
Jonathan Appavoo - [email protected]
Orran Krieger - [email protected]

Ali Raza (9):
kbuild: Add sections and symbols to linker script for UKL support
x86/boot: Load the PT_TLS segment for Unikernel configs
sched: Add task_struct tracking of kernel or application execution
x86/entry: Create alternate entry path for system calls
x86/uaccess: Make access_ok UKL aware
x86/fault: Skip checking kernel mode access to user address space for
UKL
x86/signal: Adjust signal handler register values and return frame
exec: Make exec path for starting UKL application
Kconfig: Add config option for enabling and sample for testing UKL

Eric B Munson (1):
exec: Give userspace a method for starting UKL process

Documentation/index.rst | 1 +
Documentation/ukl/ukl.rst | 104 +++++++++++++++++++++++
Kconfig | 2 +
Makefile | 4 +
arch/x86/boot/compressed/misc.c | 3 +
arch/x86/entry/entry_64.S | 133 ++++++++++++++++++++++++++++++
arch/x86/include/asm/elf.h | 9 +-
arch/x86/include/asm/uaccess.h | 8 ++
arch/x86/kernel/process.c | 13 +++
arch/x86/kernel/process_64.c | 49 ++++++++---
arch/x86/kernel/signal.c | 22 +++--
arch/x86/kernel/vmlinux.lds.S | 98 ++++++++++++++++++++++
arch/x86/mm/fault.c | 7 +-
fs/binfmt_elf.c | 28 +++++++
fs/exec.c | 75 +++++++++++++----
include/asm-generic/sections.h | 4 +
include/asm-generic/vmlinux.lds.h | 32 ++++++-
include/linux/sched.h | 26 ++++++
kernel/Kconfig.ukl | 41 +++++++++
samples/ukl/Makefile | 16 ++++
samples/ukl/README | 17 ++++
samples/ukl/syscall.S | 28 +++++++
samples/ukl/tcp_server.c | 99 ++++++++++++++++++++++
scripts/mod/modpost.c | 4 +
24 files changed, 785 insertions(+), 38 deletions(-)
create mode 100644 Documentation/ukl/ukl.rst
create mode 100644 kernel/Kconfig.ukl
create mode 100644 samples/ukl/Makefile
create mode 100644 samples/ukl/README
create mode 100644 samples/ukl/syscall.S
create mode 100644 samples/ukl/tcp_server.c


base-commit: 4fe89d07dcc2804c8b562f6c7896a45643d34b2f
--
2.21.3


2022-10-03 22:50:14

by Ali Raza

[permalink] [raw]
Subject: [RFC UKL 08/10] exec: Make exec path for starting UKL application

The UKL application still relies on much of the setup done to start a
standard user space process, so we still need to use much of that path.
There are several areas that the UKL application doesn't need or want so we
bypass them in the case of UKL. These are: ELF loading, because it is part
of the kernel image; and segments register value initialization. We need
to record a starting location for the application heap, this normally is
the end of the ELF binary, once loaded. We choose an arbitrary low address
because there is no binary to load. We also hardcode the entry point for
the application to ukl__start which is the entry point for glibc plus the
'ukl_' prefix.

Cc: Jonathan Corbet <[email protected]>
Cc: Masahiro Yamada <[email protected]>
Cc: Michal Marek <[email protected]>
Cc: Nick Desaulniers <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Eric Biederman <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Josh Poimboeuf <[email protected]>

Suggested-by: Thomas Unger <[email protected]>
Signed-off-by: Ali Raza <[email protected]>
---
arch/x86/include/asm/elf.h | 9 ++++--
arch/x86/kernel/process.c | 13 +++++++++
arch/x86/kernel/process_64.c | 27 ++++++++++--------
fs/binfmt_elf.c | 28 ++++++++++++++++++
fs/exec.c | 55 ++++++++++++++++++++++++++----------
5 files changed, 103 insertions(+), 29 deletions(-)

diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index cb0ff1055ab1..91b6efafb46f 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -6,6 +6,7 @@
* ELF register definitions..
*/
#include <linux/thread_info.h>
+#include <linux/sched.h>

#include <asm/ptrace.h>
#include <asm/user.h>
@@ -164,9 +165,11 @@ static inline void elf_common_init(struct thread_struct *t,
regs->si = regs->di = regs->bp = 0;
regs->r8 = regs->r9 = regs->r10 = regs->r11 = 0;
regs->r12 = regs->r13 = regs->r14 = regs->r15 = 0;
- t->fsbase = t->gsbase = 0;
- t->fsindex = t->gsindex = 0;
- t->ds = t->es = ds;
+ if (!is_ukl_thread()) {
+ t->fsbase = t->gsbase = 0;
+ t->fsindex = t->gsindex = 0;
+ t->ds = t->es = ds;
+ }
}

#define ELF_PLAT_INIT(_r, load_addr) \
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 58a6ea472db9..8395fc0c3398 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -192,6 +192,19 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
frame->bx = 0;
*childregs = *current_pt_regs();
childregs->ax = 0;
+
+#ifdef CONFIG_UNIKERNEL_LINUX
+ /*
+ * UKL leaves return address and flags on user stack. This works
+ * fine for clone (i.e., VM shared) but not for 'fork' style
+ * clone (i.e., VM not shared). This is where we clean those extra
+ * elements from user stack.
+ */
+ if (is_ukl_thread() & !(clone_flags & CLONE_VM)) {
+ childregs->sp += 2*(sizeof(long));
+ }
+#endif
+
if (sp)
childregs->sp = sp;

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index e9e4a2946452..cf007b95d684 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -530,21 +530,26 @@ start_thread_common(struct pt_regs *regs, unsigned long new_ip,
{
WARN_ON_ONCE(regs != current_pt_regs());

- if (static_cpu_has(X86_BUG_NULL_SEG)) {
- /* Loading zero below won't clear the base. */
- loadsegment(fs, __USER_DS);
- load_gs_index(__USER_DS);
- }
+ if (!is_ukl_thread()) {
+ if (static_cpu_has(X86_BUG_NULL_SEG)) {
+ /* Loading zero below won't clear the base. */
+ loadsegment(fs, __USER_DS);
+ load_gs_index(__USER_DS);
+ }

- loadsegment(fs, 0);
- loadsegment(es, _ds);
- loadsegment(ds, _ds);
- load_gs_index(0);
+ loadsegment(fs, 0);
+ loadsegment(es, _ds);
+ loadsegment(ds, _ds);
+ load_gs_index(0);

+ regs->cs = _cs;
+ regs->ss = _ss;
+ } else {
+ regs->cs = __KERNEL_CS;
+ regs->ss = __KERNEL_DS;
+ }
regs->ip = new_ip;
regs->sp = new_sp;
- regs->cs = _cs;
- regs->ss = _ss;
regs->flags = X86_EFLAGS_IF;
}

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 63c7ebb0da89..1c91f1179398 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -845,6 +845,10 @@ static int load_elf_binary(struct linux_binprm *bprm)
struct pt_regs *regs;

retval = -ENOEXEC;
+
+ if (is_ukl_thread())
+ goto UKL_SKIP_READING_ELF;
+
/* First of all, some simple consistency checks */
if (memcmp(elf_ex->e_ident, ELFMAG, SELFMAG) != 0)
goto out;
@@ -998,6 +1002,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
if (retval)
goto out_free_dentry;

+UKL_SKIP_READING_ELF:
/* Flush all traces of the currently running executable */
retval = begin_new_exec(bprm);
if (retval)
@@ -1029,6 +1034,17 @@ static int load_elf_binary(struct linux_binprm *bprm)
start_data = 0;
end_data = 0;

+ if (is_ukl_thread()) {
+ /*
+ * load_bias needs to ensure that we push the heap start
+ * past the end of the executable, but in this case, it is
+ * already mapped with the kernel text. So we select an
+ * address that is "high enough"
+ */
+ load_bias = 0x405000;
+ goto UKL_SKIP_LOADING_ELF;
+ }
+
/* Now we do a little grungy work by mmapping the ELF image into
the correct location in memory. */
for(i = 0, elf_ppnt = elf_phdata;
@@ -1224,6 +1240,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
}
}

+UKL_SKIP_LOADING_ELF:
e_entry = elf_ex->e_entry + load_bias;
phdr_addr += load_bias;
elf_bss += load_bias;
@@ -1246,6 +1263,16 @@ static int load_elf_binary(struct linux_binprm *bprm)
goto out_free_dentry;
}

+ if (is_ukl_thread()) {
+ /*
+ * We know that this symbol exists and that it is the entry
+ * point for the linked application.
+ */
+ extern void ukl__start(void);
+ elf_entry = (unsigned long) ukl__start;
+ goto UKL_SKIP_FINDING_ELF_ENTRY;
+ }
+
if (interpreter) {
elf_entry = load_elf_interp(interp_elf_ex,
interpreter,
@@ -1283,6 +1310,7 @@ static int load_elf_binary(struct linux_binprm *bprm)

set_binfmt(&elf_format);

+UKL_SKIP_FINDING_ELF_ENTRY:
#ifdef ARCH_HAS_SETUP_ADDITIONAL_PAGES
retval = ARCH_SETUP_ADDITIONAL_PAGES(bprm, elf_ex, !!interpreter);
if (retval < 0)
diff --git a/fs/exec.c b/fs/exec.c
index d046dbb9cbd0..4ae06fcf7436 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1246,9 +1246,11 @@ int begin_new_exec(struct linux_binprm * bprm)
int retval;

/* Once we are committed compute the creds */
- retval = bprm_creds_from_file(bprm);
- if (retval)
- return retval;
+ if (!is_ukl_thread()) {
+ retval = bprm_creds_from_file(bprm);
+ if (retval)
+ return retval;
+ }

/*
* Ensure all future errors are fatal.
@@ -1282,9 +1284,11 @@ int begin_new_exec(struct linux_binprm * bprm)
goto out;

/* If the binary is not readable then enforce mm->dumpable=0 */
- would_dump(bprm, bprm->file);
- if (bprm->have_execfd)
- would_dump(bprm, bprm->executable);
+ if (!is_ukl_thread()) {
+ would_dump(bprm, bprm->file);
+ if (bprm->have_execfd)
+ would_dump(bprm, bprm->executable);
+ }

/*
* Release all of the old mmap stuff
@@ -1509,6 +1513,11 @@ static struct linux_binprm *alloc_bprm(int fd, struct filename *filename)
if (!bprm)
goto out;

+ if (is_ukl_thread()) {
+ bprm->filename = "UKL";
+ goto out_ukl;
+ }
+
if (fd == AT_FDCWD || filename->name[0] == '/') {
bprm->filename = filename->name;
} else {
@@ -1522,6 +1531,8 @@ static struct linux_binprm *alloc_bprm(int fd, struct filename *filename)

bprm->filename = bprm->fdpath;
}
+
+out_ukl:
bprm->interp = bprm->filename;

retval = bprm_mm_init(bprm);
@@ -1708,6 +1719,15 @@ static int search_binary_handler(struct linux_binprm *bprm)
struct linux_binfmt *fmt;
int retval;

+ if (is_ukl_thread()) {
+ list_for_each_entry(fmt, &formats, lh) {
+ retval = fmt->load_binary(bprm);
+ if (retval == 0)
+ return retval;
+ }
+ goto out_ukl;
+ }
+
retval = prepare_binprm(bprm);
if (retval < 0)
return retval;
@@ -1717,7 +1737,7 @@ static int search_binary_handler(struct linux_binprm *bprm)
return retval;

retval = -ENOENT;
- retry:
+retry:
read_lock(&binfmt_lock);
list_for_each_entry(fmt, &formats, lh) {
if (!try_module_get(fmt->module))
@@ -1745,6 +1765,7 @@ static int search_binary_handler(struct linux_binprm *bprm)
goto retry;
}

+out_ukl:
return retval;
}

@@ -1799,7 +1820,7 @@ static int exec_binprm(struct linux_binprm *bprm)
static int bprm_execve(struct linux_binprm *bprm,
int fd, struct filename *filename, int flags)
{
- struct file *file;
+ struct file *file = NULL;
int retval;

retval = prepare_bprm_creds(bprm);
@@ -1809,10 +1830,12 @@ static int bprm_execve(struct linux_binprm *bprm,
check_unsafe_exec(bprm);
current->in_execve = 1;

- file = do_open_execat(fd, filename, flags);
- retval = PTR_ERR(file);
- if (IS_ERR(file))
- goto out_unmark;
+ if (!is_ukl_thread()) {
+ file = do_open_execat(fd, filename, flags);
+ retval = PTR_ERR(file);
+ if (IS_ERR(file))
+ goto out_unmark;
+ }

sched_exec();

@@ -1830,9 +1853,11 @@ static int bprm_execve(struct linux_binprm *bprm,
bprm->interp_flags |= BINPRM_FLAGS_PATH_INACCESSIBLE;

/* Set the unchanging part of bprm->cred */
- retval = security_bprm_creds_for_exec(bprm);
- if (retval)
- goto out;
+ if (!is_ukl_thread()) {
+ retval = security_bprm_creds_for_exec(bprm);
+ if (retval)
+ goto out;
+ }

retval = exec_binprm(bprm);
if (retval < 0)
--
2.21.3

2022-10-03 22:51:47

by Ali Raza

[permalink] [raw]
Subject: [RFC UKL 05/10] x86/uaccess: Make access_ok UKL aware

When configured for UKL, access_ok needs to account for the unified address
space that is used by the kernel and the process being run. To do this,
they need to check the task struct field added earlier to determine where
the execution that is making the check is running. For a zero value, the
normal boundary definitions apply, but non-zero value indicates a UKL
thread and a shared address space should be assumed.

Cc: Jonathan Corbet <[email protected]>
Cc: Masahiro Yamada <[email protected]>
Cc: Michal Marek <[email protected]>
Cc: Nick Desaulniers <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Eric Biederman <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Josh Poimboeuf <[email protected]>

Signed-off-by: Ali Raza <[email protected]>
---
arch/x86/include/asm/uaccess.h | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 913e593a3b45..adef521b2e59 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -37,11 +37,19 @@ static inline bool pagefault_disabled(void);
* Return: true (nonzero) if the memory block may be valid, false (zero)
* if it is definitely invalid.
*/
+#ifdef CONFIG_UNIKERNEL_LINUX
+#define access_ok(addr, size) \
+({ \
+ WARN_ON_IN_IRQ(); \
+ (is_ukl_thread() ? 1 : likely(__access_ok(addr, size))); \
+})
+#else
#define access_ok(addr, size) \
({ \
WARN_ON_IN_IRQ(); \
likely(__access_ok(addr, size)); \
})
+#endif

#include <asm-generic/access_ok.h>

--
2.21.3

2022-10-04 17:50:00

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC UKL 05/10] x86/uaccess: Make access_ok UKL aware



On Mon, Oct 3, 2022, at 3:21 PM, Ali Raza wrote:
> When configured for UKL, access_ok needs to account for the unified address
> space that is used by the kernel and the process being run. To do this,
> they need to check the task struct field added earlier to determine where
> the execution that is making the check is running. For a zero value, the
> normal boundary definitions apply, but non-zero value indicates a UKL
> thread and a shared address space should be assumed.

I think this is just wrong. Why should a UKL process be able to read() to kernel (high-half) memory?

set_fs() is gone. Please keep it gone.

>
> Cc: Jonathan Corbet <[email protected]>
> Cc: Masahiro Yamada <[email protected]>
> Cc: Michal Marek <[email protected]>
> Cc: Nick Desaulniers <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Eric Biederman <[email protected]>
> Cc: Kees Cook <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Alexander Viro <[email protected]>
> Cc: Arnd Bergmann <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Ben Segall <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Paolo Bonzini <[email protected]>
> Cc: Josh Poimboeuf <[email protected]>
>
> Signed-off-by: Ali Raza <[email protected]>
> ---
> arch/x86/include/asm/uaccess.h | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
> index 913e593a3b45..adef521b2e59 100644
> --- a/arch/x86/include/asm/uaccess.h
> +++ b/arch/x86/include/asm/uaccess.h
> @@ -37,11 +37,19 @@ static inline bool pagefault_disabled(void);
> * Return: true (nonzero) if the memory block may be valid, false (zero)
> * if it is definitely invalid.
> */
> +#ifdef CONFIG_UNIKERNEL_LINUX
> +#define access_ok(addr, size) \
> +({ \
> + WARN_ON_IN_IRQ(); \
> + (is_ukl_thread() ? 1 : likely(__access_ok(addr, size))); \
> +})
> +#else
> #define access_ok(addr, size) \
> ({ \
> WARN_ON_IN_IRQ(); \
> likely(__access_ok(addr, size)); \
> })
> +#endif
>
> #include <asm-generic/access_ok.h>
>
> --
> 2.21.3

2022-10-06 22:05:06

by Ali Raza

[permalink] [raw]
Subject: Re: [RFC UKL 05/10] x86/uaccess: Make access_ok UKL aware

On 10/4/22 13:36, Andy Lutomirski wrote:
>
>
> On Mon, Oct 3, 2022, at 3:21 PM, Ali Raza wrote:
>> When configured for UKL, access_ok needs to account for the unified address
>> space that is used by the kernel and the process being run. To do this,
>> they need to check the task struct field added earlier to determine where
>> the execution that is making the check is running. For a zero value, the
>> normal boundary definitions apply, but non-zero value indicates a UKL
>> thread and a shared address space should be assumed.
>
> I think this is just wrong. Why should a UKL process be able to read() to kernel (high-half) memory?
>
> set_fs() is gone. Please keep it gone.

UKL needs access to kernel memory because the UKL application is linked
with the kernel, so its data lives along with kernel data in the kernel
half of memory. So any thing which involves a check to see if user
pointer indeed lives in user part of memory would fail. For example,
anything which invokes copy_to_user or copy_from_user would involve a
call to access_ok. This would fail because the UKL user pointer will
have a kernel address.

>
>>
>> Cc: Jonathan Corbet <[email protected]>
>> Cc: Masahiro Yamada <[email protected]>
>> Cc: Michal Marek <[email protected]>
>> Cc: Nick Desaulniers <[email protected]>
>> Cc: Thomas Gleixner <[email protected]>
>> Cc: Ingo Molnar <[email protected]>
>> Cc: Borislav Petkov <[email protected]>
>> Cc: Dave Hansen <[email protected]>
>> Cc: "H. Peter Anvin" <[email protected]>
>> Cc: Andy Lutomirski <[email protected]>
>> Cc: Eric Biederman <[email protected]>
>> Cc: Kees Cook <[email protected]>
>> Cc: Peter Zijlstra <[email protected]>
>> Cc: Alexander Viro <[email protected]>
>> Cc: Arnd Bergmann <[email protected]>
>> Cc: Juri Lelli <[email protected]>
>> Cc: Vincent Guittot <[email protected]>
>> Cc: Dietmar Eggemann <[email protected]>
>> Cc: Steven Rostedt <[email protected]>
>> Cc: Ben Segall <[email protected]>
>> Cc: Mel Gorman <[email protected]>
>> Cc: Daniel Bristot de Oliveira <[email protected]>
>> Cc: Valentin Schneider <[email protected]>
>> Cc: Paolo Bonzini <[email protected]>
>> Cc: Josh Poimboeuf <[email protected]>
>>
>> Signed-off-by: Ali Raza <[email protected]>
>> ---
>> arch/x86/include/asm/uaccess.h | 8 ++++++++
>> 1 file changed, 8 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
>> index 913e593a3b45..adef521b2e59 100644
>> --- a/arch/x86/include/asm/uaccess.h
>> +++ b/arch/x86/include/asm/uaccess.h
>> @@ -37,11 +37,19 @@ static inline bool pagefault_disabled(void);
>> * Return: true (nonzero) if the memory block may be valid, false (zero)
>> * if it is definitely invalid.
>> */
>> +#ifdef CONFIG_UNIKERNEL_LINUX
>> +#define access_ok(addr, size) \
>> +({ \
>> + WARN_ON_IN_IRQ(); \
>> + (is_ukl_thread() ? 1 : likely(__access_ok(addr, size))); \
>> +})
>> +#else
>> #define access_ok(addr, size) \
>> ({ \
>> WARN_ON_IN_IRQ(); \
>> likely(__access_ok(addr, size)); \
>> })
>> +#endif
>>
>> #include <asm-generic/access_ok.h>
>>
>> --
>> 2.21.3

2022-10-06 23:25:14

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC UKL 00/10] Unikernel Linux (UKL)

On October 3, 2022 3:21:23 PM PDT, Ali Raza <[email protected]> wrote:
>Unikernel Linux (UKL) is a research project aimed at integrating
>application specific optimizations to the Linux kernel. This RFC aims to
>introduce this research to the community. Any feedback regarding the idea,
>goals, implementation and research is highly appreciated.
>
>Unikernels are specialized operating systems where an application is linked
>directly with the kernel and runs in supervisor mode. This allows the
>developers to implement application specific optimizations to the kernel,
>which can be directly invoked by the application (without going through the
>syscall path). An application can control scheduling and resource
>management and directly access the hardware. Application and the kernel can
>be co-optimized, e.g., through LTO, PGO, etc. All of these optimizations,
>and others, provide applications with huge performance benefits over
>general purpose operating systems.
>
>Linux is the de-facto operating system of today. Applications depend on its
>battle tested code base, large developer community, support for legacy
>code, a huge ecosystem of tools and utilities, and a wide range of
>compatible hardware and device drivers. Linux also allows some degree of
>application specific optimizations through build time config options,
>runtime configuration, and recently through eBPF. But still, there is a
>need for even more fine-grained application specific optimizations, and
>some developers resort to kernel bypass techniques.
>
>Unikernel Linux (UKL) aims to get the best of both worlds by bringing
>application specific optimizations to the Linux ecosystem. This way,
>unmodified applications can keep getting the benefits of Linux while taking
>advantage of the unikernel-style optimizations. Optionally, applications
>can be modified to invoke deeper optimizations.
>
>There are two steps to unikernel-izing Linux, i.e., first, equip Linux with
>a unikernel model, and second, actually use that model to implement
>application specific optimizations. This patch focuses on the first part.
>Through this patch, unmodified applications can be built as Linux
>unikernels, albeit with only modest performance advantages. Like
>unikernels, UKL would allow an application to be statically linked into the
>kernel and executed in supervisor mode. However, UKL preserves most of the
>invariants and design of Linux, including a separate page-able application
>portion of the address space and a pinned kernel portion, the ability to
>run multiple processes, and distinct execution modes for application and
>kernel code. Kernel execution mode and application execution mode are
>different, e.g., the application execution mode allows application threads
>to be scheduled, handle signals, etc., which do not apply to kernel
>threads. Application built as a Linux unikernel will have its text and data
>loaded with the kernel at boot time, while the rest of the address space
>would remain unchanged. These applications invoke the system call
>functionality through a function call into the kernel system call entry
>point instead of through the syscall assembly instruction. UKL would
>support a normal userspace so the UKL application can be started, managed,
>profiled, etc., using normal command line utilities.
>
>Once Linux has a unikernel model, different application specific
>optimizations are possible. We have tried a few, e.g., fast system call
>transitions, shared stacks to allow LTO, invoking kernel functions
>directly, etc. We have seen huge performance benefits, details of which are
>not relevant to this patch and can be found in our paper.
>(https://arxiv.org/pdf/2206.00789.pdf)
>
>UKL differs significantly from previous projects, e.g., UML, KML and LKL.
>User Mode Linux (UML) is a virtual machine monitor implemented on syscall
>interface, a very different goal from UKL. Kernel Mode Linux (KML) allows
>applications to run in kernel mode and replaces syscalls with function
>calls. While KML stops there, UKL goes further. UKL links applications and
>kernel together which allows further optimizations e.g., fast system call
>transitions, shared stacks to allow LTO, invoking kernel functions directly
>etc. Details can be found in the paper linked above. Linux Kernel Library
>(LKL) harvests arch independent code from Linux, takes it to userspace as a
>library to be linked with applications. A host needs to provide arch
>dependent functionality. This model is very different from UKL. A detailed
>discussion of related work is present in the paper linked above.
>
>See samples/ukl for a simple TCP echo server example which can be built as
>a normal user space application and also as a UKL application. In the Linux
>config options, a path to the compiled and partially linked application
>binary can be specified. Kernel built with UKL enabled will search this
>location for the binary and link with the kernel. Applications and required
>libraries need to be compiled with -mno-red-zone -mcmodel=kernel flags
>because kernel mode execution can trample on application red zones and in
>order to link with the kernel and be loaded in the high end of the address
>space, application should have the correct memory model. Examples of other
>applications like Redis, Memcached etc along with glibc and libgcc etc.,
>can be found at https://github.com/unikernelLinux/ukl
>
>List of authors and contributors:
>=================================
>
>Ali Raza - [email protected]
>Thomas Unger - [email protected]
>Matthew Boyd - [email protected]
>Eric Munson - [email protected]
>Parul Sohal - [email protected]
>Ulrich Drepper - [email protected]
>Richard W.M. Jones - [email protected]
>Daniel Bristot de Oliveira - [email protected]
>Larry Woodman - [email protected]
>Renato Mancuso - [email protected]
>Jonathan Appavoo - [email protected]
>Orran Krieger - [email protected]
>
>Ali Raza (9):
> kbuild: Add sections and symbols to linker script for UKL support
> x86/boot: Load the PT_TLS segment for Unikernel configs
> sched: Add task_struct tracking of kernel or application execution
> x86/entry: Create alternate entry path for system calls
> x86/uaccess: Make access_ok UKL aware
> x86/fault: Skip checking kernel mode access to user address space for
> UKL
> x86/signal: Adjust signal handler register values and return frame
> exec: Make exec path for starting UKL application
> Kconfig: Add config option for enabling and sample for testing UKL
>
>Eric B Munson (1):
> exec: Give userspace a method for starting UKL process
>
> Documentation/index.rst | 1 +
> Documentation/ukl/ukl.rst | 104 +++++++++++++++++++++++
> Kconfig | 2 +
> Makefile | 4 +
> arch/x86/boot/compressed/misc.c | 3 +
> arch/x86/entry/entry_64.S | 133 ++++++++++++++++++++++++++++++
> arch/x86/include/asm/elf.h | 9 +-
> arch/x86/include/asm/uaccess.h | 8 ++
> arch/x86/kernel/process.c | 13 +++
> arch/x86/kernel/process_64.c | 49 ++++++++---
> arch/x86/kernel/signal.c | 22 +++--
> arch/x86/kernel/vmlinux.lds.S | 98 ++++++++++++++++++++++
> arch/x86/mm/fault.c | 7 +-
> fs/binfmt_elf.c | 28 +++++++
> fs/exec.c | 75 +++++++++++++----
> include/asm-generic/sections.h | 4 +
> include/asm-generic/vmlinux.lds.h | 32 ++++++-
> include/linux/sched.h | 26 ++++++
> kernel/Kconfig.ukl | 41 +++++++++
> samples/ukl/Makefile | 16 ++++
> samples/ukl/README | 17 ++++
> samples/ukl/syscall.S | 28 +++++++
> samples/ukl/tcp_server.c | 99 ++++++++++++++++++++++
> scripts/mod/modpost.c | 4 +
> 24 files changed, 785 insertions(+), 38 deletions(-)
> create mode 100644 Documentation/ukl/ukl.rst
> create mode 100644 kernel/Kconfig.ukl
> create mode 100644 samples/ukl/Makefile
> create mode 100644 samples/ukl/README
> create mode 100644 samples/ukl/syscall.S
> create mode 100644 samples/ukl/tcp_server.c
>
>
>base-commit: 4fe89d07dcc2804c8b562f6c7896a45643d34b2f

This is basically taking Linux and turning it into a whole new operating system, while expecting the Linux kernel community to carry the support burden thereof.

We have seen this before, notably with Xen. It is *expensive* and *painful* for the maintenance of the mainstream kernel.

Linux already has a notion of "kernel mode applications", they are called kernel modules and kernel threads. It seems to me that you are trying to introduce a user space compatibility layer into the kernel, with the only benefit being avoiding the syscall overhead. The latter is bigger than we would like, which is why we are changing the x86 hardware architecture to improve it.

In my opinion, this would require *enormous* justification to put it into mainline.