Short explaination:
This patch implements a feature called "x86 multiring", which is a
shorthand for x86 multiple user-mode privilege rings support.
It allows user-mode programs to create DPL 1 and 2 segments and get a
modifiable per-process copy of IDT.
User Mode Linux can use these features to implement a syscall mechanism
identical to the one used by the kernel-mode kernel, and thus much
faster than the current one, with free memory protection and with zero
context switches.
Wine could also use it to achieve fast syscall-level emulation of
Windows NT (and, to a lesser extent, Windows 3.1 and 9x).
Obviously there is some risk of the patch creating security holes.
System calls:
All operations are performed using the new sys_multiring syscall. The
API is documented in include/asm-i386/multiring.h, that multiring
applications should include.
Supervisor problems:
The most serious issue caused by the use of ring 1 and 2 is that they
are intended for kernel code, which means that they count as supervisor
wherever a "user/supervisor" bit is present.
This results in:
- Unavailaibility of multiring on 386 processors since they don't
support supervisor WP.
- Page protection no longer working
While this may seem catastrophic, it isn't because segment-level
protection can be used instead.
To enforce it, the patch modifies the GDT so that the default CS and DS
have a limit at __PAGE_OFFSET - 1.
LDT and TLS interfaces are also changed to alter segment limits to avoid
overlap with the kernel area. If this is impossible, multiring mode is
inhibited using bad_segments mm_context_t field; if the process is
already in multiring mode the operation fails.
vsyscalls will need to be put before the rest of the kernel with this
scheme.
- Supervisor bit in error code in page faults not reliable (regs->xcs is
used instead, with the only problem of not being able to tell f00f
invalid opcodes from page faults)
- Potential minor problems for profilers since to get DPL 1/2 events,
DPL 0 events also have to be enabled.
Based on my reading of Intel manuals, there should not be any other
problem, but I might have missed something.
IDT functionality:
When multiring mode is entered, the default IDT is copied to a new
allocated page and a pointer to the new one is stored in mm_context_t.
The initial multiring IDT is identical to the default one, with the
exception of the SYSCALL_VECTOR DPL which is set to 1.
The code includes a config option to put IDTs in high memory, that is
however untested and not very useful anyway.
IDTs are loaded in two different ways: if CONFIG_X86_HIGHIDT is set or
the processor is f00f-buggy, each CPU gets a fixmap entry that is
remapped to load another IDT; otherwise, a simple lidt instruction is
used.
sys_multiring allows to read, copy and set gates in the IDT.
The vectors that are settable are currently 0x20-0x2f (because DOS and
Windows are here), 0x80 and 0xf1-0xfa.
Set operations will fail if the user tries to set a gate to a kernel
mode address which isn't the syscall one or a task or interrupt gate.
The i8259 is remapped to 0x30-0x3f to accomodate this.
The multiring_mode filed is added to the thread structure, and is 1 if
the thread has entered multiring mode (i.e. selectors are RPL 1-ed).
GDT functionality:
When switching to a multiring mm, the DPL in the default user CS and DS
is set to 1, to prevent ring 2 and 3 to load them and thus bypass any
security that the DPL 1 code might be enforcing.
LDT/TLS functionality:
When in multiring mode, LDT/TLS functions honor the new dpl field in
struct user_desc that of course allows to set a custom dpl in the
descriptors.
They are also changed to support segment-level protection as outlined
above.
TSS functionality:
sys_multiring allows to modify the ring 1 and 2 TSS ESP and SS that are
loaded on inter-privilege call.
The values are kept in the thread structure and are loaded on task
switch.
clone functionality:
The CLONE_IDT flag is added, and does the obvious thing. Note that if
the task is not in multiring mode, it is silently ignored.
The CLONE_CLEAR_IDT flag is also added and also does the obvious thing
and takes precedence over CLONE_IDT.
Entering multiring mode:
Multiring mode can be entered using sys_multiring(MULTIRING_ELEVATE).
This will allocate a new IDT, fix the GDT and put RPL 1 selectors in
cs/ds/es/ss/fs/gs. Note that all other threads will also get RPL 1
selectors.
RPL 1 selectors are loaded only if the selector points to the default CS
or DS.
Limitations:
- Since there are only 4 privilege rings only up to 2 UMLs can be nested
- Doesn't work on 386
Potential improvements:
- Copy on write IDTs
- Reduction of IDT allocation size from 4 KB to 2 KB
- x86-64?
Diffstat:
arch/i386/Config.help | 7
arch/i386/config.in | 1
arch/i386/kernel/Makefile | 2
arch/i386/kernel/cpu/common.c | 9
arch/i386/kernel/cpu/intel.c | 8
arch/i386/kernel/entry.S | 1
arch/i386/kernel/i8259.c | 6
arch/i386/kernel/ldt.c | 88 +++++++++
arch/i386/kernel/multiring.c | 316 +++++++++++++++++++++++++++++++++++
arch/i386/kernel/process.c | 52 ++++-
arch/i386/kernel/ptrace.c | 8
arch/i386/kernel/signal.c | 24 +-
arch/i386/kernel/traps.c | 44 +++-
arch/i386/mach-generic/irq_vectors.h | 24 +-
arch/i386/mach-visws/irq_vectors.h | 22 +-
arch/i386/math-emu/fpu_entry.c | 2
arch/i386/mm/fault.c | 5
fs/exec.c | 4
include/asm-i386/desc.h | 48 +++++
include/asm-i386/fixmap.h | 14 +
include/asm-i386/idt.h | 236 ++++++++++++++++++++++++++
include/asm-i386/ldt.h | 3
include/asm-i386/mmu.h | 10 +
include/asm-i386/mmu_context.h | 8
include/asm-i386/multiring.h | 194 +++++++++++++++++++++
include/asm-i386/processor.h | 26 +-
include/asm-i386/segment.h | 12 +
include/asm-i386/system.h | 6
include/asm-i386/unistd.h | 1
include/linux/sched.h | 8
kernel/fork.c | 6
31 files changed, 1103 insertions(+), 92 deletions(-)
Test program (apologies for the horrible coding style):
/*
multiring-test.c: example program for Linux multiring support
Copyright (C) 2002 Luca Barbieri <[email protected]>
This program is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free
Software Foundation; either version 2, or (at your option) any later
version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or
FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
for more details.
You should have received a copy of the GNU General Public License
along with this program; see the file COPYING. If not, write to the Free
Software Foundation, 59 Temple Place - Suite 330, Boston, MA
02111-1307, USA.
*/
#include <errno.h>
#include "/home/ldb/src/linux-2.5.44_multiring/include/asm-i386/unistd.h"
#include "/home/ldb/src/linux-2.5.44_multiring/include/asm-i386/multiring.h"
#include "/home/ldb/src/linux-2.5.44_multiring/include/asm-i386/ldt.h"
#include <signal.h>
#include <stdlib.h>
#include <stdio.h>
#include <sched.h>
#define CLONE_IDT 0x00800000
unsigned ring;
unsigned xcs;
typedef char intbuf_t[4];
intbuf_t test_entry_intbuf;
typedef int (*test_entry_int_t)(unsigned param) __attribute__((regparm(3)));
#define test_entry_int ((test_entry_int_t)test_entry_intbuf)
intbuf_t kernel_syscall_intbuf;
char lower_ring_msg[] = "from lower ring! [should be XXXm...]\n";
void make_int(intbuf_t p, unsigned vec)
{
p[0] = 0xcd;
p[1] = vec;
p[2] = 0xc3;
}
void test_entry();
void test_entry2();
void syscall_entry();
int kernel_syscall(unsigned num, unsigned arg1, unsigned arg2, unsigned arg3)
{
unsigned ret;
__asm__ __volatile__ (
"pushl %%ebx\n\t"
"movl %2, %%ebx\n\t"
"call kernel_syscall_intbuf\n\t"
"popl %%ebx" : "=a" (ret) : "0" (num), "r" (arg1), "c" (arg2), "d" (arg3));
return ret;
}
int do_test_entry(unsigned param)
{
printf("test_entry called with: %u\n", param);
if(param == 99)
{
unsigned vec;
int ret;
int testvec;
vec = multiring_copy_free(0x80);
printf("copy_free(0x80) ret: %i errno: %i\n", vec, errno);
printf("alloc_all ");
for(;;)
{
ret = multiring_copy_free(0x80);
if(ret >= 0)
printf("%x ", testvec = ret);
else
{
printf("\nret: %i errno: %i [should be -1/28]\n", ret, errno);
break;
}
}
ret = multiring_free(testvec);
printf("free testvec: %i ret: %i errno: %i\n", testvec, ret, errno);
ret = multiring_copy_free(0x80);
printf("copy_free ret: %i errno: %i\n", ret, errno);
printf("about to multiring_set %x %x\n", (xcs << 16) | ((unsigned long)test_entry2 & 0xffff), ((unsigned long)test_entry2 & 0xffff0000) | 0x8f00 | ((ring + 1) << 13));
multiring_set(0x80, (xcs << 16) | ((unsigned long)syscall_entry & 0xffff), ((unsigned long)syscall_entry & 0xffff0000) | 0x8f00 | ((ring + 1) << 13));
make_int(kernel_syscall_intbuf, vec);
}
return param * 2;
}
int do_syscall_entry(unsigned num, unsigned arg1, unsigned arg2, unsigned arg3)
{
if((num == __NR_write) && (arg1 == 1))
{
char* ptr = (char*)arg2;
ptr[0] = 'X';
ptr[1] = 'X';
ptr[2] = 'X';
return kernel_syscall(num, 1, arg2, arg3);
}
else if(num == __NR_exit)
{
kernel_syscall(__NR_write, 1, (unsigned long)"quitting\n", strlen("quitting\n"));
return kernel_syscall(__NR_exit, 0, 0, 0);
}
else
{
kernel_syscall(__NR_write, 1, (unsigned long)"syscall not allowed\n", strlen("syscall not allowed\n"));
return -ENOSYS;
}
}
asm("
test_entry:
pushl %eax
call do_test_entry
addl $4, %esp
iret
test_entry2:
cld
pushl %es
pushl %ds
pushl %eax
movw %ss, %ax
movw %ax, %ds
movw %ax, %es
call do_test_entry
addl $4, %esp
popl %ds
popl %es
iret
syscall_entry:
cld
pushl %es
pushl %ds
pushl %ebp
pushl %edi
pushl %esi
pushl %edx
pushl %ecx
pushl %ebx
pushl %eax
movw %ss, %ax
movw %ax, %ds
movw %ax, %es
call do_syscall_entry
addl $4, %esp
popl %ebx
popl %ecx
popl %edx
popl %esi
popl %edi
popl %ebp
popl %ds
popl %es
iret
");
char* entry_stack;
char* lower_stack;
char* clone_stack;
#define ldt_sel(num, ring) (((num) << 3) | (1 << 2) | (ring))
static void interprivilege_jump(unsigned ss, unsigned esp, unsigned cs, unsigned eip) __attribute__((noreturn));
static void interprivilege_jump(unsigned ss, unsigned esp, unsigned cs, unsigned eip)
{
__asm__ __volatile__(
"movl %0, %%ds\n\t"
"movl %0, %%es\n\t"
"pushl %0\n\t"
"pushl %1\n\t"
"pushfl\n\t"
"pushl %2\n\t"
"pushl %3\n\t"
"iret" : : "r" (ss), "r" (esp), "r" (cs), "r" (eip));
abort();
}
void lower_ring_start()
{
test_entry_int(10);
#ifdef CRASH_INT
/* int 0x80 has DPL 1, so this should crash */
write(1, lower_ring_msg, strlen(lower_ring_msg));
#endif
test_entry_int(99);
write(1, lower_ring_msg, strlen(lower_ring_msg));
#ifdef CRASH_SEGV
*(unsigned long*)0 = 0;
#endif
_exit(0);
}
/* this should reboot the processor if we have access to kernel mode memory */
void triple_fault(void)
{
struct
{
unsigned short a;
struct
{
unsigned short lim;
unsigned long addr;
} m48;
} desc;
/* find the IDT address */
__asm__ __volatile__("sidt %0" : "=m" (desc.m48));
/* clear the IDT (or crash, if the kernel works properly) */
memset((void*)desc.m48.addr, 0, desc.m48.lim + 1);
/* triple fault */
*(unsigned long*)0 = 0;
}
#define idt_gate(sel, addr, ring) (((sel) << 16) | ((unsigned long)addr & 0xffff)), (((unsigned long)addr & 0xffff0000) | 0x8f00 | ((ring) << 13))
static inline pid_t syscall_clone(unsigned long flags)
{
pid_t pid;
asm volatile("pushl %%ebx\n\tmovl %1, %%ebx\n\tmovl %%esp, %%ecx\n\tint $0x80\n\tpopl %%ebx" : "=a" (pid) : "r" (flags | SIGCHLD), "0" (__NR_clone) : "edx", "memory");
return pid;
}
unsigned test_vec;
void clone_vm_child(void)
{
int ret;
unsigned ccs;
unsigned css;
unsigned cring;
struct multiring_gate gate;
ret = multiring_check();
printf("vm check ret: %i errno: %i\n", ret, errno);
__asm__ __volatile__("movl %%cs, %0" : "=r" (ccs));
__asm__ __volatile__("movl %%ss, %0" : "=r" (css));
printf("vm cs: %x ss: %x\n", ccs, css);
cring = xcs & 3;
ret = multiring_set(test_vec, idt_gate(ccs, 0x33333333, cring));
printf("vm set 33333333 ret: %i errno: %i\n", ret, errno);
ret = multiring_get(test_vec, &gate);
printf("vm get ret: %i errno: %i a: %x b: %x\n", ret, errno, gate.a, gate.b);
_exit(0);
}
int main(int argc, char** argv)
{
int ret;
unsigned xss;
unsigned test_entry_vec;
struct multiring_gate gate;
pid_t pid;
struct user_desc ldt;
ret = multiring_elevate();
printf("elevate ret: %i errno: %i\n", ret, errno);
#ifdef ELEVATE
return 0;
#endif
__asm__ __volatile__("movl %%cs, %0" : "=r" (xcs));
ret = multiring_check();
printf("check ret: %i errno: %i\n", ret, errno);
printf("cs: %x\n", xcs);
__asm__ __volatile__("movl %%ss, %0" : "=r" (xss));
printf("ss: %x\n", xss);
ring = xcs & 3;
if(ring == 3)
{
printf("we are still at ring 3 :( - exiting\n");
return 0;
}
test_vec = multiring_set_free(idt_gate(xcs, 0x11111111, ring));
printf("set_free 11111111 ret: %i errno: %i\n", test_vec, errno);
ret = multiring_get(test_vec, &gate);
printf("get ret: %i errno: %i a: %x b: %x\n", ret, errno, gate.a, gate.b);
pid = fork();
printf("fork pid: %i errno: %i\n", pid, errno);
if(!pid)
{
//for(;;) {}
ret = multiring_check();
printf("fork check ret: %i errno: %i\n", ret, errno);
__asm__ __volatile__("movl %%cs, %0" : "=r" (xcs));
__asm__ __volatile__("movl %%ss, %0" : "=r" (xss));
printf("fork cs: %x ss: %x\n", xcs, xss);
ring = xcs & 3;
ret = multiring_set(test_vec, idt_gate(xcs, 0x22222222, ring));
printf("fork set 22222222 ret: %i errno: %i\n", ret, errno);
ret = multiring_get(test_vec, &gate);
printf("fork get ret: %i errno: %i a: %x b: %x\n", ret, errno, gate.a, gate.b);
_exit(0);
}
else
{
int status;
waitpid(pid, &status, 0);
printf("status: %x\n", status);
}
ret = multiring_get(test_vec, &gate);
printf("get ret: %i errno: %i a: %x b: %x [should be desc for 11111111]\n", ret, errno, gate.a, gate.b);
pid = clone(clone_vm_child, (char*)malloc(65536) + 65536, CLONE_VM|SIGCHLD, 0);
printf("CLONE_VM pid: %i errno: %i\n", pid, errno);
{
int status;
waitpid(pid, &status, 0);
printf("status: %x\n", status);
}
ret = multiring_get(test_vec, &gate);
printf("get ret: %i errno: %i a: %x b: %x [should be desc for 33333333]\n", ret, errno, gate.a, gate.b);
pid = syscall_clone(CLONE_IDT);
printf("CLONE_IDT pid: %i errno: %i\n", pid, errno);
if(!pid)
{
//for(;;) {}
ret = multiring_check();
printf("idt check ret: %i errno: %i\n", ret, errno);
__asm__ __volatile__("movl %%cs, %0" : "=r" (xcs));
__asm__ __volatile__("movl %%ss, %0" : "=r" (xss));
printf("idt cs: %x ss: %x\n", xcs, xss);
ring = xcs & 3;
ret = multiring_set(test_vec, idt_gate(xcs, 0x44444444, ring));
printf("idt set 44444444 ret: %i errno: %i\n", ret, errno);
ret = multiring_get(test_vec, &gate);
printf("idt get ret: %i errno: %i a: %x b: %x\n", ret, errno, gate.a, gate.b);
_exit(0);
}
else
{
int status;
waitpid(pid, &status, 0);
printf("status: %x\n", status);
}
ret = multiring_get(test_vec, &gate);
printf("get ret: %i errno: %i a: %x b: %x [should be desc for 44444444]\n", ret, errno, gate.a, gate.b);
#ifdef DO_EXEC
execlp("ls", "ls", 0);
#endif
#ifdef EXPLOIT_GDT
printf("about to exploit with sel = %x\n", xss);
triple_fault();
#endif
multiring_set_free(idt_gate(xcs, 0xeeeeeeee, 0));
printf("multiring_set(gate_to_kernel) ret: %i errno: %i\n", ret, errno);
#ifdef EXPLOIT_IDT
test_entry_vec = multiring_set_free(idt_gate(xcs, 0xeeeeeeee, 1));
printf("set_free(to_karea) ret: %i errno: %i\n", test_entry_vec, errno);
#else
test_entry_vec = multiring_set_free(idt_gate(xcs, test_entry, ring));
printf("set_free ret: %i errno: %i\n", test_entry_vec, errno);
#endif
make_int(test_entry_intbuf, test_entry_vec);
ret = test_entry_int(42);
printf("test_entry_int: %u\n", ret);
/* we could use the initial stack, but it's more difficult */
entry_stack = malloc(65536);
ret = multiring_set_espss(ring, (unsigned long)(entry_stack + 65536), xss);
printf("set_espss ret: %i errno %i esp: %x ss: %x\n", ret, errno, (unsigned long)(entry_stack + 65536), xss);
{
unsigned tesp, tss;
ret = multiring_get_espss(ring, &tesp, &tss);
printf("get_espss ret: %i errno %i esp: %x ss: %x\n", ret, errno, tesp, tss);
}
ret = multiring_set(test_entry_vec, idt_gate(xcs, test_entry2, ring + 1));
printf("multiring_set(test_entry_vec) ret: %i errno: %i\n", ret, errno);
/* addresses should not be hardcoded in real programs */
ldt.entry_number = 1;
ldt.base_addr = 0;
ldt.limit = 0xaffff;
ldt.seg_32bit = 1;
ldt.contents = MODIFY_LDT_CONTENTS_DATA;
ldt.read_exec_only = 0;
ldt.limit_in_pages = 1;
ldt.seg_not_present = 0;
ldt.useable = 0;
#ifdef EXPLOIT_LDT
ldt.dpl = ring;
ldt.limit = 0xfffff;
ret = modify_ldt(0x11, &ldt, sizeof(struct user_desc));
printf("modify_ldt(exploit) ret: %i errno: %i\n", test_entry_vec, errno);
__asm__ __volatile__("movl %0, %%ds\n\tmovl %0, %%es" : : "r" (ldt_sel(1, ring)));
printf("about to exploit with sel = %x\n", ldt_sel(1, ring));
triple_fault();
#endif
ldt.dpl = ring + 1;
ret = modify_ldt(0x11, &ldt, sizeof(struct user_desc));
printf("modify_ldt(data) ret: %i errno: %i\n", test_entry_vec, errno);
ldt.entry_number = 2;
ldt.contents = MODIFY_LDT_CONTENTS_CODE;
modify_ldt(0x11, &ldt, sizeof(struct user_desc));
printf("modify_ldt(code) ret: %i errno: %i\n", test_entry_vec, errno);
lower_stack = malloc(65536);
interprivilege_jump(ldt_sel(1, ring + 1), (unsigned long)(lower_stack + 65536), ldt_sel(2, ring + 1), (unsigned long)lower_ring_start);
return 0;
}
Patch:
diff --exclude-from=/home/ldb/src/linux-exclude -urNdp linux-2.5.44/arch/i386/Config.help linux-2.5.44_multiring/arch/i386/Config.help
--- linux-2.5.44/arch/i386/Config.help 2002-10-27 02:38:39.000000000 +0100
+++ linux-2.5.44_multiring/arch/i386/Config.help 2002-10-27 00:44:02.000000000 +0200
@@ -165,6 +165,13 @@ CONFIG_HIGHPTE
low memory. Setting this option will put user-space page table
entries in high memory.
+CONFIG_HIGHIDT
+ The kernel uses 4KB for each multiring process (e.g. User Mode Linux).
+ Say Y to allocate those 4KB in high memory. This is only useful if you
+ plan to run hundreds of thousands of multiring processes.
+
+ If unsure, say N.
+
CONFIG_HIGHMEM4G
Select this if you have a 32-bit processor and between 1 and 4
gigabytes of physical RAM.
diff --exclude-from=/home/ldb/src/linux-exclude -urNdp linux-2.5.44/arch/i386/config.in linux-2.5.44_multiring/arch/i386/config.in
--- linux-2.5.44/arch/i386/config.in 2002-10-27 02:38:39.000000000 +0100
+++ linux-2.5.44_multiring/arch/i386/config.in 2002-10-27 00:44:02.000000000 +0200
@@ -236,6 +236,7 @@ fi
if [ "$CONFIG_HIGHMEM4G" = "y" -o "$CONFIG_HIGHMEM64G" = "y" ]; then
bool 'Allocate 3rd-level pagetables from highmem' CONFIG_HIGHPTE
+ bool 'Allocate multiring IDTs from highmem (EXPERIMENTAL)' CONFIG_X86_HIGHIDT
fi
bool 'Math emulation' CONFIG_MATH_EMULATION
diff --exclude-from=/home/ldb/src/linux-exclude -urNdp linux-2.5.44/arch/i386/kernel/cpu/common.c linux-2.5.44_multiring/arch/i386/kernel/cpu/common.c
--- linux-2.5.44/arch/i386/kernel/cpu/common.c 2002-10-27 02:38:39.000000000 +0100
+++ linux-2.5.44_multiring/arch/i386/kernel/cpu/common.c 2002-10-27 00:44:02.000000000 +0200
@@ -243,6 +243,8 @@ void __init generic_identify(struct cpui
}
}
+extern void load_idt_table_init(unsigned cpu, pgprot_t prot);
+
/*
* This does the hard work of actually picking apart the CPU stuff...
*/
@@ -257,6 +259,7 @@ void __init identify_cpu(struct cpuinfo_
c->x86_model = c->x86_mask = 0; /* So far unknown... */
c->x86_vendor_id[0] = '\0'; /* Unset */
c->x86_model_id[0] = '\0'; /* Unset */
+ c->f00f_bug = 0;
memset(&c->x86_capability, 0, sizeof c->x86_capability);
if (!have_cpuid_p()) {
@@ -348,6 +351,9 @@ void __init identify_cpu(struct cpuinfo_
/* AND the already accumulated flags with these */
for ( i = 0 ; i < NCAPINTS ; i++ )
boot_cpu_data.x86_capability[i] &= c->x86_capability[i];
+
+ if(c->f00f_bug)
+ boot_cpu_data.f00f_bug = 1;
}
printk(KERN_DEBUG "CPU: Common caps: %08lx %08lx %08lx %08lx\n",
@@ -355,6 +361,9 @@ void __init identify_cpu(struct cpuinfo_
boot_cpu_data.x86_capability[1],
boot_cpu_data.x86_capability[2],
boot_cpu_data.x86_capability[3]);
+
+ /* load the IDT */
+ load_idt_table_init((c == &boot_cpu_data) ? 0 : (c - cpu_data), boot_cpu_data.f00f_bug ? PAGE_KERNEL_RO : PAGE_KERNEL);
}
/*
* Perform early boot up checks for a valid TSC. See arch/i386/kernel/time.c
diff --exclude-from=/home/ldb/src/linux-exclude -urNdp linux-2.5.44/arch/i386/kernel/cpu/intel.c linux-2.5.44_multiring/arch/i386/kernel/cpu/intel.c
--- linux-2.5.44/arch/i386/kernel/cpu/intel.c 2002-10-27 02:38:39.000000000 +0100
+++ linux-2.5.44_multiring/arch/i386/kernel/cpu/intel.c 2002-10-27 00:44:02.000000000 +0200
@@ -169,15 +169,13 @@ static void __init init_intel(struct cpu
* have the F0 0F bug, which lets nonpriviledged users lock up the system.
* Note that the workaround only should be initialized once...
*/
- c->f00f_bug = 0;
if ( c->x86 == 5 ) {
- static int f00f_workaround_enabled = 0;
+ static int f00f_workaround_message = 0;
c->f00f_bug = 1;
- if ( !f00f_workaround_enabled ) {
- trap_init_f00f_bug();
+ if ( !f00f_workaround_message ) {
printk(KERN_NOTICE "Intel Pentium with F0 0F bug - workaround enabled.\n");
- f00f_workaround_enabled = 1;
+ f00f_workaround_message = 1;
}
}
#endif
diff --exclude-from=/home/ldb/src/linux-exclude -urNdp linux-2.5.44/arch/i386/kernel/entry.S linux-2.5.44_multiring/arch/i386/kernel/entry.S
--- linux-2.5.44/arch/i386/kernel/entry.S 2002-10-27 02:38:39.000000000 +0100
+++ linux-2.5.44_multiring/arch/i386/kernel/entry.S 2002-10-27 00:44:02.000000000 +0200
@@ -737,6 +737,7 @@ ENTRY(sys_call_table)
.long sys_free_hugepages
.long sys_exit_group
.long sys_lookup_dcookie
+ .long sys_multiring
.rept NR_syscalls-(.-sys_call_table)/4
.long sys_ni_syscall
diff --exclude-from=/home/ldb/src/linux-exclude -urNdp linux-2.5.44/arch/i386/kernel/i8259.c linux-2.5.44_multiring/arch/i386/kernel/i8259.c
--- linux-2.5.44/arch/i386/kernel/i8259.c 2002-10-27 02:38:39.000000000 +0100
+++ linux-2.5.44_multiring/arch/i386/kernel/i8259.c 2002-10-27 00:44:02.000000000 +0200
@@ -282,7 +282,8 @@ void init_8259A(int auto_eoi)
* outb_p - this has to work on a wide range of PC hardware.
*/
outb_p(0x11, 0x20); /* ICW1: select 8259A-1 init */
- outb_p(0x20 + 0, 0x21); /* ICW2: 8259A-1 IR0-7 mapped to 0x20-0x27 */
+ /* ICW2: 8259A-1 IR0-7 mapped to FIRST_EXTERNAL_VECTOR-FIRST_EXTERNAL_VECTOR+7 */
+ outb_p(FIRST_EXTERNAL_VECTOR + 0, 0x21);
outb_p(0x04, 0x21); /* 8259A-1 (the master) has a slave on IR2 */
if (auto_eoi)
outb_p(0x03, 0x21); /* master does Auto EOI */
@@ -290,7 +291,8 @@ void init_8259A(int auto_eoi)
outb_p(0x01, 0x21); /* master expects normal EOI */
outb_p(0x11, 0xA0); /* ICW1: select 8259A-2 init */
- outb_p(0x20 + 8, 0xA1); /* ICW2: 8259A-2 IR0-7 mapped to 0x28-0x2f */
+ /* ICW2: 8259A-2 IR0-7 mapped to FIRST_EXTERNAL_VECTOR+8-FIRST_EXTERNAL_VECTOR+f */
+ outb_p(FIRST_EXTERNAL_VECTOR + 8, 0xA1);
outb_p(0x02, 0xA1); /* 8259A-2 is a slave on master's IR2 */
outb_p(0x01, 0xA1); /* (slave's support for AEOI in flat mode
is to be investigated) */
diff --exclude-from=/home/ldb/src/linux-exclude -urNdp linux-2.5.44/arch/i386/kernel/ldt.c linux-2.5.44_multiring/arch/i386/kernel/ldt.c
--- linux-2.5.44/arch/i386/kernel/ldt.c 2002-10-27 02:38:39.000000000 +0100
+++ linux-2.5.44_multiring/arch/i386/kernel/ldt.c 2002-10-27 02:18:48.000000000 +0200
@@ -18,6 +18,7 @@
#include <asm/system.h>
#include <asm/ldt.h>
#include <asm/desc.h>
+#include <asm/idt.h>
#ifdef CONFIG_SMP /* avoids "defined but not used" warnig */
static void flush_ldt(void *null)
@@ -81,22 +82,62 @@ static inline int copy_ldt(mm_context_t
return 0;
}
+static inline int copy_idt(mm_context_t* new, mm_context_t* old, unsigned flags)
+{
+ struct desc_struct* oldidt;
+ if(flags & (CLONE_VM | CLONE_IDT))
+ {
+ oldidt = kmap_idt(old);
+ atomic_inc(idt_refcnt(oldidt));
+ kunmap_idt(old, oldidt);
+ new->idt = old->idt;
+ }
+ else
+ {
+ struct desc_struct* newidt;
+ union idt idtu;
+
+ newidt = alloc_idt(&idtu);
+ if(!newidt)
+ return -ENOMEM;
+ oldidt = kmap_read_idt(old);
+ memcpy(newidt, oldidt, IDT_SIZE);
+ kunmap_read_idt(old, oldidt);
+ kunmap_new_idt(&idtu, newidt);
+ wmb();
+ new->idt = idtu;
+ }
+ return 0;
+}
+
/*
* we do not have to muck with descriptors here, that is
* done in switch_mm() as needed.
*/
-int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
+int init_new_context_flags(struct task_struct *tsk, struct mm_struct *mm, unsigned flags)
{
struct mm_struct * old_mm;
int retval = 0;
init_MUTEX(&mm->context.sem);
mm->context.size = 0;
+ mm->context.idt.opaque = 0;
+ mm->context.bad_segments = 0;
old_mm = current->mm;
- if (old_mm && old_mm->context.size > 0) {
- down(&old_mm->context.sem);
- retval = copy_ldt(&mm->context, &old_mm->context);
- up(&old_mm->context.sem);
+ if (old_mm)
+ {
+ mm->context.bad_segments = old_mm->context.bad_segments;
+ if(old_mm->context.size > 0 || (old_mm->context.idt.opaque && !(flags & CLONE_CLEAR_IDT)))
+ {
+ down(&old_mm->context.sem);
+ mm->context.bad_segments = old_mm->context.bad_segments;
+ retval = 0;
+ if(old_mm->context.size > 0)
+ retval = copy_ldt(&mm->context, &old_mm->context);
+ if(!retval && old_mm->context.idt.opaque && !(flags & CLONE_CLEAR_IDT))
+ retval = copy_idt(&mm->context, &old_mm->context, flags);
+ up(&old_mm->context.sem);
+ }
}
return retval;
}
@@ -115,6 +156,8 @@ void release_segments(struct mm_struct *
kfree(mm->context.ldt);
mm->context.size = 0;
}
+ if(mm->context.idt.opaque)
+ free_idt(&mm->context);
}
static int read_ldt(void * ptr, unsigned long bytecount)
@@ -189,6 +232,10 @@ static int write_ldt(void * ptr, unsigne
goto out;
}
+ error = LDT_handle_perm(&ldt_info, &mm->context);
+ if(error)
+ goto out;
+
down(&mm->context.sem);
if (ldt_info.entry_number >= mm->context.size) {
error = alloc_ldt(¤t->mm->context, ldt_info.entry_number+1, 1);
@@ -211,10 +258,22 @@ static int write_ldt(void * ptr, unsigne
entry_2 = LDT_entry_b(&ldt_info);
if (oldmode)
entry_2 &= ~(1 << 20);
+ if(mm->context.idt.opaque)
+ {
+ error = -EINVAL;
+ if(ldt_info.dpl == 0)
+ goto out_unlock;
+ entry_2 = (entry_2 & ~(3 << 13)) | (ldt_info.dpl << 13);
+ }
/* Install the new entry ... */
install:
+ *(lp+1) = 0;
+ wmb();
+
*lp = entry_1;
+ wmb();
+
*(lp+1) = entry_2;
error = 0;
@@ -244,3 +303,22 @@ asmlinkage int sys_modify_ldt(int func,
}
return ret;
}
+
+int LDT_handle_over_page_offset(mm_context_t* ctx)
+{
+ if(ctx->idt.opaque)
+ return -EPERM;
+ else if(ctx->bad_segments)
+ return 0;
+ else
+ {
+ int ret = 0;
+ down(&ctx->sem);
+ if(ctx->idt.opaque)
+ ret = -EPERM;
+ else
+ ctx->bad_segments = 1;
+ up(&ctx->sem);
+ return ret;
+ }
+}
diff --exclude-from=/home/ldb/src/linux-exclude -urNdp linux-2.5.44/arch/i386/kernel/Makefile linux-2.5.44_multiring/arch/i386/kernel/Makefile
--- linux-2.5.44/arch/i386/kernel/Makefile 2002-10-27 02:38:39.000000000 +0100
+++ linux-2.5.44_multiring/arch/i386/kernel/Makefile 2002-10-27 00:44:03.000000000 +0200
@@ -9,7 +9,7 @@ export-objs := mca.o i386_ksyms.o ti
obj-y := process.o semaphore.o signal.o entry.o traps.o irq.o vm86.o \
ptrace.o i8259.o ioport.o ldt.o setup.o time.o sys_i386.o \
pci-dma.o i386_ksyms.o i387.o bluesmoke.o dmi_scan.o \
- bootflag.o
+ bootflag.o multiring.o
obj-y += cpu/
obj-y += timers/
diff --exclude-from=/home/ldb/src/linux-exclude -urNdp linux-2.5.44/arch/i386/kernel/multiring.c linux-2.5.44_multiring/arch/i386/kernel/multiring.c
--- linux-2.5.44/arch/i386/kernel/multiring.c 1970-01-01 01:00:00.000000000 +0100
+++ linux-2.5.44_multiring/arch/i386/kernel/multiring.c 2002-10-27 03:18:02.000000000 +0100
@@ -0,0 +1,316 @@
+/*
+ * linux/kernel/multiring.c: support for multiple privilege rings
+ *
+ * Copyright (C) 2002 Luca Barbieri <[email protected]>
+ */
+
+#include <linux/errno.h>
+#include <linux/sched.h>
+#include <linux/string.h>
+#include <linux/mm.h>
+#include <linux/smp.h>
+#include <linux/smp_lock.h>
+#include <linux/init.h>
+
+#include <asm/uaccess.h>
+#include <asm/idt.h>
+#include <asm/irq.h>
+#include <asm/multiring.h>
+
+#define MULTIRING_NR_VECTORS ((MULTIRING_AUTO_LAST_VECTOR - MULTIRING_AUTO_FIRST_VECTOR + 1) + (MULTIRING_SPECIAL_LAST_VECTOR - MULTIRING_SPECIAL_FIRST_VECTOR + 1) + 1)
+
+static unsigned char vec2idx[256];
+static unsigned char idx2vec[MULTIRING_NR_VECTORS];
+
+#ifdef CONFIG_SMP /* avoids "defined but not used" warning */
+static void flush_idt(void *null)
+{
+ if (current->active_mm)
+ load_IDT(¤t->active_mm->context);
+}
+#endif
+
+static inline unsigned multiring_fix_selector(unsigned sel)
+{
+ unsigned tsel = sel | 3;
+ if(tsel == USER_DS_RPL(3))
+ return USER_DS_RPL(MULTIRING_USER_RING);
+ else if(tsel == USER_CS_RPL(3))
+ return USER_CS_RPL(MULTIRING_USER_RING);
+ else
+ return sel;
+}
+
+static inline void multiring_fix_selector_ptr(unsigned* sel)
+{
+ *sel = multiring_fix_selector(*sel);
+}
+
+static inline struct pt_regs* get_pt_regs(task_t* tsk)
+{
+ return (struct pt_regs*)tsk->thread.esp0 - 1;
+}
+
+void multiring_init_task(task_t* tsk)
+{
+ /* One-time initialization for multiring mode */
+ unsigned oldsel, newsel;
+ struct pt_regs* regs;
+
+ regs = get_pt_regs(tsk);
+
+#define multiring_fix_loaded_selector(selname) \
+ savesegment(selname, oldsel); \
+ newsel = multiring_fix_selector(oldsel); \
+ if(oldsel != newsel) \
+ loadsegment(selname, oldsel);
+
+ multiring_fix_selector_ptr((unsigned*)®s->xds);
+ multiring_fix_selector_ptr((unsigned*)®s->xes);
+ multiring_fix_selector_ptr((unsigned*)®s->xcs);
+ multiring_fix_selector_ptr((unsigned*)®s->xss);
+ if(tsk != current)
+ {
+ multiring_fix_selector_ptr((unsigned*)&tsk->thread.fs);
+ multiring_fix_selector_ptr((unsigned*)&tsk->thread.gs);
+ }
+ else
+ {
+ multiring_fix_loaded_selector(fs);
+ multiring_fix_loaded_selector(gs);
+ }
+#undef multiring_fix_loaded_selector
+}
+
+void load_IDT_nolock(mm_context_t* ctx, unsigned cpu)
+{
+ load_IDT_nolock_inline(current, ctx, cpu);
+}
+
+asmlinkage int sys_multiring(unsigned op, unsigned arg1, unsigned arg2, unsigned arg3)
+{
+ mm_context_t* ctx = ¤t->mm->context;
+ int ret;
+ unsigned i;
+ int cpu;
+ struct desc_struct gate;
+ struct desc_struct* idt;
+
+ /* Without WP, DPL 1 and 2 will bypass copy-on-write */
+ if(!boot_cpu_data.wp_works_ok)
+ return -ENOSYS;
+
+ if(op > MULTIRING_LAST_OP)
+ return -EOPNOTSUPP;
+
+ switch(op)
+ {
+ case MULTIRING_CHECK:
+ return current->thread.multiring_mode ? 0 : (ctx->bad_segments ? -EPERM : 1);
+
+ case MULTIRING_SET_ESPSS:
+ {
+ struct tss_struct *tss;
+ unsigned ring = arg1;
+ unsigned esp = arg2;
+ unsigned ss = arg3;
+ if(ring == 0)
+ return -EPERM;
+ if(ring >= 3 || ring != (ss & 3))
+ return -EINVAL;
+
+ current->thread.espss12[ring - 1].esp = esp;
+ current->thread.espss12[ring - 1].ss = ss;
+
+ cpu = get_cpu();
+ tss = init_tss + cpu;
+ tss->espss12[ring - 1].esp = esp;
+ tss->espss12[ring - 1].ss = ss;
+ put_cpu();
+
+ return 0;
+ }
+
+ case MULTIRING_GET_ESPSS:
+ {
+ unsigned ring = arg1;
+ unsigned* esp = (unsigned*)arg2;
+ unsigned* ss = (unsigned*)arg3;
+ if(ring == 0)
+ return -EPERM;
+ if(ring >= 3)
+ return -EINVAL;
+
+ if(put_user(current->thread.espss12[ring - 1].esp, esp)
+ || put_user(current->thread.espss12[ring - 1].ss, ss))
+ return -EFAULT;
+ return 0;
+ }
+
+ case MULTIRING_GET_RANGE:
+ {
+ unsigned first, last;
+ char* ptr = (char*)arg3;
+ if(arg1 <= arg2)
+ {
+ first = arg1;
+ last = arg2;
+ }
+ else
+ {
+ first = arg2;
+ last = arg1;
+ }
+
+ ret = (last - first + 1) * 8;
+ idt = kmap_read_idt_or_table(ctx);
+ if(copy_to_user(ptr, &idt[first], ret))
+ ret = -EFAULT;
+ kunmap_read_idt_or_table(ctx, idt);
+ return ret;
+ }
+
+ case MULTIRING_GET:
+ {
+ unsigned vec = arg1;
+ unsigned* ptr = (unsigned*)arg2;
+ ret = 0;
+ idt = kmap_read_idt_or_table(ctx);
+ if(copy_to_user(ptr, &idt[vec], 8))
+ ret = -EFAULT;
+ kunmap_read_idt_or_table(ctx, idt);
+ return ret;
+ }
+ }
+
+ if(!current->thread.multiring_mode)
+ {
+ if(op != MULTIRING_ELEVATE)
+ return -ENXIO;
+
+ if(ctx->bad_segments)
+ return -EPERM;
+
+ down(&ctx->sem);
+ ret = -EPERM;
+ if(ctx->bad_segments)
+ goto out_up;
+
+ if(!ctx->idt.opaque)
+ {
+ union idt idtu;
+ idt = alloc_idt(&idtu);
+ ret = -ENOMEM;
+ if(!idt)
+ goto out_up;
+ memcpy(idt, idt_table, IDT_SIZE);
+ idt[SYSCALL_VECTOR].b = (idt[SYSCALL_VECTOR].b &~ 0x6000) | (MULTIRING_USER_RING << 13);
+ kunmap_new_idt(&idtu, idt);
+
+ wmb();
+ ctx->idt = idtu;
+ wmb();
+
+#ifdef CONFIG_SMP
+ cpu = get_cpu();
+ if (current->mm->cpu_vm_mask != (1 << cpu))
+ smp_call_function(flush_idt, 0, 1, 1);
+ put_cpu();
+#endif
+ }
+ up(&ctx->sem);
+
+ cpu = get_cpu();
+ if(!current->thread.multiring_mode)
+ load_IDT_nolock(ctx, cpu);
+ put_cpu();
+ return 0;
+
+ out_up:
+ up(&ctx->sem);
+ return ret;
+ }
+
+ /* MULTIRING_SET or MULTIRING_COPY */
+ {
+ unsigned vec = arg1;
+ idt = kmap_write_idt(ctx);
+ if(vec == MULTIRING_VEC_FREE)
+ {
+ for(i = 0; i < MULTIRING_NR_VECTORS; ++i)
+ {
+ vec = idx2vec[i];
+ if(!idt[vec].a && !idt[vec].b)
+ goto found_free;
+ }
+ ret = -ENOSPC;
+ goto out_put_idt;
+ found_free:
+ }
+ else
+ {
+ ret = -EPERM;
+ if(vec2idx[vec] == (unsigned char)~0)
+ goto out_put_idt;
+ }
+
+ if(op == MULTIRING_COPY)
+ {
+ unsigned from = arg2;
+ ret = -EPERM;
+ if(vec2idx[from] == (unsigned char)~0)
+ goto out_put_idt;
+ gate = idt[from];
+ }
+ else
+ {
+ gate.a = arg2;
+ gate.b = arg3;
+ if(gate.b & 0x8000)
+ {
+ ret = -EINVAL;
+ if((gate.b & 0x10ff) || (~gate.b & 0x700) || !(gate.b & 0x6000)) /* reserved_is_bad || !trap_gate || dpl0 */
+ goto out_put_idt;
+ ret = -EPERM;
+ if(!(gate.a & 0x30000) && (gate.a >> 16) && (gate.a != idt_table[SYSCALL_VECTOR].a || (gate.b | 0xe000) != idt_table[SYSCALL_VECTOR].b)) /* seg_RPL == 0 && not_int80_syscall */
+ goto out_put_idt;
+ }
+ }
+
+ idt[vec].b = 0;
+ wmb();
+ idt[vec].a = gate.a;
+ wmb();
+ idt[vec].b = gate.b;
+ ret = vec;
+ }
+ out_put_idt:
+ kunmap_write_idt(ctx, idt);
+ return ret;
+}
+
+int __init init_multiring(void)
+{
+ unsigned idx;
+ unsigned vec;
+
+ memset(vec2idx, 0xff, sizeof(vec2idx));
+ idx = 0;
+ for(vec = MULTIRING_AUTO_FIRST_VECTOR; vec <= MULTIRING_AUTO_LAST_VECTOR; ++vec, ++idx)
+ idx2vec[idx] = vec;
+ for(vec = MULTIRING_SPECIAL_FIRST_VECTOR; vec <= MULTIRING_SPECIAL_LAST_VECTOR; ++vec, ++idx)
+ idx2vec[idx] = vec;
+ idx2vec[idx] = SYSCALL_VECTOR;
+ vec2idx[SYSCALL_VECTOR] = idx;
+
+ for(idx = 0; idx < (MULTIRING_NR_VECTORS - 1); ++idx)
+ {
+ vec = idx2vec[idx];
+ vec2idx[vec] = idx;
+ idt_table[vec].a = 0;
+ idt_table[vec].b = 0;
+ }
+ return 0;
+}
+
+__initcall(init_multiring);
diff --exclude-from=/home/ldb/src/linux-exclude -urNdp linux-2.5.44/arch/i386/kernel/process.c linux-2.5.44_multiring/arch/i386/kernel/process.c
--- linux-2.5.44/arch/i386/kernel/process.c 2002-10-27 02:38:39.000000000 +0100
+++ linux-2.5.44_multiring/arch/i386/kernel/process.c 2002-10-27 02:34:41.000000000 +0200
@@ -40,6 +40,7 @@
#include <asm/system.h>
#include <asm/io.h>
#include <asm/ldt.h>
+#include <asm/idt.h>
#include <asm/processor.h>
#include <asm/i387.h>
#include <asm/desc.h>
@@ -252,6 +253,8 @@ void flush_thread(void)
*/
clear_fpu(tsk);
tsk->used_math = 0;
+ tsk->thread.multiring_mode = 0;
+ load_IDT(&tsk->mm->context);
}
void release_thread(struct task_struct *dead_task)
@@ -268,18 +271,13 @@ void release_thread(struct task_struct *
}
}
-/*
- * Save a segment.
- */
-#define savesegment(seg,value) \
- asm volatile("movl %%" #seg ",%0":"=m" (*(int *)&(value)))
-
int copy_thread(int nr, unsigned long clone_flags, unsigned long esp,
unsigned long unused,
struct task_struct * p, struct pt_regs * regs)
{
struct pt_regs * childregs;
struct task_struct *tsk;
+ tsk = current;
childregs = ((struct pt_regs *) (THREAD_SIZE + (unsigned long) p->thread_info)) - 1;
struct_cpy(childregs, regs);
@@ -289,13 +287,14 @@ int copy_thread(int nr, unsigned long cl
p->thread.esp = (unsigned long) childregs;
p->thread.esp0 = (unsigned long) (childregs+1);
+ p->thread.espss12[0] = tsk->thread.espss12[0];
+ p->thread.espss12[1] = tsk->thread.espss12[1];
p->thread.eip = (unsigned long) ret_from_fork;
savesegment(fs,p->thread.fs);
savesegment(gs,p->thread.gs);
- tsk = current;
unlazy_fpu(tsk);
struct_cpy(&p->thread.i387, &tsk->thread.i387);
@@ -307,6 +306,8 @@ int copy_thread(int nr, unsigned long cl
IO_BITMAP_BYTES);
}
+ p->thread.multiring_mode = (tsk->thread.multiring_mode && p->mm->context.idt.opaque) ? 1 : 0;
+
/*
* Set a new TLS for the child thread?
*/
@@ -319,6 +320,8 @@ int copy_thread(int nr, unsigned long cl
return -EFAULT;
if (LDT_empty(&info))
return -EINVAL;
+ if (LDT_handle_perm(&info, &p->mm->context))
+ return -EPERM;
idx = info.entry_number;
if (idx < GDT_ENTRY_TLS_MIN || idx > GDT_ENTRY_TLS_MAX)
@@ -327,6 +330,12 @@ int copy_thread(int nr, unsigned long cl
desc = p->thread.tls_array + idx - GDT_ENTRY_TLS_MIN;
desc->a = LDT_entry_a(&info);
desc->b = LDT_entry_b(&info);
+ if(unlikely(current->mm->context.idt.opaque))
+ {
+ if(unlikely(info.dpl == 0))
+ return -EINVAL;
+ desc->b = (desc->b & ~(3 << 13)) | (info.dpl << 13);
+ }
}
return 0;
}
@@ -420,6 +429,10 @@ void __switch_to(struct task_struct *pre
*/
tss->esp0 = next->esp0;
+ /* multiring */
+ tss->espss12[0] = next->espss12[0];
+ tss->espss12[1] = next->espss12[1];
+
/*
* Load the per-thread Thread-Local Storage descriptor.
*/
@@ -599,8 +612,12 @@ asmlinkage int sys_set_thread_area(struc
if (copy_from_user(&info, u_info, sizeof(info)))
return -EFAULT;
- idx = info.entry_number;
+ if (LDT_handle_perm(&info, ¤t->mm->context))
+ return -EPERM;
+
+ idx = info.entry_number;
+
/*
* index -1 means the kernel should try to find and
* allocate an empty descriptor:
@@ -618,20 +635,25 @@ asmlinkage int sys_set_thread_area(struc
desc = t->tls_array + idx - GDT_ENTRY_TLS_MIN;
- /*
- * We must not get preempted while modifying the TLS.
- */
- cpu = get_cpu();
-
if (LDT_empty(&info)) {
desc->a = 0;
desc->b = 0;
} else {
desc->a = LDT_entry_a(&info);
desc->b = LDT_entry_b(&info);
+ if(unlikely(current->mm->context.idt.opaque))
+ {
+ if(unlikely(info.dpl == 0))
+ return -EINVAL;
+ desc->b = (desc->b & ~(3 << 13)) | (info.dpl << 13);
+ }
}
- load_TLS(t, cpu);
+ /*
+ * We must not get preempted while modifying the TLS.
+ */
+ cpu = get_cpu();
+ load_TLS(t, cpu);
put_cpu();
return 0;
@@ -656,6 +678,7 @@ asmlinkage int sys_set_thread_area(struc
#define GET_LIMIT_PAGES(desc) (((desc)->b >> 23) & 1)
#define GET_PRESENT(desc) (((desc)->b >> 15) & 1)
#define GET_USEABLE(desc) (((desc)->b >> 20) & 1)
+#define GET_DPL(desc) (((desc)->b >> 13) & 3)
asmlinkage int sys_get_thread_area(struct user_desc *u_info)
{
@@ -679,6 +702,7 @@ asmlinkage int sys_get_thread_area(struc
info.limit_in_pages = GET_LIMIT_PAGES(desc);
info.seg_not_present = !GET_PRESENT(desc);
info.useable = GET_USEABLE(desc);
+ info.dpl = GET_DPL(desc);
if (copy_to_user(u_info, &info, sizeof(info)))
return -EFAULT;
diff --exclude-from=/home/ldb/src/linux-exclude -urNdp linux-2.5.44/arch/i386/kernel/ptrace.c linux-2.5.44_multiring/arch/i386/kernel/ptrace.c
--- linux-2.5.44/arch/i386/kernel/ptrace.c 2002-10-27 02:38:39.000000000 +0100
+++ linux-2.5.44_multiring/arch/i386/kernel/ptrace.c 2002-10-27 00:44:03.000000000 +0200
@@ -76,24 +76,24 @@ static int putreg(struct task_struct *ch
{
switch (regno >> 2) {
case FS:
- if (value && (value & 3) != 3)
+ if (value && !(value & 3))
return -EIO;
child->thread.fs = value;
return 0;
case GS:
- if (value && (value & 3) != 3)
+ if (value && !(value & 3))
return -EIO;
child->thread.gs = value;
return 0;
case DS:
case ES:
- if (value && (value & 3) != 3)
+ if (value && !(value & 3))
return -EIO;
value &= 0xffff;
break;
case SS:
case CS:
- if ((value & 3) != 3)
+ if (!(value & 3))
return -EIO;
value &= 0xffff;
break;
diff --exclude-from=/home/ldb/src/linux-exclude -urNdp linux-2.5.44/arch/i386/kernel/signal.c linux-2.5.44_multiring/arch/i386/kernel/signal.c
--- linux-2.5.44/arch/i386/kernel/signal.c 2002-10-27 02:38:39.000000000 +0100
+++ linux-2.5.44_multiring/arch/i386/kernel/signal.c 2002-10-27 00:44:03.000000000 +0200
@@ -162,7 +162,7 @@ restore_sigcontext(struct pt_regs *regs,
#define COPY_SEG_STRICT(seg) \
{ unsigned short tmp; \
err |= __get_user(tmp, &sc->seg); \
- regs->x##seg = tmp|3; }
+ regs->x##seg = (tmp & 3) ? tmp : (tmp | USER_RING); }
#define GET_SEG(seg) \
{ unsigned short tmp; \
@@ -338,7 +338,7 @@ get_sigframe(struct k_sigaction *ka, str
}
/* This is the legacy signal stack switching. */
- else if ((regs->xss & 0xffff) != __USER_DS &&
+ else if ((regs->xss & 0xfffc) != USER_DS_RPL(0) &&
!(ka->sa.sa_flags & SA_RESTORER) &&
ka->sa.sa_restorer) {
esp = (unsigned long) ka->sa.sa_restorer;
@@ -350,6 +350,7 @@ get_sigframe(struct k_sigaction *ka, str
static void setup_frame(int sig, struct k_sigaction *ka,
sigset_t *set, struct pt_regs * regs)
{
+ unsigned ring;
struct sigframe *frame;
int err = 0;
@@ -398,10 +399,10 @@ static void setup_frame(int sig, struct
regs->eip = (unsigned long) ka->sa.sa_handler;
set_fs(USER_DS);
- regs->xds = __USER_DS;
- regs->xes = __USER_DS;
- regs->xss = __USER_DS;
- regs->xcs = __USER_CS;
+ ring = get_user_ring();
+ regs->xds = regs->xes = regs->xss = USER_DS_RPL(ring);
+ regs->xcs = USER_CS_RPL(ring);
+ put_user_ring();
regs->eflags &= ~TF_MASK;
#if DEBUG_SIG
@@ -422,6 +423,7 @@ static void setup_rt_frame(int sig, stru
{
struct rt_sigframe *frame;
int err = 0;
+ unsigned ring;
frame = get_sigframe(ka, regs, sizeof(*frame));
@@ -473,10 +475,10 @@ static void setup_rt_frame(int sig, stru
regs->eip = (unsigned long) ka->sa.sa_handler;
set_fs(USER_DS);
- regs->xds = __USER_DS;
- regs->xes = __USER_DS;
- regs->xss = __USER_DS;
- regs->xcs = __USER_CS;
+ ring = get_user_ring();
+ regs->xds = regs->xes = regs->xss = USER_DS_RPL(ring);
+ regs->xcs = USER_CS_RPL(ring);
+ put_user_ring();
regs->eflags &= ~TF_MASK;
#if DEBUG_SIG
@@ -556,7 +558,7 @@ int do_signal(struct pt_regs *regs, sigs
* kernel mode. Just return without doing anything
* if so.
*/
- if ((regs->xcs & 3) != 3)
+ if (!(regs->xcs & 3))
return 1;
if (current->flags & PF_FREEZE) {
diff --exclude-from=/home/ldb/src/linux-exclude -urNdp linux-2.5.44/arch/i386/kernel/traps.c linux-2.5.44_multiring/arch/i386/kernel/traps.c
--- linux-2.5.44/arch/i386/kernel/traps.c 2002-10-27 02:38:39.000000000 +0100
+++ linux-2.5.44_multiring/arch/i386/kernel/traps.c 2002-10-27 00:44:03.000000000 +0200
@@ -49,6 +49,8 @@
#include <linux/irq.h>
#include <linux/module.h>
+#include <asm/idt.h>
+
asmlinkage int system_call(void);
asmlinkage void lcall7(void);
asmlinkage void lcall27(void);
@@ -62,6 +64,10 @@ struct desc_struct default_ldt[] = { { 0
* for this.
*/
struct desc_struct idt_table[256] __attribute__((__section__(".data.idt"))) = { {0, 0}, };
+#if defined(CONFIG_HIGHMEM) || defined(CONFIG_X86_F00F_BUG)
+pte_t* idt_pte;
+pte_t idt_table_pte;
+#endif
asmlinkage void divide_error(void);
asmlinkage void debug(void);
@@ -249,6 +255,30 @@ bad:
printk("\n");
}
+void __init load_idt_table_init(unsigned cpu, pgprot_t prot)
+{
+ struct Xgt_desc_struct map_idt_descr;
+
+ if(use_highidt
+#ifdef CONFIG_X86_F00F_BUG
+ || (pgprot_val(prot) == __PAGE_KERNEL_RO)
+#endif
+ )
+ {
+ unsigned idt_vstart;
+ idt_vstart = __fix_to_virt(FIX_IDT_BEGIN);
+ idt_pte = pte_offset_kernel(pmd_offset(pgd_offset_k(idt_vstart), (idt_vstart)), (idt_vstart));
+
+ idt_table_pte = pfn_pte(__pa(idt_table) >> PAGE_SHIFT, prot);
+ set_pte(idt_pte - cpu, idt_table_pte);
+ map_idt_descr.size = IDT_SIZE - 1;
+ map_idt_descr.address = __fix_to_virt(FIX_IDT_BEGIN + cpu);
+ __asm__ __volatile__("lidt %0": "=m" (map_idt_descr));
+ }
+ else
+ __asm__ __volatile__("lidt %0": "=m" (idt_descr));
+}
+
static void handle_BUG(struct pt_regs *regs)
{
unsigned short ud2;
@@ -814,20 +844,6 @@ asmlinkage void math_emulate(long arg)
#endif /* CONFIG_MATH_EMULATION */
-#ifdef CONFIG_X86_F00F_BUG
-void __init trap_init_f00f_bug(void)
-{
- __set_fixmap(FIX_F00F_IDT, __pa(&idt_table), PAGE_KERNEL_RO);
-
- /*
- * Update the IDT descriptor and reload the IDT so that
- * it uses the read-only mapped virtual address.
- */
- idt_descr.address = fix_to_virt(FIX_F00F_IDT);
- __asm__ __volatile__("lidt %0": "=m" (idt_descr));
-}
-#endif
-
#define _set_gate(gate_addr,type,dpl,addr) \
do { \
int __d0, __d1; \
diff --exclude-from=/home/ldb/src/linux-exclude -urNdp linux-2.5.44/arch/i386/mach-generic/irq_vectors.h linux-2.5.44_multiring/arch/i386/mach-generic/irq_vectors.h
--- linux-2.5.44/arch/i386/mach-generic/irq_vectors.h 2002-10-27 02:38:39.000000000 +0100
+++ linux-2.5.44_multiring/arch/i386/mach-generic/irq_vectors.h 2002-10-27 00:44:03.000000000 +0200
@@ -22,16 +22,19 @@
#ifndef _ASM_IRQ_VECTORS_H
#define _ASM_IRQ_VECTORS_H
+#define MULTIRING_SPECIAL_FIRST_VECTOR 0x20
+#define MULTIRING_SPECIAL_LAST_VECTOR 0x2f
+
/*
* IDT vectors usable for external interrupt sources start
- * at 0x20:
+ * at 0x30:
*/
-#define FIRST_EXTERNAL_VECTOR 0x20
+#define FIRST_EXTERNAL_VECTOR 0x30
#define SYSCALL_VECTOR 0x80
/*
- * Vectors 0x20-0x2f are used for ISA interrupts.
+ * Vectors 0x30-0x3f are used for ISA interrupts.
*/
/*
@@ -49,6 +52,9 @@
#define RESCHEDULE_VECTOR 0xfc
#define CALL_FUNCTION_VECTOR 0xfb
+#define MULTIRING_AUTO_LAST_VECTOR 0xfa
+#define MULTIRING_AUTO_FIRST_VECTOR 0xf1
+
#define THERMAL_APIC_VECTOR 0xf0
/*
* Local APIC timer IRQ vector is on a different priority level,
@@ -58,26 +64,26 @@
#define LOCAL_TIMER_VECTOR 0xef
/*
- * First APIC vector available to drivers: (vectors 0x30-0xee)
- * we start at 0x31 to spread out vectors evenly between priority
+ * First APIC vector available to drivers: (vectors 0x41-0xee)
+ * we start at 0x41 to spread out vectors evenly between priority
* levels. (0x80 is the syscall vector)
*/
-#define FIRST_DEVICE_VECTOR 0x31
+#define FIRST_DEVICE_VECTOR 0x41
#define FIRST_SYSTEM_VECTOR 0xef
#define TIMER_IRQ 0
/*
- * 16 8259A IRQ's, 208 potential APIC interrupt sources.
+ * 16 8259A IRQ's, 192 potential APIC interrupt sources.
* Right now the APIC is mostly only used for SMP.
* 256 vectors is an architectural limit. (we can have
* more than 256 devices theoretically, but they will
* have to use shared interrupts)
* Since vectors 0x00-0x1f are used/reserved for the CPU,
- * the usable vector space is 0x20-0xff (224 vectors)
+ * the usable vector space is 0x30-0xff (208 vectors)
*/
#ifdef CONFIG_X86_IO_APIC
-#define NR_IRQS 224
+#define NR_IRQS 208
#else
#define NR_IRQS 16
#endif
diff --exclude-from=/home/ldb/src/linux-exclude -urNdp linux-2.5.44/arch/i386/mach-visws/irq_vectors.h linux-2.5.44_multiring/arch/i386/mach-visws/irq_vectors.h
--- linux-2.5.44/arch/i386/mach-visws/irq_vectors.h 2002-10-27 02:38:39.000000000 +0100
+++ linux-2.5.44_multiring/arch/i386/mach-visws/irq_vectors.h 2002-10-27 00:44:03.000000000 +0200
@@ -1,11 +1,14 @@
#ifndef _ASM_IRQ_VECTORS_H
#define _ASM_IRQ_VECTORS_H
+#define MULTIRING_SPECIAL_FIRST_VECTOR 0x20
+#define MULTIRING_SPECIAL_LAST_VECTOR 0x2f
+
/*
* IDT vectors usable for external interrupt sources start
- * at 0x20:
+ * at 0x30:
*/
-#define FIRST_EXTERNAL_VECTOR 0x20
+#define FIRST_EXTERNAL_VECTOR 0x30
#define SYSCALL_VECTOR 0x80
@@ -28,6 +31,9 @@
#define RESCHEDULE_VECTOR 0xfc
#define CALL_FUNCTION_VECTOR 0xfb
+#define MULTIRING_AUTO_LAST_VECTOR 0xfa
+#define MULTIRING_AUTO_FIRST_VECTOR 0xf1
+
#define THERMAL_APIC_VECTOR 0xf0
/*
* Local APIC timer IRQ vector is on a different priority level,
@@ -37,26 +43,26 @@
#define LOCAL_TIMER_VECTOR 0xef
/*
- * First APIC vector available to drivers: (vectors 0x30-0xee)
- * we start at 0x31 to spread out vectors evenly between priority
+ * First APIC vector available to drivers: (vectors 0x41-0xee)
+ * we start at 0x41 to spread out vectors evenly between priority
* levels. (0x80 is the syscall vector)
*/
-#define FIRST_DEVICE_VECTOR 0x31
+#define FIRST_DEVICE_VECTOR 0x41
#define FIRST_SYSTEM_VECTOR 0xef
#define TIMER_IRQ 0
/*
- * 16 8259A IRQ's, 208 potential APIC interrupt sources.
+ * 16 8259A IRQ's, 192 potential APIC interrupt sources.
* Right now the APIC is mostly only used for SMP.
* 256 vectors is an architectural limit. (we can have
* more than 256 devices theoretically, but they will
* have to use shared interrupts)
* Since vectors 0x00-0x1f are used/reserved for the CPU,
- * the usable vector space is 0x20-0xff (224 vectors)
+ * the usable vector space is 0x30-0xff (208 vectors)
*/
#ifdef CONFIG_X86_IO_APIC
-#define NR_IRQS 224
+#define NR_IRQS 208
#else
#define NR_IRQS 16
#endif
diff --exclude-from=/home/ldb/src/linux-exclude -urNdp linux-2.5.44/arch/i386/math-emu/fpu_entry.c linux-2.5.44_multiring/arch/i386/math-emu/fpu_entry.c
--- linux-2.5.44/arch/i386/math-emu/fpu_entry.c 2002-10-27 02:38:39.000000000 +0100
+++ linux-2.5.44_multiring/arch/i386/math-emu/fpu_entry.c 2002-10-27 00:44:03.000000000 +0200
@@ -171,7 +171,7 @@ asmlinkage void math_emulate(long arg)
FPU_EIP += code_base = FPU_CS << 4;
code_limit = code_base + 0xffff; /* Assumes code_base <= 0xffff0000 */
}
- else if ( FPU_CS == __USER_CS && FPU_DS == __USER_DS )
+ else if ( (FPU_CS | 3) == USER_CS_RPL(3) && (FPU_DS | 3) == USER_DS_RPL(3) )
{
addr_modes.default_mode = 0;
}
diff --exclude-from=/home/ldb/src/linux-exclude -urNdp linux-2.5.44/arch/i386/mm/fault.c linux-2.5.44_multiring/arch/i386/mm/fault.c
--- linux-2.5.44/arch/i386/mm/fault.c 2002-10-27 02:38:39.000000000 +0100
+++ linux-2.5.44_multiring/arch/i386/mm/fault.c 2002-10-27 00:44:03.000000000 +0200
@@ -157,6 +157,10 @@ asmlinkage void do_page_fault(struct pt_
tsk = current;
+ /* User code at DPL 1 and 2 will also cause the U/S bit to be unset */
+ if(current->thread.multiring_mode && (regs->xcs & 3))
+ error_code |= 4;
+
/*
* We fault-in kernel-space virtual memory on-demand. The
* 'reference' page table is init_mm.pgd.
@@ -270,6 +274,7 @@ bad_area:
up_read(&mm->mmap_sem);
/* User mode accesses just cause a SIGSEGV */
+ /* Note: F0 0F C7 C8 in DPL 1 or 2 will cause the if body to be executed (seems unavoidable) */
if (error_code & 4) {
tsk->thread.cr2 = address;
tsk->thread.error_code = error_code;
diff --exclude-from=/home/ldb/src/linux-exclude -urNdp linux-2.5.44/fs/exec.c linux-2.5.44_multiring/fs/exec.c
--- linux-2.5.44/fs/exec.c 2002-10-27 02:38:39.000000000 +0100
+++ linux-2.5.44_multiring/fs/exec.c 2002-10-27 02:14:07.000000000 +0200
@@ -1021,7 +1021,11 @@ int do_execve(char * filename, char ** a
if (!bprm.mm)
goto out_file;
+#ifdef init_new_context_flags
+ retval = init_new_context_flags(current, bprm.mm, CLONE_EXEC);
+#else
retval = init_new_context(current, bprm.mm);
+#endif
if (retval < 0)
goto out_mm;
diff --exclude-from=/home/ldb/src/linux-exclude -urNdp linux-2.5.44/include/asm-i386/desc.h linux-2.5.44_multiring/include/asm-i386/desc.h
--- linux-2.5.44/include/asm-i386/desc.h 2002-10-27 02:38:39.000000000 +0100
+++ linux-2.5.44_multiring/include/asm-i386/desc.h 2002-10-27 00:44:03.000000000 +0200
@@ -3,12 +3,16 @@
#include <asm/ldt.h>
#include <asm/segment.h>
+#include <asm/page.h>
+
+#define USER_CSDS_A (((__PAGE_OFFSET - 1) >> PAGE_SHIFT) & 0xffff)
+#define USER_CSDS_B(data, ring) (0x00c09200 | (((__PAGE_OFFSET - 1) >> PAGE_SHIFT) & 0xf0000) | ((!(data)) << 11) | ((ring) << 13))
#ifndef __ASSEMBLY__
#include <asm/mmu.h>
-extern struct desc_struct cpu_gdt_table[NR_CPUS][GDT_ENTRIES];
+extern struct desc_struct cpu_gdt_table[NR_CPUS][GDT_ENTRIES], idt_table[IDT_ENTRIES];
struct Xgt_desc_struct {
unsigned short size;
@@ -61,7 +65,7 @@ static inline void set_ldt_desc(unsigned
((info)->seg_32bit << 22) | \
((info)->limit_in_pages << 23) | \
((info)->useable << 20) | \
- 0x7000)
+ 0x1000 | (USER_RING << 13))
#define LDT_empty(info) (\
(info)->base_addr == 0 && \
@@ -73,6 +77,35 @@ static inline void set_ldt_desc(unsigned
(info)->seg_not_present == 1 && \
(info)->useable == 0 )
+extern int LDT_handle_over_page_offset(mm_context_t* ctx);
+
+static inline int LDT_handle_perm(struct user_desc* info, mm_context_t* ctx)
+{
+ unsigned limit;
+ unsigned maxlim;
+ if(info->base_addr >= __PAGE_OFFSET)
+ return LDT_handle_over_page_offset(ctx);
+
+ limit = info->limit & 0xfffff;
+ if(info->limit_in_pages)
+ limit = (limit << PAGE_SHIFT) + (PAGE_SIZE - 1);
+
+ maxlim = (__PAGE_OFFSET - 1) - info->base_addr;
+ if(limit > maxlim)
+ {
+ if(maxlim <= 0xfffff)
+ {
+ info->limit = maxlim;
+ info->limit_in_pages = 0;
+ }
+ else if(!(info->base_addr & ~PAGE_MASK))
+ info->limit = maxlim >> PAGE_SHIFT;
+ else
+ return LDT_handle_over_page_offset(ctx);
+ }
+ return 0;
+}
+
#if TLS_SIZE != 24
# error update this code.
#endif
@@ -117,6 +150,17 @@ static inline void load_LDT(mm_context_t
put_cpu();
}
+static inline unsigned get_user_ring(void)
+{
+ preempt_disable();
+ return unlikely(current->thread.multiring_mode) ? MULTIRING_USER_RING : USER_RING;
+}
+
+static inline void put_user_ring(void)
+{
+ preempt_enable();
+}
+
#endif /* !__ASSEMBLY__ */
#endif
diff --exclude-from=/home/ldb/src/linux-exclude -urNdp linux-2.5.44/include/asm-i386/fixmap.h linux-2.5.44_multiring/include/asm-i386/fixmap.h
--- linux-2.5.44/include/asm-i386/fixmap.h 2002-10-27 02:38:39.000000000 +0100
+++ linux-2.5.44_multiring/include/asm-i386/fixmap.h 2002-10-27 00:44:03.000000000 +0200
@@ -48,6 +48,10 @@
* future, say framebuffers for the console driver(s) could be
* fix-mapped?
*/
+
+extern int __fix_idt_begin_should_have_been_optimized_away(void);
+extern int __fix_idt_end_should_have_been_optimized_away(void);
+
enum fixed_addresses {
#ifdef CONFIG_X86_LOCAL_APIC
FIX_APIC_BASE, /* local (CPU) APIC) -- required for SMP or not */
@@ -62,9 +66,15 @@ enum fixed_addresses {
FIX_LI_PCIA, /* Lithium PCI Bridge A */
FIX_LI_PCIB, /* Lithium PCI Bridge B */
#endif
-#ifdef CONFIG_X86_F00F_BUG
- FIX_F00F_IDT, /* Virtual mapping for IDT */
+
+#if defined(CONFIG_X86_F00F_BUG) || defined(CONFIG_X86_HIGHIDT)
+ FIX_IDT_BEGIN, /* Virtual mapping for IDT */
+ FIX_IDT_END = FIX_IDT_BEGIN + NR_CPUS - 1,
+#else
+#define FIX_IDT_BEGIN __fix_idt_begin_should_have_been_optimized_away()
+#define FIX_IDT_END __fix_idt_end_should_have_been_optimized_away()
#endif
+
#ifdef CONFIG_X86_CYCLONE
FIX_CYCLONE_TIMER, /*cyclone timer register*/
#endif
diff --exclude-from=/home/ldb/src/linux-exclude -urNdp linux-2.5.44/include/asm-i386/idt.h linux-2.5.44_multiring/include/asm-i386/idt.h
--- linux-2.5.44/include/asm-i386/idt.h 1970-01-01 01:00:00.000000000 +0100
+++ linux-2.5.44_multiring/include/asm-i386/idt.h 2002-10-27 03:18:13.000000000 +0100
@@ -0,0 +1,236 @@
+/*
+ * linux/include/asm-i386/multiring.h: multiring IDT inline functions
+ *
+ * Copyright (C) 2002 Luca Barbieri <[email protected]>
+ */
+
+#ifndef __i386_IDT_H
+#define __i386_IDT_H
+
+#include <linux/config.h>
+#include <linux/slab.h>
+#include <linux/gfp.h>
+#include <linux/smp.h>
+#include <linux/highmem.h>
+#include <asm/atomic.h>
+#include <linux/spinlock.h>
+
+#include <asm/desc.h>
+#include <asm/segment.h>
+#include <asm/mmu.h>
+#include <asm/pgtable.h>
+#include <asm/pgalloc.h>
+#include <asm/tlb.h>
+
+#ifdef CONFIG_X86_HIGHIDT
+#define use_highidt 1
+#else
+#define use_highidt 0
+#endif
+
+#ifdef CONFIG_X86_F00F_BUG
+#define cpu_has_f00f_bug boot_cpu_data.f00f_bug
+#else
+#define cpu_has_f00f_bug 0
+#endif
+
+extern pte_t* idt_pte;
+extern unsigned idt_prot;
+extern pte_t idt_table_pte;
+
+/* access to idt_refcnt */
+static inline struct desc_struct* kmap_idt(mm_context_t* ctx)
+{
+ if(use_highidt)
+ {
+ if(!cpu_has_f00f_bug)
+ {
+ unsigned cpu = get_cpu();
+ return (struct desc_struct*)__fix_to_virt(FIX_IDT_BEGIN + cpu);
+ }
+ else
+ {
+ /* we need to kmap because the window is read-only */
+ return kmap(ctx->idt.page);
+ }
+ }
+ return ctx->idt.addr;
+}
+
+static inline rwlock_t* idt_lock(struct desc_struct* idt)
+{
+ return (rwlock_t*)((char*)idt + IDT_SIZE);
+}
+
+/* read access */
+static inline struct desc_struct* kmap_read_idt(mm_context_t* ctx)
+{
+ struct desc_struct* idt = kmap_idt(ctx);
+ read_lock(idt_lock(idt));
+ return idt;
+}
+
+/* write access */
+static inline struct desc_struct* kmap_write_idt(mm_context_t* ctx)
+{
+ struct desc_struct* idt = kmap_idt(ctx);
+ write_lock(idt_lock(idt));
+ return idt;
+}
+
+static inline struct desc_struct* kmap_read_idt_or_table(mm_context_t* ctx)
+{
+ if(current->thread.multiring_mode)
+ return kmap_read_idt(ctx);
+ else
+ return idt_table;
+}
+
+static inline void kunmap_idt(mm_context_t* ctx, struct desc_struct* idt)
+{
+ if(use_highidt)
+ {
+ if(!cpu_has_f00f_bug)
+ put_cpu();
+ else
+ kunmap(ctx->idt.page);
+ }
+}
+
+static inline void kunmap_read_idt(mm_context_t* ctx, struct desc_struct* idt)
+{
+ read_unlock(idt_lock(idt));
+ kunmap_idt(ctx, idt);
+}
+
+static inline void kunmap_write_idt(mm_context_t* ctx, struct desc_struct* idt)
+{
+ write_unlock(idt_lock(idt));
+ kunmap_idt(ctx, idt);
+}
+
+static inline void kunmap_read_idt_or_table(mm_context_t* ctx, struct desc_struct* idt)
+{
+ if(idt != idt_table)
+ return kunmap_read_idt(ctx, idt);
+}
+
+static inline void load_idt_table(unsigned cpu)
+{
+ if(use_highidt || cpu_has_f00f_bug)
+ {
+ set_pte(idt_pte - cpu, idt_table_pte);
+ __flush_tlb_one(__fix_to_virt(FIX_IDT_BEGIN + cpu));
+ }
+ else
+ __asm__ __volatile__("lidt %0": "=m" (idt_descr));
+}
+
+static inline void change_gdt_ring(unsigned cpu, unsigned ring)
+{
+ cpu_gdt_table[cpu][GDT_ENTRY_DEFAULT_USER_CS].b = USER_CSDS_B(0, ring);
+ cpu_gdt_table[cpu][GDT_ENTRY_DEFAULT_USER_DS].b = USER_CSDS_B(1, ring);
+}
+
+extern void multiring_init_task(task_t* tsk);
+
+static inline void load_IDT_nolock_inline(task_t* tsk, mm_context_t* ctx, unsigned cpu)
+{
+ if(!ctx->idt.opaque)
+ {
+ load_idt_table(cpu);
+ change_gdt_ring(cpu, 3);
+ return;
+ }
+
+ change_gdt_ring(cpu, MULTIRING_USER_RING);
+ if(use_highidt || cpu_has_f00f_bug)
+ {
+ set_pte(idt_pte - cpu, cpu_has_f00f_bug ? pfn_pte(__pa(ctx->idt.addr) >> PAGE_SHIFT, PAGE_KERNEL_RO) : mk_pte(ctx->idt.page, PAGE_KERNEL));
+ __flush_tlb_one(__fix_to_virt(FIX_IDT_BEGIN + cpu));
+ }
+ else
+ {
+ struct Xgt_desc_struct map_idt_descr;
+ map_idt_descr.size = IDT_SIZE - 1;
+ map_idt_descr.address = (unsigned long)ctx->idt.addr;
+ __asm__ __volatile__("lidt %0": "=m" (map_idt_descr));
+ }
+ if(unlikely(!tsk->thread.multiring_mode))
+ {
+ multiring_init_task(tsk);
+ tsk->thread.multiring_mode = 1;
+ }
+}
+
+static inline void load_IDT_inline(task_t* tsk, mm_context_t* ctx)
+{
+ unsigned cpu = get_cpu();
+ load_IDT_nolock_inline(tsk, ctx, cpu);
+ put_cpu();
+}
+
+extern void load_IDT_nolock(mm_context_t* ctx, unsigned cpu);
+
+static inline void load_IDT(mm_context_t* ctx)
+{
+ unsigned cpu = get_cpu();
+ load_IDT_nolock(ctx, cpu);
+ put_cpu();
+}
+
+static inline atomic_t* idt_refcnt(struct desc_struct* idt)
+{
+ return (atomic_t*)((char*)idt + IDT_SIZE + sizeof(spinlock_t));
+}
+
+static inline struct desc_struct* __alloc_idt(union idt* idtu)
+{
+ if(use_highidt)
+ {
+ /* Use high memory */
+ idtu->page = alloc_page(GFP_KERNEL | __GFP_HIGHMEM);
+ return (struct desc_struct*)kmap(idtu->page);
+ }
+ else
+ {
+ /* TODO: use an aligned 2KB allocator instead */
+ return idtu->addr = (struct desc_struct*)__get_free_page(GFP_KERNEL);
+ }
+}
+
+static inline struct desc_struct* alloc_idt(union idt* idtu)
+{
+ struct desc_struct* idt = __alloc_idt(idtu);
+ if(idt)
+ {
+ *idt_lock(idt) = RW_LOCK_UNLOCKED;
+ atomic_set(idt_refcnt(idt), 1);
+ }
+ return idt;
+}
+
+static inline void kunmap_new_idt(union idt* idtu, struct desc_struct* idt)
+{
+ if(use_highidt)
+ kunmap(idtu->page);
+}
+
+static inline void __free_idt(mm_context_t* ctx)
+{
+ if(use_highidt)
+ __free_page(ctx->idt.page);
+ else
+ free_page((unsigned long)ctx->idt.addr);
+}
+
+static inline void free_idt(mm_context_t* ctx)
+{
+ struct desc_struct* idt = kmap_idt(ctx);
+ int free = atomic_dec_and_test(idt_refcnt(idt));
+ kunmap_idt(ctx, idt);
+ if(free)
+ __free_idt(ctx);
+}
+
+#endif
diff --exclude-from=/home/ldb/src/linux-exclude -urNdp linux-2.5.44/include/asm-i386/ldt.h linux-2.5.44_multiring/include/asm-i386/ldt.h
--- linux-2.5.44/include/asm-i386/ldt.h 2002-10-27 02:38:39.000000000 +0100
+++ linux-2.5.44_multiring/include/asm-i386/ldt.h 2002-10-27 00:44:03.000000000 +0200
@@ -22,6 +22,9 @@ struct user_desc {
unsigned int limit_in_pages:1;
unsigned int seg_not_present:1;
unsigned int useable:1;
+
+ /* has effect only in multiring mode, but is returned in any mode */
+ unsigned int dpl:2;
};
#define MODIFY_LDT_CONTENTS_DATA 0
diff --exclude-from=/home/ldb/src/linux-exclude -urNdp linux-2.5.44/include/asm-i386/mmu_context.h linux-2.5.44_multiring/include/asm-i386/mmu_context.h
--- linux-2.5.44/include/asm-i386/mmu_context.h 2002-10-27 02:38:39.000000000 +0100
+++ linux-2.5.44_multiring/include/asm-i386/mmu_context.h 2002-10-27 03:13:34.000000000 +0100
@@ -6,12 +6,15 @@
#include <asm/atomic.h>
#include <asm/pgalloc.h>
#include <asm/tlbflush.h>
+#include <asm/idt.h>
/*
* possibly do the LDT unload here?
*/
#define destroy_context(mm) do { } while(0)
-int init_new_context(struct task_struct *tsk, struct mm_struct *mm);
+int init_new_context_flags(struct task_struct *tsk, struct mm_struct *mm, unsigned flags);
+#define init_new_context_flags init_new_context_flags
+#define init_new_context(tsk, mm) init_new_context_flags(tsk, mm, 0)
#ifdef CONFIG_SMP
@@ -45,6 +48,9 @@ static inline void switch_mm(struct mm_s
*/
if (unlikely(prev->context.ldt != next->context.ldt))
load_LDT_nolock(&next->context, cpu);
+
+ if (unlikely(prev->context.idt.opaque != next->context.idt.opaque))
+ load_IDT_nolock_inline(tsk, &next->context, cpu);
}
#ifdef CONFIG_SMP
else {
diff --exclude-from=/home/ldb/src/linux-exclude -urNdp linux-2.5.44/include/asm-i386/mmu.h linux-2.5.44_multiring/include/asm-i386/mmu.h
--- linux-2.5.44/include/asm-i386/mmu.h 2002-10-27 02:38:39.000000000 +0100
+++ linux-2.5.44_multiring/include/asm-i386/mmu.h 2002-10-27 00:44:03.000000000 +0200
@@ -7,10 +7,20 @@
*
* cpu_vm_mask is used to optimize ldt flushing.
*/
+
+union idt
+{
+ struct desc_struct* addr;
+ struct page* page;
+ unsigned long opaque;
+};
+
typedef struct {
int size;
struct semaphore sem;
void *ldt;
+ union idt idt;
+ unsigned char bad_segments;
} mm_context_t;
#endif
diff --exclude-from=/home/ldb/src/linux-exclude -urNdp linux-2.5.44/include/asm-i386/multiring.h linux-2.5.44_multiring/include/asm-i386/multiring.h
--- linux-2.5.44/include/asm-i386/multiring.h 1970-01-01 01:00:00.000000000 +0100
+++ linux-2.5.44_multiring/include/asm-i386/multiring.h 2002-10-27 00:44:03.000000000 +0200
@@ -0,0 +1,194 @@
+/*
+ * linux/include/asm-i386/multiring.h: header for multiple privilege rings support (available to user-mode)
+ *
+ * Copyright (C) 2002 Luca Barbieri <[email protected]>
+ */
+
+#ifndef __i386_MULTIRING_H
+#define __i386_MULTIRING_H
+
+#define MULTIRING_CHECK 0
+
+#define MULTIRING_GET_ESPSS 1
+#define MULTIRING_SET_ESPSS 2
+
+#define MULTIRING_GET 3
+#define MULTIRING_GET_RANGE 4
+
+#define MULTIRING_ELEVATE 5
+#define MULTIRING_SET 6
+#define MULTIRING_COPY 7
+#define MULTIRING_LAST_OP 7
+
+#define MULTIRING_VEC_FREE ((unsigned char)~0)
+
+#ifdef __KERNEL__
+#define multiring_gate desc_struct
+#else
+struct multiring_gate
+{
+ unsigned long a;
+ unsigned long b;
+};
+#endif
+
+#ifndef __KERNEL__
+#ifndef MULTIRING_NO_SYSCALLS
+#ifndef _syscall1
+#include <asm/unistd.h>
+#endif
+
+#define __NR_multiring0 __NR_multiring
+#define __NR_multiring1 __NR_multiring
+#define __NR_multiring2 __NR_multiring
+#define __NR_multiring3 __NR_multiring
+
+_syscall1(int, multiring0, unsigned, op);
+_syscall2(int, multiring1, unsigned, op, unsigned long, arg1);
+_syscall3(int, multiring2, unsigned, op, unsigned long, arg1, unsigned long, arg2);
+_syscall4(int, multiring3, unsigned, op, unsigned long, arg1, unsigned long, arg2, unsigned long, arg3);
+#endif
+
+#ifndef MULTIRING_NO_HELPERS
+/*
+ Check whether the program is in multiring mode.
+
+ Return value:
+ 0: multiring mode
+ 1: normal mode
+ -1/EPERM: normal mode; can't enter multiring mode due to a bad segment
+*/
+static inline int multiring_check()
+{
+ return multiring0(MULTIRING_CHECK);
+}
+
+/* Enters multiring mode.
+ This fails if you have previously set up a "bad segment" with modify_ldt, set_thread_area or CLONE_SETTLS.
+ A bad segment is a segment that either starts in the kernel memory region or has a limit expressed in pages, with a value that causes it to contain the kernel region and with a non-page-aligned base address.
+
+ This will cause the process to get a private IDT that starts as a copy of the global one, but with the SYSCALL_VECTOR DPL set to 1.
+ Multiring processes also have the DPL on the GDT CS and DS descriptors set to 1.
+ Upon return, in all threads, cs/ds/es/ss/fs/gs selectors pointing to the default CS/DS will be changed to have RPL=1.
+
+ Return value:
+ 0: success
+ -1/EPERM: can't enter multiring mode due to a bad segment
+ -1/ENOMEM: couldn't allocate the IDT
+*/
+static inline int multiring_elevate()
+{
+ return multiring0(MULTIRING_ELEVATE);
+}
+
+/*
+ Return TSS ESP/SS value for a privilege ring
+
+ Parameters:
+ ring: privilege ring whose tss esp/ss you want to get
+ esp: pointer to returned tss esp value
+ ss: pointer to returned tss ss value
+*/
+static inline int multiring_get_espss(unsigned ring, unsigned* esp, unsigned* ss)
+{
+ return multiring3(MULTIRING_GET_ESPSS, ring, (unsigned long)esp, (unsigned long)ss);
+}
+
+/*
+ Set TSS ESP/SS value for a privilege ring
+ Note: while multiring mode is currently not needed to do this, you shouldn't rely on this.
+
+ Parameters:
+ ring: privilege ring whose tss esp/ss you want to set
+ esp: new tss esp value
+ ss: new tss ss value - must have RPL == ring
+*/
+static inline int multiring_set_espss(unsigned ring, unsigned esp, unsigned ss)
+{
+ return multiring3(MULTIRING_SET_ESPSS, ring, esp, ss);
+}
+
+/*
+ Get a single interrupt gate.
+
+ Parameters:
+ vec: vector number
+ gate: pointer to returned multiring_gate
+*/
+static inline int multiring_get(unsigned vec, struct multiring_gate* gate)
+{
+ return multiring2(MULTIRING_GET, vec, (unsigned long)gate);
+}
+
+/*
+ Get a range of interrupt gates.
+
+ Parameters:
+ first: first vector number
+ last: last vector numer
+ gates: pointer to returned multiring_gate structs (size must be >= (last - first + 1) * sizeof(struct multiring_gate))
+*/
+static inline int multiring_get_range(unsigned first, unsigned last, struct multiring_gate* gates)
+{
+ return multiring3(MULTIRING_GET_RANGE, first, last, (unsigned long)gates);
+}
+
+/*
+ Set an interrupt gate.
+
+ Parameters:
+ vec: vector number or MULTIRING_VEC_FREE to get a free one (or ENOSPC if none found)
+ a: first 32-bit word of interrupt gate
+ b: second 32-bit word of interrupt gate
+
+ Return value:
+ >= 0: success, return value is vector number
+ -1/EPERM: you tried to set an unmodifiable vector or to create a gate to a non-syscall RPL 0 address or there was an another permission denied error
+ -1/ENXIO: you weren't in multiring mode
+ -1/EINVAL: you tried to create a task/interrupt gate, a DPL 0 gate, a gate with reserved regions with wrong value, or something else considered invalid/not supported
+ -1/ENOSPC: you specified MULTIRING_VEC_FREE but there was no available free vector
+*/
+static inline int multiring_set(unsigned vec, unsigned long a, unsigned long b)
+{
+ return multiring3(MULTIRING_SET, vec, a, b);
+}
+
+/* like multiring_set with vec=MULTIRING_VEC_FREE */
+static inline int multiring_set_free(unsigned long a, unsigned long b)
+{
+ return multiring_set(MULTIRING_VEC_FREE, a, b);
+}
+
+/* like multiring_set with a=0 b=0 */
+static inline int multiring_free(unsigned vec)
+{
+ return multiring_set(vec, 0, 0);
+}
+
+/*
+ Set an interrupt gate based on another one
+
+ Parameters:
+ vec: vector number or MULTIRING_VEC_FREE to get a free one (or ENOSPC if none found)
+ from: vector number to copy from
+
+ Return value:
+ >= 0: success, return value is vector number
+ -1/ENXIO: you weren't in multiring mode
+ -1/EPERM: you tried to set an unmodifiable vector or to create a gate to a non-syscall RPL 0 address or there was an another permission denied error
+ -1/ENOSPC: you specified MULTIRING_VEC_FREE but there was no available free vector
+*/
+static inline int multiring_copy(unsigned vec, unsigned from)
+{
+ return multiring2(MULTIRING_COPY, vec, from);
+}
+
+/* like multiring_copy with vec=MULTIRING_VEC_FREE */
+static inline int multiring_copy_free(unsigned from)
+{
+ return multiring_copy(MULTIRING_VEC_FREE, from);
+}
+
+#endif
+#endif
+#endif
diff --exclude-from=/home/ldb/src/linux-exclude -urNdp linux-2.5.44/include/asm-i386/processor.h linux-2.5.44_multiring/include/asm-i386/processor.h
--- linux-2.5.44/include/asm-i386/processor.h 2002-10-27 02:38:39.000000000 +0100
+++ linux-2.5.44_multiring/include/asm-i386/processor.h 2002-10-27 00:44:04.000000000 +0200
@@ -332,14 +332,17 @@ typedef struct {
unsigned long seg;
} mm_segment_t;
+struct espss
+{
+ unsigned long esp;
+ unsigned long ss;
+};
+
struct tss_struct {
unsigned short back_link,__blh;
unsigned long esp0;
unsigned short ss0,__ss0h;
- unsigned long esp1;
- unsigned short ss1,__ss1h;
- unsigned long esp2;
- unsigned short ss2,__ss2h;
+ struct espss espss12[2];
unsigned long __cr3;
unsigned long eip;
unsigned long eflags;
@@ -367,10 +370,12 @@ struct thread_struct {
/* cached TLS descriptors. */
struct desc_struct tls_array[GDT_ENTRY_TLS_ENTRIES];
unsigned long esp0;
+ struct espss espss12[2];
unsigned long eip;
unsigned long esp;
unsigned long fs;
unsigned long gs;
+ unsigned char multiring_mode;
/* Hardware debugging registers */
unsigned long debugreg[8]; /* %%db0-7 debug registers */
/* fault info */
@@ -388,7 +393,8 @@ struct thread_struct {
#define INIT_THREAD { \
{ { 0, 0 } , }, \
0, \
- 0, 0, 0, 0, \
+ { {0, 0}, {0, 0} }, \
+ 0, 0, 0, 0, 0, \
{ [0 ... 7] = 0 }, /* debugging registers */ \
0, 0, 0, \
{ { 0, }, }, /* 387 state */ \
@@ -400,7 +406,7 @@ struct thread_struct {
0,0, /* back_link, __blh */ \
sizeof(init_stack) + (long) &init_stack, /* esp0 */ \
__KERNEL_DS, 0, /* ss0 */ \
- 0,0,0,0,0,0, /* stack1, stack2 */ \
+ { {0, 0}, {0, 0} }, /* stack1, stack2 */ \
0, /* cr3 */ \
0,0, /* eip,eflags */ \
0,0,0,0, /* eax,ecx,edx,ebx */ \
@@ -415,10 +421,10 @@ struct thread_struct {
#define start_thread(regs, new_eip, new_esp) do { \
__asm__("movl %0,%%fs ; movl %0,%%gs": :"r" (0)); \
set_fs(USER_DS); \
- regs->xds = __USER_DS; \
- regs->xes = __USER_DS; \
- regs->xss = __USER_DS; \
- regs->xcs = __USER_CS; \
+ regs->xds = USER_DS_RPL(USER_RING); \
+ regs->xes = USER_DS_RPL(USER_RING); \
+ regs->xss = USER_DS_RPL(USER_RING); \
+ regs->xcs = USER_CS_RPL(USER_RING); \
regs->eip = new_eip; \
regs->esp = new_esp; \
} while (0)
diff --exclude-from=/home/ldb/src/linux-exclude -urNdp linux-2.5.44/include/asm-i386/segment.h linux-2.5.44_multiring/include/asm-i386/segment.h
--- linux-2.5.44/include/asm-i386/segment.h 2002-10-27 02:38:39.000000000 +0100
+++ linux-2.5.44_multiring/include/asm-i386/segment.h 2002-10-27 00:44:04.000000000 +0200
@@ -1,6 +1,12 @@
#ifndef _ASM_SEGMENT_H
#define _ASM_SEGMENT_H
+/* The default user privilege level */
+#define USER_RING 3
+
+/* The multiring most privileged user level */
+#define MULTIRING_USER_RING 1
+
/*
* The layout of the per-CPU GDT under Linux:
*
@@ -43,10 +49,10 @@
#define TLS_SIZE (GDT_ENTRY_TLS_ENTRIES * 8)
#define GDT_ENTRY_DEFAULT_USER_CS 4
-#define __USER_CS (GDT_ENTRY_DEFAULT_USER_CS * 8 + 3)
+#define USER_CS_RPL(ring) (GDT_ENTRY_DEFAULT_USER_CS * 8 + (ring))
#define GDT_ENTRY_DEFAULT_USER_DS 5
-#define __USER_DS (GDT_ENTRY_DEFAULT_USER_DS * 8 + 3)
+#define USER_DS_RPL(ring) (GDT_ENTRY_DEFAULT_USER_DS * 8 + (ring))
#define GDT_ENTRY_KERNEL_BASE 12
@@ -76,4 +82,6 @@
*/
#define IDT_ENTRIES 256
+#define IDT_SIZE (IDT_ENTRIES * 8)
+
#endif
diff --exclude-from=/home/ldb/src/linux-exclude -urNdp linux-2.5.44/include/asm-i386/system.h linux-2.5.44_multiring/include/asm-i386/system.h
--- linux-2.5.44/include/asm-i386/system.h 2002-10-27 02:38:39.000000000 +0100
+++ linux-2.5.44_multiring/include/asm-i386/system.h 2002-10-27 00:44:04.000000000 +0200
@@ -95,6 +95,12 @@ static inline unsigned long _get_base(ch
: :"m" (*(unsigned int *)&(value)))
/*
+ * Save a segment.
+ */
+#define savesegment(seg,value) \
+ asm volatile("movl %%" #seg ",%0":"=m" (*(int *)&(value)))
+
+/*
* Clear and set 'TS' bit respectively
*/
#define clts() __asm__ __volatile__ ("clts")
diff --exclude-from=/home/ldb/src/linux-exclude -urNdp linux-2.5.44/include/asm-i386/unistd.h linux-2.5.44_multiring/include/asm-i386/unistd.h
--- linux-2.5.44/include/asm-i386/unistd.h 2002-10-27 02:38:39.000000000 +0100
+++ linux-2.5.44_multiring/include/asm-i386/unistd.h 2002-10-27 00:44:04.000000000 +0200
@@ -258,6 +258,7 @@
#define __NR_free_hugepages 251
#define __NR_exit_group 252
#define __NR_lookup_dcookie 253
+#define __NR_multiring 254
/* user-visible error numbers are in the range -1 - -124: see <asm-i386/errno.h> */
diff --exclude-from=/home/ldb/src/linux-exclude -urNdp linux-2.5.44/include/linux/sched.h linux-2.5.44_multiring/include/linux/sched.h
--- linux-2.5.44/include/linux/sched.h 2002-10-27 02:38:39.000000000 +0100
+++ linux-2.5.44_multiring/include/linux/sched.h 2002-10-27 02:36:47.000000000 +0200
@@ -51,6 +51,14 @@ struct exec_domain;
#define CLONE_SETTID 0x00100000 /* write the TID back to userspace */
#define CLONE_CLEARTID 0x00200000 /* clear the userspace TID */
#define CLONE_DETACHED 0x00400000 /* parent wants no child-exit signal */
+#ifdef __i386__
+#define CLONE_IDT 0x00800000 /* set if IDT is shared between processes (note: CLONE_VM implies CLONE_IDT) */
+#define CLONE_CLEAR_IDT 0x01000000 /* set to clear the IDT */
+#define CLONE_EXEC CLONE_CLEAR_IDT
+#else
+#define CLONE_EXEC 0
+#endif
+
/*
* List of flags we want to share for kernel threads,
diff --exclude-from=/home/ldb/src/linux-exclude -urNdp linux-2.5.44/kernel/fork.c linux-2.5.44_multiring/kernel/fork.c
--- linux-2.5.44/kernel/fork.c 2002-10-27 02:38:39.000000000 +0100
+++ linux-2.5.44_multiring/kernel/fork.c 2002-10-27 00:44:04.000000000 +0200
@@ -435,7 +435,11 @@ static int copy_mm(unsigned long clone_f
if (!mm_init(mm))
goto fail_nomem;
- if (init_new_context(tsk,mm))
+#ifdef init_new_context_flags
+ if (init_new_context_flags(tsk, mm, clone_flags))
+#else
+ if (init_new_context(tsk, mm))
+#endif
goto free_pt;
down_write(&oldmm->mmap_sem);
Luca Barbieri <[email protected]> writes:
> Short explaination:
> This patch implements a feature called "x86 multiring", which is a
> shorthand for x86 multiple user-mode privilege rings support.
> It allows user-mode programs to create DPL 1 and 2 segments and get a
> modifiable per-process copy of IDT.
>
> User Mode Linux can use these features to implement a syscall mechanism
> identical to the one used by the kernel-mode kernel, and thus much
> faster than the current one, with free memory protection and with zero
> context switches.
But there are privilege switches.
> Wine could also use it to achieve fast syscall-level emulation of
> Windows NT (and, to a lesser extent, Windows 3.1 and 9x).
>
> Obviously there is some risk of the patch creating security holes.
Let me get the gist of the idea.
To accelerate UML, and wine type applications:
1) setup segments with restricted limits, so their children cannot
write into their supervisor process even though they share a mm.
2) load a special system call table that switches processor modes
when any system call is activated.
Unless I am mistaken all of the above can be accomplished without
using the cpus multiple rings of privilege. Which would allow nesting
only limited by the address space reduction of each task.
Eric
> But there are privilege switches.
Of course, they are unavoidable. However, they are as fast as the one
needed to make kernel syscalls.
> Let me get the gist of the idea.
> To accelerate UML, and wine type applications:
> 1) setup segments with restricted limits, so their children cannot
> write into their supervisor process even though they share a mm.
> 2) load a special system call table that switches processor modes
> when any system call is activated.
>
> Unless I am mistaken all of the above can be accomplished without
> using the cpus multiple rings of privilege. Which would allow nesting
> only limited by the address space reduction of each task.
You also need:
3) Prevent less privileged subtasks from loading segments belonging to
more privileged ones
This can be done in hardware using the x86 privilege rings, at the
cost of limitations on the number of subtasks and the inability to have
protected pairs of subtasks where none is more privileged than the other.
Of course it is also possible to do this in the kernel, or in a
privileged user-mode task using LDT/TLS system calls, by modifying
descriptor tables on interprivilege jumps but this is obviously
significantly slower.
Anyway hardware-based and kernel-based privilege separation can
perfectly coexist.
On Sunday 27 October 2002 03:48, Luca Barbieri wrote:
> Short explaination:
> This patch implements a feature called "x86 multiring", which is a
> shorthand for x86 multiple user-mode privilege rings support.
> It allows user-mode programs to create DPL 1 and 2 segments and get a
> modifiable per-process copy of IDT.
>
> User Mode Linux can use these features to implement a syscall mechanism
> identical to the one used by the kernel-mode kernel, and thus much
> faster than the current one, with free memory protection and with zero
> context switches.
>
> Wine could also use it to achieve fast syscall-level emulation of
> Windows NT (and, to a lesser extent, Windows 3.1 and 9x).
Karim once talked about doing a flavor of Adeos that would drop a running
kernel into ring 1 as a result of insmodding an Adeos module, which would
allow Adeos to combine an unmodified Linux kernel with a realtime executive.
--
Daniel
Daniel Phillips wrote:
> Karim once talked about doing a flavor of Adeos that would drop a running
> kernel into ring 1 as a result of insmodding an Adeos module, which would
> allow Adeos to combine an unmodified Linux kernel with a realtime executive.
Yes. The initial Adeos design (http://www.opersys.com/adeos/) spelled out
the details for shoving Linux out of ring 0 and into ring 1 without modifying
it. It would still have access to its page tables, but it wouldn't be allowed
to use some key instructions (including cli/sti). In that scenario, the
nanokernel would be the only thing running at ring 0, everything else would
run in ring 1 and above. This includes all non-Linux OSes (see the Adeos paper
for complete details).
Though this is fine, it is very hardware dependent. Last I checked, for
example, few archs have 4-level rings. If we're assuming all archs are going
to act/look like x86, it may be worth the effort, but I'm not sure this is
a safe bet. (Which doesn't mean some people can't find this useful, there's
been at least one debugger that follows this method:
http://marc.theaimsgroup.com/?l=linux-kernel&m=102675847422778&w=2)
Instead, it's more interesting to run each OS copy in its own separate
physical address space in priviliged mode over Adeos. This implies a few
assumptions, but "in Linux we trust" (i.e. it's not doing any random physical
accesses, and if it is, then it needs to be fixed). The other OSes, such as
emulated WinXYZ, can also have their own physically separate address space
and run in unpriviliged mode (ring 1 or worse; depending on your willingnes
to implement appropriate handlers for the faults generated by the OS not
running in its intended ring 0). Have a look at the "Practical SMP clusters
document" at the URL above for a discussion of a relatively simple method to
get multiple copies of Linux running side-by-side each in their own separate
physical address space and all linked through Adeos.
Karim
===================================================
Karim Yaghmour
[email protected]
Embedded and Real-Time Linux Expert
===================================================