2007-06-05 15:27:34

by Rusty Russell

[permalink] [raw]
Subject: [PATCH 1/6] lguest example launcher fix

"struct option" arrays handed to getopt_long() are supposed to be
NULL-terminated. (Another patch added an arg and it finally segv'd).

Signed-off-by: Rusty Russell <[email protected]>
---
Documentation/lguest/lguest.c | 1 +
1 file changed, 1 insertion(+)

===================================================================
--- a/Documentation/lguest/lguest.c
+++ b/Documentation/lguest/lguest.c
@@ -930,6 +930,7 @@ static struct option opts[] = {
{ "tunnet", 1, NULL, 't' },
{ "block", 1, NULL, 'b' },
{ "initrd", 1, NULL, 'i' },
+ { NULL },
};
static void usage(void)
{



2007-06-05 15:27:57

by Rusty Russell

[permalink] [raw]
Subject: [PATCH 2/6] lguest tsc fix

In recent -mm kernels, the TSC capability cannot be disabled,
resulting in a divide by zero error in the normal sched_clock.

The correct fix is to have a special lguest sched_clock
implementation: this is as simple as it gets.

Signed-off-by: Rusty Russell <[email protected]>
---
drivers/lguest/lguest.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)

===================================================================
--- a/drivers/lguest/lguest.c
+++ b/drivers/lguest/lguest.c
@@ -351,11 +351,18 @@ static void lguest_time_irq(unsigned int
update_process_times(user_mode_vm(get_irq_regs()));
}

+static u64 sched_clock_base;
static void lguest_time_init(void)
{
set_irq_handler(0, lguest_time_irq);
hcall(LHCALL_TIMER_READ, 0, 0, 0);
+ sched_clock_base = jiffies_64;
enable_lguest_irq(0);
+}
+
+static unsigned long long lguest_sched_clock(void)
+{
+ return (jiffies_64 - sched_clock_base) * (1000000000 / HZ);
}

static void lguest_load_esp0(struct tss_struct *tss,
@@ -494,6 +501,7 @@ __init void lguest_init(void *boot)
paravirt_ops.time_init = lguest_time_init;
paravirt_ops.set_lazy_mode = lguest_lazy_mode;
paravirt_ops.wbinvd = lguest_wbinvd;
+ paravirt_ops.sched_clock = lguest_sched_clock;

hcall(LHCALL_LGUEST_INIT, __pa(&lguest_data), 0, 0);

@@ -507,8 +515,6 @@ __init void lguest_init(void *boot)
cpu_detect(&new_cpu_data);
/* Math is always hard! */
new_cpu_data.hard_math = 1;
-
- tsc_disable = 1;

#ifdef CONFIG_X86_MCE
mce_disabled = 1;


2007-06-05 15:28:26

by Rusty Russell

[permalink] [raw]
Subject: [PATCH 3/6] lguest suppress IDE probing

The IDE probe is the slowest part of boot: by suppressing it we cut
boot from from 3 seconds to half a second.

AFAICT, the commandline is the easiest way to suppress the probing.

Signed-off-by: Rusty Russell <[email protected]>
---
drivers/lguest/lguest.c | 6 ++++++
1 file changed, 6 insertions(+)

===================================================================
--- a/drivers/lguest/lguest.c
+++ b/drivers/lguest/lguest.c
@@ -605,6 +605,12 @@ __init void lguest_init(void *boot)
acpi_disabled = 1;
acpi_ht = 0;
#endif
+#ifdef CONFIG_BLK_DEV_IDE
+ /* Saves three seconds off boot time. */
+ strcat(boot_command_line,
+ " ide0=noprobe ide1=noprobe ide2=noprobe"
+ " ide3=noprobe ide4=noprobe ide5=noprobe");
+#endif

add_preferred_console("hvc", 0, NULL);



2007-06-05 15:28:51

by Rusty Russell

[permalink] [raw]
Subject: [PATCH 4/6] lguest don't signal like crazy, use LHREQ_BREAK command

We currently use a "waker" process: a child of the launcher which
selects() on the incoming file descriptors. It sends a SIGUSR1 to the
launcher whenever select() returns to kick the launcher out of the
kernel.

This has nasty side-effects: the waker needs to keep sending signals
to avoid the race, so we nice it to try to make sure the launcher runs
soon. Also the launcher blocks SIGUSR1 when it's not running the
guest, so it doesn't have to deal with other interrupted reads...

It's better to explicitly tell the kernel to break out of the guest,
and this is what we do, with a new LHREQ_BREAK command. This makes
the launcher return -EAGAIN from reading /dev/lguest, and blocks the
waker until the launcher calls LHREQ_BREAK, avoiding the race.

We also take precautions against simultaneous writes or reads on the
/dev/lguest fd. As only root can open these file descriptors it's not
much of a problem, but we want to relax that restriction eventually.

The main improvement is in consistency, rather than raw benchmark
results:

Before:
Time for one context switch via pipe: 9265 (4534 - 9495)
Time for one Copy-on-Write fault: 67687 (14898 - 159125)
Time to exec client once: 1102812 (795843 - 1128250)
Time for one fork/exit/wait: 712000 (400625 - 723156)
Time for gettimeofday(): 16681 (16378 - 35835)
Time to send 4 MB from host: 141317343 (140165578 - 141469500)
Time for one int-0x80 syscall: 272 (272 - 575)
Time for one syscall via libc: 275 (274 - 904)
Time for two PTE updates: 16232 (6430 - 16316)
Time to read from disk (256 kB): 16786750 (16597500 - 31493250)
Time for one disk read: 192656 (189312 - 958687)
Time for inter-guest pingpong: 110453 (104492 - 316429)

After:
Time for one context switch via pipe: 4687 (4563 - 4857)
Time for one Copy-on-Write fault: 44523 (11628 - 77855)
Time to exec client once: 814765 (805796 - 829875)
Time for one fork/exit/wait: 405875 (400562 - 434750)
Time for gettimeofday(): 16644 (16203 - 16931)
Time to send 4 MB from host: 136530000 (121522250 - 151629000)
Time for one int-0x80 syscall: 273 (272 - 274)
Time for one syscall via libc: 279 (277 - 279)
Time for two PTE updates: 6439 (6395 - 6528)
Time to read from disk (256 kB): 16787000 (16641250 - 16861250)
Time for one disk read: 192187 (190515 - 193843)
Time for inter-guest pingpong: 111093 (109203 - 136554)

Signed-off-by: Rusty Russell <[email protected]>
---
Documentation/lguest/lguest.c | 84 +++++++++++++++++----------------------
drivers/lguest/core.c | 7 ++-
drivers/lguest/lg.h | 5 ++
drivers/lguest/lguest_user.c | 59 +++++++++++++++++++++++----
include/linux/lguest_launcher.h | 1
5 files changed, 101 insertions(+), 55 deletions(-)

===================================================================
--- a/Documentation/lguest/lguest.c
+++ b/Documentation/lguest/lguest.c
@@ -16,7 +16,6 @@
#include <fcntl.h>
#include <stdbool.h>
#include <errno.h>
-#include <signal.h>
#include <ctype.h>
#include <sys/socket.h>
#include <sys/ioctl.h>
@@ -47,6 +46,7 @@ static bool verbose;
static bool verbose;
#define verbose(args...) \
do { if (verbose) printf(args); } while(0)
+static int waker_fd;

struct device_list
{
@@ -341,15 +341,14 @@ static void set_fd(int fd, struct device
devices->max_infd = fd;
}

-/* We send lguest_add signals while input is pending: avoids races. */
-static void wake_parent(int pipefd, struct device_list *devices)
-{
- nice(19);
-
+/* When input arrives, we tell the kernel to kick lguest out with -EAGAIN. */
+static void wake_parent(int pipefd, int lguest_fd, struct device_list *devices)
+{
set_fd(pipefd, devices);

for (;;) {
fd_set rfds = devices->infds;
+ u32 args[] = { LHREQ_BREAK, 1 };

select(devices->max_infd+1, &rfds, NULL, NULL, NULL);
if (FD_ISSET(pipefd, &rfds)) {
@@ -357,25 +356,14 @@ static void wake_parent(int pipefd, stru
if (read(pipefd, &ignorefd, sizeof(ignorefd)) == 0)
exit(0);
FD_CLR(ignorefd, &devices->infds);
- }
- kill(getppid(), SIGUSR1);
- }
-}
-
-/* We don't want signal to kill us, just jerk us out of kernel. */
-static void wakeup(int signo)
-{
-}
-
-static int setup_waker(struct device_list *device_list)
+ } else
+ write(lguest_fd, args, sizeof(args));
+ }
+}
+
+static int setup_waker(int lguest_fd, struct device_list *device_list)
{
int pipefd[2], child;
- struct sigaction act;
-
- act.sa_handler = wakeup;
- sigemptyset(&act.sa_mask);
- act.sa_flags = 0;
- sigaction(SIGUSR1, &act, NULL);

pipe(pipefd);
child = fork();
@@ -384,7 +372,7 @@ static int setup_waker(struct device_lis

if (child == 0) {
close(pipefd[1]);
- wake_parent(pipefd[0], device_list);
+ wake_parent(pipefd[0], lguest_fd, device_list);
}
close(pipefd[0]);

@@ -495,8 +483,13 @@ static bool handle_console_input(int fd,
else if (abort->count == 3) {
struct timeval now;
gettimeofday(&now, NULL);
- if (now.tv_sec <= abort->start.tv_sec+1)
+ if (now.tv_sec <= abort->start.tv_sec+1) {
+ /* Make sure waker is not blocked in BREAK */
+ u32 args[] = { LHREQ_BREAK, 0 };
+ close(waker_fd);
+ write(fd, args, sizeof(args));
exit(2);
+ }
abort->count = 0;
}
} else
@@ -613,7 +606,7 @@ static void handle_output(int fd, unsign
warnx("Pending dma %p, key %p", (void *)dma, (void *)key);
}

-static void handle_input(int fd, int childfd, struct device_list *devices)
+static void handle_input(int fd, struct device_list *devices)
{
struct timeval poll = { .tv_sec = 0, .tv_usec = 0 };

@@ -628,8 +621,8 @@ static void handle_input(int fd, int chi
if (i->handle_input && FD_ISSET(i->fd, &fds)) {
if (!i->handle_input(fd, i)) {
FD_CLR(i->fd, &devices->infds);
- /* Tell child to ignore it too... */
- write(childfd, &i->fd, sizeof(i->fd));
+ /* Tell waker to ignore it too... */
+ write(waker_fd, &i->fd, sizeof(i->fd));
}
}
}
@@ -898,29 +891,28 @@ static void map_device_descriptors(struc
}

static void __attribute__((noreturn))
-run_guest(int lguest_fd, int waker_fd, struct device_list *device_list)
-{
- sigset_t sigset;
-
- sigemptyset(&sigset);
- sigaddset(&sigset, SIGUSR1);
+run_guest(int lguest_fd, struct device_list *device_list)
+{
for (;;) {
+ u32 args[] = { LHREQ_BREAK, 0 };
unsigned long arr[2];
int readval;

- sigprocmask(SIG_UNBLOCK, &sigset, NULL);
+ /* We read from the /dev/lguest device to run the Guest. */
readval = read(lguest_fd, arr, sizeof(arr));
- sigprocmask(SIG_BLOCK, &sigset, NULL);
-
- if (readval == sizeof(arr))
+
+ if (readval == sizeof(arr)) {
handle_output(lguest_fd, arr[0], arr[1], device_list);
- else if (errno == ENOENT) {
+ continue;
+ } else if (errno == ENOENT) {
char reason[1024] = { 0 };
read(lguest_fd, reason, sizeof(reason)-1);
errx(1, "%s", reason);
- } else if (errno != EINTR)
+ } else if (errno != EAGAIN)
err(1, "Running guest failed");
- handle_input(lguest_fd, waker_fd, device_list);
+ handle_input(lguest_fd, device_list);
+ if (write(lguest_fd, args, sizeof(args)) < 0)
+ err(1, "Resetting break");
}
}

@@ -943,7 +935,7 @@ int main(int argc, char *argv[])
int main(int argc, char *argv[])
{
unsigned long mem, pgdir, start, page_offset, initrd_size = 0;
- int c, lguest_fd, waker_fd;
+ int c, lguest_fd;
struct device_list device_list;
void *boot = (void *)0;
const char *initrd_name = NULL;
@@ -1014,7 +1006,7 @@ int main(int argc, char *argv[])
*(int *)(boot + 0x23c) = 1;

lguest_fd = tell_kernel(pgdir, start, page_offset);
- waker_fd = setup_waker(&device_list);
-
- run_guest(lguest_fd, waker_fd, &device_list);
-}
+ waker_fd = setup_waker(lguest_fd, &device_list);
+
+ run_guest(lguest_fd, &device_list);
+}
===================================================================
--- a/drivers/lguest/core.c
+++ b/drivers/lguest/core.c
@@ -313,7 +313,12 @@ int run_guest(struct lguest *lg, unsigne
}

if (signal_pending(current))
- return -EINTR;
+ return -ERESTARTSYS;
+
+ /* If Waker set break_out, return to Launcher. */
+ if (lg->break_out)
+ return -EAGAIN;
+
maybe_do_interrupt(lg);

try_to_freeze();
===================================================================
--- a/drivers/lguest/lg.h
+++ b/drivers/lguest/lg.h
@@ -16,6 +16,7 @@
#include <linux/futex.h>
#include <linux/lguest.h>
#include <linux/lguest_launcher.h>
+#include <linux/wait.h>
#include <linux/err.h>
#include <asm/semaphore.h>
#include "irq_vectors.h"
@@ -138,6 +139,10 @@ struct lguest
u32 esp1;
u8 ss1;

+ /* Do we need to stop what we're doing and return to userspace? */
+ int break_out;
+ wait_queue_head_t break_wq;
+
/* Bitmap of what has changed: see CHANGED_* above. */
int changed;
struct lguest_pages *last_pages;
===================================================================
--- a/drivers/lguest/lguest_user.c
+++ b/drivers/lguest/lguest_user.c
@@ -30,6 +30,30 @@ static long user_get_dma(struct lguest *
return udma;
}

+/* To force the Guest to stop running and return to the Launcher, the
+ * Waker sets writes LHREQ_BREAK and the value "1" to /dev/lguest. The
+ * Launcher then writes LHREQ_BREAK and "0" to release the Waker. */
+static int break_guest_out(struct lguest *lg, const u32 __user *input)
+{
+ unsigned long on;
+
+ /* Fetch whether they're turning break on or off.. */
+ if (get_user(on, input) != 0)
+ return -EFAULT;
+
+ if (on) {
+ lg->break_out = 1;
+ /* Pop it out (may be running on different CPU) */
+ wake_up_process(lg->tsk);
+ /* Wait for them to reset it */
+ return wait_event_interruptible(lg->break_wq, !lg->break_out);
+ } else {
+ lg->break_out = 0;
+ wake_up(&lg->break_wq);
+ return 0;
+ }
+}
+
/* + irq */
static int user_send_irq(struct lguest *lg, const u32 __user *input)
{
@@ -49,6 +73,10 @@ static ssize_t read(struct file *file, c

if (!lg)
return -EINVAL;
+
+ /* If you're not the task which owns the guest, go away. */
+ if (current != lg->tsk)
+ return -EPERM;

if (lg->dead) {
size_t len;
@@ -75,13 +103,20 @@ static int initialize(struct file *file,
int err, i;
u32 args[4];

- if (file->private_data)
- return -EBUSY;
-
- if (copy_from_user(args, input, sizeof(args)) != 0)
- return -EFAULT;
-
+ /* We grab the Big Lguest lock, which protects the global array
+ * "lguests" and multiple simultaneous initializations. */
mutex_lock(&lguest_lock);
+
+ if (file->private_data) {
+ err = -EBUSY;
+ goto unlock;
+ }
+
+ if (copy_from_user(args, input, sizeof(args)) != 0) {
+ err = -EFAULT;
+ goto unlock;
+ }
+
i = find_free_guest();
if (i < 0) {
err = -ENOSPC;
@@ -107,10 +142,12 @@ static int initialize(struct file *file,
lg->tsk = current;
get_task_struct(lg->tsk);
lg->mm = get_task_mm(lg->tsk);
+ init_waitqueue_head(&lg->break_wq);
lg->last_pages = NULL;
- mutex_unlock(&lguest_lock);
-
file->private_data = lg;
+
+ mutex_unlock(&lguest_lock);
+
return sizeof(args);

free_regs:
@@ -136,6 +173,10 @@ static ssize_t write(struct file *file,
return -EINVAL;
if (lg && lg->dead)
return -ENOENT;
+
+ /* If you're not the task which owns the Guest, you can only break */
+ if (lg && current != lg->tsk && req != LHREQ_BREAK)
+ return -EPERM;

switch (req) {
case LHREQ_INITIALIZE:
@@ -144,6 +185,8 @@ static ssize_t write(struct file *file,
return user_get_dma(lg, (const u32 __user *)input);
case LHREQ_IRQ:
return user_send_irq(lg, (const u32 __user *)input);
+ case LHREQ_BREAK:
+ return break_guest_out(lg, (const u32 __user *)input);
default:
return -EINVAL;
}
===================================================================
--- a/include/linux/lguest_launcher.h
+++ b/include/linux/lguest_launcher.h
@@ -68,5 +68,6 @@ enum lguest_req
LHREQ_INITIALIZE, /* + pfnlimit, pgdir, start, pageoffset */
LHREQ_GETDMA, /* + addr (returns &lguest_dma, irq in ->used_len) */
LHREQ_IRQ, /* + irq */
+ LHREQ_BREAK, /* + on/off flag (on blocks until someone does off) */
};
#endif /* _ASM_LGUEST_USER */


2007-06-05 15:29:21

by Rusty Russell

[permalink] [raw]
Subject: [PATCH 5/6] lguest use TSC

James: Add rudimentary TSC-based clocksource support (based on Rusty's original
patch).
Rusty: Add rudimentary code to handle TSC changing. We can use the native
sched_clock again, we just have to tell it the TSC speed (get_cpu_khz).

(Note on benchmarks: Linux only sets the clock to the nearest second,
so expect it to be up to up to 1000000000 ns out. But the narrowness
of the max - min range indicates consistency.)

Before:
Approximate accuracy of clock: 827460500 (819802000 - 834640500)

After:
Approximate accuracy of clock: 326354500 (326257000 - 327937500)

Signed-off-by: James Morris <[email protected]>
Signed-off-by: Rusty Russell <[email protected]>
---
arch/i386/kernel/tsc.c | 1 +
drivers/lguest/core.c | 9 +++++++++
drivers/lguest/lg.h | 1 +
drivers/lguest/lguest.c | 40 ++++++++++++++++++++++++++++++++--------
include/linux/lguest.h | 2 ++
5 files changed, 45 insertions(+), 8 deletions(-)

===================================================================
--- a/arch/i386/kernel/tsc.c
+++ b/arch/i386/kernel/tsc.c
@@ -27,6 +27,7 @@ static int tsc_enabled;
* an extra value to store the TSC freq
*/
unsigned int tsc_khz;
+EXPORT_SYMBOL_GPL(tsc_khz);

int tsc_disable;

===================================================================
--- a/drivers/lguest/core.c
+++ b/drivers/lguest/core.c
@@ -332,6 +333,14 @@ int run_guest(struct lguest *lg, unsigne
continue;
}

+ /* If the Guest hasn't been told the clock multiplier to use or
+ * it's changed, we update it here. */
+ if (unlikely(lg->tsc_khz != tsc_khz) && lg->lguest_data) {
+ lg->tsc_khz = tsc_khz;
+ if (put_user(lg->tsc_khz, &lg->lguest_data->tsc_khz))
+ return -EFAULT;
+ }
+
local_irq_disable();

/* Even if *we* don't want FPU trap, guest might... */
===================================================================
--- a/drivers/lguest/lg.h
+++ b/drivers/lguest/lg.h
@@ -160,6 +160,7 @@ struct lguest
unsigned long pending_key; /* address they're sending to */

unsigned int stack_pages;
+ u32 tsc_khz;

struct lguest_dma_info dma[LGUEST_MAX_DMA];

===================================================================
--- a/drivers/lguest/lguest.c
+++ b/drivers/lguest/lguest.c
@@ -25,6 +25,7 @@
#include <linux/screen_info.h>
#include <linux/irq.h>
#include <linux/interrupt.h>
+#include <linux/clocksource.h>
#include <linux/lguest.h>
#include <linux/lguest_launcher.h>
#include <linux/lguest_bus.h>
@@ -37,6 +38,7 @@
#include <asm/e820.h>
#include <asm/mce.h>
#include <asm/io.h>
+#include <asm/sched-clock.h>

/* Declarations for definitions in lguest_guest.S */
extern char lguest_noirq_start[], lguest_noirq_end[];
@@ -209,8 +211,8 @@ static void lguest_cpuid(unsigned int *e
case 1: /* Basic feature request. */
/* We only allow kernel to see SSE3, CMPXCHG16B and SSSE3 */
*ecx &= 0x00002201;
- /* Similarly: SSE, SSE2, FXSR, MMX, CMOV, CMPXCHG8B, FPU. */
- *edx &= 0x07808101;
+ /* SSE, SSE2, FXSR, MMX, CMOV, CMPXCHG8B, TSC, FPU. */
+ *edx &= 0x07808111;
/* Host wants to know when we flush kernel pages: set PGE. */
*edx |= 0x00002000;
break;
@@ -345,24 +347,46 @@ static unsigned long lguest_get_wallcloc
return hcall(LHCALL_GET_WALLCLOCK, 0, 0, 0);
}

+/* This is what we tell the kernel is our clocksource. It's the normal "Time
+ * Stamp Counter": the Host tells us what speed it's going at. */
+static struct clocksource lguest_clock = {
+ .name = "lguest",
+ .rating = 400,
+ .read = native_read_tsc,
+ .mask = CLOCKSOURCE_MASK(64),
+ .mult = 0, /* to be set */
+ .shift = 22,
+ .flags = CLOCK_SOURCE_IS_CONTINUOUS,
+};
+
static void lguest_time_irq(unsigned int irq, struct irq_desc *desc)
{
+ /* Check in case host TSC has changed rate. */
+ if (unlikely(tsc_khz != lguest_data.tsc_khz)) {
+ tsc_khz = lguest_data.tsc_khz;
+ lguest_clock.mult = clocksource_khz2mult(tsc_khz, 22);
+ __get_cpu_var(sc_data).cyc2ns_scale
+ = (1000000 << CYC2NS_SCALE_FACTOR) / tsc_khz;
+ }
do_timer(hcall(LHCALL_TIMER_READ, 0, 0, 0));
update_process_times(user_mode_vm(get_irq_regs()));
}

-static u64 sched_clock_base;
static void lguest_time_init(void)
{
set_irq_handler(0, lguest_time_irq);
+
+ lguest_clock.mult = clocksource_khz2mult(tsc_khz, 22);
+ clocksource_register(&lguest_clock);
+
hcall(LHCALL_TIMER_READ, 0, 0, 0);
- sched_clock_base = jiffies_64;
enable_lguest_irq(0);
}

-static unsigned long long lguest_sched_clock(void)
-{
- return (jiffies_64 - sched_clock_base) * (1000000000 / HZ);
+static unsigned long lguest_get_cpu_khz(void)
+{
+ /* The Host tells us the TSC speed */
+ return lguest_data.tsc_khz;
}

static void lguest_load_esp0(struct tss_struct *tss,
@@ -501,7 +525,7 @@ __init void lguest_init(void *boot)
paravirt_ops.time_init = lguest_time_init;
paravirt_ops.set_lazy_mode = lguest_lazy_mode;
paravirt_ops.wbinvd = lguest_wbinvd;
- paravirt_ops.sched_clock = lguest_sched_clock;
+ paravirt_ops.get_cpu_khz = lguest_get_cpu_khz;

hcall(LHCALL_LGUEST_INIT, __pa(&lguest_data), 0, 0);

===================================================================
--- a/include/linux/lguest.h
+++ b/include/linux/lguest.h
@@ -70,6 +70,8 @@ struct lguest_data
unsigned long reserve_mem;
/* ID of this guest (used by network driver to set ethernet address) */
u16 guestid;
+ /* KHz for the TSC clock. */
+ u32 tsc_khz;

/* Fields initialized by the guest at boot: */
/* Instruction range to suppress interrupts even if enabled */


2007-06-05 15:29:41

by Rusty Russell

[permalink] [raw]
Subject: [PATCH 6/6] lguest use hrtimers

(BTW Thomas, is the check for delta < minimum actually required in our
set_next_event function?)

>From [email protected] Wed May 9 23:37:36 2007

Convert lguest to the hrtimer framework, enabling dynamic ticks and high
resolution timers.

Signed-off-by: James Morris <[email protected]>
Signed-off-by: Rusty Russell <[email protected]>
---
drivers/lguest/core.c | 2 -
drivers/lguest/hypercalls.c | 10 +----
drivers/lguest/interrupts_and_traps.c | 34 +++++++++++++++--
drivers/lguest/lg.h | 6 ++-
drivers/lguest/lguest.c | 63 +++++++++++++++++++++++++++++++--
drivers/lguest/lguest_user.c | 3 +
include/linux/lguest.h | 5 ++
7 files changed, 106 insertions(+), 17 deletions(-)

===================================================================
--- a/drivers/lguest/core.c
+++ b/drivers/lguest/core.c
@@ -329,7 +329,7 @@ int run_guest(struct lguest *lg, unsigne

if (lg->halted) {
set_current_state(TASK_INTERRUPTIBLE);
- schedule_timeout(1);
+ schedule();
continue;
}

===================================================================
--- a/drivers/lguest/hypercalls.c
+++ b/drivers/lguest/hypercalls.c
@@ -44,13 +44,6 @@ static void do_hcall(struct lguest *lg,
else
guest_pagetable_flush_user(lg);
break;
- case LHCALL_TIMER_READ: {
- u32 now = jiffies;
- mb();
- regs->eax = now - lg->last_timer;
- lg->last_timer = now;
- break;
- }
case LHCALL_GET_WALLCLOCK: {
struct timespec ts;
ktime_get_real_ts(&ts);
@@ -84,6 +77,9 @@ static void do_hcall(struct lguest *lg,
break;
case LHCALL_LOAD_TLS:
guest_load_tls(lg, regs->edx);
+ break;
+ case LHCALL_SET_CLOCKEVENT:
+ guest_set_clockevent(lg, regs->edx);
break;
case LHCALL_TS:
lg->ts = regs->edx;
===================================================================
--- a/drivers/lguest/interrupts_and_traps.c
+++ b/drivers/lguest/interrupts_and_traps.c
@@ -72,10 +72,6 @@ void maybe_do_interrupt(struct lguest *l

if (!lg->lguest_data)
return;
-
- /* If timer has changed, set timer interrupt. */
- if (jiffies != lg->last_timer)
- set_bit(0, lg->irqs_pending);

/* Mask out any interrupts they have blocked. */
if (copy_from_user(&blk, lg->lguest_data->blocked_interrupts,
@@ -240,3 +236,33 @@ void copy_traps(const struct lguest *lg,
else
default_idt_entry(&idt[i], i, def[i]);
}
+
+void guest_set_clockevent(struct lguest *lg, unsigned long delta)
+{
+ ktime_t expires;
+
+ if (unlikely(delta == 0)) {
+ /* Clock event device is shutting down. */
+ hrtimer_cancel(&lg->hrt);
+ return;
+ }
+
+ expires = ktime_add_ns(ktime_get_real(), delta);
+ hrtimer_start(&lg->hrt, expires, HRTIMER_MODE_ABS);
+}
+
+static enum hrtimer_restart clockdev_fn(struct hrtimer *timer)
+{
+ struct lguest *lg = container_of(timer, struct lguest, hrt);
+
+ set_bit(0, lg->irqs_pending);
+ if (lg->halted)
+ wake_up_process(lg->tsk);
+ return HRTIMER_NORESTART;
+}
+
+void init_clockdev(struct lguest *lg)
+{
+ hrtimer_init(&lg->hrt, CLOCK_REALTIME, HRTIMER_MODE_ABS);
+ lg->hrt.function = clockdev_fn;
+}
===================================================================
--- a/drivers/lguest/lg.h
+++ b/drivers/lguest/lg.h
@@ -134,7 +134,6 @@ struct lguest
u32 cr2;
int halted;
int ts;
- u32 last_timer;
u32 next_hcall;
u32 esp1;
u8 ss1;
@@ -173,6 +172,9 @@ struct lguest
/* The IDT entries: some copied into lguest_ro_state when running. */
struct desc_struct idt[FIRST_EXTERNAL_VECTOR+LGUEST_IRQS];
struct desc_struct syscall_idt;
+
+ /* Virtual clock device */
+ struct hrtimer hrt;

/* Pending virtual interrupts */
DECLARE_BITMAP(irqs_pending, LGUEST_IRQS);
@@ -202,6 +204,8 @@ void setup_default_idt_entries(struct lg
const unsigned long *def);
void copy_traps(const struct lguest *lg, struct desc_struct *idt,
const unsigned long *def);
+void guest_set_clockevent(struct lguest *lg, unsigned long delta);
+void init_clockdev(struct lguest *lg);

/* segments.c: */
void setup_default_gdt_entries(struct lguest_ro_state *state);
===================================================================
--- a/drivers/lguest/lguest.c
+++ b/drivers/lguest/lguest.c
@@ -26,6 +26,7 @@
#include <linux/irq.h>
#include <linux/interrupt.h>
#include <linux/clocksource.h>
+#include <linux/clockchips.h>
#include <linux/lguest.h>
#include <linux/lguest_launcher.h>
#include <linux/lguest_bus.h>
@@ -359,8 +360,58 @@ static struct clocksource lguest_clock =
.flags = CLOCK_SOURCE_IS_CONTINUOUS,
};

+/* We also need a "struct clock_event_device": Linux asks us to set it to go
+ * off some time in the future. Actually, James Morris figured all this out, I
+ * just applied the patch. */
+static int lguest_clockevent_set_next_event(unsigned long delta,
+ struct clock_event_device *evt)
+{
+ if (delta < LG_CLOCK_MIN_DELTA) {
+ if (printk_ratelimit())
+ printk(KERN_DEBUG "%s: small delta %lu ns\n",
+ __FUNCTION__, delta);
+ return -ETIME;
+ }
+ hcall(LHCALL_SET_CLOCKEVENT, delta, 0, 0);
+ return 0;
+}
+
+static void lguest_clockevent_set_mode(enum clock_event_mode mode,
+ struct clock_event_device *evt)
+{
+ switch (mode) {
+ case CLOCK_EVT_MODE_UNUSED:
+ case CLOCK_EVT_MODE_SHUTDOWN:
+ /* A 0 argument shuts the clock down. */
+ hcall(LHCALL_SET_CLOCKEVENT, 0, 0, 0);
+ break;
+ case CLOCK_EVT_MODE_ONESHOT:
+ /* This is what we expect. */
+ break;
+ case CLOCK_EVT_MODE_PERIODIC:
+ BUG();
+ }
+}
+
+/* This describes our primitive timer chip. */
+static struct clock_event_device lguest_clockevent = {
+ .name = "lguest",
+ .features = CLOCK_EVT_FEAT_ONESHOT,
+ .set_next_event = lguest_clockevent_set_next_event,
+ .set_mode = lguest_clockevent_set_mode,
+ .rating = INT_MAX,
+ .mult = 1,
+ .shift = 0,
+ .min_delta_ns = LG_CLOCK_MIN_DELTA,
+ .max_delta_ns = LG_CLOCK_MAX_DELTA,
+};
+
+/* This is the Guest timer interrupt handler (hardware interrupt 0). We just
+ * call the clockevent infrastructure and it does whatever needs doing. */
static void lguest_time_irq(unsigned int irq, struct irq_desc *desc)
{
+ unsigned long flags;
+
/* Check in case host TSC has changed rate. */
if (unlikely(tsc_khz != lguest_data.tsc_khz)) {
tsc_khz = lguest_data.tsc_khz;
@@ -368,8 +419,11 @@ static void lguest_time_irq(unsigned int
__get_cpu_var(sc_data).cyc2ns_scale
= (1000000 << CYC2NS_SCALE_FACTOR) / tsc_khz;
}
- do_timer(hcall(LHCALL_TIMER_READ, 0, 0, 0));
- update_process_times(user_mode_vm(get_irq_regs()));
+
+ /* Don't interrupt us while this is running. */
+ local_irq_save(flags);
+ lguest_clockevent.event_handler(&lguest_clockevent);
+ local_irq_restore(flags);
}

static void lguest_time_init(void)
@@ -379,7 +433,10 @@ static void lguest_time_init(void)
lguest_clock.mult = clocksource_khz2mult(tsc_khz, 22);
clocksource_register(&lguest_clock);

- hcall(LHCALL_TIMER_READ, 0, 0, 0);
+ /* We can't set cpumask in the initializer: damn C limitations! */
+ lguest_clockevent.cpumask = cpumask_of_cpu(0);
+ clockevents_register_device(&lguest_clockevent);
+
enable_lguest_irq(0);
}

===================================================================
--- a/drivers/lguest/lguest_user.c
+++ b/drivers/lguest/lguest_user.c
@@ -139,6 +139,7 @@ static int initialize(struct file *file,

setup_regs(lg->regs, args[2]);
setup_guest_gdt(lg);
+ init_clockdev(lg);
lg->tsk = current;
get_task_struct(lg->tsk);
lg->mm = get_task_mm(lg->tsk);
@@ -200,6 +201,8 @@ static int close(struct inode *inode, st
return 0;

mutex_lock(&lguest_lock);
+ /* Cancels the hrtimer set via LHCALL_SET_CLOCKEVENT. */
+ hrtimer_cancel(&lg->hrt);
release_all_dma(lg);
free_guest_pagetable(lg);
put_task_struct(lg->tsk);
===================================================================
--- a/include/linux/lguest.h
+++ b/include/linux/lguest.h
@@ -15,7 +15,7 @@
#define LHCALL_LOAD_IDT_ENTRY 6
#define LHCALL_SET_STACK 7
#define LHCALL_TS 8
-#define LHCALL_TIMER_READ 9
+#define LHCALL_SET_CLOCKEVENT 9
#define LHCALL_HALT 10
#define LHCALL_GET_WALLCLOCK 11
#define LHCALL_BIND_DMA 12
@@ -23,6 +23,9 @@
#define LHCALL_SET_PTE 14
#define LHCALL_SET_PMD 15
#define LHCALL_LOAD_TLS 16
+
+#define LG_CLOCK_MIN_DELTA 100UL
+#define LG_CLOCK_MAX_DELTA ULONG_MAX

#define LGUEST_TRAP_ENTRY 0x1F



2007-06-05 15:32:46

by Matt Mackall

[permalink] [raw]
Subject: Re: [PATCH 3/6] lguest suppress IDE probing

On Wed, Jun 06, 2007 at 12:58:03AM +1000, Rusty Russell wrote:
> The IDE probe is the slowest part of boot: by suppressing it we cut
> boot from from 3 seconds to half a second.
>
> AFAICT, the commandline is the easiest way to suppress the probing.

Switching to libata accomplishes the same and isn't quite as ugly. I'd
just drop this patch and put a note about it in lguest.txt, as libata
seems to be the fashionable thing to do these days.

> Signed-off-by: Rusty Russell <[email protected]>
> ---
> drivers/lguest/lguest.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> ===================================================================
> --- a/drivers/lguest/lguest.c
> +++ b/drivers/lguest/lguest.c
> @@ -605,6 +605,12 @@ __init void lguest_init(void *boot)
> acpi_disabled = 1;
> acpi_ht = 0;
> #endif
> +#ifdef CONFIG_BLK_DEV_IDE
> + /* Saves three seconds off boot time. */
> + strcat(boot_command_line,
> + " ide0=noprobe ide1=noprobe ide2=noprobe"
> + " ide3=noprobe ide4=noprobe ide5=noprobe");
> +#endif
>
> add_preferred_console("hvc", 0, NULL);
>
>

--
Mathematics is the supreme nostalgia of our time.

2007-06-05 15:35:26

by Matt Mackall

[permalink] [raw]
Subject: Re: [PATCH 4/6] lguest don't signal like crazy, use LHREQ_BREAK command

On Wed, Jun 06, 2007 at 01:00:06AM +1000, Rusty Russell wrote:
> We currently use a "waker" process: a child of the launcher which
> selects() on the incoming file descriptors. It sends a SIGUSR1 to the
> launcher whenever select() returns to kick the launcher out of the
> kernel.

If I break out of lguest with three ctrl-Cs, this leaves one of the
lguest processes running with /dev/lguest held open.

--
Mathematics is the supreme nostalgia of our time.

2007-06-05 16:03:17

by Alan

[permalink] [raw]
Subject: Re: [PATCH 3/6] lguest suppress IDE probing

On Wed, 06 Jun 2007 00:58:03 +1000
Rusty Russell <[email protected]> wrote:

> The IDE probe is the slowest part of boot: by suppressing it we cut
> boot from from 3 seconds to half a second.

NAK NAK NAK NAK NAK

> AFAICT, the commandline is the easiest way to suppress the probing.

Gaa ... Rusty surely you have more taste than that.

See include/asm-foo/ide.h

Add an lguest check to go with the pci check and for the lguest case just
say "no controllers"

Better yet just don't compile in the old IDE stuff, lguest doesn't have a
PCI or ISA bus anyway.

Alternatively make the IDE I/O space return 0xFF and it'll skip them
anyway.

Alan

2007-06-05 16:11:17

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH 3/6] lguest suppress IDE probing

On Tue, Jun 05, 2007 at 05:07:47PM +0100, Alan Cox wrote:
> Add an lguest check to go with the pci check and for the lguest case just
> say "no controllers"
>
> Better yet just don't compile in the old IDE stuff, lguest doesn't have a
> PCI or ISA bus anyway.
>
> Alternatively make the IDE I/O space return 0xFF and it'll skip them
> anyway.

ACK to any of these suggestions... agreed.

Jeff



2007-06-05 17:09:46

by James Morris

[permalink] [raw]
Subject: Re: [PATCH 3/6] lguest suppress IDE probing

On Tue, 5 Jun 2007, Alan Cox wrote:

> Better yet just don't compile in the old IDE stuff, lguest doesn't have a
> PCI or ISA bus anyway.

The guest & host kernels are intended to be identical, so we could expect
pretty much anything to be compiled in to the guest.

The other two suggestions sound better.


- James
--
James Morris
<[email protected]>

2007-06-05 18:15:25

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 2/6] lguest tsc fix

On Wed, Jun 06, 2007 at 12:56:36AM +1000, Rusty Russell wrote:
> In recent -mm kernels, the TSC capability cannot be disabled,
> resulting in a divide by zero error in the normal sched_clock.

That will hopefully change. I hope hpa will just undo this.

>
> The correct fix is to have a special lguest sched_clock
> implementation: this is as simple as it gets.

But gettimeofday might still use it. Is that ok for you?

-Andi

2007-06-06 00:44:53

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH 4/6] lguest don't signal like crazy, use LHREQ_BREAK command

On Tue, 2007-06-05 at 10:34 -0500, Matt Mackall wrote:
> On Wed, Jun 06, 2007 at 01:00:06AM +1000, Rusty Russell wrote:
> > We currently use a "waker" process: a child of the launcher which
> > selects() on the incoming file descriptors. It sends a SIGUSR1 to the
> > launcher whenever select() returns to kick the launcher out of the
> > kernel.
>
> If I break out of lguest with three ctrl-Cs, this leaves one of the
> lguest processes running with /dev/lguest held open.

This patch, or the previous version I sent? The previous one had this
issue, so this one takes some care to kill the waker and I haven't seen
it since:

/* Make sure waker is not blocked in BREAK */
u32 args[] = { LHREQ_BREAK, 0 };
close(waker_fd);
write(fd, args, sizeof(args));
exit(2);

Thanks,
Rusty.



2007-06-06 01:00:43

by Matt Mackall

[permalink] [raw]
Subject: Re: [PATCH 4/6] lguest don't signal like crazy, use LHREQ_BREAK command

On Wed, Jun 06, 2007 at 10:07:46AM +1000, Rusty Russell wrote:
> On Tue, 2007-06-05 at 10:34 -0500, Matt Mackall wrote:
> > On Wed, Jun 06, 2007 at 01:00:06AM +1000, Rusty Russell wrote:
> > > We currently use a "waker" process: a child of the launcher which
> > > selects() on the incoming file descriptors. It sends a SIGUSR1 to the
> > > launcher whenever select() returns to kick the launcher out of the
> > > kernel.
> >
> > If I break out of lguest with three ctrl-Cs, this leaves one of the
> > lguest processes running with /dev/lguest held open.
>
> This patch, or the previous version I sent? The previous one had this
> issue, so this one takes some care to kill the waker and I haven't seen
> it since:
>
> /* Make sure waker is not blocked in BREAK */
> u32 args[] = { LHREQ_BREAK, 0 };
> close(waker_fd);
> write(fd, args, sizeof(args));
> exit(2);

Probably the one you sent earlier.

--
Mathematics is the supreme nostalgia of our time.

2007-06-06 01:11:21

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH 2/6] lguest tsc fix

On Tue, 2007-06-05 at 20:15 +0200, Andi Kleen wrote:
> On Wed, Jun 06, 2007 at 12:56:36AM +1000, Rusty Russell wrote:
> > In recent -mm kernels, the TSC capability cannot be disabled,
> > resulting in a divide by zero error in the normal sched_clock.
>
> That will hopefully change. I hope hpa will just undo this.
>
> >
> > The correct fix is to have a special lguest sched_clock
> > implementation: this is as simple as it gets.
>
> But gettimeofday might still use it. Is that ok for you?

Yes, I don't think it will be any worse than before. Basically the
guest uses a dumb jiffies-based clock. The TSC patch later in this same
series changes it back to use the native sched_clock, and overrides
tsc_khz instead (based on information from the host).

Thanks,
Rusty.


2007-06-06 01:11:41

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH 3/6] lguest suppress IDE probing

On Tue, 2007-06-05 at 17:07 +0100, Alan Cox wrote:
> On Wed, 06 Jun 2007 00:58:03 +1000
> Rusty Russell <[email protected]> wrote:
>
> > The IDE probe is the slowest part of boot: by suppressing it we cut
> > boot from from 3 seconds to half a second.
>
> NAK NAK NAK NAK NAK

Hi Alan!

> > AFAICT, the commandline is the easiest way to suppress the probing.
>
> Gaa ... Rusty surely you have more taste than that.

Indeed, but it got attention 8)

> See include/asm-foo/ide.h
>
> Add an lguest check to go with the pci check and for the lguest case just
> say "no controllers"

Actually, Jeremy suggested claiming the entire IO space. That works for
Xen domU too, and makes some amount of sense.

> Better yet just don't compile in the old IDE stuff, lguest doesn't have a
> PCI or ISA bus anyway.

Sure, but the "run the same kernel as guest and host" is a really nice
feature.

> Alternatively make the IDE I/O space return 0xFF and it'll skip them
> anyway.

Hmm, every "in" should be returning 0xFFs, but I still get the delay and
the probing. Xen domU gets it too.

Thanks!
Rusty.

2007-06-06 10:18:38

by Alan

[permalink] [raw]
Subject: Re: [PATCH 3/6] lguest suppress IDE probing

> Actually, Jeremy suggested claiming the entire IO space. That works for
> Xen domU too, and makes some amount of sense.

Agreed

> > Better yet just don't compile in the old IDE stuff, lguest doesn't have a
> > PCI or ISA bus anyway.
>
> Sure, but the "run the same kernel as guest and host" is a really nice
> feature.

Modules dear boy, modules ;)

> > Alternatively make the IDE I/O space return 0xFF and it'll skip them
> > anyway.
>
> Hmm, every "in" should be returning 0xFFs, but I still get the delay and
> the probing. Xen domU gets it too.

Can you see in a debugger where it is spending the time. 0xFF should be
taken as "no port, move on nothing to see"

2007-06-07 02:13:34

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH 3/6] lguest suppress IDE probing

On Wed, 2007-06-06 at 11:23 +0100, Alan Cox wrote:
> > > Better yet just don't compile in the old IDE stuff, lguest doesn't have a
> > > PCI or ISA bus anyway.
> >
> > Sure, but the "run the same kernel as guest and host" is a really nice
> > feature.
>
> Modules dear boy, modules ;)

For some reason, pulling half the kernel's brains out into a separately
maintained userspace seems to make things less reliable. I always build
in everything I need to boot.

Perhaps this makes me an old-timer.

> > > Alternatively make the IDE I/O space return 0xFF and it'll skip them
> > > anyway.
> >
> > Hmm, every "in" should be returning 0xFFs, but I still get the delay and
> > the probing. Xen domU gets it too.
>
> Can you see in a debugger where it is spending the time. 0xFF should be
> taken as "no port, move on nothing to see"

Well, the code is a little opaque to me, but do_probe() calls msleep(50)
three times. According to gdb this gets called 27 times -> 4.05
seconds.

Cheers,
Rusty.


2007-06-07 14:45:40

by Matt Mackall

[permalink] [raw]
Subject: Re: [PATCH 3/6] lguest suppress IDE probing

On Thu, Jun 07, 2007 at 12:13:07PM +1000, Rusty Russell wrote:
> On Wed, 2007-06-06 at 11:23 +0100, Alan Cox wrote:
> > > > Better yet just don't compile in the old IDE stuff, lguest doesn't have a
> > > > PCI or ISA bus anyway.
> > >
> > > Sure, but the "run the same kernel as guest and host" is a really nice
> > > feature.
> >
> > Modules dear boy, modules ;)
>
> For some reason, pulling half the kernel's brains out into a separately
> maintained userspace seems to make things less reliable. I always build
> in everything I need to boot.
>
> Perhaps this makes me an old-timer.
>
> > > > Alternatively make the IDE I/O space return 0xFF and it'll skip them
> > > > anyway.
> > >
> > > Hmm, every "in" should be returning 0xFFs, but I still get the delay and
> > > the probing. Xen domU gets it too.
> >
> > Can you see in a debugger where it is spending the time. 0xFF should be
> > taken as "no port, move on nothing to see"
>
> Well, the code is a little opaque to me, but do_probe() calls msleep(50)
> three times. According to gdb this gets called 27 times -> 4.05
> seconds.

Yep. libata somehow manages to avoid most of these, it'd be
interesting to work out why.

--
Mathematics is the supreme nostalgia of our time.