2007-01-09 13:37:36

by Avi Kivity

[permalink] [raw]
Subject: [RFC] Stable kvm userspace interface

I had originally hoped to get this in for 2.6.20. It now looks like .20
will have a shorter cycle than usual, and the mmu took a bit longer than
expected, so it's more realistic to aim for 2.6.21.

The current kvm userspace interface has several deficiencies:

- open("/dev/kvm") returns a different object (a new vm) per invocation;
this is "unusual" by Linux standards
- all vcpus share the same inode and struct file, which can cause
scalability problems on very large smps. This isn't a problem for
current hardware, which has moderate core counts and huge vmexit
latencies, not to mention a limit of one vcpu per vm, but I'd like to
future-proof the interface.
- the KVM_VCPU_RUN ioctl() copies a needless chuck of data back and forth
- the PIO handlers communicate by means of registers (for single I/O) or
virtual addresses (for string I/O). Instead the values should be
explicit fields in some structure, and physical addresses should be used
to remove the need to translate addresses in userspace.
- the interrupt code still needs work to properly support the local apic
with Windows guests.
- userspace must rely on delivered signals, which are slow, and cannot
use queued signals (a la pselect()/ppoll()).

I propose the following as the new, stable, kvm api:

// open a handle to the kvm interface. does not create a vm.
int kvm_fd = open("/dev/kvm", O_RDWR);

// the kvm interface supports just three ioctls:
ioctl(kvm_fd, KVM_GET_API_VERSION, 0);
ioctl(kvm_fd, KVM_GET_MSR_LIST, &msr_list);
int vm_fd = ioctl(kvm_fd, KVM_CREATE_VM, 0);

// vm ioctls:
ioctl(vm_fd, KVM_VM_CREATE_MEMORY_REGION, &slot);
ioctl(vm_fd, KVM_VM_GET_DIRTY_LOG, &dirty_log);
int vcpu_fd = ioctl(vm_fd, KVM_CREATE_VCPU, vcpu_slot_number);

// each vcpu is a separate fd/inode. this ensures no cacheline bouncing
// when the kernel refcounts the inodes on syscalls.

// kvm_vcpu_area contains the exit reasons and associated data, and
// results returned by userspace to resolve the exit reasons.
struct kvm_vcpu_area *vcpu_area = mmap(NULL, PAGE_SIZE, ..., vcpu_fd, 0);

struct kvm_vcpu_area {
u32 vcpu_area_size;
u32 exit_reason;

sigset_t sigmask; // for use during vcpu execution

union {
struct kvm_pio pio;
struct kvm_mmio mmio;
struct kvm_cpuid cpuid;
// etc.
char padding[...];
};

struct kvm_irq irq; // acks from vm; injection from userspace
};


// vcpu ioctls

ioctl(vcpu_fd, KVM_VCPU_RUN, 0); // all comms through mmap()ed vcpu_area
ioctl(vcpu_fd, KVM_VCPU_GET_REGS, &regs);
ioctl(vcpu_fd, KVM_VCPU_SET_REGS, &regs);
ioctl(vcpu_fd, KVM_VCPU_GET_SREGS, &sregs);
ioctl(vcpu_fd, KVM_VCPU_SET_SREGS, &sregs);
ioctl(vcpu_fd, KVM_VCPU_GET_MSRS, &msrs);
ioctl(vcpu_fd, KVM_VCPU_SET_MSRS, &msrs);
ioctl(vcpu_fd, KVM_VCPU_DEBUG_GUEST, &debug);


/* for KVM_VM_CREATE_MEMORY_REGION */
struct kvm_memory_region {
__u32 slot;
__u32 flags;
__u64 guest_phys_addr;
__u64 memory_size; /* bytes */
};

/* for kvm_memory_region::flags */
#define KVM_MEM_LOG_DIRTY_PAGES 1UL


#define KVM_EXIT_TYPE_FAIL_ENTRY 1
#define KVM_EXIT_TYPE_VM_EXIT 2

enum kvm_exit_reason {
KVM_EXIT_UNKNOWN = 0,
KVM_EXIT_EXCEPTION = 1,
KVM_EXIT_IO = 2,
KVM_EXIT_CPUID = 3,
KVM_EXIT_DEBUG = 4,
KVM_EXIT_HLT = 5,
KVM_EXIT_MMIO = 6,
KVM_EXIT_IRQ_WINDOW_OPEN = 7,
KVM_EXIT_HYPERCALL = 8,
};


/* for KVM_GET_REGS and KVM_SET_REGS */
struct kvm_regs {
// note: no vcpu!

/* out (KVM_GET_REGS) / in (KVM_SET_REGS) */
__u64 rax, rbx, rcx, rdx;
__u64 rsi, rdi, rsp, rbp;
__u64 r8, r9, r10, r11;
__u64 r12, r13, r14, r15;
__u64 rip, rflags;
};

struct kvm_segment {
__u64 base;
__u32 limit;
__u16 selector;
__u8 type;
__u8 present, dpl, db, s, l, g, avl;
__u8 unusable;
__u8 padding;
};

struct kvm_dtable {
__u64 base;
__u16 limit;
__u16 padding[3];
};

/* for KVM_VCPU_GET_SREGS and KVM_VCPU_SET_SREGS */
struct kvm_sregs {
/* out (KVM_GET_SREGS) / in (KVM_SET_SREGS) */
struct kvm_segment cs, ds, es, fs, gs, ss;
struct kvm_segment tr, ldt;
struct kvm_dtable gdt, idt;
__u64 cr0, cr2, cr3, cr4, cr8;
};

struct kvm_msr_entry {
__u32 index;
__u32 reserved;
__u64 data;
};

/* for KVM_VCPU_GET_MSRS and KVM_VCPU_SET_MSRS */
struct kvm_msrs {
__u32 nmsrs; /* number of msrs in entries */
__u32 padding;

struct kvm_msr_entry entries[0];
};

/* for KVM_GET_MSR_INDEX_LIST */
struct kvm_msr_list {
__u32 nmsrs; /* number of msrs in entries */
__u32 indices[0];
};

struct kvm_breakpoint {
__u32 enabled;
__u32 padding;
__u64 address;
};

/* for KVM_VCPU_DEBUG_GUEST */
struct kvm_debug_guest {
__u32 enabled;
__u32 singlestep;
struct kvm_breakpoint breakpoints[4];
};

/* for KVM_VM_GET_DIRTY_LOG */
struct kvm_dirty_log {
__u32 slot;
__u32 padding;
union {
void __user *dirty_bitmap; /* one bit per page */
__u64 padding;
};
};


Comments and questions are welcome.


Thanks to Arnd Bergmann for his contributions and advice on this issue.

--
error compiling committee.c: too many arguments to function


2007-01-09 13:48:10

by Jeff Garzik

[permalink] [raw]
Subject: Re: [RFC] Stable kvm userspace interface

Can we please avoid adding a ton of new ioctls? ioctls inevitably
require 64-bit compat code for certain architectures, whereas
sysfs/procfs does not.

Jeff



2007-01-09 14:02:57

by James Morris

[permalink] [raw]
Subject: Re: [kvm-devel] [RFC] Stable kvm userspace interface

On Tue, 9 Jan 2007, Jeff Garzik wrote:

> Can we please avoid adding a ton of new ioctls? ioctls inevitably
> require 64-bit compat code for certain architectures, whereas
> sysfs/procfs does not.

I guess ioctl is not as important now if the API is now always talking to
one VM.


- James
--
James Morris
<[email protected]>

2007-01-09 14:11:26

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC] Stable kvm userspace interface

Jeff Garzik wrote:
> Can we please avoid adding a ton of new ioctls? ioctls inevitably
> require 64-bit compat code for certain architectures, whereas
> sysfs/procfs does not.
>

I don't see how the procfs or sysfs models fit kvm. wrt compat code,
the current kvm abi (also ioctl based) is 32/64 bit safe without compat
code, and I certainly don't intend to break it.


--
error compiling committee.c: too many arguments to function

2007-01-11 07:26:54

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [kvm-devel] [RFC] Stable kvm userspace interface

On Tuesday 09 January 2007 14:37, Avi Kivity wrote:
> struct kvm_vcpu_area {
> ? ? u32 vcpu_area_size;
> ? ? u32 exit_reason;
>
> ? ? sigset_t sigmask; ?// for use during vcpu execution

Since Jeff brought up the point of 32 bit compatibility:
When this structure is shared between 64 bit kernel and
32 bit user space, you sigmask should be a __u64 in order
to guarantee compatibility.

Arnd <><

2007-01-11 07:35:12

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [kvm-devel] [RFC] Stable kvm userspace interface

On Tuesday 09 January 2007 14:47, Jeff Garzik wrote:
> Can we please avoid adding a ton of new ioctls? ?ioctls inevitably
> require 64-bit compat code for certain architectures, whereas
> sysfs/procfs does not.

For performance reasons, an ascii string based interface is not
desireable here, some of these calls should be optimized to
the point of counting cycles.

Sysfs also does not fit the use case at all, and procfs only
makes sense if you really want to keep all information about the
guest as part of the process directory it belongs to.

I still think that in the long term, we should migrate to
new system calls and a special file system for kvm, which
might be non-mountable. Those will of course have the same
32 bit compat problems as the ioctl approach, but so far,
Avi has kept a good watch on avoiding these problems.

As long as we think the interface is likely to change (which it
certainly is right now), I believe that ioctl is the right
interface. We can think about retiring it when the interface has
stabilized enough to be converted to syscalls.

Arnd <><

2007-01-11 08:02:11

by Avi Kivity

[permalink] [raw]
Subject: Re: [kvm-devel] [RFC] Stable kvm userspace interface

Arnd Bergmann wrote:
> On Tuesday 09 January 2007 14:37, Avi Kivity wrote:
>
>> struct kvm_vcpu_area {
>> u32 vcpu_area_size;
>> u32 exit_reason;
>>
>> sigset_t sigmask; // for use during vcpu execution
>>
>
> Since Jeff brought up the point of 32 bit compatibility:
> When this structure is shared between 64 bit kernel and
> 32 bit user space, you sigmask should be a __u64 in order
> to guarantee compatibility.
>

Right. Thanks.

--
error compiling committee.c: too many arguments to function

2007-01-11 08:03:50

by Avi Kivity

[permalink] [raw]
Subject: Re: [kvm-devel] [RFC] Stable kvm userspace interface

Arnd Bergmann wrote:
> I still think that in the long term, we should migrate to
> new system calls and a special file system for kvm, which
> might be non-mountable.

The inode-per-vm and inode-per-vcpu approach sort-of-implies a
nonmountable special filesystem, so with the proposed change, we'll be
halfway there.



--
error compiling committee.c: too many arguments to function

2007-01-11 08:26:21

by Jeff Garzik

[permalink] [raw]
Subject: Re: [kvm-devel] [RFC] Stable kvm userspace interface

Arnd Bergmann wrote:
> On Tuesday 09 January 2007 14:47, Jeff Garzik wrote:
>> Can we please avoid adding a ton of new ioctls? ioctls inevitably
>> require 64-bit compat code for certain architectures, whereas
>> sysfs/procfs does not.
>
> For performance reasons, an ascii string based interface is not
> desireable here, some of these calls should be optimized to
> the point of counting cycles.

sysfs does not require ASCII...

Jeff



2007-01-11 08:32:28

by Avi Kivity

[permalink] [raw]
Subject: Re: [kvm-devel] [RFC] Stable kvm userspace interface

Jeff Garzik wrote:
> Arnd Bergmann wrote:
>> On Tuesday 09 January 2007 14:47, Jeff Garzik wrote:
>>> Can we please avoid adding a ton of new ioctls? ioctls inevitably
>>> require 64-bit compat code for certain architectures, whereas
>>> sysfs/procfs does not.
>>
>> For performance reasons, an ascii string based interface is not
>> desireable here, some of these calls should be optimized to
>> the point of counting cycles.
>
> sysfs does not require ASCII...
>

The main kvm ioctl switches the execution mode to guest mode. Just like
a syscall enters kernel mode, ioctl(vcpu_fd, KVM_VCPU_RUN) enters the
guest address space and begins executing guest code.

I don't see how to model that with sysfs.

There are other objections as well. sysfs is a public interface, whereas
kvm is a process private attribute. These objections don't apply to
/proc though.


--
error compiling committee.c: too many arguments to function

2007-01-11 17:47:14

by David Lang

[permalink] [raw]
Subject: Re: [kvm-devel] [RFC] Stable kvm userspace interface

On Thu, 11 Jan 2007, Arnd Bergmann wrote:

> On Tuesday 09 January 2007 14:47, Jeff Garzik wrote:
>> Can we please avoid adding a ton of new ioctls? ?ioctls inevitably
>> require 64-bit compat code for certain architectures, whereas
>> sysfs/procfs does not.
>
> For performance reasons, an ascii string based interface is not
> desireable here, some of these calls should be optimized to
> the point of counting cycles.

why is this? most of the API that is being discussed is run once when the VM is
being setup.

there may be some calls that are performance sensitive, but for things like
seperating the page tables, the cost of doing the work will swamp any ASCII
conversion costs.

David Lang

> Sysfs also does not fit the use case at all, and procfs only
> makes sense if you really want to keep all information about the
> guest as part of the process directory it belongs to.
>
> I still think that in the long term, we should migrate to
> new system calls and a special file system for kvm, which
> might be non-mountable. Those will of course have the same
> 32 bit compat problems as the ioctl approach, but so far,
> Avi has kept a good watch on avoiding these problems.
>
> As long as we think the interface is likely to change (which it
> certainly is right now), I believe that ioctl is the right
> interface. We can think about retiring it when the interface has
> stabilized enough to be converted to syscalls.
>
> Arnd <><
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2007-01-12 11:20:04

by Pavel Machek

[permalink] [raw]
Subject: Re: [kvm-devel] [RFC] Stable kvm userspace interface


Hi!

> >>Can we please avoid adding a ton of new ioctls?
> >>ioctls inevitably require 64-bit compat code for
> >>certain architectures, whereas sysfs/procfs does not.
> >
> >For performance reasons, an ascii string based
> >interface is not
> >desireable here, some of these calls should be
> >optimized to
> >the point of counting cycles.
>
> sysfs does not require ASCII...

Yep, but at that point you have 32 vs. 64bit nightmare back... and
stronger, because sysfs does not have compat handling.

--
Thanks for all the (sleeping) penguins.