LinuxLists.cc - [RFC] kvm: export host NUMA info to guest's scheduler

2012-05-23 06:32:47

Subject: [RFC] kvm: export host NUMA info to guest's scheduler

Currently, the guest can not know the NUMA info of the vcpu, which will
result in performance drawback. For example:
Supposing vcpu-a on nodeA, vcpu-b on nodeB, when load balance,
the tasks' pull and push between these vcpus will cost more. But
unfortunately, currently, the guest is just blind to this.

So here is the idea to solve it.
Export host numa info through guest's sched domain to its scheduler.
So the guest's lb will consider the cost.

These patches include:
For guest:
0001-sched-add-virt-sched-domain-for-the-guest.patch
0002-sched-add-virt-domain-device-s-driver.patch
For host:
0001-kvm-collect-vcpus-numa-info-for-guest-s-scheduler.patch
0001-Qemu-add-virt-sched-domain-device.patch

Please give some comments and suggestion.

Thanks and regards,
pingfan

2012-05-23 06:32:53

by Pingfan Liu

[permalink] [raw]

Subject: [PATCH 1/2] sched: add virt sched domain for the guest

From: Liu Ping Fan <[email protected]>

The guest's scheduler can not see the numa info on the host and
this will result to the following scene:
Supposing vcpu-a on nodeA, vcpu-b on nodeB, when load balance,
the tasks' pull and push between these vcpus will cost more. But
unfortunately, currently, the guest is just blind to this.

This patch want to export the host numa info to the guest, and help
guest to rebuild its sched domain based on host's info.

--todo:
vcpu's hotplug will be considered.

Signed-off-by: Liu Ping Fan <[email protected]>
---
kernel/cpuset.c | 2 +-
kernel/sched/core.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 5 ++++
3 files changed, 71 insertions(+), 1 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 14f7070..1246091 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -778,7 +778,7 @@ static DECLARE_WORK(rebuild_sched_domains_work, do_rebuild_sched_domains);
* to a separate workqueue thread, which ends up processing the
* above do_rebuild_sched_domains() function.
*/
-static void async_rebuild_sched_domains(void)
+void async_rebuild_sched_domains(void)
{
queue_work(cpuset_wq, &rebuild_sched_domains_work);
}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e5212ae..3f72c1a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6343,6 +6343,60 @@ static struct sched_domain_topology_level default_topology[] = {
{ NULL, },
};

+#ifdef CONFIG_VIRT_SCHED_DOMAIN
+/* fill in by host */
+DEFINE_PER_CPU(int, virt_numa_node);
+/* todo, exchange info about HOST_NUMNODES from host */
+#define HOST_NUMNODES 128
+/* keep map, node->cpumask; todo, make it dynamic allocated */
+static struct cpumask virt_node_to_cpumask_map[HOST_NUMNODES];
+
+static inline int virt_cpu_to_node(int cpu)
+{
+ return per_cpu(virt_numa_node, cpu);
+}
+
+const struct cpumask *virt_cpumask_of_node(int vnode)
+{
+ struct cpumask *msk = &virt_node_to_cpumask_map[vnode];
+ return msk;
+}
+
+static const struct cpumask *virt_cpu_cpu_mask(int cpu)
+{
+ return virt_cpumask_of_node(virt_cpu_to_node(cpu));
+}
+
+static struct sched_domain_topology_level virt_topology[] = {
+ { sd_init_CPU, virt_cpu_cpu_mask, },
+#ifdef CONFIG_NUMA
+ { sd_init_ALLNODES, cpu_allnodes_mask, },
+#endif
+ { NULL, },
+};
+
+static int update_virt_numa_node(void)
+{
+ int i, cpu, apicid, vnode;
+ for (i = 0; i < HOST_NUMNODES; i++)
+ cpumask_clear(&virt_node_to_cpumask_map[i]);
+ for_each_possible_cpu(cpu) {
+ apicid = cpu_physical_id(cpu);
+ vnode = __vapicid_to_vnode[apicid];
+ per_cpu(virt_numa_node, cpu) = vnode;
+ cpumask_set_cpu(cpu, &virt_node_to_cpumask_map[vnode]);
+ }
+ return 0;
+}
+
+int rebuild_virt_sd(void)
+{
+ update_virt_numa_node();
+ async_rebuild_sched_domains();
+ return 0;
+}
+#endif
+
static struct sched_domain_topology_level *sched_domain_topology = default_topology;

static int __sdt_alloc(const struct cpumask *cpu_map)
@@ -6689,9 +6743,11 @@ match1:
/* Build new domains */
for (i = 0; i < ndoms_new; i++) {
for (j = 0; j < ndoms_cur && !new_topology; j++) {
+#ifndef CONFIG_VIRT_SCHED_DOMAIN
if (cpumask_equal(doms_new[i], doms_cur[j])
&& dattrs_equal(dattr_new, i, dattr_cur, j))
goto match2;
+#endif
}
/* no match - add a new doms_new */
build_sched_domains(doms_new[i], dattr_new ? dattr_new + i : NULL);
@@ -6837,6 +6893,15 @@ void __init sched_init_smp(void)
{
cpumask_var_t non_isolated_cpus;

+#ifdef CONFIG_VIRT_SCHED_DOMAIN
+ int i;
+ for (i = 0; i < MAX_LOCAL_APIC; i++) {
+ /* pretend all on the same node */
+ __vapicid_to_vnode[i] = 0;
+ }
+ update_virt_numa_node();
+ sched_domain_topology = virt_topology;
+#endif
alloc_cpumask_var(&non_isolated_cpus, GFP_KERNEL);
alloc_cpumask_var(&fallback_doms, GFP_KERNEL);

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index fb3acba..232482d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -8,6 +8,9 @@

extern __read_mostly int scheduler_running;

+#ifdef CONFIG_VIRT_SCHED_DOMAIN
+extern s16 __vapicid_to_vnode[];
+#endif
/*
* Convert user-nice values [ -20 ... 0 ... 19 ]
* to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],
@@ -198,6 +201,8 @@ struct cfs_bandwidth { };

#endif /* CONFIG_CGROUP_SCHED */

+extern void async_rebuild_sched_domains(void);
+
/* CFS-related fields in a runqueue */
struct cfs_rq {
struct load_weight load;
--
1.7.4.4

2012-05-23 06:33:04

by Pingfan Liu

[permalink] [raw]

Subject: [PATCH 2/2] sched: add virt domain device's driver

From: Liu Ping Fan <[email protected]>

A driver plays with Qemu's emulated "virt domain device". They aims to
export the host numa info to the guest.

--todo:
A more proper place to archive this driver?

Signed-off-by: Liu Ping Fan <[email protected]>
---
drivers/virtio/Kconfig | 4 ++
drivers/virtio/Makefile | 1 +
drivers/virtio/vsd.c | 124 +++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 129 insertions(+), 0 deletions(-)
create mode 100644 drivers/virtio/vsd.c

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index 1a61939..2ab6faa 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -45,5 +45,9 @@ config VIRTIO_BALLOON
platform device driver.

If unsure, say N.
+config VIRT_SCHED_DOMAIN
+ tristate "virt sched domain driver (EXPERIMENTAL)"
+ ---help---
+ This driver make guest scheduler know the NUMA info on host

endmenu
diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
index 5a4c63c..20a565d 100644
--- a/drivers/virtio/Makefile
+++ b/drivers/virtio/Makefile
@@ -1,4 +1,5 @@
obj-$(CONFIG_VIRTIO) += virtio.o
+obj-$(CONFIG_VIRT_SCHED_DOMAIN) += vsd.o
obj-$(CONFIG_VIRTIO_RING) += virtio_ring.o
obj-$(CONFIG_VIRTIO_MMIO) += virtio_mmio.o
obj-$(CONFIG_VIRTIO_PCI) += virtio_pci.o
diff --git a/drivers/virtio/vsd.c b/drivers/virtio/vsd.c
new file mode 100644
index 0000000..628dba0
--- /dev/null
+++ b/drivers/virtio/vsd.c
@@ -0,0 +1,124 @@
+/*
+ * PCI driver for qemu virt sched domain device.
+ *
+ * Copyright IBM Corp. 2012
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+#include <linux/module.h>
+#include <linux/list.h>
+#include <linux/pci.h>
+#include <linux/interrupt.h>
+
+#define PCI_DEVICE_ID_CPUSTATE 0x1010
+struct vsd_regs {
+ /* __vapicid_to_vnode[]; addr wr by gst,rd by hst; content wr by hst,
+ * rd by gst
+ */
+ unsigned int apic_to_node;
+ unsigned int smem_size;
+};
+
+struct vsd_stub {
+ struct vsd_regs __iomem *regs;
+};
+
+/* fill in by host */
+s16 __vapicid_to_vnode[MAX_LOCAL_APIC];
+static struct vsd_stub *agent;
+
+extern int rebuild_virt_sd(void);
+
+static irqreturn_t virt_sd_thread_irq(int irq, void *data)
+{
+ rebuild_virt_sd();
+ return IRQ_HANDLED;
+}
+
+static irqreturn_t vsd_irq(int irq, void *data)
+{
+ return IRQ_WAKE_THREAD;
+}
+
+static int __devinit vsd_probe(struct pci_dev *pci_dev,
+ const struct pci_device_id *id)
+{
+ int ret = 0;
+ agent = kzalloc(sizeof(struct vsd_stub), GFP_KERNEL);
+ if (agent == NULL) {
+ ret = -1;
+ goto fail;
+ }
+ ret = pci_enable_device(pci_dev);
+ if (ret) {
+ printk(KERN_WARNING "%s, pci_enable_device fail,ret=0x%x\n",
+ __func__, ret);
+ goto fail;
+ }
+ ret = pci_request_regions(pci_dev, "vsd");
+ if (ret) {
+ printk(KERN_WARNING "%s, pci_request_regions fail,ret=0x%x\n",
+ __func__, ret);
+ goto out_enable_device;
+ }
+ agent->regs = ioremap(pci_dev->resource[0].start,
+ pci_dev->resource[0].end - pci_dev->resource[0].start);
+ if (agent->regs == NULL) {
+ printk(KERN_WARNING "%s, ioremap fail\n", __func__);
+ goto out_req_regions;
+ }
+ agent->regs->apic_to_node = __pa(__vapicid_to_vnode);
+ agent->regs->smem_size = sizeof(__vapicid_to_vnode);
+ ret = request_threaded_irq(pci_dev->irq, vsd_irq, virt_sd_thread_irq,
+ IRQF_SHARED, "virt domain irq", agent);
+ if (ret < 0)
+ goto out_req_regions;
+ return 0;
+out_req_regions:
+ pci_release_regions(pci_dev);
+out_enable_device:
+ pci_disable_device(pci_dev);
+ kfree(agent);
+ agent = NULL;
+fail:
+ printk(KERN_WARNING "%s fail\n", __func__);
+ return ret;
+}
+
+static void __devexit vsd_remove(struct pci_dev *pci_dev)
+{
+}
+
+/* Qumranet donated their vendor ID for devices 0x1000 thru 0x10FF. */
+static DEFINE_PCI_DEVICE_TABLE(pci_vsd_id_table) = {
+ { PCI_VENDOR_ID_IBM, PCI_DEVICE_ID_CPUSTATE,
+ PCI_ANY_ID, PCI_ANY_ID,
+ PCI_CLASS_SYSTEM_OTHER, 0,
+ 0 },
+ { 0 },
+};
+MODULE_DEVICE_TABLE(pci, pci_vsd_id_table);
+
+static struct pci_driver pci_vsd_driver = {
+ .name = "vsd",
+ .id_table = pci_vsd_id_table,
+ .probe = vsd_probe,
+ .remove = __devexit_p(vsd_remove),
+};
+
+static int __init pci_vsd_init(void)
+{
+ return pci_register_driver(&pci_vsd_driver);
+}
+module_init(pci_vsd_init);
+
+static void __exit pci_vsd_exit(void)
+{
+ pci_unregister_driver(&pci_vsd_driver);
+}
+module_exit(pci_vsd_exit);
+MODULE_DESCRIPTION("vsd");
+MODULE_LICENSE("GPL");
+MODULE_VERSION("1");
--
1.7.4.4

2012-05-23 06:33:14

by Pingfan Liu

[permalink] [raw]

Subject: [PATCH] Qemu: add virt sched domain device

From: Liu Ping Fan <[email protected]>

The device will demand the collection of vcpus' numa info, and
trigger the guest to rebuild the sched domain.

Signed-off-by: Liu Ping Fan <[email protected]>
---
Makefile.target | 1 +
hmp-commands.hx | 16 +++++
hw/qdev.h | 1 +
hw/virt_sd.c | 155 +++++++++++++++++++++++++++++++++++++++++++++
linux-headers/linux/kvm.h | 8 ++-
5 files changed, 180 insertions(+), 1 deletions(-)
create mode 100644 hw/virt_sd.c

diff --git a/Makefile.target b/Makefile.target
index 4fbbabf..fded330 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -265,6 +265,7 @@ obj-i386-y += pci-hotplug.o smbios.o wdt_ib700.o
obj-i386-y += debugcon.o multiboot.o
obj-i386-y += pc_piix.o
obj-i386-y += pc_sysfw.o
+obj-i386-y += virt_sd.o
obj-i386-$(CONFIG_KVM) += kvm/clock.o kvm/apic.o kvm/i8259.o kvm/ioapic.o kvm/i8254.o
obj-i386-$(CONFIG_SPICE) += qxl.o qxl-logger.o qxl-render.o

diff --git a/hmp-commands.hx b/hmp-commands.hx
index 461fa59..47b826c 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -1254,6 +1254,22 @@ Change I/O throttle limits for a block drive to @var{bps} @var{bps_rd} @var{bps_
ETEXI

{
+ .name = "guest_numa_notify",
+ .args_type = "",
+ .params = "",
+ .help = "force guest to update numa info based on host",
+ .user_print = monitor_user_noop,
+ .mhandler.cmd_new = do_guest_numa_notify,
+ },
+
+STEXI
+@item device_add @var{config}
+@findex device_add
+
+Add device.
+ETEXI
+
+ {
.name = "block_set_io_throttle",
.args_type = "device:B,bps:l,bps_rd:l,bps_wr:l,iops:l,iops_rd:l,iops_wr:l",
.params = "device bps bps_rd bps_wr iops iops_rd iops_wr",
diff --git a/hw/qdev.h b/hw/qdev.h
index 4e90119..6902474 100644
--- a/hw/qdev.h
+++ b/hw/qdev.h
@@ -203,6 +203,7 @@ void do_info_qtree(Monitor *mon);
void do_info_qdm(Monitor *mon);
int do_device_add(Monitor *mon, const QDict *qdict, QObject **ret_data);
int do_device_del(Monitor *mon, const QDict *qdict, QObject **ret_data);
+int do_guest_numa_notify(Monitor *mon, const QDict *qdict, QObject **ret_data);

/*** qdev-properties.c ***/

diff --git a/hw/virt_sd.c b/hw/virt_sd.c
new file mode 100644
index 0000000..c3aece4
--- /dev/null
+++ b/hw/virt_sd.c
@@ -0,0 +1,155 @@
+/*
+ * Virt sched domain Support
+ *
+ * Copyright IBM, Corp. 2012
+ *
+ * Authors:
+ * Liu Ping Fan <[email protected]>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ *
+*/
+#include "hw.h"
+#include "pci.h"
+#include "kvm.h"
+#include <linux/kvm.h>
+
+/* #define DEBUG_VSD */
+#ifdef DEBUG_VSD
+#define dprintf(fmt, ...) \
+ do { fprintf(stderr, fmt, ## __VA_ARGS__); } while (0)
+#else
+#define dprintf(fmt, ...) \
+ do { } while (0)
+#endif
+
+#define PCI_DEVICE_ID_CPUSTATE 0x1010
+
+typedef struct VirtSdState VirtSdState;
+typedef struct Regs Regs;
+
+#define VSD_REGS_SIZE 0x1000
+struct Regs {
+ unsigned int gpa_apic_node;
+ unsigned int size;
+};
+
+struct VirtSdState {
+ PCIDevice dev;
+ MemoryRegion mmio;
+ Regs regs;
+};
+
+static const VMStateDescription vmstate_vsd = {
+ .name = "vsd",
+ .version_id = 1,
+ .minimum_version_id = 0,
+ .fields = (VMStateField[]) {
+ VMSTATE_END_OF_LIST()
+ },
+};
+
+static VirtSdState *vsd_dev;
+
+static int update_guest_numa(void)
+{
+ int ret = 0;
+ target_phys_addr_t sz;
+ struct kvm_virt_sd vsd;
+ sz = vsd.sz = vsd_dev->regs.size;
+ vsd.vapic_map = cpu_physical_memory_map(vsd_dev->regs.gpa_apic_node,
+ &sz, 1);
+ ret = kvm_ioctl(kvm_state, KVM_SET_GUEST_NUMA, &vsd);
+ if (ret < 0) {
+ return -1;
+ } else {
+ qemu_set_irq(vsd_dev->dev.irq[0], 1);
+ qemu_set_irq(vsd_dev->dev.irq[0], 0);
+ }
+ return 0;
+}
+
+int do_guest_numa_notify(Monitor *mon, const QDict *qdict, QObject **ret_data)
+{
+ return update_guest_numa();
+}
+
+static void
+vsd_mmio_write(void *opaque, target_phys_addr_t addr, uint64_t val,
+ unsigned size)
+{
+ VirtSdState *vsd = opaque;
+ dprintf("vsd_mmio_write,addr=0x%lx, val=0x%lx\n", addr, val);
+ switch (addr) {
+ case 0:
+ vsd->regs.gpa_apic_node = val;
+ break;
+ case 4:
+ vsd->regs.size = val;
+ break;
+ default:
+ fprintf(stderr, "reg unimplemented\n");
+ break;
+ }
+}
+
+static uint64_t
+vsd_mmio_read(void *opaque, target_phys_addr_t addr, unsigned size)
+{
+ return 0;
+}
+
+static const MemoryRegionOps vsd_ops = {
+ .read = vsd_mmio_read,
+ .write = vsd_mmio_write,
+ .endianness = DEVICE_LITTLE_ENDIAN,
+};
+
+static int pci_vsd_init(PCIDevice *dev)
+{
+ uint8_t *pci_cfg = dev->config;
+ VirtSdState *s = DO_UPCAST(VirtSdState, dev, dev);
+ memory_region_init_io(&s->mmio, &vsd_ops, s, "vsd", VSD_REGS_SIZE);
+ vsd_dev = s;
+ pci_cfg[PCI_INTERRUPT_PIN] = 1;
+ pci_cfg[PCI_CAPABILITY_LIST] = 0xdc;
+ pci_register_bar(&s->dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY, &s->mmio);
+ return 0;
+}
+
+static int pci_vsd_exit(PCIDevice *dev)
+{
+ return 0;
+}
+
+static Property vsd_properties[] = {
+ DEFINE_PROP_END_OF_LIST(),
+};
+
+static void vsd_class_init(ObjectClass *klass, void *data)
+{
+ DeviceClass *dc = DEVICE_CLASS(klass);
+ PCIDeviceClass *k = PCI_DEVICE_CLASS(klass);
+
+ k->init = pci_vsd_init;
+ k->exit = pci_vsd_exit;
+ k->vendor_id = PCI_VENDOR_ID_IBM;
+ k->device_id = PCI_DEVICE_ID_CPUSTATE;
+ k->revision = 0x10;
+ k->class_id = PCI_CLASS_MEMORY_RAM;
+ dc->props = vsd_properties;
+}
+
+static TypeInfo vsd_info = {
+ .name = "vsd",
+ .parent = TYPE_PCI_DEVICE,
+ .instance_size = sizeof(VirtSdState),
+ .class_init = vsd_class_init,
+};
+
+static void vsd_register_types(void)
+{
+ type_register_static(&vsd_info);
+}
+type_init(vsd_register_types)
diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
index ee7bd9c..aa5aec3 100644
--- a/linux-headers/linux/kvm.h
+++ b/linux-headers/linux/kvm.h
@@ -448,7 +448,6 @@ struct kvm_ppc_pvinfo {
__u32 hcall[4];
__u8 pad[108];
};
-
#define KVMIO 0xAE

/* machine type bits, to be used as argument to KVM_CREATE_VM */
@@ -478,6 +477,7 @@ struct kvm_ppc_pvinfo {
#define KVM_TRACE_PAUSE __KVM_DEPRECATED_MAIN_0x07
#define KVM_TRACE_DISABLE __KVM_DEPRECATED_MAIN_0x08

+
/*
* Extension capability list.
*/
@@ -733,6 +733,7 @@ struct kvm_one_reg {
struct kvm_userspace_memory_region)
#define KVM_SET_TSS_ADDR _IO(KVMIO, 0x47)
#define KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO, 0x48, __u64)
+#define KVM_SET_GUEST_NUMA _IOW(KVMIO, 0x49, struct kvm_virt_sd)

/* enable ucontrol for s390 */
struct kvm_s390_ucas_mapping {
@@ -913,4 +914,9 @@ struct kvm_assigned_msix_entry {
__u16 padding[3];
};

+struct kvm_virt_sd {
+ __u64 *vapic_map;
+ __u64 sz;
+};
+
#endif /* __LINUX_KVM_H */
--
1.7.4.4

2012-05-23 06:33:10

by Pingfan Liu

[permalink] [raw]

Subject: [PATCH] kvm: collect vcpus' numa info for guest's scheduler

From: Liu Ping Fan <[email protected]>

The guest's scheduler can not see the numa info on the host and
this will result to the following scene:
Supposing vcpu-a on nodeA, vcpu-b on nodeB, when load balance,
the tasks' pull and push between these vcpus will cost more. But
unfortunately, currently, the guest is just blind to this.

This patch want to collect vm's vcpus' numa info.

--todo:
consider about vcpu's initial and hotplug event

Signed-off-by: Liu Ping Fan <[email protected]>
---
arch/x86/kvm/x86.c | 33 +++++++++++++++++++++++++++++++++
include/linux/kvm.h | 6 ++++++
include/linux/kvm_host.h | 4 ++++
virt/kvm/kvm_main.c | 10 ++++++++++
4 files changed, 53 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 185a2b8..d907504 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4918,6 +4918,39 @@ void kvm_arch_exit(void)
kvm_mmu_module_exit();
}

+#ifdef CONFIG_VIRT_SD_SUPPORTD
+int kvm_arch_guest_numa_update(struct kvm *kvm, void __user *to, int n)
+{
+ struct kvm_vcpu *vcpup;
+ s16 *apci_ids;
+ int idx, node;
+ int ret = 0;
+ unsigned int cpu;
+ struct pid *pid;
+ struct task_struct *tsk;
+ apci_ids = kmalloc(n, GFP_KERNEL);
+ if (apci_ids == NULL)
+ return -ENOMEM;
+ kvm_for_each_vcpu(idx, vcpup, kvm) {
+ rcu_read_lock();
+ pid = rcu_dereference(vcpup->pid);
+ tsk = get_pid_task(pid, PIDTYPE_PID);
+ rcu_read_unlock();
+ if (tsk) {
+ cpu = task_cpu(tsk);
+ put_task_struct(tsk);
+ node = cpu_to_node(cpu);
+ } else
+ node = NUMA_NO_NODE;
+ apci_ids[vcpup->vcpu_id] = node;
+ }
+ if (copy_to_user(to, apci_ids, n))
+ ret = -EFAULT;
+ kfree(apci_ids);
+ return ret;
+}
+#endif
+
int kvm_emulate_halt(struct kvm_vcpu *vcpu)
{
++vcpu->stat.halt_exits;
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 6c322a9..da4c0bc 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -732,6 +732,7 @@ struct kvm_one_reg {
struct kvm_userspace_memory_region)
#define KVM_SET_TSS_ADDR _IO(KVMIO, 0x47)
#define KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO, 0x48, __u64)
+#define KVM_SET_GUEST_NUMA _IOW(KVMIO, 0x49, struct kvm_virt_sd)

/* enable ucontrol for s390 */
struct kvm_s390_ucas_mapping {
@@ -909,5 +910,10 @@ struct kvm_assigned_msix_entry {
__u16 entry; /* The index of entry in the MSI-X table */
__u16 padding[3];
};
+#define VIRT_SD_SUPPORTD
+struct kvm_virt_sd {
+ __u64 *vapic_map;
+ __u64 sz;
+};

#endif /* __LINUX_KVM_H */
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 72cbf08..328aa0c 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -526,6 +526,10 @@ void kvm_arch_destroy_vm(struct kvm *kvm);
void kvm_free_all_assigned_devices(struct kvm *kvm);
void kvm_arch_sync_events(struct kvm *kvm);

+#ifdef CONFIG_VIRT_SD_SUPPORTD
+int kvm_arch_guest_numa_update(struct kvm *kvm, void __user *to, int n);
+#endif
+
int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu);
void kvm_vcpu_kick(struct kvm_vcpu *vcpu);

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9739b53..46292bd 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2029,6 +2029,16 @@ static long kvm_vm_ioctl(struct file *filp,
r = kvm_ioeventfd(kvm, &data);
break;
}
+#ifdef CONFIG_VIRT_SD_SUPPORTD
+ case KVM_SET_GUEST_NUMA: {
+ struct kvm_virt_sd sd;
+ r = -EFAULT;
+ if (copy_from_user(&sd, argp, sizeof sd))
+ goto out;
+ r = kvm_arch_guest_numa_update(kvm, sd.vapic_map, sd.sz);
+ break;
+ }
+#endif
#ifdef CONFIG_KVM_APIC_ARCHITECTURE
case KVM_SET_BOOT_CPU_ID:
r = 0;
--
1.7.4.4

2012-05-23 07:54:16

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 1/2] sched: add virt sched domain for the guest

On Wed, 2012-05-23 at 14:32 +0800, Liu Ping Fan wrote:
> From: Liu Ping Fan <[email protected]>
>
> The guest's scheduler can not see the numa info on the host and
> this will result to the following scene:
> Supposing vcpu-a on nodeA, vcpu-b on nodeB, when load balance,
> the tasks' pull and push between these vcpus will cost more. But
> unfortunately, currently, the guest is just blind to this.
>
> This patch want to export the host numa info to the guest, and help
> guest to rebuild its sched domain based on host's info.

Hell no, we're not going to export sched domains, if kvm/qemu wants this
its all in sysfs.

The whole sched_domain stuff is a big enough pain as it is, exporting
this and making it a sodding API is the worst thing ever.

Whatever brainfart made you think this is needed anyway? sysfs contains
the host topology, qemu can already create whatever guest topology you
want (see the -smp and -numa arguments), so what gives?

2012-05-23 08:10:48

by Pingfan Liu

[permalink] [raw]

Subject: Re: [PATCH 1/2] sched: add virt sched domain for the guest

On Wed, May 23, 2012 at 3:54 PM, Peter Zijlstra <[email protected]> wrote:
> On Wed, 2012-05-23 at 14:32 +0800, Liu Ping Fan wrote:
>> From: Liu Ping Fan <[email protected]>
>>
>> The guest's scheduler can not see the numa info on the host and
>> this will result to the following scene:
>> Supposing vcpu-a on nodeA, vcpu-b on nodeB, when load balance,
>> the tasks' pull and push between these vcpus will cost more. But
>> unfortunately, currently, the guest is just blind to this.
>>
>> This patch want to export the host numa info to the guest, and help
>> guest to rebuild its sched domain based on host's info.
>
> Hell no, we're not going to export sched domains, if kvm/qemu wants this
> its all in sysfs.
>
> The whole sched_domain stuff is a big enough pain as it is, exporting
> this and making it a sodding API is the worst thing ever.
>
> Whatever brainfart made you think this is needed anyway? sysfs contains
> the host topology, qemu can already create whatever guest topology you
> want (see the -smp and -numa arguments), so what gives?

I think -numa option will be used to emulate the special virtual
machine to customer, and do not necessary map to host topology.
And even we map them exactly with -numa option, the movement of vcpu
threads among host nodes will break the topology initialized by -numa
option.
So give the guest a opportunity to adjust its topology?

Thanks and regards,
pingfan

2012-05-23 08:23:36

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 1/2] sched: add virt sched domain for the guest

On Wed, 2012-05-23 at 16:10 +0800, Liu ping fan wrote:
> the movement of vcpu
> threads among host nodes will break the topology initialized by -numa
> option.

You want to remap vcpu to nodes? Are you bloody insane? cpu:node maps
are assumed static, you cannot make that a dynamic map and pray things
keep working.

Also, have you any idea how expensive it is to rebuild the topology vs
migrating the vcpu?

2012-05-23 08:34:07

by Pingfan Liu

[permalink] [raw]

Subject: Re: [PATCH 1/2] sched: add virt sched domain for the guest

On Wed, May 23, 2012 at 4:23 PM, Peter Zijlstra <[email protected]> wrote:
> On Wed, 2012-05-23 at 16:10 +0800, Liu ping fan wrote:
>> the movement of vcpu
>> threads among host nodes will break the topology initialized by -numa
>> option.
>
> You want to remap vcpu to nodes? Are you bloody insane? cpu:node maps
> are assumed static, you cannot make that a dynamic map and pray things
> keep working.
>
> Also, have you any idea how expensive it is to rebuild the topology vs
> migrating the vcpu?
>
No, do not rebuild the topology too frequently. Supposing vcpus in
node-A/B/C, now node-B is unplugged,
so we need to migrate some of vcpus from node-B to node-A, or to node-C.
>

2012-05-23 08:48:48

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 1/2] sched: add virt sched domain for the guest

On Wed, 2012-05-23 at 16:34 +0800, Liu ping fan wrote:
> so we need to migrate some of vcpus from node-B to node-A, or to
> node-C.

This is absolutely broken, you cannot do that.

A guest task might want to be node affine, it looks at the topology sets
a cpu affinity mask and expects to stay on that node.

But then you come along, and flip one of those cpus to another node. The
guest task will now run on another node and get remote memory accesses.

Similarly for the guest kernel, it assumes cpu:node maps are static, it
will use this for all kinds of things, including the allocation of
per-cpu memory to be node affine to that cpu.

If you go migrate cpus across nodes everything comes down.

Please go do something else, I'll do this.

2012-05-23 09:58:32

by Pingfan Liu

[permalink] [raw]

Subject: Re: [PATCH 1/2] sched: add virt sched domain for the guest

On Wed, May 23, 2012 at 4:48 PM, Peter Zijlstra <[email protected]> wrote:
> On Wed, 2012-05-23 at 16:34 +0800, Liu ping fan wrote:
>> so we need to migrate some of vcpus from node-B to node-A, or to
>> node-C.
>
> This is absolutely broken, you cannot do that.
>
> A guest task might want to be node affine, it looks at the topology sets
> a cpu affinity mask and expects to stay on that node.
>
> But then you come along, and flip one of those cpus to another node. The
> guest task will now run on another node and get remote memory accesses.
>
Oh, I had thought using -smp to handle such situation. The memory
accesses cost problem can be partly handled by kvm,
while opening a gap for guest's scheduler to see the host numa info.

> Similarly for the guest kernel, it assumes cpu:node maps are static, it
> will use this for all kinds of things, including the allocation of
> per-cpu memory to be node affine to that cpu.
>
> If you go migrate cpus across nodes everything comes down.
>
>
> Please go do something else, I'll do this.

OK, thanks.
pingfan

2012-05-23 10:14:12

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 1/2] sched: add virt sched domain for the guest

On Wed, 2012-05-23 at 17:58 +0800, Liu ping fan wrote:
> > Please go do something else, I'll do this.
>
OK so that was to say never, as in dynamic cpu:node relations aren't
going to happen. but tip/sched/numa contain the bits needed to make
vnuma work.

2012-05-23 15:53:05

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 1/2] sched: add virt sched domain for the guest

On Wed, 2012-05-23 at 08:23 -0700, Dave Hansen wrote:
> On 05/23/2012 01:48 AM, Peter Zijlstra wrote:
> > On Wed, 2012-05-23 at 16:34 +0800, Liu ping fan wrote:
> >> > so we need to migrate some of vcpus from node-B to node-A, or to
> >> > node-C.
> > This is absolutely broken, you cannot do that.
> >
> > A guest task might want to be node affine, it looks at the topology sets
> > a cpu affinity mask and expects to stay on that node.
> >
> > But then you come along, and flip one of those cpus to another node. The
> > guest task will now run on another node and get remote memory accesses.
>
> Insane, sure. But, if the node has physically gone away, what do we do?
> I think we've got to either kill the guest, or let it run somewhere
> suboptimal. Sounds like you're advocating killing it. ;)

You all seem terribly confused. If you want a guest that 100% mirrors
the host topology you need hard-binding of all vcpu threads and clearly
you're in trouble if you unplug a host cpu while there's still a vcpu
expecting to run there.

That's an administrator error and you get to keep the pieces, I don't
care.

In case you want simple virt-numa where a number of vcpus constitute a
vnode and have their memory all on the same node the vcpus are ran on,
what does it matter if you unplug something in the host? Just migrate
everything -- including memory.

But what Liu was proposing is completely insane and broken. You cannot
simply remap cpu:node relations. Wanting to do that shows a profound
lack of understanding.

Our kernel assumes that a cpu remains on the same node. All userspace
that does anything with NUMA assumes the same. You cannot change this.

2012-05-23 15:56:20

by Dave Hansen

[permalink] [raw]

Subject: Re: [PATCH 1/2] sched: add virt sched domain for the guest

On 05/23/2012 01:48 AM, Peter Zijlstra wrote:
> On Wed, 2012-05-23 at 16:34 +0800, Liu ping fan wrote:
>> > so we need to migrate some of vcpus from node-B to node-A, or to
>> > node-C.
> This is absolutely broken, you cannot do that.
>
> A guest task might want to be node affine, it looks at the topology sets
> a cpu affinity mask and expects to stay on that node.
>
> But then you come along, and flip one of those cpus to another node. The
> guest task will now run on another node and get remote memory accesses.

Insane, sure. But, if the node has physically gone away, what do we do?
I think we've got to either kill the guest, or let it run somewhere
suboptimal. Sounds like you're advocating killing it. ;)