2021-07-13 16:20:09

by Stefan Hajnoczi

[permalink] [raw]
Subject: [RFC 0/3] cpuidle: add poll_source API and virtio vq polling

These patches are not polished yet but I would like request feedback on this
approach and share performance results with you.

Idle CPUs tentatively enter a busy wait loop before halting when the cpuidle
haltpoll driver is enabled inside a virtual machine. This reduces wakeup
latency for events that occur soon after the vCPU becomes idle.

This patch series extends the cpuidle busy wait loop with the new poll_source
API so drivers can participate in polling. Such polling-aware drivers disable
their device's irq during the busy wait loop to avoid the cost of interrupts.
This reduces latency further than regular cpuidle haltpoll, which still relies
on irqs.

Virtio drivers are modified to use the poll_source API so all virtio device
types get this feature. The following virtio-blk fio benchmark results show the
improvement:

IOPS (numjobs=4, iodepth=1, 4 virtqueues)
before poll_source io_poll
4k randread 167102 186049 (+11%) 186654 (+11%)
4k randwrite 162204 181214 (+11%) 181850 (+12%)
4k randrw 159520 177071 (+11%) 177928 (+11%)

The comparison against io_poll shows that cpuidle poll_source achieves
equivalent performance to the block layer's io_poll feature (which I
implemented in a separate patch series [1]).

The advantage of poll_source is that applications do not need to explicitly set
the RWF_HIPRI I/O request flag. The poll_source approach is attractive because
few applications actually use RWF_HIPRI and it takes advantage of CPU cycles we
would have spent in cpuidle haltpoll anyway.

The current series does not improve virtio-net. I haven't investigated deeply,
but it is possible that NAPI and poll_source do not combine. See the final
patch for a starting point on making the two work together.

I have not tried this on bare metal but it might help there too. The cost of
disabling a device's irq must be less than the savings from avoiding irq
handling for this optimization to make sense.

[1] https://lore.kernel.org/linux-block/[email protected]/

Stefan Hajnoczi (3):
cpuidle: add poll_source API
virtio: add poll_source virtqueue polling
softirq: participate in cpuidle polling

drivers/cpuidle/Makefile | 1 +
drivers/virtio/virtio_pci_common.h | 7 ++
include/linux/interrupt.h | 2 +
include/linux/poll_source.h | 53 +++++++++++++++
include/linux/virtio.h | 2 +
include/linux/virtio_config.h | 2 +
drivers/cpuidle/poll_source.c | 102 +++++++++++++++++++++++++++++
drivers/cpuidle/poll_state.c | 6 ++
drivers/virtio/virtio.c | 34 ++++++++++
drivers/virtio/virtio_pci_common.c | 86 ++++++++++++++++++++++++
drivers/virtio/virtio_pci_modern.c | 2 +
kernel/softirq.c | 14 ++++
12 files changed, 311 insertions(+)
create mode 100644 include/linux/poll_source.h
create mode 100644 drivers/cpuidle/poll_source.c

--
2.31.1


2021-07-13 16:20:52

by Stefan Hajnoczi

[permalink] [raw]
Subject: [RFC 2/3] virtio: add poll_source virtqueue polling

VIRTIO drivers can cheaply disable interrupts by setting
RING_EVENT_FLAGS_DISABLE in the packed virtqueues's Driver Event
Suppression structure. See "2.7.10 Driver and Device Event Suppression"
in the VIRTIO 1.1 specification for details
(https://docs.oasis-open.org/virtio/virtio/v1.1/virtio-v1.1.html).

Add a per-virtqueue poll_source that disables events in ->start(),
processes completed virtqueue buffers in ->poll(), and re-enables events
in ->stop().

This optimization avoids interrupt injection during cpuidle polling.
Workloads that submit requests and then halt the vCPU until completion
benefit from this.

To enable this optimization:

1. Enable the cpuidle haltpoll driver:
https://www.kernel.org/doc/html/latest/virt/guest-halt-polling.html

2. Enable poll_source on the virtio device:
# echo -n 1 > /sys/block/vda/device/poll_source

Note that this feature currently as no effect on split virtqueues when
VIRTIO_F_EVENT_IDX is negotiated. It may be possible to tweak
virtqueue_disable_cb_split() but I haven't attempted it here.

This optimization has been benchmarked successfully with virtio-blk
devices. Currently it does not improve virtio-net performance, probably
because it doesn't combine with NAPI polling.

Signed-off-by: Stefan Hajnoczi <[email protected]>
---
drivers/virtio/virtio_pci_common.h | 7 +++
include/linux/virtio.h | 2 +
include/linux/virtio_config.h | 2 +
drivers/virtio/virtio.c | 34 ++++++++++++
drivers/virtio/virtio_pci_common.c | 86 ++++++++++++++++++++++++++++++
drivers/virtio/virtio_pci_modern.c | 2 +
6 files changed, 133 insertions(+)

diff --git a/drivers/virtio/virtio_pci_common.h b/drivers/virtio/virtio_pci_common.h
index beec047a8f8d..630746043183 100644
--- a/drivers/virtio/virtio_pci_common.h
+++ b/drivers/virtio/virtio_pci_common.h
@@ -21,6 +21,7 @@
#include <linux/pci.h>
#include <linux/slab.h>
#include <linux/interrupt.h>
+#include <linux/poll_source.h>
#include <linux/virtio.h>
#include <linux/virtio_config.h>
#include <linux/virtio_ring.h>
@@ -38,6 +39,9 @@ struct virtio_pci_vq_info {

/* MSI-X vector (or none) */
unsigned msix_vector;
+
+ /* the cpuidle poll_source */
+ struct poll_source poll_source;
};

/* Our device structure */
@@ -102,6 +106,9 @@ static struct virtio_pci_device *to_vp_device(struct virtio_device *vdev)
return container_of(vdev, struct virtio_pci_device, vdev);
}

+/* enable poll_source API for vq polling */
+int vp_enable_poll_source(struct virtio_device *vdev, bool enable);
+
/* wait for pending irq handlers */
void vp_synchronize_vectors(struct virtio_device *vdev);
/* the notify function used when creating a virt queue */
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index b1894e0323fa..baaa3c8fadb1 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -93,6 +93,7 @@ dma_addr_t virtqueue_get_used_addr(struct virtqueue *vq);
* virtio_device - representation of a device using virtio
* @index: unique position on the virtio bus
* @failed: saved value for VIRTIO_CONFIG_S_FAILED bit (for restore)
+ * @poll_source_enabled: poll_source API enabled for vq polling
* @config_enabled: configuration change reporting enabled
* @config_change_pending: configuration change reported while disabled
* @config_lock: protects configuration change reporting
@@ -107,6 +108,7 @@ dma_addr_t virtqueue_get_used_addr(struct virtqueue *vq);
struct virtio_device {
int index;
bool failed;
+ bool poll_source_enabled;
bool config_enabled;
bool config_change_pending;
spinlock_t config_lock;
diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
index 8519b3ae5d52..5fb78d56fd44 100644
--- a/include/linux/virtio_config.h
+++ b/include/linux/virtio_config.h
@@ -72,6 +72,7 @@ struct virtio_shm_region {
* @set_vq_affinity: set the affinity for a virtqueue (optional).
* @get_vq_affinity: get the affinity for a virtqueue (optional).
* @get_shm_region: get a shared memory region based on the index.
+ * @enable_poll_source: enable/disable poll_source API vq polling (optional).
*/
typedef void vq_callback_t(struct virtqueue *);
struct virtio_config_ops {
@@ -97,6 +98,7 @@ struct virtio_config_ops {
int index);
bool (*get_shm_region)(struct virtio_device *vdev,
struct virtio_shm_region *region, u8 id);
+ int (*enable_poll_source)(struct virtio_device *dev, bool enable);
};

/* If driver didn't advertise the feature, it will never appear. */
diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index 4b15c00c0a0a..22fee71bbe34 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -59,12 +59,44 @@ static ssize_t features_show(struct device *_d,
}
static DEVICE_ATTR_RO(features);

+static ssize_t poll_source_show(struct device *_d,
+ struct device_attribute *attr, char *buf)
+{
+ struct virtio_device *dev = dev_to_virtio(_d);
+ return sprintf(buf, "%d\n", dev->poll_source_enabled);
+}
+
+static ssize_t poll_source_store(struct device *_d,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ struct virtio_device *dev = dev_to_virtio(_d);
+ bool val;
+ int rc;
+
+ rc = kstrtobool(buf, &val);
+ if (rc)
+ return rc;
+
+ if (val == dev->poll_source_enabled)
+ return count;
+ if (!dev->config->enable_poll_source)
+ return -ENOTSUPP;
+
+ rc = dev->config->enable_poll_source(dev, val);
+ if (rc)
+ return rc;
+ return count;
+}
+static DEVICE_ATTR_RW(poll_source);
+
static struct attribute *virtio_dev_attrs[] = {
&dev_attr_device.attr,
&dev_attr_vendor.attr,
&dev_attr_status.attr,
&dev_attr_modalias.attr,
&dev_attr_features.attr,
+ &dev_attr_poll_source.attr,
NULL,
};
ATTRIBUTE_GROUPS(virtio_dev);
@@ -343,6 +375,8 @@ int register_virtio_device(struct virtio_device *dev)
dev->index = err;
dev_set_name(&dev->dev, "virtio%u", dev->index);

+ dev->poll_source_enabled = false;
+
spin_lock_init(&dev->config_lock);
dev->config_enabled = false;
dev->config_change_pending = false;
diff --git a/drivers/virtio/virtio_pci_common.c b/drivers/virtio/virtio_pci_common.c
index 222d630c41fc..6de372e12afb 100644
--- a/drivers/virtio/virtio_pci_common.c
+++ b/drivers/virtio/virtio_pci_common.c
@@ -24,6 +24,83 @@ MODULE_PARM_DESC(force_legacy,
"Force legacy mode for transitional virtio 1 devices");
#endif

+static void vp_poll_source_start(struct poll_source *src)
+{
+ struct virtio_pci_vq_info *info =
+ container_of(src, struct virtio_pci_vq_info, poll_source);
+
+ /* This API does not require a lock */
+ virtqueue_disable_cb(info->vq);
+}
+
+static void vp_poll_source_stop(struct poll_source *src)
+{
+ struct virtio_pci_vq_info *info =
+ container_of(src, struct virtio_pci_vq_info, poll_source);
+
+ /* Poll one last time in case */
+ /* TODO allow driver to provide spinlock */
+ if (!virtqueue_enable_cb(info->vq))
+ vring_interrupt(0 /* ignored */, info->vq);
+}
+
+static void vp_poll_source_poll(struct poll_source *src)
+{
+ struct virtio_pci_vq_info *info =
+ container_of(src, struct virtio_pci_vq_info, poll_source);
+
+ vring_interrupt(0 /* ignored */, info->vq);
+}
+
+static const struct poll_source_ops vp_poll_source_ops = {
+ .start = vp_poll_source_start,
+ .stop = vp_poll_source_stop,
+ .poll = vp_poll_source_poll,
+};
+
+/* call this when irq affinity changes to update cpuidle poll_source */
+/* TODO this function waits for a smp_call_function_single() completion, is that allowed in all caller contexts? */
+/* TODO this function is not thread-safe, do all callers hold the same lock? */
+static int vp_update_poll_source(struct virtio_device *vdev, int index)
+{
+ struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+ struct poll_source *src = &vp_dev->vqs[index]->poll_source;
+ const struct cpumask *mask;
+
+ if (!list_empty(&src->node))
+ poll_source_unregister(src);
+
+ if (!vdev->poll_source_enabled)
+ return 0;
+
+ mask = vp_get_vq_affinity(vdev, index);
+ if (!mask)
+ return -ENOTTY;
+
+ /* Update the poll_source cpu */
+ src->cpu = cpumask_first(mask);
+
+ /* Don't use poll_source if interrupts are handled on multiple CPUs */
+ if (cpumask_next(src->cpu, mask) < nr_cpu_ids)
+ return 0;
+
+ return poll_source_register(src);
+}
+
+/* TODO add this to virtio_pci_legacy config ops? */
+int vp_enable_poll_source(struct virtio_device *vdev, bool enable)
+{
+ struct virtqueue *vq;
+
+ vdev->poll_source_enabled = enable;
+
+ /* TODO locking */
+ list_for_each_entry(vq, &vdev->vqs, list) {
+ vp_update_poll_source(vdev, vq->index); /* TODO handle errors? */
+ }
+ return 0;
+}
+
/* wait for pending irq handlers */
void vp_synchronize_vectors(struct virtio_device *vdev)
{
@@ -186,6 +263,9 @@ static struct virtqueue *vp_setup_vq(struct virtio_device *vdev, unsigned index,
if (!info)
return ERR_PTR(-ENOMEM);

+ info->poll_source.ops = &vp_poll_source_ops;
+ INIT_LIST_HEAD(&info->poll_source.node);
+
vq = vp_dev->setup_vq(vp_dev, info, index, callback, name, ctx,
msix_vec);
if (IS_ERR(vq))
@@ -237,6 +317,7 @@ void vp_del_vqs(struct virtio_device *vdev)
int irq = pci_irq_vector(vp_dev->pci_dev, v);

irq_set_affinity_hint(irq, NULL);
+ vp_update_poll_source(vdev, vq->index);
free_irq(irq, vq);
}
}
@@ -342,6 +423,9 @@ static int vp_find_vqs_msix(struct virtio_device *vdev, unsigned nvqs,
vqs[i]);
if (err)
goto error_find;
+
+ if (desc)
+ vp_update_poll_source(vdev, i);
}
return 0;

@@ -440,6 +524,8 @@ int vp_set_vq_affinity(struct virtqueue *vq, const struct cpumask *cpu_mask)
cpumask_copy(mask, cpu_mask);
irq_set_affinity_hint(irq, mask);
}
+
+ vp_update_poll_source(vdev, vq->index);
}
return 0;
}
diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
index 30654d3a0b41..417d568590f2 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -394,6 +394,7 @@ static const struct virtio_config_ops virtio_pci_config_nodev_ops = {
.set_vq_affinity = vp_set_vq_affinity,
.get_vq_affinity = vp_get_vq_affinity,
.get_shm_region = vp_get_shm_region,
+ .enable_poll_source = vp_enable_poll_source,
};

static const struct virtio_config_ops virtio_pci_config_ops = {
@@ -411,6 +412,7 @@ static const struct virtio_config_ops virtio_pci_config_ops = {
.set_vq_affinity = vp_set_vq_affinity,
.get_vq_affinity = vp_get_vq_affinity,
.get_shm_region = vp_get_shm_region,
+ .enable_poll_source = vp_enable_poll_source,
};

/* the PCI probing function */
--
2.31.1

2021-07-13 16:21:00

by Stefan Hajnoczi

[permalink] [raw]
Subject: [RFC 1/3] cpuidle: add poll_source API

Introduce an API for adding cpuidle poll callbacks:

struct poll_source_ops {
void (*start)(struct poll_source *src);
void (*stop)(struct poll_source *src);
void (*poll)(struct poll_source *src);
};

int poll_source_register(struct poll_source *src);
int poll_source_unregister(struct poll_source *src);

When cpuidle enters the poll state it invokes ->start() and then invokes
->poll() repeatedly from the busy wait loop. Finally ->stop() is invoked
when the busy wait loop finishes.

The ->poll() function should check for activity and cause
TIF_NEED_RESCHED to be set in order to stop the busy wait loop.

This API is intended to be used by drivers that can cheaply poll for
events. Participating in cpuidle polling allows them to avoid interrupt
latencies during periods where the CPU is going to poll anyway.

Note that each poll_source is bound to a particular CPU. The API is
mainly intended to by used by drivers that have multiple queues with irq
affinity.

Signed-off-by: Stefan Hajnoczi <[email protected]>
---
drivers/cpuidle/Makefile | 1 +
include/linux/poll_source.h | 53 +++++++++++++++++++
drivers/cpuidle/poll_source.c | 99 +++++++++++++++++++++++++++++++++++
drivers/cpuidle/poll_state.c | 6 +++
4 files changed, 159 insertions(+)
create mode 100644 include/linux/poll_source.h
create mode 100644 drivers/cpuidle/poll_source.c

diff --git a/drivers/cpuidle/Makefile b/drivers/cpuidle/Makefile
index 26bbc5e74123..994f72d6fe95 100644
--- a/drivers/cpuidle/Makefile
+++ b/drivers/cpuidle/Makefile
@@ -7,6 +7,7 @@ obj-y += cpuidle.o driver.o governor.o sysfs.o governors/
obj-$(CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED) += coupled.o
obj-$(CONFIG_DT_IDLE_STATES) += dt_idle_states.o
obj-$(CONFIG_ARCH_HAS_CPU_RELAX) += poll_state.o
+obj-$(CONFIG_ARCH_HAS_CPU_RELAX) += poll_source.o
obj-$(CONFIG_HALTPOLL_CPUIDLE) += cpuidle-haltpoll.o

##################################################################################
diff --git a/include/linux/poll_source.h b/include/linux/poll_source.h
new file mode 100644
index 000000000000..ccfb424e170b
--- /dev/null
+++ b/include/linux/poll_source.h
@@ -0,0 +1,53 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * poll_source.h - cpuidle busy waiting API
+ */
+#ifndef __LINUX_POLLSOURCE_H__
+#define __LINUX_POLLSOURCE_H__
+
+#include <linux/list.h>
+
+struct poll_source;
+
+struct poll_source_ops {
+ void (*start)(struct poll_source *src);
+ void (*stop)(struct poll_source *src);
+ void (*poll)(struct poll_source *src);
+};
+
+struct poll_source {
+ const struct poll_source_ops *ops;
+ struct list_head node;
+ int cpu;
+};
+
+/**
+ * poll_source_register - Add a poll_source for a CPU
+ */
+#if defined(CONFIG_CPU_IDLE) && defined(CONFIG_ARCH_HAS_CPU_RELAX)
+int poll_source_register(struct poll_source *src);
+#else
+static inline int poll_source_register(struct poll_source *src)
+{
+ return 0;
+}
+#endif
+
+/**
+ * poll_source_unregister - Remove a previously registered poll_source
+ */
+#if defined(CONFIG_CPU_IDLE) && defined(CONFIG_ARCH_HAS_CPU_RELAX)
+int poll_source_unregister(struct poll_source *src);
+#else
+static inline int poll_source_unregister(struct poll_source *src)
+{
+ return 0;
+}
+#endif
+
+/* Used by the cpuidle driver */
+void poll_source_start(void);
+void poll_source_run_once(void);
+void poll_source_stop(void);
+
+#endif /* __LINUX_POLLSOURCE_H__ */
diff --git a/drivers/cpuidle/poll_source.c b/drivers/cpuidle/poll_source.c
new file mode 100644
index 000000000000..46100e5a71e4
--- /dev/null
+++ b/drivers/cpuidle/poll_source.c
@@ -0,0 +1,99 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * poll_source.c - cpuidle busy waiting API
+ */
+
+#include <linux/lockdep.h>
+#include <linux/percpu.h>
+#include <linux/poll_source.h>
+
+/* The per-cpu list of registered poll sources */
+DEFINE_PER_CPU(struct list_head, poll_source_list);
+
+/* Called from idle task with TIF_POLLING_NRFLAG set and irqs enabled */
+void poll_source_start(void)
+{
+ struct poll_source *src;
+
+ list_for_each_entry(src, this_cpu_ptr(&poll_source_list), node)
+ src->ops->start(src);
+}
+
+/* Called from idle task with TIF_POLLING_NRFLAG set and irqs enabled */
+void poll_source_run_once(void)
+{
+ struct poll_source *src;
+
+ list_for_each_entry(src, this_cpu_ptr(&poll_source_list), node)
+ src->ops->poll(src);
+}
+
+/* Called from idle task with TIF_POLLING_NRFLAG set and irqs enabled */
+void poll_source_stop(void)
+{
+ struct poll_source *src;
+
+ list_for_each_entry(src, this_cpu_ptr(&poll_source_list), node)
+ src->ops->stop(src);
+}
+
+static void poll_source_register_this_cpu(void *opaque)
+{
+ struct poll_source *src = opaque;
+
+ lockdep_assert_irqs_disabled();
+
+ list_add_tail(&src->node, this_cpu_ptr(&poll_source_list));
+}
+
+int poll_source_register(struct poll_source *src)
+{
+ if (!list_empty(&src->node))
+ return -EBUSY;
+
+ /*
+ * There is no race with src->cpu iterating over poll_source_list
+ * because smp_call_function_single() just sets TIF_NEED_RESCHED
+ * instead of sending an IPI during idle.
+ */
+ /* TODO but what happens if the flag isn't set yet when smp_call_function_single() is invoked? */
+ return smp_call_function_single(src->cpu,
+ poll_source_register_this_cpu,
+ src,
+ 1);
+}
+EXPORT_SYMBOL_GPL(poll_source_register);
+
+static void poll_source_unregister_this_cpu(void *opaque)
+{
+ struct poll_source *src = opaque;
+
+ lockdep_assert_irqs_disabled();
+
+ /*
+ * See comment in poll_source_register() about why this does not race
+ * with the idle CPU iterating over poll_source_list.
+ */
+ list_del_init(&src->node);
+}
+
+int poll_source_unregister(struct poll_source *src)
+{
+ return smp_call_function_single(src->cpu,
+ poll_source_unregister_this_cpu,
+ src,
+ 1);
+}
+EXPORT_SYMBOL_GPL(poll_source_unregister);
+
+/* TODO what happens when a CPU goes offline? */
+static int __init poll_source_init(void)
+{
+ int i;
+
+ for_each_possible_cpu(i)
+ INIT_LIST_HEAD(&per_cpu(poll_source_list, i));
+
+ return 0;
+}
+core_initcall(poll_source_init);
diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
index f7e83613ae94..aa26870034ac 100644
--- a/drivers/cpuidle/poll_state.c
+++ b/drivers/cpuidle/poll_state.c
@@ -7,6 +7,7 @@
#include <linux/sched.h>
#include <linux/sched/clock.h>
#include <linux/sched/idle.h>
+#include <linux/poll_source.h>

#define POLL_IDLE_RELAX_COUNT 200

@@ -22,9 +23,12 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev,
unsigned int loop_count = 0;
u64 limit;

+ poll_source_start();
+
limit = cpuidle_poll_time(drv, dev);

while (!need_resched()) {
+ poll_source_run_once();
cpu_relax();
if (loop_count++ < POLL_IDLE_RELAX_COUNT)
continue;
@@ -35,6 +39,8 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev,
break;
}
}
+
+ poll_source_stop();
}
current_clr_polling();

--
2.31.1

2021-07-13 16:21:38

by Stefan Hajnoczi

[permalink] [raw]
Subject: [RFC 3/3] softirq: participate in cpuidle polling

Normally softirqs are invoked when exiting irqs. When polling in the
cpuidle driver there may be no irq. Therefore pending softirqs go
unnoticed and polling continues without invoking them.

Add a softirq_poll() function to explicitly check for and invoke
softirqs.

Signed-off-by: Stefan Hajnoczi <[email protected]>
---
This commit is not needed for virtio-blk. I added it when I realized
virtio-net's NAPI scheduling might not be detected by the cpuidle busy
wait loop because it is unaware of softirqs. However, even after doing
this virtio-net's NAPI polling doesn't combine with cpuidle haltpoll.

Perhaps this patch is still desirable for cpuidle poll_state in case a
softirq is raised?
---
include/linux/interrupt.h | 2 ++
drivers/cpuidle/poll_source.c | 3 +++
kernel/softirq.c | 14 ++++++++++++++
3 files changed, 19 insertions(+)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 4777850a6dc7..9bfdcc466ba8 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -573,6 +573,8 @@ struct softirq_action
asmlinkage void do_softirq(void);
asmlinkage void __do_softirq(void);

+extern void softirq_poll(void);
+
extern void open_softirq(int nr, void (*action)(struct softirq_action *));
extern void softirq_init(void);
extern void __raise_softirq_irqoff(unsigned int nr);
diff --git a/drivers/cpuidle/poll_source.c b/drivers/cpuidle/poll_source.c
index 46100e5a71e4..ed200feb0daa 100644
--- a/drivers/cpuidle/poll_source.c
+++ b/drivers/cpuidle/poll_source.c
@@ -6,6 +6,7 @@
#include <linux/lockdep.h>
#include <linux/percpu.h>
#include <linux/poll_source.h>
+#include <linux/interrupt.h>

/* The per-cpu list of registered poll sources */
DEFINE_PER_CPU(struct list_head, poll_source_list);
@@ -26,6 +27,8 @@ void poll_source_run_once(void)

list_for_each_entry(src, this_cpu_ptr(&poll_source_list), node)
src->ops->poll(src);
+
+ softirq_poll();
}

/* Called from idle task with TIF_POLLING_NRFLAG set and irqs enabled */
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 4992853ef53d..f45bf0204218 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -611,6 +611,20 @@ void irq_enter(void)
irq_enter_rcu();
}

+/**
+ * softirq_poll() - invoke pending softirqs
+ *
+ * Normally it is not necessary to explicitly poll for softirqs, but in the
+ * cpuidle driver a polling function may have raised a softirq with no irq exit
+ * to invoke it. Therefore it is necessary to poll for pending softirqs and
+ * invoke them explicitly.
+ */
+void softirq_poll(void)
+{
+ if (!in_interrupt() && local_softirq_pending())
+ invoke_softirq();
+}
+
static inline void tick_irq_exit(void)
{
#ifdef CONFIG_NO_HZ_COMMON
--
2.31.1

2021-07-19 21:44:52

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [RFC 1/3] cpuidle: add poll_source API

Hi Stefan,

On Tue, Jul 13, 2021 at 05:19:04PM +0100, Stefan Hajnoczi wrote:
> Introduce an API for adding cpuidle poll callbacks:
>
> struct poll_source_ops {
> void (*start)(struct poll_source *src);
> void (*stop)(struct poll_source *src);
> void (*poll)(struct poll_source *src);
> };
>
> int poll_source_register(struct poll_source *src);
> int poll_source_unregister(struct poll_source *src);
>
> When cpuidle enters the poll state it invokes ->start() and then invokes
> ->poll() repeatedly from the busy wait loop. Finally ->stop() is invoked
> when the busy wait loop finishes.
>
> The ->poll() function should check for activity and cause
> TIF_NEED_RESCHED to be set in order to stop the busy wait loop.
>
> This API is intended to be used by drivers that can cheaply poll for
> events. Participating in cpuidle polling allows them to avoid interrupt
> latencies during periods where the CPU is going to poll anyway.
>
> Note that each poll_source is bound to a particular CPU. The API is
> mainly intended to by used by drivers that have multiple queues with irq
> affinity.
>
> Signed-off-by: Stefan Hajnoczi <[email protected]>
> ---
> drivers/cpuidle/Makefile | 1 +
> include/linux/poll_source.h | 53 +++++++++++++++++++
> drivers/cpuidle/poll_source.c | 99 +++++++++++++++++++++++++++++++++++
> drivers/cpuidle/poll_state.c | 6 +++
> 4 files changed, 159 insertions(+)
> create mode 100644 include/linux/poll_source.h
> create mode 100644 drivers/cpuidle/poll_source.c
>
> diff --git a/drivers/cpuidle/Makefile b/drivers/cpuidle/Makefile
> index 26bbc5e74123..994f72d6fe95 100644
> --- a/drivers/cpuidle/Makefile
> +++ b/drivers/cpuidle/Makefile
> @@ -7,6 +7,7 @@ obj-y += cpuidle.o driver.o governor.o sysfs.o governors/
> obj-$(CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED) += coupled.o
> obj-$(CONFIG_DT_IDLE_STATES) += dt_idle_states.o
> obj-$(CONFIG_ARCH_HAS_CPU_RELAX) += poll_state.o
> +obj-$(CONFIG_ARCH_HAS_CPU_RELAX) += poll_source.o
> obj-$(CONFIG_HALTPOLL_CPUIDLE) += cpuidle-haltpoll.o
>
> ##################################################################################
> diff --git a/include/linux/poll_source.h b/include/linux/poll_source.h
> new file mode 100644
> index 000000000000..ccfb424e170b
> --- /dev/null
> +++ b/include/linux/poll_source.h
> @@ -0,0 +1,53 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +/*
> + * poll_source.h - cpuidle busy waiting API
> + */
> +#ifndef __LINUX_POLLSOURCE_H__
> +#define __LINUX_POLLSOURCE_H__
> +
> +#include <linux/list.h>
> +
> +struct poll_source;
> +
> +struct poll_source_ops {
> + void (*start)(struct poll_source *src);
> + void (*stop)(struct poll_source *src);
> + void (*poll)(struct poll_source *src);
> +};
> +
> +struct poll_source {
> + const struct poll_source_ops *ops;
> + struct list_head node;
> + int cpu;
> +};
> +
> +/**
> + * poll_source_register - Add a poll_source for a CPU
> + */
> +#if defined(CONFIG_CPU_IDLE) && defined(CONFIG_ARCH_HAS_CPU_RELAX)
> +int poll_source_register(struct poll_source *src);
> +#else
> +static inline int poll_source_register(struct poll_source *src)
> +{
> + return 0;
> +}
> +#endif
> +
> +/**
> + * poll_source_unregister - Remove a previously registered poll_source
> + */
> +#if defined(CONFIG_CPU_IDLE) && defined(CONFIG_ARCH_HAS_CPU_RELAX)
> +int poll_source_unregister(struct poll_source *src);
> +#else
> +static inline int poll_source_unregister(struct poll_source *src)
> +{
> + return 0;
> +}
> +#endif
> +
> +/* Used by the cpuidle driver */
> +void poll_source_start(void);
> +void poll_source_run_once(void);
> +void poll_source_stop(void);
> +
> +#endif /* __LINUX_POLLSOURCE_H__ */
> diff --git a/drivers/cpuidle/poll_source.c b/drivers/cpuidle/poll_source.c
> new file mode 100644
> index 000000000000..46100e5a71e4
> --- /dev/null
> +++ b/drivers/cpuidle/poll_source.c
> @@ -0,0 +1,99 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * poll_source.c - cpuidle busy waiting API
> + */
> +
> +#include <linux/lockdep.h>
> +#include <linux/percpu.h>
> +#include <linux/poll_source.h>
> +
> +/* The per-cpu list of registered poll sources */
> +DEFINE_PER_CPU(struct list_head, poll_source_list);
> +
> +/* Called from idle task with TIF_POLLING_NRFLAG set and irqs enabled */
> +void poll_source_start(void)
> +{
> + struct poll_source *src;
> +
> + list_for_each_entry(src, this_cpu_ptr(&poll_source_list), node)
> + src->ops->start(src);
> +}
> +
> +/* Called from idle task with TIF_POLLING_NRFLAG set and irqs enabled */
> +void poll_source_run_once(void)
> +{
> + struct poll_source *src;
> +
> + list_for_each_entry(src, this_cpu_ptr(&poll_source_list), node)
> + src->ops->poll(src);
> +}
> +
> +/* Called from idle task with TIF_POLLING_NRFLAG set and irqs enabled */
> +void poll_source_stop(void)
> +{
> + struct poll_source *src;
> +
> + list_for_each_entry(src, this_cpu_ptr(&poll_source_list), node)
> + src->ops->stop(src);
> +}
> +
> +static void poll_source_register_this_cpu(void *opaque)
> +{
> + struct poll_source *src = opaque;
> +
> + lockdep_assert_irqs_disabled();
> +
> + list_add_tail(&src->node, this_cpu_ptr(&poll_source_list));
> +}
> +
> +int poll_source_register(struct poll_source *src)
> +{
> + if (!list_empty(&src->node))
> + return -EBUSY;
> +
> + /*
> + * There is no race with src->cpu iterating over poll_source_list
> + * because smp_call_function_single() just sets TIF_NEED_RESCHED

Hum... what about

CPU-0 CPU-1

poll_source_start poll_source_register IPI'ing CPU-0

Perhaps a llist can be used?

> while (!need_resched()) {
> + poll_source_run_once();

Probably want to use static_key's for ->start, ->stop and ->poll?

2021-07-20 14:25:19

by Stefan Hajnoczi

[permalink] [raw]
Subject: Re: [RFC 1/3] cpuidle: add poll_source API

On Mon, Jul 19, 2021 at 06:03:55PM -0300, Marcelo Tosatti wrote:
> Hi Stefan,
>
> On Tue, Jul 13, 2021 at 05:19:04PM +0100, Stefan Hajnoczi wrote:
> > Introduce an API for adding cpuidle poll callbacks:
> >
> > struct poll_source_ops {
> > void (*start)(struct poll_source *src);
> > void (*stop)(struct poll_source *src);
> > void (*poll)(struct poll_source *src);
> > };
> >
> > int poll_source_register(struct poll_source *src);
> > int poll_source_unregister(struct poll_source *src);
> >
> > When cpuidle enters the poll state it invokes ->start() and then invokes
> > ->poll() repeatedly from the busy wait loop. Finally ->stop() is invoked
> > when the busy wait loop finishes.
> >
> > The ->poll() function should check for activity and cause
> > TIF_NEED_RESCHED to be set in order to stop the busy wait loop.
> >
> > This API is intended to be used by drivers that can cheaply poll for
> > events. Participating in cpuidle polling allows them to avoid interrupt
> > latencies during periods where the CPU is going to poll anyway.
> >
> > Note that each poll_source is bound to a particular CPU. The API is
> > mainly intended to by used by drivers that have multiple queues with irq
> > affinity.
> >
> > Signed-off-by: Stefan Hajnoczi <[email protected]>
> > ---
> > drivers/cpuidle/Makefile | 1 +
> > include/linux/poll_source.h | 53 +++++++++++++++++++
> > drivers/cpuidle/poll_source.c | 99 +++++++++++++++++++++++++++++++++++
> > drivers/cpuidle/poll_state.c | 6 +++
> > 4 files changed, 159 insertions(+)
> > create mode 100644 include/linux/poll_source.h
> > create mode 100644 drivers/cpuidle/poll_source.c
> >
> > diff --git a/drivers/cpuidle/Makefile b/drivers/cpuidle/Makefile
> > index 26bbc5e74123..994f72d6fe95 100644
> > --- a/drivers/cpuidle/Makefile
> > +++ b/drivers/cpuidle/Makefile
> > @@ -7,6 +7,7 @@ obj-y += cpuidle.o driver.o governor.o sysfs.o governors/
> > obj-$(CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED) += coupled.o
> > obj-$(CONFIG_DT_IDLE_STATES) += dt_idle_states.o
> > obj-$(CONFIG_ARCH_HAS_CPU_RELAX) += poll_state.o
> > +obj-$(CONFIG_ARCH_HAS_CPU_RELAX) += poll_source.o
> > obj-$(CONFIG_HALTPOLL_CPUIDLE) += cpuidle-haltpoll.o
> >
> > ##################################################################################
> > diff --git a/include/linux/poll_source.h b/include/linux/poll_source.h
> > new file mode 100644
> > index 000000000000..ccfb424e170b
> > --- /dev/null
> > +++ b/include/linux/poll_source.h
> > @@ -0,0 +1,53 @@
> > +/* SPDX-License-Identifier: GPL-2.0-or-later */
> > +/*
> > + * poll_source.h - cpuidle busy waiting API
> > + */
> > +#ifndef __LINUX_POLLSOURCE_H__
> > +#define __LINUX_POLLSOURCE_H__
> > +
> > +#include <linux/list.h>
> > +
> > +struct poll_source;
> > +
> > +struct poll_source_ops {
> > + void (*start)(struct poll_source *src);
> > + void (*stop)(struct poll_source *src);
> > + void (*poll)(struct poll_source *src);
> > +};
> > +
> > +struct poll_source {
> > + const struct poll_source_ops *ops;
> > + struct list_head node;
> > + int cpu;
> > +};
> > +
> > +/**
> > + * poll_source_register - Add a poll_source for a CPU
> > + */
> > +#if defined(CONFIG_CPU_IDLE) && defined(CONFIG_ARCH_HAS_CPU_RELAX)
> > +int poll_source_register(struct poll_source *src);
> > +#else
> > +static inline int poll_source_register(struct poll_source *src)
> > +{
> > + return 0;
> > +}
> > +#endif
> > +
> > +/**
> > + * poll_source_unregister - Remove a previously registered poll_source
> > + */
> > +#if defined(CONFIG_CPU_IDLE) && defined(CONFIG_ARCH_HAS_CPU_RELAX)
> > +int poll_source_unregister(struct poll_source *src);
> > +#else
> > +static inline int poll_source_unregister(struct poll_source *src)
> > +{
> > + return 0;
> > +}
> > +#endif
> > +
> > +/* Used by the cpuidle driver */
> > +void poll_source_start(void);
> > +void poll_source_run_once(void);
> > +void poll_source_stop(void);
> > +
> > +#endif /* __LINUX_POLLSOURCE_H__ */
> > diff --git a/drivers/cpuidle/poll_source.c b/drivers/cpuidle/poll_source.c
> > new file mode 100644
> > index 000000000000..46100e5a71e4
> > --- /dev/null
> > +++ b/drivers/cpuidle/poll_source.c
> > @@ -0,0 +1,99 @@
> > +// SPDX-License-Identifier: GPL-2.0-or-later
> > +/*
> > + * poll_source.c - cpuidle busy waiting API
> > + */
> > +
> > +#include <linux/lockdep.h>
> > +#include <linux/percpu.h>
> > +#include <linux/poll_source.h>
> > +
> > +/* The per-cpu list of registered poll sources */
> > +DEFINE_PER_CPU(struct list_head, poll_source_list);
> > +
> > +/* Called from idle task with TIF_POLLING_NRFLAG set and irqs enabled */
> > +void poll_source_start(void)
> > +{
> > + struct poll_source *src;
> > +
> > + list_for_each_entry(src, this_cpu_ptr(&poll_source_list), node)
> > + src->ops->start(src);
> > +}
> > +
> > +/* Called from idle task with TIF_POLLING_NRFLAG set and irqs enabled */
> > +void poll_source_run_once(void)
> > +{
> > + struct poll_source *src;
> > +
> > + list_for_each_entry(src, this_cpu_ptr(&poll_source_list), node)
> > + src->ops->poll(src);
> > +}
> > +
> > +/* Called from idle task with TIF_POLLING_NRFLAG set and irqs enabled */
> > +void poll_source_stop(void)
> > +{
> > + struct poll_source *src;
> > +
> > + list_for_each_entry(src, this_cpu_ptr(&poll_source_list), node)
> > + src->ops->stop(src);
> > +}
> > +
> > +static void poll_source_register_this_cpu(void *opaque)
> > +{
> > + struct poll_source *src = opaque;
> > +
> > + lockdep_assert_irqs_disabled();
> > +
> > + list_add_tail(&src->node, this_cpu_ptr(&poll_source_list));
> > +}
> > +
> > +int poll_source_register(struct poll_source *src)
> > +{
> > + if (!list_empty(&src->node))
> > + return -EBUSY;
> > +
> > + /*
> > + * There is no race with src->cpu iterating over poll_source_list
> > + * because smp_call_function_single() just sets TIF_NEED_RESCHED
>
> Hum... what about
>
> CPU-0 CPU-1
>
> poll_source_start poll_source_register IPI'ing CPU-0
>
> Perhaps a llist can be used?

The simplest solution might be queue_work_on(src->cpu, system_hipri_wq,
&work) plus a completion. That way we know poll_source_start/stop/run()
are not executing and there can be no re-entrancy issues. The nice thing
is it's simple - no low-level tricks.

Those are the semantics I wanted with smp_call_function_single() but I
think I chose the wrong API :).

> > while (!need_resched()) {
> > + poll_source_run_once();
>
> Probably want to use static_key's for ->start, ->stop and ->poll?

Good idea, thanks.

Stefan


Attachments:
(No filename) (6.60 kB)
signature.asc (499.00 B)
Download all attachments

2021-07-21 03:32:11

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC 0/3] cpuidle: add poll_source API and virtio vq polling


在 2021/7/14 上午12:19, Stefan Hajnoczi 写道:
> These patches are not polished yet but I would like request feedback on this
> approach and share performance results with you.
>
> Idle CPUs tentatively enter a busy wait loop before halting when the cpuidle
> haltpoll driver is enabled inside a virtual machine. This reduces wakeup
> latency for events that occur soon after the vCPU becomes idle.
>
> This patch series extends the cpuidle busy wait loop with the new poll_source
> API so drivers can participate in polling. Such polling-aware drivers disable
> their device's irq during the busy wait loop to avoid the cost of interrupts.
> This reduces latency further than regular cpuidle haltpoll, which still relies
> on irqs.
>
> Virtio drivers are modified to use the poll_source API so all virtio device
> types get this feature. The following virtio-blk fio benchmark results show the
> improvement:
>
> IOPS (numjobs=4, iodepth=1, 4 virtqueues)
> before poll_source io_poll
> 4k randread 167102 186049 (+11%) 186654 (+11%)
> 4k randwrite 162204 181214 (+11%) 181850 (+12%)
> 4k randrw 159520 177071 (+11%) 177928 (+11%)
>
> The comparison against io_poll shows that cpuidle poll_source achieves
> equivalent performance to the block layer's io_poll feature (which I
> implemented in a separate patch series [1]).
>
> The advantage of poll_source is that applications do not need to explicitly set
> the RWF_HIPRI I/O request flag. The poll_source approach is attractive because
> few applications actually use RWF_HIPRI and it takes advantage of CPU cycles we
> would have spent in cpuidle haltpoll anyway.
>
> The current series does not improve virtio-net. I haven't investigated deeply,
> but it is possible that NAPI and poll_source do not combine. See the final
> patch for a starting point on making the two work together.
>
> I have not tried this on bare metal but it might help there too. The cost of
> disabling a device's irq must be less than the savings from avoiding irq
> handling for this optimization to make sense.
>
> [1] https://lore.kernel.org/linux-block/[email protected]/


Hi Stefan:

Some questions:

1) What's the advantages of introducing polling at virtio level instead
of doing it at each subsystems? Polling in virtio level may only work
well if all (or most) of the devices are virtio
2) What's the advantages of using cpuidle instead of using a thread (and
leverage the scheduler)?
3) Any reason it's virtio_pci specific not a general virtio one?

Thanks

(Btw, do we need to cc scheduler guys?)


>
> Stefan Hajnoczi (3):
> cpuidle: add poll_source API
> virtio: add poll_source virtqueue polling
> softirq: participate in cpuidle polling
>
> drivers/cpuidle/Makefile | 1 +
> drivers/virtio/virtio_pci_common.h | 7 ++
> include/linux/interrupt.h | 2 +
> include/linux/poll_source.h | 53 +++++++++++++++
> include/linux/virtio.h | 2 +
> include/linux/virtio_config.h | 2 +
> drivers/cpuidle/poll_source.c | 102 +++++++++++++++++++++++++++++
> drivers/cpuidle/poll_state.c | 6 ++
> drivers/virtio/virtio.c | 34 ++++++++++
> drivers/virtio/virtio_pci_common.c | 86 ++++++++++++++++++++++++
> drivers/virtio/virtio_pci_modern.c | 2 +
> kernel/softirq.c | 14 ++++
> 12 files changed, 311 insertions(+)
> create mode 100644 include/linux/poll_source.h
> create mode 100644 drivers/cpuidle/poll_source.c
>

2021-07-21 09:48:19

by Stefan Hajnoczi

[permalink] [raw]
Subject: Re: [RFC 0/3] cpuidle: add poll_source API and virtio vq polling

On Wed, Jul 21, 2021 at 11:29:55AM +0800, Jason Wang wrote:
>
> 在 2021/7/14 上午12:19, Stefan Hajnoczi 写道:
> > These patches are not polished yet but I would like request feedback on this
> > approach and share performance results with you.
> >
> > Idle CPUs tentatively enter a busy wait loop before halting when the cpuidle
> > haltpoll driver is enabled inside a virtual machine. This reduces wakeup
> > latency for events that occur soon after the vCPU becomes idle.
> >
> > This patch series extends the cpuidle busy wait loop with the new poll_source
> > API so drivers can participate in polling. Such polling-aware drivers disable
> > their device's irq during the busy wait loop to avoid the cost of interrupts.
> > This reduces latency further than regular cpuidle haltpoll, which still relies
> > on irqs.
> >
> > Virtio drivers are modified to use the poll_source API so all virtio device
> > types get this feature. The following virtio-blk fio benchmark results show the
> > improvement:
> >
> > IOPS (numjobs=4, iodepth=1, 4 virtqueues)
> > before poll_source io_poll
> > 4k randread 167102 186049 (+11%) 186654 (+11%)
> > 4k randwrite 162204 181214 (+11%) 181850 (+12%)
> > 4k randrw 159520 177071 (+11%) 177928 (+11%)
> >
> > The comparison against io_poll shows that cpuidle poll_source achieves
> > equivalent performance to the block layer's io_poll feature (which I
> > implemented in a separate patch series [1]).
> >
> > The advantage of poll_source is that applications do not need to explicitly set
> > the RWF_HIPRI I/O request flag. The poll_source approach is attractive because
> > few applications actually use RWF_HIPRI and it takes advantage of CPU cycles we
> > would have spent in cpuidle haltpoll anyway.
> >
> > The current series does not improve virtio-net. I haven't investigated deeply,
> > but it is possible that NAPI and poll_source do not combine. See the final
> > patch for a starting point on making the two work together.
> >
> > I have not tried this on bare metal but it might help there too. The cost of
> > disabling a device's irq must be less than the savings from avoiding irq
> > handling for this optimization to make sense.
> >
> > [1] https://lore.kernel.org/linux-block/[email protected]/
>
>
> Hi Stefan:
>
> Some questions:
>
> 1) What's the advantages of introducing polling at virtio level instead of
> doing it at each subsystems? Polling in virtio level may only work well if
> all (or most) of the devices are virtio

I'm not sure I understand the question. cpuidle haltpoll benefits all
devices today, except it incurs interrupt latency. The poll_source API
eliminates the interrupt latency for drivers that can disable device
interrupts cheaply.

This patch adds poll_source to core virtio code so that all virtio
drivers get this feature for free. No driver-specific changes are
needed.

If you mean networking, block layer, etc by "subsystems" then there's
nothing those subsystems can do to help. Whether poll_source can be used
depends on the specific driver, not the subsystem. If you consider
drivers/virtio/ a subsystem, then that's exactly what the patch series
is doing.

> 2) What's the advantages of using cpuidle instead of using a thread (and
> leverage the scheduler)?

In order to combine with the existing cpuidle infrastructure. No new
polling loop is introduced and no additional CPU cycles are spent on
polling.

If cpuidle itself is converted to threads then poll_source would
automatically operate in a thread too, but this patch series doesn't
change how the core cpuidle code works.

> 3) Any reason it's virtio_pci specific not a general virtio one?

Good idea, it is possible to move the virtio_pci changes into virtio.c.

Other transports can't use this feature yet though. Only virtio_pci
supports vq irq affinity. But the code can be generic and if other
transports ever support vq irq affinity they'll get it for free.

> (Btw, do we need to cc scheduler guys?)

I'm not sure. This patch series doesn't change how cpuidle interacts
with the scheduler. The cpuidle maintainers can pull in more people, if
necessary.

Stefan


Attachments:
(No filename) (4.22 kB)
signature.asc (499.00 B)
Download all attachments

2021-07-22 09:06:22

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC 0/3] cpuidle: add poll_source API and virtio vq polling


在 2021/7/21 下午5:41, Stefan Hajnoczi 写道:
> On Wed, Jul 21, 2021 at 11:29:55AM +0800, Jason Wang wrote:
>> 在 2021/7/14 上午12:19, Stefan Hajnoczi 写道:
>>> These patches are not polished yet but I would like request feedback on this
>>> approach and share performance results with you.
>>>
>>> Idle CPUs tentatively enter a busy wait loop before halting when the cpuidle
>>> haltpoll driver is enabled inside a virtual machine. This reduces wakeup
>>> latency for events that occur soon after the vCPU becomes idle.
>>>
>>> This patch series extends the cpuidle busy wait loop with the new poll_source
>>> API so drivers can participate in polling. Such polling-aware drivers disable
>>> their device's irq during the busy wait loop to avoid the cost of interrupts.
>>> This reduces latency further than regular cpuidle haltpoll, which still relies
>>> on irqs.
>>>
>>> Virtio drivers are modified to use the poll_source API so all virtio device
>>> types get this feature. The following virtio-blk fio benchmark results show the
>>> improvement:
>>>
>>> IOPS (numjobs=4, iodepth=1, 4 virtqueues)
>>> before poll_source io_poll
>>> 4k randread 167102 186049 (+11%) 186654 (+11%)
>>> 4k randwrite 162204 181214 (+11%) 181850 (+12%)
>>> 4k randrw 159520 177071 (+11%) 177928 (+11%)
>>>
>>> The comparison against io_poll shows that cpuidle poll_source achieves
>>> equivalent performance to the block layer's io_poll feature (which I
>>> implemented in a separate patch series [1]).
>>>
>>> The advantage of poll_source is that applications do not need to explicitly set
>>> the RWF_HIPRI I/O request flag. The poll_source approach is attractive because
>>> few applications actually use RWF_HIPRI and it takes advantage of CPU cycles we
>>> would have spent in cpuidle haltpoll anyway.
>>>
>>> The current series does not improve virtio-net. I haven't investigated deeply,
>>> but it is possible that NAPI and poll_source do not combine. See the final
>>> patch for a starting point on making the two work together.
>>>
>>> I have not tried this on bare metal but it might help there too. The cost of
>>> disabling a device's irq must be less than the savings from avoiding irq
>>> handling for this optimization to make sense.
>>>
>>> [1] https://lore.kernel.org/linux-block/[email protected]/
>>
>> Hi Stefan:
>>
>> Some questions:
>>
>> 1) What's the advantages of introducing polling at virtio level instead of
>> doing it at each subsystems? Polling in virtio level may only work well if
>> all (or most) of the devices are virtio
> I'm not sure I understand the question. cpuidle haltpoll benefits all
> devices today, except it incurs interrupt latency. The poll_source API
> eliminates the interrupt latency for drivers that can disable device
> interrupts cheaply.
>
> This patch adds poll_source to core virtio code so that all virtio
> drivers get this feature for free. No driver-specific changes are
> needed.
>
> If you mean networking, block layer, etc by "subsystems" then there's
> nothing those subsystems can do to help. Whether poll_source can be used
> depends on the specific driver, not the subsystem. If you consider
> drivers/virtio/ a subsystem, then that's exactly what the patch series
> is doing.


I meant, if we choose to use idle poll, we have some several choices:

1) bus level (e.g the virtio)
2) subsystem level (e.g the networking and block)

I'm not sure which one is better.


>
>> 2) What's the advantages of using cpuidle instead of using a thread (and
>> leverage the scheduler)?
> In order to combine with the existing cpuidle infrastructure. No new
> polling loop is introduced and no additional CPU cycles are spent on
> polling.
>
> If cpuidle itself is converted to threads then poll_source would
> automatically operate in a thread too, but this patch series doesn't
> change how the core cpuidle code works.


So networking subsystem can use NAPI busy polling in the process context
which means it can be leveraged by the scheduler.

I'm not sure it's a good idea to poll drivers for a specific bus in the
general cpu idle layer.

Another questions, are those numbers measured by APICV capable machine?

Virtio-net turns on the tx interrupts since 2 years ago. And we don't
see too much difference when measured with a APICV host.


>
>> 3) Any reason it's virtio_pci specific not a general virtio one?
> Good idea, it is possible to move the virtio_pci changes into virtio.c.
>
> Other transports can't use this feature yet though. Only virtio_pci
> supports vq irq affinity. But the code can be generic and if other
> transports ever support vq irq affinity they'll get it for free.


Yes.

Thanks


>
>> (Btw, do we need to cc scheduler guys?)
> I'm not sure. This patch series doesn't change how cpuidle interacts
> with the scheduler. The cpuidle maintainers can pull in more people, if
> necessary.
>
> Stefan

2021-07-26 15:19:49

by Stefan Hajnoczi

[permalink] [raw]
Subject: Re: [RFC 0/3] cpuidle: add poll_source API and virtio vq polling

On Thu, Jul 22, 2021 at 05:04:57PM +0800, Jason Wang wrote:
>
> 在 2021/7/21 下午5:41, Stefan Hajnoczi 写道:
> > On Wed, Jul 21, 2021 at 11:29:55AM +0800, Jason Wang wrote:
> > > 在 2021/7/14 上午12:19, Stefan Hajnoczi 写道:
> > > > These patches are not polished yet but I would like request feedback on this
> > > > approach and share performance results with you.
> > > >
> > > > Idle CPUs tentatively enter a busy wait loop before halting when the cpuidle
> > > > haltpoll driver is enabled inside a virtual machine. This reduces wakeup
> > > > latency for events that occur soon after the vCPU becomes idle.
> > > >
> > > > This patch series extends the cpuidle busy wait loop with the new poll_source
> > > > API so drivers can participate in polling. Such polling-aware drivers disable
> > > > their device's irq during the busy wait loop to avoid the cost of interrupts.
> > > > This reduces latency further than regular cpuidle haltpoll, which still relies
> > > > on irqs.
> > > >
> > > > Virtio drivers are modified to use the poll_source API so all virtio device
> > > > types get this feature. The following virtio-blk fio benchmark results show the
> > > > improvement:
> > > >
> > > > IOPS (numjobs=4, iodepth=1, 4 virtqueues)
> > > > before poll_source io_poll
> > > > 4k randread 167102 186049 (+11%) 186654 (+11%)
> > > > 4k randwrite 162204 181214 (+11%) 181850 (+12%)
> > > > 4k randrw 159520 177071 (+11%) 177928 (+11%)
> > > >
> > > > The comparison against io_poll shows that cpuidle poll_source achieves
> > > > equivalent performance to the block layer's io_poll feature (which I
> > > > implemented in a separate patch series [1]).
> > > >
> > > > The advantage of poll_source is that applications do not need to explicitly set
> > > > the RWF_HIPRI I/O request flag. The poll_source approach is attractive because
> > > > few applications actually use RWF_HIPRI and it takes advantage of CPU cycles we
> > > > would have spent in cpuidle haltpoll anyway.
> > > >
> > > > The current series does not improve virtio-net. I haven't investigated deeply,
> > > > but it is possible that NAPI and poll_source do not combine. See the final
> > > > patch for a starting point on making the two work together.
> > > >
> > > > I have not tried this on bare metal but it might help there too. The cost of
> > > > disabling a device's irq must be less than the savings from avoiding irq
> > > > handling for this optimization to make sense.
> > > >
> > > > [1] https://lore.kernel.org/linux-block/[email protected]/
> > >
> > > Hi Stefan:
> > >
> > > Some questions:
> > >
> > > 1) What's the advantages of introducing polling at virtio level instead of
> > > doing it at each subsystems? Polling in virtio level may only work well if
> > > all (or most) of the devices are virtio
> > I'm not sure I understand the question. cpuidle haltpoll benefits all
> > devices today, except it incurs interrupt latency. The poll_source API
> > eliminates the interrupt latency for drivers that can disable device
> > interrupts cheaply.
> >
> > This patch adds poll_source to core virtio code so that all virtio
> > drivers get this feature for free. No driver-specific changes are
> > needed.
> >
> > If you mean networking, block layer, etc by "subsystems" then there's
> > nothing those subsystems can do to help. Whether poll_source can be used
> > depends on the specific driver, not the subsystem. If you consider
> > drivers/virtio/ a subsystem, then that's exactly what the patch series
> > is doing.
>
>
> I meant, if we choose to use idle poll, we have some several choices:
>
> 1) bus level (e.g the virtio)
> 2) subsystem level (e.g the networking and block)
>
> I'm not sure which one is better.

This API is intended to be driver- or bus-level. I don't think
subsystems can do very much since they don't know the hardware
capabilities (cheap interrupt disabling) and in most cases there's no
advantage of plumbing it through subsystems when drivers can call the
API directly.

> > > 2) What's the advantages of using cpuidle instead of using a thread (and
> > > leverage the scheduler)?
> > In order to combine with the existing cpuidle infrastructure. No new
> > polling loop is introduced and no additional CPU cycles are spent on
> > polling.
> >
> > If cpuidle itself is converted to threads then poll_source would
> > automatically operate in a thread too, but this patch series doesn't
> > change how the core cpuidle code works.
>
>
> So networking subsystem can use NAPI busy polling in the process context
> which means it can be leveraged by the scheduler.
>
> I'm not sure it's a good idea to poll drivers for a specific bus in the
> general cpu idle layer.

Why? Maybe because the cpuidle execution environment is a little special?

> Another questions, are those numbers measured by APICV capable machine?

Yes.

> Virtio-net turns on the tx interrupts since 2 years ago. And we don't see
> too much difference when measured with a APICV host.

My understand is NAPI always takes the first interrupt. Polling only
happens on subsequent rounds until there's no more work to do.

There seem to be multiple factors that would influence tx performance
like how full the tx queues are, whether more packets are sent during
NAPI polling, whether you're benchmarking a physical PCIe NIC or a
vhost_net software device, etc.

Regarding APICV and software devices, the benchmark results I posted
show that avoiding the interrupt injection helps even with APICV.

Stefan


Attachments:
(No filename) (5.59 kB)
signature.asc (499.00 B)
Download all attachments

2021-07-26 15:57:40

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC 0/3] cpuidle: add poll_source API and virtio vq polling

On Mon, Jul 26, 2021 at 5:17 PM Stefan Hajnoczi <[email protected]> wrote:
>
> On Thu, Jul 22, 2021 at 05:04:57PM +0800, Jason Wang wrote:
> >
> > 在 2021/7/21 下午5:41, Stefan Hajnoczi 写道:
> > > On Wed, Jul 21, 2021 at 11:29:55AM +0800, Jason Wang wrote:
> > > > 在 2021/7/14 上午12:19, Stefan Hajnoczi 写道:
> > > > > These patches are not polished yet but I would like request feedback on this
> > > > > approach and share performance results with you.
> > > > >
> > > > > Idle CPUs tentatively enter a busy wait loop before halting when the cpuidle
> > > > > haltpoll driver is enabled inside a virtual machine. This reduces wakeup
> > > > > latency for events that occur soon after the vCPU becomes idle.
> > > > >
> > > > > This patch series extends the cpuidle busy wait loop with the new poll_source
> > > > > API so drivers can participate in polling. Such polling-aware drivers disable
> > > > > their device's irq during the busy wait loop to avoid the cost of interrupts.
> > > > > This reduces latency further than regular cpuidle haltpoll, which still relies
> > > > > on irqs.
> > > > >
> > > > > Virtio drivers are modified to use the poll_source API so all virtio device
> > > > > types get this feature. The following virtio-blk fio benchmark results show the
> > > > > improvement:
> > > > >
> > > > > IOPS (numjobs=4, iodepth=1, 4 virtqueues)
> > > > > before poll_source io_poll
> > > > > 4k randread 167102 186049 (+11%) 186654 (+11%)
> > > > > 4k randwrite 162204 181214 (+11%) 181850 (+12%)
> > > > > 4k randrw 159520 177071 (+11%) 177928 (+11%)
> > > > >
> > > > > The comparison against io_poll shows that cpuidle poll_source achieves
> > > > > equivalent performance to the block layer's io_poll feature (which I
> > > > > implemented in a separate patch series [1]).
> > > > >
> > > > > The advantage of poll_source is that applications do not need to explicitly set
> > > > > the RWF_HIPRI I/O request flag. The poll_source approach is attractive because
> > > > > few applications actually use RWF_HIPRI and it takes advantage of CPU cycles we
> > > > > would have spent in cpuidle haltpoll anyway.
> > > > >
> > > > > The current series does not improve virtio-net. I haven't investigated deeply,
> > > > > but it is possible that NAPI and poll_source do not combine. See the final
> > > > > patch for a starting point on making the two work together.
> > > > >
> > > > > I have not tried this on bare metal but it might help there too. The cost of
> > > > > disabling a device's irq must be less than the savings from avoiding irq
> > > > > handling for this optimization to make sense.
> > > > >
> > > > > [1] https://lore.kernel.org/linux-block/[email protected]/
> > > >
> > > > Hi Stefan:
> > > >
> > > > Some questions:
> > > >
> > > > 1) What's the advantages of introducing polling at virtio level instead of
> > > > doing it at each subsystems? Polling in virtio level may only work well if
> > > > all (or most) of the devices are virtio
> > > I'm not sure I understand the question. cpuidle haltpoll benefits all
> > > devices today, except it incurs interrupt latency. The poll_source API
> > > eliminates the interrupt latency for drivers that can disable device
> > > interrupts cheaply.
> > >
> > > This patch adds poll_source to core virtio code so that all virtio
> > > drivers get this feature for free. No driver-specific changes are
> > > needed.
> > >
> > > If you mean networking, block layer, etc by "subsystems" then there's
> > > nothing those subsystems can do to help. Whether poll_source can be used
> > > depends on the specific driver, not the subsystem. If you consider
> > > drivers/virtio/ a subsystem, then that's exactly what the patch series
> > > is doing.
> >
> >
> > I meant, if we choose to use idle poll, we have some several choices:
> >
> > 1) bus level (e.g the virtio)
> > 2) subsystem level (e.g the networking and block)
> >
> > I'm not sure which one is better.
>
> This API is intended to be driver- or bus-level. I don't think
> subsystems can do very much since they don't know the hardware
> capabilities (cheap interrupt disabling) and in most cases there's no
> advantage of plumbing it through subsystems when drivers can call the
> API directly.
>
> > > > 2) What's the advantages of using cpuidle instead of using a thread (and
> > > > leverage the scheduler)?
> > > In order to combine with the existing cpuidle infrastructure. No new
> > > polling loop is introduced and no additional CPU cycles are spent on
> > > polling.
> > >
> > > If cpuidle itself is converted to threads then poll_source would
> > > automatically operate in a thread too, but this patch series doesn't
> > > change how the core cpuidle code works.
> >
> >
> > So networking subsystem can use NAPI busy polling in the process context
> > which means it can be leveraged by the scheduler.
> >
> > I'm not sure it's a good idea to poll drivers for a specific bus in the
> > general cpu idle layer.
>
> Why? Maybe because the cpuidle execution environment is a little special?

Well, this would be prone to abuse.

The time spent in that driver callback counts as CPU idle time while
it really is the driver running and there is not limit on how much
time the callback can take, while doing costly things in the idle loop
is generally avoided, because on wakeup the CPU needs to be available
to the task needing it as soon as possible. IOW, the callback
potentially add unbounded latency to the CPU wakeup path.

2021-07-26 16:41:21

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC 0/3] cpuidle: add poll_source API and virtio vq polling

On Mon, Jul 26, 2021 at 6:04 PM Stefan Hajnoczi <[email protected]> wrote:
>
> On Mon, Jul 26, 2021 at 05:47:19PM +0200, Rafael J. Wysocki wrote:
> > On Mon, Jul 26, 2021 at 5:17 PM Stefan Hajnoczi <[email protected]> wrote:
> > >
> > > On Thu, Jul 22, 2021 at 05:04:57PM +0800, Jason Wang wrote:
> > > >
> > > > 在 2021/7/21 下午5:41, Stefan Hajnoczi 写道:
> > > > > On Wed, Jul 21, 2021 at 11:29:55AM +0800, Jason Wang wrote:
> > > > > > 在 2021/7/14 上午12:19, Stefan Hajnoczi 写道:
> > > > > > > These patches are not polished yet but I would like request feedback on this
> > > > > > > approach and share performance results with you.
> > > > > > >
> > > > > > > Idle CPUs tentatively enter a busy wait loop before halting when the cpuidle
> > > > > > > haltpoll driver is enabled inside a virtual machine. This reduces wakeup
> > > > > > > latency for events that occur soon after the vCPU becomes idle.
> > > > > > >
> > > > > > > This patch series extends the cpuidle busy wait loop with the new poll_source
> > > > > > > API so drivers can participate in polling. Such polling-aware drivers disable
> > > > > > > their device's irq during the busy wait loop to avoid the cost of interrupts.
> > > > > > > This reduces latency further than regular cpuidle haltpoll, which still relies
> > > > > > > on irqs.
> > > > > > >
> > > > > > > Virtio drivers are modified to use the poll_source API so all virtio device
> > > > > > > types get this feature. The following virtio-blk fio benchmark results show the
> > > > > > > improvement:
> > > > > > >
> > > > > > > IOPS (numjobs=4, iodepth=1, 4 virtqueues)
> > > > > > > before poll_source io_poll
> > > > > > > 4k randread 167102 186049 (+11%) 186654 (+11%)
> > > > > > > 4k randwrite 162204 181214 (+11%) 181850 (+12%)
> > > > > > > 4k randrw 159520 177071 (+11%) 177928 (+11%)
> > > > > > >
> > > > > > > The comparison against io_poll shows that cpuidle poll_source achieves
> > > > > > > equivalent performance to the block layer's io_poll feature (which I
> > > > > > > implemented in a separate patch series [1]).
> > > > > > >
> > > > > > > The advantage of poll_source is that applications do not need to explicitly set
> > > > > > > the RWF_HIPRI I/O request flag. The poll_source approach is attractive because
> > > > > > > few applications actually use RWF_HIPRI and it takes advantage of CPU cycles we
> > > > > > > would have spent in cpuidle haltpoll anyway.
> > > > > > >
> > > > > > > The current series does not improve virtio-net. I haven't investigated deeply,
> > > > > > > but it is possible that NAPI and poll_source do not combine. See the final
> > > > > > > patch for a starting point on making the two work together.
> > > > > > >
> > > > > > > I have not tried this on bare metal but it might help there too. The cost of
> > > > > > > disabling a device's irq must be less than the savings from avoiding irq
> > > > > > > handling for this optimization to make sense.
> > > > > > >
> > > > > > > [1] https://lore.kernel.org/linux-block/[email protected]/
> > > > > >
> > > > > > Hi Stefan:
> > > > > >
> > > > > > Some questions:
> > > > > >
> > > > > > 1) What's the advantages of introducing polling at virtio level instead of
> > > > > > doing it at each subsystems? Polling in virtio level may only work well if
> > > > > > all (or most) of the devices are virtio
> > > > > I'm not sure I understand the question. cpuidle haltpoll benefits all
> > > > > devices today, except it incurs interrupt latency. The poll_source API
> > > > > eliminates the interrupt latency for drivers that can disable device
> > > > > interrupts cheaply.
> > > > >
> > > > > This patch adds poll_source to core virtio code so that all virtio
> > > > > drivers get this feature for free. No driver-specific changes are
> > > > > needed.
> > > > >
> > > > > If you mean networking, block layer, etc by "subsystems" then there's
> > > > > nothing those subsystems can do to help. Whether poll_source can be used
> > > > > depends on the specific driver, not the subsystem. If you consider
> > > > > drivers/virtio/ a subsystem, then that's exactly what the patch series
> > > > > is doing.
> > > >
> > > >
> > > > I meant, if we choose to use idle poll, we have some several choices:
> > > >
> > > > 1) bus level (e.g the virtio)
> > > > 2) subsystem level (e.g the networking and block)
> > > >
> > > > I'm not sure which one is better.
> > >
> > > This API is intended to be driver- or bus-level. I don't think
> > > subsystems can do very much since they don't know the hardware
> > > capabilities (cheap interrupt disabling) and in most cases there's no
> > > advantage of plumbing it through subsystems when drivers can call the
> > > API directly.
> > >
> > > > > > 2) What's the advantages of using cpuidle instead of using a thread (and
> > > > > > leverage the scheduler)?
> > > > > In order to combine with the existing cpuidle infrastructure. No new
> > > > > polling loop is introduced and no additional CPU cycles are spent on
> > > > > polling.
> > > > >
> > > > > If cpuidle itself is converted to threads then poll_source would
> > > > > automatically operate in a thread too, but this patch series doesn't
> > > > > change how the core cpuidle code works.
> > > >
> > > >
> > > > So networking subsystem can use NAPI busy polling in the process context
> > > > which means it can be leveraged by the scheduler.
> > > >
> > > > I'm not sure it's a good idea to poll drivers for a specific bus in the
> > > > general cpu idle layer.
> > >
> > > Why? Maybe because the cpuidle execution environment is a little special?
> >
> > Well, this would be prone to abuse.
> >
> > The time spent in that driver callback counts as CPU idle time while
> > it really is the driver running and there is not limit on how much
> > time the callback can take, while doing costly things in the idle loop
> > is generally avoided, because on wakeup the CPU needs to be available
> > to the task needing it as soon as possible. IOW, the callback
> > potentially add unbounded latency to the CPU wakeup path.
>
> How is this different from driver interrupt handlers running during
> cpuidle?

The time spent on handling interrupts does not count as CPU idle time.

2021-07-26 17:09:11

by Stefan Hajnoczi

[permalink] [raw]
Subject: Re: [RFC 0/3] cpuidle: add poll_source API and virtio vq polling

On Mon, Jul 26, 2021 at 05:47:19PM +0200, Rafael J. Wysocki wrote:
> On Mon, Jul 26, 2021 at 5:17 PM Stefan Hajnoczi <[email protected]> wrote:
> >
> > On Thu, Jul 22, 2021 at 05:04:57PM +0800, Jason Wang wrote:
> > >
> > > 在 2021/7/21 下午5:41, Stefan Hajnoczi 写道:
> > > > On Wed, Jul 21, 2021 at 11:29:55AM +0800, Jason Wang wrote:
> > > > > 在 2021/7/14 上午12:19, Stefan Hajnoczi 写道:
> > > > > > These patches are not polished yet but I would like request feedback on this
> > > > > > approach and share performance results with you.
> > > > > >
> > > > > > Idle CPUs tentatively enter a busy wait loop before halting when the cpuidle
> > > > > > haltpoll driver is enabled inside a virtual machine. This reduces wakeup
> > > > > > latency for events that occur soon after the vCPU becomes idle.
> > > > > >
> > > > > > This patch series extends the cpuidle busy wait loop with the new poll_source
> > > > > > API so drivers can participate in polling. Such polling-aware drivers disable
> > > > > > their device's irq during the busy wait loop to avoid the cost of interrupts.
> > > > > > This reduces latency further than regular cpuidle haltpoll, which still relies
> > > > > > on irqs.
> > > > > >
> > > > > > Virtio drivers are modified to use the poll_source API so all virtio device
> > > > > > types get this feature. The following virtio-blk fio benchmark results show the
> > > > > > improvement:
> > > > > >
> > > > > > IOPS (numjobs=4, iodepth=1, 4 virtqueues)
> > > > > > before poll_source io_poll
> > > > > > 4k randread 167102 186049 (+11%) 186654 (+11%)
> > > > > > 4k randwrite 162204 181214 (+11%) 181850 (+12%)
> > > > > > 4k randrw 159520 177071 (+11%) 177928 (+11%)
> > > > > >
> > > > > > The comparison against io_poll shows that cpuidle poll_source achieves
> > > > > > equivalent performance to the block layer's io_poll feature (which I
> > > > > > implemented in a separate patch series [1]).
> > > > > >
> > > > > > The advantage of poll_source is that applications do not need to explicitly set
> > > > > > the RWF_HIPRI I/O request flag. The poll_source approach is attractive because
> > > > > > few applications actually use RWF_HIPRI and it takes advantage of CPU cycles we
> > > > > > would have spent in cpuidle haltpoll anyway.
> > > > > >
> > > > > > The current series does not improve virtio-net. I haven't investigated deeply,
> > > > > > but it is possible that NAPI and poll_source do not combine. See the final
> > > > > > patch for a starting point on making the two work together.
> > > > > >
> > > > > > I have not tried this on bare metal but it might help there too. The cost of
> > > > > > disabling a device's irq must be less than the savings from avoiding irq
> > > > > > handling for this optimization to make sense.
> > > > > >
> > > > > > [1] https://lore.kernel.org/linux-block/[email protected]/
> > > > >
> > > > > Hi Stefan:
> > > > >
> > > > > Some questions:
> > > > >
> > > > > 1) What's the advantages of introducing polling at virtio level instead of
> > > > > doing it at each subsystems? Polling in virtio level may only work well if
> > > > > all (or most) of the devices are virtio
> > > > I'm not sure I understand the question. cpuidle haltpoll benefits all
> > > > devices today, except it incurs interrupt latency. The poll_source API
> > > > eliminates the interrupt latency for drivers that can disable device
> > > > interrupts cheaply.
> > > >
> > > > This patch adds poll_source to core virtio code so that all virtio
> > > > drivers get this feature for free. No driver-specific changes are
> > > > needed.
> > > >
> > > > If you mean networking, block layer, etc by "subsystems" then there's
> > > > nothing those subsystems can do to help. Whether poll_source can be used
> > > > depends on the specific driver, not the subsystem. If you consider
> > > > drivers/virtio/ a subsystem, then that's exactly what the patch series
> > > > is doing.
> > >
> > >
> > > I meant, if we choose to use idle poll, we have some several choices:
> > >
> > > 1) bus level (e.g the virtio)
> > > 2) subsystem level (e.g the networking and block)
> > >
> > > I'm not sure which one is better.
> >
> > This API is intended to be driver- or bus-level. I don't think
> > subsystems can do very much since they don't know the hardware
> > capabilities (cheap interrupt disabling) and in most cases there's no
> > advantage of plumbing it through subsystems when drivers can call the
> > API directly.
> >
> > > > > 2) What's the advantages of using cpuidle instead of using a thread (and
> > > > > leverage the scheduler)?
> > > > In order to combine with the existing cpuidle infrastructure. No new
> > > > polling loop is introduced and no additional CPU cycles are spent on
> > > > polling.
> > > >
> > > > If cpuidle itself is converted to threads then poll_source would
> > > > automatically operate in a thread too, but this patch series doesn't
> > > > change how the core cpuidle code works.
> > >
> > >
> > > So networking subsystem can use NAPI busy polling in the process context
> > > which means it can be leveraged by the scheduler.
> > >
> > > I'm not sure it's a good idea to poll drivers for a specific bus in the
> > > general cpu idle layer.
> >
> > Why? Maybe because the cpuidle execution environment is a little special?
>
> Well, this would be prone to abuse.
>
> The time spent in that driver callback counts as CPU idle time while
> it really is the driver running and there is not limit on how much
> time the callback can take, while doing costly things in the idle loop
> is generally avoided, because on wakeup the CPU needs to be available
> to the task needing it as soon as possible. IOW, the callback
> potentially add unbounded latency to the CPU wakeup path.

How is this different from driver interrupt handlers running during
cpuidle?

Stefan


Attachments:
(No filename) (5.92 kB)
signature.asc (499.00 B)
Download all attachments

2021-07-27 13:34:13

by Stefan Hajnoczi

[permalink] [raw]
Subject: Re: [RFC 0/3] cpuidle: add poll_source API and virtio vq polling

On Mon, Jul 26, 2021 at 06:37:13PM +0200, Rafael J. Wysocki wrote:
> On Mon, Jul 26, 2021 at 6:04 PM Stefan Hajnoczi <[email protected]> wrote:
> >
> > On Mon, Jul 26, 2021 at 05:47:19PM +0200, Rafael J. Wysocki wrote:
> > > On Mon, Jul 26, 2021 at 5:17 PM Stefan Hajnoczi <[email protected]> wrote:
> > > >
> > > > On Thu, Jul 22, 2021 at 05:04:57PM +0800, Jason Wang wrote:
> > > > >
> > > > > 在 2021/7/21 下午5:41, Stefan Hajnoczi 写道:
> > > > > > On Wed, Jul 21, 2021 at 11:29:55AM +0800, Jason Wang wrote:
> > > > > > > 在 2021/7/14 上午12:19, Stefan Hajnoczi 写道:
> > > > > > > > These patches are not polished yet but I would like request feedback on this
> > > > > > > > approach and share performance results with you.
> > > > > > > >
> > > > > > > > Idle CPUs tentatively enter a busy wait loop before halting when the cpuidle
> > > > > > > > haltpoll driver is enabled inside a virtual machine. This reduces wakeup
> > > > > > > > latency for events that occur soon after the vCPU becomes idle.
> > > > > > > >
> > > > > > > > This patch series extends the cpuidle busy wait loop with the new poll_source
> > > > > > > > API so drivers can participate in polling. Such polling-aware drivers disable
> > > > > > > > their device's irq during the busy wait loop to avoid the cost of interrupts.
> > > > > > > > This reduces latency further than regular cpuidle haltpoll, which still relies
> > > > > > > > on irqs.
> > > > > > > >
> > > > > > > > Virtio drivers are modified to use the poll_source API so all virtio device
> > > > > > > > types get this feature. The following virtio-blk fio benchmark results show the
> > > > > > > > improvement:
> > > > > > > >
> > > > > > > > IOPS (numjobs=4, iodepth=1, 4 virtqueues)
> > > > > > > > before poll_source io_poll
> > > > > > > > 4k randread 167102 186049 (+11%) 186654 (+11%)
> > > > > > > > 4k randwrite 162204 181214 (+11%) 181850 (+12%)
> > > > > > > > 4k randrw 159520 177071 (+11%) 177928 (+11%)
> > > > > > > >
> > > > > > > > The comparison against io_poll shows that cpuidle poll_source achieves
> > > > > > > > equivalent performance to the block layer's io_poll feature (which I
> > > > > > > > implemented in a separate patch series [1]).
> > > > > > > >
> > > > > > > > The advantage of poll_source is that applications do not need to explicitly set
> > > > > > > > the RWF_HIPRI I/O request flag. The poll_source approach is attractive because
> > > > > > > > few applications actually use RWF_HIPRI and it takes advantage of CPU cycles we
> > > > > > > > would have spent in cpuidle haltpoll anyway.
> > > > > > > >
> > > > > > > > The current series does not improve virtio-net. I haven't investigated deeply,
> > > > > > > > but it is possible that NAPI and poll_source do not combine. See the final
> > > > > > > > patch for a starting point on making the two work together.
> > > > > > > >
> > > > > > > > I have not tried this on bare metal but it might help there too. The cost of
> > > > > > > > disabling a device's irq must be less than the savings from avoiding irq
> > > > > > > > handling for this optimization to make sense.
> > > > > > > >
> > > > > > > > [1] https://lore.kernel.org/linux-block/[email protected]/
> > > > > > >
> > > > > > > Hi Stefan:
> > > > > > >
> > > > > > > Some questions:
> > > > > > >
> > > > > > > 1) What's the advantages of introducing polling at virtio level instead of
> > > > > > > doing it at each subsystems? Polling in virtio level may only work well if
> > > > > > > all (or most) of the devices are virtio
> > > > > > I'm not sure I understand the question. cpuidle haltpoll benefits all
> > > > > > devices today, except it incurs interrupt latency. The poll_source API
> > > > > > eliminates the interrupt latency for drivers that can disable device
> > > > > > interrupts cheaply.
> > > > > >
> > > > > > This patch adds poll_source to core virtio code so that all virtio
> > > > > > drivers get this feature for free. No driver-specific changes are
> > > > > > needed.
> > > > > >
> > > > > > If you mean networking, block layer, etc by "subsystems" then there's
> > > > > > nothing those subsystems can do to help. Whether poll_source can be used
> > > > > > depends on the specific driver, not the subsystem. If you consider
> > > > > > drivers/virtio/ a subsystem, then that's exactly what the patch series
> > > > > > is doing.
> > > > >
> > > > >
> > > > > I meant, if we choose to use idle poll, we have some several choices:
> > > > >
> > > > > 1) bus level (e.g the virtio)
> > > > > 2) subsystem level (e.g the networking and block)
> > > > >
> > > > > I'm not sure which one is better.
> > > >
> > > > This API is intended to be driver- or bus-level. I don't think
> > > > subsystems can do very much since they don't know the hardware
> > > > capabilities (cheap interrupt disabling) and in most cases there's no
> > > > advantage of plumbing it through subsystems when drivers can call the
> > > > API directly.
> > > >
> > > > > > > 2) What's the advantages of using cpuidle instead of using a thread (and
> > > > > > > leverage the scheduler)?
> > > > > > In order to combine with the existing cpuidle infrastructure. No new
> > > > > > polling loop is introduced and no additional CPU cycles are spent on
> > > > > > polling.
> > > > > >
> > > > > > If cpuidle itself is converted to threads then poll_source would
> > > > > > automatically operate in a thread too, but this patch series doesn't
> > > > > > change how the core cpuidle code works.
> > > > >
> > > > >
> > > > > So networking subsystem can use NAPI busy polling in the process context
> > > > > which means it can be leveraged by the scheduler.
> > > > >
> > > > > I'm not sure it's a good idea to poll drivers for a specific bus in the
> > > > > general cpu idle layer.
> > > >
> > > > Why? Maybe because the cpuidle execution environment is a little special?
> > >
> > > Well, this would be prone to abuse.
> > >
> > > The time spent in that driver callback counts as CPU idle time while
> > > it really is the driver running and there is not limit on how much
> > > time the callback can take, while doing costly things in the idle loop
> > > is generally avoided, because on wakeup the CPU needs to be available
> > > to the task needing it as soon as possible. IOW, the callback
> > > potentially add unbounded latency to the CPU wakeup path.
> >
> > How is this different from driver interrupt handlers running during
> > cpuidle?
>
> The time spent on handling interrupts does not count as CPU idle time.

Is that taken care of by account_hardirq_enter()?

Drivers using poll_source have two pieces:
1. A small piece of code that polls the device.
2. A larger piece of code that runs when polling succeeds. This is
basically the irq handler.

Would it be acceptable to run #1 from cpuidle but defer #2 so it's not
accounted as idle time?

Thanks,
Stefan


Attachments:
(No filename) (6.94 kB)
signature.asc (499.00 B)
Download all attachments