2017-04-24 14:03:17

by Daniel Lezcano

[permalink] [raw]
Subject: [PATCH V9 1/3] irq: Allow to pass the IRQF_TIMER flag with percpu irq request

In the next changes, we track when the interrupts occur in order to
statistically compute when is supposed to happen the next interrupt.

In all the interruptions, it does not make sense to store the timer interrupt
occurences and try to predict the next interrupt as when know the expiration
time.

The request_irq() has a irq flags parameter and the timer drivers use it to
pass the IRQF_TIMER flag, letting us know the interrupt is coming from a timer.
Based on this flag, we can discard these interrupts when tracking them.

But, the API request_percpu_irq does not allow to pass a flag, hence specifying
if the interrupt type is a timer.

Add a function request_percpu_irq_flags() where we can specify the flags. The
request_percpu_irq() function is changed to be a wrapper to
request_percpu_irq_flags() passing a zero flag parameter.

Change the timers using request_percpu_irq() to use request_percpu_irq_flags()
instead with the IRQF_TIMER flag set.

For now, in order to prevent a misusage of this parameter, only the IRQF_TIMER
flag (or zero) is a valid parameter to be passed to the
request_percpu_irq_flags() function.

Signed-off-by: Daniel Lezcano <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Vineet Gupta <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Patrice Chotard <[email protected]>
Cc: Kukjin Kim <[email protected]>
Cc: Krzysztof Kozlowski <[email protected]>
Cc: Javier Martinez Canillas <[email protected]>
Cc: Christoffer Dall <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krčmář <[email protected]>

---
Changelog:

V9:
- Clarified the patch description
- Fixed EXPORT_SYMBOL_GPL(request_percpu_irq_flags)
---
arch/arm/kernel/smp_twd.c | 3 ++-
drivers/clocksource/arc_timer.c | 4 ++--
drivers/clocksource/arm_arch_timer.c | 20 ++++++++++++--------
drivers/clocksource/arm_global_timer.c | 4 ++--
drivers/clocksource/exynos_mct.c | 7 ++++---
drivers/clocksource/qcom-timer.c | 4 ++--
drivers/clocksource/time-armada-370-xp.c | 9 +++++----
drivers/clocksource/timer-nps.c | 6 +++---
include/linux/interrupt.h | 11 ++++++++++-
kernel/irq/manage.c | 15 ++++++++++-----
virt/kvm/arm/arch_timer.c | 5 +++--
11 files changed, 55 insertions(+), 33 deletions(-)

diff --git a/arch/arm/kernel/smp_twd.c b/arch/arm/kernel/smp_twd.c
index 895ae51..ce9fdcf 100644
--- a/arch/arm/kernel/smp_twd.c
+++ b/arch/arm/kernel/smp_twd.c
@@ -332,7 +332,8 @@ static int __init twd_local_timer_common_register(struct device_node *np)
goto out_free;
}

- err = request_percpu_irq(twd_ppi, twd_handler, "twd", twd_evt);
+ err = request_percpu_irq_flags(twd_ppi, twd_handler, IRQF_TIMER, "twd",
+ twd_evt);
if (err) {
pr_err("twd: can't register interrupt %d (%d)\n", twd_ppi, err);
goto out_free;
diff --git a/drivers/clocksource/arc_timer.c b/drivers/clocksource/arc_timer.c
index 7517f95..993e6af 100644
--- a/drivers/clocksource/arc_timer.c
+++ b/drivers/clocksource/arc_timer.c
@@ -301,8 +301,8 @@ static int __init arc_clockevent_setup(struct device_node *node)
}

/* Needs apriori irq_set_percpu_devid() done in intc map function */
- ret = request_percpu_irq(arc_timer_irq, timer_irq_handler,
- "Timer0 (per-cpu-tick)", evt);
+ ret = request_percpu_irq_flags(arc_timer_irq, timer_irq_handler, IRQF_TIMER,
+ "Timer0 (per-cpu-tick)", evt);
if (ret) {
pr_err("clockevent: unable to request irq\n");
return ret;
diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c
index 7a8a411..d9d00b0 100644
--- a/drivers/clocksource/arm_arch_timer.c
+++ b/drivers/clocksource/arm_arch_timer.c
@@ -768,25 +768,29 @@ static int __init arch_timer_register(void)
ppi = arch_timer_ppi[arch_timer_uses_ppi];
switch (arch_timer_uses_ppi) {
case VIRT_PPI:
- err = request_percpu_irq(ppi, arch_timer_handler_virt,
- "arch_timer", arch_timer_evt);
+ err = request_percpu_irq_flags(ppi, arch_timer_handler_virt,
+ IRQF_TIMER, "arch_timer",
+ arch_timer_evt);
break;
case PHYS_SECURE_PPI:
case PHYS_NONSECURE_PPI:
- err = request_percpu_irq(ppi, arch_timer_handler_phys,
- "arch_timer", arch_timer_evt);
+ err = request_percpu_irq_flags(ppi, arch_timer_handler_phys,
+ IRQF_TIMER, "arch_timer",
+ arch_timer_evt);
if (!err && arch_timer_ppi[PHYS_NONSECURE_PPI]) {
ppi = arch_timer_ppi[PHYS_NONSECURE_PPI];
- err = request_percpu_irq(ppi, arch_timer_handler_phys,
- "arch_timer", arch_timer_evt);
+ err = request_percpu_irq_flags(ppi, arch_timer_handler_phys,
+ IRQF_TIMER, "arch_timer",
+ arch_timer_evt);
if (err)
free_percpu_irq(arch_timer_ppi[PHYS_SECURE_PPI],
arch_timer_evt);
}
break;
case HYP_PPI:
- err = request_percpu_irq(ppi, arch_timer_handler_phys,
- "arch_timer", arch_timer_evt);
+ err = request_percpu_irq_flags(ppi, arch_timer_handler_phys,
+ IRQF_TIMER, "arch_timer",
+ arch_timer_evt);
break;
default:
BUG();
diff --git a/drivers/clocksource/arm_global_timer.c b/drivers/clocksource/arm_global_timer.c
index 123ed20..5a72ec1 100644
--- a/drivers/clocksource/arm_global_timer.c
+++ b/drivers/clocksource/arm_global_timer.c
@@ -302,8 +302,8 @@ static int __init global_timer_of_register(struct device_node *np)
goto out_clk;
}

- err = request_percpu_irq(gt_ppi, gt_clockevent_interrupt,
- "gt", gt_evt);
+ err = request_percpu_irq_flags(gt_ppi, gt_clockevent_interrupt,
+ IRQF_TIMER, "gt", gt_evt);
if (err) {
pr_warn("global-timer: can't register interrupt %d (%d)\n",
gt_ppi, err);
diff --git a/drivers/clocksource/exynos_mct.c b/drivers/clocksource/exynos_mct.c
index 670ff0f..a48ca0f 100644
--- a/drivers/clocksource/exynos_mct.c
+++ b/drivers/clocksource/exynos_mct.c
@@ -524,9 +524,10 @@ static int __init exynos4_timer_resources(struct device_node *np, void __iomem *

if (mct_int_type == MCT_INT_PPI) {

- err = request_percpu_irq(mct_irqs[MCT_L0_IRQ],
- exynos4_mct_tick_isr, "MCT",
- &percpu_mct_tick);
+ err = request_percpu_irq_flags(mct_irqs[MCT_L0_IRQ],
+ exynos4_mct_tick_isr,
+ IRQF_TIMER, "MCT",
+ &percpu_mct_tick);
WARN(err, "MCT: can't request IRQ %d (%d)\n",
mct_irqs[MCT_L0_IRQ], err);
} else {
diff --git a/drivers/clocksource/qcom-timer.c b/drivers/clocksource/qcom-timer.c
index ee358cd..8e876fc 100644
--- a/drivers/clocksource/qcom-timer.c
+++ b/drivers/clocksource/qcom-timer.c
@@ -174,8 +174,8 @@ static int __init msm_timer_init(u32 dgt_hz, int sched_bits, int irq,
}

if (percpu)
- res = request_percpu_irq(irq, msm_timer_interrupt,
- "gp_timer", msm_evt);
+ res = request_percpu_irq_flags(irq, msm_timer_interrupt,
+ IRQF_TIMER, "gp_timer", msm_evt);

if (res) {
pr_err("request_percpu_irq failed\n");
diff --git a/drivers/clocksource/time-armada-370-xp.c b/drivers/clocksource/time-armada-370-xp.c
index 4440aef..7405e14 100644
--- a/drivers/clocksource/time-armada-370-xp.c
+++ b/drivers/clocksource/time-armada-370-xp.c
@@ -309,10 +309,11 @@ static int __init armada_370_xp_timer_common_init(struct device_node *np)
/*
* Setup clockevent timer (interrupt-driven).
*/
- res = request_percpu_irq(armada_370_xp_clkevt_irq,
- armada_370_xp_timer_interrupt,
- "armada_370_xp_per_cpu_tick",
- armada_370_xp_evt);
+ res = request_percpu_irq_flags(armada_370_xp_clkevt_irq,
+ armada_370_xp_timer_interrupt,
+ IRQF_TIMER,
+ "armada_370_xp_per_cpu_tick",
+ armada_370_xp_evt);
/* Immediately configure the timer on the boot CPU */
if (res) {
pr_err("Failed to request percpu irq");
diff --git a/drivers/clocksource/timer-nps.c b/drivers/clocksource/timer-nps.c
index da1f798..195f039 100644
--- a/drivers/clocksource/timer-nps.c
+++ b/drivers/clocksource/timer-nps.c
@@ -256,9 +256,9 @@ static int __init nps_setup_clockevent(struct device_node *node)
return ret;

/* Needs apriori irq_set_percpu_devid() done in intc map function */
- ret = request_percpu_irq(nps_timer0_irq, timer_irq_handler,
- "Timer0 (per-cpu-tick)",
- &nps_clockevent_device);
+ ret = request_percpu_irq_flags(nps_timer0_irq, timer_irq_handler,
+ IRQF_TIMER, "Timer0 (per-cpu-tick)",
+ &nps_clockevent_device);
if (ret) {
pr_err("Couldn't request irq\n");
clk_disable_unprepare(clk);
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 53144e7..8f44f23 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -152,8 +152,17 @@ struct irqaction {
unsigned long flags, const char *name, void *dev_id);

extern int __must_check
+request_percpu_irq_flags(unsigned int irq, irq_handler_t handler,
+ unsigned long flags, const char *devname,
+ void __percpu *percpu_dev_id);
+
+static inline int __must_check
request_percpu_irq(unsigned int irq, irq_handler_t handler,
- const char *devname, void __percpu *percpu_dev_id);
+ const char *devname, void __percpu *percpu_dev_id)
+{
+ return request_percpu_irq_flags(irq, handler, 0,
+ devname, percpu_dev_id);
+}

extern void free_irq(unsigned int, void *);
extern void free_percpu_irq(unsigned int, void __percpu *);
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index ae1c90f..1ba7734 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1951,9 +1951,10 @@ int setup_percpu_irq(unsigned int irq, struct irqaction *act)
}

/**
- * request_percpu_irq - allocate a percpu interrupt line
+ * request_percpu_irq_flags - allocate a percpu interrupt line
* @irq: Interrupt line to allocate
* @handler: Function to be called when the IRQ occurs.
+ * @flags: Interrupt type flags (IRQF_TIMER only)
* @devname: An ascii name for the claiming device
* @dev_id: A percpu cookie passed back to the handler function
*
@@ -1966,8 +1967,9 @@ int setup_percpu_irq(unsigned int irq, struct irqaction *act)
* the handler gets called with the interrupted CPU's instance of
* that variable.
*/
-int request_percpu_irq(unsigned int irq, irq_handler_t handler,
- const char *devname, void __percpu *dev_id)
+int request_percpu_irq_flags(unsigned int irq, irq_handler_t handler,
+ unsigned long flags, const char *devname,
+ void __percpu *dev_id)
{
struct irqaction *action;
struct irq_desc *desc;
@@ -1981,12 +1983,15 @@ int request_percpu_irq(unsigned int irq, irq_handler_t handler,
!irq_settings_is_per_cpu_devid(desc))
return -EINVAL;

+ if (flags && flags != IRQF_TIMER)
+ return -EINVAL;
+
action = kzalloc(sizeof(struct irqaction), GFP_KERNEL);
if (!action)
return -ENOMEM;

action->handler = handler;
- action->flags = IRQF_PERCPU | IRQF_NO_SUSPEND;
+ action->flags = flags | IRQF_PERCPU | IRQF_NO_SUSPEND;
action->name = devname;
action->percpu_dev_id = dev_id;

@@ -2007,7 +2012,7 @@ int request_percpu_irq(unsigned int irq, irq_handler_t handler,

return retval;
}
-EXPORT_SYMBOL_GPL(request_percpu_irq);
+EXPORT_SYMBOL_GPL(request_percpu_irq_flags);

/**
* irq_get_irqchip_state - returns the irqchip state of a interrupt.
diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c
index 35d7100..602e0a8 100644
--- a/virt/kvm/arm/arch_timer.c
+++ b/virt/kvm/arm/arch_timer.c
@@ -523,8 +523,9 @@ int kvm_timer_hyp_init(void)
host_vtimer_irq_flags = IRQF_TRIGGER_LOW;
}

- err = request_percpu_irq(host_vtimer_irq, kvm_arch_timer_handler,
- "kvm guest timer", kvm_get_running_vcpus());
+ err = request_percpu_irq_flags(host_vtimer_irq, kvm_arch_timer_handler,
+ IRQF_TIMER, "kvm guest timer",
+ kvm_get_running_vcpus());
if (err) {
kvm_err("kvm_arch_timer: can't request interrupt %d (%d)\n",
host_vtimer_irq, err);
--
1.9.1


2017-04-24 14:03:33

by Daniel Lezcano

[permalink] [raw]
Subject: [PATCH V9 3/3] irq: Compute the periodic interval for interrupts

An interrupt behaves with a burst of activity with periodic interval of time
followed by one or two peaks of longer interval.

As the time intervals are periodic, statistically speaking they follow a normal
distribution and each interrupts can be tracked individually.

This patch does statistics on all interrupts, except the timers which are
deterministic by essence. The goal is to extract the periodicity for each
interrupt, with the last timestamp and sum them, so we have the next event.

Taking the earliest prediction gives the expected wakeup on the system (assuming
a timer won't expire before).

As stated in the previous patch, this code is not enabled in the kernel by
default.

Signed-off-by: Daniel Lezcano <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Nicolas Pitre <[email protected]>
---
Changelog:
V9:
- Deal with 48+16 bits encoded values
- Changed irq_stat => irqt_stat to prevent name collision on s390
- Changed div64 by constant IRQ_TIMINGS_SHIFT bit shift for average
- Changed div64 by constant IRQ_TIMINGS_SHIFT bit shift for variance
---
include/linux/interrupt.h | 1 +
kernel/irq/internals.h | 19 +++
kernel/irq/timings.c | 348 ++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 368 insertions(+)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 853aef7..5d4e43a 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -715,6 +715,7 @@ static inline void init_irq_proc(void)
#ifdef CONFIG_IRQ_TIMINGS
void irq_timings_enable(void);
void irq_timings_disable(void);
+u64 irq_timings_next_event(u64 now);
#endif

struct seq_file;
diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h
index df51b5e0..1f56c3d 100644
--- a/kernel/irq/internals.h
+++ b/kernel/irq/internals.h
@@ -242,13 +242,21 @@ struct irq_timings {

DECLARE_PER_CPU(struct irq_timings, irq_timings);

+extern void irq_timings_free(int irq);
+extern int irq_timings_alloc(int irq);
+
static inline void remove_timings(struct irq_desc *desc)
{
desc->istate &= ~IRQS_TIMINGS;
+
+ irq_timings_free(irq_desc_get_irq(desc));
}

static inline void setup_timings(struct irq_desc *desc, struct irqaction *act)
{
+ int irq = irq_desc_get_irq(desc);
+ int ret;
+
/*
* We don't need the measurement because the idle code already
* knows the next expiry event.
@@ -256,6 +264,17 @@ static inline void setup_timings(struct irq_desc *desc, struct irqaction *act)
if (act->flags & __IRQF_TIMER)
return;

+ /*
+ * In case the timing allocation fails, we just want to warn,
+ * not fail, so letting the system boot anyway.
+ */
+ ret = irq_timings_alloc(irq);
+ if (ret) {
+ pr_warn("Failed to allocate irq timing stats for irq%d (%d)",
+ irq, ret);
+ return;
+ }
+
desc->istate |= IRQS_TIMINGS;
}

diff --git a/kernel/irq/timings.c b/kernel/irq/timings.c
index 56cf687..04d62b3 100644
--- a/kernel/irq/timings.c
+++ b/kernel/irq/timings.c
@@ -9,9 +9,14 @@
*
*/
#include <linux/percpu.h>
+#include <linux/slab.h>
#include <linux/static_key.h>
#include <linux/interrupt.h>
+#include <linux/idr.h>
#include <linux/irq.h>
+#include <linux/math64.h>
+
+#include <trace/events/irq.h>

#include "internals.h"

@@ -19,6 +24,18 @@

DEFINE_PER_CPU(struct irq_timings, irq_timings);

+struct irqt_stat {
+ u64 ne; /* next event */
+ u64 lts; /* last timestamp */
+ u64 variance; /* variance */
+ u32 avg; /* mean value */
+ u32 count; /* number of samples */
+ int anomalies; /* number of consecutives anomalies */
+ int valid; /* behaviour of the interrupt */
+};
+
+static DEFINE_IDR(irqt_stats);
+
void irq_timings_enable(void)
{
static_branch_enable(&irq_timing_enabled);
@@ -28,3 +45,334 @@ void irq_timings_disable(void)
{
static_branch_disable(&irq_timing_enabled);
}
+
+/**
+ * irqs_update - update the irq timing statistics with a new timestamp
+ *
+ * @irqs: an irqt_stat struct pointer
+ * @ts: the new timestamp
+ *
+ * ** This function must be called with the local irq disabled **
+ *
+ * The statistics are computed online, in other words, the code is
+ * designed to compute the statistics on a stream of values rather
+ * than doing multiple passes on the values to compute the average,
+ * then the variance. The integer division introduces a loss of
+ * precision but with an acceptable error margin regarding the results
+ * we would have with the double floating precision: we are dealing
+ * with nanosec, so big numbers, consequently the mantisse is
+ * negligeable, especially when converting the time in usec
+ * afterwards.
+ *
+ * The computation happens at idle time. When the CPU is not idle, the
+ * interrupts' timestamps are stored in the circular buffer, when the
+ * CPU goes idle and this routine is called, all the buffer's values
+ * are injected in the statistical model continuying to extend the
+ * statistics from the previous busy-idle cycle.
+ *
+ * The observations showed a device will trigger a burst of periodic
+ * interrupts followed by one or two peaks of longer time, for
+ * instance when a SD card device flushes its cache, then the periodic
+ * intervals occur again. A one second inactivity period resets the
+ * stats, that gives us the certitude the statistical values won't
+ * exceed 1x10^9, thus the computation won't overflow.
+ *
+ * Basically, the purpose of the algorithm is to watch the periodic
+ * interrupts and eliminate the peaks.
+ *
+ * An interrupt is considered periodically stable if the interval of
+ * its occurences follow the normal distribution, thus the values
+ * comply with:
+ *
+ * avg - 3 x stddev < value < avg + 3 x stddev
+ *
+ * Which can be simplified to:
+ *
+ * -3 x stddev < value - avg < 3 x stddev
+ *
+ * abs(value - avg) < 3 x stddev
+ *
+ * In order to save a costly square root computation, we use the
+ * variance. For the record, stddev = sqrt(variance). The equation
+ * above becomes:
+ *
+ * abs(value - avg) < 3 x sqrt(variance)
+ *
+ * And finally we square it:
+ *
+ * (value - avg) ^ 2 < (3 x sqrt(variance)) ^ 2
+ *
+ * (value - avg) x (value - avg) < 9 x variance
+ *
+ * Statistically speaking, any values out of this interval is
+ * considered as an anomaly and is discarded. However, a normal
+ * distribution appears when the number of samples is 30 (it is the
+ * rule of thumb in statistics, cf. "30 samples" on Internet). When
+ * there are three consecutive anomalies, the statistics are resetted.
+ *
+ */
+static void irqs_update(struct irqt_stat *irqs, u64 ts)
+{
+ u64 old_ts = irqs->lts;
+ u64 variance = 0;
+ u64 interval;
+ s64 diff;
+
+ /*
+ * The timestamps are absolute time values, we need to compute
+ * the timing interval between two interrupts.
+ */
+ irqs->lts = ts;
+
+ /*
+ * The interval type is u64 in order to deal with the same
+ * type in our computation, that prevent mindfuck issues with
+ * overflow, sign and division.
+ */
+ interval = ts - old_ts;
+
+ /*
+ * The interrupt triggered more than one second apart, that
+ * ends the sequence as predictible for our purpose. In this
+ * case, assume we have the beginning of a sequence and the
+ * timestamp is the first value. As it is impossible to
+ * predict anything at this point, return.
+ *
+ * Note the first timestamp of the sequence will always fall
+ * in this test because the old_ts is zero. That is what we
+ * want as we need another timestamp to compute an interval.
+ */
+ if (interval >= NSEC_PER_SEC) {
+ memset(irqs, 0, sizeof(*irqs));
+ irqs->lts = ts;
+ return;
+ }
+
+ /*
+ * Pre-compute the delta with the average as the result is
+ * used several times in this function.
+ */
+ diff = interval - irqs->avg;
+
+ /*
+ * Increment the number of samples.
+ */
+ irqs->count++;
+
+ /*
+ * Online variance divided by the number of elements if there
+ * is more than one sample. Normally the formula is division
+ * by count - 1 but we assume the number of element will be
+ * more than 32 and dividing by 32 instead of 31 is enough
+ * precise.
+ */
+ if (likely(irqs->count > 1))
+ variance = irqs->variance >> IRQ_TIMINGS_SHIFT;
+
+ /*
+ * The rule of thumb in statistics for the normal distribution
+ * is having at least 30 samples in order to have the model to
+ * apply. Values outside the interval are considered as an
+ * anomaly.
+ */
+ if ((irqs->count >= 30) && ((diff * diff) > (9 * variance))) {
+ /*
+ * After three consecutive anomalies, we reset the
+ * stats as it is no longer stable enough.
+ */
+ if (irqs->anomalies++ >= 3) {
+ memset(irqs, 0, sizeof(*irqs));
+ irqs->lts = ts;
+ return;
+ }
+ } else {
+ /*
+ * The anomalies must be consecutives, so at this
+ * point, we reset the anomalies counter.
+ */
+ irqs->anomalies = 0;
+ }
+
+ /*
+ * The interrupt is considered stable enough to try to predict
+ * the next event on it.
+ */
+ irqs->valid = 1;
+
+ /*
+ * Online average algorithm:
+ *
+ * new_average = average + ((value - average) / count)
+ *
+ * The variance computation depends on the new average
+ * to be computed here first.
+ *
+ */
+ irqs->avg = irqs->avg + (diff >> IRQ_TIMINGS_SHIFT);
+
+ /*
+ * Online variance algorithm:
+ *
+ * new_variance = variance + (value - average) x (value - new_average)
+ *
+ * Warning: irqs->avg is updated with the line above, hence
+ * 'interval - irqs->avg' is no longer equal to 'diff'
+ */
+ irqs->variance = irqs->variance + (diff * (interval - irqs->avg));
+
+ /*
+ * Update the next event
+ */
+ irqs->ne = ts + irqs->avg;
+}
+
+/**
+ * irq_timings_next_event - Return when the next event is supposed to arrive
+ *
+ * *** This function must be called with the local irq disabled ***
+ *
+ * During the last busy cycle, the number of interrupts is incremented
+ * and stored in the irq_timings structure. This information is
+ * necessary to:
+ *
+ * - know if the index in the table wrapped up:
+ *
+ * If more than the array size interrupts happened during the
+ * last busy/idle cycle, the index wrapped up and we have to
+ * begin with the next element in the array which is the last one
+ * in the sequence, otherwise it is a the index 0.
+ *
+ * - have an indication of the interrupts activity on this CPU
+ * (eg. irq/sec)
+ *
+ * The values are 'consumed' after inserting in the statistical model,
+ * thus the count is reinitialized.
+ *
+ * The array of values **must** be browsed in the time direction, the
+ * timestamp must increase between an element and the next one.
+ *
+ * Returns a nanosec time based estimation of the earliest interrupt,
+ * U64_MAX otherwise.
+ */
+u64 irq_timings_next_event(u64 now)
+{
+ struct irq_timings *irqts = this_cpu_ptr(&irq_timings);
+ struct irqt_stat *irqs;
+ struct irqt_stat __percpu *s;
+ u64 ts, ne = U64_MAX;
+ int index, count, i, irq = 0;
+
+ /*
+ * Number of elements in the circular buffer. If it happens it
+ * was flushed before, then the number of elements could be
+ * smaller than IRQ_TIMINGS_SIZE, so the count is used,
+ * otherwise the array size is used as we wrapped. The index
+ * begins from zero when we did not wrap. That could be done
+ * in a nicer way with the proper circular array structure
+ * type but with the cost of extra computation in the
+ * interrupt handler hot path. We choose efficiency.
+ */
+ if (irqts->count >= IRQ_TIMINGS_SIZE) {
+ count = IRQ_TIMINGS_SIZE;
+ index = irqts->count & IRQ_TIMINGS_MASK;
+ } else {
+ count = irqts->count;
+ index = 0;
+ }
+
+ /*
+ * Inject measured irq/timestamp to the statistical model.
+ */
+ for (i = 0; i < count; i++) {
+
+ ts = irqts->values[(index + i) & IRQ_TIMINGS_MASK];
+
+ irq_timing_decode(ts, &ts, &irq);
+
+ s = idr_find(&irqt_stats, irq);
+ if (s) {
+ irqs = this_cpu_ptr(s);
+ irqs_update(irqs, ts);
+ }
+ }
+
+ /*
+ * Reset the counter, we consumed all the data from our
+ * circular buffer.
+ */
+ irqts->count = 0;
+
+ /*
+ * Look in the list of interrupts' statistics, the earliest
+ * next event.
+ */
+ idr_for_each_entry(&irqt_stats, s, i) {
+
+ irqs = this_cpu_ptr(s);
+
+ if (!irqs->valid)
+ continue;
+
+ if (irqs->ne <= now) {
+ irq = i;
+ ne = now;
+
+ /*
+ * This interrupt mustn't use in the future
+ * until new events occur and update the
+ * statistics.
+ */
+ irqs->valid = 0;
+ break;
+ }
+
+ if (irqs->ne < ne) {
+ irq = i;
+ ne = irqs->ne;
+ }
+ }
+
+ return ne;
+}
+
+void irq_timings_free(int irq)
+{
+ struct irqt_stat __percpu *s;
+
+ s = idr_find(&irqt_stats, irq);
+ if (s) {
+ free_percpu(s);
+ idr_remove(&irqt_stats, irq);
+ }
+}
+
+int irq_timings_alloc(int irq)
+{
+ int id;
+ struct irqt_stat __percpu *s;
+
+ /*
+ * Some platforms can have the same private interrupt per cpu,
+ * so this function may be be called several times with the
+ * same interrupt number. Just bail out in case the per cpu
+ * stat structure is already allocated.
+ */
+ s = idr_find(&irqt_stats, irq);
+ if (s)
+ return 0;
+
+ s = alloc_percpu(*s);
+ if (!s)
+ return -ENOMEM;
+
+ idr_preload(GFP_KERNEL);
+ id = idr_alloc(&irqt_stats, s, irq, irq + 1, GFP_NOWAIT);
+ idr_preload_end();
+
+ if (id < 0) {
+ free_percpu(s);
+ return id;
+ }
+
+ return 0;
+}
--
1.9.1

2017-04-24 14:03:44

by Daniel Lezcano

[permalink] [raw]
Subject: [PATCH V9 2/3] irq: Track the interrupt timings

The interrupt framework gives a lot of information about each interrupt.

It does not keep track of when those interrupts occur though.

This patch provides a mean to record the timestamp for each interrupt
occurrences in a per-CPU circular buffer to help with the prediction
of the next occurrence using a statistical model.

Each CPU can store IRQ_TIMINGS_SIZE events <irq, timestamp>, the current
value of IRQ_TIMINGS_SIZE is 32.

Each event is encoded into a single u64, where the 48 bits are used for the
timestamp and the next 16 bits are for the irq number.

A static key is introduced so when the irq prediction is switched off at
runtime, we can reduce the overhead near to zero.

It results in most of the code in internals.h for inline reason and a very
few in the new file timings.c. The latter will contain more in the next patch
which will provide the statistical model for the next event prediction.

Note this code is by default *not* compiled in the kernel.

Signed-off-by: Daniel Lezcano <[email protected]>
Acked-by: Nicolas Pitre <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Vincent Guittot <[email protected]>
---
V9:
- Changed indentation level by inverting the static key condition
- Encoded interrupt and timestamp into a u64 variable
- Boolean enable instead of refcount for the static key
V8:
- Replaced percpu field in the irqdesc by a percpu array containing the
timings and the associated irq. The function irq_timings_get_next() is no
longer needed, so it is removed
- Removed all unused code resulting from the conversion irqdesc->percpu
timings storage
V7:
- Mentionned in the irq_timings_get_next() function description,
the function must be called inside a rcu read locked section
V6:
- Renamed handle_irq_timings to record_irq_time
- Stored the event time instead of the interval time
- Removed the 'timestamp' field from the timings structure
- Moved _handle_irq_timings content inside record_irq_time
V5:
- Changed comment about 'deterministic' as the comment is confusing
- Added license comment in the header
- Replaced irq_timings_get/put by irq_timings_enable/disable
- Moved IRQS_TIMINGS check in the handle_timings inline function
- Dropped 'if !prev' as it is pointless
- Stored time interval in nsec basis with u64 instead of u32
- Removed redundant store
- Removed the math
V4:
- Added a static key
- Added more comments for irq_timings_get_next()
- Unified some function names to be prefixed by 'irq_timings_...'
- Fixed a rebase error
V3:
- Replaced ktime_get() by local_clock()
- Shared irq are not handled
- Simplified code by adding the timing in the irqdesc struct
- Added a function to browse the irq timings
V2:
- Fixed kerneldoc comment
- Removed data field from the struct irq timing
- Changed the lock section comment
- Removed semi-colon style with empty stub
- Replaced macro by static inline
- Fixed static functions declaration
RFC:
- initial posting
---
include/linux/interrupt.h | 5 +++
kernel/irq/Kconfig | 3 ++
kernel/irq/Makefile | 1 +
kernel/irq/handle.c | 2 ++
kernel/irq/internals.h | 84 +++++++++++++++++++++++++++++++++++++++++++++++
kernel/irq/manage.c | 3 ++
kernel/irq/timings.c | 30 +++++++++++++++++
7 files changed, 128 insertions(+)
create mode 100644 kernel/irq/timings.c

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 8f44f23..853aef7 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -712,6 +712,11 @@ static inline void init_irq_proc(void)
}
#endif

+#ifdef CONFIG_IRQ_TIMINGS
+void irq_timings_enable(void);
+void irq_timings_disable(void);
+#endif
+
struct seq_file;
int show_interrupts(struct seq_file *p, void *v);
int arch_show_interrupts(struct seq_file *p, int prec);
diff --git a/kernel/irq/Kconfig b/kernel/irq/Kconfig
index 3bbfd6a..38e551d 100644
--- a/kernel/irq/Kconfig
+++ b/kernel/irq/Kconfig
@@ -81,6 +81,9 @@ config GENERIC_MSI_IRQ_DOMAIN
config HANDLE_DOMAIN_IRQ
bool

+config IRQ_TIMINGS
+ bool
+
config IRQ_DOMAIN_DEBUG
bool "Expose hardware/virtual IRQ mapping via debugfs"
depends on IRQ_DOMAIN && DEBUG_FS
diff --git a/kernel/irq/Makefile b/kernel/irq/Makefile
index 1d3ee31..efb5f14 100644
--- a/kernel/irq/Makefile
+++ b/kernel/irq/Makefile
@@ -10,3 +10,4 @@ obj-$(CONFIG_PM_SLEEP) += pm.o
obj-$(CONFIG_GENERIC_MSI_IRQ) += msi.o
obj-$(CONFIG_GENERIC_IRQ_IPI) += ipi.o
obj-$(CONFIG_SMP) += affinity.o
+obj-$(CONFIG_IRQ_TIMINGS) += timings.o
diff --git a/kernel/irq/handle.c b/kernel/irq/handle.c
index d3f2490..eb4d3e8 100644
--- a/kernel/irq/handle.c
+++ b/kernel/irq/handle.c
@@ -138,6 +138,8 @@ irqreturn_t __handle_irq_event_percpu(struct irq_desc *desc, unsigned int *flags
unsigned int irq = desc->irq_data.irq;
struct irqaction *action;

+ record_irq_time(desc);
+
for_each_action_of_desc(desc, action) {
irqreturn_t res;

diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h
index bc226e7..df51b5e0 100644
--- a/kernel/irq/internals.h
+++ b/kernel/irq/internals.h
@@ -8,6 +8,7 @@
#include <linux/irqdesc.h>
#include <linux/kernel_stat.h>
#include <linux/pm_runtime.h>
+#include <linux/sched/clock.h>

#ifdef CONFIG_SPARSE_IRQ
# define IRQ_BITMAP_BITS (NR_IRQS + 8196)
@@ -57,6 +58,7 @@ enum {
IRQS_WAITING = 0x00000080,
IRQS_PENDING = 0x00000200,
IRQS_SUSPENDED = 0x00000800,
+ IRQS_TIMINGS = 0x00001000,
};

#include "debug.h"
@@ -226,3 +228,85 @@ static inline int irq_desc_is_chained(struct irq_desc *desc)
static inline void
irq_pm_remove_action(struct irq_desc *desc, struct irqaction *action) { }
#endif
+
+#ifdef CONFIG_IRQ_TIMINGS
+
+#define IRQ_TIMINGS_SHIFT 5
+#define IRQ_TIMINGS_SIZE (1 << IRQ_TIMINGS_SHIFT)
+#define IRQ_TIMINGS_MASK (IRQ_TIMINGS_SIZE - 1)
+
+struct irq_timings {
+ u64 values[IRQ_TIMINGS_SIZE]; /* our circular buffer */
+ unsigned int count; /* Number of interruptions since last inspection */
+};
+
+DECLARE_PER_CPU(struct irq_timings, irq_timings);
+
+static inline void remove_timings(struct irq_desc *desc)
+{
+ desc->istate &= ~IRQS_TIMINGS;
+}
+
+static inline void setup_timings(struct irq_desc *desc, struct irqaction *act)
+{
+ /*
+ * We don't need the measurement because the idle code already
+ * knows the next expiry event.
+ */
+ if (act->flags & __IRQF_TIMER)
+ return;
+
+ desc->istate |= IRQS_TIMINGS;
+}
+
+extern void irq_timings_enable(void);
+extern void irq_timings_disable(void);
+
+extern struct static_key_false irq_timing_enabled;
+
+/*
+ * The interrupt number and the timestamp are encoded into a single
+ * u64 variable to optimize the size.
+ * 48 bit time stamp and 16 bit IRQ number is way sufficient.
+ * Who cares an IRQ after 78 hours of idle time?
+ */
+static inline u64 irq_timing_encode(u64 timestamp, int irq)
+{
+ return (timestamp << 16) | irq;
+}
+
+static inline void irq_timing_decode(u64 value, u64 *timestamp, int *irq)
+{
+ *timestamp = value >> 16;
+ *irq = value & U16_MAX;
+}
+
+/*
+ * The function record_irq_time is only called in one place in the
+ * interrupts handler. We want this function always inline so the code
+ * inside is embedded in the function and the static key branching
+ * code can act at the higher level. Without the explicit
+ * __always_inline we can end up with a function call and a small
+ * overhead in the hotpath for nothing.
+ */
+static __always_inline void record_irq_time(struct irq_desc *desc)
+{
+ if (!static_branch_likely(&irq_timing_enabled))
+ return;
+
+ if (desc->istate & IRQS_TIMINGS) {
+ struct irq_timings *timings = this_cpu_ptr(&irq_timings);
+
+ timings->values[timings->count & IRQ_TIMINGS_MASK] =
+ irq_timing_encode(local_clock(),
+ irq_desc_get_irq(desc));
+
+ timings->count++;
+ }
+}
+#else
+static inline void remove_timings(struct irq_desc *desc) {}
+static inline void setup_timings(struct irq_desc *desc,
+ struct irqaction *act) {};
+static inline void record_irq_time(struct irq_desc *desc) {}
+#endif /* CONFIG_IRQ_TIMINGS */
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 1ba7734..2686845 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1372,6 +1372,8 @@ static void irq_release_resources(struct irq_desc *desc)

raw_spin_unlock_irqrestore(&desc->lock, flags);

+ setup_timings(desc, new);
+
/*
* Strictly no need to wake it up, but hung_task complains
* when no hard interrupt wakes the thread up.
@@ -1500,6 +1502,7 @@ static struct irqaction *__free_irq(unsigned int irq, void *dev_id)
irq_settings_clr_disable_unlazy(desc);
irq_shutdown(desc);
irq_release_resources(desc);
+ remove_timings(desc);
}

#ifdef CONFIG_SMP
diff --git a/kernel/irq/timings.c b/kernel/irq/timings.c
new file mode 100644
index 0000000..56cf687
--- /dev/null
+++ b/kernel/irq/timings.c
@@ -0,0 +1,30 @@
+/*
+ * linux/kernel/irq/timings.c
+ *
+ * Copyright (C) 2016, Linaro Ltd - Daniel Lezcano <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ */
+#include <linux/percpu.h>
+#include <linux/static_key.h>
+#include <linux/interrupt.h>
+#include <linux/irq.h>
+
+#include "internals.h"
+
+DEFINE_STATIC_KEY_FALSE(irq_timing_enabled);
+
+DEFINE_PER_CPU(struct irq_timings, irq_timings);
+
+void irq_timings_enable(void)
+{
+ static_branch_enable(&irq_timing_enabled);
+}
+
+void irq_timings_disable(void)
+{
+ static_branch_disable(&irq_timing_enabled);
+}
--
1.9.1

2017-04-24 18:33:50

by Krzysztof Kozlowski

[permalink] [raw]
Subject: Re: [PATCH V9 1/3] irq: Allow to pass the IRQF_TIMER flag with percpu irq request

On Mon, Apr 24, 2017 at 04:01:31PM +0200, Daniel Lezcano wrote:
> In the next changes, we track when the interrupts occur in order to
> statistically compute when is supposed to happen the next interrupt.
>
> In all the interruptions, it does not make sense to store the timer interrupt
> occurences and try to predict the next interrupt as when know the expiration
> time.
>
> The request_irq() has a irq flags parameter and the timer drivers use it to
> pass the IRQF_TIMER flag, letting us know the interrupt is coming from a timer.
> Based on this flag, we can discard these interrupts when tracking them.
>
> But, the API request_percpu_irq does not allow to pass a flag, hence specifying
> if the interrupt type is a timer.
>
> Add a function request_percpu_irq_flags() where we can specify the flags. The
> request_percpu_irq() function is changed to be a wrapper to
> request_percpu_irq_flags() passing a zero flag parameter.
>
> Change the timers using request_percpu_irq() to use request_percpu_irq_flags()
> instead with the IRQF_TIMER flag set.
>
> For now, in order to prevent a misusage of this parameter, only the IRQF_TIMER
> flag (or zero) is a valid parameter to be passed to the
> request_percpu_irq_flags() function.
>
> Signed-off-by: Daniel Lezcano <[email protected]>
> Cc: Mark Rutland <[email protected]>
> Cc: Vineet Gupta <[email protected]>
> Cc: Marc Zyngier <[email protected]>
> Cc: Patrice Chotard <[email protected]>
> Cc: Kukjin Kim <[email protected]>
> Cc: Krzysztof Kozlowski <[email protected]>
> Cc: Javier Martinez Canillas <[email protected]>
> Cc: Christoffer Dall <[email protected]>
> Cc: Paolo Bonzini <[email protected]>
> Cc: Radim Krčmář <[email protected]>
>
> ---
> Changelog:
>
> V9:
> - Clarified the patch description
> - Fixed EXPORT_SYMBOL_GPL(request_percpu_irq_flags)
> ---
> arch/arm/kernel/smp_twd.c | 3 ++-
> drivers/clocksource/arc_timer.c | 4 ++--
> drivers/clocksource/arm_arch_timer.c | 20 ++++++++++++--------
> drivers/clocksource/arm_global_timer.c | 4 ++--
> drivers/clocksource/exynos_mct.c | 7 ++++---
> drivers/clocksource/qcom-timer.c | 4 ++--
> drivers/clocksource/time-armada-370-xp.c | 9 +++++----
> drivers/clocksource/timer-nps.c | 6 +++---
> include/linux/interrupt.h | 11 ++++++++++-
> kernel/irq/manage.c | 15 ++++++++++-----
> virt/kvm/arm/arch_timer.c | 5 +++--
> 11 files changed, 55 insertions(+), 33 deletions(-)
>

For exynos-mct:
Acked-by: Krzysztof Kozlowski <[email protected]>

Best regards,
Krzysztof

2017-04-24 18:46:55

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH V9 1/3] irq: Allow to pass the IRQF_TIMER flag with percpu irq request

On 24/04/17 15:01, Daniel Lezcano wrote:
> In the next changes, we track when the interrupts occur in order to
> statistically compute when is supposed to happen the next interrupt.
>
> In all the interruptions, it does not make sense to store the timer interrupt
> occurences and try to predict the next interrupt as when know the expiration
> time.
>
> The request_irq() has a irq flags parameter and the timer drivers use it to
> pass the IRQF_TIMER flag, letting us know the interrupt is coming from a timer.
> Based on this flag, we can discard these interrupts when tracking them.
>
> But, the API request_percpu_irq does not allow to pass a flag, hence specifying
> if the interrupt type is a timer.
>
> Add a function request_percpu_irq_flags() where we can specify the flags. The
> request_percpu_irq() function is changed to be a wrapper to
> request_percpu_irq_flags() passing a zero flag parameter.
>
> Change the timers using request_percpu_irq() to use request_percpu_irq_flags()
> instead with the IRQF_TIMER flag set.
>
> For now, in order to prevent a misusage of this parameter, only the IRQF_TIMER
> flag (or zero) is a valid parameter to be passed to the
> request_percpu_irq_flags() function.

[...]

> diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c
> index 35d7100..602e0a8 100644
> --- a/virt/kvm/arm/arch_timer.c
> +++ b/virt/kvm/arm/arch_timer.c
> @@ -523,8 +523,9 @@ int kvm_timer_hyp_init(void)
> host_vtimer_irq_flags = IRQF_TRIGGER_LOW;
> }
>
> - err = request_percpu_irq(host_vtimer_irq, kvm_arch_timer_handler,
> - "kvm guest timer", kvm_get_running_vcpus());
> + err = request_percpu_irq_flags(host_vtimer_irq, kvm_arch_timer_handler,
> + IRQF_TIMER, "kvm guest timer",
> + kvm_get_running_vcpus());
> if (err) {
> kvm_err("kvm_arch_timer: can't request interrupt %d (%d)\n",
> host_vtimer_irq, err);
>

How is that useful? This timer is controlled by the guest OS, and not
the host kernel. Can you explain how you intend to make use of that
information in this case?

Thanks,

M.
--
Jazz is not dead. It just smells funny...

2017-04-24 18:59:26

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH V9 1/3] irq: Allow to pass the IRQF_TIMER flag with percpu irq request

On Mon, Apr 24, 2017 at 07:46:43PM +0100, Marc Zyngier wrote:
> On 24/04/17 15:01, Daniel Lezcano wrote:
> > In the next changes, we track when the interrupts occur in order to
> > statistically compute when is supposed to happen the next interrupt.
> >
> > In all the interruptions, it does not make sense to store the timer interrupt
> > occurences and try to predict the next interrupt as when know the expiration
> > time.
> >
> > The request_irq() has a irq flags parameter and the timer drivers use it to
> > pass the IRQF_TIMER flag, letting us know the interrupt is coming from a timer.
> > Based on this flag, we can discard these interrupts when tracking them.
> >
> > But, the API request_percpu_irq does not allow to pass a flag, hence specifying
> > if the interrupt type is a timer.
> >
> > Add a function request_percpu_irq_flags() where we can specify the flags. The
> > request_percpu_irq() function is changed to be a wrapper to
> > request_percpu_irq_flags() passing a zero flag parameter.
> >
> > Change the timers using request_percpu_irq() to use request_percpu_irq_flags()
> > instead with the IRQF_TIMER flag set.
> >
> > For now, in order to prevent a misusage of this parameter, only the IRQF_TIMER
> > flag (or zero) is a valid parameter to be passed to the
> > request_percpu_irq_flags() function.
>
> [...]
>
> > diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c
> > index 35d7100..602e0a8 100644
> > --- a/virt/kvm/arm/arch_timer.c
> > +++ b/virt/kvm/arm/arch_timer.c
> > @@ -523,8 +523,9 @@ int kvm_timer_hyp_init(void)
> > host_vtimer_irq_flags = IRQF_TRIGGER_LOW;
> > }
> >
> > - err = request_percpu_irq(host_vtimer_irq, kvm_arch_timer_handler,
> > - "kvm guest timer", kvm_get_running_vcpus());
> > + err = request_percpu_irq_flags(host_vtimer_irq, kvm_arch_timer_handler,
> > + IRQF_TIMER, "kvm guest timer",
> > + kvm_get_running_vcpus());
> > if (err) {
> > kvm_err("kvm_arch_timer: can't request interrupt %d (%d)\n",
> > host_vtimer_irq, err);
> >
>
> How is that useful? This timer is controlled by the guest OS, and not
> the host kernel. Can you explain how you intend to make use of that
> information in this case?

Isn't it a source of interruption on the host kernel?

--

<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

2017-04-24 19:15:09

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH V9 1/3] irq: Allow to pass the IRQF_TIMER flag with percpu irq request

On 24/04/17 19:59, Daniel Lezcano wrote:
> On Mon, Apr 24, 2017 at 07:46:43PM +0100, Marc Zyngier wrote:
>> On 24/04/17 15:01, Daniel Lezcano wrote:
>>> In the next changes, we track when the interrupts occur in order to
>>> statistically compute when is supposed to happen the next interrupt.
>>>
>>> In all the interruptions, it does not make sense to store the timer interrupt
>>> occurences and try to predict the next interrupt as when know the expiration
>>> time.
>>>
>>> The request_irq() has a irq flags parameter and the timer drivers use it to
>>> pass the IRQF_TIMER flag, letting us know the interrupt is coming from a timer.
>>> Based on this flag, we can discard these interrupts when tracking them.
>>>
>>> But, the API request_percpu_irq does not allow to pass a flag, hence specifying
>>> if the interrupt type is a timer.
>>>
>>> Add a function request_percpu_irq_flags() where we can specify the flags. The
>>> request_percpu_irq() function is changed to be a wrapper to
>>> request_percpu_irq_flags() passing a zero flag parameter.
>>>
>>> Change the timers using request_percpu_irq() to use request_percpu_irq_flags()
>>> instead with the IRQF_TIMER flag set.
>>>
>>> For now, in order to prevent a misusage of this parameter, only the IRQF_TIMER
>>> flag (or zero) is a valid parameter to be passed to the
>>> request_percpu_irq_flags() function.
>>
>> [...]
>>
>>> diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c
>>> index 35d7100..602e0a8 100644
>>> --- a/virt/kvm/arm/arch_timer.c
>>> +++ b/virt/kvm/arm/arch_timer.c
>>> @@ -523,8 +523,9 @@ int kvm_timer_hyp_init(void)
>>> host_vtimer_irq_flags = IRQF_TRIGGER_LOW;
>>> }
>>>
>>> - err = request_percpu_irq(host_vtimer_irq, kvm_arch_timer_handler,
>>> - "kvm guest timer", kvm_get_running_vcpus());
>>> + err = request_percpu_irq_flags(host_vtimer_irq, kvm_arch_timer_handler,
>>> + IRQF_TIMER, "kvm guest timer",
>>> + kvm_get_running_vcpus());
>>> if (err) {
>>> kvm_err("kvm_arch_timer: can't request interrupt %d (%d)\n",
>>> host_vtimer_irq, err);
>>>
>>
>> How is that useful? This timer is controlled by the guest OS, and not
>> the host kernel. Can you explain how you intend to make use of that
>> information in this case?
>
> Isn't it a source of interruption on the host kernel?

Only to cause an exit of the VM, and not under the control of the host.
This isn't triggering any timer related action on the host code either.

Your patch series seems to assume some kind of predictability of the
timer interrupt, which can make sense on the host. Here, this interrupt
is shared among *all* guests running on this system.

Maybe you could explain why you think this interrupt is relevant to what
you're trying to achieve?

Thanks,

M.
--
Jazz is not dead. It just smells funny...

2017-04-24 20:00:09

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH V9 1/3] irq: Allow to pass the IRQF_TIMER flag with percpu irq request

On Mon, Apr 24, 2017 at 08:14:54PM +0100, Marc Zyngier wrote:
> On 24/04/17 19:59, Daniel Lezcano wrote:
> > On Mon, Apr 24, 2017 at 07:46:43PM +0100, Marc Zyngier wrote:
> >> On 24/04/17 15:01, Daniel Lezcano wrote:
> >>> In the next changes, we track when the interrupts occur in order to
> >>> statistically compute when is supposed to happen the next interrupt.
> >>>
> >>> In all the interruptions, it does not make sense to store the timer interrupt
> >>> occurences and try to predict the next interrupt as when know the expiration
> >>> time.
> >>>
> >>> The request_irq() has a irq flags parameter and the timer drivers use it to
> >>> pass the IRQF_TIMER flag, letting us know the interrupt is coming from a timer.
> >>> Based on this flag, we can discard these interrupts when tracking them.
> >>>
> >>> But, the API request_percpu_irq does not allow to pass a flag, hence specifying
> >>> if the interrupt type is a timer.
> >>>
> >>> Add a function request_percpu_irq_flags() where we can specify the flags. The
> >>> request_percpu_irq() function is changed to be a wrapper to
> >>> request_percpu_irq_flags() passing a zero flag parameter.
> >>>
> >>> Change the timers using request_percpu_irq() to use request_percpu_irq_flags()
> >>> instead with the IRQF_TIMER flag set.
> >>>
> >>> For now, in order to prevent a misusage of this parameter, only the IRQF_TIMER
> >>> flag (or zero) is a valid parameter to be passed to the
> >>> request_percpu_irq_flags() function.
> >>
> >> [...]
> >>
> >>> diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c
> >>> index 35d7100..602e0a8 100644
> >>> --- a/virt/kvm/arm/arch_timer.c
> >>> +++ b/virt/kvm/arm/arch_timer.c
> >>> @@ -523,8 +523,9 @@ int kvm_timer_hyp_init(void)
> >>> host_vtimer_irq_flags = IRQF_TRIGGER_LOW;
> >>> }
> >>>
> >>> - err = request_percpu_irq(host_vtimer_irq, kvm_arch_timer_handler,
> >>> - "kvm guest timer", kvm_get_running_vcpus());
> >>> + err = request_percpu_irq_flags(host_vtimer_irq, kvm_arch_timer_handler,
> >>> + IRQF_TIMER, "kvm guest timer",
> >>> + kvm_get_running_vcpus());
> >>> if (err) {
> >>> kvm_err("kvm_arch_timer: can't request interrupt %d (%d)\n",
> >>> host_vtimer_irq, err);
> >>>
> >>
> >> How is that useful? This timer is controlled by the guest OS, and not
> >> the host kernel. Can you explain how you intend to make use of that
> >> information in this case?
> >
> > Isn't it a source of interruption on the host kernel?
>
> Only to cause an exit of the VM, and not under the control of the host.
> This isn't triggering any timer related action on the host code either.
>
> Your patch series seems to assume some kind of predictability of the
> timer interrupt, which can make sense on the host. Here, this interrupt
> is shared among *all* guests running on this system.
>
> Maybe you could explain why you think this interrupt is relevant to what
> you're trying to achieve?

If this interrupt does not happen on the host, we don't care.

The flag IRQF_TIMER is used by the spurious irq handler in the try_one_irq()
function. However the per cpu timer interrupt will be discarded in the function
before because it is per cpu.

IMO, for consistency reason, adding the IRQF_TIMER makes sense. Other than
that, as the interrupt is not happening on the host, this flag won't be used.

Do you want to drop this change?

-- Daniel



--

<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

2017-04-25 07:39:17

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH V9 1/3] irq: Allow to pass the IRQF_TIMER flag with percpu irq request

On 24/04/17 20:59, Daniel Lezcano wrote:
> On Mon, Apr 24, 2017 at 08:14:54PM +0100, Marc Zyngier wrote:
>> On 24/04/17 19:59, Daniel Lezcano wrote:
>>> On Mon, Apr 24, 2017 at 07:46:43PM +0100, Marc Zyngier wrote:
>>>> On 24/04/17 15:01, Daniel Lezcano wrote:
>>>>> In the next changes, we track when the interrupts occur in order to
>>>>> statistically compute when is supposed to happen the next interrupt.
>>>>>
>>>>> In all the interruptions, it does not make sense to store the timer interrupt
>>>>> occurences and try to predict the next interrupt as when know the expiration
>>>>> time.
>>>>>
>>>>> The request_irq() has a irq flags parameter and the timer drivers use it to
>>>>> pass the IRQF_TIMER flag, letting us know the interrupt is coming from a timer.
>>>>> Based on this flag, we can discard these interrupts when tracking them.
>>>>>
>>>>> But, the API request_percpu_irq does not allow to pass a flag, hence specifying
>>>>> if the interrupt type is a timer.
>>>>>
>>>>> Add a function request_percpu_irq_flags() where we can specify the flags. The
>>>>> request_percpu_irq() function is changed to be a wrapper to
>>>>> request_percpu_irq_flags() passing a zero flag parameter.
>>>>>
>>>>> Change the timers using request_percpu_irq() to use request_percpu_irq_flags()
>>>>> instead with the IRQF_TIMER flag set.
>>>>>
>>>>> For now, in order to prevent a misusage of this parameter, only the IRQF_TIMER
>>>>> flag (or zero) is a valid parameter to be passed to the
>>>>> request_percpu_irq_flags() function.
>>>>
>>>> [...]
>>>>
>>>>> diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c
>>>>> index 35d7100..602e0a8 100644
>>>>> --- a/virt/kvm/arm/arch_timer.c
>>>>> +++ b/virt/kvm/arm/arch_timer.c
>>>>> @@ -523,8 +523,9 @@ int kvm_timer_hyp_init(void)
>>>>> host_vtimer_irq_flags = IRQF_TRIGGER_LOW;
>>>>> }
>>>>>
>>>>> - err = request_percpu_irq(host_vtimer_irq, kvm_arch_timer_handler,
>>>>> - "kvm guest timer", kvm_get_running_vcpus());
>>>>> + err = request_percpu_irq_flags(host_vtimer_irq, kvm_arch_timer_handler,
>>>>> + IRQF_TIMER, "kvm guest timer",
>>>>> + kvm_get_running_vcpus());
>>>>> if (err) {
>>>>> kvm_err("kvm_arch_timer: can't request interrupt %d (%d)\n",
>>>>> host_vtimer_irq, err);
>>>>>
>>>>
>>>> How is that useful? This timer is controlled by the guest OS, and not
>>>> the host kernel. Can you explain how you intend to make use of that
>>>> information in this case?
>>>
>>> Isn't it a source of interruption on the host kernel?
>>
>> Only to cause an exit of the VM, and not under the control of the host.
>> This isn't triggering any timer related action on the host code either.
>>
>> Your patch series seems to assume some kind of predictability of the
>> timer interrupt, which can make sense on the host. Here, this interrupt
>> is shared among *all* guests running on this system.
>>
>> Maybe you could explain why you think this interrupt is relevant to what
>> you're trying to achieve?
>
> If this interrupt does not happen on the host, we don't care.

All interrupts happen on the host. There is no such thing as a HW
interrupt being directly delivered to a guest (at least so far). The
timer is under control of the guest, which uses as it sees fit. When
the HW timer expires, the interrupt fires on the host, which re-inject
the interrupt in the guest.

> The flag IRQF_TIMER is used by the spurious irq handler in the try_one_irq()
> function. However the per cpu timer interrupt will be discarded in the function
> before because it is per cpu.

Right. That's not because this is a timer, but because it is per-cpu.
So why do we need this IRQF_TIMER flag, instead of fixing try_one_irq()?

> IMO, for consistency reason, adding the IRQF_TIMER makes sense. Other than
> that, as the interrupt is not happening on the host, this flag won't be used.
>
> Do you want to drop this change?

No, I'd like to understand the above. Why isn't the following patch
doing the right thing?

diff --git a/kernel/irq/spurious.c b/kernel/irq/spurious.c
index 061ba7eed4ed..a4a81c6c7602 100644
--- a/kernel/irq/spurious.c
+++ b/kernel/irq/spurious.c
@@ -72,6 +72,7 @@ static int try_one_irq(struct irq_desc *desc, bool force)
* marked polled are excluded from polling.
*/
if (irq_settings_is_per_cpu(desc) ||
+ irq_settings_is_per_cpu_devid(desc) ||
irq_settings_is_nested_thread(desc) ||
irq_settings_is_polled(desc))
goto out;

Thanks,

M.
--
Jazz is not dead. It just smells funny...

2017-04-25 08:35:10

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH V9 1/3] irq: Allow to pass the IRQF_TIMER flag with percpu irq request

On Tue, Apr 25, 2017 at 08:38:56AM +0100, Marc Zyngier wrote:
> On 24/04/17 20:59, Daniel Lezcano wrote:
> > On Mon, Apr 24, 2017 at 08:14:54PM +0100, Marc Zyngier wrote:
> >> On 24/04/17 19:59, Daniel Lezcano wrote:
> >>> On Mon, Apr 24, 2017 at 07:46:43PM +0100, Marc Zyngier wrote:
> >>>> On 24/04/17 15:01, Daniel Lezcano wrote:
> >>>>> In the next changes, we track when the interrupts occur in order to
> >>>>> statistically compute when is supposed to happen the next interrupt.
> >>>>>
> >>>>> In all the interruptions, it does not make sense to store the timer interrupt
> >>>>> occurences and try to predict the next interrupt as when know the expiration
> >>>>> time.
> >>>>>
> >>>>> The request_irq() has a irq flags parameter and the timer drivers use it to
> >>>>> pass the IRQF_TIMER flag, letting us know the interrupt is coming from a timer.
> >>>>> Based on this flag, we can discard these interrupts when tracking them.
> >>>>>
> >>>>> But, the API request_percpu_irq does not allow to pass a flag, hence specifying
> >>>>> if the interrupt type is a timer.
> >>>>>
> >>>>> Add a function request_percpu_irq_flags() where we can specify the flags. The
> >>>>> request_percpu_irq() function is changed to be a wrapper to
> >>>>> request_percpu_irq_flags() passing a zero flag parameter.
> >>>>>
> >>>>> Change the timers using request_percpu_irq() to use request_percpu_irq_flags()
> >>>>> instead with the IRQF_TIMER flag set.
> >>>>>
> >>>>> For now, in order to prevent a misusage of this parameter, only the IRQF_TIMER
> >>>>> flag (or zero) is a valid parameter to be passed to the
> >>>>> request_percpu_irq_flags() function.
> >>>>
> >>>> [...]
> >>>>
> >>>>> diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c
> >>>>> index 35d7100..602e0a8 100644
> >>>>> --- a/virt/kvm/arm/arch_timer.c
> >>>>> +++ b/virt/kvm/arm/arch_timer.c
> >>>>> @@ -523,8 +523,9 @@ int kvm_timer_hyp_init(void)
> >>>>> host_vtimer_irq_flags = IRQF_TRIGGER_LOW;
> >>>>> }
> >>>>>
> >>>>> - err = request_percpu_irq(host_vtimer_irq, kvm_arch_timer_handler,
> >>>>> - "kvm guest timer", kvm_get_running_vcpus());
> >>>>> + err = request_percpu_irq_flags(host_vtimer_irq, kvm_arch_timer_handler,
> >>>>> + IRQF_TIMER, "kvm guest timer",
> >>>>> + kvm_get_running_vcpus());
> >>>>> if (err) {
> >>>>> kvm_err("kvm_arch_timer: can't request interrupt %d (%d)\n",
> >>>>> host_vtimer_irq, err);
> >>>>>
> >>>>
> >>>> How is that useful? This timer is controlled by the guest OS, and not
> >>>> the host kernel. Can you explain how you intend to make use of that
> >>>> information in this case?
> >>>
> >>> Isn't it a source of interruption on the host kernel?
> >>
> >> Only to cause an exit of the VM, and not under the control of the host.
> >> This isn't triggering any timer related action on the host code either.
> >>
> >> Your patch series seems to assume some kind of predictability of the
> >> timer interrupt, which can make sense on the host. Here, this interrupt
> >> is shared among *all* guests running on this system.
> >>
> >> Maybe you could explain why you think this interrupt is relevant to what
> >> you're trying to achieve?
> >
> > If this interrupt does not happen on the host, we don't care.
>
> All interrupts happen on the host. There is no such thing as a HW
> interrupt being directly delivered to a guest (at least so far). The
> timer is under control of the guest, which uses as it sees fit. When
> the HW timer expires, the interrupt fires on the host, which re-inject
> the interrupt in the guest.

Ah, thanks for the clarification. Interesting.

How can the host know which guest to re-inject the interrupt?

> > The flag IRQF_TIMER is used by the spurious irq handler in the try_one_irq()
> > function. However the per cpu timer interrupt will be discarded in the function
> > before because it is per cpu.
>
> Right. That's not because this is a timer, but because it is per-cpu.
> So why do we need this IRQF_TIMER flag, instead of fixing try_one_irq()?

When a timer is not per cpu, (eg. request_irq), we need this flag, no?

> > IMO, for consistency reason, adding the IRQF_TIMER makes sense. Other than
> > that, as the interrupt is not happening on the host, this flag won't be used.
> >
> > Do you want to drop this change?
>
> No, I'd like to understand the above. Why isn't the following patch
> doing the right thing?

Actually, the explanation is in the next patch of the series (2/3)

[ ... ]

+static inline void setup_timings(struct irq_desc *desc, struct irqaction *act)
+{
+ /*
+ * We don't need the measurement because the idle code already
+ * knows the next expiry event.
+ */
+ if (act->flags & __IRQF_TIMER)
+ return;
+
+ desc->istate |= IRQS_TIMINGS;
+}

[ ... ]

+/*
+ * The function record_irq_time is only called in one place in the
+ * interrupts handler. We want this function always inline so the code
+ * inside is embedded in the function and the static key branching
+ * code can act at the higher level. Without the explicit
+ * __always_inline we can end up with a function call and a small
+ * overhead in the hotpath for nothing.
+ */
+static __always_inline void record_irq_time(struct irq_desc *desc)
+{
+ if (!static_branch_likely(&irq_timing_enabled))
+ return;
+
+ if (desc->istate & IRQS_TIMINGS) {
+ struct irq_timings *timings = this_cpu_ptr(&irq_timings);
+
+ timings->values[timings->count & IRQ_TIMINGS_MASK] =
+ irq_timing_encode(local_clock(),
+ irq_desc_get_irq(desc));
+
+ timings->count++;
+ }
+}

[ ... ]

The purpose is to predict the next event interrupts on the system which are
source of wake up. For now, this patchset is focused on interrupts (discarding
timer interrupts).

The following article gives more details: https://lwn.net/Articles/673641/

When the interrupt is setup, we tag it except if it is a timer. So with this
patch there is another usage of the IRQF_TIMER where we will be ignoring
interrupt coming from a timer.

As the timer interrupt is delivered to the host, we should not measure it as it
is a timer and set this flag.

The needed information is: "what is the earliest VM timer?". If this
information is already available then there is nothing more to do, otherwise we
should add it in the future.

> diff --git a/kernel/irq/spurious.c b/kernel/irq/spurious.c
> index 061ba7eed4ed..a4a81c6c7602 100644
> --- a/kernel/irq/spurious.c
> +++ b/kernel/irq/spurious.c
> @@ -72,6 +72,7 @@ static int try_one_irq(struct irq_desc *desc, bool force)
> * marked polled are excluded from polling.
> */
> if (irq_settings_is_per_cpu(desc) ||
> + irq_settings_is_per_cpu_devid(desc) ||
> irq_settings_is_nested_thread(desc) ||
> irq_settings_is_polled(desc))
> goto out;
>
> Thanks,
>
> M.
> --
> Jazz is not dead. It just smells funny...

--

<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

2017-04-25 09:10:28

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH V9 1/3] irq: Allow to pass the IRQF_TIMER flag with percpu irq request

On 25/04/17 09:34, Daniel Lezcano wrote:
> On Tue, Apr 25, 2017 at 08:38:56AM +0100, Marc Zyngier wrote:
>> On 24/04/17 20:59, Daniel Lezcano wrote:
>>> On Mon, Apr 24, 2017 at 08:14:54PM +0100, Marc Zyngier wrote:
>>>> On 24/04/17 19:59, Daniel Lezcano wrote:
>>>>> On Mon, Apr 24, 2017 at 07:46:43PM +0100, Marc Zyngier wrote:
>>>>>> On 24/04/17 15:01, Daniel Lezcano wrote:
>>>>>>> In the next changes, we track when the interrupts occur in order to
>>>>>>> statistically compute when is supposed to happen the next interrupt.
>>>>>>>
>>>>>>> In all the interruptions, it does not make sense to store the timer interrupt
>>>>>>> occurences and try to predict the next interrupt as when know the expiration
>>>>>>> time.
>>>>>>>
>>>>>>> The request_irq() has a irq flags parameter and the timer drivers use it to
>>>>>>> pass the IRQF_TIMER flag, letting us know the interrupt is coming from a timer.
>>>>>>> Based on this flag, we can discard these interrupts when tracking them.
>>>>>>>
>>>>>>> But, the API request_percpu_irq does not allow to pass a flag, hence specifying
>>>>>>> if the interrupt type is a timer.
>>>>>>>
>>>>>>> Add a function request_percpu_irq_flags() where we can specify the flags. The
>>>>>>> request_percpu_irq() function is changed to be a wrapper to
>>>>>>> request_percpu_irq_flags() passing a zero flag parameter.
>>>>>>>
>>>>>>> Change the timers using request_percpu_irq() to use request_percpu_irq_flags()
>>>>>>> instead with the IRQF_TIMER flag set.
>>>>>>>
>>>>>>> For now, in order to prevent a misusage of this parameter, only the IRQF_TIMER
>>>>>>> flag (or zero) is a valid parameter to be passed to the
>>>>>>> request_percpu_irq_flags() function.
>>>>>>
>>>>>> [...]
>>>>>>
>>>>>>> diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c
>>>>>>> index 35d7100..602e0a8 100644
>>>>>>> --- a/virt/kvm/arm/arch_timer.c
>>>>>>> +++ b/virt/kvm/arm/arch_timer.c
>>>>>>> @@ -523,8 +523,9 @@ int kvm_timer_hyp_init(void)
>>>>>>> host_vtimer_irq_flags = IRQF_TRIGGER_LOW;
>>>>>>> }
>>>>>>>
>>>>>>> - err = request_percpu_irq(host_vtimer_irq, kvm_arch_timer_handler,
>>>>>>> - "kvm guest timer", kvm_get_running_vcpus());
>>>>>>> + err = request_percpu_irq_flags(host_vtimer_irq, kvm_arch_timer_handler,
>>>>>>> + IRQF_TIMER, "kvm guest timer",
>>>>>>> + kvm_get_running_vcpus());
>>>>>>> if (err) {
>>>>>>> kvm_err("kvm_arch_timer: can't request interrupt %d (%d)\n",
>>>>>>> host_vtimer_irq, err);
>>>>>>>
>>>>>>
>>>>>> How is that useful? This timer is controlled by the guest OS, and not
>>>>>> the host kernel. Can you explain how you intend to make use of that
>>>>>> information in this case?
>>>>>
>>>>> Isn't it a source of interruption on the host kernel?
>>>>
>>>> Only to cause an exit of the VM, and not under the control of the host.
>>>> This isn't triggering any timer related action on the host code either.
>>>>
>>>> Your patch series seems to assume some kind of predictability of the
>>>> timer interrupt, which can make sense on the host. Here, this interrupt
>>>> is shared among *all* guests running on this system.
>>>>
>>>> Maybe you could explain why you think this interrupt is relevant to what
>>>> you're trying to achieve?
>>>
>>> If this interrupt does not happen on the host, we don't care.
>>
>> All interrupts happen on the host. There is no such thing as a HW
>> interrupt being directly delivered to a guest (at least so far). The
>> timer is under control of the guest, which uses as it sees fit. When
>> the HW timer expires, the interrupt fires on the host, which re-inject
>> the interrupt in the guest.
>
> Ah, thanks for the clarification. Interesting.
>
> How can the host know which guest to re-inject the interrupt?

The timer can only fire when the vcpu is running. If it is not running,
a software timer is queued, with a pointer to the vcpu struct.

>>> The flag IRQF_TIMER is used by the spurious irq handler in the try_one_irq()
>>> function. However the per cpu timer interrupt will be discarded in the function
>>> before because it is per cpu.
>>
>> Right. That's not because this is a timer, but because it is per-cpu.
>> So why do we need this IRQF_TIMER flag, instead of fixing try_one_irq()?
>
> When a timer is not per cpu, (eg. request_irq), we need this flag, no?

Sure, but in this series, they all seem to be per-cpu.

>>> IMO, for consistency reason, adding the IRQF_TIMER makes sense. Other than
>>> that, as the interrupt is not happening on the host, this flag won't be used.
>>>
>>> Do you want to drop this change?
>>
>> No, I'd like to understand the above. Why isn't the following patch
>> doing the right thing?
>
> Actually, the explanation is in the next patch of the series (2/3)
>
> [ ... ]
>
> +static inline void setup_timings(struct irq_desc *desc, struct irqaction *act)
> +{
> + /*
> + * We don't need the measurement because the idle code already
> + * knows the next expiry event.
> + */
> + if (act->flags & __IRQF_TIMER)
> + return;

And that's where this is really wrong for the KVM guest timer. As I
said, this timer is under complete control of the guest, and the rest of
the system doesn't know about it. KVM itself will only find out when the
vcpu does a VM exit for a reason or another, and will just save/restore
the state in order to be able to give the timer to another guest.

The idle code is very much *not* aware of anything concerning that guest
timer.

> +
> + desc->istate |= IRQS_TIMINGS;
> +}
>
> [ ... ]
>
> +/*
> + * The function record_irq_time is only called in one place in the
> + * interrupts handler. We want this function always inline so the code
> + * inside is embedded in the function and the static key branching
> + * code can act at the higher level. Without the explicit
> + * __always_inline we can end up with a function call and a small
> + * overhead in the hotpath for nothing.
> + */
> +static __always_inline void record_irq_time(struct irq_desc *desc)
> +{
> + if (!static_branch_likely(&irq_timing_enabled))
> + return;
> +
> + if (desc->istate & IRQS_TIMINGS) {
> + struct irq_timings *timings = this_cpu_ptr(&irq_timings);
> +
> + timings->values[timings->count & IRQ_TIMINGS_MASK] =
> + irq_timing_encode(local_clock(),
> + irq_desc_get_irq(desc));
> +
> + timings->count++;
> + }
> +}
>
> [ ... ]
>
> The purpose is to predict the next event interrupts on the system which are
> source of wake up. For now, this patchset is focused on interrupts (discarding
> timer interrupts).
>
> The following article gives more details: https://lwn.net/Articles/673641/
>
> When the interrupt is setup, we tag it except if it is a timer. So with this
> patch there is another usage of the IRQF_TIMER where we will be ignoring
> interrupt coming from a timer.
>
> As the timer interrupt is delivered to the host, we should not measure it as it
> is a timer and set this flag.
>
> The needed information is: "what is the earliest VM timer?". If this
> information is already available then there is nothing more to do, otherwise we
> should add it in the future.

This information is not readily available. You can only find it when it
is too late (timer has already fired) or when it is not relevant anymore
(guest is sleeping and we've queued a SW timer for it).

Thanks,

M.
--
Jazz is not dead. It just smells funny...

2017-04-25 09:49:38

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH V9 1/3] irq: Allow to pass the IRQF_TIMER flag with percpu irq request

On Tue, Apr 25, 2017 at 10:10:12AM +0100, Marc Zyngier wrote:

[ ... ]

> >>>> Maybe you could explain why you think this interrupt is relevant to what
> >>>> you're trying to achieve?
> >>>
> >>> If this interrupt does not happen on the host, we don't care.
> >>
> >> All interrupts happen on the host. There is no such thing as a HW
> >> interrupt being directly delivered to a guest (at least so far). The
> >> timer is under control of the guest, which uses as it sees fit. When
> >> the HW timer expires, the interrupt fires on the host, which re-inject
> >> the interrupt in the guest.
> >
> > Ah, thanks for the clarification. Interesting.
> >
> > How can the host know which guest to re-inject the interrupt?
>
> The timer can only fire when the vcpu is running. If it is not running,
> a software timer is queued, with a pointer to the vcpu struct.

I see, thanks.

> >>> The flag IRQF_TIMER is used by the spurious irq handler in the try_one_irq()
> >>> function. However the per cpu timer interrupt will be discarded in the function
> >>> before because it is per cpu.
> >>
> >> Right. That's not because this is a timer, but because it is per-cpu.
> >> So why do we need this IRQF_TIMER flag, instead of fixing try_one_irq()?
> >
> > When a timer is not per cpu, (eg. request_irq), we need this flag, no?
>
> Sure, but in this series, they all seem to be per-cpu.

I think I was unclear. We need to tag an interrupt with IRQS_TIMINGS to record
their occurences but discarding the timers interrupt. That is done by checking
against IRQF_TIMER when setting up an interrupt.

request_irq() has a flag parameter which has IRQF_TIMER set in case of the
timers. request_percpu_irq has no flag parameter, so it is not possible to
discard these interrupts as the IRQS_TIMINGS will be set.

I don't understand how this is related to the the try_one_irq() fix you are
proposing. Am I missing something?

Regarding your description below, the host has no control at all on the virtual
timer and is not able to know the next expiration time, so I don't see the
point to add the IRQF_TIMER flag to the virtual timer.

I will resend a new version without this change on the virtual timer.

> >>> IMO, for consistency reason, adding the IRQF_TIMER makes sense. Other than
> >>> that, as the interrupt is not happening on the host, this flag won't be used.
> >>>
> >>> Do you want to drop this change?
> >>
> >> No, I'd like to understand the above. Why isn't the following patch
> >> doing the right thing?
> >
> > Actually, the explanation is in the next patch of the series (2/3)
> >
> > [ ... ]
> >
> > +static inline void setup_timings(struct irq_desc *desc, struct irqaction *act)
> > +{
> > + /*
> > + * We don't need the measurement because the idle code already
> > + * knows the next expiry event.
> > + */
> > + if (act->flags & __IRQF_TIMER)
> > + return;
>
> And that's where this is really wrong for the KVM guest timer. As I
> said, this timer is under complete control of the guest, and the rest of
> the system doesn't know about it. KVM itself will only find out when the
> vcpu does a VM exit for a reason or another, and will just save/restore
> the state in order to be able to give the timer to another guest.
>
> The idle code is very much *not* aware of anything concerning that guest
> timer.

Just for my own curiosity, if there are two VM (VM1 and VM2). VM1 sets a timer1
at <time> and exits, VM2 runs and sets a timer2 at <time+delta>.

The timer1 for VM1 is supposed to expire while VM2 is running. IIUC the virtual
timer is under control of VM2 and will expire at <time+delta>.

Is the host wake up with the SW timer and switch in VM1 which in turn restores
the timer and jump in the virtual timer irq handler?

> > +
> > + desc->istate |= IRQS_TIMINGS;
> > +}
> >
> > [ ... ]
> >
> > +/*
> > + * The function record_irq_time is only called in one place in the
> > + * interrupts handler. We want this function always inline so the code
> > + * inside is embedded in the function and the static key branching
> > + * code can act at the higher level. Without the explicit
> > + * __always_inline we can end up with a function call and a small
> > + * overhead in the hotpath for nothing.
> > + */
> > +static __always_inline void record_irq_time(struct irq_desc *desc)
> > +{
> > + if (!static_branch_likely(&irq_timing_enabled))
> > + return;
> > +
> > + if (desc->istate & IRQS_TIMINGS) {
> > + struct irq_timings *timings = this_cpu_ptr(&irq_timings);
> > +
> > + timings->values[timings->count & IRQ_TIMINGS_MASK] =
> > + irq_timing_encode(local_clock(),
> > + irq_desc_get_irq(desc));
> > +
> > + timings->count++;
> > + }
> > +}
> >
> > [ ... ]
> >
> > The purpose is to predict the next event interrupts on the system which are
> > source of wake up. For now, this patchset is focused on interrupts (discarding
> > timer interrupts).
> >
> > The following article gives more details: https://lwn.net/Articles/673641/
> >
> > When the interrupt is setup, we tag it except if it is a timer. So with this
> > patch there is another usage of the IRQF_TIMER where we will be ignoring
> > interrupt coming from a timer.
> >
> > As the timer interrupt is delivered to the host, we should not measure it as it
> > is a timer and set this flag.
> >
> > The needed information is: "what is the earliest VM timer?". If this
> > information is already available then there is nothing more to do, otherwise we
> > should add it in the future.
>
> This information is not readily available. You can only find it when it
> is too late (timer has already fired) or when it is not relevant anymore
> (guest is sleeping and we've queued a SW timer for it).
>
> Thanks,
>
> M.
> --
> Jazz is not dead. It just smells funny...

--

<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

2017-04-25 10:21:34

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH V9 1/3] irq: Allow to pass the IRQF_TIMER flag with percpu irq request

On 25/04/17 10:49, Daniel Lezcano wrote:
> On Tue, Apr 25, 2017 at 10:10:12AM +0100, Marc Zyngier wrote:

[...]

>>> +static inline void setup_timings(struct irq_desc *desc, struct irqaction *act)
>>> +{
>>> + /*
>>> + * We don't need the measurement because the idle code already
>>> + * knows the next expiry event.
>>> + */
>>> + if (act->flags & __IRQF_TIMER)
>>> + return;
>>
>> And that's where this is really wrong for the KVM guest timer. As I
>> said, this timer is under complete control of the guest, and the rest of
>> the system doesn't know about it. KVM itself will only find out when the
>> vcpu does a VM exit for a reason or another, and will just save/restore
>> the state in order to be able to give the timer to another guest.
>>
>> The idle code is very much *not* aware of anything concerning that guest
>> timer.
>
> Just for my own curiosity, if there are two VM (VM1 and VM2). VM1 sets a timer1
> at <time> and exits, VM2 runs and sets a timer2 at <time+delta>.
>
> The timer1 for VM1 is supposed to expire while VM2 is running. IIUC the virtual
> timer is under control of VM2 and will expire at <time+delta>.
>
> Is the host wake up with the SW timer and switch in VM1 which in turn restores
> the timer and jump in the virtual timer irq handler?

Indeed. The SW timer causes VM1 to wake-up, either on the same CPU
(preempting VM2) or on another. The timer is then restored with the
pending virtual interrupt injected, and the guest does what it has to
with it.

Thanks,

M.
--
Jazz is not dead. It just smells funny...

2017-04-25 10:22:42

by Christoffer Dall

[permalink] [raw]
Subject: Re: [PATCH V9 1/3] irq: Allow to pass the IRQF_TIMER flag with percpu irq request

On Tue, Apr 25, 2017 at 11:49:27AM +0200, Daniel Lezcano wrote:

[...]

> >
> > The idle code is very much *not* aware of anything concerning that guest
> > timer.
>
> Just for my own curiosity, if there are two VM (VM1 and VM2). VM1 sets a timer1
> at <time> and exits, VM2 runs and sets a timer2 at <time+delta>.
>
> The timer1 for VM1 is supposed to expire while VM2 is running. IIUC the virtual
> timer is under control of VM2 and will expire at <time+delta>.
>
> Is the host wake up with the SW timer and switch in VM1 which in turn restores
> the timer and jump in the virtual timer irq handler?
>
The thing that may be missing here is that a VCPU thread (more of which
in a collection is a VM) is just a thread from the point of view of
Linux, and whether or not a guest schedules a timer, should not effect
the scheduler's decision to run a given thread, if the thread is
runnable.

Whenever we run a VCPU thread, we look at its timer state (in software)
and calculate if the guest should see a timer interrupt and inject such
a one (the hardware arch timer is not involved in this process at all).

We use timers in exactly two scenarios:

1. The hardware arch timers are used to force an exit to the host when
the guest programmed the timer, so we can do the calculation in
software I mentioned above and inject a virtual software-generated
interrupt when the guest expects to see one.

2. The guest goes to sleep (WFI) but has programmed a timer to be woken
up at some point. KVM handles a WFI by blocking the VCPU thread,
which basically means making the thread interruptible and putting it
on a waitqueue. In this case we schedule a software timer to make
the thread runnable again when the software timer fires (and the
scheduler runs that thread when it wants to after that).

If you have a VCPU thread from VM1 blocked, and you run a VCPU thread
from VM2, then the VCPU thread from VM2 will program the hardware arch
timer with the context of the VM2 VCPU thread while running, and this
has nothing to do with the VCPU thread from VM1 at this point, because
it relies on the host Linux time keeping infrastructure to become
runnable some time in the future, and running a guest naturally doesn't
mess with the host's time keeping.

Hope this helps,
-Christoffer

2017-04-25 12:51:14

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH V9 1/3] irq: Allow to pass the IRQF_TIMER flag with percpu irq request

On Tue, Apr 25, 2017 at 11:21:21AM +0100, Marc Zyngier wrote:
> On 25/04/17 10:49, Daniel Lezcano wrote:
> > On Tue, Apr 25, 2017 at 10:10:12AM +0100, Marc Zyngier wrote:
>
> [...]
>
> >>> +static inline void setup_timings(struct irq_desc *desc, struct irqaction *act)
> >>> +{
> >>> + /*
> >>> + * We don't need the measurement because the idle code already
> >>> + * knows the next expiry event.
> >>> + */
> >>> + if (act->flags & __IRQF_TIMER)
> >>> + return;
> >>
> >> And that's where this is really wrong for the KVM guest timer. As I
> >> said, this timer is under complete control of the guest, and the rest of
> >> the system doesn't know about it. KVM itself will only find out when the
> >> vcpu does a VM exit for a reason or another, and will just save/restore
> >> the state in order to be able to give the timer to another guest.
> >>
> >> The idle code is very much *not* aware of anything concerning that guest
> >> timer.
> >
> > Just for my own curiosity, if there are two VM (VM1 and VM2). VM1 sets a timer1
> > at <time> and exits, VM2 runs and sets a timer2 at <time+delta>.
> >
> > The timer1 for VM1 is supposed to expire while VM2 is running. IIUC the virtual
> > timer is under control of VM2 and will expire at <time+delta>.
> >
> > Is the host wake up with the SW timer and switch in VM1 which in turn restores
> > the timer and jump in the virtual timer irq handler?
>
> Indeed. The SW timer causes VM1 to wake-up, either on the same CPU
> (preempting VM2) or on another. The timer is then restored with the
> pending virtual interrupt injected, and the guest does what it has to
> with it.

Thanks for clarification.

So there is a virtual timer with real registers / interruption (waking up the
host) for the running VMs and SW timers for non-running VMs.

What is the benefit of having such mechanism instead of real timers injecting
interrupts in the VM without the virtual timer + save/restore? Efficiency in
the running VMs when setting up timers (saving privileges change overhead)?

--

<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

2017-04-25 12:52:54

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH V9 1/3] irq: Allow to pass the IRQF_TIMER flag with percpu irq request

On Tue, Apr 25, 2017 at 12:22:30PM +0200, Christoffer Dall wrote:
> On Tue, Apr 25, 2017 at 11:49:27AM +0200, Daniel Lezcano wrote:
>
> [...]
>
> > >
> > > The idle code is very much *not* aware of anything concerning that guest
> > > timer.
> >
> > Just for my own curiosity, if there are two VM (VM1 and VM2). VM1 sets a timer1
> > at <time> and exits, VM2 runs and sets a timer2 at <time+delta>.
> >
> > The timer1 for VM1 is supposed to expire while VM2 is running. IIUC the virtual
> > timer is under control of VM2 and will expire at <time+delta>.
> >
> > Is the host wake up with the SW timer and switch in VM1 which in turn restores
> > the timer and jump in the virtual timer irq handler?
> >
> The thing that may be missing here is that a VCPU thread (more of which
> in a collection is a VM) is just a thread from the point of view of
> Linux, and whether or not a guest schedules a timer, should not effect
> the scheduler's decision to run a given thread, if the thread is
> runnable.
>
> Whenever we run a VCPU thread, we look at its timer state (in software)
> and calculate if the guest should see a timer interrupt and inject such
> a one (the hardware arch timer is not involved in this process at all).
>
> We use timers in exactly two scenarios:
>
> 1. The hardware arch timers are used to force an exit to the host when
> the guest programmed the timer, so we can do the calculation in
> software I mentioned above and inject a virtual software-generated
> interrupt when the guest expects to see one.
>
> 2. The guest goes to sleep (WFI) but has programmed a timer to be woken
> up at some point. KVM handles a WFI by blocking the VCPU thread,
> which basically means making the thread interruptible and putting it
> on a waitqueue. In this case we schedule a software timer to make
> the thread runnable again when the software timer fires (and the
> scheduler runs that thread when it wants to after that).
>
> If you have a VCPU thread from VM1 blocked, and you run a VCPU thread
> from VM2, then the VCPU thread from VM2 will program the hardware arch
> timer with the context of the VM2 VCPU thread while running, and this
> has nothing to do with the VCPU thread from VM1 at this point, because
> it relies on the host Linux time keeping infrastructure to become
> runnable some time in the future, and running a guest naturally doesn't
> mess with the host's time keeping.
>
> Hope this helps,

Yes, definitively. Thanks for the detailed description.

-- Daniel

--

<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

2017-04-25 13:22:28

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH V9 1/3] irq: Allow to pass the IRQF_TIMER flag with percpu irq request

On 25/04/17 13:51, Daniel Lezcano wrote:
> On Tue, Apr 25, 2017 at 11:21:21AM +0100, Marc Zyngier wrote:
>> On 25/04/17 10:49, Daniel Lezcano wrote:
>>> On Tue, Apr 25, 2017 at 10:10:12AM +0100, Marc Zyngier wrote:
>>
>> [...]
>>
>>>>> +static inline void setup_timings(struct irq_desc *desc, struct irqaction *act)
>>>>> +{
>>>>> + /*
>>>>> + * We don't need the measurement because the idle code already
>>>>> + * knows the next expiry event.
>>>>> + */
>>>>> + if (act->flags & __IRQF_TIMER)
>>>>> + return;
>>>>
>>>> And that's where this is really wrong for the KVM guest timer. As I
>>>> said, this timer is under complete control of the guest, and the rest of
>>>> the system doesn't know about it. KVM itself will only find out when the
>>>> vcpu does a VM exit for a reason or another, and will just save/restore
>>>> the state in order to be able to give the timer to another guest.
>>>>
>>>> The idle code is very much *not* aware of anything concerning that guest
>>>> timer.
>>>
>>> Just for my own curiosity, if there are two VM (VM1 and VM2). VM1 sets a timer1
>>> at <time> and exits, VM2 runs and sets a timer2 at <time+delta>.
>>>
>>> The timer1 for VM1 is supposed to expire while VM2 is running. IIUC the virtual
>>> timer is under control of VM2 and will expire at <time+delta>.
>>>
>>> Is the host wake up with the SW timer and switch in VM1 which in turn restores
>>> the timer and jump in the virtual timer irq handler?
>>
>> Indeed. The SW timer causes VM1 to wake-up, either on the same CPU
>> (preempting VM2) or on another. The timer is then restored with the
>> pending virtual interrupt injected, and the guest does what it has to
>> with it.
>
> Thanks for clarification.
>
> So there is a virtual timer with real registers / interruption (waking up the
> host) for the running VMs and SW timers for non-running VMs.
>
> What is the benefit of having such mechanism instead of real timers injecting
> interrupts in the VM without the virtual timer + save/restore? Efficiency in
> the running VMs when setting up timers (saving privileges change overhead)?


You can't dedicate HW resources to virtual CPUs. It just doesn't scale.
Also, injecting HW interrupts in a guest is pretty hard work, and for
multiple reasons:
- the host needs to be in control of interrupt delivery (don't hog the
CPU with guest interrupts)
- you want to be able to remap interrupts (id X on the host becomes id
Y on the guest),
- you want to deal with migrating vcpus,
- you want deliver an interrupt to a vcpu that is *not* running.

It *is* doable, but it is not cheap at all from a HW point of view.

M.
--
Jazz is not dead. It just smells funny...

2017-04-25 13:53:38

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH V9 1/3] irq: Allow to pass the IRQF_TIMER flag with percpu irq request

On 25/04/2017 15:22, Marc Zyngier wrote:
> On 25/04/17 13:51, Daniel Lezcano wrote:
>> On Tue, Apr 25, 2017 at 11:21:21AM +0100, Marc Zyngier wrote:
>>> On 25/04/17 10:49, Daniel Lezcano wrote:
>>>> On Tue, Apr 25, 2017 at 10:10:12AM +0100, Marc Zyngier wrote:
>>>
>>> [...]
>>>
>>>>>> +static inline void setup_timings(struct irq_desc *desc, struct irqaction *act)
>>>>>> +{
>>>>>> + /*
>>>>>> + * We don't need the measurement because the idle code already
>>>>>> + * knows the next expiry event.
>>>>>> + */
>>>>>> + if (act->flags & __IRQF_TIMER)
>>>>>> + return;
>>>>>
>>>>> And that's where this is really wrong for the KVM guest timer. As I
>>>>> said, this timer is under complete control of the guest, and the rest of
>>>>> the system doesn't know about it. KVM itself will only find out when the
>>>>> vcpu does a VM exit for a reason or another, and will just save/restore
>>>>> the state in order to be able to give the timer to another guest.
>>>>>
>>>>> The idle code is very much *not* aware of anything concerning that guest
>>>>> timer.
>>>>
>>>> Just for my own curiosity, if there are two VM (VM1 and VM2). VM1 sets a timer1
>>>> at <time> and exits, VM2 runs and sets a timer2 at <time+delta>.
>>>>
>>>> The timer1 for VM1 is supposed to expire while VM2 is running. IIUC the virtual
>>>> timer is under control of VM2 and will expire at <time+delta>.
>>>>
>>>> Is the host wake up with the SW timer and switch in VM1 which in turn restores
>>>> the timer and jump in the virtual timer irq handler?
>>>
>>> Indeed. The SW timer causes VM1 to wake-up, either on the same CPU
>>> (preempting VM2) or on another. The timer is then restored with the
>>> pending virtual interrupt injected, and the guest does what it has to
>>> with it.
>>
>> Thanks for clarification.
>>
>> So there is a virtual timer with real registers / interruption (waking up the
>> host) for the running VMs and SW timers for non-running VMs.
>>
>> What is the benefit of having such mechanism instead of real timers injecting
>> interrupts in the VM without the virtual timer + save/restore? Efficiency in
>> the running VMs when setting up timers (saving privileges change overhead)?
>
>
> You can't dedicate HW resources to virtual CPUs. It just doesn't scale.
> Also, injecting HW interrupts in a guest is pretty hard work, and for
> multiple reasons:
> - the host needs to be in control of interrupt delivery (don't hog the
> CPU with guest interrupts)
> - you want to be able to remap interrupts (id X on the host becomes id
> Y on the guest),
> - you want to deal with migrating vcpus,
> - you want deliver an interrupt to a vcpu that is *not* running.
>
> It *is* doable, but it is not cheap at all from a HW point of view.


Ok, I see.

Thanks!

-- Daniel


--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog