hrtimer_force_reprogram() and hrtimer_interrupt() invokes
__hrtimer_get_next_event() to find the earliest expiry time of hrtimer
bases. __hrtimer_get_next_event() does not update
cpu_base::[softirq_]_expires_next to preserve reprogramming logic. That
needs to be done at the callsites.
hrtimer_force_reprogram() updates cpu_base::softirq_expires_next only when
the first expiring timer is a softirq timer and the soft interrupt is not
activated. That's wrong because cpu_base::softirq_expires_next is left
stale when the first expiring timer of all bases is a timer which expires
in hard interrupt context. hrtimer_interrupt() does never update
cpu_base::softirq_expires_next which is wrong too.
That becomes a problem when clock_settime() sets CLOCK_REALTIME forward and
the first soft expiring timer is in the CLOCK_REALTIME_SOFT base. Setting
CLOCK_REALTIME forward moves the clock MONOTONIC based expiry time of that
timer before the stale cpu_base::softirq_expires_next.
cpu_base::softirq_expires_next is cached to make the check for raising the
soft interrupt fast. In the above case the soft interrupt won't be raised
until clock monotonic reaches the stale cpu_base::softirq_expires_next
value. That's incorrect, but what's worse it that if the softirq timer
becomes the first expiring timer of all clock bases after the hard expiry
timer has been handled the reprogramming of the clockevent from
hrtimer_interrupt() will result in an interrupt storm. That happens because
the reprogramming does not use cpu_base::softirq_expires_next, it uses
__hrtimer_get_next_event() which returns the actual expiry time. Once clock
MONOTONIC reaches cpu_base::softirq_expires_next the soft interrupt is
raised and the storm subsides.
Change the logic in hrtimer_force_reprogram() to evaluate the soft and hard
bases seperately, update softirq_expires_next and handle the case when a
soft expiring timer is the first of all bases by comparing the expiry times
and updating the required cpu base fields. Split this functionality into a
separate function to be able to use it in hrtimer_interrupt() as well
without copy paste.
Micke, please test this patch - it is compile tested only...
Fixes:5da70160462e ("hrtimer: Implement support for softirq based hrtimers")
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Anna-Maria Behnsen <[email protected]>
Reported-by: Mikael Beckius <[email protected]>
---
kernel/time/hrtimer.c | 53 +++++++++++++++++++++++++++----------------
1 file changed, 34 insertions(+), 19 deletions(-)
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 46bd35d5fc0d..788b9d137de4 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -591,6 +591,37 @@ __hrtimer_get_next_event(struct hrtimer_cpu_base *cpu_base, unsigned int active_
return expires_next;
}
+static ktime_t hrtimer_update_next_event(struct hrtimer_cpu_base *cpu_base)
+{
+ ktime_t expires_next, soft = KTIME_MAX;
+
+ /*
+ * If the soft interrupt has already been activated, ignore the
+ * soft bases. They will be handled in the already raised soft
+ * interrupt.
+ */
+ if (!cpu_base->softirq_activated) {
+ soft = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_SOFT);
+ /*
+ * Update the soft expiry time. clock_settime() might have
+ * affected it.
+ */
+ cpu_base->softirq_expires_next = soft;
+ }
+
+ expires_next = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_HARD);
+ /*
+ * If a softirq timer is expiring first, update cpu_base->next_timer
+ * and program the hardware with the soft expiry time.
+ */
+ if (expires_next > soft) {
+ cpu_base->next_timer = cpu_base->softirq_next_timer;
+ expires_next = soft;
+ }
+
+ return expires_next;
+}
+
static inline ktime_t hrtimer_update_base(struct hrtimer_cpu_base *base)
{
ktime_t *offs_real = &base->clock_base[HRTIMER_BASE_REALTIME].offset;
@@ -631,23 +662,7 @@ hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base, int skip_equal)
{
ktime_t expires_next;
- /*
- * Find the current next expiration time.
- */
- expires_next = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_ALL);
-
- if (cpu_base->next_timer && cpu_base->next_timer->is_soft) {
- /*
- * When the softirq is activated, hrtimer has to be
- * programmed with the first hard hrtimer because soft
- * timer interrupt could occur too late.
- */
- if (cpu_base->softirq_activated)
- expires_next = __hrtimer_get_next_event(cpu_base,
- HRTIMER_ACTIVE_HARD);
- else
- cpu_base->softirq_expires_next = expires_next;
- }
+ expires_next = hrtimer_update_next_event(cpu_base);
if (skip_equal && expires_next == cpu_base->expires_next)
return;
@@ -1647,8 +1662,8 @@ void hrtimer_interrupt(struct clock_event_device *dev)
__hrtimer_run_queues(cpu_base, now, flags, HRTIMER_ACTIVE_HARD);
- /* Reevaluate the clock bases for the next expiry */
- expires_next = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_ALL);
+ /* Reevaluate the clock bases for the [soft] next expiry */
+ expires_next = hrtimer_update_next_event(cpu_base);
/*
* Store the new expiry value so the migration code can verify
* against it.
--
2.20.1
> Micke, please test this patch - it is compile tested only...
Thanks Anna-Maria, it seems to work equally good or better than the change
I suggested in preventing interrupt storms. After running a test I created,
based on traces from a live system, several thousands of iterations
there are no reproductions. Normally both issues will reproduce within
seconds after only a few iterations.
Also by looking at the source code it seems like your patch should do the
trick.
Micke
The following commit has been merged into the timers/urgent branch of tip:
Commit-ID: 05f7fcc675f50001a30b8938c05d11ca9f599f8c
Gitweb: https://git.kernel.org/tip/05f7fcc675f50001a30b8938c05d11ca9f599f8c
Author: Anna-Maria Behnsen <[email protected]>
AuthorDate: Tue, 23 Feb 2021 17:02:40 +01:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Mon, 01 Mar 2021 10:17:56 +01:00
hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event()
hrtimer_force_reprogram() and hrtimer_interrupt() invokes
__hrtimer_get_next_event() to find the earliest expiry time of hrtimer
bases. __hrtimer_get_next_event() does not update
cpu_base::[softirq_]_expires_next to preserve reprogramming logic. That
needs to be done at the callsites.
hrtimer_force_reprogram() updates cpu_base::softirq_expires_next only when
the first expiring timer is a softirq timer and the soft interrupt is not
activated. That's wrong because cpu_base::softirq_expires_next is left
stale when the first expiring timer of all bases is a timer which expires
in hard interrupt context. hrtimer_interrupt() does never update
cpu_base::softirq_expires_next which is wrong too.
That becomes a problem when clock_settime() sets CLOCK_REALTIME forward and
the first soft expiring timer is in the CLOCK_REALTIME_SOFT base. Setting
CLOCK_REALTIME forward moves the clock MONOTONIC based expiry time of that
timer before the stale cpu_base::softirq_expires_next.
cpu_base::softirq_expires_next is cached to make the check for raising the
soft interrupt fast. In the above case the soft interrupt won't be raised
until clock monotonic reaches the stale cpu_base::softirq_expires_next
value. That's incorrect, but what's worse it that if the softirq timer
becomes the first expiring timer of all clock bases after the hard expiry
timer has been handled the reprogramming of the clockevent from
hrtimer_interrupt() will result in an interrupt storm. That happens because
the reprogramming does not use cpu_base::softirq_expires_next, it uses
__hrtimer_get_next_event() which returns the actual expiry time. Once clock
MONOTONIC reaches cpu_base::softirq_expires_next the soft interrupt is
raised and the storm subsides.
Change the logic in hrtimer_force_reprogram() to evaluate the soft and hard
bases seperately, update softirq_expires_next and handle the case when a
soft expiring timer is the first of all bases by comparing the expiry times
and updating the required cpu base fields. Split this functionality into a
separate function to be able to use it in hrtimer_interrupt() as well
without copy paste.
Fixes: da70160462e ("hrtimer: Implement support for softirq based hrtimers")
Reported-by: Mikael Beckius <[email protected]>
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Anna-Maria Behnsen <[email protected]>
Tested-by: Mikael Beckius <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/time/hrtimer.c | 60 +++++++++++++++++++++++++++---------------
1 file changed, 39 insertions(+), 21 deletions(-)
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 743c852..788b9d1 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -546,8 +546,11 @@ static ktime_t __hrtimer_next_event_base(struct hrtimer_cpu_base *cpu_base,
}
/*
- * Recomputes cpu_base::*next_timer and returns the earliest expires_next but
- * does not set cpu_base::*expires_next, that is done by hrtimer_reprogram.
+ * Recomputes cpu_base::*next_timer and returns the earliest expires_next
+ * but does not set cpu_base::*expires_next, that is done by
+ * hrtimer[_force]_reprogram and hrtimer_interrupt only. When updating
+ * cpu_base::*expires_next right away, reprogramming logic would no longer
+ * work.
*
* When a softirq is pending, we can ignore the HRTIMER_ACTIVE_SOFT bases,
* those timers will get run whenever the softirq gets handled, at the end of
@@ -588,6 +591,37 @@ __hrtimer_get_next_event(struct hrtimer_cpu_base *cpu_base, unsigned int active_
return expires_next;
}
+static ktime_t hrtimer_update_next_event(struct hrtimer_cpu_base *cpu_base)
+{
+ ktime_t expires_next, soft = KTIME_MAX;
+
+ /*
+ * If the soft interrupt has already been activated, ignore the
+ * soft bases. They will be handled in the already raised soft
+ * interrupt.
+ */
+ if (!cpu_base->softirq_activated) {
+ soft = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_SOFT);
+ /*
+ * Update the soft expiry time. clock_settime() might have
+ * affected it.
+ */
+ cpu_base->softirq_expires_next = soft;
+ }
+
+ expires_next = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_HARD);
+ /*
+ * If a softirq timer is expiring first, update cpu_base->next_timer
+ * and program the hardware with the soft expiry time.
+ */
+ if (expires_next > soft) {
+ cpu_base->next_timer = cpu_base->softirq_next_timer;
+ expires_next = soft;
+ }
+
+ return expires_next;
+}
+
static inline ktime_t hrtimer_update_base(struct hrtimer_cpu_base *base)
{
ktime_t *offs_real = &base->clock_base[HRTIMER_BASE_REALTIME].offset;
@@ -628,23 +662,7 @@ hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base, int skip_equal)
{
ktime_t expires_next;
- /*
- * Find the current next expiration time.
- */
- expires_next = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_ALL);
-
- if (cpu_base->next_timer && cpu_base->next_timer->is_soft) {
- /*
- * When the softirq is activated, hrtimer has to be
- * programmed with the first hard hrtimer because soft
- * timer interrupt could occur too late.
- */
- if (cpu_base->softirq_activated)
- expires_next = __hrtimer_get_next_event(cpu_base,
- HRTIMER_ACTIVE_HARD);
- else
- cpu_base->softirq_expires_next = expires_next;
- }
+ expires_next = hrtimer_update_next_event(cpu_base);
if (skip_equal && expires_next == cpu_base->expires_next)
return;
@@ -1644,8 +1662,8 @@ retry:
__hrtimer_run_queues(cpu_base, now, flags, HRTIMER_ACTIVE_HARD);
- /* Reevaluate the clock bases for the next expiry */
- expires_next = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_ALL);
+ /* Reevaluate the clock bases for the [soft] next expiry */
+ expires_next = hrtimer_update_next_event(cpu_base);
/*
* Store the new expiry value so the migration code can verify
* against it.
The following commit has been merged into the timers/urgent branch of tip:
Commit-ID: eca8f0c80a005aea84df507a446fc0154fc55a32
Gitweb: https://git.kernel.org/tip/eca8f0c80a005aea84df507a446fc0154fc55a32
Author: Anna-Maria Behnsen <[email protected]>
AuthorDate: Tue, 23 Feb 2021 17:02:40 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Sat, 06 Mar 2021 12:53:47 +01:00
hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event()
hrtimer_force_reprogram() and hrtimer_interrupt() invokes
__hrtimer_get_next_event() to find the earliest expiry time of hrtimer
bases. __hrtimer_get_next_event() does not update
cpu_base::[softirq_]_expires_next to preserve reprogramming logic. That
needs to be done at the callsites.
hrtimer_force_reprogram() updates cpu_base::softirq_expires_next only when
the first expiring timer is a softirq timer and the soft interrupt is not
activated. That's wrong because cpu_base::softirq_expires_next is left
stale when the first expiring timer of all bases is a timer which expires
in hard interrupt context. hrtimer_interrupt() does never update
cpu_base::softirq_expires_next which is wrong too.
That becomes a problem when clock_settime() sets CLOCK_REALTIME forward and
the first soft expiring timer is in the CLOCK_REALTIME_SOFT base. Setting
CLOCK_REALTIME forward moves the clock MONOTONIC based expiry time of that
timer before the stale cpu_base::softirq_expires_next.
cpu_base::softirq_expires_next is cached to make the check for raising the
soft interrupt fast. In the above case the soft interrupt won't be raised
until clock monotonic reaches the stale cpu_base::softirq_expires_next
value. That's incorrect, but what's worse it that if the softirq timer
becomes the first expiring timer of all clock bases after the hard expiry
timer has been handled the reprogramming of the clockevent from
hrtimer_interrupt() will result in an interrupt storm. That happens because
the reprogramming does not use cpu_base::softirq_expires_next, it uses
__hrtimer_get_next_event() which returns the actual expiry time. Once clock
MONOTONIC reaches cpu_base::softirq_expires_next the soft interrupt is
raised and the storm subsides.
Change the logic in hrtimer_force_reprogram() to evaluate the soft and hard
bases seperately, update softirq_expires_next and handle the case when a
soft expiring timer is the first of all bases by comparing the expiry times
and updating the required cpu base fields. Split this functionality into a
separate function to be able to use it in hrtimer_interrupt() as well
without copy paste.
Fixes: da70160462e ("hrtimer: Implement support for softirq based hrtimers")
Reported-by: Mikael Beckius <[email protected]>
Suggested-by: Thomas Gleixner <[email protected]>
Tested-by: Mikael Beckius <[email protected]>
Signed-off-by: Anna-Maria Behnsen <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/time/hrtimer.c | 60 +++++++++++++++++++++++++++---------------
1 file changed, 39 insertions(+), 21 deletions(-)
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 743c852..788b9d1 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -546,8 +546,11 @@ static ktime_t __hrtimer_next_event_base(struct hrtimer_cpu_base *cpu_base,
}
/*
- * Recomputes cpu_base::*next_timer and returns the earliest expires_next but
- * does not set cpu_base::*expires_next, that is done by hrtimer_reprogram.
+ * Recomputes cpu_base::*next_timer and returns the earliest expires_next
+ * but does not set cpu_base::*expires_next, that is done by
+ * hrtimer[_force]_reprogram and hrtimer_interrupt only. When updating
+ * cpu_base::*expires_next right away, reprogramming logic would no longer
+ * work.
*
* When a softirq is pending, we can ignore the HRTIMER_ACTIVE_SOFT bases,
* those timers will get run whenever the softirq gets handled, at the end of
@@ -588,6 +591,37 @@ __hrtimer_get_next_event(struct hrtimer_cpu_base *cpu_base, unsigned int active_
return expires_next;
}
+static ktime_t hrtimer_update_next_event(struct hrtimer_cpu_base *cpu_base)
+{
+ ktime_t expires_next, soft = KTIME_MAX;
+
+ /*
+ * If the soft interrupt has already been activated, ignore the
+ * soft bases. They will be handled in the already raised soft
+ * interrupt.
+ */
+ if (!cpu_base->softirq_activated) {
+ soft = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_SOFT);
+ /*
+ * Update the soft expiry time. clock_settime() might have
+ * affected it.
+ */
+ cpu_base->softirq_expires_next = soft;
+ }
+
+ expires_next = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_HARD);
+ /*
+ * If a softirq timer is expiring first, update cpu_base->next_timer
+ * and program the hardware with the soft expiry time.
+ */
+ if (expires_next > soft) {
+ cpu_base->next_timer = cpu_base->softirq_next_timer;
+ expires_next = soft;
+ }
+
+ return expires_next;
+}
+
static inline ktime_t hrtimer_update_base(struct hrtimer_cpu_base *base)
{
ktime_t *offs_real = &base->clock_base[HRTIMER_BASE_REALTIME].offset;
@@ -628,23 +662,7 @@ hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base, int skip_equal)
{
ktime_t expires_next;
- /*
- * Find the current next expiration time.
- */
- expires_next = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_ALL);
-
- if (cpu_base->next_timer && cpu_base->next_timer->is_soft) {
- /*
- * When the softirq is activated, hrtimer has to be
- * programmed with the first hard hrtimer because soft
- * timer interrupt could occur too late.
- */
- if (cpu_base->softirq_activated)
- expires_next = __hrtimer_get_next_event(cpu_base,
- HRTIMER_ACTIVE_HARD);
- else
- cpu_base->softirq_expires_next = expires_next;
- }
+ expires_next = hrtimer_update_next_event(cpu_base);
if (skip_equal && expires_next == cpu_base->expires_next)
return;
@@ -1644,8 +1662,8 @@ retry:
__hrtimer_run_queues(cpu_base, now, flags, HRTIMER_ACTIVE_HARD);
- /* Reevaluate the clock bases for the next expiry */
- expires_next = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_ALL);
+ /* Reevaluate the clock bases for the [soft] next expiry */
+ expires_next = hrtimer_update_next_event(cpu_base);
/*
* Store the new expiry value so the migration code can verify
* against it.
The following commit has been merged into the timers/urgent branch of tip:
Commit-ID: 46eb1701c046cc18c032fa68f3c8ccbf24483ee4
Gitweb: https://git.kernel.org/tip/46eb1701c046cc18c032fa68f3c8ccbf24483ee4
Author: Anna-Maria Behnsen <[email protected]>
AuthorDate: Tue, 23 Feb 2021 17:02:40 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Mon, 08 Mar 2021 09:37:01 +01:00
hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event()
hrtimer_force_reprogram() and hrtimer_interrupt() invokes
__hrtimer_get_next_event() to find the earliest expiry time of hrtimer
bases. __hrtimer_get_next_event() does not update
cpu_base::[softirq_]_expires_next to preserve reprogramming logic. That
needs to be done at the callsites.
hrtimer_force_reprogram() updates cpu_base::softirq_expires_next only when
the first expiring timer is a softirq timer and the soft interrupt is not
activated. That's wrong because cpu_base::softirq_expires_next is left
stale when the first expiring timer of all bases is a timer which expires
in hard interrupt context. hrtimer_interrupt() does never update
cpu_base::softirq_expires_next which is wrong too.
That becomes a problem when clock_settime() sets CLOCK_REALTIME forward and
the first soft expiring timer is in the CLOCK_REALTIME_SOFT base. Setting
CLOCK_REALTIME forward moves the clock MONOTONIC based expiry time of that
timer before the stale cpu_base::softirq_expires_next.
cpu_base::softirq_expires_next is cached to make the check for raising the
soft interrupt fast. In the above case the soft interrupt won't be raised
until clock monotonic reaches the stale cpu_base::softirq_expires_next
value. That's incorrect, but what's worse it that if the softirq timer
becomes the first expiring timer of all clock bases after the hard expiry
timer has been handled the reprogramming of the clockevent from
hrtimer_interrupt() will result in an interrupt storm. That happens because
the reprogramming does not use cpu_base::softirq_expires_next, it uses
__hrtimer_get_next_event() which returns the actual expiry time. Once clock
MONOTONIC reaches cpu_base::softirq_expires_next the soft interrupt is
raised and the storm subsides.
Change the logic in hrtimer_force_reprogram() to evaluate the soft and hard
bases seperately, update softirq_expires_next and handle the case when a
soft expiring timer is the first of all bases by comparing the expiry times
and updating the required cpu base fields. Split this functionality into a
separate function to be able to use it in hrtimer_interrupt() as well
without copy paste.
Fixes: 5da70160462e ("hrtimer: Implement support for softirq based hrtimers")
Reported-by: Mikael Beckius <[email protected]>
Suggested-by: Thomas Gleixner <[email protected]>
Tested-by: Mikael Beckius <[email protected]>
Signed-off-by: Anna-Maria Behnsen <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/time/hrtimer.c | 60 +++++++++++++++++++++++++++---------------
1 file changed, 39 insertions(+), 21 deletions(-)
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 743c852..788b9d1 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -546,8 +546,11 @@ static ktime_t __hrtimer_next_event_base(struct hrtimer_cpu_base *cpu_base,
}
/*
- * Recomputes cpu_base::*next_timer and returns the earliest expires_next but
- * does not set cpu_base::*expires_next, that is done by hrtimer_reprogram.
+ * Recomputes cpu_base::*next_timer and returns the earliest expires_next
+ * but does not set cpu_base::*expires_next, that is done by
+ * hrtimer[_force]_reprogram and hrtimer_interrupt only. When updating
+ * cpu_base::*expires_next right away, reprogramming logic would no longer
+ * work.
*
* When a softirq is pending, we can ignore the HRTIMER_ACTIVE_SOFT bases,
* those timers will get run whenever the softirq gets handled, at the end of
@@ -588,6 +591,37 @@ __hrtimer_get_next_event(struct hrtimer_cpu_base *cpu_base, unsigned int active_
return expires_next;
}
+static ktime_t hrtimer_update_next_event(struct hrtimer_cpu_base *cpu_base)
+{
+ ktime_t expires_next, soft = KTIME_MAX;
+
+ /*
+ * If the soft interrupt has already been activated, ignore the
+ * soft bases. They will be handled in the already raised soft
+ * interrupt.
+ */
+ if (!cpu_base->softirq_activated) {
+ soft = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_SOFT);
+ /*
+ * Update the soft expiry time. clock_settime() might have
+ * affected it.
+ */
+ cpu_base->softirq_expires_next = soft;
+ }
+
+ expires_next = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_HARD);
+ /*
+ * If a softirq timer is expiring first, update cpu_base->next_timer
+ * and program the hardware with the soft expiry time.
+ */
+ if (expires_next > soft) {
+ cpu_base->next_timer = cpu_base->softirq_next_timer;
+ expires_next = soft;
+ }
+
+ return expires_next;
+}
+
static inline ktime_t hrtimer_update_base(struct hrtimer_cpu_base *base)
{
ktime_t *offs_real = &base->clock_base[HRTIMER_BASE_REALTIME].offset;
@@ -628,23 +662,7 @@ hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base, int skip_equal)
{
ktime_t expires_next;
- /*
- * Find the current next expiration time.
- */
- expires_next = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_ALL);
-
- if (cpu_base->next_timer && cpu_base->next_timer->is_soft) {
- /*
- * When the softirq is activated, hrtimer has to be
- * programmed with the first hard hrtimer because soft
- * timer interrupt could occur too late.
- */
- if (cpu_base->softirq_activated)
- expires_next = __hrtimer_get_next_event(cpu_base,
- HRTIMER_ACTIVE_HARD);
- else
- cpu_base->softirq_expires_next = expires_next;
- }
+ expires_next = hrtimer_update_next_event(cpu_base);
if (skip_equal && expires_next == cpu_base->expires_next)
return;
@@ -1644,8 +1662,8 @@ retry:
__hrtimer_run_queues(cpu_base, now, flags, HRTIMER_ACTIVE_HARD);
- /* Reevaluate the clock bases for the next expiry */
- expires_next = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_ALL);
+ /* Reevaluate the clock bases for the [soft] next expiry */
+ expires_next = hrtimer_update_next_event(cpu_base);
/*
* Store the new expiry value so the migration code can verify
* against it.
This patch (or at least the version of it that showed up in 5.10.24
LTS when combined with Android Common Kernel and some arm phone
platform drivers) causes a roughly 60% regression (from ~5.3-6 gbps
down to ~2.2gbps) on running pktgen when egressing via ncm gadget on a
SuperSpeed+ 10gbps USB3 connection.
The regression is not there in 5.10.23, and is present in 5.10.24 and
5.10.26. Reverting just this one patch is confirmed to restore
performance (both on 5.10.24 and 5.10.26).
We don't know the cause, as we know nothing about hrtimers... but we
lightly suspect that the ncm->task_timer in f_ncm.c is perhaps not
firing as often as it should...
Unfortunately I have no idea how to replicate this on commonly
available hardware (or with a pure stable or 5.11/5.12 Linus tree)
since it requires a gadget capable usb3.1 10gbps controller (which I
only have access to in combination with a 5.10-based arm+dwc3 soc).
(the regression is visible with just usb3.0, but it's smaller due to
usb3.0 topping out at just under 4gbps, though it still drops to
2.2gbps -- and this still doesn't help since usb3 gadget capable
controllers are nearly as rare)
- Maciej & Lorenzo
On Tue, Apr 13, 2021 at 09:55:08AM -0700, Maciej Żenczykowski wrote:
> This patch (or at least the version of it that showed up in 5.10.24
> LTS when combined with Android Common Kernel and some arm phone
> platform drivers) causes a roughly 60% regression (from ~5.3-6 gbps
> down to ~2.2gbps) on running pktgen when egressing via ncm gadget on a
> SuperSpeed+ 10gbps USB3 connection.
>
> The regression is not there in 5.10.23, and is present in 5.10.24 and
> 5.10.26. Reverting just this one patch is confirmed to restore
> performance (both on 5.10.24 and 5.10.26).
>
> We don't know the cause, as we know nothing about hrtimers... but we
> lightly suspect that the ncm->task_timer in f_ncm.c is perhaps not
> firing as often as it should...
>
> Unfortunately I have no idea how to replicate this on commonly
> available hardware (or with a pure stable or 5.11/5.12 Linus tree)
> since it requires a gadget capable usb3.1 10gbps controller (which I
> only have access to in combination with a 5.10-based arm+dwc3 soc).
>
> (the regression is visible with just usb3.0, but it's smaller due to
> usb3.0 topping out at just under 4gbps, though it still drops to
> 2.2gbps -- and this still doesn't help since usb3 gadget capable
> controllers are nearly as rare)
>
To give context, the commit is now 46eb1701c046 ("hrtimer: Update
softirq_expires_next correctly after __hrtimer_get_next_event()") and is
attached below.
The f_ncm.c driver is doing a lot of calls to hrtimer_start() with mode
HRTIMER_MODE_REL_SOFT for I think every packet it gets. If that should
not be happening, we can easily fix it but I guess no one has actually
had fast USB devices that noticed this until now :)
thanks,
greg k-h
----------
commit 46eb1701c046cc18c032fa68f3c8ccbf24483ee4
Author: Anna-Maria Behnsen <[email protected]>
Date: Tue Feb 23 17:02:40 2021 +0100
hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event()
hrtimer_force_reprogram() and hrtimer_interrupt() invokes
__hrtimer_get_next_event() to find the earliest expiry time of hrtimer
bases. __hrtimer_get_next_event() does not update
cpu_base::[softirq_]_expires_next to preserve reprogramming logic. That
needs to be done at the callsites.
hrtimer_force_reprogram() updates cpu_base::softirq_expires_next only when
the first expiring timer is a softirq timer and the soft interrupt is not
activated. That's wrong because cpu_base::softirq_expires_next is left
stale when the first expiring timer of all bases is a timer which expires
in hard interrupt context. hrtimer_interrupt() does never update
cpu_base::softirq_expires_next which is wrong too.
That becomes a problem when clock_settime() sets CLOCK_REALTIME forward and
the first soft expiring timer is in the CLOCK_REALTIME_SOFT base. Setting
CLOCK_REALTIME forward moves the clock MONOTONIC based expiry time of that
timer before the stale cpu_base::softirq_expires_next.
cpu_base::softirq_expires_next is cached to make the check for raising the
soft interrupt fast. In the above case the soft interrupt won't be raised
until clock monotonic reaches the stale cpu_base::softirq_expires_next
value. That's incorrect, but what's worse it that if the softirq timer
becomes the first expiring timer of all clock bases after the hard expiry
timer has been handled the reprogramming of the clockevent from
hrtimer_interrupt() will result in an interrupt storm. That happens because
the reprogramming does not use cpu_base::softirq_expires_next, it uses
__hrtimer_get_next_event() which returns the actual expiry time. Once clock
MONOTONIC reaches cpu_base::softirq_expires_next the soft interrupt is
raised and the storm subsides.
Change the logic in hrtimer_force_reprogram() to evaluate the soft and hard
bases seperately, update softirq_expires_next and handle the case when a
soft expiring timer is the first of all bases by comparing the expiry times
and updating the required cpu base fields. Split this functionality into a
separate function to be able to use it in hrtimer_interrupt() as well
without copy paste.
Fixes: 5da70160462e ("hrtimer: Implement support for softirq based hrtimers")
Reported-by: Mikael Beckius <[email protected]>
Suggested-by: Thomas Gleixner <[email protected]>
Tested-by: Mikael Beckius <[email protected]>
Signed-off-by: Anna-Maria Behnsen <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 743c852e10f2..788b9d137de4 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -546,8 +546,11 @@ static ktime_t __hrtimer_next_event_base(struct hrtimer_cpu_base *cpu_base,
}
/*
- * Recomputes cpu_base::*next_timer and returns the earliest expires_next but
- * does not set cpu_base::*expires_next, that is done by hrtimer_reprogram.
+ * Recomputes cpu_base::*next_timer and returns the earliest expires_next
+ * but does not set cpu_base::*expires_next, that is done by
+ * hrtimer[_force]_reprogram and hrtimer_interrupt only. When updating
+ * cpu_base::*expires_next right away, reprogramming logic would no longer
+ * work.
*
* When a softirq is pending, we can ignore the HRTIMER_ACTIVE_SOFT bases,
* those timers will get run whenever the softirq gets handled, at the end of
@@ -588,6 +591,37 @@ __hrtimer_get_next_event(struct hrtimer_cpu_base *cpu_base, unsigned int active_
return expires_next;
}
+static ktime_t hrtimer_update_next_event(struct hrtimer_cpu_base *cpu_base)
+{
+ ktime_t expires_next, soft = KTIME_MAX;
+
+ /*
+ * If the soft interrupt has already been activated, ignore the
+ * soft bases. They will be handled in the already raised soft
+ * interrupt.
+ */
+ if (!cpu_base->softirq_activated) {
+ soft = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_SOFT);
+ /*
+ * Update the soft expiry time. clock_settime() might have
+ * affected it.
+ */
+ cpu_base->softirq_expires_next = soft;
+ }
+
+ expires_next = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_HARD);
+ /*
+ * If a softirq timer is expiring first, update cpu_base->next_timer
+ * and program the hardware with the soft expiry time.
+ */
+ if (expires_next > soft) {
+ cpu_base->next_timer = cpu_base->softirq_next_timer;
+ expires_next = soft;
+ }
+
+ return expires_next;
+}
+
static inline ktime_t hrtimer_update_base(struct hrtimer_cpu_base *base)
{
ktime_t *offs_real = &base->clock_base[HRTIMER_BASE_REALTIME].offset;
@@ -628,23 +662,7 @@ hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base, int skip_equal)
{
ktime_t expires_next;
- /*
- * Find the current next expiration time.
- */
- expires_next = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_ALL);
-
- if (cpu_base->next_timer && cpu_base->next_timer->is_soft) {
- /*
- * When the softirq is activated, hrtimer has to be
- * programmed with the first hard hrtimer because soft
- * timer interrupt could occur too late.
- */
- if (cpu_base->softirq_activated)
- expires_next = __hrtimer_get_next_event(cpu_base,
- HRTIMER_ACTIVE_HARD);
- else
- cpu_base->softirq_expires_next = expires_next;
- }
+ expires_next = hrtimer_update_next_event(cpu_base);
if (skip_equal && expires_next == cpu_base->expires_next)
return;
@@ -1644,8 +1662,8 @@ void hrtimer_interrupt(struct clock_event_device *dev)
__hrtimer_run_queues(cpu_base, now, flags, HRTIMER_ACTIVE_HARD);
- /* Reevaluate the clock bases for the next expiry */
- expires_next = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_ALL);
+ /* Reevaluate the clock bases for the [soft] next expiry */
+ expires_next = hrtimer_update_next_event(cpu_base);
/*
* Store the new expiry value so the migration code can verify
* against it.
On Wed, Apr 14, 2021 at 2:14 AM Greg KH <[email protected]> wrote:
> To give context, the commit is now 46eb1701c046 ("hrtimer: Update
> softirq_expires_next correctly after __hrtimer_get_next_event()") and is
> attached below.
>
> The f_ncm.c driver is doing a lot of calls to hrtimer_start() with mode
> HRTIMER_MODE_REL_SOFT for I think every packet it gets. If that should
> not be happening, we can easily fix it but I guess no one has actually
> had fast USB devices that noticed this until now :)
AIUI the purpose of this timer is to support packet aggregation. USB
transfers are relatively expensive/high latency. 6 Gbps is 500k
1500-byte packets per second, or one every 2us. So f_ncm buffers as
many packets as will fit into 16k (usually, 10 1500-byte packets), and
only initiates a USB transfer when those packets have arrived. That
ends up doing only one transfer every 20us. It sets a 300us timer to
ensure that if the 10 packets haven't arrived, it still sends out
whatever it has when the timer fires. The timer is set/updated on
every packet buffered by ncm.
Is this somehow much more expensive in 5.10.24 than it was before?
Even if this driver is somehow "holding it wrong", might there not be
other workloads that have a similar problem? What about regressions on
those workloads?
On Wed, Apr 14 2021 at 11:49, Lorenzo Colitti wrote:
> On Wed, Apr 14, 2021 at 2:14 AM Greg KH <[email protected]> wrote:
>> To give context, the commit is now 46eb1701c046 ("hrtimer: Update
>> softirq_expires_next correctly after __hrtimer_get_next_event()") and is
>> attached below.
>>
>> The f_ncm.c driver is doing a lot of calls to hrtimer_start() with mode
>> HRTIMER_MODE_REL_SOFT for I think every packet it gets. If that should
>> not be happening, we can easily fix it but I guess no one has actually
>> had fast USB devices that noticed this until now :)
>
> AIUI the purpose of this timer is to support packet aggregation. USB
> transfers are relatively expensive/high latency. 6 Gbps is 500k
> 1500-byte packets per second, or one every 2us. So f_ncm buffers as
> many packets as will fit into 16k (usually, 10 1500-byte packets), and
> only initiates a USB transfer when those packets have arrived. That
> ends up doing only one transfer every 20us. It sets a 300us timer to
> ensure that if the 10 packets haven't arrived, it still sends out
> whatever it has when the timer fires. The timer is set/updated on
> every packet buffered by ncm.
>
> Is this somehow much more expensive in 5.10.24 than it was before?
> Even if this driver is somehow "holding it wrong", might there not be
> other workloads that have a similar problem? What about regressions on
> those workloads?
Let's put the question of whether this hrtimer usage is sensible or not
aside for now.
I stared at the change for a while and did some experiments to recreate
the problem, but that didn't get me anywhere.
Could you please do the following?
Enable tracing and enable the following tracepoints:
timers/hrtimer_cancel
timers/hrtimer_start
timers/hrtimer_expire_entry
irq/softirq_raise
irq/softirq_enter
irq/softirq_exit
and function tracing filtered on ncm_wrap_ntb() and
package_for_tx() only (to reduce the noise).
Run the test on a kernels with and without that commit and collect trace
data for both.
That should give me a pretty clear picture what's going on.
Thanks,
tglx
On Thu, Apr 15, 2021 at 9:47 AM Thomas Gleixner <[email protected]> wrote:
>
> On Wed, Apr 14 2021 at 11:49, Lorenzo Colitti wrote:
> > On Wed, Apr 14, 2021 at 2:14 AM Greg KH <[email protected]> wrote:
> >> To give context, the commit is now 46eb1701c046 ("hrtimer: Update
> >> softirq_expires_next correctly after __hrtimer_get_next_event()") and is
> >> attached below.
> >>
> >> The f_ncm.c driver is doing a lot of calls to hrtimer_start() with mode
> >> HRTIMER_MODE_REL_SOFT for I think every packet it gets. If that should
> >> not be happening, we can easily fix it but I guess no one has actually
> >> had fast USB devices that noticed this until now :)
> >
> > AIUI the purpose of this timer is to support packet aggregation. USB
> > transfers are relatively expensive/high latency. 6 Gbps is 500k
> > 1500-byte packets per second, or one every 2us. So f_ncm buffers as
> > many packets as will fit into 16k (usually, 10 1500-byte packets), and
> > only initiates a USB transfer when those packets have arrived. That
> > ends up doing only one transfer every 20us. It sets a 300us timer to
> > ensure that if the 10 packets haven't arrived, it still sends out
> > whatever it has when the timer fires. The timer is set/updated on
> > every packet buffered by ncm.
> >
> > Is this somehow much more expensive in 5.10.24 than it was before?
> > Even if this driver is somehow "holding it wrong", might there not be
> > other workloads that have a similar problem? What about regressions on
> > those workloads?
>
> Let's put the question of whether this hrtimer usage is sensible or not
> aside for now.
>
> I stared at the change for a while and did some experiments to recreate
> the problem, but that didn't get me anywhere.
>
> Could you please do the following?
>
> Enable tracing and enable the following tracepoints:
>
> timers/hrtimer_cancel
> timers/hrtimer_start
> timers/hrtimer_expire_entry
> irq/softirq_raise
> irq/softirq_enter
> irq/softirq_exit
>
> and function tracing filtered on ncm_wrap_ntb() and
> package_for_tx() only (to reduce the noise).
>
> Run the test on a kernels with and without that commit and collect trace
> data for both.
>
> That should give me a pretty clear picture what's going on.
Lorenzo is trying to get the traces you asked for, or rather he’s
trying to get confirmation he can post them.
Our initial observation of these results seems to suggest that
updating the timer (hrtimer_start, which seems to also call
hrtimer_cancel) takes twice as long as it used to.
My gut feeling is that softirq_activated is usually false, and the old
code in such a case calls just __hrtimer_get_next_event(,
HRTIMER_ACTIVE_ALL). While the new code will first call
__hrtimer_get_next_event(, HRTIMER_ACTIVE_SOFT) and then
__hrtimer_get_next_event(, HRTIMER_ACTIVE_HARD)
Perhaps __hrtimer_get_next_event() should return both soft and hard
event times in one function call?
Or perhaps hrtimer_start should not call hrtimer_cancel?
We've also been looking at the f_ncm driver itself, and trying to
reduce the number of hrtimer_cancel/start calls.
It's pretty easy to reduce this by a factor of 10x (we’re not yet 100%
certain this is race free), which does drastically improve
performance.
However, it still only takes the regression from 60% to 20%.
- Maciej
On Mon, Apr 19 2021 at 20:12, Maciej Żenczykowski wrote:
> On Thu, Apr 15, 2021 at 9:47 AM Thomas Gleixner <[email protected]> wrote:
>> Run the test on a kernels with and without that commit and collect trace
>> data for both.
>>
>> That should give me a pretty clear picture what's going on.
>
> Lorenzo is trying to get the traces you asked for, or rather he’s
> trying to get confirmation he can post them.
>
> Our initial observation of these results seems to suggest that
> updating the timer (hrtimer_start, which seems to also call
> hrtimer_cancel) takes twice as long as it used to.
Which contradicts my measurements. The change in complexity is marginal
and the overhead in cycles/instructions is close to noise. It's
measurable in a microbenchmark, but it's in the < 1% range which is far
away from the 60% you are seing.
> My gut feeling is that softirq_activated is usually false, and the old
> code in such a case calls just __hrtimer_get_next_event(,
> HRTIMER_ACTIVE_ALL). While the new code will first call
> __hrtimer_get_next_event(, HRTIMER_ACTIVE_SOFT) and then
> __hrtimer_get_next_event(, HRTIMER_ACTIVE_HARD)
>
> Perhaps __hrtimer_get_next_event() should return both soft and hard
> event times in one function call?
> Or perhaps hrtimer_start should not call hrtimer_cancel?
Perhaps we do a proper analysis first :)
Thanks,
tglx
On Fri, Apr 16, 2021 at 1:47 AM Thomas Gleixner <[email protected]> wrote:
> Enable tracing and enable the following tracepoints:
> [...]
Sorry for the delay. I had to learn a bit about how to use the tracing
infrastructure. I don't know if I can post here, but to my untrained
eye, one big difference between the old (fast) code and the new (slow)
code is that the new code calls tick_program_event() much more. It
looks like that makes most of the difference.
With the old code, hrtimer_start_range_ns almost never calls
tick_program_event at all, but the new code seems to call it twice on
every timer update.
On Tue, Apr 20 2021 at 17:15, Lorenzo Colitti wrote:
> On Fri, Apr 16, 2021 at 1:47 AM Thomas Gleixner <[email protected]> wrote:
>> Enable tracing and enable the following tracepoints:
>> [...]
>
> Sorry for the delay. I had to learn a bit about how to use the tracing
> infrastructure. I don't know if I can post here, but to my untrained
> eye, one big difference between the old (fast) code and the new (slow)
> code is that the new code calls tick_program_event() much more. It
> looks like that makes most of the difference.
>
> With the old code, hrtimer_start_range_ns almost never calls
> tick_program_event at all, but the new code seems to call it twice on
> every timer update.
Yes, I figured out why that happens by now, but the old behaviour was
just incorrect. So now there are clearly two issues:
1) hrtimer is contrary to timer_list not really suited for high
frequency start/cancel/start/... cycles of a timer. It's optimized
for the start and expire precisely case.
2) The cost for reprogramming depends on the underlying hardware. With
halfways sane timer hardware it's minimal and as I measured in a
micro benchmark in the 1% range. Now when your system ends up
having one of the real timer trainwrecks which can be found in
drivers/clocksource/ which are even worse than the x86 HPET horror,
then it's going to be painful and a performance issue.
I assume that's an ARM64 system. ARM64 CPUs have an architected per
CPU timer where the reprogramming is pretty fast as it's next to
the CPU, but who knows what your system is using.
Now in the meantime I looked into __hrtimer_start_range_ns() whether
that double reprogram can be avoided without creating a total trainwreck
and imposing penalty on all sensible use cases. Completely untested
patch below should do the trick and it's not ugly enough that I hate it
with a passion.
Even if that makes your problem go away #1 is still beyond suboptimal
and #2 is something you really want to look into.
Thanks,
tglx
---
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1030,12 +1030,13 @@ static void __remove_hrtimer(struct hrti
* remove hrtimer, called with base lock held
*/
static inline int
-remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base, bool restart)
+remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base,
+ bool restart, bool keep_local)
{
u8 state = timer->state;
if (state & HRTIMER_STATE_ENQUEUED) {
- int reprogram;
+ bool reprogram;
/*
* Remove the timer and force reprogramming when high
@@ -1048,8 +1049,16 @@ remove_hrtimer(struct hrtimer *timer, st
debug_deactivate(timer);
reprogram = base->cpu_base == this_cpu_ptr(&hrtimer_bases);
+ /*
+ * If the timer is not restarted then reprogramming is
+ * required if the timer is local. If it is local and about
+ * to be restarted, avoid programming it twice (on removal
+ * and a moment later when it's requeued).
+ */
if (!restart)
state = HRTIMER_STATE_INACTIVE;
+ else
+ reprogram &= !keep_local;
__remove_hrtimer(timer, base, state, reprogram);
return 1;
@@ -1103,9 +1112,31 @@ static int __hrtimer_start_range_ns(stru
struct hrtimer_clock_base *base)
{
struct hrtimer_clock_base *new_base;
+ bool force_local, first;
- /* Remove an active timer from the queue: */
- remove_hrtimer(timer, base, true);
+ /*
+ * If the timer is on the local cpu base and is the first expiring
+ * timer then this might end up reprogramming the hardware twice
+ * (on removal and on enqueue). To avoid that by prevent the
+ * reprogram on removal, keep the timer local to the current CPU
+ * and enforce reprogramming after it is queued no matter whether
+ * it is the new first expiring timer again or not.
+ */
+ force_local = base->cpu_base == this_cpu_ptr(&hrtimer_bases);
+ force_local &= base->cpu_base->next_timer == timer;
+
+ /*
+ * Remove an active timer from the queue. In case it is not queued
+ * on the current CPU, make sure that remove_hrtimer() updates the
+ * remote data correctly.
+ *
+ * If it's on the current CPU and the first expiring timer, then
+ * skip reprogramming, keep the timer local and enforce
+ * reprogramming later if it was the first expiring timer. This
+ * avoids programming the underlying clock event twice (once at
+ * removal and once after enqueue).
+ */
+ remove_hrtimer(timer, base, true, force_local);
if (mode & HRTIMER_MODE_REL)
tim = ktime_add_safe(tim, base->get_time());
@@ -1115,9 +1146,24 @@ static int __hrtimer_start_range_ns(stru
hrtimer_set_expires_range_ns(timer, tim, delta_ns);
/* Switch the timer base, if necessary: */
- new_base = switch_hrtimer_base(timer, base, mode & HRTIMER_MODE_PINNED);
+ if (!force_local) {
+ new_base = switch_hrtimer_base(timer, base,
+ mode & HRTIMER_MODE_PINNED);
+ } else {
+ new_base = base;
+ }
- return enqueue_hrtimer(timer, new_base, mode);
+ first = enqueue_hrtimer(timer, new_base, mode);
+ if (!force_local)
+ return first;
+
+ /*
+ * Timer was forced to stay on the current CPU to avoid
+ * reprogramming on removal and enqueue. Force reprogram the
+ * hardware by evaluating the new first expiring timer.
+ */
+ hrtimer_force_reprogram(new_base->cpu_base, 1);
+ return 0;
}
/**
@@ -1183,7 +1229,7 @@ int hrtimer_try_to_cancel(struct hrtimer
base = lock_hrtimer_base(timer, &flags);
if (!hrtimer_callback_running(timer))
- ret = remove_hrtimer(timer, base, false);
+ ret = remove_hrtimer(timer, base, false, false);
unlock_hrtimer_base(timer, &flags);
On Tue, Apr 20, 2021 at 11:19 PM Thomas Gleixner <[email protected]> wrote:
> 1) hrtimer is contrary to timer_list not really suited for high
> frequency start/cancel/start/... cycles of a timer. It's optimized
> for the start and expire precisely case.
Ack. This is not what the f_ncm gadget code needs. It just needs a
timer to fire "about 300us after the last packet submitted". Under
load the timer will almost never fire since there will always be
another packet coming. So the speed of adding/updating the timer is
much more important than the accuracy. We will try to move it to
timer_list.
> I assume that's an ARM64 system. ARM64 CPUs have an architected per
> CPU timer where the reprogramming is pretty fast as it's next to
> the CPU, but who knows what your system is using.
This system appears to be using timer hardware that is... slow to
program (microseconds). We're looking at whether it's possible to use
the arch timer instead.
> Now in the meantime I looked into __hrtimer_start_range_ns() whether
> that double reprogram can be avoided without creating a total trainwreck
> and imposing penalty on all sensible use cases. Completely untested
> patch below should do the trick and it's not ugly enough that I hate it
> with a passion.
I tested this and in my simple benchmark, the double calls are gone
and hrtimer_start calls tick_program_event approximately once (more
precisely, 90.06% of the time; sometimes it doesn't call it at all).
This is not enough to cancel the regression we're seeing because the
previous code would pretty much never call tick_program_event at all.
But it's definitely better.
On Wed, Apr 21, 2021 at 11:08 PM Lorenzo Colitti <[email protected]> wrote:
> load the timer will almost never fire since there will always be
> another packet coming. So the speed of adding/updating the timer is
> much more important than the accuracy. We will try to move it to
> timer_list.
... but actually, that's not acceptable, because AFAICS timer_list
counts in jiffies so its resolution might be as low as 10ms. It's not
acceptable to delay network traffic by up to 10ms under load. Not sure
if there is another timer API that can be used here.
On Wed, Apr 21, 2021 at 11:40:54PM +0900, Lorenzo Colitti wrote:
> On Wed, Apr 21, 2021 at 11:08 PM Lorenzo Colitti <[email protected]> wrote:
> > load the timer will almost never fire since there will always be
> > another packet coming. So the speed of adding/updating the timer is
> > much more important than the accuracy. We will try to move it to
> > timer_list.
>
> ... but actually, that's not acceptable, because AFAICS timer_list
> counts in jiffies so its resolution might be as low as 10ms. It's not
> acceptable to delay network traffic by up to 10ms under load. Not sure
> if there is another timer API that can be used here.
What do other network drivers do? This has to be a common problem, why
is this USB driver somehow unique here?
thanks,
greg k-h
On Wed, Apr 21 2021 at 23:08, Lorenzo Colitti wrote:
> On Tue, Apr 20, 2021 at 11:19 PM Thomas Gleixner <[email protected]> wrote:
>> 1) hrtimer is contrary to timer_list not really suited for high
>> frequency start/cancel/start/... cycles of a timer. It's optimized
>> for the start and expire precisely case.
>
> Ack. This is not what the f_ncm gadget code needs. It just needs a
> timer to fire "about 300us after the last packet submitted". Under
> load the timer will almost never fire since there will always be
> another packet coming. So the speed of adding/updating the timer is
> much more important than the accuracy. We will try to move it to
> timer_list.
>
>> I assume that's an ARM64 system. ARM64 CPUs have an architected per
>> CPU timer where the reprogramming is pretty fast as it's next to
>> the CPU, but who knows what your system is using.
>
> This system appears to be using timer hardware that is... slow to
> program (microseconds). We're looking at whether it's possible to use
> the arch timer instead.
>
>> Now in the meantime I looked into __hrtimer_start_range_ns() whether
>> that double reprogram can be avoided without creating a total trainwreck
>> and imposing penalty on all sensible use cases. Completely untested
>> patch below should do the trick and it's not ugly enough that I hate it
>> with a passion.
>
> I tested this and in my simple benchmark, the double calls are gone
> and hrtimer_start calls tick_program_event approximately once (more
> precisely, 90.06% of the time; sometimes it doesn't call it at all).
> This is not enough to cancel the regression we're seeing because the
> previous code would pretty much never call tick_program_event at all.
> But it's definitely better.
It's hardly a regression when the behaviour relies on a bug in the core
code which should have never been there in the first place.
If you look at what you are doing it's obvious that reprogramming has to
take place quite often.
Your timeout is relative 300us and it's moved every N microseconds
ahead. So if there is not much other hrtimer activity on that CPU which
is often the case and the only other relevant hrtimer which is armed
(and expires) is the tick, then it depends on the HZ setting how often
reprogramming is required.
Lets assume HZ=1000 and a rearm period of 20us, which means the timer is
rearmed 50 times per tick and reprogramming is needed for each rearm
before the tick timer becomes the first expiring timer, i.e. 300us
before the tick. So 700/20 = 35 times per 1ms the hardware needs to be
updated because the first expiry time moves 20us farther into the future
on every update. That's 70%
So depending on how long that takes on your machine 35 updates can be a
significant amount of time.
With HZ=250 that's 200 times rearming / tick and the reprogramming is
then 3700/20 = 185 which is 92.5% which is close enough to the 90% you
are seing...
Now, if your timer hardware needs several microseconds to be programmed
then it's obvious that you lose a lot of CPU time.
Just for comparison. In a VM I'm experimenting with right now the
reprogramming time is ~500ns which is still a lot of cycles, but
compared to 5us faster by an order of magnitude. And on the sane
machine bare metal its way faster and therefore less noticeable.
The use case still sucks, but that's not a problem of hrtimers as they
are not designed for this use case.
Now looking at the usecase. The comment above the hrtimer callback,
which is the only hint there, says:
* The transmit should only be run if no skb data has been sent for a
* certain duration.
which is useless word salad.
So looking at the commit which added that muck: 6d3865f9d41f ("usb:
gadget: NCM: Add transmit multi-frame."). The changelog mutters:
It has a time out of 300ms to ensure that smaller number of packets
still maintain a normal latency.
while the timeout is defined as:
+#define TX_TIMEOUT_NSECS 300000
which is clearly 300us and not 300ms. So much for useful information in
changelogs.
But at least the rest of the commit message gives a clue why this is
useful:
This adds multi-frame support to the NCM NTB's for
the gadget driver. This allows multiple network
packets to be put inside a single USB NTB with a
maximum size of 16kB.
So the timer ensures that for batching xmit packets into a single NTB
the maximum delay is 300us if the NTB is not filled up on time.
Though there is ZERO explanation why 300us is a sensible number and why
it's a brilliant idea to use a hrtimer which is clearly documented to be
not suitable for this.
I'm pretty sure that the original author did not develop that on
hardware which used a timer based on a hardware trainwreck, but alone
the fact that:
- rbtree reshuffling _is_ always required
- high frequency reprogramming of the timer hardware might be required
- in case that CONFIG_HIGH_RES_TIMERS is off or not enabled at runtime
for whatever reason this is still subject to the CONFIG_HZ setting
tells me that this is clearly well thought out and based on "works for
me" engineering.
Plus at the time when this was hacked up, the bug which was fixed by the
commit you refer to, did not exist yet as it was introduced with the
softirq based expiry changes years later.
Back then the timer expiry was scheduling a tasklet to ensure that
->ndo_start_xmit() is called from softirq context.
The softirq based expiry changes which introduced the correctness bug
removed that tasklet indirection.
And going back to 4.14 yields the proof of this. The same microbenchmark
as described above i.e. modifying the timer every 20us has exactly the
same effect vs. reprogramming the clock event device:
It's programmed twice on every hrtimer_start() ~90% of the time
So no. The real regression was introduced with:
5da70160462e ("hrtimer: Implement support for softirq based hrtimers")
and related changes, but that had not a negative effect on that USB
driver usage, quite the contrary it papered over it.
commit 46eb1701c0 fixed that regression and thereby unearthed the issue
with that USB driver usage issue in combination with that timer
hardware.
If 5da70160462e et al. would not had introduced that bug then the
problems with that hrtimer usage in combination with that timer hardware
trainwreck would have been noticed way earlier. IOW it would never have
worked at all.
Aside of that any use case which does not use the _SOFT variant of
hrtimers was not affected by 5da70160462e et al. Which in turn means
that all !_SOFT usecases are not abusing hrtimers and NONE of the
_SOFT usecases cares about correctness and accuracy at all.
So pretty please stop claiming that this is a regression. It was broken
forever, but papered over and therefore just not noticed.
IOW, the problem was introduced with 6d3865f9d41f, i.e. 6+ years
ago. The reasons why this pops up now are:
- There was an unnoticed bug since 5da70160462e in the hrtimers code
for 3 years.
- It's only observable on hardware which has abysmal per cpu timers
TBH, I'm telling HW people for 15+ years now that clockevents and
clocksources are performance critical and relevant for correctness, but
obviously they don't listen. Otherwise we would not have code in tree
which has comments like this:
hpet_clkevt_set_next_event(...)
* HPETs are a complete disaster. The compare register is
or any other clock event which is compare equal based and has to do read
back based interrupt loss prevention like
pxa_osmr0_set_next_event() and other trainwrecks.
Along with the insanities of watchdogs to monitor correctness and
clocksources/events which stop to work when entering deep power states.
Not to talk about gems like this:
exynos4_mct_write()
/* Wait maximum 1 ms until written values are applied */
along with all the magic patches which make this the preferred device
for the very wrong reasons.
I have no idea which particular trainwreck you are dealing with because
you didn't tell, but TBH I don't want to know at all. It's probably just
another incarnation of broken which is not covered by the above, but
fits nicely.
That said, even if you manage to avoid the timer hardware trainwreck on
which your system is depending on right now, then still the underlying
problem remains that hrtimers are not the proper tool for high frequency
modification and never will be.
There are smarter approaches to that USB/NTB problem, but that's beyond
the discussion at hand.
Thanks,
tglx
On Thu, Apr 22 2021 at 02:08, Thomas Gleixner wrote:
> On Wed, Apr 21 2021 at 23:08, Lorenzo Colitti wrote:
> That said, even if you manage to avoid the timer hardware trainwreck on
> which your system is depending on right now, then still the underlying
> problem remains that hrtimers are not the proper tool for high frequency
> modification and never will be.
>
> There are smarter approaches to that USB/NTB problem, but that's beyond
> the discussion at hand.
And because I was looking at it anyway, here are some hints:
Buffersize: 16k
Timeout: 300us
Let's assume small packets with a size of 256 bytes each. The packets
arrive with a period of ~299us, i.e. right before the timer fires. Which
means the timer wont fire because it's moved 300us ahead each time and
the buffer is sent to the wire when it's full.
That means the 1st packet in the buffer is sent 63 * 299 us = 18.8ms
after it was queued in the buffer.
How is that supposed to be correct? I know, throughput is what matters
and correctness is overrated.
For the high volume traffic case the timer is not needed at all if
the logic is done proper.
xmit()
{
if (!packet_fits_in_buffer())
send_buffer_and_cancel_timer_if_armed();
if (first_packet_in_buffer())
buffer_deadline = now() + MAX_DELAY;
add_packet_to_buffer();
/* Are more packets available for xmit at the core? */
if (!xmit_queue_empty())
return;
/*
* Last packet from network core for now, check how long the
* first packet sits in the buffer.
*/
if (buffer_deadline <= now()) {
send_buffer_and_cancel_timer_if_armed();
return;
}
if (!timer_armed())
arm_timer(buffer_deadline);
}
IOW, as long as the network core has packets outstanding for xmit, there
is no point in arming the timer at all.
But yes, that needs more work and thoughts than the 'bolt and duct tape'
approach which led to the current situation.
Aside of that it would certainly be interesting to start very simple
with an experiment which does not use a timer in the first place:
xmit()
{
if (!packet_fits_in_buffer())
send_buffer();
add_packet_to_buffer();
/* Are more packets available for xmit at the core? */
if (!xmit_queue_empty())
return;
/* Last packet from network core for now. Drain it. */
send_buffer();
}
I wouldn't be surprised if that'd turn out to be the right thing to do
both latency-wise and throughput-wise.
But that'd be not convoluted enough and of course the commit which
introduced that magic does not explain anything at all, so I can't tell
if that was actually tried or not.
Thanks,
tglx
On Thu, Apr 22, 2021 at 9:08 AM Thomas Gleixner <[email protected]> wrote:
> Just for comparison. In a VM I'm experimenting with right now the
> reprogramming time is ~500ns which is still a lot of cycles, but
> compared to 5us faster by an order of magnitude. And on the sane
> machine bare metal its way faster and therefore less noticeable.
FWIW, on this hardware, frtrace says that arming the arm64 architected
timer takes 0.7us. Definitely better than 2-3us, but still not free.
This is not a high-end desktop or server, but it's also not super
slow, low-power hardware.
> * The transmit should only be run if no skb data has been sent for a
> * certain duration.
>
> which is useless word salad.
You're the one who wrote that comment - see b1a31a5f5f27. You'll
forgive me for being amused. :-)
Thanks for the history/analysis/suggestions. I think it's a fact that
this is a regression in performance: this particular code has
performed well for a couple of years now. The fact that the good
performance only existed due to a correctness bug in the hrtimer code
definitely does make it harder to argue that the regression should be
reverted.
That said: if you have a fix for the double reprogram, then that fix
should probably be applied? 0.5us is not free, and even if hrtimers
aren't designed for frequent updates, touching the hardware twice as
often does seem like a bad idea, since, as you point out, there's a
*lot* of hardware that is slow.
Separately, we're also going to look at making ncm better. (In defense
of the original author, in 2014 I don't think anyone would even have
dreamed of USB being fast enough for this to be a problem.) The first
thing we're going to try to do is set the timer once per NTB instead
of once per packet (so, 10x less). My initial attempt to do that
causes the link to die after a while and I need to figure out why
before I can send a patch up. I'm suspicious of the threading, which
uses non-atomic variables (timer_force_tx, ncm->skb_tx_data) to
synchronize control flow between the timer and the transmit function,
which can presumably run on different CPUs. That seems wrong since
either core could observe stale variables. But perhaps there are
memory barriers that I'm not aware of.
The idea of getting rid of the timer by doing aggregation based on
transmit queue lengths seems like a much larger effort, but probably
one that is required to actually improve performance substantially
beyond what it is now.
On Thu, Apr 22 2021 at 23:20, Lorenzo Colitti wrote:
> On Thu, Apr 22, 2021 at 9:08 AM Thomas Gleixner <[email protected]> wrote:
>> Just for comparison. In a VM I'm experimenting with right now the
>> reprogramming time is ~500ns which is still a lot of cycles, but
>> compared to 5us faster by an order of magnitude. And on the sane
>> machine bare metal its way faster and therefore less noticeable.
>
> FWIW, on this hardware, frtrace says that arming the arm64 architected
> timer takes 0.7us. Definitely better than 2-3us, but still not free.
> This is not a high-end desktop or server, but it's also not super
> slow, low-power hardware.
>
>> * The transmit should only be run if no skb data has been sent for a
>> * certain duration.
>>
>> which is useless word salad.
>
> You're the one who wrote that comment - see b1a31a5f5f27. You'll
> forgive me for being amused. :-)
Rightfully so! I still call it word salat :)
> Thanks for the history/analysis/suggestions. I think it's a fact that
> this is a regression in performance: this particular code has
> performed well for a couple of years now. The fact that the good
> performance only existed due to a correctness bug in the hrtimer code
> definitely does make it harder to argue that the regression should be
> reverted.
We tend to disagree about the naming conventions here, but we seem at
least to agree that reverting a fix for a correctness bug (which has way
worse implications than slowing down a gruesome driver) is not going to
happen.
> That said: if you have a fix for the double reprogram, then that fix
> should probably be applied? 0.5us is not free, and even if hrtimers
> aren't designed for frequent updates, touching the hardware twice as
> often does seem like a bad idea, since, as you point out, there's a
> *lot* of hardware that is slow.
That's an obvious improvement, but not a fix. And I checked quite some
hrtimer users and there are only a few which ever rearm an queued timer
and that happens infrequently.
> Separately, we're also going to look at making ncm better. (In defense
> of the original author, in 2014 I don't think anyone would even have
> dreamed of USB being fast enough for this to be a problem.) The first
> thing we're going to try to do is set the timer once per NTB instead
> of once per packet (so, 10x less). My initial attempt to do that
> causes the link to die after a while and I need to figure out why
> before I can send a patch up. I'm suspicious of the threading, which
> uses non-atomic variables (timer_force_tx, ncm->skb_tx_data) to
> synchronize control flow between the timer and the transmit function,
> which can presumably run on different CPUs. That seems wrong since
> either core could observe stale variables. But perhaps there are
> memory barriers that I'm not aware of.
Not that I see any.
> The idea of getting rid of the timer by doing aggregation based on
> transmit queue lengths seems like a much larger effort, but probably
> one that is required to actually improve performance substantially
> beyond what it is now.
I don't think it's a huge effort. netdev_xmit_more() should tell you
what you need to know.
Thanks,
tglx
If __hrtimer_start_range_ns() is invoked with an already armed hrtimer then
the timer has to be canceled first and then added back. If the timer is the
first expiring timer then on removal the clockevent device is reprogrammed
to the next expiring timer to avoid that the pending expiry fires needlessly.
If the new expiry time ends up to be the first expiry again then the clock
event device has to reprogrammed again.
Avoid this by checking whether the timer is the first to expire and in that
case, keep the timer on the current CPU and delay the reprogramming up to
the point where the timer has been enqueued again.
Reported-by: Lorenzo Colitti <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
---
kernel/time/hrtimer.c | 60 ++++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 53 insertions(+), 7 deletions(-)
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1030,12 +1030,13 @@ static void __remove_hrtimer(struct hrti
* remove hrtimer, called with base lock held
*/
static inline int
-remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base, bool restart)
+remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base,
+ bool restart, bool keep_local)
{
u8 state = timer->state;
if (state & HRTIMER_STATE_ENQUEUED) {
- int reprogram;
+ bool reprogram;
/*
* Remove the timer and force reprogramming when high
@@ -1048,8 +1049,16 @@ remove_hrtimer(struct hrtimer *timer, st
debug_deactivate(timer);
reprogram = base->cpu_base == this_cpu_ptr(&hrtimer_bases);
+ /*
+ * If the timer is not restarted then reprogramming is
+ * required if the timer is local. If it is local and about
+ * to be restarted, avoid programming it twice (on removal
+ * and a moment later when it's requeued).
+ */
if (!restart)
state = HRTIMER_STATE_INACTIVE;
+ else
+ reprogram &= !keep_local;
__remove_hrtimer(timer, base, state, reprogram);
return 1;
@@ -1103,9 +1112,31 @@ static int __hrtimer_start_range_ns(stru
struct hrtimer_clock_base *base)
{
struct hrtimer_clock_base *new_base;
+ bool force_local, first;
- /* Remove an active timer from the queue: */
- remove_hrtimer(timer, base, true);
+ /*
+ * If the timer is on the local cpu base and is the first expiring
+ * timer then this might end up reprogramming the hardware twice
+ * (on removal and on enqueue). To avoid that by prevent the
+ * reprogram on removal, keep the timer local to the current CPU
+ * and enforce reprogramming after it is queued no matter whether
+ * it is the new first expiring timer again or not.
+ */
+ force_local = base->cpu_base == this_cpu_ptr(&hrtimer_bases);
+ force_local &= base->cpu_base->next_timer == timer;
+
+ /*
+ * Remove an active timer from the queue. In case it is not queued
+ * on the current CPU, make sure that remove_hrtimer() updates the
+ * remote data correctly.
+ *
+ * If it's on the current CPU and the first expiring timer, then
+ * skip reprogramming, keep the timer local and enforce
+ * reprogramming later if it was the first expiring timer. This
+ * avoids programming the underlying clock event twice (once at
+ * removal and once after enqueue).
+ */
+ remove_hrtimer(timer, base, true, force_local);
if (mode & HRTIMER_MODE_REL)
tim = ktime_add_safe(tim, base->get_time());
@@ -1115,9 +1146,24 @@ static int __hrtimer_start_range_ns(stru
hrtimer_set_expires_range_ns(timer, tim, delta_ns);
/* Switch the timer base, if necessary: */
- new_base = switch_hrtimer_base(timer, base, mode & HRTIMER_MODE_PINNED);
+ if (!force_local) {
+ new_base = switch_hrtimer_base(timer, base,
+ mode & HRTIMER_MODE_PINNED);
+ } else {
+ new_base = base;
+ }
- return enqueue_hrtimer(timer, new_base, mode);
+ first = enqueue_hrtimer(timer, new_base, mode);
+ if (!force_local)
+ return first;
+
+ /*
+ * Timer was forced to stay on the current CPU to avoid
+ * reprogramming on removal and enqueue. Force reprogram the
+ * hardware by evaluating the new first expiring timer.
+ */
+ hrtimer_force_reprogram(new_base->cpu_base, 1);
+ return 0;
}
/**
@@ -1183,7 +1229,7 @@ int hrtimer_try_to_cancel(struct hrtimer
base = lock_hrtimer_base(timer, &flags);
if (!hrtimer_callback_running(timer))
- ret = remove_hrtimer(timer, base, false);
+ ret = remove_hrtimer(timer, base, false, false);
unlock_hrtimer_base(timer, &flags);
On Mon, Apr 26, 2021 at 10:49:33AM +0200, Thomas Gleixner wrote:
> If __hrtimer_start_range_ns() is invoked with an already armed hrtimer then
> the timer has to be canceled first and then added back. If the timer is the
> first expiring timer then on removal the clockevent device is reprogrammed
> to the next expiring timer to avoid that the pending expiry fires needlessly.
>
> If the new expiry time ends up to be the first expiry again then the clock
> event device has to reprogrammed again.
>
> Avoid this by checking whether the timer is the first to expire and in that
> case, keep the timer on the current CPU and delay the reprogramming up to
> the point where the timer has been enqueued again.
>
> Reported-by: Lorenzo Colitti <[email protected]>
> Signed-off-by: Thomas Gleixner <[email protected]>
> ---
> kernel/time/hrtimer.c | 60 ++++++++++++++++++++++++++++++++++++++++++++------
> 1 file changed, 53 insertions(+), 7 deletions(-)
>
> --- a/kernel/time/hrtimer.c
> +++ b/kernel/time/hrtimer.c
> @@ -1030,12 +1030,13 @@ static void __remove_hrtimer(struct hrti
> * remove hrtimer, called with base lock held
> */
> static inline int
> -remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base, bool restart)
> +remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base,
> + bool restart, bool keep_local)
> {
> u8 state = timer->state;
>
> if (state & HRTIMER_STATE_ENQUEUED) {
> - int reprogram;
> + bool reprogram;
>
> /*
> * Remove the timer and force reprogramming when high
> @@ -1048,8 +1049,16 @@ remove_hrtimer(struct hrtimer *timer, st
> debug_deactivate(timer);
> reprogram = base->cpu_base == this_cpu_ptr(&hrtimer_bases);
>
> + /*
> + * If the timer is not restarted then reprogramming is
> + * required if the timer is local. If it is local and about
> + * to be restarted, avoid programming it twice (on removal
> + * and a moment later when it's requeued).
> + */
> if (!restart)
> state = HRTIMER_STATE_INACTIVE;
> + else
> + reprogram &= !keep_local;
reprogram = reprogram && !keep_local;
perhaps?
>
> __remove_hrtimer(timer, base, state, reprogram);
> return 1;
> @@ -1103,9 +1112,31 @@ static int __hrtimer_start_range_ns(stru
> struct hrtimer_clock_base *base)
> {
> struct hrtimer_clock_base *new_base;
> + bool force_local, first;
>
> - /* Remove an active timer from the queue: */
> - remove_hrtimer(timer, base, true);
> + /*
> + * If the timer is on the local cpu base and is the first expiring
> + * timer then this might end up reprogramming the hardware twice
> + * (on removal and on enqueue). To avoid that by prevent the
> + * reprogram on removal, keep the timer local to the current CPU
> + * and enforce reprogramming after it is queued no matter whether
> + * it is the new first expiring timer again or not.
> + */
> + force_local = base->cpu_base == this_cpu_ptr(&hrtimer_bases);
> + force_local &= base->cpu_base->next_timer == timer;
Using bitwise ops on a bool is cute and all, but isn't that more
readable when written like:
force_local = base->cpu_base == this_cpu_ptr(&hrtimer_bases) &&
base->cpu_base->next_timer == timer;
> +
> + /*
> + * Remove an active timer from the queue. In case it is not queued
> + * on the current CPU, make sure that remove_hrtimer() updates the
> + * remote data correctly.
> + *
> + * If it's on the current CPU and the first expiring timer, then
> + * skip reprogramming, keep the timer local and enforce
> + * reprogramming later if it was the first expiring timer. This
> + * avoids programming the underlying clock event twice (once at
> + * removal and once after enqueue).
> + */
> + remove_hrtimer(timer, base, true, force_local);
>
> if (mode & HRTIMER_MODE_REL)
> tim = ktime_add_safe(tim, base->get_time());
> @@ -1115,9 +1146,24 @@ static int __hrtimer_start_range_ns(stru
> hrtimer_set_expires_range_ns(timer, tim, delta_ns);
>
> /* Switch the timer base, if necessary: */
> - new_base = switch_hrtimer_base(timer, base, mode & HRTIMER_MODE_PINNED);
> + if (!force_local) {
> + new_base = switch_hrtimer_base(timer, base,
> + mode & HRTIMER_MODE_PINNED);
> + } else {
> + new_base = base;
> + }
>
> - return enqueue_hrtimer(timer, new_base, mode);
> + first = enqueue_hrtimer(timer, new_base, mode);
> + if (!force_local)
> + return first;
> +
> + /*
> + * Timer was forced to stay on the current CPU to avoid
> + * reprogramming on removal and enqueue. Force reprogram the
> + * hardware by evaluating the new first expiring timer.
> + */
> + hrtimer_force_reprogram(new_base->cpu_base, 1);
> + return 0;
> }
There is an unfortunate amount of duplication between
hrtimer_force_reprogram() and hrtimer_reprogram(). The obvious cleanups
don't work however :/ Still, does that in_hrtirq optimization make sense
to have in force_reprogram ?
On Mon, Apr 26, 2021 at 11:40:46AM +0200, Peter Zijlstra wrote:
> There is an unfortunate amount of duplication between
> hrtimer_force_reprogram() and hrtimer_reprogram(). The obvious cleanups
> don't work however :/ Still, does that in_hrtirq optimization make sense
> to have in force_reprogram ?
Something like so perhaps?
---
kernel/time/hrtimer.c | 75 ++++++++++++++++++++++-----------------------------
1 file changed, 32 insertions(+), 43 deletions(-)
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 5c9d968187ae..fb8d2c58443a 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -652,21 +652,27 @@ static inline int hrtimer_hres_active(void)
return __hrtimer_hres_active(this_cpu_ptr(&hrtimer_bases));
}
-/*
- * Reprogram the event source with checking both queues for the
- * next event
- * Called with interrupts disabled and base->lock held
- */
static void
-hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base, int skip_equal)
+__hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base, int skip_equal,
+ struct hrtimer *next_timer, ktime_t expires_next)
{
- ktime_t expires_next;
+ /*
+ * If the hrtimer interrupt is running, then it will
+ * reevaluate the clock bases and reprogram the clock event
+ * device. The callbacks are always executed in hard interrupt
+ * context so we don't need an extra check for a running
+ * callback.
+ */
+ if (cpu_base->in_hrtirq)
+ return;
- expires_next = hrtimer_update_next_event(cpu_base);
+ if (expires_next > cpu_base->expires_next)
+ return;
if (skip_equal && expires_next == cpu_base->expires_next)
return;
+ cpu_base->next_timer = next_timer;
cpu_base->expires_next = expires_next;
/*
@@ -689,7 +695,23 @@ hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base, int skip_equal)
if (!__hrtimer_hres_active(cpu_base) || cpu_base->hang_detected)
return;
- tick_program_event(cpu_base->expires_next, 1);
+ tick_program_event(expires_next, 1);
+}
+
+/*
+ * Reprogram the event source with checking both queues for the
+ * next event
+ * Called with interrupts disabled and base->lock held
+ */
+static void
+hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base, int skip_equal)
+{
+ ktime_t expires_next;
+
+ expires_next = hrtimer_update_next_event(cpu_base);
+
+ __hrtimer_force_reprogram(cpu_base, skip_equal,
+ cpu_base->next_timer, expires_next);
}
/* High resolution timer related functions */
@@ -835,40 +857,7 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool reprogram)
if (base->cpu_base != cpu_base)
return;
- /*
- * If the hrtimer interrupt is running, then it will
- * reevaluate the clock bases and reprogram the clock event
- * device. The callbacks are always executed in hard interrupt
- * context so we don't need an extra check for a running
- * callback.
- */
- if (cpu_base->in_hrtirq)
- return;
-
- if (expires >= cpu_base->expires_next)
- return;
-
- /* Update the pointer to the next expiring timer */
- cpu_base->next_timer = timer;
- cpu_base->expires_next = expires;
-
- /*
- * If hres is not active, hardware does not have to be
- * programmed yet.
- *
- * If a hang was detected in the last timer interrupt then we
- * do not schedule a timer which is earlier than the expiry
- * which we enforced in the hang detection. We want the system
- * to make progress.
- */
- if (!__hrtimer_hres_active(cpu_base) || cpu_base->hang_detected)
- return;
-
- /*
- * Program the timer hardware. We enforce the expiry for
- * events which are already in the past.
- */
- tick_program_event(expires, 1);
+ __hrtimer_force_reprogram(cpu_base, true, timer, expires);
}
/*
On Mon, Apr 26 2021 at 11:40, Peter Zijlstra wrote:
> On Mon, Apr 26, 2021 at 10:49:33AM +0200, Thomas Gleixner wrote:
>> If __hrtimer_start_range_ns() is invoked with an already armed hrtimer then
>> the timer has to be canceled first and then added back. If the timer is the
>> first expiring timer then on removal the clockevent device is reprogrammed
>> to the next expiring timer to avoid that the pending expiry fires needlessly.
>> /*
>> * Remove the timer and force reprogramming when high
>> @@ -1048,8 +1049,16 @@ remove_hrtimer(struct hrtimer *timer, st
>> debug_deactivate(timer);
>> reprogram = base->cpu_base == this_cpu_ptr(&hrtimer_bases);
>>
>> + /*
>> + * If the timer is not restarted then reprogramming is
>> + * required if the timer is local. If it is local and about
>> + * to be restarted, avoid programming it twice (on removal
>> + * and a moment later when it's requeued).
>> + */
>> if (!restart)
>> state = HRTIMER_STATE_INACTIVE;
>> + else
>> + reprogram &= !keep_local;
>
> reprogram = reprogram && !keep_local;
>
> perhaps?
Maybe
>>
>> __remove_hrtimer(timer, base, state, reprogram);
>> return 1;
>> @@ -1103,9 +1112,31 @@ static int __hrtimer_start_range_ns(stru
>> struct hrtimer_clock_base *base)
>> {
>> struct hrtimer_clock_base *new_base;
>> + bool force_local, first;
>>
>> - /* Remove an active timer from the queue: */
>> - remove_hrtimer(timer, base, true);
>> + /*
>> + * If the timer is on the local cpu base and is the first expiring
>> + * timer then this might end up reprogramming the hardware twice
>> + * (on removal and on enqueue). To avoid that by prevent the
>> + * reprogram on removal, keep the timer local to the current CPU
>> + * and enforce reprogramming after it is queued no matter whether
>> + * it is the new first expiring timer again or not.
>> + */
>> + force_local = base->cpu_base == this_cpu_ptr(&hrtimer_bases);
>> + force_local &= base->cpu_base->next_timer == timer;
>
> Using bitwise ops on a bool is cute and all, but isn't that more
> readable when written like:
>
> force_local = base->cpu_base == this_cpu_ptr(&hrtimer_bases) &&
> base->cpu_base->next_timer == timer;
>
Which results in an extra conditional branch.
>> + /*
>> + * Timer was forced to stay on the current CPU to avoid
>> + * reprogramming on removal and enqueue. Force reprogram the
>> + * hardware by evaluating the new first expiring timer.
>> + */
>> + hrtimer_force_reprogram(new_base->cpu_base, 1);
>> + return 0;
>> }
>
> There is an unfortunate amount of duplication between
> hrtimer_force_reprogram() and hrtimer_reprogram(). The obvious cleanups
> don't work however :/ Still, does that in_hrtirq optimization make sense
> to have in force_reprogram ?
Yes, no, do not know. Let me have a look.
Thanks,
tglx
On Mon, Apr 26, 2021 at 02:33:00PM +0200, Thomas Gleixner wrote:
> >> + force_local = base->cpu_base == this_cpu_ptr(&hrtimer_bases);
> >> + force_local &= base->cpu_base->next_timer == timer;
> >
> > Using bitwise ops on a bool is cute and all, but isn't that more
> > readable when written like:
> >
> > force_local = base->cpu_base == this_cpu_ptr(&hrtimer_bases) &&
> > base->cpu_base->next_timer == timer;
> >
>
> Which results in an extra conditional branch.
Srlsy, compiler not smart enough?
On Mon, Apr 26 2021 at 14:40, Peter Zijlstra wrote:
> On Mon, Apr 26, 2021 at 02:33:00PM +0200, Thomas Gleixner wrote:
>
>> >> + force_local = base->cpu_base == this_cpu_ptr(&hrtimer_bases);
>> >> + force_local &= base->cpu_base->next_timer == timer;
>> >
>> > Using bitwise ops on a bool is cute and all, but isn't that more
>> > readable when written like:
>> >
>> > force_local = base->cpu_base == this_cpu_ptr(&hrtimer_bases) &&
>> > base->cpu_base->next_timer == timer;
>> >
>>
>> Which results in an extra conditional branch.
>
> Srlsy, compiler not smart enough?
gcc 8.3 is definitely not ...
On Mon, Apr 26 2021 at 14:25, Peter Zijlstra wrote:
> On Mon, Apr 26, 2021 at 11:40:46AM +0200, Peter Zijlstra wrote:
>> There is an unfortunate amount of duplication between
>> hrtimer_force_reprogram() and hrtimer_reprogram(). The obvious cleanups
>> don't work however :/ Still, does that in_hrtirq optimization make sense
>> to have in force_reprogram ?
No. It need's to be in hrtimer_programm().