Ingo,
Please pull the sched/0hz-v2 branch that can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
sched/0hz-v2
HEAD: 9b14d5204490f9acd03998a5e406ecadb87cddba
Changes in v4:
* Remove the nohz_offload option, just stick with the existing interface,
the change is transparent. Suggested by Luiz.
* Automatically pin workqueues to housekeepers.
---
Now that scheduler_tick() has become resilient towards the absence of
ticks, current->sched_class->task_tick() is the last piece that needs
at least 1Hz tick to keep scheduler stats alive.
This patchset offloads this residual 1Hz tick to workqueues. This way
the nohz full CPUs don't have anymore tick (assuming nothing else
requires it) as their residual 1Hz tick get handled by the housekeepers.
Nothing special is required for testing, just use the usual kernel
parameters, say on CPUs 1-7:
"nohz_full=1-7"
or
"isolcpus=nohz_offload,domain,1-7"
Thanks,
Frederic
---
Frederic Weisbecker (6):
sched: Rename init_rq_hrtick to hrtick_rq_init
nohz: Allow to check if remote CPU tick is stopped
sched/isolation: Isolate workqueues when "nohz_full=" is set
sched/isolation: Residual 1Hz scheduler tick offload
sched/nohz: Remove the 1 Hz tick code
sched/isolation: Tick offload documentation
Documentation/admin-guide/kernel-parameters.txt | 6 +-
include/linux/sched/isolation.h | 1 +
include/linux/sched/nohz.h | 4 -
include/linux/tick.h | 2 +
kernel/sched/core.c | 104 +++++++++++++++++-------
kernel/sched/idle_task.c | 1 -
kernel/sched/isolation.c | 8 +-
kernel/sched/sched.h | 13 +--
kernel/time/tick-sched.c | 13 +--
kernel/workqueue.c | 3 +-
10 files changed, 103 insertions(+), 52 deletions(-)
Do that rename in order to normalize the hrtick namespace.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Luiz Capitulino <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
kernel/sched/core.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 644fa2e..d72d0e9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -333,7 +333,7 @@ void hrtick_start(struct rq *rq, u64 delay)
}
#endif /* CONFIG_SMP */
-static void init_rq_hrtick(struct rq *rq)
+static void hrtick_rq_init(struct rq *rq)
{
#ifdef CONFIG_SMP
rq->hrtick_csd_pending = 0;
@@ -351,7 +351,7 @@ static inline void hrtick_clear(struct rq *rq)
{
}
-static inline void init_rq_hrtick(struct rq *rq)
+static inline void hrtick_rq_init(struct rq *rq)
{
}
#endif /* CONFIG_SCHED_HRTICK */
@@ -5955,7 +5955,7 @@ void __init sched_init(void)
rq->last_sched_tick = 0;
#endif
#endif /* CONFIG_SMP */
- init_rq_hrtick(rq);
+ hrtick_rq_init(rq);
atomic_set(&rq->nr_iowait, 0);
}
--
2.7.4
As we prepare for offloading the residual 1hz scheduler ticks to
workqueue, let's affine those to housekeepers so that they don't
interrupt the CPUs that don't want to be disturbed.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Luiz Capitulino <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
include/linux/sched/isolation.h | 1 +
kernel/sched/isolation.c | 4 +++-
kernel/workqueue.c | 3 ++-
3 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index d849431..4a6582c 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -12,6 +12,7 @@ enum hk_flags {
HK_FLAG_SCHED = (1 << 3),
HK_FLAG_TICK = (1 << 4),
HK_FLAG_DOMAIN = (1 << 5),
+ HK_FLAG_WQ = (1 << 6),
};
#ifdef CONFIG_CPU_ISOLATION
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index b71b436..8f1c1de 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -3,6 +3,7 @@
* any CPU: unbound workqueues, timers, kthreads and any offloadable work.
*
* Copyright (C) 2017 Red Hat, Inc., Frederic Weisbecker
+ * Copyright (C) 2017-2018 SUSE, Frederic Weisbecker
*
*/
@@ -119,7 +120,8 @@ static int __init housekeeping_nohz_full_setup(char *str)
{
unsigned int flags;
- flags = HK_FLAG_TICK | HK_FLAG_TIMER | HK_FLAG_RCU | HK_FLAG_MISC;
+ flags = HK_FLAG_TICK | HK_FLAG_WQ | HK_FLAG_TIMER |
+ HK_FLAG_RCU | HK_FLAG_MISC;
return housekeeping_setup(str, flags);
}
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 43d18cb..c296dc9 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -5546,7 +5546,8 @@ int __init workqueue_init_early(void)
WARN_ON(__alignof__(struct pool_workqueue) < __alignof__(long long));
BUG_ON(!alloc_cpumask_var(&wq_unbound_cpumask, GFP_KERNEL));
- cpumask_copy(wq_unbound_cpumask, housekeeping_cpumask(HK_FLAG_DOMAIN));
+ cpumask_copy(wq_unbound_cpumask,
+ housekeeping_cpumask(HK_FLAG_DOMAIN | HK_FLAG_WQ));
pwq_cache = KMEM_CACHE(pool_workqueue, SLAB_PANIC);
--
2.7.4
Update the documentation to reflect the 1Hz tick offload changes.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Luiz Capitulino <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 46b26bf..534afb1 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1746,7 +1746,11 @@
specified in the flag list (default: domain):
nohz
- Disable the tick when a single task runs.
+ Disable the tick when a single task runs. A residual 1Hz
+ tick is offloaded to workqueues that you need to affine
+ to housekeeping through the sysfs file
+ /sys/devices/virtual/workqueue/cpumask or using the below
+ domain flag.
domain
Isolate from the general SMP balancing and scheduling
algorithms. Note that performing domain isolation this way
--
2.7.4
Now that the 1Hz tick is offloaded to workqueues, we can safely remove
the residual code that used to handle it locally.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Luiz Capitulino <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
include/linux/sched/nohz.h | 4 ----
kernel/sched/core.c | 29 -----------------------------
kernel/sched/idle_task.c | 1 -
kernel/sched/sched.h | 11 +----------
kernel/time/tick-sched.c | 6 ------
5 files changed, 1 insertion(+), 50 deletions(-)
diff --git a/include/linux/sched/nohz.h b/include/linux/sched/nohz.h
index 3d3a97d..0942172 100644
--- a/include/linux/sched/nohz.h
+++ b/include/linux/sched/nohz.h
@@ -37,8 +37,4 @@ extern void wake_up_nohz_cpu(int cpu);
static inline void wake_up_nohz_cpu(int cpu) { }
#endif
-#ifdef CONFIG_NO_HZ_FULL
-extern u64 scheduler_tick_max_deferment(void);
-#endif
-
#endif /* _LINUX_SCHED_NOHZ_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c79500c..cbc3001 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3033,35 +3033,9 @@ void scheduler_tick(void)
rq->idle_balance = idle_cpu(cpu);
trigger_load_balance(rq);
#endif
- rq_last_tick_reset(rq);
}
#ifdef CONFIG_NO_HZ_FULL
-/**
- * scheduler_tick_max_deferment
- *
- * Keep at least one tick per second when a single
- * active task is running because the scheduler doesn't
- * yet completely support full dynticks environment.
- *
- * This makes sure that uptime, CFS vruntime, load
- * balancing, etc... continue to move forward, even
- * with a very low granularity.
- *
- * Return: Maximum deferment in nanoseconds.
- */
-u64 scheduler_tick_max_deferment(void)
-{
- struct rq *rq = this_rq();
- unsigned long next, now = READ_ONCE(jiffies);
-
- next = rq->last_sched_tick + HZ;
-
- if (time_before_eq(next, now))
- return 0;
-
- return jiffies_to_nsecs(next - now);
-}
struct tick_work {
int cpu;
@@ -6028,9 +6002,6 @@ void __init sched_init(void)
rq->last_load_update_tick = jiffies;
rq->nohz_flags = 0;
#endif
-#ifdef CONFIG_NO_HZ_FULL
- rq->last_sched_tick = 0;
-#endif
#endif /* CONFIG_SMP */
hrtick_rq_init(rq);
atomic_set(&rq->nr_iowait, 0);
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index d518664..a64fc92 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -48,7 +48,6 @@ dequeue_task_idle(struct rq *rq, struct task_struct *p, int flags)
static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
{
- rq_last_tick_reset(rq);
}
static void task_tick_idle(struct rq *rq, struct task_struct *curr, int queued)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5a3b82c..9a25047 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -699,9 +699,7 @@ struct rq {
#endif /* CONFIG_SMP */
unsigned long nohz_flags;
#endif /* CONFIG_NO_HZ_COMMON */
-#ifdef CONFIG_NO_HZ_FULL
- unsigned long last_sched_tick;
-#endif
+
/* capture load from *all* tasks on this cpu: */
struct load_weight load;
unsigned long nr_load_updates;
@@ -1639,13 +1637,6 @@ static inline void sub_nr_running(struct rq *rq, unsigned count)
sched_update_tick_dependency(rq);
}
-static inline void rq_last_tick_reset(struct rq *rq)
-{
-#ifdef CONFIG_NO_HZ_FULL
- rq->last_sched_tick = jiffies;
-#endif
-}
-
extern void update_rq_clock(struct rq *rq);
extern void activate_task(struct rq *rq, struct task_struct *p, int flags);
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 97c4317..03ebe89 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -748,12 +748,6 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
delta = KTIME_MAX;
}
-#ifdef CONFIG_NO_HZ_FULL
- /* Limit the tick delta to the maximum scheduler deferment */
- if (!ts->inidle)
- delta = min(delta, scheduler_tick_max_deferment());
-#endif
-
/* Calculate the next expiry time */
if (delta < (KTIME_MAX - basemono))
expires = basemono + delta;
--
2.7.4
This check is racy but provides a good heuristic to determine whether
a CPU may need a remote tick or not.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Luiz Capitulino <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
include/linux/tick.h | 2 ++
kernel/time/tick-sched.c | 7 +++++++
2 files changed, 9 insertions(+)
diff --git a/include/linux/tick.h b/include/linux/tick.h
index 7cc3592..944c829 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -114,6 +114,7 @@ enum tick_dep_bits {
#ifdef CONFIG_NO_HZ_COMMON
extern bool tick_nohz_enabled;
extern int tick_nohz_tick_stopped(void);
+extern int tick_nohz_tick_stopped_cpu(int cpu);
extern void tick_nohz_idle_enter(void);
extern void tick_nohz_idle_exit(void);
extern void tick_nohz_irq_exit(void);
@@ -125,6 +126,7 @@ extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
#else /* !CONFIG_NO_HZ_COMMON */
#define tick_nohz_enabled (0)
static inline int tick_nohz_tick_stopped(void) { return 0; }
+static inline int tick_nohz_tick_stopped_cpu(int cpu) { return 0; }
static inline void tick_nohz_idle_enter(void) { }
static inline void tick_nohz_idle_exit(void) { }
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index f7cc7ab..97c4317 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -486,6 +486,13 @@ int tick_nohz_tick_stopped(void)
return __this_cpu_read(tick_cpu_sched.tick_stopped);
}
+int tick_nohz_tick_stopped_cpu(int cpu)
+{
+ struct tick_sched *ts = per_cpu_ptr(&tick_cpu_sched, cpu);
+
+ return ts->tick_stopped;
+}
+
/**
* tick_nohz_update_jiffies - update jiffies when idle was interrupted
*
--
2.7.4
When a CPU runs in full dynticks mode, a 1Hz tick remains in order to
keep the scheduler stats alive. However this residual tick is a burden
for bare metal tasks that can't stand any interruption at all, or want
to minimize them.
The usual boot parameters "nohz_full=" or "isolcpus=nohz" will now
outsource these scheduler ticks to the global workqueue so that a
housekeeping CPU handles those remotely.
Note that in the case of using isolcpus, it's still up to the user to
affine the global workqueues to the housekeeping CPUs through
/sys/devices/virtual/workqueue/cpumask or domains isolation
"isolcpus=nohz,domain".
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Luiz Capitulino <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
kernel/sched/core.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++-
kernel/sched/isolation.c | 4 +++
kernel/sched/sched.h | 2 ++
3 files changed, 84 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d72d0e9..c79500c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3062,7 +3062,82 @@ u64 scheduler_tick_max_deferment(void)
return jiffies_to_nsecs(next - now);
}
-#endif
+
+struct tick_work {
+ int cpu;
+ struct delayed_work work;
+};
+
+static struct tick_work __percpu *tick_work_cpu;
+
+static void sched_tick_remote(struct work_struct *work)
+{
+ struct delayed_work *dwork = to_delayed_work(work);
+ struct tick_work *twork = container_of(dwork, struct tick_work, work);
+ int cpu = twork->cpu;
+ struct rq *rq = cpu_rq(cpu);
+ struct rq_flags rf;
+
+ /*
+ * Handle the tick only if it appears the remote CPU is running
+ * in full dynticks mode. The check is racy by nature, but
+ * missing a tick or having one too much is no big deal.
+ */
+ if (!idle_cpu(cpu) && tick_nohz_tick_stopped_cpu(cpu)) {
+ rq_lock_irq(rq, &rf);
+ update_rq_clock(rq);
+ rq->curr->sched_class->task_tick(rq, rq->curr, 0);
+ rq_unlock_irq(rq, &rf);
+ }
+
+ queue_delayed_work(system_unbound_wq, dwork, HZ);
+}
+
+static void sched_tick_start(int cpu)
+{
+ struct tick_work *twork;
+
+ if (housekeeping_cpu(cpu, HK_FLAG_TICK))
+ return;
+
+ WARN_ON_ONCE(!tick_work_cpu);
+
+ twork = per_cpu_ptr(tick_work_cpu, cpu);
+ twork->cpu = cpu;
+ INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
+ queue_delayed_work(system_unbound_wq, &twork->work, HZ);
+}
+
+#ifdef CONFIG_HOTPLUG_CPU
+static void sched_tick_stop(int cpu)
+{
+ struct tick_work *twork;
+
+ if (housekeeping_cpu(cpu, HK_FLAG_TICK))
+ return;
+
+ WARN_ON_ONCE(!tick_work_cpu);
+
+ twork = per_cpu_ptr(tick_work_cpu, cpu);
+ cancel_delayed_work_sync(&twork->work);
+}
+#endif /* CONFIG_HOTPLUG_CPU */
+
+int __init sched_tick_offload_init(void)
+{
+ tick_work_cpu = alloc_percpu(struct tick_work);
+ if (!tick_work_cpu) {
+ pr_err("Can't allocate remote tick struct\n");
+ return -ENOMEM;
+ }
+
+ return 0;
+}
+
+#else
+static void sched_tick_start(int cpu) { }
+static void sched_tick_stop(int cpu) { }
+#endif /* CONFIG_NO_HZ_FULL */
#if defined(CONFIG_PREEMPT) && (defined(CONFIG_DEBUG_PREEMPT) || \
defined(CONFIG_PREEMPT_TRACER))
@@ -5713,6 +5788,7 @@ int sched_cpu_starting(unsigned int cpu)
{
set_cpu_rq_start_time(cpu);
sched_rq_cpu_starting(cpu);
+ sched_tick_start(cpu);
return 0;
}
@@ -5724,6 +5800,7 @@ int sched_cpu_dying(unsigned int cpu)
/* Handle pending wakeups and then migrate everything off */
sched_ttwu_pending();
+ sched_tick_stop(cpu);
rq_lock_irqsave(rq, &rf);
if (rq->rd) {
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 8f1c1de..d782302 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -13,6 +13,7 @@
#include <linux/kernel.h>
#include <linux/static_key.h>
#include <linux/ctype.h>
+#include "sched.h"
DEFINE_STATIC_KEY_FALSE(housekeeping_overriden);
EXPORT_SYMBOL_GPL(housekeeping_overriden);
@@ -61,6 +62,9 @@ void __init housekeeping_init(void)
static_branch_enable(&housekeeping_overriden);
+ if (housekeeping_flags & HK_FLAG_TICK)
+ sched_tick_offload_init();
+
/* We need at least one CPU to handle housekeeping work */
WARN_ON_ONCE(cpumask_empty(housekeeping_mask));
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b19552a2..5a3b82c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1587,6 +1587,7 @@ extern void post_init_entity_util_avg(struct sched_entity *se);
#ifdef CONFIG_NO_HZ_FULL
extern bool sched_can_stop_tick(struct rq *rq);
+extern int __init sched_tick_offload_init(void);
/*
* Tick may be needed by tasks in the runqueue depending on their policy and
@@ -1611,6 +1612,7 @@ static inline void sched_update_tick_dependency(struct rq *rq)
tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED);
}
#else
+static inline int sched_tick_offload_init(void) { return 0; }
static inline void sched_update_tick_dependency(struct rq *rq) { }
#endif
--
2.7.4
On Fri, 19 Jan 2018 01:02:14 +0100
Frederic Weisbecker <[email protected]> wrote:
> Ingo,
>
> Please pull the sched/0hz-v2 branch that can be found at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> sched/0hz-v2
>
> HEAD: 9b14d5204490f9acd03998a5e406ecadb87cddba
>
> Changes in v4:
>
> * Remove the nohz_offload option, just stick with the existing interface,
> the change is transparent. Suggested by Luiz.
>
> * Automatically pin workqueues to housekeepers.
I've been testing this series and the tick doesn't go completely away
for me: it ticks at around 8 seconds interval.
I've debugged this down to the clocksource_watchdog() timer, which is
created by clocksource_start_watchdog(). This timer cycles over all online
CPUs. I couldn't find a way to disable it. It seems to be always enabled
for x86 by CONFIG_CLOCKSOURCE_WATCHDOG since commit 6471b825c4.
Since the 1Hz tick offload worked for you, I must be missing a way
to disable this timer or the kernel is thinking my CPU has unstable
TSC (which it doesn't AFAIK).
>
> ---
> Now that scheduler_tick() has become resilient towards the absence of
> ticks, current->sched_class->task_tick() is the last piece that needs
> at least 1Hz tick to keep scheduler stats alive.
>
> This patchset offloads this residual 1Hz tick to workqueues. This way
> the nohz full CPUs don't have anymore tick (assuming nothing else
> requires it) as their residual 1Hz tick get handled by the housekeepers.
>
> Nothing special is required for testing, just use the usual kernel
> parameters, say on CPUs 1-7:
>
> "nohz_full=1-7"
> or
> "isolcpus=nohz_offload,domain,1-7"
>
> Thanks,
> Frederic
> ---
>
> Frederic Weisbecker (6):
> sched: Rename init_rq_hrtick to hrtick_rq_init
> nohz: Allow to check if remote CPU tick is stopped
> sched/isolation: Isolate workqueues when "nohz_full=" is set
> sched/isolation: Residual 1Hz scheduler tick offload
> sched/nohz: Remove the 1 Hz tick code
> sched/isolation: Tick offload documentation
>
>
> Documentation/admin-guide/kernel-parameters.txt | 6 +-
> include/linux/sched/isolation.h | 1 +
> include/linux/sched/nohz.h | 4 -
> include/linux/tick.h | 2 +
> kernel/sched/core.c | 104 +++++++++++++++++-------
> kernel/sched/idle_task.c | 1 -
> kernel/sched/isolation.c | 8 +-
> kernel/sched/sched.h | 13 +--
> kernel/time/tick-sched.c | 13 +--
> kernel/workqueue.c | 3 +-
> 10 files changed, 103 insertions(+), 52 deletions(-)
>
On Wed, Jan 24, 2018 at 10:46:08AM -0500, Luiz Capitulino wrote:
> On Fri, 19 Jan 2018 01:02:14 +0100
> Frederic Weisbecker <[email protected]> wrote:
>
> > Ingo,
> >
> > Please pull the sched/0hz-v2 branch that can be found at:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> > sched/0hz-v2
> >
> > HEAD: 9b14d5204490f9acd03998a5e406ecadb87cddba
> >
> > Changes in v4:
> >
> > * Remove the nohz_offload option, just stick with the existing interface,
> > the change is transparent. Suggested by Luiz.
> >
> > * Automatically pin workqueues to housekeepers.
>
> I've been testing this series and the tick doesn't go completely away
> for me: it ticks at around 8 seconds interval.
>
> I've debugged this down to the clocksource_watchdog() timer, which is
> created by clocksource_start_watchdog(). This timer cycles over all online
> CPUs. I couldn't find a way to disable it. It seems to be always enabled
> for x86 by CONFIG_CLOCKSOURCE_WATCHDOG since commit 6471b825c4.
>
> Since the 1Hz tick offload worked for you, I must be missing a way
> to disable this timer or the kernel is thinking my CPU has unstable
> TSC (which it doesn't AFAIK).
It's beyond the scope of this patchset but indeed that's right, I run my
kernels with tsc=reliable because my CPUs don't have the TSC_RELIABLE flag.
That's the only way I found to shutdown the tick completely on my test
machine, otherwise I keep having that clocksource watchdog.
You can try "tsc=reliable" but that's at your own risks and it's hard
to tell what exactly are those risks depending on your CPU model (and
perhaps BIOS?).
You likely already had that watchdog timer before this patchset but didn't
notice because the 1Hz was a more frequent annoyance.
Thanks.
Hi,
Any chance this can either get pulled or declined as the merge window is
starting?
Thanks.
On Fri, Jan 19, 2018 at 01:02:14AM +0100, Frederic Weisbecker wrote:
> Ingo,
>
> Please pull the sched/0hz-v2 branch that can be found at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> sched/0hz-v2
>
> HEAD: 9b14d5204490f9acd03998a5e406ecadb87cddba
>
> Changes in v4:
>
> * Remove the nohz_offload option, just stick with the existing interface,
> the change is transparent. Suggested by Luiz.
>
> * Automatically pin workqueues to housekeepers.
>
> ---
> Now that scheduler_tick() has become resilient towards the absence of
> ticks, current->sched_class->task_tick() is the last piece that needs
> at least 1Hz tick to keep scheduler stats alive.
>
> This patchset offloads this residual 1Hz tick to workqueues. This way
> the nohz full CPUs don't have anymore tick (assuming nothing else
> requires it) as their residual 1Hz tick get handled by the housekeepers.
>
> Nothing special is required for testing, just use the usual kernel
> parameters, say on CPUs 1-7:
>
> "nohz_full=1-7"
> or
> "isolcpus=nohz_offload,domain,1-7"
>
> Thanks,
> Frederic
> ---
>
> Frederic Weisbecker (6):
> sched: Rename init_rq_hrtick to hrtick_rq_init
> nohz: Allow to check if remote CPU tick is stopped
> sched/isolation: Isolate workqueues when "nohz_full=" is set
> sched/isolation: Residual 1Hz scheduler tick offload
> sched/nohz: Remove the 1 Hz tick code
> sched/isolation: Tick offload documentation
>
>
> Documentation/admin-guide/kernel-parameters.txt | 6 +-
> include/linux/sched/isolation.h | 1 +
> include/linux/sched/nohz.h | 4 -
> include/linux/tick.h | 2 +
> kernel/sched/core.c | 104 +++++++++++++++++-------
> kernel/sched/idle_task.c | 1 -
> kernel/sched/isolation.c | 8 +-
> kernel/sched/sched.h | 13 +--
> kernel/time/tick-sched.c | 13 +--
> kernel/workqueue.c | 3 +-
> 10 files changed, 103 insertions(+), 52 deletions(-)
On Mon, 29 Jan 2018 02:10:26 +0100
Frederic Weisbecker <[email protected]> wrote:
> On Wed, Jan 24, 2018 at 10:46:08AM -0500, Luiz Capitulino wrote:
> > On Fri, 19 Jan 2018 01:02:14 +0100
> > Frederic Weisbecker <[email protected]> wrote:
> >
> > > Ingo,
> > >
> > > Please pull the sched/0hz-v2 branch that can be found at:
> > >
> > > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> > > sched/0hz-v2
> > >
> > > HEAD: 9b14d5204490f9acd03998a5e406ecadb87cddba
> > >
> > > Changes in v4:
> > >
> > > * Remove the nohz_offload option, just stick with the existing interface,
> > > the change is transparent. Suggested by Luiz.
> > >
> > > * Automatically pin workqueues to housekeepers.
> >
> > I've been testing this series and the tick doesn't go completely away
> > for me: it ticks at around 8 seconds interval.
> >
> > I've debugged this down to the clocksource_watchdog() timer, which is
> > created by clocksource_start_watchdog(). This timer cycles over all online
> > CPUs. I couldn't find a way to disable it. It seems to be always enabled
> > for x86 by CONFIG_CLOCKSOURCE_WATCHDOG since commit 6471b825c4.
> >
> > Since the 1Hz tick offload worked for you, I must be missing a way
> > to disable this timer or the kernel is thinking my CPU has unstable
> > TSC (which it doesn't AFAIK).
>
> It's beyond the scope of this patchset but indeed that's right, I run my
> kernels with tsc=reliable because my CPUs don't have the TSC_RELIABLE flag.
> That's the only way I found to shutdown the tick completely on my test
> machine, otherwise I keep having that clocksource watchdog.
>
> You can try "tsc=reliable" but that's at your own risks and it's hard
> to tell what exactly are those risks depending on your CPU model (and
> perhaps BIOS?).
Cool, passing tsc=reliable worked for me. I finally got to the tick to
go completely away. While I agree that fixing that is beyond the scope
of this series, I think we should improve it anyway since it will probably
come up for people trying the new nohz_full=.
If this has any value:
Tested-by: Luiz Capitulino <[email protected]>
> You likely already had that watchdog timer before this patchset but didn't
> notice because the 1Hz was a more frequent annoyance.
That's possible, but I'm not sure the clocksource watchdog timer was
there before commit 6471b825c4.
On Fri, Jan 19, 2018 at 01:02:18AM +0100, Frederic Weisbecker wrote:
> When a CPU runs in full dynticks mode, a 1Hz tick remains in order to
> keep the scheduler stats alive. However this residual tick is a burden
> for bare metal tasks that can't stand any interruption at all, or want
> to minimize them.
>
> The usual boot parameters "nohz_full=" or "isolcpus=nohz" will now
> outsource these scheduler ticks to the global workqueue so that a
> housekeeping CPU handles those remotely.
>
> Note that in the case of using isolcpus, it's still up to the user to
> affine the global workqueues to the housekeeping CPUs through
> /sys/devices/virtual/workqueue/cpumask or domains isolation
> "isolcpus=nohz,domain".
I would very much like a few words on why sched_class::task_tick() is
safe to call remote -- from a quick look I think it actually is, but it
would be good to have some words here.
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index d72d0e9..c79500c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3062,7 +3062,82 @@ u64 scheduler_tick_max_deferment(void)
>
> return jiffies_to_nsecs(next - now);
> }
> -#endif
> +
> +struct tick_work {
> + int cpu;
> + struct delayed_work work;
> +};
> +
> +static struct tick_work __percpu *tick_work_cpu;
> +
> +static void sched_tick_remote(struct work_struct *work)
> +{
> + struct delayed_work *dwork = to_delayed_work(work);
> + struct tick_work *twork = container_of(dwork, struct tick_work, work);
> + int cpu = twork->cpu;
> + struct rq *rq = cpu_rq(cpu);
> + struct rq_flags rf;
> +
> + /*
> + * Handle the tick only if it appears the remote CPU is running
> + * in full dynticks mode. The check is racy by nature, but
> + * missing a tick or having one too much is no big deal.
> + */
> + if (!idle_cpu(cpu) && tick_nohz_tick_stopped_cpu(cpu)) {
> + rq_lock_irq(rq, &rf);
> + update_rq_clock(rq);
> + rq->curr->sched_class->task_tick(rq, rq->curr, 0);
> + rq_unlock_irq(rq, &rf);
> + }
> +
> + queue_delayed_work(system_unbound_wq, dwork, HZ);
Do we want something that tracks the actual interrer arrival time of
this work, such that we can detect and warn if the book-keeping thing is
failing to keep up?
> +}
> +
> +static void sched_tick_start(int cpu)
> +{
> + struct tick_work *twork;
> +
> + if (housekeeping_cpu(cpu, HK_FLAG_TICK))
> + return;
This all looks very static :-(, you can't reconfigure this nohz_full
crud after boot?
> + WARN_ON_ONCE(!tick_work_cpu);
> +
> + twork = per_cpu_ptr(tick_work_cpu, cpu);
> + twork->cpu = cpu;
> + INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
> + queue_delayed_work(system_unbound_wq, &twork->work, HZ);
> +}
Similarly, I think we want a few words about how unbound workqueues are
expected to behave vs NUMA.
AFAICT unbound workqueues by default prefer to run on a cpu in the same
node, but if no cpu is available, it doesn't go looking for the nearest
node that does have a cpu, it just punts to whatever random cpu.
On Fri, Jan 19, 2018 at 01:02:18AM +0100, Frederic Weisbecker wrote:
> + queue_delayed_work(system_unbound_wq, dwork, HZ);
Also, why HZ ? No clues were given.
On Mon, Jan 29, 2018 at 02:10:26AM +0100, Frederic Weisbecker wrote:
> It's beyond the scope of this patchset but indeed that's right, I run my
> kernels with tsc=reliable because my CPUs don't have the TSC_RELIABLE flag.
> That's the only way I found to shutdown the tick completely on my test
> machine, otherwise I keep having that clocksource watchdog.
>
> You can try "tsc=reliable" but that's at your own risks and it's hard
> to tell what exactly are those risks depending on your CPU model (and
> perhaps BIOS?).
BIOS, anything Nehalem and later, BIOS monkeys are to blame. There is
one exception to that, and that is very large socket count machines, and
these people typically already know what they got themselves into.
If your machine never triggers the watchdog with your workload, booting
with tsc=reliable is safe (for that workload).
But if you trip the watchdog, booting with tsc=reliable is a very bad
idea indeed.
On Mon, Jan 29, 2018 at 10:33:16AM -0500, Luiz Capitulino wrote:
> Cool, passing tsc=reliable worked for me. I finally got to the tick to
> go completely away. While I agree that fixing that is beyond the scope
> of this series, I think we should improve it anyway since it will probably
> come up for people trying the new nohz_full=.
The only way to fix that is to audit all BIOS code :/ Short of that, we
need to periodically test the TSC on each CPU/SOCKET to verify its still
in step with expectation.
Sadly MSR_IA32_TSC and MSR_IA32_TSC_ADJUST are writable registers and
BIOS monkeys occasionally write to them for 'raisins-of-insanity'.
On Mon, 29 Jan 2018 16:54:31 +0100
Peter Zijlstra <[email protected]> wrote:
> On Mon, Jan 29, 2018 at 10:33:16AM -0500, Luiz Capitulino wrote:
> > Cool, passing tsc=reliable worked for me. I finally got to the tick to
> > go completely away. While I agree that fixing that is beyond the scope
> > of this series, I think we should improve it anyway since it will probably
> > come up for people trying the new nohz_full=.
>
> The only way to fix that is to audit all BIOS code :/ Short of that, we
> need to periodically test the TSC on each CPU/SOCKET to verify its still
> in step with expectation.
Oh, OK. I thought we could find a way to avoid the timer in the kernel
for certain CPUs like mine.
Thanks for the explanation.
>
> Sadly MSR_IA32_TSC and MSR_IA32_TSC_ADJUST are writable registers and
> BIOS monkeys occasionally write to them for 'raisins-of-insanity'.
>
On Mon, Jan 29, 2018 at 04:38:39PM +0100, Peter Zijlstra wrote:
1;4205;0c> On Fri, Jan 19, 2018 at 01:02:18AM +0100, Frederic Weisbecker wrote:
> > When a CPU runs in full dynticks mode, a 1Hz tick remains in order to
> > keep the scheduler stats alive. However this residual tick is a burden
> > for bare metal tasks that can't stand any interruption at all, or want
> > to minimize them.
> >
> > The usual boot parameters "nohz_full=" or "isolcpus=nohz" will now
> > outsource these scheduler ticks to the global workqueue so that a
> > housekeeping CPU handles those remotely.
> >
> > Note that in the case of using isolcpus, it's still up to the user to
> > affine the global workqueues to the housekeeping CPUs through
> > /sys/devices/virtual/workqueue/cpumask or domains isolation
> > "isolcpus=nohz,domain".
>
> I would very much like a few words on why sched_class::task_tick() is
> safe to call remote -- from a quick look I think it actually is, but it
> would be good to have some words here.
Let's rather say I can't prove that it is safe, given the amount of code that
is behind throughout the various flavour of scheduler features.
But as far as I checked several times, it seems that nothing is accessed locally
on ::scheduler_tick(). Everything looks fetched from the runqueue struct target
while it is locked.
If we ever find local references such as "current" or "__this_cpu_*" in the path,
we'll have to fix them.
>
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index d72d0e9..c79500c 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3062,7 +3062,82 @@ u64 scheduler_tick_max_deferment(void)
> >
> > return jiffies_to_nsecs(next - now);
> > }
> > -#endif
> > +
> > +struct tick_work {
> > + int cpu;
> > + struct delayed_work work;
> > +};
> > +
> > +static struct tick_work __percpu *tick_work_cpu;
> > +
> > +static void sched_tick_remote(struct work_struct *work)
> > +{
> > + struct delayed_work *dwork = to_delayed_work(work);
> > + struct tick_work *twork = container_of(dwork, struct tick_work, work);
> > + int cpu = twork->cpu;
> > + struct rq *rq = cpu_rq(cpu);
> > + struct rq_flags rf;
> > +
> > + /*
> > + * Handle the tick only if it appears the remote CPU is running
> > + * in full dynticks mode. The check is racy by nature, but
> > + * missing a tick or having one too much is no big deal.
> > + */
> > + if (!idle_cpu(cpu) && tick_nohz_tick_stopped_cpu(cpu)) {
> > + rq_lock_irq(rq, &rf);
> > + update_rq_clock(rq);
> > + rq->curr->sched_class->task_tick(rq, rq->curr, 0);
> > + rq_unlock_irq(rq, &rf);
> > + }
> > +
> > + queue_delayed_work(system_unbound_wq, dwork, HZ);
>
> Do we want something that tracks the actual interrer arrival time of
> this work, such that we can detect and warn if the book-keeping thing is
> failing to keep up?
Yeah perhaps we can have some sort of check to make sure we got a tick after
some reasonable delay since the last sched in of the current remote task.
>
> > +}
> > +
> > +static void sched_tick_start(int cpu)
> > +{
> > + struct tick_work *twork;
> > +
> > + if (housekeeping_cpu(cpu, HK_FLAG_TICK))
> > + return;
>
> This all looks very static :-(, you can't reconfigure this nohz_full
> crud after boot?
Unfortunately yes. In fact making the nohz interface dynamically available
through cpuset is the next big step.
>
> > + WARN_ON_ONCE(!tick_work_cpu);
> > +
> > + twork = per_cpu_ptr(tick_work_cpu, cpu);
> > + twork->cpu = cpu;
> > + INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
> > + queue_delayed_work(system_unbound_wq, &twork->work, HZ);
> > +}
>
> Similarly, I think we want a few words about how unbound workqueues are
> expected to behave vs NUMA.
>
> AFAICT unbound workqueues by default prefer to run on a cpu in the same
> node, but if no cpu is available, it doesn't go looking for the nearest
> node that does have a cpu, it just punts to whatever random cpu.
Yes, and in fact you just made me look into wq_select_unbound_cpu() and
it looks worse than that. If the current CPU is not in the wq_unbound_cpumask,
a random one is picked up from that global cpumask without trying a near
one in the current node.
Looks like room for improvement on the workqueue side. I'll see what I can do.
Thanks.
On Mon, Jan 29, 2018 at 05:48:33PM +0100, Frederic Weisbecker wrote:
> On Mon, Jan 29, 2018 at 04:38:39PM +0100, Peter Zijlstra wrote:
> > I would very much like a few words on why sched_class::task_tick() is
> > safe to call remote -- from a quick look I think it actually is, but it
> > would be good to have some words here.
>
> Let's rather say I can't prove that it is safe, given the amount of code that
> is behind throughout the various flavour of scheduler features.
>
> But as far as I checked several times, it seems that nothing is accessed locally
> on ::scheduler_tick(). Everything looks fetched from the runqueue struct target
> while it is locked.
>
> If we ever find local references such as "current" or "__this_cpu_*" in the path,
> we'll have to fix them.
Sure, but at least state you audited the code for what issues. That
tells me you know wth you were doing and gives more trust than blindly
changing random code ;-)
> > > +static void sched_tick_start(int cpu)
> > > +{
> > > + struct tick_work *twork;
> > > +
> > > + if (housekeeping_cpu(cpu, HK_FLAG_TICK))
> > > + return;
> >
> > This all looks very static :-(, you can't reconfigure this nohz_full
> > crud after boot?
>
> Unfortunately yes. In fact making the nohz interface dynamically available
> through cpuset is the next big step.
OK, fair enough.
> > > + WARN_ON_ONCE(!tick_work_cpu);
> > > +
> > > + twork = per_cpu_ptr(tick_work_cpu, cpu);
> > > + twork->cpu = cpu;
> > > + INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
> > > + queue_delayed_work(system_unbound_wq, &twork->work, HZ);
> > > +}
> >
> > Similarly, I think we want a few words about how unbound workqueues are
> > expected to behave vs NUMA.
> >
> > AFAICT unbound workqueues by default prefer to run on a cpu in the same
> > node, but if no cpu is available, it doesn't go looking for the nearest
> > node that does have a cpu, it just punts to whatever random cpu.
>
> Yes, and in fact you just made me look into wq_select_unbound_cpu() and
> it looks worse than that. If the current CPU is not in the wq_unbound_cpumask,
> a random one is picked up from that global cpumask without trying a near
> one in the current node.
>
> Looks like room for improvement on the workqueue side. I'll see what I can do.
Great, thanks!
Hi, Frederic!
>>>>> On Mon, 29 Jan 2018 02:10:26 +0100, Frederic Weisbecker wrote:
> On Wed, Jan 24, 2018 at 10:46:08AM -0500, Luiz Capitulino wrote:
[...]
>> Since the 1Hz tick offload worked for you, I must be missing
>> a way to disable this timer or the kernel is thinking my CPU
>> has unstable TSC (which it doesn't AFAIK).
> It's beyond the scope of this patchset but indeed that's
> right, I run my kernels with tsc=reliable because my CPUs
> don't have the TSC_RELIABLE flag. That's the only way I found
> to shutdown the tick completely on my test machine, otherwise
> I keep having that clocksource watchdog.
[...]
Thanks, it helps. But I have accounting problem:
if I run user busy loop on the nohz cpu, the task accounting works
correctly (top shows the task takes 100% cpu), but cpu accounting is
wrong (cpu is 100% idle, in the per-core view as well).
If I understand correctly, the stats are updated by account_user_time()
-> task_group_account_field() but there is no call for it in case of
offloading (it is called from irqtime_account_process_tick,
account_process_tick, vtime_user_exit).
Moreover, task_group_account_field() uses __this_cpu_add() which will be
wrong for offloading.
For testing I used kcpustat_cpu(task_cpu(p)) in
task_group_account_field() and added call account_user_time(curr, delta)
to the sched_tick_remote() what fixes it for me, but what would be the
proper fix?
--
WBR,
Yauheni Kaliuta
On Tue, May 22, 2018 at 10:10:19PM +0300, Yauheni Kaliuta wrote:
> Hi, Frederic!
>
> >>>>> On Mon, 29 Jan 2018 02:10:26 +0100, Frederic Weisbecker wrote:
> > On Wed, Jan 24, 2018 at 10:46:08AM -0500, Luiz Capitulino wrote:
>
> [...]
>
> >> Since the 1Hz tick offload worked for you, I must be missing
> >> a way to disable this timer or the kernel is thinking my CPU
> >> has unstable TSC (which it doesn't AFAIK).
>
> > It's beyond the scope of this patchset but indeed that's
> > right, I run my kernels with tsc=reliable because my CPUs
> > don't have the TSC_RELIABLE flag. That's the only way I found
> > to shutdown the tick completely on my test machine, otherwise
> > I keep having that clocksource watchdog.
>
> [...]
>
> Thanks, it helps. But I have accounting problem:
>
> if I run user busy loop on the nohz cpu, the task accounting works
> correctly (top shows the task takes 100% cpu), but cpu accounting is
> wrong (cpu is 100% idle, in the per-core view as well).
>
> If I understand correctly, the stats are updated by account_user_time()
> -> task_group_account_field() but there is no call for it in case of
> offloading (it is called from irqtime_account_process_tick,
> account_process_tick, vtime_user_exit).
Ah I forgot about kcpustat accounting. I remember I wanted to fix that a
few years ago but I forgot about it when I removed the last tick. That
thing was lurking behind 1Hz.
>
> Moreover, task_group_account_field() uses __this_cpu_add() which will be
> wrong for offloading.
>
> For testing I used kcpustat_cpu(task_cpu(p)) in
> task_group_account_field() and added call account_user_time(curr, delta)
> to the sched_tick_remote() what fixes it for me, but what would be the
> proper fix?
Yeah unfortunately that's unsafe. Task accounting is not designed for remote
update. You could race with an update from another CPU, especially the local
updater.
I fear we need to take the same approach than task cputime, which is using a seqcount
for updates. Then the reader would fetch the kcpustat values + the delta
vtime from the task executing.
Things can get complicated once we dive into corner cases: CPUTIME_IRQ,
CPUTIME_SOFTIRQ, and CPUTIME_STEAL. At least we don't need to care about CPUTIME_IDLE
and CPUTIME_IOWAIT that have their own delta.
I'm trying that.
Thanks.
On Fri, 25 May 2018 04:56:25 +0200
Frederic Weisbecker <[email protected]> wrote:
> On Tue, May 22, 2018 at 10:10:19PM +0300, Yauheni Kaliuta wrote:
> > Hi, Frederic!
> >
> > >>>>> On Mon, 29 Jan 2018 02:10:26 +0100, Frederic Weisbecker wrote:
> > > On Wed, Jan 24, 2018 at 10:46:08AM -0500, Luiz Capitulino wrote:
> >
> > [...]
> >
> > >> Since the 1Hz tick offload worked for you, I must be missing
> > >> a way to disable this timer or the kernel is thinking my CPU
> > >> has unstable TSC (which it doesn't AFAIK).
> >
> > > It's beyond the scope of this patchset but indeed that's
> > > right, I run my kernels with tsc=reliable because my CPUs
> > > don't have the TSC_RELIABLE flag. That's the only way I found
> > > to shutdown the tick completely on my test machine, otherwise
> > > I keep having that clocksource watchdog.
> >
> > [...]
> >
> > Thanks, it helps. But I have accounting problem:
> >
> > if I run user busy loop on the nohz cpu, the task accounting works
> > correctly (top shows the task takes 100% cpu), but cpu accounting is
> > wrong (cpu is 100% idle, in the per-core view as well).
> >
> > If I understand correctly, the stats are updated by account_user_time()
> > -> task_group_account_field() but there is no call for it in case of
> > offloading (it is called from irqtime_account_process_tick,
> > account_process_tick, vtime_user_exit).
>
> Ah I forgot about kcpustat accounting. I remember I wanted to fix that a
> few years ago but I forgot about it when I removed the last tick. That
> thing was lurking behind 1Hz.
>
> >
> > Moreover, task_group_account_field() uses __this_cpu_add() which will be
> > wrong for offloading.
> >
> > For testing I used kcpustat_cpu(task_cpu(p)) in
> > task_group_account_field() and added call account_user_time(curr, delta)
> > to the sched_tick_remote() what fixes it for me, but what would be the
> > proper fix?
>
> Yeah unfortunately that's unsafe. Task accounting is not designed for remote
> update. You could race with an update from another CPU, especially the local
> updater.
>
> I fear we need to take the same approach than task cputime, which is using a seqcount
> for updates. Then the reader would fetch the kcpustat values + the delta
> vtime from the task executing.
>
> Things can get complicated once we dive into corner cases: CPUTIME_IRQ,
> CPUTIME_SOFTIRQ, and CPUTIME_STEAL. At least we don't need to care about CPUTIME_IDLE
> and CPUTIME_IOWAIT that have their own delta.
>
> I'm trying that.
Cool! Needless to say, but we can help testing once you have patches.