Date: Tue, 19 Dec 2017 10:19:11 +0100
From: Peter Zijlstra <peterz@infradead.org>
To: Frederic Weisbecker <frederic@kernel.org>
Cc: LKML <linux-kernel@vger.kernel.org>,
        Chris Metcalf <cmetcalf@mellanox.com>,
        Thomas Gleixner <tglx@linutronix.de>,
        Luiz Capitulino <lcapitulino@redhat.com>,
        Christoph Lameter <cl@linux.com>,
        "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>,
        Ingo Molnar <mingo@kernel.org>, Wanpeng Li <kernellwp@gmail.com>,
        Mike Galbraith <efault@gmx.de>, Rik van Riel <riel@redhat.com>
Subject: Re: [PATCH 4/5] sched/isolation: Residual 1Hz scheduler tick offload
Message-ID: <20171219091911.tg2k4w7mgv2bcmeb@hirez.programming.kicks-ass.net>
References: <1513653838-31314-1-git-send-email-frederic@kernel.org>
 <1513653838-31314-5-git-send-email-frederic@kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <1513653838-31314-5-git-send-email-frederic@kernel.org>
User-Agent: NeoMutt/20170609 (1.8.3)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2379
Lines: 75

On Tue, Dec 19, 2017 at 04:23:57AM +0100, Frederic Weisbecker wrote:
> When a CPU runs in full dynticks mode, a 1Hz tick remains in order to
> keep the scheduler stats alive. However this residual tick is a burden
> for Real-Time tasks that can't stand no interruption at all.

I'm not sure that is accurate. RT doesn't necessarily have anything much
to so with this. The tick is per definition very deterministic and thus
should not be a problem.

> Adding the boot parameter "isolcpus=nohz_offload" will now outsource
> these scheduler ticks to the global workqueue so that a housekeeping CPU
> handles that tick remotely.

The global workqueue sounds horrific; surely you want at least one such
housekeeping CPU per node or something ?

> Note it's still up to the user to affine the global workqueues to the
> housekeeping CPUs through /sys/devices/virtual/workqueue/cpumask or
> domains isolation.

Not sure I understand what this means... from what I can tell you're
using an unbound workqueue, there's no way to split the ticks up to node
local CPUs.

> +static void sched_tick_remote(struct work_struct *work)
> +{
> +	struct delayed_work *dwork = to_delayed_work(work);
> +	struct tick_work *twork = container_of(dwork, struct tick_work, work);
> +	struct rq *rq = cpu_rq(twork->cpu);
> +	struct rq_flags rf;
> +
> +	rq_lock_irq(rq, &rf);
> +	update_rq_clock(rq);
> +	rq->curr->sched_class->task_tick(rq, rq->curr, 0);
> +	rq_unlock_irq(rq, &rf);
> +
> +	queue_delayed_work(system_unbound_wq, dwork, HZ);
> +}
> +
> +void sched_tick_start(int cpu)
> +{
> +	struct tick_work *twork;
> +
> +	if (housekeeping_cpu(cpu, HK_FLAG_TICK_SCHED))
> +		return;
> +
> +	WARN_ON_ONCE(!tick_work_cpu);
> +
> +	twork = per_cpu_ptr(tick_work_cpu, cpu);
> +	twork->cpu = cpu;
> +	INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
> +	queue_delayed_work(system_unbound_wq, &twork->work, HZ);
> +
> +	return;
> +}
> +
> +#ifdef CONFIG_HOTPLUG_CPU
> +void sched_tick_stop(int cpu)
> +{
> +	struct tick_work *twork;
> +
> +	if (housekeeping_cpu(cpu, HK_FLAG_TICK_SCHED))
> +		return;
> +
> +	WARN_ON_ONCE(!tick_work_cpu);
> +
> +	twork = per_cpu_ptr(tick_work_cpu, cpu);
> +	cancel_delayed_work_sync(&twork->work);
> +
> +	return;
> +}
> +#endif /* CONFIG_HOTPLUG_CPU */

This seems daft in that you _always_ run this remote tick, even when the
CPU in question is not in nohz (full) mode.