Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67;
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6F90921720
Date:   Mon, 29 Jan 2018 17:48:33 +0100
From:   Frederic Weisbecker <frederic@kernel.org>
To:     Peter Zijlstra <peterz@infradead.org>
Cc:     Ingo Molnar <mingo@kernel.org>,
        LKML <linux-kernel@vger.kernel.org>,
        Chris Metcalf <cmetcalf@mellanox.com>,
        Thomas Gleixner <tglx@linutronix.de>,
        Luiz Capitulino <lcapitulino@redhat.com>,
        Christoph Lameter <cl@linux.com>,
        "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>,
        Wanpeng Li <kernellwp@gmail.com>,
        Mike Galbraith <efault@gmx.de>, Rik van Riel <riel@redhat.com>
Subject: Re: [PATCH 4/6] sched/isolation: Residual 1Hz scheduler tick offload
Message-ID: <20180129164832.GC2942@lerouge>
References: <1516320140-13189-1-git-send-email-frederic@kernel.org>
 <1516320140-13189-5-git-send-email-frederic@kernel.org>
 <20180129153839.GT2269@hirez.programming.kicks-ass.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20180129153839.GT2269@hirez.programming.kicks-ass.net>
User-Agent: Mutt/1.5.24 (2015-08-30)
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

On Mon, Jan 29, 2018 at 04:38:39PM +0100, Peter Zijlstra wrote:
1;4205;0c> On Fri, Jan 19, 2018 at 01:02:18AM +0100, Frederic Weisbecker wrote:
> > When a CPU runs in full dynticks mode, a 1Hz tick remains in order to
> > keep the scheduler stats alive. However this residual tick is a burden
> > for bare metal tasks that can't stand any interruption at all, or want
> > to minimize them.
> > 
> > The usual boot parameters "nohz_full=" or "isolcpus=nohz" will now
> > outsource these scheduler ticks to the global workqueue so that a
> > housekeeping CPU handles those remotely.
> > 
> > Note that in the case of using isolcpus, it's still up to the user to
> > affine the global workqueues to the housekeeping CPUs through
> > /sys/devices/virtual/workqueue/cpumask or domains isolation
> > "isolcpus=nohz,domain".
> 
> I would very much like a few words on why sched_class::task_tick() is
> safe to call remote -- from a quick look I think it actually is, but it
> would be good to have some words here.

Let's rather say I can't prove that it is safe, given the amount of code that
is behind throughout the various flavour of scheduler features.

But as far as I checked several times, it seems that nothing is accessed locally
on ::scheduler_tick(). Everything looks fetched from the runqueue struct target
while it is locked.

If we ever find local references such as "current" or "__this_cpu_*" in the path,
we'll have to fix them.

> 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index d72d0e9..c79500c 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3062,7 +3062,82 @@ u64 scheduler_tick_max_deferment(void)
> >  
> >  	return jiffies_to_nsecs(next - now);
> >  }
> > -#endif
> > +
> > +struct tick_work {
> > +	int			cpu;
> > +	struct delayed_work	work;
> > +};
> > +
> > +static struct tick_work __percpu *tick_work_cpu;
> > +
> > +static void sched_tick_remote(struct work_struct *work)
> > +{
> > +	struct delayed_work *dwork = to_delayed_work(work);
> > +	struct tick_work *twork = container_of(dwork, struct tick_work, work);
> > +	int cpu = twork->cpu;
> > +	struct rq *rq = cpu_rq(cpu);
> > +	struct rq_flags rf;
> > +
> > +	/*
> > +	 * Handle the tick only if it appears the remote CPU is running
> > +	 * in full dynticks mode. The check is racy by nature, but
> > +	 * missing a tick or having one too much is no big deal.
> > +	 */
> > +	if (!idle_cpu(cpu) && tick_nohz_tick_stopped_cpu(cpu)) {
> > +		rq_lock_irq(rq, &rf);
> > +		update_rq_clock(rq);
> > +		rq->curr->sched_class->task_tick(rq, rq->curr, 0);
> > +		rq_unlock_irq(rq, &rf);
> > +	}
> > +
> > +	queue_delayed_work(system_unbound_wq, dwork, HZ);
> 
> Do we want something that tracks the actual interrer arrival time of
> this work, such that we can detect and warn if the book-keeping thing is
> failing to keep up?

Yeah perhaps we can have some sort of check to make sure we got a tick after
some reasonable delay since the last sched in of the current remote task.

> 
> > +}
> > +
> > +static void sched_tick_start(int cpu)
> > +{
> > +	struct tick_work *twork;
> > +
> > +	if (housekeeping_cpu(cpu, HK_FLAG_TICK))
> > +		return;
> 
> This all looks very static :-(, you can't reconfigure this nohz_full
> crud after boot?

Unfortunately yes. In fact making the nohz interface dynamically available
through cpuset is the next big step.

> 
> > +	WARN_ON_ONCE(!tick_work_cpu);
> > +
> > +	twork = per_cpu_ptr(tick_work_cpu, cpu);
> > +	twork->cpu = cpu;
> > +	INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
> > +	queue_delayed_work(system_unbound_wq, &twork->work, HZ);
> > +}
> 
> Similarly, I think we want a few words about how unbound workqueues are
> expected to behave vs NUMA.
> 
> AFAICT unbound workqueues by default prefer to run on a cpu in the same
> node, but if no cpu is available, it doesn't go looking for the nearest
> node that does have a cpu, it just punts to whatever random cpu.

Yes, and in fact you just made me look into wq_select_unbound_cpu() and
it looks worse than that. If the current CPU is not in the wq_unbound_cpumask,
a random one is picked up from that global cpumask without trying a near
one in the current node.

Looks like room for improvement on the workqueue side. I'll see what I can do.

Thanks.