Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
Date:   Thu, 23 Sep 2021 08:24:03 -0400
From:   Phil Auld <pauld@redhat.com>
To:     Vincent Guittot <vincent.guittot@linaro.org>
Cc:     Mike Galbraith <efault@gmx.de>,
        Mel Gorman <mgorman@techsingularity.net>,
        Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@kernel.org>,
        Valentin Schneider <valentin.schneider@arm.com>,
        Aubrey Li <aubrey.li@linux.intel.com>,
        Barry Song <song.bao.hua@hisilicon.com>,
        Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
        LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to
 nr_running
Message-ID: <YUxx42W3K2Ur7W84@lorien.usersys.redhat.com>
References: <20210921103621.GM3959@techsingularity.net>
 <ea2f9038f00d3b4c0008235079e1868145b47621.camel@gmx.de>
 <20210922132002.GX3959@techsingularity.net>
 <CAKfTPtCxhzz1XgNXM8jaQC2=tGHm0ap88HneUgWTpCSeWVZwsw@mail.gmail.com>
 <20210922150457.GA3959@techsingularity.net>
 <CAKfTPtB3tXwBZ_tVaDdiwMt-=sGH1iV6eUV6Rsnpw7q=tEpBwA@mail.gmail.com>
 <20210922173853.GB3959@techsingularity.net>
 <CAKfTPtDc39fCLbQqA2BhC6dsb+MyYYMdk9HUvrU0fRqULuQB-g@mail.gmail.com>
 <ba60262d15891702cae0d59122388c6a18caaf53.camel@gmx.de>
 <CAKfTPtBBqLghrXrayyoBQQyDqdv6+pdCjiZkmzLaGvdNtN=Aug@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAKfTPtBBqLghrXrayyoBQQyDqdv6+pdCjiZkmzLaGvdNtN=Aug@mail.gmail.com>
Precedence: bulk

On Thu, Sep 23, 2021 at 10:40:48AM +0200 Vincent Guittot wrote:
> On Thu, 23 Sept 2021 at 03:47, Mike Galbraith <efault@gmx.de> wrote:
> >
> > On Wed, 2021-09-22 at 20:22 +0200, Vincent Guittot wrote:
> > > On Wed, 22 Sept 2021 at 19:38, Mel Gorman <mgorman@techsingularity.net> wrote:
> > > >
> > > >
> > > > I'm not seeing an alternative suggestion that could be turned into
> > > > an implementation. The current value for sched_wakeup_granularity
> > > > was set 12 years ago was exposed for tuning which is no longer
> > > > the case. The intent was to allow some dynamic adjustment between
> > > > sysctl_sched_wakeup_granularity and sysctl_sched_latency to reduce
> > > > over-scheduling in the worst case without disabling preemption entirely
> > > > (which the first version did).
> >
> > I don't think those knobs were ever _intended_ for general purpose
> > tuning, but they did get used that way by some folks.
> >
> > > >
> > > > Should we just ignore this problem and hope it goes away or just let
> > > > people keep poking silly values into debugfs via tuned?
> > >
> > > We should certainly not add a bandaid because people will continue to
> > > poke silly value at the end. And increasing
> > > sysctl_sched_wakeup_granularity based on the number of running threads
> > > is not the right solution.
> >
> > Watching my desktop box stack up large piles of very short running
> > threads, I agree, instantaneous load looks like a non-starter.
> >
> > >  According to the description of your
> > > problem that the current task doesn't get enough time to move forward,
> > > sysctl_sched_min_granularity should be part of the solution. Something
> > > like below will ensure that current got a chance to move forward
> >
> > Nah, progress is guaranteed, the issue is a zillion very similar short
> > running threads preempting each other with no win to be had, thus
> > spending cycles in the scheduler that are utterly wasted.  It's a valid
> > issue, trouble is teaching the scheduler to recognize that situation
> > without mucking up other situations where there IS a win for even very
> > short running threads say, doing a synchronous handoff; preemption is
> > cheaper than scheduling off if the waker is going be awakened again in
> > very short order.
> >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 9bf540f04c2d..39d4e4827d3d 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -7102,6 +7102,7 @@ static void check_preempt_wakeup(struct rq *rq,
> > > struct task_struct *p, int wake_
> > >         int scale = cfs_rq->nr_running >= sched_nr_latency;
> > >         int next_buddy_marked = 0;
> > >         int cse_is_idle, pse_is_idle;
> > > +       unsigned long delta_exec;
> > >
> > >         if (unlikely(se == pse))
> > >                 return;
> > > @@ -7161,6 +7162,13 @@ static void check_preempt_wakeup(struct rq *rq,
> > > struct task_struct *p, int wake_
> > >                 return;
> > >
> > >         update_curr(cfs_rq_of(se));
> > > +       delta_exec = se->sum_exec_runtime - se->prev_sum_exec_runtime;
> > > +       /*
> > > +        * Ensure that current got a chance to move forward
> > > +        */
> > > +       if (delta_exec < sysctl_sched_min_granularity)
> > > +               return;
> > > +
> > >         if (wakeup_preempt_entity(se, pse) == 1) {
> > >                 /*
> > >                  * Bias pick_next to pick the sched entity that is
> >
> > Yikes!  If you do that, you may as well go the extra nanometer and rip
> > wakeup preemption out entirely, same result, impressive diffstat.
> 
> This patch is mainly there to show that there are other ways to ensure
> progress without using some load heuristic.
> sysctl_sched_min_granularity has the problem of scaling with the
> number of cpus and this can generate large values. At least we should
> use the normalized_sysctl_sched_min_granularity or even a smaller
> value but wakeup preemption still happens with this change. It only
> ensures that we don't waste time preempting each other without any
> chance to do actual stuff.
>

It's capped at 8 cpus, which is pretty easy to reach these days, so the
values don't get too large.  That scaling is almost a no-op these days.


Cheers,
Phil


> a 100us value should even be enough to fix Mel's problem without
> impacting common wakeup preemption cases.
> 
> 
> >
> >         -Mike
> 

--