Sender: Tejun Heo <htejun@gmail.com>
Date:   Mon, 30 Jan 2023 14:26:20 -1000
From:   Tejun Heo <tj@kernel.org>
To:     Josh Don <joshdon@google.com>
Cc:     torvalds@linux-foundation.org, mingo@redhat.com,
        peterz@infradead.org, juri.lelli@redhat.com,
        vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
        rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
        bristot@redhat.com, vschneid@redhat.com, ast@kernel.org,
        daniel@iogearbox.net, andrii@kernel.org, martin.lau@kernel.org,
        brho@google.com, pjt@google.com, derkling@google.com,
        haoluo@google.com, dvernet@meta.com, dschatzberg@meta.com,
        dskarlat@cs.cmu.edu, riel@surriel.com,
        linux-kernel@vger.kernel.org, bpf@vger.kernel.org,
        kernel-team@meta.com
Subject: Re: [PATCH 27/30] sched_ext: Implement core-sched support
Message-ID: <Y9hgLENFI5y3Qtx2@slm.duckdns.org>
References: <20230128001639.3510083-1-tj@kernel.org>
 <20230128001639.3510083-28-tj@kernel.org>
 <CABk29Nt2-CCGnogpfEgJ3ZDk5Esk04n6EwsAqpw_vdeVfKuFUQ@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CABk29Nt2-CCGnogpfEgJ3ZDk5Esk04n6EwsAqpw_vdeVfKuFUQ@mail.gmail.com>
Precedence: bulk

Hello,

On Mon, Jan 30, 2023 at 01:38:15PM -0800, Josh Don wrote:
> > The core-sched support is composed of the following parts:
> 
> Thanks, this looks pretty reasonable overall.
> 
> One meta comment is that I think we can shortcircuit from
> touch_core_sched when we have sched_core_disabled().

Yeah, touch_core_sched() is really cheap (it's just an assignment from an rq
field to a task field) but sched_core_disabled() is also just a static
branch. Will update.

> Reviewed-by: Josh Don <joshdon@google.com>
> 
> > +                       /*
> > +                        * While core-scheduling, rq lock is shared among
> > +                        * siblings but the debug annotations and rq clock
> > +                        * aren't. Do pinning dance to transfer the ownership.
> > +                        */
> > +                       WARN_ON_ONCE(__rq_lockp(rq) != __rq_lockp(srq));
> > +                       rq_unpin_lock(rq, rf);
> > +                       rq_pin_lock(srq, &srf);
> > +
> > +                       update_rq_clock(srq);
> 
> Unfortunate that we have to do this superfluous update; maybe we can
> save/restore the clock flags from before the pinning shenanigans?

So, this one isn't really superflous. There are two rq's involved - self and
sibling. self's rq clock is saved and restored through rq_unpin_lock() and
rq_repin_lock(). We're transferring the lock owner ship from self to sibling
without actually unlocking and relocking the lock as they should be sharing
the same lock; however, that doesn't mean that the two queues share rq
clocks, so the sibling needs to update its rq clock upon getting the lock
transferred to it. It might make sense to make the siblings share the rq
clock when core-sched is enabled but that's probably for some other time.

> > +static struct task_struct *pick_task_scx(struct rq *rq)
> > +{
> > +       struct task_struct *curr = rq->curr;
> > +       struct task_struct *first = first_local_task(rq);
> > +
> > +       if (curr->scx.flags & SCX_TASK_QUEUED) {
> > +               /* is curr the only runnable task? */
> > +               if (!first)
> > +                       return curr;
> > +
> > +               /*
> > +                * Does curr trump first? We can always go by core_sched_at for
> > +                * this comparison as it represents global FIFO ordering when
> > +                * the default core-sched ordering is in used and local-DSQ FIFO
> > +                * ordering otherwise.
> > +                */
> > +               if (curr->scx.slice && time_before64(curr->scx.core_sched_at,
> > +                                                    first->scx.core_sched_at))
> > +                       return curr;
> 
> So is this to handle the case where we have something running on 'rq'
> to match the cookie of our sibling (which had priority), but now we
> want to switch to running the first thing in the local queue, which
> has a different cookie (and is now the highest priority entity)? Maybe
> being slightly more specific in the comment would help :)

pick_task_scx() is to pick the next best candidate for the rq. The
candidates then compete to determine the next cookie. Here, as long as only
one task gets dispatched at any given time, the condition check shouldn't
really trigger - ie. if curr has slice remaining, balance_one() wouldn't
have populated the local DSQ anyway and first would be NULL.

However, the BPF scheduler is free to dispatch whatever tasks anytime (e.g.
scx_example_central), so it's possible that a task with an earlier timestamp
has been dispatched to the local DSQ since curr started executing, in which
case we likely want to return the first on DSQ as the CPU's candidate.

IOW, the time_before() is there mostly to cover unusual cases. Most should
either trigger !first before or fail the curr->scx.slice test. Will update
the comment to clarify.

Thanks.

-- 
tejun