2024-03-23 02:37:46

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 12/36] sched_ext: Implement BPF extensible scheduler class

Hello Tejun,

On Fri, Nov 10, 2023 at 04:47:38PM -1000, Tejun Heo wrote:
[..]
> +/*
> + * Omitted operations:
> + *
> + * - wakeup_preempt: NOOP as it isn't useful in the wakeup path because the task
> + * isn't tied to the CPU at that point.
> + *
> + * - migrate_task_rq: Unncessary as task to cpu mapping is transient.
> + *
> + * - task_fork/dead: We need fork/dead notifications for all tasks regardless of
> + * their current sched_class. Call them directly from sched core instead.
> + *
> + * - task_woken, switched_from: Unnecessary.
> + */
> +DEFINE_SCHED_CLASS(ext) = {
> + .enqueue_task = enqueue_task_scx,
> + .dequeue_task = dequeue_task_scx,
> + .yield_task = yield_task_scx,
> + .yield_to_task = yield_to_task_scx,
> +
> + .wakeup_preempt = wakeup_preempt_scx,

I was wondering about the comment above related to 'wakeup_preempt', could
you clarify why it is not useful (NOOP) in the sched-ext class?

wakeup_preempt() may be called via:
sched_ttwu_pending() ->
ttwu_do_activate() ->
wakeup_preempt()


at which point the enqueue of the task could have already happened via:

sched_ttwu_pending() ->
ttwu_do_activate() ->
activate_task() ->
enqueue_task()

But the comment above says "task isn't tied to the CPU" ?

Apologies in advance if I missed something as I just recently starting
looking into the sched-ext patches.

thanks!

- Joel



2024-03-23 22:12:45

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 12/36] sched_ext: Implement BPF extensible scheduler class

Hello, Joel.

On Fri, Mar 22, 2024 at 10:37:32PM -0400, Joel Fernandes wrote:
..
> I was wondering about the comment above related to 'wakeup_preempt', could
> you clarify why it is not useful (NOOP) in the sched-ext class?
>
> wakeup_preempt() may be called via:
> sched_ttwu_pending() ->
> ttwu_do_activate() ->
> wakeup_preempt()
>
>
> at which point the enqueue of the task could have already happened via:
>
> sched_ttwu_pending() ->
> ttwu_do_activate() ->
> activate_task() ->
> enqueue_task()
>
> But the comment above says "task isn't tied to the CPU" ?

In sched_ext, a scheduling queue isn't tied to a particular CPU. For
example, it's trivial to share a single global scheduling queue across the
whole system or any subset of CPUs. To support this behavior, tasks can be
hot-migrated in the dispatch path just before it starts executing:

https://github.com/sched-ext/sched_ext/blob/sched_ext/kernel/sched/ext.c#L1335

So, the CPU picked by ops.select_cpu() in the enqueue path often doesn't
determine the CPU the task is going to execute on. If the picked CPU matches
the CPU the task is eventually going to run on, there's a small performance
advantage as the later hot migration can be avoided.

As the task isn't actually tied to the CPU being picked, it's a bit awkward
to ask "does this task preempt this CPU?" Instead, preemption is implemented
by calling scx_bpf_kick_cpu() w/ SCX_KICK_PREEMPT or using the
SCX_ENQ_PREEMPT flag from the enqueue path which allows preempting any CPU.

> Apologies in advance if I missed something as I just recently starting
> looking into the sched-ext patches.

No worries. Happy to answer any questions.

Thanks.

--
tejun

2024-04-25 21:31:51

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 12/36] sched_ext: Implement BPF extensible scheduler class

On Sat, Mar 23, 2024 at 6:12 PM Tejun Heo <[email protected]> wrote:
>
> Hello, Joel.
>
> On Fri, Mar 22, 2024 at 10:37:32PM -0400, Joel Fernandes wrote:
> ...
> > I was wondering about the comment above related to 'wakeup_preempt', could
> > you clarify why it is not useful (NOOP) in the sched-ext class?
> >
> > wakeup_preempt() may be called via:
> > sched_ttwu_pending() ->
> > ttwu_do_activate() ->
> > wakeup_preempt()
> >
> >
> > at which point the enqueue of the task could have already happened via:
> >
> > sched_ttwu_pending() ->
> > ttwu_do_activate() ->
> > activate_task() ->
> > enqueue_task()
> >
> > But the comment above says "task isn't tied to the CPU" ?
>
> In sched_ext, a scheduling queue isn't tied to a particular CPU. For
> example, it's trivial to share a single global scheduling queue across the
> whole system or any subset of CPUs. To support this behavior, tasks can be
> hot-migrated in the dispatch path just before it starts executing:
>
> https://github.com/sched-ext/sched_ext/blob/sched_ext/kernel/sched/ext.c#L1335
>
> So, the CPU picked by ops.select_cpu() in the enqueue path often doesn't
> determine the CPU the task is going to execute on. If the picked CPU matches
> the CPU the task is eventually going to run on, there's a small performance
> advantage as the later hot migration can be avoided.
>
> As the task isn't actually tied to the CPU being picked, it's a bit awkward
> to ask "does this task preempt this CPU?" Instead, preemption is implemented
> by calling scx_bpf_kick_cpu() w/ SCX_KICK_PREEMPT or using the
> SCX_ENQ_PREEMPT flag from the enqueue path which allows preempting any CPU.
>

Got it. I took some time to look at it some more. Now I am wondering
why check_preempt_curr() has to be separately implemented for a class
and why the enqueue() handler of each class cannot take care of
preempting curr via setting resched flags.

The only reason I can see is that, activate_task() is not always
followed by a check_preempt_curr() and sometimes there is an
unconditional resched_curr() happening following the call to
activate_task().

But such issues don't affect sched_ext in its current form I guess.

Btw, if sched_ext were to be implemented as a higher priority class
above CFS [1], then check_preempt_curr() may preempt without even
calling the class's check_preempt_curr() :

void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
{
if (p->sched_class == rq->curr->sched_class)
rq->curr->sched_class->check_preempt_curr(rq, p, flags);
else if (sched_class_above(p->sched_class, rq->curr->sched_class))
resched_curr(rq);

But if I understand, sched_ext is below CFS at the moment, so that
should not be an issue.

[1] By the way, now that I brought up the higher priority class thing,
I might as well discuss it here :-D :

One of my use cases is about scheduling high priority latency sensitive threads:
I think if sched_ext could have 2 classes, one lower than CFS and one
above CFS, that would be beneficial to those who want a gradual
transition to use scx, instead of switching all tasks to scx at once.

One reason is EAS (in CFS). It may be beneficial for people to use
the existing EAS for everything but latency critical tasks (much like
how people use RT class for those). This is quite involved and
reimplementing EAS in BPF may be quite a project. Not that it
shouldn't be implemented that way, but EAS is about a decade old with
all kinds of energy modeling, math and what not. Having scx higher
than cfs alongside the lower one is less of an invasive approach than
switching everything on the system to scx.

Do you have any opinions on that? If it makes sense, I can work on
such an implementation.

Another reason for this is, general purpose systems run very varied
workloads, and big dramatic changes are likely to be reverted due to
power and performance regressions. Hence, the request for a higher
scx, so that we (high priority task scx users) can take baby steps.

thanks,

- Joel

2024-04-26 16:59:50

by Barret Rhoden

[permalink] [raw]
Subject: Re: [PATCH 12/36] sched_ext: Implement BPF extensible scheduler class

On 4/25/24 17:28, Joel Fernandes wrote:
> But if I understand, sched_ext is below CFS at the moment, so that
> should not be an issue.
>
> [1] By the way, now that I brought up the higher priority class thing,
> I might as well discuss it here :-D :
>
> One of my use cases is about scheduling high priority latency sensitive threads:
> I think if sched_ext could have 2 classes, one lower than CFS and one
> above CFS, that would be beneficial to those who want a gradual
> transition to use scx, instead of switching all tasks to scx at once.

The way we initially went with Ghost (which is the Google project
similar to sched_ext) was to run Ghost below CFS. That was a "safety
first" approach, where we didn't have a lot of faith in the schedulers
we were writing and wanted to convince people (including ourselves) that
we wouldn't completely hose a machine.

For the same reason you pointed out, we eventually wanted to run our
schedulers above CFS. Currently, Ghost has the option to run above or
below CFS, and we're pretty close to being able to run all of our
schedulers above CFS. Once we do that, I imagine we'll drop the "above
or below" aspect, since it's a bit more complicated.

It's actually even more complicated than that - ghost also has a
separate "ghost agent" sched class above CFS, even if ghost itself is
below CFS. This lets us run a single userspace "agent task" on a cpu to
make a scheduling decision.

Anyway, when we port Ghost to run on sched_ext instead of our custom
ghost sched class(es), we'll be running sched_ext above CFS, and I doubt
we'll want to run below CFS at all. Though whether that's a local patch
we carry, or the default upstream probably depends on what everyone else
wants too.

Thanks,
Barret



2024-04-26 21:58:31

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 12/36] sched_ext: Implement BPF extensible scheduler class

Hello, Joel.

On Thu, Apr 25, 2024 at 05:28:40PM -0400, Joel Fernandes wrote:
> Got it. I took some time to look at it some more. Now I am wondering
> why check_preempt_curr() has to be separately implemented for a class
> and why the enqueue() handler of each class cannot take care of
> preempting curr via setting resched flags.
>
> The only reason I can see is that, activate_task() is not always
> followed by a check_preempt_curr() and sometimes there is an
> unconditional resched_curr() happening following the call to
> activate_task().
>
> But such issues don't affect sched_ext in its current form I guess.

There's ttwu_runnable() path which just changes the target task's state and
then checks for preemption. The path doesn't involve enqueueing but can
still preempt. Maybe SCX might need to support this in the future too but it
doesn't seem pressing.

> Btw, if sched_ext were to be implemented as a higher priority class
> above CFS [1], then check_preempt_curr() may preempt without even
> calling the class's check_preempt_curr() :
>
> void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
> {
> if (p->sched_class == rq->curr->sched_class)
> rq->curr->sched_class->check_preempt_curr(rq, p, flags);
> else if (sched_class_above(p->sched_class, rq->curr->sched_class))
> resched_curr(rq);
>
> But if I understand, sched_ext is below CFS at the moment, so that
> should not be an issue.
>
> [1] By the way, now that I brought up the higher priority class thing,
> I might as well discuss it here :-D :
>
> One of my use cases is about scheduling high priority latency sensitive threads:
> I think if sched_ext could have 2 classes, one lower than CFS and one
> above CFS, that would be beneficial to those who want a gradual
> transition to use scx, instead of switching all tasks to scx at once.
>
> One reason is EAS (in CFS). It may be beneficial for people to use
> the existing EAS for everything but latency critical tasks (much like
> how people use RT class for those). This is quite involved and
> reimplementing EAS in BPF may be quite a project. Not that it
> shouldn't be implemented that way, but EAS is about a decade old with
> all kinds of energy modeling, math and what not. Having scx higher
> than cfs alongside the lower one is less of an invasive approach than
> switching everything on the system to scx.

I see.

> Do you have any opinions on that? If it makes sense, I can work on
> such an implementation.
>
> Another reason for this is, general purpose systems run very varied
> workloads, and big dramatic changes are likely to be reverted due to
> power and performance regressions. Hence, the request for a higher
> scx, so that we (high priority task scx users) can take baby steps.

Yeah, as a use case, it makes sense to me. Would it suffice to be able to
choose between above or below the fair class tho?

Thanks.

--
tejun